Decision Tree - Mushrooms Selection

Introduction

This project presents and interprets the steps taken to apply the Decision Tree method on the mushroom dataset. The report will identify key characteristics that help determine whether a mushroom is safe to eat.

Upload packages and dataset

library(pacman)
pacman::p_load(rpart, caret, rpart.plot, rattle, readxl,
  readr, janitor, ggplot2, dplyr, gt, tidyr, ggthemes, scales , gridExtra, corrplot)
original <- read_excel("D:/Documents/Data Analytics/NorthEastern/Courses/ALY 6040/Module 2 Technique Practice - Decision Trees - Mushrooms/mushrooms.xlsx")
mushrooms <- original  # Keeping the original data set

The dataset was loaded with read_excel() function from the readxl package. Another way to import it would be to click on the import button in the global environment, which provides the same function. This function is fast and efficient even when the dataset is large.

1. Data Cleaning

The data set contains 8,124 observations and 23 character features. Class feature is the target variable in this data set, as it represents whether the mushroom is edible or poisonous. Bruises could be turned into a logical variable because it contains True or False as observations, and ring number could be modified as a numerical variable because it indicates how many rings the mushroom has. The function complete.cases() was used to find the rows that contain missing observations, which in this case there were none. The clean_names() function was used to clear R errors regarding the “-“ sign present in the name. After looking at the structure of the dataset, it was noticed that the veil type feature had the same value across all observations, which is considered redundant and therefore it was eliminated. Overall, the dataset is clean, but the list with the description of each observation constantly needs to be checked, which is not preferred.

str(mushrooms)

## tibble [8,124 × 23] (S3: tbl_df/tbl/data.frame)
##  $ class                   : chr [1:8124] "p" "e" "e" "p" ...
##  $ cap-shape               : chr [1:8124] "x" "x" "b" "x" ...
##  $ cap-surface             : chr [1:8124] "s" "s" "s" "y" ...
##  $ cap-color               : chr [1:8124] "n" "y" "w" "w" ...
##  $ bruises                 : chr [1:8124] "t" "t" "t" "t" ...
##  $ odor                    : chr [1:8124] "p" "a" "l" "p" ...
##  $ gill-attachment         : chr [1:8124] "f" "f" "f" "f" ...
##  $ gill-spacing            : chr [1:8124] "c" "c" "c" "c" ...
##  $ gill-size               : chr [1:8124] "n" "b" "b" "n" ...
##  $ gill-color              : chr [1:8124] "k" "k" "n" "n" ...
##  $ stalk-shape             : chr [1:8124] "e" "e" "e" "e" ...
##  $ stalk-root              : chr [1:8124] "e" "c" "c" "e" ...
##  $ stalk-surface-above-ring: chr [1:8124] "s" "s" "s" "s" ...
##  $ stalk-surface-below-ring: chr [1:8124] "s" "s" "s" "s" ...
##  $ stalk-color-above-ring  : chr [1:8124] "w" "w" "w" "w" ...
##  $ stalk-color-below-ring  : chr [1:8124] "w" "w" "w" "w" ...
##  $ veil-type               : chr [1:8124] "p" "p" "p" "p" ...
##  $ veil-color              : chr [1:8124] "w" "w" "w" "w" ...
##  $ ring-number             : chr [1:8124] "o" "o" "o" "o" ...
##  $ ring-type               : chr [1:8124] "p" "p" "p" "p" ...
##  $ spore-print-color       : chr [1:8124] "k" "n" "n" "k" ...
##  $ population              : chr [1:8124] "s" "n" "n" "s" ...
##  $ habitat                 : chr [1:8124] "u" "g" "m" "u" ...

nrow(mushrooms) - sum(complete.cases(mushrooms)) # missing obs.

## [1] 0

mushrooms <- clean_names(mushrooms) # cleaning the names

Deleting redundant variable “veil_type”

mushrooms$veil_type <- NULL
skimr::skim(original) #this function helps to see this redundancy bcs it shows how many unique observations there are

Data summary
Name	original
Number of rows	8124
Number of columns	23
_______________________
Column type frequency:
character	23
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
class	1	1	1	2
cap-shape	1	1	1	6
cap-surface	1	1	1	4
cap-color	1	1	1	10
bruises	1	1	1	2
odor	1	1	1	9
gill-attachment	1	1	1	2
gill-spacing	1	1	1	2
gill-size	1	1	1	2
gill-color	1	1	1	12
stalk-shape	1	1	1	2
stalk-root	1	1	1	5
stalk-surface-above-ring	1	1	1	4
stalk-surface-below-ring	1	1	1	4
stalk-color-above-ring	1	1	1	9
stalk-color-below-ring	1	1	1	9
veil-type	1	1	1	1
veil-color	1	1	1	4
ring-number	1	1	1	3
ring-type	1	1	1	5
spore-print-color	1	1	1	9
population	1	1	1	6
habitat	1	1	1	7

2. Exploratory Data Analysis

After glancing at the variables with the skim() function, the odor variable was analyzed. A table was created to count how many observations are in each class for each odor observation. The insight from this simple visualization is that some odor characteristics are present only for poisonous mushrooms, indicating key factors in determining whether the mushrooms are safe to eat or not. The same concept was turned into a function that counts how many characteristics of each feature are not shared between poisonous and edible mushrooms, indicating which feature is determinant in mushroom classification.

table(mushrooms$class,mushrooms$odor)

##    
##        a    c    f    l    m    n    p    s    y
##   e  400    0    0  400    0 3408    0    0    0
##   p    0  192 2160    0   36  120  256  576  576

# Analyzing class-feature combinations
number.perfect.splits <- apply(X=mushrooms[-1], MARGIN = 2, FUN = function(col){
  t <- table(mushrooms$class,col)
  sum(t == 0)
})

Afterwards, the order() function was used to sort the perfect splits in a descending way and a bar plot was created. The mushrooms can be mostly differentiated by odor, stalk color above the ring or below the ring, and by the spore print color. The features that are not as useful in classification are the stalk surface below and above the ring, stalk shape, gill size, shaping and attachment, and bruises.

order <- order(number.perfect.splits, decreasing = TRUE) 
number.perfect.splits <- number.perfect.splits[order]

par(mar=c(10,2,2,2)) # sets the edges of the plot
barplot(number.perfect.splits,
        main="Number of perfect splits vs feature",
        xlab="",ylab="Feature",las=2,col="wheat")

3. Build Decision Tree

Firstly, the seed is set to 12345 for reproducible results. The next step is to split the dataset for training and testing so that the effectiveness is real and not over optimistic (Kabacoff, 2022). Randomly 80% of observations were assigned to the test set, and the remaining 20% to the test set.

set.seed(12345)
train <- sample(1:nrow(mushrooms), size = ceiling(0.80*nrow(mushrooms)), replace = FALSE)
mushrooms_train <- mushrooms[train,]
mushrooms_test <- mushrooms[-train,] # selects the remaining 20% of the rows

A penalty matrix is created, because making errors in deciding which mushrooms are edible has serious costs. In this case it will be 10 times worse to classify a poisonous mushroom as edible, this is also known as false positive. The penalty for false negatives is only 1, since misclassifying a poisonous mushroom as edible will not impact anybody’s life. The penalty matrix will be used in building the classification tree (Williams, 2010; Aslett,).

penalty.matrix <- matrix(c(0,1,10,0), byrow=TRUE, nrow=2)

In the next phase, the classic decision tree is built using the rpart() function that specifies to use all the predictors in the training set, incorporating the penalty matrix via parms = list(loss = penalty.matrix), which focuses on minimizing weighted misclassification costs rather than minimizing the number of misclassifications in general. Lastly, the method = “class” indicates that this is a classification, and not a regression for instance.

The tree is split five times and has six terminal nodes. The intensity of the color in each box determines the predicted value at the node (Milborrow, 2021). The first continuous number is the predicted probability of being poisonous, and the second value shows the percentage of observations in the node.

tree <- rpart(class~., #uses all predictors in the training set
              data = mushrooms_train,
              parms = list(loss = penalty.matrix), #minimizes the chances for False Positive
              method = "class")
rpart.plot(tree, nn=TRUE)

Root Node

The tree starts with the probability that 48% of the mushrooms are poisonous, even though the probability of being edible is higher (52%). The initial prediction is “poisonous” because the class assignment mechanism looks to minimize error based on further splits. Since this is the root node, all the observations are considered (100%). This node splits into node 2 and node 3 (terminal).

Leaf Node 3

This is a terminal node; in which case the mushroom is poisonous (46% of the observations). If the odor is other than almond “a”, anise “l”, or none “n”, then the mushroom has a probability of 100% to be poisonous.

Node 2

If the mushroom has the odor almond “a”, anise “l”, or none “n” then it is edible, and there are 54% mushrooms in this node. The probability of being poisonous is only 3% at this stage, but additional features need to be inspected. This node has two branches, node 4 and node 5.

Leaf Node 4

When the spore print color is buff “b”, chocolate “h”, black “k”, brown “n”, orange ”o”, purple “u”, or yellow “y” the mushroom has a probability of 100% to be edible only when the odor is “a”, “l”, or “n”. There are 45% observations that meet these criteria.

Node 5

If the spore print color is not b, h, k, n, o, u, y, but green “r” or white “w”, there are 18% chances that the mushroom is poisonous. At this stage, 9% of observations meet this criterion. The node further develops into node 10 and node 11.

Node 10

The mushroom can be edible if the population is clustered “c”, numerous “n”, scattered “s”, or solitary “y”. In this node, 97% of mushrooms have the probability to be edible, and there are 7% of observations with all these characteristics. This node finally splits into terminal node 20 and terminal node 21.

Node 11

When the mushroom checks all the previous characteristics, but the population is abundant “a” or several “v”, then the mushroom has 64% probability to be poisonous. However, there are only 2% observations in this dataset with these descriptions. This node separates into terminal nodes 22 and 23.

Leaf Node 20

To arrive at the final result within this branch, if the gill size is broad “b” the mushroom has 100% probability of being edible when all the previous assumptions were met. There are 6% of cases with all these descriptors.

Leaf Node 21

When the gill size is narrow “n” and not broad, the mushroom is definitely poisonous, even when all the other assumptions are met. Only a very small number of mushrooms fall in this node.

Leaf Node 22

The mushrooms need to grow in leaves “l” or paths “p” to indicate that they are edible. If these traits are present, the mushroom has 100% probability to be edible, but there are only a few cases (1%).

Leaf Node 23

If the habitat is not leaves or paths, but grasses “g”, meadows “m”, urban “u”, waste w”, or woods “d”, then there is 100% chance the mushroom is poisonous. Only 1% of the observations fall in this node.

4. Tuning the tree

This decision tree effectively uses the features of mushrooms—such as odor, spore print color, gill size, population, and habitat—to classify them into edible and poisonous categories. Terminal nodes show clear rules with high confidence (either 100% edible or 100% poisonous), demonstrating the decision tree’s capability to discern between classes based on the data provided. The tree can be further improved by choosing the smallest tree with the cross-validated error within one standard deviation from the minimum cross-validated error value (Kabacoff, 2022). In this case, the complexity parameter is 0.01 and the tree has five splits. The prune() function cuts off the least important splits based on the minimum complexity parameter. However, tree remaining unchanged with a CP of 0.01 after cross-validation suggests that the initial model was already quite effective and balanced in terms of size and accuracy.

cp.optim <- tree$cptable[which.min(tree$cptable[,"xerror"]), "CP"]

tree$cptable

##           CP nsplit  rel error       xerror         xstd
## 1 0.70701391      0 1.00000000 1.000000e+01 0.1192054496
## 2 0.15596330      1 0.29298609 2.929861e-02 0.0029221105
## 3 0.08671204      2 0.13702279 1.131400e+00 0.0560459358
## 4 0.03551347      3 0.05031074 1.535957e-01 0.0210674150
## 5 0.01479728      4 0.01479728 1.515241e-01 0.0210551689
## 6 0.01000000      5 0.00000000 5.918911e-04 0.0004184658

tree <- prune(tree, cp=cp.optim) # prunes the tree based on 0.01 CP

rpart.plot(tree, nn=TRUE)

5. Validating the Decision Tree

The confusion matrix is one of the most important parts of the analysis because it shows the accuracy of the model. It can be seen that the model predicted all cases correctly. The True Positive (829) is the number of mushrooms correctly classified as edible. A special consideration should be allotted to the fact that none of the poisonous mushrooms were predicted edible because the False Positive is 0. Type I error would be the worst-case scenario for this application, as people’s lives would be in danger. The True Negative (795) shows the number of mushrooms correctly classified as poisonous, meaning that the model predicted poisonous, and the actual class was poisonous. Luckily, the False Negative is 0, which means that edible mushrooms are not misclassified as poisonous, allowing all varieties of edible mushrooms to be consumed. The False Negatives are not considered as bad as Type I error, but it is great that there are no misclassifications (Costa E Silva et al., 2020).

pred <- predict(object = tree, # specifies the decision tree to use
                mushrooms_test[-1], # removes the class feature
                type = "class") # specifies is a classification

t <- table(mushrooms_test$class, pred) # prepares the data for the confusion matrix

confusionMatrix(t)

## Confusion Matrix and Statistics
## 
##    pred
##       e   p
##   e 829   0
##   p   0 795
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9977, 1)
##     No Information Rate : 0.5105     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5105     
##          Detection Rate : 0.5105     
##    Detection Prevalence : 0.5105     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : e          
##

Furthermore, additional calculations regarding the accuracy of the model are performed. Accuracy is 1, which means that all instances were correctly classified. This statistic is considered to be too good to be true in real-world situations, because it would mean that the predictions can be 100% trusted. Precision (Positive Predictive Value) of 1 indicates perfect precision that all predicted edible mushrooms were actually edible. Sensitivity (True Positive Rate) shows the proportions of actual positive instances that were correctly identified by the model. Because the value is 1, this means that 100% of edible mushrooms were correctly identified. Specificity (True Negative Rate) depicts the proportions of actual negative mushroom that were correctly identified. Since the value is 1, all poisonous mushrooms were correctly identified. Overall, these results indicate an exceptionally accurate and precise model with perfect classification performance probably because a subset of the same dataset was used for testing, therefore overfitting the model (Kabacoff, 2022; Gil, 2018).

6. What If: Penalty Matrix is not applied

If the penalty matrix was not applied, the model would have two splits, an accuracy rate of 99%, and 11 poisonous mushrooms were classified as edible, which could potentially lead to fatal cases.

tree2 <- rpart(class~., #uses all predictors in the training set
              data = mushrooms_train, method = "class")
rpart.plot(tree2, nn=TRUE)
# Only two levels are left, and two features are important: odor and spore print color
cp.optim <- tree2$cptable[which.min(tree2$cptable[,"xerror"]), "CP"]

tree2$cptable

##           CP nsplit  rel error     xerror        xstd
## 1 0.96827940      0 1.00000000 1.00000000 0.012905966
## 2 0.01986543      1 0.03172060 0.03172060 0.003163669
## 3 0.01000000      2 0.01185517 0.01185517 0.001943424

# 0.01 is the complexity parameter with the lowest cross-validated error; 2 splits

# Tree prunning using the best complexity parameter
tree2 <- prune(tree2, cp=cp.optim) # prunes the tree based on 0.01 CP
rpart.plot(tree2, nn=TRUE) # the tree is unchanged because we already had the best option.

#Testing the model
pred2 <- predict(object = tree2, # specifies the decision tree to use
                mushrooms_test[-1], # removes the class feature
                type = "class") # specifies is a classification

# Calculating accuracy
t2 <- table(mushrooms_test$class, pred2) # prepares the data for the confusion matrix
confusionMatrix(t2)

## Confusion Matrix and Statistics
## 
##    pred2
##       e   p
##   e 829   0
##   p  11 784
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9879, 0.9966)
##     No Information Rate : 0.5172          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9864          
##                                           
##  Mcnemar's Test P-Value : 0.002569        
##                                           
##             Sensitivity : 0.9869          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9862          
##              Prevalence : 0.5172          
##          Detection Rate : 0.5105          
##    Detection Prevalence : 0.5105          
##       Balanced Accuracy : 0.9935          
##                                           
##        'Positive' Class : e               
##

Conclusion

In this practice, the mushroom dataset was used to apply the classification method called classic Decision Tree.

The most important features to look for when mushroom foraging are odor, spore print color, gill size, population, and habitat. The model is indeed exceptional, and achieving perfect classification accuracy and precision is not common in practice, especially in real-world datasets with inherent complexities and noise. The causing factors can be that the data characteristics are pretty simple and distinct, there is a balance between classes in the dataset (52% edible, 48% poisonous), the cross-validation was performed on test data (20% of total observations), and finally the application of the penalty matrix potentially led to better performance by prioritizing to minimize costly errors (Brownlee, 2020). If the penalty matrix was not applied, the model would have two splits, an accuracy rate of 99%, and 11 poisonous mushrooms were classified as edible, which could potentially lead to fatal cases.

The Decision Tree model can be safely used in determining whether the mushroom can be consumed by humans or not. There are additional variables that can be added, such as the part of the country where the mushroom grows and the time of the year the mushroom usually flowers (Mushrooms and Other Fungi.) since without the proper conditions and in the right region, the mushrooms would not produce the fruit. Additional analysis such as model comparison (random forests, support vector machine) can be performed, or finding which mushrooms are suitable for culinary or medicinal use.

References

Aslett, L.Data Mining Lab 4: New Tree Data Set and Loss Matrices. Louis Aslett. Retrieved April 19, 2024, from chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-New_Tree_Data_Set_and_Loss_Matrices.pdf

Brownlee, J. (2020, August 15,). 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Costa E Silva, E., Lopes, C., Correia, A., & Faria, S. (2020). A logistic regression model for consumer default risk. Journal of Applied Statistics, 47, 1-17. 10.1080/02664763.2020.1759030

Gil, J. I. (2018, March 23,). Why do I get a 100% accuracy decision tree? https://stats.stackexchange.com/questions/336055/why-do-i-get-a-100-accuracy-decision-tree

Kabacoff, R. (2022). R in action : data analysis and graphics with R and Tidyverse. Manning Publications Co.

Milborrow, S. (2021, June1,). Plotting rpart trees with the rpart.plot package. Stephen Milborrow. Retrieved April 19, 2024, from http://www.milbo.org/rpart-plot/prp.pdf

Mushrooms and Other Fungi. U.S. National Park Service. Retrieved April 21, 2024, from https://www.nps.gov/chir/learn/nature/mushrooms.htm

Williams, G. (2010, August 22,). Data Mining Survivor: Tuning Parameters - Loss Matrix. Togaware. Retrieved Apr 19, 2024, from https://datamining.togaware.com/survivor/Loss_Matrix.html