SOCR ≫ | DSPA ≫ | Topics ≫ |
When classification needs to be apparent, kNN or naive Bayes we presented earlier may not be useful as they do not have rigid rules for classification. In some cases, we need to specify well stated rules for our decision. Just like a scoring criterion for driving ability, credit scoring, or grading rubric. For instance, we can assign credit or deduct penalty for each driving operation task. In this case, we actually have a decision tree to judge whether this applicant is qualified to get a driver’s license.
In this chapter, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.
Decision tree learners enable classification via tree structures modeling the relationships among all features and potential outcomes in the data. All decision trees begin with a trunk representing all points are part of the same cohort. Iteratively, the trunk is split into narrower and narrower branches by forking-decisions based on the intrinsic data structure. At each step, splitting the data into branches may include binary or multinomial classification. The final decision is obtained when the tree branching process terminates. The terminal (leaf) nodes represent the action to be taken as the result of the series of branching decisions. For predictive models, the leaf nodes provide the expected forecasting results given the series of events in the decision tree.
There are a number of R
packages available for decision tree classification including rpart
, C5.0
, party
, etc.
Let’s start by seeing a simple example using the Iris dataset, which we saw in Chapter 2. The data features or attributes include Sepal.Length
, Sepal.Width
, Petal.Length
, and Petal.Width
, and classes are represented by the Species taxa
(setosa, versicolor, and virginica).
# install.packages("party")
library("party")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris); table(iris$Species)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
##
## setosa versicolor virginica
## 50 50 50
The ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
function will build a conditional-inference decision tree.
<- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
iris_ctree print(iris_ctree)
##
## Conditional inference tree with 4 terminal nodes
##
## Response: Species
## Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## Number of observations: 150
##
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
## 2)* weights = 50
## 1) Petal.Length > 1.9
## 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
## 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
## 5)* weights = 46
## 4) Petal.Length > 4.8
## 6)* weights = 8
## 3) Petal.Width > 1.7
## 7)* weights = 46
plot(iris_ctree, cex=2) # party::plot.BinaryTree()
head(iris); tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Similarly, we can demonstrate a classification of the iris taxa via rpart
:
library(rpart)
library(rpart.utils)
library(plotly)
# iris_rpart = rpart(Species~., data=iris)
# print(iris_rpart)
#
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
# install.packages("rattle")
# library(rattle)
# fancyRpartPlot(iris_rpart, cex = 1.5, caption = "rattle::fancyRpartPlot (Iris flowers)")
#
#### Plot_ly decision tree visualization
<- rpart(Species ~ . , method="class", data=iris)
iris_rpart # printcp(iris_rpart)
# Set the Graph/Tree Node names
<- iris_rpart$frame
treeFrame
# Identify Leaf nodes
<- treeFrame$var == "<leaf>"
isLeave <- rep(NA, length(isLeave))
nodes
# Get the 3 Isis taxa labels: "setosa", "versicolor", "virginica" for all nodes (including leaves)
<- attr(iris_rpart, "ylevels")
ylevel <- ylevel[treeFrame$yval][isLeave]
nodes[isLeave] !isLeave] <- labels(iris_rpart)[-1][!isLeave[-length(isLeave)]]
nodes[
# Get all Tree Graph Connections (source --> target)
<- iris_rpart$frame
treeFrame <- rpart.utils::rpart.rules(iris_rpart)
treeRules
<- sapply(as.numeric(row.names(treeFrame)),
targetPaths function(x) { strsplit(unlist(treeRules[x]),split=",") } )
<- sapply(1:length(targetPaths),
lastStop function(x) { targetPaths[[x]][length(targetPaths[[x]])] } )
<- sapply(1:length(targetPaths),
oneBefore function(x) { targetPaths[[x]][length(targetPaths[[x]])-1] } )
# Initialize the tree edges
=c()
target=c()
source=treeFrame$n
valuesfor(i in 2:length(oneBefore)) {
=oneBefore[[i]]
tmpNode= which(lastStop==tmpNode)
q
=ifelse(length(q)==0, 1, q)
q= c(source, q)
source = c(target, i)
target
}=source-1
source=target-1
target
# Display the sankey tree
plot_ly(type = "sankey", orientation = "h",
node = list(label = nodes, pad = 15, thickness = 30, line = list(color="black", width=1)),
link = list(source = source, target = target, value=values[-1])) %>%
layout(title = "Iris rPart Decision Tree", font = list(size = 12))
The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begins at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.
Decision trees involve recursive partitioning that use data features and attributes to split the data into groups (nodes) of similar classes.
To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly bundled together. Each of the groups is considered a segment of the data. After getting the approximate range of each feature value within each group, we can make the decision tree.
\[ X = [X_1, X_2, X_3, ..., X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & ... & x_{1, k} \\ x_{2, 1} & x_{2, 2} & ... & x_{2, k} \\ ... & ... & ... & ... \\ x_{n, 1} & x_{n, 2} & ... & x_{n, k} \end{pmatrix} }_{features/attributes} \left\{ \begin{array}{ll} & \\ cases & \\ & \end{array} \right. \]
The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset \(D\) (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into \(K\) subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:
pure
.One objective criteria for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. There are three main measures to evaluate the impurity reduction: Misclassification error, Gini index and Entropy.
For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogeneous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.
The Entropy
is an information measure of the amount of disorder or uncertainty in a system. Suppose we have a data set \(D=(X_1, X_2, ..., X_n)\) that includes \(n\) features (variables) and suppose each of these features can take on any of \(k\) possible values (states). Then the cardinality of the entire system is \(k^n\) as each of the features are assumed to have \(k\) independent states, thus the total number of different datasets that can be expected is \(\underbrace{k\times k\times k \times ...\times k}_{n}=k^n\). Suppose \({\displaystyle p_{1},p_{2},..., p_k}\) represent the proportions of each class (note: \(\sum_{i} {p_i}=1\)) present in the child node that results from a split in a decision tree classifier. Then the entropy measure is defined by: \[ Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\] If each of the \(1\le i\leq k\) states for each feature is equally likely to be observed with probability \(p_i=\frac{1}{k}\), then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=
\sum_{i=1}^{k} {\frac{1}{k} \log{k}}= k{\frac{1}{k} \log{k}}=\log{k} = O(1).\]
In the other extreme, the entropy is minimized when the probability of one class is unitary and all other ones are trivial. Note that by L’Hopital’s Rule \(\lim_{x\longrightarrow 0} {x\times log(x)} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0\) for a single class classification where the probability of one class is unitary (\(p_{i_o}=1\)) and the other ones are trivial (\(p_{i\not= i_o}=0\)):
\[Entropy(D)=p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]
In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample \(D=\{X_1, X_2, ..., X_n\}\) is defined by:
\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]
where \(P(c_i|D)\) is the probability of a data point in \(D\) being labeled with class \(c_i\), and \(k\) is the number of classes (clusters). \(P(c_i|D)\) can be estimated from the observed data by:
\[P(c_i|D) = \frac{P(D | c_i)P(c_i)}{P(D)} = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]
Observe that if the observations are evenly split among all \(k\) classes then \(P(c_i|D)=\frac{1}{k}\) and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=\log{k}=O(1).\]
At the other extreme, if all the observations are from one class then: \[H(D)=-1*\log(1)=0.\]
Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy \(\left ( log_b(x) = \frac{log_2(x)}{log_2(b)}\right )\).
The Information gain
is the expected reduction in entropy caused by knowing the value of an attribute.
Similar to the Entropy
measure, the Misclassification error
and the Gini index
are also applied to evaluate information gain. The Misclassification error is defined by the formula:
\[ME = 1-\max_k (p_k).\]
And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]
C5.0 algorithm is a popular implementation of decision trees. To begin with, let’s consider the term purity
. If the segments of data contains a single class, they are considered pure
. The entropy
represents a mathematical formalism measuring purity for data segments.
\[Entropy(S)=- \sum_{i=1}^c {p_i \log_2(p_i)},\]
where entropy
is the measurement, c is the number of total class levels, and \(p_i\) refers to the proportion of observations that fall into each class (i.e., probability of a randomly selected data point to belong to the \(i^{th}\) class level. For 2 possible classes, the entropy
ranges from 0 to 1. For \(n\) classes, the entropy ranges from \(0\) to \(\log_2(n)\), where the minimum entropy corresponds to data that is purely homogeneous (completely deterministic/predictable) and the maximum entropy represents completely disordered data (stochastic or extremely noisy). You might wonder what is the benefit of using the entropy? The answer is because the smaller the entropy the more information is contained in the corresponding decision node split. Systems (data) with high entropy indicate significant information content (randomness) and data with low entropy indicates highly-compressible data with structure in it.
If we only have one class in the segment then the entropy is trivial, \(Entropy(S)=(-1)\times \log_2(1)=0\). Let’s try another example. If we have a segment of data that contains two classes, the first class contains 80% of the data and the second class contains the remaining 20%. Then, we have the following entropy
: \(Entropy(S)=-0.8\log_2(0.8)-0.2\log_2(0.2)=0.7219281.\)
For a two class problem, the relationship between class-proportions and entropy is illustrated in the following graph. Assume \(x\) is the proportion for one of the classes.
<- 100
N set.seed(1234)
<- runif(N)
x <- sort(x)
x_ind <- -x_ind*log2(x_ind)-(1-x_ind)*log2(1-x_ind)
y
# curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
# segments(0.8, 0, 0.8, 0.7219281, lwd=2, col="blue", lty = 2)
# points(0.8, 0.7219281, col = "blue", pch=21, cex=2, lwd=2)
# text(-0.02, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.48, 0.3, pos=4, srt = 90, "Low-Structure/High-Entropy")
# text(0.96, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.65, 0.8, pos=4, "(x=0.8,Entr=0.72)")
<- c('High-Structure/Low-Entropy', 'Low-Structure/High-Entropy', 'High-Structure/Low-Entropy')
modelLabels <- c(0.03, 0.5, 0.97)
modelLabels.x <- c(0.5, 0.7, 0.5)
modelLabels.y <- c("blue", "red", "blue")
modelLabels.col plot_ly() %>%
add_lines(x = ~x_ind, y = ~y, name="Entropy", line = list(width = 4)) %>%
add_segments(x=0.8, xend=0.8, y=0.07, yend = 0.7219281, showlegend=F) %>%
# add_segments(x=-0.2, xend=1.1, y=0, yend = 0, showlegend=F) %>%
# add_markers(x = ~x, y = ~y, name="Sample Simulated Data") %>%
#add_lines(x = ~c(point1$x,point2$x), y = ~c(point1$y,point2$y), name="Second PC, Orthogonal to lm(Y ~ X)",
# line = list(width = 4)) %>%
add_markers(x = 0.8, y = 0.7219281, name="(x=0.8,Entropy=0.72)", marker = list(size = 20,
color = 'green', line = list(color = 'yellow', width = 2))) %>%
layout(title="Binary Class Entropy",
xaxis=list(title="Proportion of Class 1 Obsevations", scaleanchor="y"), # control the y:x axes aspect ratio
yaxis = list(title="Entropy", scaleanchor = "x", range=c(-0.1, 1.1)), legend = list(orientation = 'h'),
annotations = list(text=modelLabels, x=modelLabels.x, y=modelLabels.y, textangle=90,
font=list(size=15, color=modelLabels.col), showarrow=FALSE))
The closer the binary proportion split is to \(0.5\) the greater the entropy. The more homogeneous the split (one class becomes the majority) the lower the entropy. Decision trees aim to find splits in the data that reduce the entropy, i.e., increasing the homogeneity of the elements within all classes.
This measuring mechanism could be used to measure and compare the information we get using different features as data partitioning characteristics. Let’s consider this scenario. Suppose \(S\) and \(S_1\) represent the entropy of the system before and after the splitting/partitioning of the data according to a specific data feature attribute (\(F\)). Denote the entropies of the original and the derived partition by \(Entropy(S)\) and \(Entropy(S_1)\), respectively. The information we gained from partitioning the data using this specific feature (F) is calculated as a change in the entropy:
\[Gain(F) = Entropy(S)-Entropy(S_1)\]
Note that smaller entropy \(Entropy(S_1)\) corresponds with better classification and more information gained. A more complicated case would be that the partitions create multiple segments. Then, the entropy for each partition method is calculated by the following formula:
\[Entropy(S)=\sum_{i=1}^n w_i Entropy(P_i)=\sum_{i=1}^{n} w_i \left (-\sum_{j=1}^{c}p_i\log_2(p_i)\right )\]
Where \(w_i\) is the proportion of examples falling in that segment and \(P_i\) is segment i. Thus, the total entropy of a partition method is calculated by a weighted sum of entropies for each segment created by this method.
When we get the maximum reduction in entropy with a feature (\(F\)) then the \(Gain(F)=Entropy(S)\), since \(Entropy(S_1)=0\). On the contrary, if we gain no information with this feature, we have \(Gain(F)=0\).
While making a decision tree, we can classify those observations using as many splits as we want. This eventually might over classify our data. An extreme example of this would be that we make each observation as a class, which is meaningless.
So how to control the size of the decision tree? One possible solution is to make a cutoff for the number of decisions that a decision tree could make. Similarly, we can control the minimum number of points in each segment. This method is called early stopping, or pre-pruning, the decision tree. However, this might terminate the decision procedure ahead of some important pending partition.
Another solution post-pruning can be applied when we begin with growing a big decision tree and subsequently reduce the branches based on error rates with penalty at the nodes. This is often more effective than the pre-pruning solution.
The C5.0 algorithm
uses the post-pruning method to control the size of the decision tree. It first grows an overfitting large tree that contains all the possibilities of partitioning. Then, it cuts out nodes and branches with little effect on classification errors.
In this chapter we are using the Quality of life and chronic disease dataset, Case06_QoL_Symptom_ChronicIllness.csv
. This dataset has 41 variables. Detailed description for each variable is provided here.
Important variables:
Let’s load the data first.
<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
qolstr(qol)
## 'data.frame': 2356 obs. of 41 variables:
## $ ID : int 171 171 172 179 180 180 181 182 183 186 ...
## $ INTERVIEWDATE : int 0 427 0 0 0 42 0 0 0 0 ...
## $ LANGUAGE : int 1 1 1 1 1 1 1 1 1 2 ...
## $ AGE : int 49 49 62 44 64 64 52 48 49 78 ...
## $ RACE_ETHNICITY : int 3 3 3 7 3 3 3 3 3 4 ...
## $ SEX : int 2 2 2 2 1 1 2 1 1 1 ...
## $ QOL_Q_01 : int 4 4 3 6 3 3 4 2 3 5 ...
## $ QOL_Q_02 : int 4 3 3 6 2 5 4 1 4 6 ...
## $ QOL_Q_03 : int 4 4 4 6 3 6 4 3 4 4 ...
## $ QOL_Q_04 : int 4 4 2 6 3 6 2 2 5 2 ...
## $ QOL_Q_05 : int 1 5 4 6 2 6 4 3 4 3 ...
## $ QOL_Q_06 : int 4 4 2 6 1 2 4 1 2 4 ...
## $ QOL_Q_07 : int 1 2 5 -1 0 5 8 4 3 7 ...
## $ QOL_Q_08 : int 6 1 3 6 6 6 3 1 2 4 ...
## $ QOL_Q_09 : int 3 4 3 6 2 2 4 2 2 4 ...
## $ QOL_Q_10 : int 3 1 3 6 3 6 3 2 4 3 ...
## $ MSA_Q_01 : int 1 3 2 6 2 3 4 1 1 2 ...
## $ MSA_Q_02 : int 1 1 2 6 1 6 4 3 2 4 ...
## $ MSA_Q_03 : int 2 1 2 6 1 2 3 3 1 2 ...
## $ MSA_Q_04 : int 1 3 2 6 1 2 1 4 1 5 ...
## $ MSA_Q_05 : int 1 1 1 6 1 2 1 6 3 2 ...
## $ MSA_Q_06 : int 1 2 2 6 1 2 1 1 2 2 ...
## $ MSA_Q_07 : int 2 1 3 6 1 1 1 1 1 5 ...
## $ MSA_Q_08 : int 1 1 1 6 1 1 1 1 2 1 ...
## $ MSA_Q_09 : int 1 1 1 6 2 2 4 6 2 1 ...
## $ MSA_Q_10 : int 1 1 1 6 1 1 1 1 1 3 ...
## $ MSA_Q_11 : int 2 3 2 6 1 1 2 1 1 5 ...
## $ MSA_Q_12 : int 1 1 2 6 1 1 2 6 1 3 ...
## $ MSA_Q_13 : int 1 1 1 6 1 6 2 1 4 2 ...
## $ MSA_Q_14 : int 1 1 1 6 1 2 1 1 3 1 ...
## $ MSA_Q_15 : int 2 1 1 6 1 1 3 2 1 3 ...
## $ MSA_Q_16 : int 2 3 5 6 1 2 1 2 1 2 ...
## $ MSA_Q_17 : int 2 1 1 6 1 1 1 1 1 3 ...
## $ PH2_Q_01 : int 3 2 1 5 1 1 3 1 2 3 ...
## $ PH2_Q_02 : int 4 4 1 5 1 2 1 1 4 2 ...
## $ TOS_Q_01 : int 2 2 2 4 1 1 2 2 1 1 ...
## $ TOS_Q_02 : int 1 1 1 4 4 4 1 2 4 4 ...
## $ TOS_Q_03 : int 4 4 4 4 4 4 4 4 4 4 ...
## $ TOS_Q_04 : int 5 5 5 5 5 5 5 5 5 5 ...
## $ CHARLSONSCORE : int 2 2 3 1 0 0 2 8 0 1 ...
## $ CHRONICDISEASESCORE: num 1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...
Most of the coded variables like QOL_Q_01
(health rating) have ordinal values (1=excellent, 2=very good, 3=good, 4=fair, 5= poor, 6=no answer). We can use the table()
function to see its distribution. We also have some numerical variables in the dataset like CHRONICDISEASESCORE
. We can take a look at it by using summary()
.
Our variable of interest CHRONICDISEASESCORE
has some missing data. A simple way to address this is just deleting those observations with missing values. You could also try to impute the missing value using various imputation methods mentioned in Chapter 2.
table(qol$QOL_Q_01)
##
## 1 2 3 4 5 6
## 44 213 801 900 263 135
<-qol[!qol$CHRONICDISEASESCORE==-9, ]
qolsummary(qol$CHRONICDISEASESCORE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.880 1.395 1.497 1.970 4.760
Let’s create two classes using variable CHRONICDISEASESCORE
. We classify the patients with CHRONICDISEASESCORE < mean(CHRONICDISEASESCORE)
as having minor disease and the rest as having severe disease. This dichotomous classification (qol$cd
) may not be perfect and we will talk about alternative classification strategies in the practice problem in the end of the chapter.
$cd<-qol$CHRONICDISEASESCORE>1.497
qol$cd<-factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease")) qol
To make the qol
data more organized, we can order the data by the variable ID
.
<- qol[order(qol$ID), ]
qol
# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
<- qol[ , -1] qol
Then, we are able to subset training and testing datasets. Here is an example of a non-random split of the entire data into training (\(2114\)) and testing (\(100\)) sets:
<- qol[1:2114, ]
qol_train <- qol[2115:2214, ] qol_test
And here is an example of random assignments of cases into training and testing sets (80-20% slit):
set.seed(1234)
<- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
train_index <- qol[train_index, ]
qol_train <- qol[-train_index, ] qol_test
We can quickly inspect the distributions of the training and testing data to ensure they are not vastly different. We can see that the classes are split fairly equal in training and test datasets.
prop.table(table(qol_train$cd))
##
## minor_disease severe_disease
## 0.5324675 0.4675325
prop.table(table(qol_test$cd))
##
## minor_disease severe_disease
## 0.4853273 0.5146727
In this section we are using the C5.0()
function from the C50
package.
The function C5.0()
has following components:
m<-C5.0(train, class, trials=1, costs=NULL)
You could delete the #
in the following code and run it in R to install and load the C50
package.
# install.packages("C50")
library(C50)
In the qol
dataset (ID column is already removed), column 41 is the class vector (qol$cd
) and column 40 is the numerical version of vector 41 (qol$CHRONICDISEASESCORE
). We need to delete these two columns to create our training data that only contains features.
summary(qol_train[,-c(40, 41)])
## INTERVIEWDATE LANGUAGE AGE RACE_ETHNICITY
## Min. : 0.00 Min. :1.000 Min. :20.00 Min. :1.000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:52.00 1st Qu.:3.000
## Median : 0.00 Median :1.000 Median :59.00 Median :3.000
## Mean : 20.29 Mean :1.199 Mean :58.55 Mean :3.618
## 3rd Qu.: 0.00 3rd Qu.:1.000 3rd Qu.:66.00 3rd Qu.:4.000
## Max. :440.00 Max. :2.000 Max. :90.00 Max. :7.000
## SEX QOL_Q_01 QOL_Q_02 QOL_Q_03
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :1.000 Median :4.000 Median :3.000 Median :4.000
## Mean :1.423 Mean :3.678 Mean :3.417 Mean :3.705
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :2.000 Max. :6.000 Max. :6.000 Max. :6.000
## QOL_Q_04 QOL_Q_05 QOL_Q_06 QOL_Q_07
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :-1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 1.000
## Median :3.000 Median :3.000 Median :3.000 Median : 5.000
## Mean :3.021 Mean :3.099 Mean :2.963 Mean : 4.172
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 7.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :10.000
## QOL_Q_08 QOL_Q_09 QOL_Q_10 MSA_Q_01
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.885 Mean :3.151 Mean :2.922 Mean :2.975
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_02 MSA_Q_03 MSA_Q_04 MSA_Q_05
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :2.000 Median :2.000 Median :1.000
## Mean :3.066 Mean :2.298 Mean :2.404 Mean :1.895
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_06 MSA_Q_07 MSA_Q_08 MSA_Q_09 MSA_Q_10
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :1.00 Median :2.000 Median :1.000
## Mean :2.401 Mean :2.172 Mean :1.48 Mean :2.193 Mean :1.717
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.00 Max. :6.000 Max. :6.000
## MSA_Q_11 MSA_Q_12 MSA_Q_13 MSA_Q_14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :2.242 Mean :2.036 Mean :1.897 Mean :2.002
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_15 MSA_Q_16 MSA_Q_17 PH2_Q_01
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :2.000
## Mean :1.765 Mean :1.823 Mean :2.061 Mean :2.479
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## PH2_Q_02 TOS_Q_01 TOS_Q_02 TOS_Q_03
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:4.000
## Median :2.000 Median :2.000 Median :4.000 Median :4.000
## Mean :2.368 Mean :2.211 Mean :3.297 Mean :3.787
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :5.000 Max. :5.000 Max. :5.000
## TOS_Q_04 CHARLSONSCORE
## Min. :1.000 Min. :-9.0000
## 1st Qu.:5.000 1st Qu.: 0.0000
## Median :5.000 Median : 1.0000
## Mean :4.679 Mean : 0.9339
## 3rd Qu.:5.000 3rd Qu.: 1.0000
## Max. :6.000 Max. :10.0000
set.seed(1234)
<- C5.0(qol_train[,-c(40, 41)], qol_train$cd)
qol_model qol_model
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Tree size: 5
##
## Non-standard options: attempt to group attributes
summary(qol_model)
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
##
## C5.0 [Release 2.07 GPL Edition] Sun Aug 15 18:12:09 2021
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1771 cases (40 attributes) from undefined.data
##
## Decision tree:
##
## CHARLSONSCORE <= 0: minor_disease (645/169)
## CHARLSONSCORE > 0:
## :...CHARLSONSCORE > 5: minor_disease (29/9)
## CHARLSONSCORE <= 5:
## :...AGE > 47: severe_disease (936/352)
## AGE <= 47:
## :...MSA_Q_08 <= 1: minor_disease (133/46)
## MSA_Q_08 > 1: severe_disease (28/8)
##
##
## Evaluation on training data (1771 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 5 584(33.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 583 360 (a): class minor_disease
## 224 604 (b): class severe_disease
##
##
## Attribute usage:
##
## 100.00% CHARLSONSCORE
## 61.94% AGE
## 9.09% MSA_Q_08
##
##
## Time: 0.0 secs
# plot(qol_model, subtree = 17) # ?C50::plot.C5.0
The output of qol_model
indicates that we have a tree that has 25 terminal nodes. summary(qol_model)
suggests that the classification error for decision tree is 28% percent in the training data.
Now we can make predictions using the decision tree that we just build. The predict()
function we will use is the same as the one we showed in earlier chapters, e.g., Chapter 2 and Chapter 7. In general, predict()
is extended by each specific type of regression, classification, clustering or forecasting machine learning technique. For example, randomForest::predict.randomForest()
is invoked by:
predict(RF_model, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...),
where type
represents type of prediction output to be generated - “response” (equivalent to “class”), “prob” or “votes”. Thus, the predicted values are either predicted “response” class labels, matrix of class probabilities, or vote counts.
This time we are going to introduce the confusionMatrix()
function under package caret
as the evaluation method. When we combine it with a table()
function, the output of the evaluation is very straight forward.
# See docs for predict # ?C50::predict.C5.0
<- predict(qol_model, qol_test[ ,-c(40, 41)]) # removing the last 2 columns CHRONICDISEASESCORE and cd, whjich represent the clinical outcomes we are predicting!
qol_pred # install.packages("caret")
library(caret)
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
##
## Accuracy : 0.6772
## 95% CI : (0.6315, 0.7206)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 3.001e-12
##
## Kappa : 0.3538
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6651
## Specificity : 0.6886
## Pos Pred Value : 0.6682
## Neg Pred Value : 0.6856
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4831
## Balanced Accuracy : 0.6769
##
## 'Positive' Class : minor_disease
##
The Confusion Matrix shows prediction accuracy is about 63%, however, this may vary (see the corresponding confidence interval).
Recall the C5.0
function, there is an option trials=
, which is an integer specifying the number of boosting iterations. The default value of one indicates that a single model is used, and we can specify a larger number of iterations, for instance trials=6
.
set.seed(1234)
<- C5.0(qol_train[ , -c(40, 41)], qol_train$cd, trials=6) # try alternative values for the trials option
qol_boost6 qol_boost6
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd, trials = 6)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 6
## Average tree size: 4.7
##
## Non-standard options: attempt to group attributes
We can see that the size of the tree reduced to about 12 (this may vary at each run).
Since this is a fairly small tree we can visualize it by the function plot()
. We also use the option type="simple"
to make the tree look more condensed. You can also zoom in and display just a specific branch of the tree by using subtree =
.
plot(qol_boost6, type="simple") # C50::plot.C5.0
Caution: The plotting of decision trees will fail if you have columns that start with numbers or special characters (e.g., “5variable”, “!variable”). In general, avoid spaces, special characters, and other non-terminal symbols in column/row names.
Next step would be making predictions and testing its accuracy.
<- predict(qol_boost6, qol_test[ ,-c(40, 41)])
qol_boost_pred6 confusionMatrix(table(qol_boost_pred6, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_boost_pred6 minor_disease severe_disease
## minor_disease 145 69
## severe_disease 70 159
##
## Accuracy : 0.6862
## 95% CI : (0.6408, 0.7292)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 1.747e-13
##
## Kappa : 0.3718
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6744
## Specificity : 0.6974
## Pos Pred Value : 0.6776
## Neg Pred Value : 0.6943
## Prevalence : 0.4853
## Detection Rate : 0.3273
## Detection Prevalence : 0.4831
## Balanced Accuracy : 0.6859
##
## 'Positive' Class : minor_disease
##
The accuracy is about 64%. However, this may vary each time we run the experiment (mind the confidence interval). In some studies, the trials option provides significant improvement to the overall accuracy. A good choice for this option is trials=10
.
Suppose we want to reduce the false negative rate, in this case, misclassifying a severe case as minor. False negative (failure to detect a severe disease case) may be more costly than false positive (misclassifying a minor disease case as severe). Misclassification errors can be expressed as a matrix:
<-matrix(c(0, 1, 4, 0), nrow = 2)
error_cost error_cost
## [,1] [,2]
## [1,] 0 4
## [2,] 1 0
Let’s build a decision tree with the option cpsts=error_cost
.
set.seed(1234)
<-C5.0(qol_train[-c(40, 41)], qol_train$cd, costs=error_cost)
qol_cost<-predict(qol_cost, qol_test)
qol_cost_predconfusionMatrix(table(qol_cost_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_cost_pred minor_disease severe_disease
## minor_disease 73 14
## severe_disease 142 214
##
## Accuracy : 0.6479
## 95% CI : (0.6014, 0.6923)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 1.024e-08
##
## Kappa : 0.2829
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3395
## Specificity : 0.9386
## Pos Pred Value : 0.8391
## Neg Pred Value : 0.6011
## Prevalence : 0.4853
## Detection Rate : 0.1648
## Detection Prevalence : 0.1964
## Balanced Accuracy : 0.6391
##
## 'Positive' Class : minor_disease
##
Although the overall accuracy decreased, the false negative cell labels were reduced from 75 (without specifying a cost matrix) to 17 (when specifying a non-trivial (loaded) cost matrix). This comes at the cost of increasing the rate of false-positive labeling (minor disease cases misclassified as severe).
There are multiple choices to plot trees fitted by rpart
or C5.0
.
library("rpart")
# remove CHRONICDISEASESCORE, but keep *cd* label
set.seed(1234)
<-rpart(cd~., data=qol_train[, -40], cp=0.01)
qol_model# here we use rpart::cp = *complexity parameter* = 0.01
qol_model
## n= 1771
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1771 828 minor_disease (0.5324675 0.4675325)
## 2) CHARLSONSCORE< 0.5 645 169 minor_disease (0.7379845 0.2620155) *
## 3) CHARLSONSCORE>=0.5 1126 467 severe_disease (0.4147425 0.5852575)
## 6) AGE< 47.5 165 66 minor_disease (0.6000000 0.4000000) *
## 7) AGE>=47.5 961 368 severe_disease (0.3829344 0.6170656) *
You can also plot directly using rpart.plot
library(rpart.plot)
##
## Attaching package: 'rpart.plot'
## The following object is masked from 'package:rpart.utils':
##
## rpart.rules
rpart.plot(qol_model, type = 4,extra = 1,clip.right.labs = F)
We can use fancyRpartPlot
to render a more colorful decision tree.
library("rattle")
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")
<-predict(qol_model, qol_test,type = 'class')
qol_predconfusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 74
## severe_disease 72 154
##
## Accuracy : 0.6704
## 95% CI : (0.6245, 0.7141)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 2.276e-11
##
## Kappa : 0.3405
##
## Mcnemar's Test P-Value : 0.934
##
## Sensitivity : 0.6651
## Specificity : 0.6754
## Pos Pred Value : 0.6590
## Neg Pred Value : 0.6814
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4898
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : minor_disease
##
These results are consistent with their counterparts reported using C5.0.
How can we tune the parameters to further improve the results?
set.seed(1234)
= rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
control = rpart(cd ~ ., data = qol_train[ , -40], control = control)
qol_model# plotcp(qol_model)
# printcp(qol_model)
<- as.data.frame(qol_model$cptable)
data $CP <- as.factor(data$CP)
data
plot_ly(data = data, x = ~CP, y = ~xerror, type = 'scatter', mode='lines+markers',
name = 'Test', error_y = ~list(array = xstd, color = 'gray')) %>%
#add_segments(x=0, xend=0.23, y=0.76, yend = 0.76, line = list(dash = "dash"), showlegend=F) %>%
layout(title="Complexity Parameter vs. (CV) Error Rate")
Now, we can prune the tree according to the optimal cp
, complexity parameter to which the rpart
object will be trimmed. Instead of using the real error (e.g., \(1-R^2\), RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror
which averages the discrepancy between observed and predicted classifications using cross-validation, see Chapter 20.
set.seed(1234)
<- prune(qol_model, cp= qol_model$cptable[which.min(qol_model$cptable[,"xerror"]),"CP"])
selected_tr fancyRpartPlot(selected_tr, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")
<-predict(selected_tr, qol_test,type = 'class')
qol_pred_tuneconfusionMatrix(table(qol_pred_tune, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred_tune minor_disease severe_disease
## minor_disease 143 74
## severe_disease 72 154
##
## Accuracy : 0.6704
## 95% CI : (0.6245, 0.7141)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 2.276e-11
##
## Kappa : 0.3405
##
## Mcnemar's Test P-Value : 0.934
##
## Sensitivity : 0.6651
## Specificity : 0.6754
## Pos Pred Value : 0.6590
## Neg Pred Value : 0.6814
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4898
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : minor_disease
##
The result is roughly same as that of C5.0
. Despite the fact that there is no substantial classification improvement, the tree-pruning process generates a graphical representation of the decision making protocol (selected_tr
) that is much simpler and intuitive compared to the original (un-pruned) tree (qol_model
):
fancyRpartPlot(qol_model, cex = 0.1, caption = "rattle::fancyRpartPlot (QoL Data)")
We can change split=“entropy” to “error” or “gini” to apply an alternative information gain index. Experiment with these and compare the results.
set.seed(1234)
= rpart(cd ~ ., data=qol_train[ , -40], parms = list(split = "entropy"))
qol_model fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL data)")
# Modify and test using "error" and "gini"
# qol_pred<-predict(qol_model, qol_test,type = 'class')
# confusionMatrix(table(qol_pred, qol_test$cd))
In addition to the classification trees we just saw, we can explore classification rules that utilize if-else
logical statements to assign classes to unlabeled data. Below we review three classification rule strategies.
Separate and conquer repeatedly splits the data (and subsets of the data) by rules that cover a subset of examples. This procedure is very similar to the Divide and conquer approach. However, a notable difference is that each rule can be independent, yet, each decision node in a tree has to be linked to past decisions.
To understand One Rule (OneR)
algorithm, we need to know about its “sibling” - ZeroR
rule. ZeroR
rule means that we assign the mode class to unlabeled test observations regardless of its feature value. The One Rule (OneR)
algorithm is an improved version of ZeroR
that uses a single rule for classification. In other words, OneR
splits the training dataset into several segments based on feature values. Then, it assigns the mode of the classes in each segment to related observations in the unlabeled test data. In practice, we first test multiple rules and pick the rule with the smallest error rate to be our One Rule
. Remember, the rules are subjective.
The Repeated Incremental Pruning to Produce Error Reduction algorithm is a combination of the ideas behind decision tree and classification rules. It consists of a three-step process:
Let’s take another look at the same dataset as Case Study 1 - this time applying classification rules. Naturally, we skip over the first two data handling steps and go directly to step three.
Let’s start by using the OneR()
function in RWeka
package. Before installing the package you might want to check that the Java program in your computer is up to date. Also, its version has to match the version of R (i.e., 64bit R needs 64bit Java).
The function OneR()
has the following components:
m<-OneR(class~predictors, data=mydata)
y ~ .
instead, we include all of the column variables as predictors.# install.packages("RWeka")
library(RWeka)
# just remove the CHRONICDISEASESCORE but keep cd
set.seed(1234)
<-OneR(cd~., data=qol[ , -40])
qol_1R qol_1R
## CHARLSONSCORE:
## < -4.5 -> severe_disease
## < 0.5 -> minor_disease
## < 5.5 -> severe_disease
## < 8.5 -> minor_disease
## >= 8.5 -> severe_disease
## (1453/2214 instances correct)
Note that 1,453 out of 2,214 cases are correctly classified, \(66\%\), by the “one rule”.
summary(qol_1R)
##
## === Summary ===
##
## Correctly Classified Instances 1453 65.6278 %
## Incorrectly Classified Instances 761 34.3722 %
## Kappa statistic 0.3206
## Mean absolute error 0.3437
## Root mean squared error 0.5863
## Relative absolute error 68.8904 %
## Root relative squared error 117.3802 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 609 549 | a = minor_disease
## 212 844 | b = severe_disease
We obtain a rule that correctly specifies 66% of the patients, which is in line with the prior decision tree classification results. Due to algorithmic stochasticity, it’s normal that these results may vary each time you run the algorithm, albeit we used set.seed(1234)
to ensure some result reproducibility.
Another possible option for the classification rules would be the RIPPER
rule algorithm that we discussed earlier in the chapter. In R we use the Java
based function JRip()
to invoke this algorithm.
JRip()
function uses a similar invocation protocol as OneR()
:
m<-JRip(class~predictors, data=mydata)
set.seed(1234)
<-JRip(cd~., data=qol[ , -40])
qol_jrip1 qol_jrip1
## JRIP rules:
## ===========
##
## (CHARLSONSCORE >= 1) and (RACE_ETHNICITY >= 4) and (AGE >= 49) => cd=severe_disease (448.0/132.0)
## (CHARLSONSCORE >= 1) and (AGE >= 53) => cd=severe_disease (645.0/265.0)
## => cd=minor_disease (1121.0/360.0)
##
## Number of Rules : 3
summary(qol_jrip1)
##
## === Summary ===
##
## Correctly Classified Instances 1457 65.8085 %
## Incorrectly Classified Instances 757 34.1915 %
## Kappa statistic 0.3158
## Mean absolute error 0.4459
## Root mean squared error 0.4722
## Relative absolute error 89.3711 %
## Root relative squared error 94.5364 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 761 397 | a = minor_disease
## 360 696 | b = severe_disease
This JRip()
classifier uses only 3 rules and has a relatively similar accuracy 66%. As each individual has unique characteristics, classification in real world data is rarely perfect (close to 100% accuracy).
Another idea is to repeat the generation of trees multiple times, predict according to each tree’s performance, and finally ensemble those weighted votes into a combined classification result. This is precisely the idea behind random forest
classification, see Chapter 14.
# install.packages("randomForest")
require(randomForest)
set.seed(12)
# rf.fit <- tuneRF(qol_train[ , -40], qol_train[ , 40], stepFactor=1.5)
<- randomForest(cd ~ . , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit # varImpPlot(rf.fit, cex=0.5); print(rf.fit)
# type either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity)
# Accuracy
<- importance(rf.fit)
imp_rf.fit <- rownames(imp_rf.fit)
x <- imp_rf.fit[,3]
y plot_ly(x = ~y, y = ~reorder(x,y), name = "Var.Imp", type = "bar") %>%
layout(title="RF Variable Importance Plot (Accuracy)",
xaxis=list(title="Importance (mean decrease in accuracy)"),
yaxis = list(title="Variables (Ordered)"))
# Gini Index
<- imp_rf.fit[,4]
y plot_ly(x = ~y, y = ~reorder(x,y), name = "Var.Imp", type = "bar") %>%
layout(title="RF Variable Importance Plot (Gini)",
xaxis=list(title="Importance (Gini mean decrease)"),
yaxis = list(title="Variables (Ordered)"))
<- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit1 <- randomForest(cd~. , data=qol_train[ , -40], importance=TRUE, nodesize=5, ntree=5000, mtry=26)
rf.fit2
# plot(rf.fit,log="x",main="MSE Errors vs. Iterations: Three Models rf.fit, rf.fit1, rf.fit2")
# points(1:5000, rf.fit1$mse, col="red", type="l")
# points(1:5000, rf.fit2$mse, col="green", type="l")
# legend("topright", col=c("black", "red", "green"), lwd=c(2, 2, 2), legend=c("rf.fit", "rf.fit1", "rf.fit2"))
plot_ly(x = c(1:1000), y = rf.fit$err.rate[1:1000,1], name = 'OOB (rf.fit)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = c(1:1000), y = rf.fit1$err.rate[1:1000,1], name = 'OOB (rf.fit1)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = c(1:1000), y = rf.fit2$err.rate[1:1000,1], name = 'OOB (rf.fit2)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = "Out Of Bag Error Rates for 3 RF Models",
xaxis = list(title = "Number of Trees"),
yaxis = list (title = "OOB Error"))
<-predict(rf.fit2, qol_test, type = 'class')
qol_pred2confusionMatrix(table(qol_pred2, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred2 minor_disease severe_disease
## minor_disease 143 72
## severe_disease 72 156
##
## Accuracy : 0.6749
## 95% CI : (0.6291, 0.7184)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 5.956e-12
##
## Kappa : 0.3493
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6651
## Specificity : 0.6842
## Pos Pred Value : 0.6651
## Neg Pred Value : 0.6842
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4853
## Balanced Accuracy : 0.6747
##
## 'Positive' Class : minor_disease
##
These variable importance plots (varplot
) show the rank orders of importance of the features according to the specific index (Accuracy, left, and Gini, right). More information about random forests is available in Chapter 14: Improving Model Performance.
We can also use the randomForest::partialPlot()
method to examine the relative impact of the top salient features (continuous or binary variables). For each feature, this method works by:
Note that for binary features, partialPlot()
would result in only two values of the corresponding mass function. Whereas for continuous variables, it generates the marginal probability effect of the variable corresponding to the class probability (for classification problems) or the continuous response (for regression problems). Using the QoL case-study, (Case06_QoL_Symptom_ChronicIllness.csv
), this example shows the partial plots of the top six rank-ordered features of the QoL dataset chosen by the random forest model.
<- randomForest::importance(rf.fit)
imp <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
impvar # op <- par(mfrow=c(2, 3))
# for (i in 1:6) { # seq_along(impvar)) { # to plot the marginal probabilities for all features
# partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
# main=paste("Partial Dependence of 'CD'\n on ", impvar[i]))
# }
<- vector("list", 6)
p for (i in 1:6) { # seq_along(impvar)) { # to plot the marginal probabilities for all features
<- partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
p[[i]] main=paste("Partial Dependence of 'CD'\n on ", impvar[i]), plot=F)
}
plot_ly(x = p[[1]]$x, y = p[[1]]$y, name = impvar[1], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[1]),
xaxis = list(title = impvar[1]), yaxis = list (title = "Impact"))
plot_ly(x = p[[2]]$x, y = p[[2]]$y, name = impvar[2], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[2]),
xaxis = list(title = impvar[2]), yaxis = list (title = "Impact"))
plot_ly(x = p[[3]]$x, y = p[[3]]$y, name = impvar[3], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[4]]$x, y = p[[4]]$y, name = impvar[4], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[5]]$x, y = p[[5]]$y, name = impvar[5], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[6]]$x, y = p[[6]]$y, name = impvar[6], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[3], " ", impvar[4], " ",impvar[5], " ",impvar[6]),
xaxis = list(title = "MSA Questions"), yaxis = list (title = "Impact"))
In random forest (RF) classification, the node size (nodesize
) refers to the smallest node that can be split, i.e., nodes with fewer cases than the nodesize
are never subdivided. Increasing the node size leads to smaller trees, which may compromise previous predictive power. On the flip side, increasing the tree size (maxnodes
) and the number of trees (ntree
) tends to increase the predictive accuracy. However, there are tradeoffs between increasing node-size and tree-size simultaneously. To optimize the RF predictive accuracy, try smaller node sizes and more trees. Ensembling (forest) results from a larger number of trees will likely generate better results.
Another decision three technique we can test is Extra-Trees
(extremely randomized trees). Extra-trees are similar to Random Forests
with the exception that at each node, Random Forest chooses the best-cutting threshold for the feature, whereas Extra-Tree selects a uniformly random cut so that a feature with the biggest gain (or best classification score). The R
extraTrees
package provides one implementation of Extra-Trees where single random cuts are extended to several random cuts for each feature. This extension improves performance as it reduce the probability of making very poor cuts and still maintains the designed extra-tree stochastic cutting. Below, we compare the results of four alternative classification methods - random forest, SVM, kNN, and extra-tree.
library(caret)
library(extraTrees)
# For direct Extra-Tree Call
# install.packages("extraTrees")
# library(extraTrees)
# set.seed(12)
# ## Request more memory to JVM (e.g., 5GB
# options( java.parameters = "-Xmx5g" )
# library(extraTrees)
#
# system.time({
# # call extraTrees with 3 CPU cores/threads and track exec-time
# et.fit3 <- extraTrees(qol_train[ , -c(40,41)], qol_train[ , 41],
# numThreads=3, ntree=5000, mtry=26)
# })
# qol_pred3 <- predict(et.fit3, qol_test[ , -c(40,41)])
# confusionMatrix(table(qol_pred3, qol_test$cd))
# qol_pred<-predict(rf.fit, qol_test, type = 'class')
# qol_pred1<-predict(rf.fit1, qol_test, type = 'class')
# qol_pred2<-predict(rf.fit2, qol_test, type = 'class')
<- trainControl(method="repeatedcv", number=10, repeats=3)
control
## Run all subsequent models in parallel
library(doParallel)
<- makePSOCKcluster(4)
cl registerDoParallel(cl)
system.time({
<- train(cd~., data=qol_test[ , -40], method="rf", trControl=control);
rf.fit <- train(cd~., data=qol_test[ , -40], method="knn", trControl=control);
knn.fit <- train(cd~., data=qol_test[ , -40], method="svmRadial", trControl=control);
svm.fit <- train(cd~., data=qol_test[ , -40], method="extraTrees", trControl=control)
et.fit })
## user system elapsed
## 4.07 0.09 100.50
stopCluster(cl) # close multi-core cluster
<- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, ET=et.fit))
results
# summary of model differences
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, kNN, SVM, ET
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5000000 0.6000000 0.6363636 0.6364525 0.6780303 0.7777778 0
## kNN 0.3863636 0.5027778 0.5681818 0.5589785 0.6113901 0.7045455 0
## SVM 0.4418605 0.5681818 0.6000000 0.5987902 0.6243393 0.7272727 0
## ET 0.5000000 0.5909091 0.6363636 0.6290083 0.6724806 0.7777778 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.006160164 0.19999942 0.2673909 0.2707204 0.3558133 0.5535714 0
## kNN -0.222222222 0.00591716 0.1363636 0.1166633 0.2223136 0.3991597 0
## SVM -0.119305857 0.13367597 0.2019697 0.1971070 0.2471653 0.4579055 0
## ET 0.000000000 0.17671518 0.2697064 0.2557297 0.3429830 0.5535714 0
# plot summaries
<- list(x=list(relation="free"), y=list(relation="free"))
scales bwplot(results, scales=scales) # Box plots of
densityplot(results, scales=scales, pch = "|") # Density plots of accuracy
dotplot(results, scales=scales) # Dot plots of accuracy & Kappa
splom(results) # contrast pair-wise model scatterplots of prediction accuracy
In the previous case study, we classified the CHRONICDISEASESCORE
into two groups. What will happen if we use three groups? Let’s separate CHRONICDISEASESCORE
evenly into three groups. Recall the quantile()
function that we talked about in Chapter 2. We can use it to get the cut-points for classification. Then, a for
loop will help us split the variable CHRONICDISEASESCORE
into three categories.
quantile(qol$CHRONICDISEASESCORE, probs = c(1/3, 2/3))
## 33.33333% 66.66667%
## 1.06 1.80
for(i in 1:2214){
if(qol$CHRONICDISEASESCORE[i]>0.7&qol$CHRONICDISEASESCORE[i]<2.2){
$cdthree[i]=2
qol
}else if(qol$CHRONICDISEASESCORE[i]>=2.2){
$cdthree[i]=3
qol
}else{
$cdthree[i]=1
qol
}
}
$cdthree<-factor(qol$cdthree, levels=c(1, 2, 3), labels = c("minor_disease", "mild_disease", "severe_disease"))
qoltable(qol$cdthree)
##
## minor_disease mild_disease severe_disease
## 379 1431 404
After labeling the three categories in the new variable cdthree
, our job of preparing the class variable is done. Let’s follow along the earlier sections in the chapter to figure out how well does the tree classifiers and rule classifiers perform in the three-category case. First, try to build a tree classifier using C5.0()
with 10 boost trials. One small tip is that in the training dataset, we cannot have column 40 (CHRONICDISEASESCORE
), 41 (cd
) and now 42 (cdthree
) because they all contain class outcome related variables.
# qol_train1<-qol[1:2114, ]
# qol_test1<-qol[2115:2214, ]
<- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
train_index <-qol[train_index, ]
qol_train1<-qol[-train_index, ]
qol_test1
prop.table(table(qol_train1$cdthree))
##
## minor_disease mild_disease severe_disease
## 0.1829475 0.6295878 0.1874647
prop.table(table(qol_test1$cdthree))
##
## minor_disease mild_disease severe_disease
## 0.1241535 0.7133183 0.1625282
set.seed(1234)
<-C5.0(qol_train1[ , -c(40, 41, 42)], qol_train1$cdthree, trials=10)
qol_model1 qol_model1
##
## Call:
## C5.0.default(x = qol_train1[, -c(40, 41, 42)], y = qol_train1$cdthree, trials
## = 10)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 10
## Average tree size: 238.5
##
## Non-standard options: attempt to group attributes
<-predict(qol_model1, qol_test1)
qol_pred1
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 6 46 3
## mild_disease 26 267 23
## severe_disease 1 60 11
##
## Overall Statistics
##
## Accuracy : 0.6411
## 95% CI : (0.5945, 0.6858)
## No Information Rate : 0.842
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0469
##
## Mcnemar's Test P-Value : 3.943e-05
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 0.18182 0.7158
## Specificity 0.88049 0.3000
## Pos Pred Value 0.10909 0.8449
## Neg Pred Value 0.93041 0.1654
## Prevalence 0.07449 0.8420
## Detection Rate 0.01354 0.6027
## Detection Prevalence 0.12415 0.7133
## Balanced Accuracy 0.53115 0.5079
## Class: severe_disease
## Sensitivity 0.29730
## Specificity 0.84975
## Pos Pred Value 0.15278
## Neg Pred Value 0.92992
## Prevalence 0.08352
## Detection Rate 0.02483
## Detection Prevalence 0.16253
## Balanced Accuracy 0.57353
We can see that the prediction accuracy with three categories is way lower than the one we did with two categories.
Secondly, try to build a rule classifier with OneR()
.
set.seed(1234)
<-OneR(cdthree~., data=qol[ , -c(40, 41)])
qol_1R1 qol_1R1
## INTERVIEWDATE:
## < 3.5 -> mild_disease
## < 28.5 -> severe_disease
## < 282.0 -> mild_disease
## < 311.5 -> severe_disease
## >= 311.5 -> mild_disease
## (1436/2214 instances correct)
summary(qol_1R1)
##
## === Summary ===
##
## Correctly Classified Instances 1436 64.86 %
## Incorrectly Classified Instances 778 35.14 %
## Kappa statistic 0.022
## Mean absolute error 0.2343
## Root mean squared error 0.484
## Relative absolute error 67.5977 %
## Root relative squared error 116.2958 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 0 375 4 | a = minor_disease
## 0 1422 9 | b = mild_disease
## 0 390 14 | c = severe_disease
<-predict(qol_1R1, qol_test1)
qol_pred1
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 0 55 0
## mild_disease 0 315 1
## severe_disease 0 71 1
##
## Overall Statistics
##
## Accuracy : 0.7133
## 95% CI : (0.6688, 0.755)
## No Information Rate : 0.9955
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0086
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity NA 0.714286
## Specificity 0.8758 0.500000
## Pos Pred Value NA 0.996835
## Neg Pred Value NA 0.007874
## Prevalence 0.0000 0.995485
## Detection Rate 0.0000 0.711061
## Detection Prevalence 0.1242 0.713318
## Balanced Accuracy NA 0.607143
## Class: severe_disease
## Sensitivity 0.500000
## Specificity 0.839002
## Pos Pred Value 0.013889
## Neg Pred Value 0.997305
## Prevalence 0.004515
## Detection Rate 0.002257
## Detection Prevalence 0.162528
## Balanced Accuracy 0.669501
The OneRule
classifier that is purely based on the value of the INTERVIEWDATE
. The internal classification accuracy is 65% and equal to the external (validation data) prediction accuracy. Although, the latter assessment is a bit misleading as the vast majority of external validation data are classified in only one class - mild_disease.
Finally, let’s revisit the JRip()
classifier with the same three class labels according to cdthree
.
set.seed(1234)
<-JRip(cdthree~., data=qol[ , -c(40, 41)])
qol_jrip1 qol_jrip1
## JRIP rules:
## ===========
##
## (CHARLSONSCORE <= 0) and (AGE <= 50) and (MSA_Q_06 <= 1) and (QOL_Q_07 >= 1) and (MSA_Q_09 <= 1) => cdthree=minor_disease (35.0/11.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_10 >= 4) and (QOL_Q_07 >= 9) => cdthree=severe_disease (54.0/20.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 5) and (MSA_Q_09 <= 4) and (MSA_Q_04 >= 3) => cdthree=severe_disease (64.0/30.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 4) and (PH2_Q_01 >= 3) and (QOL_Q_10 >= 4) and (RACE_ETHNICITY >= 4) => cdthree=severe_disease (43.0/19.0)
## => cdthree=mild_disease (2018.0/653.0)
##
## Number of Rules : 5
summary(qol_jrip1)
##
## === Summary ===
##
## Correctly Classified Instances 1481 66.8925 %
## Incorrectly Classified Instances 733 33.1075 %
## Kappa statistic 0.1616
## Mean absolute error 0.3288
## Root mean squared error 0.4055
## Relative absolute error 94.8702 %
## Root relative squared error 97.42 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 24 342 13 | a = minor_disease
## 10 1365 56 | b = mild_disease
## 1 311 92 | c = severe_disease
<-predict(qol_jrip1, qol_test1)
qol_pred1
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 4 49 2
## mild_disease 3 299 14
## severe_disease 0 54 18
##
## Overall Statistics
##
## Accuracy : 0.7246
## 95% CI : (0.6805, 0.7657)
## No Information Rate : 0.9074
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1859
##
## Mcnemar's Test P-Value : 2.748e-14
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 0.571429 0.7438
## Specificity 0.883028 0.5854
## Pos Pred Value 0.072727 0.9462
## Neg Pred Value 0.992268 0.1890
## Prevalence 0.015801 0.9074
## Detection Rate 0.009029 0.6749
## Detection Prevalence 0.124153 0.7133
## Balanced Accuracy 0.727228 0.6646
## Class: severe_disease
## Sensitivity 0.52941
## Specificity 0.86797
## Pos Pred Value 0.25000
## Neg Pred Value 0.95687
## Prevalence 0.07675
## Detection Rate 0.04063
## Detection Prevalence 0.16253
## Balanced Accuracy 0.69869
In terms of the predictive accuracy on the testing data (qol_test1$cdthree
), we can see from these outputs that the RIPPER
algorithm performed better (67%) compared to the C5.0
decision tree (60%) and similarly to the OneR
algorithm (65%). This suggests that simple algorithms might outperform complex methods for certain real world case-studies. Later, in Chapter 14, we will provide more details about optimizing and improving classification and prediction performance.
Try these techniques with other data from the list of our Case-Studies.