SOCR ≫ | DSPA ≫ | Topics ≫ |
When classification needs to be apparent, kNN or naive Bayes we presented earlier may not be useful as they do not have rigid rules for classification. In some cases, we need to specify well stated rules for our decision. Just like a scoring criterion for driving ability or credit scoring. We have credit or penalty for each driving operation. In this case, we actually have a decision tree to judge whether this applicant is qualified to get a driver’s license.
In this chapter, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.
Decision tree learners enable classification via tree structures modeling the relationships among all features and potential outcomes in the data. All decision trees begin with trunk (all data are part of the same cohort), which is then split into narrower and narrower branches by branching decisions based on the intrinsic data structure. At each step, splitting the data into branches may include binary or multinomial classification. The final decision is obtained when the tree branching process terminates. The terminal (leaf) nodes represent the action to be taken as the result of the series of branching decisions. For predictive models, the leaf nodes provide the expected forecasting results given the series of events in the tree.
There are a number of R
packages available for decision tree classification including rpart
, C5.0
, party
, etc.
Let’s start by seeing a simple example using the Iris dataset, which we saw in Chapter 2. The data features or attributes include Sepal.Length
, Sepal.Width
, Petal.Length
, and Petal.Width
, and classes are represented by the Species taxa
(setosa; versicolor; and virginica).
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
##
## setosa versicolor virginica
## 50 50 50
The ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
function will build a decision tree.
iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)
##
## Conditional inference tree with 4 terminal nodes
##
## Response: Species
## Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## Number of observations: 150
##
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
## 2)* weights = 50
## 1) Petal.Length > 1.9
## 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
## 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
## 5)* weights = 46
## 4) Petal.Length > 4.8
## 6)* weights = 8
## 3) Petal.Width > 1.7
## 7)* weights = 46
plot(iris_ctree, cex=2)
head(iris); tail(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Similarly, we can demonstrate a classification of the iris taxa via rpart
:
library(rpart)
iris_rpart = rpart(Species~., data=iris)
print(iris_rpart)
## n= 150
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
## 2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
## 6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) *
## 7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(iris_rpart, cex = 1.5)
The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begin at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.
Decision trees involve recursive partitioning that uses data features and attributes to split the data into groups (nodes) of similar classes.
To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly clotted together. Each of the group is considered a segment of the data. After getting the approximate range of each feature value under each group, we can make the decision tree.
\[ X = [X_1, X_2, X_3, ..., X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & ... & x_{1, k} \\ x_{2, 1} & x_{2, 2} & ... & x_{2, k} \\ ... & ... & ... & ... \\ x_{n, 1} & x_{n, 2} & ... & x_{n, k} \end{pmatrix} }_{features/attributes} \left\{ \begin{array}{ll} & \\ cases & \\ & \end{array} \right. \]
The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset \(D\) (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into \(K\) subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:
pure
.One objective criteria for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. There are three main indices to evaluate the impurity reduction: Misclassification error, Gini index and Entropy.
For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogenous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.
The Entropy
is an information measure of the amount of disorder or uncertainty in a system. Suppose we have a data set \(D=(X_1, X_2, ..., X_n)\) that includes \(n\) features (variables) and suppose each of these features can take on any of \(k\) possible values (states). Then the cardinality of the entire system is \(k^n\) as each of the features are assumed to have \(k\) independent states, thus the total number of different datasets that can be expected is \(\underbrace{k\times k\times k \times ...\times k}_{n}=k^n\). Suppose \({\displaystyle p_{1},p_{2},..., p_n}\) represent the proportions of each class (note: \(\sum_{i} {p_i}=1\)) present in the child node that results from a split in a decision tree classifier. Then the entropy measure is defined by: \[ Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\] If each of the \(1\le i\leq k\) states for each feature is equally likely to be observed with probability \(p_i=\frac{1}{k}\), then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=
\sum_{i=1}^{k} {\frac{1}{k} \log{k}}=
\frac{1}{k}\sum_{i=1}^{k} {1} = 1.\]
In the other extreme, the entropy is minimized (as by L’Hopital’s Rule \(\lim_{x\longrightarrow 0} {x\times log(x)} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0\)) for a single class classification where the probability of one class is unitary (\(p_{i_o}=1\)) and the other ones are trivial (\(p_{i\not= i_o}=0\)):
\[Entropy(D)=- \sum_{i} {\frac{1}{k} \log{\frac{1}{k}}}= p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]
In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample \(D=\{X_1, X_2, ..., X_n\}\) is defined by:
\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]
where \(P(c_i|D)\) is the probability of a data point in \(D\) being labeled with class \(c_i\), and \(k\) is the number of classes (clusters). \(P(c_i|D)\) can be estimated from the observed data by:
\[P(c_i|D) = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]
Observe that if the observations are evenly split amongst all \(k\) classes then \(P(c_i|D)=\frac{1}{k}\) and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=1\]
At the other extreme, if all the observations are from one class then: \[H(D)=-1*log_k(1)=0.\]
Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy \(\left ( log_b(x) = \frac{log_2(x)}{log_2(b)}\right )\).
The Gain
is the expected reduction in entropy caused by knowing the value of an attribute.
Similar to the Entropy
measure, the Misclassification error
and the Gini index
are also applied to evaluate information gain. The Misclassification error is defined by the formula:
\[ME = 1-\max_k (p_k).\]
And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]
C5.0 algorithm is a popular implementation of decision trees.
To begin with, let’s consider the term purity
. If the segments of data contains a single class, they are considered pure
. The entropy
represents a mathematical formalism measuring purity for data segments.
\[Entropy(S)=- \sum_{i=1}^c {p_i log_2(p_i)}\]
Where entropy
is the measurement, c is the number of total class levels, and \(p_i\) refers to the proportion of observations that fall into each class (i.e., probability of a randomly selected data point to belong to the \(i^{th}\) class level. For 2 possible classes, the entropy
ranges from 0 to 1. For \(n\) classes, the entropy ranges from \(0\) to \(log_2(n)\), where the minimum entropy corresponds to data that is purely homogeneous (completely deterministic/predictable) and the maximum entropy represents completely disordered data (stochastic or extremely noisy). You might wonder what is the benefit of using the entropy? Another way to say this is the smaller the entropy the more information is contained in this split method. Systems (data) with high entropy indicate significant information content (randomness) and data with low entropy indicates highly-compressible data with structure in it.
If we only have one class in the segment then \(Entropy(S)=(-1)\times log_2(1)=0\)
Let’s try another example. If we have a segment of data that contains two classes, the first class contains 80% of the data and the second class contains the remaining 20%. Then, we have the following entropy
: \(Entropy(S)=-0.8log_2(0.8)-0.2log_2(0.2)=0.7219281.\)
The relationship for two class proportions and entropy are illustrated in the following graph. Assume \(x\) is the proportion for one of the classes.
set.seed(1234)
x<-runif(100)
curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
The closer the binary proportion split is to \(0.5\) the greater the entropy. The more homogeneous the split (one class becomes the majority) the lower the entropy. Decision trees aim to find splits in the data that reduce the entropy, i.e., increasing the homogeneity of the elements within all classes.
This measuring mechanism could be used to measure and compare the information we get using different features as data partitioning characteristics. Let’s consider this scenario. Suppose \(S\) and \(S_1\) represent the entropy of the system before and after the splitting/partitioning of the data according to a specific data feature attribute (\(F\)). Denote the entropies of the original and the derived partition by \(Entropy(S)\) and \(Entropy(S_1)\), respectively. The information we gained from partitioning the data using this specific feature (F) is calculated as a change in the entropy:
\[Gain(F) = Entropy(S)-Entropy(S_1)\]
Note that smaller entropy \(Entropy(S_1)\) corresponds with better classification and more information gained. A more complicated case would be that the partitions create multiple segments. Then, the entropy for each partition method is calculated by the following formula:
\[Entropy(S)=\sum_{i=1}^n w_i Entropy(P_i)=\sum_{i=1}^{n} w_i (\sum_{j=1}^{c}-p_ilog_2(p_i))\]
Where \(w_i\) is the proportion of examples falling in that segment and \(P_i\) is segment i. Thus, the total entropy of a partition method is calculated by a weighted sum of entropies for each segment created by this method.
When we get the maximum reduction in entropy with a feature (\(F\)) then the \(Gain(F)=Entropy(S)\), since \(Entropy(S_1)=0\). On the contrary, if we gain no information with this feature, we have \(Gain(F)=0\).
While making a decision tree, we can classify those observations using as many splits as we want. This eventually might over classify our data. An extreme example of this would be that we make each observation as a class, which is meaningless.
So how to control the size of the decision tree? One possible solution is to make a cutoff for the number of decisions that a decision tree could make. Similarly, we can control the number of examples in each segment to be not too small. This method is called early stopping or pre-pruning the decision tree. However, this might make the decision procedure to stop before some important partition could happen.
Another solution post-pruning is that we begin with growing a big decision tree and subsequently reduce the branches based on error rates with penalty at the nodes. This is often more effective than the previous solution.
C5.0 algorithm
uses the post-pruning method to control the size of the decision tree. It first grows an overfitting large tree to contain all the possibilities of partitioning. Then it cuts out nodes and branches with little effect on classification errors.
In this chapter we are using the Quality of life and chronic disease dataset, Case06_QoL_Symptom_ChronicIllness.csv
. This dataset has 41 variables. Detailed description for each variable is provided here.
Important variables:
Let’s load the data first.
qol<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
str(qol)
## 'data.frame': 2356 obs. of 41 variables:
## $ ID : int 171 171 172 179 180 180 181 182 183 186 ...
## $ INTERVIEWDATE : int 0 427 0 0 0 42 0 0 0 0 ...
## $ LANGUAGE : int 1 1 1 1 1 1 1 1 1 2 ...
## $ AGE : int 49 49 62 44 64 64 52 48 49 78 ...
## $ RACE_ETHNICITY : int 3 3 3 7 3 3 3 3 3 4 ...
## $ SEX : int 2 2 2 2 1 1 2 1 1 1 ...
## $ QOL_Q_01 : int 4 4 3 6 3 3 4 2 3 5 ...
## $ QOL_Q_02 : int 4 3 3 6 2 5 4 1 4 6 ...
## $ QOL_Q_03 : int 4 4 4 6 3 6 4 3 4 4 ...
## $ QOL_Q_04 : int 4 4 2 6 3 6 2 2 5 2 ...
## $ QOL_Q_05 : int 1 5 4 6 2 6 4 3 4 3 ...
## $ QOL_Q_06 : int 4 4 2 6 1 2 4 1 2 4 ...
## $ QOL_Q_07 : int 1 2 5 -1 0 5 8 4 3 7 ...
## $ QOL_Q_08 : int 6 1 3 6 6 6 3 1 2 4 ...
## $ QOL_Q_09 : int 3 4 3 6 2 2 4 2 2 4 ...
## $ QOL_Q_10 : int 3 1 3 6 3 6 3 2 4 3 ...
## $ MSA_Q_01 : int 1 3 2 6 2 3 4 1 1 2 ...
## $ MSA_Q_02 : int 1 1 2 6 1 6 4 3 2 4 ...
## $ MSA_Q_03 : int 2 1 2 6 1 2 3 3 1 2 ...
## $ MSA_Q_04 : int 1 3 2 6 1 2 1 4 1 5 ...
## $ MSA_Q_05 : int 1 1 1 6 1 2 1 6 3 2 ...
## $ MSA_Q_06 : int 1 2 2 6 1 2 1 1 2 2 ...
## $ MSA_Q_07 : int 2 1 3 6 1 1 1 1 1 5 ...
## $ MSA_Q_08 : int 1 1 1 6 1 1 1 1 2 1 ...
## $ MSA_Q_09 : int 1 1 1 6 2 2 4 6 2 1 ...
## $ MSA_Q_10 : int 1 1 1 6 1 1 1 1 1 3 ...
## $ MSA_Q_11 : int 2 3 2 6 1 1 2 1 1 5 ...
## $ MSA_Q_12 : int 1 1 2 6 1 1 2 6 1 3 ...
## $ MSA_Q_13 : int 1 1 1 6 1 6 2 1 4 2 ...
## $ MSA_Q_14 : int 1 1 1 6 1 2 1 1 3 1 ...
## $ MSA_Q_15 : int 2 1 1 6 1 1 3 2 1 3 ...
## $ MSA_Q_16 : int 2 3 5 6 1 2 1 2 1 2 ...
## $ MSA_Q_17 : int 2 1 1 6 1 1 1 1 1 3 ...
## $ PH2_Q_01 : int 3 2 1 5 1 1 3 1 2 3 ...
## $ PH2_Q_02 : int 4 4 1 5 1 2 1 1 4 2 ...
## $ TOS_Q_01 : int 2 2 2 4 1 1 2 2 1 1 ...
## $ TOS_Q_02 : int 1 1 1 4 4 4 1 2 4 4 ...
## $ TOS_Q_03 : int 4 4 4 4 4 4 4 4 4 4 ...
## $ TOS_Q_04 : int 5 5 5 5 5 5 5 5 5 5 ...
## $ CHARLSONSCORE : int 2 2 3 1 0 0 2 8 0 1 ...
## $ CHRONICDISEASESCORE: num 1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...
Most of the coded variables like QOL_Q_01
(heath rating) have ordinal values(1=excellent, 2=very good, 3=good, 4=fair, 5= poor, 6=no answer). We can use the table()
function to see its distribution. We also have some numerical variables in the dataset like CHRONICDISEASESCORE
. We can take a look at it by using summary()
.
Our variable of interest CHRONICDISEASESCORE
has some missing data. A simple way to address this is just deleting those observations with missing values. You could also try to impute the missing value using various imputation methods mentioned in Chapter 2.
table(qol$QOL_Q_01)
##
## 1 2 3 4 5 6
## 44 213 801 900 263 135
qol<-qol[!qol$CHRONICDISEASESCORE==-9, ]
summary(qol$CHRONICDISEASESCORE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.880 1.395 1.497 1.970 4.760
Let’s create two classes using variable CHRONICDISEASESCORE
. We classify the patients with CHRONICDISEASESCORE < mean(CHRONICDISEASESCORE)
as having minor disease and the rest as having severe disease. This dichotomous classification (qol$cd
) may not be perfect and we will talk about alternative classification strategies in the practice problem in the end of the chapter.
qol$cd<-qol$CHRONICDISEASESCORE>1.497
qol$cd<-factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))
To make the qol
data more organized, we can order the data by the variable ID
.
qol<-qol[order(qol$ID), ]
# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]
Then, we are able to subset training and testing datasets. Here is an example of a non-random split of the entire data into training (\(2114\)) and testing (\(100\)) sets:
qol_train<-qol[1:2114, ]
qol_test<-qol[2115:2214, ]
And here is an example of random assignments of cases into training and testing sets (80-20% slit):
set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train<-qol[train_index, ]
qol_test<-qol[-train_index, ]
We can quickly inspect the distributions of the training and testing data to ensure they are not vastly different. We can see that the classes are split fairly equal in training and test datasets.
prop.table(table(qol_train$cd))
##
## minor_disease severe_disease
## 0.5279503 0.4720497
prop.table(table(qol_test$cd))
##
## minor_disease severe_disease
## 0.503386 0.496614
In this section we are using the C5.0()
function from the C50
package.
The function C5.0()
has following components:
m<-C5.0(train, class, trials=1, costs=NULL)
You could delete the #
in the following code and run it in R to install and load the C50
package.
# install.packages("C50")
library(C50)
In the qol
dataset (ID column is already removed), column 41 is the class vector (qol$cd
) and column 40 is the numerical version of vector 41 (qol$CHRONICDISEASESCORE
). We need to delete these two columns to create our training data that only contains features.
summary(qol_train[,-c(40, 41)])
## INTERVIEWDATE LANGUAGE AGE RACE_ETHNICITY
## Min. : 0.00 Min. :1.000 Min. :20.00 Min. :1.000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:52.00 1st Qu.:3.000
## Median : 0.00 Median :1.000 Median :59.00 Median :3.000
## Mean : 21.68 Mean :1.217 Mean :58.74 Mean :3.614
## 3rd Qu.: 0.00 3rd Qu.:1.000 3rd Qu.:67.00 3rd Qu.:4.000
## Max. :440.00 Max. :2.000 Max. :90.00 Max. :7.000
## SEX QOL_Q_01 QOL_Q_02 QOL_Q_03
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :1.000 Median :4.000 Median :3.000 Median :4.000
## Mean :1.422 Mean :3.661 Mean :3.408 Mean :3.714
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :2.000 Max. :6.000 Max. :6.000 Max. :6.000
## QOL_Q_04 QOL_Q_05 QOL_Q_06 QOL_Q_07
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :-1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.00 1st Qu.: 1.000
## Median :3.000 Median :3.000 Median :3.00 Median : 5.000
## Mean :3.033 Mean :3.134 Mean :2.96 Mean : 4.139
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.: 7.000
## Max. :6.000 Max. :6.000 Max. :6.00 Max. :10.000
## QOL_Q_08 QOL_Q_09 QOL_Q_10 MSA_Q_01
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.892 Mean :3.161 Mean :2.938 Mean :2.966
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_02 MSA_Q_03 MSA_Q_04 MSA_Q_05
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.00 Median :2.000 Median :2.000 Median :1.000
## Mean :3.06 Mean :2.273 Mean :2.401 Mean :1.885
## 3rd Qu.:4.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :6.00 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_06 MSA_Q_07 MSA_Q_08 MSA_Q_09
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :1.000 Median :2.000
## Mean :2.408 Mean :2.224 Mean :1.474 Mean :2.163
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_10 MSA_Q_11 MSA_Q_12 MSA_Q_13
## Min. :1.0 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.0 Median :1.000 Median :1.000 Median :1.000
## Mean :1.7 Mean :2.239 Mean :2.034 Mean :1.885
## 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.0 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_14 MSA_Q_15 MSA_Q_16 MSA_Q_17
## Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.00 Median :1.000 Median :1.000
## Mean :1.998 Mean :1.77 Mean :1.835 Mean :2.072
## 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :6.000 Max. :6.00 Max. :6.000 Max. :6.000
## PH2_Q_01 PH2_Q_02 TOS_Q_01 TOS_Q_02
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :2.000 Median :2.000 Median :2.000 Median :4.000
## Mean :2.484 Mean :2.369 Mean :2.239 Mean :3.316
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :5.000 Max. :5.000
## TOS_Q_03 TOS_Q_04 CHARLSONSCORE
## Min. :1.000 Min. :1.000 Min. :-9.0000
## 1st Qu.:4.000 1st Qu.:5.000 1st Qu.: 0.0000
## Median :4.000 Median :5.000 Median : 1.0000
## Mean :3.787 Mean :4.686 Mean : 0.8826
## 3rd Qu.:4.000 3rd Qu.:5.000 3rd Qu.: 1.0000
## Max. :5.000 Max. :6.000 Max. :10.0000
set.seed(1234)
qol_model<-C5.0(qol_train[,-c(40, 41)], qol_train$cd)
qol_model
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Tree size: 25
##
## Non-standard options: attempt to group attributes
summary(qol_model)
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
##
## C5.0 [Release 2.07 GPL Edition] Fri Jun 30 15:37:03 2017
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1771 cases (40 attributes) from undefined.data
##
## Decision tree:
##
## CHARLSONSCORE <= 0: minor_disease (665/180)
## CHARLSONSCORE > 0:
## :...AGE <= 47:
## :...MSA_Q_08 > 2: severe_disease (15/4)
## : MSA_Q_08 <= 2:
## : :...MSA_Q_14 <= 1: minor_disease (86/20)
## : MSA_Q_14 > 1:
## : :...MSA_Q_10 > 4: minor_disease (6)
## : MSA_Q_10 <= 4:
## : :...TOS_Q_03 > 4: severe_disease (8)
## : TOS_Q_03 <= 4:
## : :...MSA_Q_17 > 2: minor_disease (8/1)
## : MSA_Q_17 <= 2:
## : :...QOL_Q_01 <= 2: minor_disease (4)
## : QOL_Q_01 > 2: severe_disease (38/13)
## AGE > 47:
## :...RACE_ETHNICITY > 3:
## :...QOL_Q_07 > 5: severe_disease (133/26)
## : QOL_Q_07 <= 5:
## : :...QOL_Q_10 > 5: severe_disease (24/2)
## : QOL_Q_10 <= 5:
## : :...MSA_Q_14 <= 5: severe_disease (202/72)
## : MSA_Q_14 > 5: minor_disease (11/2)
## RACE_ETHNICITY <= 3:
## :...QOL_Q_01 <= 2: minor_disease (50/20)
## QOL_Q_01 > 2:
## :...CHARLSONSCORE > 1: severe_disease (184/58)
## CHARLSONSCORE <= 1:
## :...MSA_Q_04 > 5: minor_disease (27/8)
## MSA_Q_04 <= 5:
## :...QOL_Q_07 <= 5:
## :...QOL_Q_05 <= 2:
## : :...TOS_Q_04 <= 2: minor_disease (5)
## : : TOS_Q_04 > 2: severe_disease (52/15)
## : QOL_Q_05 > 2:
## : :...MSA_Q_06 <= 5: minor_disease (119/46)
## : MSA_Q_06 > 5: severe_disease (10/2)
## QOL_Q_07 > 5:
## :...QOL_Q_09 <= 2: severe_disease (18/1)
## QOL_Q_09 > 2:
## :...RACE_ETHNICITY <= 2: minor_disease (12/5)
## RACE_ETHNICITY > 2:
## :...MSA_Q_17 > 3: severe_disease (19/2)
## MSA_Q_17 <= 3:
## :...PH2_Q_01 <= 3: severe_disease (50/14)
## PH2_Q_01 > 3:
## :...MSA_Q_14 <= 3: minor_disease (21/6)
## MSA_Q_14 > 3: severe_disease (4)
##
##
## Evaluation on training data (1771 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 25 497(28.1%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 726 209 (a): class minor_disease
## 288 548 (b): class severe_disease
##
##
## Attribute usage:
##
## 100.00% CHARLSONSCORE
## 62.45% AGE
## 53.13% RACE_ETHNICITY
## 38.40% QOL_Q_07
## 34.61% QOL_Q_01
## 21.91% MSA_Q_14
## 19.03% MSA_Q_04
## 13.38% QOL_Q_10
## 10.50% QOL_Q_05
## 9.32% MSA_Q_08
## 8.13% MSA_Q_17
## 7.28% MSA_Q_06
## 7.00% QOL_Q_09
## 4.23% PH2_Q_01
## 3.61% MSA_Q_10
## 3.27% TOS_Q_03
## 3.22% TOS_Q_04
##
##
## Time: 0.0 secs
The output of qol_model
indicates that we have a tree that has 25 terminal nodes. summary(qol_model)
suggests that the classification error for decision tree is 28% percent in the training data.
Now we can make predictions using the decision tree that we just build. The predict()
function we will use is the same as the one we showed in earlier chapters, e.g., Chapter 2 and Chapter 7. In general, predict()
is extended by each specific type of regression, classificaiton, clustering or forecasting machine learning technique. For example, randomForest::predict.randomForest()
is invoked by:
predict(RF_model, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...),
where type
represents type of prediction output to be generated - “response” (equivalent to “class”), “prob” or “votes”. Thus, the predicted values are either predicted “response” class labels, matrix of class probabilities, or vote counts.
This time we are going to introduce the confusionMatrix()
function under package caret
as the evaluation method. When we combine it with a table()
function, the output of the evaluation is very straight forward.
qol_pred<-predict(qol_model, qol_test[ ,-c(40, 41)]) # removing the last 2 columns CHRONICDISEASESCORE and cd, whjich represent the clinical outcomes we are predicting!
# install.packages("caret")
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 149 89
## severe_disease 74 131
##
## Accuracy : 0.6321
## 95% CI : (0.5853, 0.6771)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 3.317e-08
##
## Kappa : 0.2637
## Mcnemar's Test P-Value : 0.2728
##
## Sensitivity : 0.6682
## Specificity : 0.5955
## Pos Pred Value : 0.6261
## Neg Pred Value : 0.6390
## Prevalence : 0.5034
## Detection Rate : 0.3363
## Detection Prevalence : 0.5372
## Balanced Accuracy : 0.6318
##
## 'Positive' Class : minor_disease
##
The Confusion Matrix shows prediction accuracy is about 63%, however, this may vary (see the corresponding confidence interval).
Recall the C5.0
function, there is an option trials=
, which is an integer specifying the number of boosting iterations. The default value of one indicates that a single model is used, and we can specify a larger number of iterations, for instance trials=6
.
set.seed(1234)
qol_boost6<-C5.0(qol_train[ , -c(40, 41)], qol_train$cd, trials=6) # try alternative values for the trials option
qol_boost6
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd, trials = 6)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 6
## Average tree size: 11.7
##
## Non-standard options: attempt to group attributes
We can see that the size of the tree reduced to about 12 (this may vary at each run).
Since this is a fairly small tree we can visualize it by the function plot()
. We also use the option type="simple"
to make the tree look more condensed.
plot(qol_boost6, type="simple")
Caution: The plotting of decision trees will fail if you have columns that start with numbers or special characters (e.g., “5variable”, “!variable”). In general, avoid spaces, special characters, and other non-terminal symbols in column/row names.
Next step would be making predictions and testing its accuracy.
qol_boost_pred6 <- predict(qol_boost6, qol_test[ ,-c(40, 41)])
confusionMatrix(table(qol_boost_pred6, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_boost_pred6 minor_disease severe_disease
## minor_disease 140 75
## severe_disease 83 145
##
## Accuracy : 0.6433
## 95% CI : (0.5968, 0.688)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 1.987e-09
##
## Kappa : 0.2868
## Mcnemar's Test P-Value : 0.5776
##
## Sensitivity : 0.6278
## Specificity : 0.6591
## Pos Pred Value : 0.6512
## Neg Pred Value : 0.6360
## Prevalence : 0.5034
## Detection Rate : 0.3160
## Detection Prevalence : 0.4853
## Balanced Accuracy : 0.6434
##
## 'Positive' Class : minor_disease
##
The accuracy is about 64%. However, this may vary each time we run the experiment (mind the confidence interval). In some studies, the trials option provides significant improvement to the overall accuracy. A good choice for this option is trials=10
.
Suppose we want to reduce the false negative rate, in this case, misclassifying a severe case as minor. False negative (failure to detect a severe disease case) may be more costly than false positive (misclassifying a minor disease case as severe). Misclassification errors can be expressed as a matrix:
error_cost<-matrix(c(0, 1, 4, 0), nrow = 2)
error_cost
## [,1] [,2]
## [1,] 0 4
## [2,] 1 0
Let’s build a decision tree with the option cpsts=error_cost
.
set.seed(1234)
qol_cost<-C5.0(qol_train[-c(40, 41)], qol_train$cd, costs=error_cost)
qol_cost_pred<-predict(qol_cost, qol_test)
confusionMatrix(table(qol_cost_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_cost_pred minor_disease severe_disease
## minor_disease 60 17
## severe_disease 163 203
##
## Accuracy : 0.5937
## 95% CI : (0.5463, 0.6398)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 8.352e-05
##
## Kappa : 0.1909
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.2691
## Specificity : 0.9227
## Pos Pred Value : 0.7792
## Neg Pred Value : 0.5546
## Prevalence : 0.5034
## Detection Rate : 0.1354
## Detection Prevalence : 0.1738
## Balanced Accuracy : 0.5959
##
## 'Positive' Class : minor_disease
##
Although the overall accuracy decreased, the false negative cell labels were reduced from 75 (without specifying a cost matrix) to 17 (when specifying a non-trivial (loaded) cost matrix). This comes at the cost of increasing the rate of false-positive labeling (minor disease cases misclassified as severe).
There are multiple choices to plot trees fitted by rpart
, C4.5
.
library("rpart")
# remove CHRONICDISEASESCORE, but keep *cd* label
set.seed(1234)
qol_model<-rpart(cd~., data=qol_train[, -40], cp=0.01)
# here we use rpart::cp = *complexity parameter* = 0.01
qol_model
## n= 1771
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1771 836 minor_disease (0.5279503 0.4720497)
## 2) CHARLSONSCORE< 0.5 665 180 minor_disease (0.7293233 0.2706767) *
## 3) CHARLSONSCORE>=0.5 1106 450 severe_disease (0.4068716 0.5931284)
## 6) AGE< 47.5 165 65 minor_disease (0.6060606 0.3939394) *
## 7) AGE>=47.5 941 350 severe_disease (0.3719447 0.6280553) *
You can also plot directly using rpart.plot
library(rpart.plot)
rpart.plot(qol_model, type = 4,extra = 1,clip.right.labs = F)
We can use fancyRpartPlot
library("rattle")
fancyRpartPlot(qol_model, cex = 1)
qol_pred<-predict(qol_model, qol_test,type = 'class')
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 133 64
## severe_disease 90 156
##
## Accuracy : 0.6524
## 95% CI : (0.606, 0.6967)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 1.759e-10
##
## Kappa : 0.3053
## Mcnemar's Test P-Value : 0.04395
##
## Sensitivity : 0.5964
## Specificity : 0.7091
## Pos Pred Value : 0.6751
## Neg Pred Value : 0.6341
## Prevalence : 0.5034
## Detection Rate : 0.3002
## Detection Prevalence : 0.4447
## Balanced Accuracy : 0.6528
##
## 'Positive' Class : minor_disease
##
These results are consistent with their counterparts reported using C5.0
. How can we tune the parameters to further improve the results?
set.seed(1234)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
qol_model= rpart(cd ~ ., data = qol_train[ , -40], control = control)
plotcp(qol_model)
printcp(qol_model)
##
## Classification tree:
## rpart(formula = cd ~ ., data = qol_train[, -40], control = control)
##
## Variables actually used in tree construction:
## [1] AGE CHARLSONSCORE INTERVIEWDATE LANGUAGE
## [5] MSA_Q_01 MSA_Q_02 MSA_Q_03 MSA_Q_04
## [9] MSA_Q_05 MSA_Q_06 MSA_Q_07 MSA_Q_08
## [13] MSA_Q_09 MSA_Q_10 MSA_Q_11 MSA_Q_12
## [17] MSA_Q_13 MSA_Q_14 MSA_Q_15 MSA_Q_16
## [21] MSA_Q_17 PH2_Q_01 PH2_Q_02 QOL_Q_01
## [25] QOL_Q_02 QOL_Q_03 QOL_Q_04 QOL_Q_05
## [29] QOL_Q_06 QOL_Q_07 QOL_Q_08 QOL_Q_09
## [33] QOL_Q_10 RACE_ETHNICITY SEX TOS_Q_01
## [37] TOS_Q_02 TOS_Q_03 TOS_Q_04
##
## Root node error: 836/1771 = 0.47205
##
## n= 1771
##
## CP nsplit rel error xerror xstd
## 1 0.24641148 0 1.0000000 1.00000 0.025130
## 2 0.04186603 1 0.7535885 0.75359 0.024099
## 3 0.00717703 2 0.7117225 0.71651 0.023816
## 4 0.00657895 3 0.7045455 0.72967 0.023920
## 5 0.00598086 9 0.6543062 0.74282 0.024020
## 6 0.00478469 14 0.6244019 0.74282 0.024020
## 7 0.00418660 17 0.6100478 0.75239 0.024090
## 8 0.00398724 21 0.5933014 0.75359 0.024099
## 9 0.00358852 32 0.5466507 0.75957 0.024141
## 10 0.00318979 41 0.5143541 0.77033 0.024215
## 11 0.00299043 53 0.4665072 0.78110 0.024286
## 12 0.00239234 59 0.4485646 0.78469 0.024309
## 13 0.00209330 91 0.3708134 0.80024 0.024406
## 14 0.00199362 95 0.3624402 0.82057 0.024522
## 15 0.00191388 108 0.3349282 0.83014 0.024574
## 16 0.00179426 122 0.2978469 0.82416 0.024542
## 17 0.00159490 151 0.2416268 0.82656 0.024555
## 18 0.00153794 164 0.2177033 0.82895 0.024567
## 19 0.00149522 171 0.2069378 0.83134 0.024580
## 20 0.00119617 182 0.1866029 0.83134 0.024580
## 21 0.00089713 295 0.0514354 0.86842 0.024758
## 22 0.00079745 306 0.0406699 0.87440 0.024783
## 23 0.00071770 309 0.0382775 0.87321 0.024778
## 24 0.00068353 314 0.0346890 0.87321 0.024778
## 25 0.00059809 321 0.0299043 0.88876 0.024841
## 26 0.00039872 367 0.0023923 0.88995 0.024846
## 27 0.00000000 373 0.0000000 0.89474 0.024864
Now, we can prune the tree according to the optimal cp
, complexity parameter to which the rpart
object will be trimmed. Instead of using the real error (e.g., \(1-R^2\), RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror
which averages the discrepancy between observed and predicted classifications using cross-validation, see Chapter 20.
set.seed(1234)
selected_tr <- prune(qol_model, cp= qol_model$cptable[which.min(qol_model$cptable[,"xerror"]),"CP"])
fancyRpartPlot(selected_tr, cex = 1)
qol_pred_tune<-predict(selected_tr, qol_test,type = 'class')
confusionMatrix(table(qol_pred_tune, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred_tune minor_disease severe_disease
## minor_disease 133 64
## severe_disease 90 156
##
## Accuracy : 0.6524
## 95% CI : (0.606, 0.6967)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 1.759e-10
##
## Kappa : 0.3053
## Mcnemar's Test P-Value : 0.04395
##
## Sensitivity : 0.5964
## Specificity : 0.7091
## Pos Pred Value : 0.6751
## Neg Pred Value : 0.6341
## Prevalence : 0.5034
## Detection Rate : 0.3002
## Detection Prevalence : 0.4447
## Balanced Accuracy : 0.6528
##
## 'Positive' Class : minor_disease
##
The result is roughly same as that of C5.0
. Despite the fact that there is no substantial classification improvement, the tree-pruning process generates a graphical representation of the decision making protocol (selected_tr
) that is much simpler and intuitive compared to the original (un-pruned) tree (qol_model
):
fancyRpartPlot(qol_model, cex = 0.1)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
We can change split=“entropy” to “error” or “gini” to apply an alternative information gain index. Experiment with these and compare the results.
set.seed(1234)
qol_model = rpart(cd ~ ., data=qol_train[ , -40], parms = list(split = "entropy"))
fancyRpartPlot(qol_model, cex = 1)
# Modify and test using "error" and "gini"
# qol_pred<-predict(qol_model, qol_test,type = 'class')
# confusionMatrix(table(qol_pred, qol_test$cd))
In addition to the classification trees we just saw, we can explore classification rules that utilize if-else
logical statements to assign classes to unlabeled data. Below we review three classification rule strategies.
Separate and conquer repeatedly splits the data (and subsets of the data) by rules that cover a subset of examples. This procedure is very similar to the Divide and conquer approach. However, a notable difference is that each rule can be independent, yet, each decision node in a tree has to be linked to past decisions.
To understand One Rule (OneR)
algorithm, we need to know about its “sibling” - ZeroR
rule. ZeroR
rule means that we assign the mode class to unlabeled test observations regardless of its feature value. One rule algorithm is an improved version of ZeroR
that uses a single rule for classification. In other words, OneR
splits the training dataset into several segments based on feature values. Then, it assigns the mode of the classes in each segment to related observations in the unlabeled test data. In practice, we first test multiple rules and pick the rule with smallest error rate to be our One Rule
. Remember, the rules are subjective.
The Repeated Incremental Pruning to Produce Error Reduction algorithm is a combination of the ideas behind decision tree and classification rules. It consists of a three-step process:
Let’s take another look at the same dataset as Case Study 1 - this time applying classification rules. Naturally, we skip over the first two data handling steps and go directly to step three.
Let’s start by using the OneR()
function in RWeka
package. Before installing the package you might want to check that the Java program in your computer is up to date. Also, its version has to match the version of R (i.e., 64bit R needs 64bit Java).
The function OneR()
has the following components:
m<-OneR(class~predictors, data=mydata)
y ~ .
instead, we include all of the column variables as predictors.# install.packages("RWeka")
library(RWeka)
# just remove the CHRONICDISEASESCORE but keep cd
set.seed(1234)
qol_1R<-OneR(cd~., data=qol[ , -40])
qol_1R
## CHARLSONSCORE:
## < -4.5 -> severe_disease
## < 0.5 -> minor_disease
## < 5.5 -> severe_disease
## < 8.5 -> minor_disease
## >= 8.5 -> severe_disease
## (1453/2214 instances correct)
Note that 1,453 out of 2,214 cases are correctly classified, \(66\%\), by the “one rule”.
summary(qol_1R)
##
## === Summary ===
##
## Correctly Classified Instances 1453 65.6278 %
## Incorrectly Classified Instances 761 34.3722 %
## Kappa statistic 0.3206
## Mean absolute error 0.3437
## Root mean squared error 0.5863
## Relative absolute error 68.8904 %
## Root relative squared error 117.3802 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 609 549 | a = minor_disease
## 212 844 | b = severe_disease
We obtain a rule that correctly specifies 66% of the patients, which is in line with the prior decision tree classification results. Due to algorithmic stochasticity, it’s normal that these results may vary each time you run the algorithm, albeit we used set.seed(1234)
to ensure some result reproducibility.
Another possible option for the classification rules would be the RIPPER
rule algorithm that we discussed earlier in the chapter. In R we use the Java
based function JRip()
to invoke this algorithm.
JRip()
function has same components with the OneR()
function:
m<-JRip(class~predictors, data=mydata)
set.seed(1234)
qol_jrip1<-JRip(cd~., data=qol[ , -40])
qol_jrip1
## JRIP rules:
## ===========
##
## (CHARLSONSCORE >= 1) and (RACE_ETHNICITY >= 4) and (AGE >= 49) => cd=severe_disease (448.0/132.0)
## (CHARLSONSCORE >= 1) and (AGE >= 53) => cd=severe_disease (645.0/265.0)
## => cd=minor_disease (1121.0/360.0)
##
## Number of Rules : 3
summary(qol_jrip1)
##
## === Summary ===
##
## Correctly Classified Instances 1457 65.8085 %
## Incorrectly Classified Instances 757 34.1915 %
## Kappa statistic 0.3158
## Mean absolute error 0.4459
## Root mean squared error 0.4722
## Relative absolute error 89.3711 %
## Root relative squared error 94.5364 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 761 397 | a = minor_disease
## 360 696 | b = severe_disease
This JRip()
classifier uses only 3 rules and has a relatively similar accuracy 66%. As each individual has unique characteristics, classification in real world data is rarely perfect (close to 100% accuracy).
Another idea is to repeat the generation of trees multiple times, predict according to each tree’s performance, and finally ensemble those weighted votes into a combined classification result. This is precisely the idea behind random forest
classification, see Chapter 14.
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(12)
# rf.fit <- tuneRF(qol_train[ , -40], qol_train[ , 40], stepFactor=1.5)
rf.fit <- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
varImpPlot(rf.fit, cex=0.5); print(rf.fit)
##
## Call:
## randomForest(formula = cd ~ ., data = qol_train[, -40], importance = TRUE, ntree = 2000, mtry = 26)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 26
##
## OOB estimate of error rate: 35.86%
## Confusion matrix:
## minor_disease severe_disease class.error
## minor_disease 576 359 0.3839572
## severe_disease 276 560 0.3301435
rf.fit1 <- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit2 <- randomForest(cd~. , data=qol_train[ , -40], importance=TRUE, nodesize=5, ntree=5000, mtry=26)
plot(rf.fit,log="x",main="rf.fit (Black), rf.fit1 (Red), rf.fit2 (Green)")
points(1:5000, rf.fit1$mse, col="red", type="l")
points(1:5000, rf.fit2$mse, col="green", type="l")
qol_pred<-predict(rf.fit2, qol_test, type = 'class')
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 138 69
## severe_disease 85 151
##
## Accuracy : 0.6524
## 95% CI : (0.606, 0.6967)
## No Information Rate : 0.5034
## P-Value [Acc > NIR] : 1.759e-10
##
## Kappa : 0.305
## Mcnemar's Test P-Value : 0.2268
##
## Sensitivity : 0.6188
## Specificity : 0.6864
## Pos Pred Value : 0.6667
## Neg Pred Value : 0.6398
## Prevalence : 0.5034
## Detection Rate : 0.3115
## Detection Prevalence : 0.4673
## Balanced Accuracy : 0.6526
##
## 'Positive' Class : minor_disease
##
These variable importance plots (varplot
) show the rank orders of importance of the features according to the specific index (Accuracy, left, and Gini, right). More information about random forests is available in Chapter 14: Improving Model Performance.
In random forest (RF) classification, the node size (nodesize
) refers to the smallest node that can be split, i.e., nodes with fewer cases than the nodesize
are never subdivided. Increasing the node size leads to smaller trees, which may compromise previous predictive power. On the flip side, increasing the tree size (maxnodes
) and the number of trees (ntree
) tends to increase the predictive accuracy. However, there are tradeoffs between increasing node-size and tree-size simultaneously. To optimize the RF predictive accuracy, try smaller node sizes and more trees. Ensembling (forest) results from a larger number of trees will likely generate better results.
In the previous case study, we classified the CHRONICDISEASESCORE
into two groups. What will happen if we use three groups? Let’s separate CHRONICDISEASESCORE
evenly into three groups. Recall the quantile()
function that we talked about in Chapter 2. We can use it to get the cut-points for classification. Then, a for
loop will help us split the variable CHRONICDISEASESCORE
into three categories.
quantile(qol$CHRONICDISEASESCORE, probs = c(1/3, 2/3))
## 33.33333% 66.66667%
## 1.06 1.80
for(i in 1:2214){
if(qol$CHRONICDISEASESCORE[i]>0.7&qol$CHRONICDISEASESCORE[i]<2.2){
qol$cdthree[i]=2
}
else if(qol$CHRONICDISEASESCORE[i]>=2.2){
qol$cdthree[i]=3
}
else{
qol$cdthree[i]=1
}
}
qol$cdthree<-factor(qol$cdthree, levels=c(1, 2, 3), labels = c("minor_disease", "mild_disease", "severe_disease"))
After labeling the three categories in the new variable cdthree
, our job of preparing the class variable is done. Let’s follow along the earlier sections in the chapter to figure out how well does the tree classifiers and rule classifiers perform in the three-category case. First, try to build a tree classifier using C5.0()
with 10 boost trials. One small tip is that in the training dataset, we cannot have column 40 (CHRONICDISEASESCORE
), 41 (cd
) and now 42 (cdthree
) because they all contain class outcome related variables.
# qol_train1<-qol[1:2114, ]
# qol_test1<-qol[2115:2214, ]
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train1<-qol[train_index, ]
qol_test1<-qol[-train_index, ]
prop.table(table(qol_train1$cdthree))
##
## minor_disease mild_disease severe_disease
## 0.1699605 0.6459627 0.1840768
prop.table(table(qol_test1$cdthree))
##
## minor_disease mild_disease severe_disease
## 0.1760722 0.6478555 0.1760722
set.seed(1234)
qol_model1<-C5.0(qol_train1[ , -c(40, 41, 42)], qol_train1$cdthree, trials=10)
qol_model1
##
## Call:
## C5.0.default(x = qol_train1[, -c(40, 41, 42)], y =
## qol_train1$cdthree, trials = 10)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 10
## Average tree size: 230.5
##
## Non-standard options: attempt to group attributes
qol_pred1<-predict(qol_model1, qol_test1)
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 12 58 8
## mild_disease 23 239 25
## severe_disease 3 61 14
##
## Overall Statistics
##
## Accuracy : 0.5982
## 95% CI : (0.5509, 0.6442)
## No Information Rate : 0.8081
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0923
## Mcnemar's Test P-Value : 4.174e-07
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 0.31579 0.6676
## Specificity 0.83704 0.4353
## Pos Pred Value 0.15385 0.8328
## Neg Pred Value 0.92877 0.2372
## Prevalence 0.08578 0.8081
## Detection Rate 0.02709 0.5395
## Detection Prevalence 0.17607 0.6479
## Balanced Accuracy 0.57641 0.5514
## Class: severe_disease
## Sensitivity 0.2979
## Specificity 0.8384
## Pos Pred Value 0.1795
## Neg Pred Value 0.9096
## Prevalence 0.1061
## Detection Rate 0.0316
## Detection Prevalence 0.1761
## Balanced Accuracy 0.5681
We can see that the prediction accuracy with three categories is way lower than the one we did with two categories.
Secondly, try to build a rule classifier with OneR()
.
set.seed(1234)
qol_1R1<-OneR(cdthree~., data=qol[ , -c(40, 41)])
qol_1R1
## INTERVIEWDATE:
## < 3.5 -> mild_disease
## < 28.5 -> severe_disease
## < 282.0 -> mild_disease
## < 311.5 -> severe_disease
## >= 311.5 -> mild_disease
## (1436/2214 instances correct)
summary(qol_1R1)
##
## === Summary ===
##
## Correctly Classified Instances 1436 64.86 %
## Incorrectly Classified Instances 778 35.14 %
## Kappa statistic 0.022
## Mean absolute error 0.2343
## Root mean squared error 0.484
## Relative absolute error 67.5977 %
## Root relative squared error 116.2958 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 0 375 4 | a = minor_disease
## 0 1422 9 | b = mild_disease
## 0 390 14 | c = severe_disease
qol_pred1<-predict(qol_1R1, qol_test1)
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 0 78 0
## mild_disease 0 285 2
## severe_disease 0 76 2
##
## Overall Statistics
##
## Accuracy : 0.6479
## 95% CI : (0.6014, 0.6923)
## No Information Rate : 0.991
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.012
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity NA 0.64920
## Specificity 0.8239 0.50000
## Pos Pred Value NA 0.99303
## Neg Pred Value NA 0.01282
## Prevalence 0.0000 0.99097
## Detection Rate 0.0000 0.64334
## Detection Prevalence 0.1761 0.64786
## Balanced Accuracy NA 0.57460
## Class: severe_disease
## Sensitivity 0.500000
## Specificity 0.826879
## Pos Pred Value 0.025641
## Neg Pred Value 0.994521
## Prevalence 0.009029
## Detection Rate 0.004515
## Detection Prevalence 0.176072
## Balanced Accuracy 0.663440
The OneRule
classifier that is purely based on the value of the INTERVIEWDATE
has 65% internal classification accuracy, and also 65% external (validation data) prediction accuracy. Although, the latter assessment is a bit misleading as the vast majority of external validation data are classified in only one class - mild_disease.
Finally, let’s revisit the JRip()
classifier with the same three class labels according to cdthree
.
set.seed(1234)
qol_jrip1<-JRip(cdthree~., data=qol[ , -c(40, 41)])
qol_jrip1
## JRIP rules:
## ===========
##
## (CHARLSONSCORE <= 0) and (AGE <= 50) and (MSA_Q_06 <= 1) and (QOL_Q_07 >= 1) and (MSA_Q_09 <= 1) => cdthree=minor_disease (35.0/11.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_10 >= 4) and (QOL_Q_07 >= 9) => cdthree=severe_disease (54.0/20.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 5) and (MSA_Q_09 <= 4) and (MSA_Q_04 >= 3) => cdthree=severe_disease (64.0/30.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 4) and (PH2_Q_01 >= 3) and (QOL_Q_10 >= 4) and (RACE_ETHNICITY >= 4) => cdthree=severe_disease (43.0/19.0)
## => cdthree=mild_disease (2018.0/653.0)
##
## Number of Rules : 5
summary(qol_jrip1)
##
## === Summary ===
##
## Correctly Classified Instances 1481 66.8925 %
## Incorrectly Classified Instances 733 33.1075 %
## Kappa statistic 0.1616
## Mean absolute error 0.3288
## Root mean squared error 0.4055
## Relative absolute error 94.8702 %
## Root relative squared error 97.42 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 24 342 13 | a = minor_disease
## 10 1365 56 | b = mild_disease
## 1 311 92 | c = severe_disease
qol_pred1<-predict(qol_jrip1, qol_test1)
confusionMatrix(table(qol_test1$cdthree, qol_pred1))
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 5 70 3
## mild_disease 2 275 10
## severe_disease 0 61 17
##
## Overall Statistics
##
## Accuracy : 0.6704
## 95% CI : (0.6245, 0.7141)
## No Information Rate : 0.9165
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1583
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 0.71429 0.6773
## Specificity 0.83257 0.6757
## Pos Pred Value 0.06410 0.9582
## Neg Pred Value 0.99452 0.1603
## Prevalence 0.01580 0.9165
## Detection Rate 0.01129 0.6208
## Detection Prevalence 0.17607 0.6479
## Balanced Accuracy 0.77343 0.6765
## Class: severe_disease
## Sensitivity 0.56667
## Specificity 0.85230
## Pos Pred Value 0.21795
## Neg Pred Value 0.96438
## Prevalence 0.06772
## Detection Rate 0.03837
## Detection Prevalence 0.17607
## Balanced Accuracy 0.70948
In terms of the predictive accuracy on the testing data (qol_test1$cdthree
), we can see from these outputs that the RIPPER
algorithm performed better (67%) than the C5.0
decision tree (60%), yet worse than the OneR
algorithm (65%), which suggests that simple algorithms might outperform complex methods for certain real world case-studies. Later, in Chapter 14, we will provide more details about optimizing and improving classification and prediction performance.
Try to replicate these results with other data from the list of our Case-Studies.