When classification needs to be apparent, kNN or naive Bayes we presented earlier may not be useful as they do not have rigid rules for classification. In some cases, we need to specify well stated rules for our decision. Just like a scoring criterion for driving ability, credit scoring, or grading rubric. For instance, we can assign credit or deduct penalty for each driving operation task. In this case, we actually have a decision tree to judge whether this applicant is qualified to get a driver’s license.

In this chapter, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.

1 Motivation

Decision tree learners enable classification via tree structures modeling the relationships among all features and potential outcomes in the data. All decision trees begin with a trunk representing all points are part of the same cohort. Iteratively, the trunk is split into narrower and narrower branches by forking-decisions based on the intrinsic data structure. At each step, splitting the data into branches may include binary or multinomial classification. The final decision is obtained when the tree branching process terminates. The terminal (leaf) nodes represent the action to be taken as the result of the series of branching decisions. For predictive models, the leaf nodes provide the expected forecasting results given the series of events in the decision tree.

There are a number of R packages available for decision tree classification including rpart, C5.0, party, etc.

2 Hands-on Example: Iris Data

Let’s start by seeing a simple example using the Iris dataset, which we saw in Chapter 2. The data features or attributes include Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, and classes are represented by the Species taxa (setosa, versicolor, and virginica).

# install.packages("party")
library("party")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris); table(iris$Species)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

## 
##     setosa versicolor  virginica 
##         50         50         50

The ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) function will build a conditional-inference decision tree.

iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)

## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Species 
## Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
## Number of observations:  150 
## 
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
##   2)*  weights = 50 
## 1) Petal.Length > 1.9
##   3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
##     4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
##       5)*  weights = 46 
##     4) Petal.Length > 4.8
##       6)*  weights = 8 
##   3) Petal.Width > 1.7
##     7)*  weights = 46

plot(iris_ctree, cex=2)  #  party::plot.BinaryTree()

head(iris); tail(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Similarly, we can demonstrate a classification of the iris taxa via rpart:

library(rpart)
library(rpart.utils)
library(plotly)
# iris_rpart = rpart(Species~., data=iris)
# print(iris_rpart)
# 
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
# install.packages("rattle")
# library(rattle)
# fancyRpartPlot(iris_rpart, cex = 1.5, caption = "rattle::fancyRpartPlot (Iris flowers)")
#
#### Plot_ly decision tree visualization
iris_rpart <- rpart(Species ~ . , method="class", data=iris)
# printcp(iris_rpart)

# Set the Graph/Tree Node names
treeFrame <- iris_rpart$frame

# Identify Leaf nodes
isLeave <- treeFrame$var == "<leaf>"
nodes <- rep(NA, length(isLeave))

# Get the 3 Isis taxa labels: "setosa", "versicolor", "virginica" for all nodes (including leaves)
ylevel <- attr(iris_rpart, "ylevels")
nodes[isLeave] <- ylevel[treeFrame$yval][isLeave]
nodes[!isLeave] <- labels(iris_rpart)[-1][!isLeave[-length(isLeave)]]

# Get all Tree Graph Connections (source --> target)
treeFrame <- iris_rpart$frame
treeRules <- rpart.utils::rpart.rules(iris_rpart)

targetPaths <- sapply(as.numeric(row.names(treeFrame)), 
                      function(x) { strsplit(unlist(treeRules[x]),split=",") } )

lastStop <- sapply(1:length(targetPaths),
                   function(x) { targetPaths[[x]][length(targetPaths[[x]])] } )

oneBefore <- sapply(1:length(targetPaths),
                    function(x) { targetPaths[[x]][length(targetPaths[[x]])-1] } )

# Initialize the tree edges
target=c()
source=c()
values=treeFrame$n
for(i in 2:length(oneBefore)) {
  tmpNode=oneBefore[[i]]
  q = which(lastStop==tmpNode)

  q=ifelse(length(q)==0, 1, q)
  source = c(source, q)
  target = c(target, i)
}
source=source-1
target=target-1

# Display the sankey tree
plot_ly(type = "sankey", orientation = "h", 
             node = list(label = nodes, pad = 15, thickness = 30, line = list(color="black", width=1)),
             link = list(source = source, target = target, value=values[-1])) %>% 
 layout(title = "Iris rPart Decision Tree", font = list(size = 12))

3 Decision Tree Overview

The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begins at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.

3.1 Divide and conquer

Decision trees involve recursive partitioning that use data features and attributes to split the data into groups (nodes) of similar classes.

To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly bundled together. Each of the groups is considered a segment of the data. After getting the approximate range of each feature value within each group, we can make the decision tree.

\[ X = [X_1, X_2, X_3, ..., X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & ... & x_{1, k} \\ x_{2, 1} & x_{2, 2} & ... & x_{2, k} \\ ... & ... & ... & ... \\ x_{n, 1} & x_{n, 2} & ... & x_{n, k} \end{pmatrix} }_{features/attributes} \left\{ \begin{array}{ll} & \\ cases & \\ & \end{array} \right. \]

The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset $D$ (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into $K$ subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:

All the samples belong to the same class, that is they have the same label and the sample is already pure.
Stop when majority of the points are already of the same class (relative to some error threshold).
There are no remaining attributes on which the samples may be further partitioned.

One objective criteria for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. There are three main measures to evaluate the impurity reduction: Misclassification error, Gini index and Entropy.

For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogeneous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.

3.2 Entropy

The Entropy is an information measure of the amount of disorder or uncertainty in a system. Suppose we have a data set $D=(X_1, X_2, ..., X_n)$ that includes $n$ features (variables) and suppose each of these features can take on any of $k$ possible values (states). Then the cardinality of the entire system is $k^n$ as each of the features are assumed to have $k$ independent states, thus the total number of different datasets that can be expected is $\underbrace{k\times k\times k \times ...\times k}_{n}=k^n$. Suppose ${\displaystyle p_{1},p_{2},..., p_k}$ represent the proportions of each class (note: $\sum_{i} {p_i}=1$) present in the child node that results from a split in a decision tree classifier. Then the entropy measure is defined by: \[ Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\] If each of the $1\le i\leq k$ states for each feature is equally likely to be observed with probability $p_i=\frac{1}{k}$, then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}= \sum_{i=1}^{k} {\frac{1}{k} \log{k}}= k{\frac{1}{k} \log{k}}=\log{k} = O(1).\]

In the other extreme, the entropy is minimized when the probability of one class is unitary and all other ones are trivial. Note that by L’Hopital’s Rule $\lim_{x\longrightarrow 0} {x\times log(x)} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0$ for a single class classification where the probability of one class is unitary ($p_{i_o}=1$) and the other ones are trivial ($p_{i\not= i_o}=0$):

\[Entropy(D)=p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]

In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample $D=\{X_1, X_2, ..., X_n\}$ is defined by:

\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]

where $P(c_i|D)$ is the probability of a data point in $D$ being labeled with class $c_i$, and $k$ is the number of classes (clusters). $P(c_i|D)$ can be estimated from the observed data by:

\[P(c_i|D) = \frac{P(D | c_i)P(c_i)}{P(D)} = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]

Observe that if the observations are evenly split among all $k$ classes then $P(c_i|D)=\frac{1}{k}$ and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=\log{k}=O(1).\]

At the other extreme, if all the observations are from one class then: \[H(D)=-1*\log(1)=0.\]

Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy $\left ( log_b(x) = \frac{log_2(x)}{log_2(b)}\right )$.

The Information gain is the expected reduction in entropy caused by knowing the value of an attribute.

3.3 Misclassification Error and Gini Index

Similar to the Entropy measure, the Misclassification error and the Gini index are also applied to evaluate information gain. The Misclassification error is defined by the formula:

\[ME = 1-\max_k (p_k).\]

And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]

3.4 C5.0 Decision Tree Algorithm

C5.0 algorithm is a popular implementation of decision trees. To begin with, let’s consider the term purity. If the segments of data contains a single class, they are considered pure. The entropy represents a mathematical formalism measuring purity for data segments.

\[Entropy(S)=- \sum_{i=1}^c {p_i \log_2(p_i)},\]

where entropy is the measurement, c is the number of total class levels, and $p_i$ refers to the proportion of observations that fall into each class (i.e., probability of a randomly selected data point to belong to the $i^{th}$ class level. For 2 possible classes, the entropy ranges from 0 to 1. For $n$ classes, the entropy ranges from $0$ to $\log_2(n)$, where the minimum entropy corresponds to data that is purely homogeneous (completely deterministic/predictable) and the maximum entropy represents completely disordered data (stochastic or extremely noisy). You might wonder what is the benefit of using the entropy? The answer is because the smaller the entropy the more information is contained in the corresponding decision node split. Systems (data) with high entropy indicate significant information content (randomness) and data with low entropy indicates highly-compressible data with structure in it.

If we only have one class in the segment then the entropy is trivial, $Entropy(S)=(-1)\times \log_2(1)=0$. Let’s try another example. If we have a segment of data that contains two classes, the first class contains 80% of the data and the second class contains the remaining 20%. Then, we have the following entropy: $Entropy(S)=-0.8\log_2(0.8)-0.2\log_2(0.2)=0.7219281.$

For a two class problem, the relationship between class-proportions and entropy is illustrated in the following graph. Assume $x$ is the proportion for one of the classes.

N <- 100
set.seed(1234)
x <- runif(N)
x_ind <- sort(x)
y <- -x_ind*log2(x_ind)-(1-x_ind)*log2(1-x_ind)

# curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
# segments(0.8, 0, 0.8, 0.7219281, lwd=2, col="blue", lty = 2)
# points(0.8, 0.7219281, col = "blue", pch=21, cex=2, lwd=2)
# text(-0.02, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.48, 0.3, pos=4, srt = 90, "Low-Structure/High-Entropy")
# text(0.96, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.65, 0.8, pos=4, "(x=0.8,Entr=0.72)")

modelLabels <- c('High-Structure/Low-Entropy', 'Low-Structure/High-Entropy', 'High-Structure/Low-Entropy')
modelLabels.x <- c(0.03, 0.5, 0.97)
modelLabels.y <- c(0.5, 0.7, 0.5)
modelLabels.col <- c("blue", "red", "blue")
plot_ly() %>%
  add_lines(x = ~x_ind, y = ~y, name="Entropy", line = list(width = 4)) %>%
  add_segments(x=0.8, xend=0.8, y=0.07, yend = 0.7219281, showlegend=F) %>%
  # add_segments(x=-0.2, xend=1.1, y=0, yend = 0, showlegend=F) %>%
  # add_markers(x = ~x, y = ~y, name="Sample Simulated Data") %>%
  #add_lines(x = ~c(point1$x,point2$x), y = ~c(point1$y,point2$y), name="Second PC, Orthogonal to lm(Y ~ X)", 
  #          line = list(width = 4))  %>% 
  add_markers(x = 0.8, y = 0.7219281, name="(x=0.8,Entropy=0.72)", marker = list(size = 20,
                                    color = 'green', line = list(color = 'yellow', width = 2))) %>%
  layout(title="Binary Class Entropy",
         xaxis=list(title="Proportion of Class 1 Obsevations", scaleanchor="y"),  # control the y:x axes aspect ratio
         yaxis = list(title="Entropy", scaleanchor  = "x", range=c(-0.1, 1.1)), legend = list(orientation = 'h'),
         annotations = list(text=modelLabels,  x=modelLabels.x, y=modelLabels.y, textangle=90,
                            font=list(size=15, color=modelLabels.col), showarrow=FALSE))

The closer the binary proportion split is to $0.5$ the greater the entropy. The more homogeneous the split (one class becomes the majority) the lower the entropy. Decision trees aim to find splits in the data that reduce the entropy, i.e., increasing the homogeneity of the elements within all classes.

This measuring mechanism could be used to measure and compare the information we get using different features as data partitioning characteristics. Let’s consider this scenario. Suppose $S$ and $S_1$ represent the entropy of the system before and after the splitting/partitioning of the data according to a specific data feature attribute ($F$). Denote the entropies of the original and the derived partition by $Entropy(S)$ and $Entropy(S_1)$, respectively. The information we gained from partitioning the data using this specific feature (F) is calculated as a change in the entropy:

\[Gain(F) = Entropy(S)-Entropy(S_1)\]

Note that smaller entropy $Entropy(S_1)$ corresponds with better classification and more information gained. A more complicated case would be that the partitions create multiple segments. Then, the entropy for each partition method is calculated by the following formula:

\[Entropy(S)=\sum_{i=1}^n w_i Entropy(P_i)=\sum_{i=1}^{n} w_i \left (-\sum_{j=1}^{c}p_i\log_2(p_i)\right )\]

Where $w_i$ is the proportion of examples falling in that segment and $P_i$ is segment i. Thus, the total entropy of a partition method is calculated by a weighted sum of entropies for each segment created by this method.

When we get the maximum reduction in entropy with a feature ($F$) then the $Gain(F)=Entropy(S)$, since $Entropy(S_1)=0$. On the contrary, if we gain no information with this feature, we have $Gain(F)=0$.

3.5 Pruning the Decision Tree

While making a decision tree, we can classify those observations using as many splits as we want. This eventually might over classify our data. An extreme example of this would be that we make each observation as a class, which is meaningless.

So how to control the size of the decision tree? One possible solution is to make a cutoff for the number of decisions that a decision tree could make. Similarly, we can control the minimum number of points in each segment. This method is called early stopping, or pre-pruning, the decision tree. However, this might terminate the decision procedure ahead of some important pending partition.

Another solution post-pruning can be applied when we begin with growing a big decision tree and subsequently reduce the branches based on error rates with penalty at the nodes. This is often more effective than the pre-pruning solution.

The C5.0 algorithm uses the post-pruning method to control the size of the decision tree. It first grows an overfitting large tree that contains all the possibilities of partitioning. Then, it cuts out nodes and branches with little effect on classification errors.

4 Case Study 1: Quality of Life and Chronic Disease

4.1 Step 1: Collecting Data

In this chapter we are using the Quality of life and chronic disease dataset, Case06_QoL_Symptom_ChronicIllness.csv. This dataset has 41 variables. Detailed description for each variable is provided here.

Important variables:

Charlson Comorbidity Index: ranging from 0-10. A score of 0 indicates no comorbid conditions. Higher scores indicate a greater level of comorbidity.
Chronic Disease Score: A summary score based on the presence and complexity of prescription medications for select chronic conditions. A high score in decades the patient has severe chronic diseases. -9 indicates a missing value.

4.2 Step 2 - exploring and preparing the data

Let’s load the data first.

qol<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
str(qol)

## 'data.frame':    2356 obs. of  41 variables:
##  $ ID                 : int  171 171 172 179 180 180 181 182 183 186 ...
##  $ INTERVIEWDATE      : int  0 427 0 0 0 42 0 0 0 0 ...
##  $ LANGUAGE           : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ AGE                : int  49 49 62 44 64 64 52 48 49 78 ...
##  $ RACE_ETHNICITY     : int  3 3 3 7 3 3 3 3 3 4 ...
##  $ SEX                : int  2 2 2 2 1 1 2 1 1 1 ...
##  $ QOL_Q_01           : int  4 4 3 6 3 3 4 2 3 5 ...
##  $ QOL_Q_02           : int  4 3 3 6 2 5 4 1 4 6 ...
##  $ QOL_Q_03           : int  4 4 4 6 3 6 4 3 4 4 ...
##  $ QOL_Q_04           : int  4 4 2 6 3 6 2 2 5 2 ...
##  $ QOL_Q_05           : int  1 5 4 6 2 6 4 3 4 3 ...
##  $ QOL_Q_06           : int  4 4 2 6 1 2 4 1 2 4 ...
##  $ QOL_Q_07           : int  1 2 5 -1 0 5 8 4 3 7 ...
##  $ QOL_Q_08           : int  6 1 3 6 6 6 3 1 2 4 ...
##  $ QOL_Q_09           : int  3 4 3 6 2 2 4 2 2 4 ...
##  $ QOL_Q_10           : int  3 1 3 6 3 6 3 2 4 3 ...
##  $ MSA_Q_01           : int  1 3 2 6 2 3 4 1 1 2 ...
##  $ MSA_Q_02           : int  1 1 2 6 1 6 4 3 2 4 ...
##  $ MSA_Q_03           : int  2 1 2 6 1 2 3 3 1 2 ...
##  $ MSA_Q_04           : int  1 3 2 6 1 2 1 4 1 5 ...
##  $ MSA_Q_05           : int  1 1 1 6 1 2 1 6 3 2 ...
##  $ MSA_Q_06           : int  1 2 2 6 1 2 1 1 2 2 ...
##  $ MSA_Q_07           : int  2 1 3 6 1 1 1 1 1 5 ...
##  $ MSA_Q_08           : int  1 1 1 6 1 1 1 1 2 1 ...
##  $ MSA_Q_09           : int  1 1 1 6 2 2 4 6 2 1 ...
##  $ MSA_Q_10           : int  1 1 1 6 1 1 1 1 1 3 ...
##  $ MSA_Q_11           : int  2 3 2 6 1 1 2 1 1 5 ...
##  $ MSA_Q_12           : int  1 1 2 6 1 1 2 6 1 3 ...
##  $ MSA_Q_13           : int  1 1 1 6 1 6 2 1 4 2 ...
##  $ MSA_Q_14           : int  1 1 1 6 1 2 1 1 3 1 ...
##  $ MSA_Q_15           : int  2 1 1 6 1 1 3 2 1 3 ...
##  $ MSA_Q_16           : int  2 3 5 6 1 2 1 2 1 2 ...
##  $ MSA_Q_17           : int  2 1 1 6 1 1 1 1 1 3 ...
##  $ PH2_Q_01           : int  3 2 1 5 1 1 3 1 2 3 ...
##  $ PH2_Q_02           : int  4 4 1 5 1 2 1 1 4 2 ...
##  $ TOS_Q_01           : int  2 2 2 4 1 1 2 2 1 1 ...
##  $ TOS_Q_02           : int  1 1 1 4 4 4 1 2 4 4 ...
##  $ TOS_Q_03           : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ TOS_Q_04           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ CHARLSONSCORE      : int  2 2 3 1 0 0 2 8 0 1 ...
##  $ CHRONICDISEASESCORE: num  1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...

Most of the coded variables like QOL_Q_01(health rating) have ordinal values (1=excellent, 2=very good, 3=good, 4=fair, 5= poor, 6=no answer). We can use the table() function to see its distribution. We also have some numerical variables in the dataset like CHRONICDISEASESCORE. We can take a look at it by using summary().

Our variable of interest CHRONICDISEASESCORE has some missing data. A simple way to address this is just deleting those observations with missing values. You could also try to impute the missing value using various imputation methods mentioned in Chapter 2.

table(qol$QOL_Q_01)

## 
##   1   2   3   4   5   6 
##  44 213 801 900 263 135

qol<-qol[!qol$CHRONICDISEASESCORE==-9, ]
summary(qol$CHRONICDISEASESCORE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.880   1.395   1.497   1.970   4.760

Let’s create two classes using variable CHRONICDISEASESCORE. We classify the patients with CHRONICDISEASESCORE < mean(CHRONICDISEASESCORE) as having minor disease and the rest as having severe disease. This dichotomous classification (qol$cd) may not be perfect and we will talk about alternative classification strategies in the practice problem in the end of the chapter.

qol$cd<-qol$CHRONICDISEASESCORE>1.497
qol$cd<-factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))

4.2.1 Data preparation - creating random training and test datasets

To make the qol data more organized, we can order the data by the variable ID.

qol <- qol[order(qol$ID), ]

# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]

Then, we are able to subset training and testing datasets. Here is an example of a non-random split of the entire data into training ($2114$) and testing ($100$) sets:

qol_train <- qol[1:2114, ]
qol_test <- qol[2115:2214, ]

And here is an example of random assignments of cases into training and testing sets (80-20% slit):

set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train <- qol[train_index, ]
qol_test <- qol[-train_index, ]

We can quickly inspect the distributions of the training and testing data to ensure they are not vastly different. We can see that the classes are split fairly equal in training and test datasets.

prop.table(table(qol_train$cd))

## 
##  minor_disease severe_disease 
##      0.5324675      0.4675325

prop.table(table(qol_test$cd))

## 
##  minor_disease severe_disease 
##      0.4853273      0.5146727

4.3 Step 3 - training a model on the data

In this section we are using the C5.0() function from the C50 package.

The function C5.0() has following components:

m<-C5.0(train, class, trials=1, costs=NULL)

train: data frame containing numeric training data (features)
class: factor vector with the class for each row in the training data.
trials: an optional number to control the boosting iterations (default=1).
costs: an optional matrix to specify the costs of false positive and false negative.

You could delete the # in the following code and run it in R to install and load the C50 package.

# install.packages("C50")
library(C50)

In the qol dataset (ID column is already removed), column 41 is the class vector (qol$cd) and column 40 is the numerical version of vector 41 (qol$CHRONICDISEASESCORE). We need to delete these two columns to create our training data that only contains features.

summary(qol_train[,-c(40, 41)])

##  INTERVIEWDATE       LANGUAGE          AGE        RACE_ETHNICITY 
##  Min.   :  0.00   Min.   :1.000   Min.   :20.00   Min.   :1.000  
##  1st Qu.:  0.00   1st Qu.:1.000   1st Qu.:52.00   1st Qu.:3.000  
##  Median :  0.00   Median :1.000   Median :59.00   Median :3.000  
##  Mean   : 20.29   Mean   :1.199   Mean   :58.55   Mean   :3.618  
##  3rd Qu.:  0.00   3rd Qu.:1.000   3rd Qu.:66.00   3rd Qu.:4.000  
##  Max.   :440.00   Max.   :2.000   Max.   :90.00   Max.   :7.000  
##       SEX           QOL_Q_01        QOL_Q_02        QOL_Q_03    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :1.000   Median :4.000   Median :3.000   Median :4.000  
##  Mean   :1.423   Mean   :3.678   Mean   :3.417   Mean   :3.705  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :2.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     QOL_Q_04        QOL_Q_05        QOL_Q_06        QOL_Q_07     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :-1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 1.000  
##  Median :3.000   Median :3.000   Median :3.000   Median : 5.000  
##  Mean   :3.021   Mean   :3.099   Mean   :2.963   Mean   : 4.172  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 7.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :10.000  
##     QOL_Q_08        QOL_Q_09        QOL_Q_10        MSA_Q_01    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.885   Mean   :3.151   Mean   :2.922   Mean   :2.975  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_02        MSA_Q_03        MSA_Q_04        MSA_Q_05    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :2.000   Median :2.000   Median :1.000  
##  Mean   :3.066   Mean   :2.298   Mean   :2.404   Mean   :1.895  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_06        MSA_Q_07        MSA_Q_08       MSA_Q_09        MSA_Q_10    
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :1.00   Median :2.000   Median :1.000  
##  Mean   :2.401   Mean   :2.172   Mean   :1.48   Mean   :2.193   Mean   :1.717  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.00   Max.   :6.000   Max.   :6.000  
##     MSA_Q_11        MSA_Q_12        MSA_Q_13        MSA_Q_14    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :2.242   Mean   :2.036   Mean   :1.897   Mean   :2.002  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_15        MSA_Q_16        MSA_Q_17        PH2_Q_01    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :1.765   Mean   :1.823   Mean   :2.061   Mean   :2.479  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     PH2_Q_02        TOS_Q_01        TOS_Q_02        TOS_Q_03    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:4.000  
##  Median :2.000   Median :2.000   Median :4.000   Median :4.000  
##  Mean   :2.368   Mean   :2.211   Mean   :3.297   Mean   :3.787  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     TOS_Q_04     CHARLSONSCORE    
##  Min.   :1.000   Min.   :-9.0000  
##  1st Qu.:5.000   1st Qu.: 0.0000  
##  Median :5.000   Median : 1.0000  
##  Mean   :4.679   Mean   : 0.9339  
##  3rd Qu.:5.000   3rd Qu.: 1.0000  
##  Max.   :6.000   Max.   :10.0000

set.seed(1234)
qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)
qol_model

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Tree size: 5 
## 
## Non-standard options: attempt to group attributes

summary(qol_model)

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Aug 15 18:12:09 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1771 cases (40 attributes) from undefined.data
## 
## Decision tree:
## 
## CHARLSONSCORE <= 0: minor_disease (645/169)
## CHARLSONSCORE > 0:
## :...CHARLSONSCORE > 5: minor_disease (29/9)
##     CHARLSONSCORE <= 5:
##     :...AGE > 47: severe_disease (936/352)
##         AGE <= 47:
##         :...MSA_Q_08 <= 1: minor_disease (133/46)
##             MSA_Q_08 > 1: severe_disease (28/8)
## 
## 
## Evaluation on training data (1771 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5  584(33.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     583   360    (a): class minor_disease
##     224   604    (b): class severe_disease
## 
## 
##  Attribute usage:
## 
##  100.00% CHARLSONSCORE
##   61.94% AGE
##    9.09% MSA_Q_08
## 
## 
## Time: 0.0 secs

# plot(qol_model, subtree = 17) # ?C50::plot.C5.0

The output of qol_model indicates that we have a tree that has 25 terminal nodes. summary(qol_model) suggests that the classification error for decision tree is 28% percent in the training data.

4.4 Step 4 - evaluating model performance

Now we can make predictions using the decision tree that we just build. The predict() function we will use is the same as the one we showed in earlier chapters, e.g., Chapter 2 and Chapter 7. In general, predict() is extended by each specific type of regression, classification, clustering or forecasting machine learning technique. For example, randomForest::predict.randomForest() is invoked by:

predict(RF_model, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...),

where type represents type of prediction output to be generated - “response” (equivalent to “class”), “prob” or “votes”. Thus, the predicted values are either predicted “response” class labels, matrix of class probabilities, or vote counts.

This time we are going to introduce the confusionMatrix() function under package caret as the evaluation method. When we combine it with a table() function, the output of the evaluation is very straight forward.

# See docs for predict # ?C50::predict.C5.0
qol_pred <- predict(qol_model, qol_test[ ,-c(40, 41)])  # removing the last 2 columns CHRONICDISEASESCORE and cd, whjich represent the clinical outcomes we are predicting!
# install.packages("caret")
library(caret)
confusionMatrix(table(qol_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             71
##   severe_disease            72            157
##                                           
##                Accuracy : 0.6772          
##                  95% CI : (0.6315, 0.7206)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 3.001e-12       
##                                           
##                   Kappa : 0.3538          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6886          
##          Pos Pred Value : 0.6682          
##          Neg Pred Value : 0.6856          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4831          
##       Balanced Accuracy : 0.6769          
##                                           
##        'Positive' Class : minor_disease   
##

The Confusion Matrix shows prediction accuracy is about 63%, however, this may vary (see the corresponding confidence interval).

4.5 Step 5 - trial option

Recall the C5.0 function, there is an option trials=, which is an integer specifying the number of boosting iterations. The default value of one indicates that a single model is used, and we can specify a larger number of iterations, for instance trials=6.

set.seed(1234)
qol_boost6 <- C5.0(qol_train[ , -c(40, 41)], qol_train$cd, trials=6) # try alternative values for the trials option
qol_boost6

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd, trials = 6)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Number of boosting iterations: 6 
## Average tree size: 4.7 
## 
## Non-standard options: attempt to group attributes

We can see that the size of the tree reduced to about 12 (this may vary at each run).

Since this is a fairly small tree we can visualize it by the function plot(). We also use the option type="simple" to make the tree look more condensed. You can also zoom in and display just a specific branch of the tree by using subtree =.

plot(qol_boost6, type="simple")   # C50::plot.C5.0

Caution: The plotting of decision trees will fail if you have columns that start with numbers or special characters (e.g., “5variable”, “!variable”). In general, avoid spaces, special characters, and other non-terminal symbols in column/row names.

Next step would be making predictions and testing its accuracy.

qol_boost_pred6 <- predict(qol_boost6, qol_test[ ,-c(40, 41)])
confusionMatrix(table(qol_boost_pred6, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_boost_pred6  minor_disease severe_disease
##   minor_disease            145             69
##   severe_disease            70            159
##                                           
##                Accuracy : 0.6862          
##                  95% CI : (0.6408, 0.7292)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 1.747e-13       
##                                           
##                   Kappa : 0.3718          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6744          
##             Specificity : 0.6974          
##          Pos Pred Value : 0.6776          
##          Neg Pred Value : 0.6943          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3273          
##    Detection Prevalence : 0.4831          
##       Balanced Accuracy : 0.6859          
##                                           
##        'Positive' Class : minor_disease   
##

The accuracy is about 64%. However, this may vary each time we run the experiment (mind the confidence interval). In some studies, the trials option provides significant improvement to the overall accuracy. A good choice for this option is trials=10.

4.6 Loading the misclassification error matrix

Suppose we want to reduce the false negative rate, in this case, misclassifying a severe case as minor. False negative (failure to detect a severe disease case) may be more costly than false positive (misclassifying a minor disease case as severe). Misclassification errors can be expressed as a matrix:

error_cost<-matrix(c(0, 1, 4, 0), nrow = 2)
error_cost

##      [,1] [,2]
## [1,]    0    4
## [2,]    1    0

Let’s build a decision tree with the option cpsts=error_cost.

set.seed(1234)
qol_cost<-C5.0(qol_train[-c(40, 41)], qol_train$cd, costs=error_cost)
qol_cost_pred<-predict(qol_cost, qol_test)
confusionMatrix(table(qol_cost_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_cost_pred    minor_disease severe_disease
##   minor_disease             73             14
##   severe_disease           142            214
##                                           
##                Accuracy : 0.6479          
##                  95% CI : (0.6014, 0.6923)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 1.024e-08       
##                                           
##                   Kappa : 0.2829          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3395          
##             Specificity : 0.9386          
##          Pos Pred Value : 0.8391          
##          Neg Pred Value : 0.6011          
##              Prevalence : 0.4853          
##          Detection Rate : 0.1648          
##    Detection Prevalence : 0.1964          
##       Balanced Accuracy : 0.6391          
##                                           
##        'Positive' Class : minor_disease   
##

Although the overall accuracy decreased, the false negative cell labels were reduced from 75 (without specifying a cost matrix) to 17 (when specifying a non-trivial (loaded) cost matrix). This comes at the cost of increasing the rate of false-positive labeling (minor disease cases misclassified as severe).

4.7 Parameter Tuning

There are multiple choices to plot trees fitted by rpart or C5.0.

library("rpart") 
# remove CHRONICDISEASESCORE, but keep *cd* label
set.seed(1234)
qol_model<-rpart(cd~., data=qol_train[, -40], cp=0.01) 
# here we use rpart::cp = *complexity parameter* = 0.01
qol_model

## n= 1771 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1771 828 minor_disease (0.5324675 0.4675325)  
##   2) CHARLSONSCORE< 0.5 645 169 minor_disease (0.7379845 0.2620155) *
##   3) CHARLSONSCORE>=0.5 1126 467 severe_disease (0.4147425 0.5852575)  
##     6) AGE< 47.5 165  66 minor_disease (0.6000000 0.4000000) *
##     7) AGE>=47.5 961 368 severe_disease (0.3829344 0.6170656) *

You can also plot directly using rpart.plot

library(rpart.plot)

## 
## Attaching package: 'rpart.plot'

## The following object is masked from 'package:rpart.utils':
## 
##     rpart.rules

rpart.plot(qol_model, type = 4,extra = 1,clip.right.labs = F)

We can use fancyRpartPlot to render a more colorful decision tree.

library("rattle")

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")

qol_pred<-predict(qol_model, qol_test,type = 'class')
confusionMatrix(table(qol_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             74
##   severe_disease            72            154
##                                           
##                Accuracy : 0.6704          
##                  95% CI : (0.6245, 0.7141)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 2.276e-11       
##                                           
##                   Kappa : 0.3405          
##                                           
##  Mcnemar's Test P-Value : 0.934           
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6754          
##          Pos Pred Value : 0.6590          
##          Neg Pred Value : 0.6814          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4898          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : minor_disease   
##

These results are consistent with their counterparts reported using C5.0. How can we tune the parameters to further improve the results?

set.seed(1234)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
qol_model= rpart(cd ~ ., data = qol_train[ , -40], control = control)
# plotcp(qol_model)
# printcp(qol_model)

data <- as.data.frame(qol_model$cptable)
data$CP <- as.factor(data$CP)

plot_ly(data = data, x = ~CP, y = ~xerror, type = 'scatter', mode='lines+markers',
        name = 'Test', error_y = ~list(array = xstd,  color = 'gray')) %>% 
  #add_segments(x=0, xend=0.23, y=0.76, yend = 0.76, line = list(dash = "dash"), showlegend=F) %>%
  layout(title="Complexity Parameter vs. (CV) Error Rate")

Now, we can prune the tree according to the optimal cp, complexity parameter to which the rpart object will be trimmed. Instead of using the real error (e.g., $1-R^2$, RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror which averages the discrepancy between observed and predicted classifications using cross-validation, see Chapter 20.

set.seed(1234)
selected_tr <- prune(qol_model, cp= qol_model$cptable[which.min(qol_model$cptable[,"xerror"]),"CP"])
fancyRpartPlot(selected_tr, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")

qol_pred_tune<-predict(selected_tr, qol_test,type = 'class')
confusionMatrix(table(qol_pred_tune, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred_tune    minor_disease severe_disease
##   minor_disease            143             74
##   severe_disease            72            154
##                                           
##                Accuracy : 0.6704          
##                  95% CI : (0.6245, 0.7141)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 2.276e-11       
##                                           
##                   Kappa : 0.3405          
##                                           
##  Mcnemar's Test P-Value : 0.934           
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6754          
##          Pos Pred Value : 0.6590          
##          Neg Pred Value : 0.6814          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4898          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : minor_disease   
##

The result is roughly same as that of C5.0. Despite the fact that there is no substantial classification improvement, the tree-pruning process generates a graphical representation of the decision making protocol (selected_tr) that is much simpler and intuitive compared to the original (un-pruned) tree (qol_model):

fancyRpartPlot(qol_model, cex = 0.1, caption = "rattle::fancyRpartPlot (QoL Data)")

5 Compare different impurity indices

We can change split=“entropy” to “error” or “gini” to apply an alternative information gain index. Experiment with these and compare the results.

set.seed(1234)
qol_model = rpart(cd ~ ., data=qol_train[ , -40], parms = list(split = "entropy"))
fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL data)")

# Modify and test using "error" and "gini"
# qol_pred<-predict(qol_model, qol_test,type = 'class')
# confusionMatrix(table(qol_pred, qol_test$cd))

6 Classification rules

In addition to the classification trees we just saw, we can explore classification rules that utilize if-else logical statements to assign classes to unlabeled data. Below we review three classification rule strategies.

6.1 Separate and conquer

Separate and conquer repeatedly splits the data (and subsets of the data) by rules that cover a subset of examples. This procedure is very similar to the Divide and conquer approach. However, a notable difference is that each rule can be independent, yet, each decision node in a tree has to be linked to past decisions.

6.2 The One Rule algorithm

To understand One Rule (OneR) algorithm, we need to know about its “sibling” - ZeroR rule. ZeroR rule means that we assign the mode class to unlabeled test observations regardless of its feature value. The One Rule (OneR) algorithm is an improved version of ZeroR that uses a single rule for classification. In other words, OneR splits the training dataset into several segments based on feature values. Then, it assigns the mode of the classes in each segment to related observations in the unlabeled test data. In practice, we first test multiple rules and pick the rule with the smallest error rate to be our One Rule. Remember, the rules are subjective.

6.3 The RIPPER algorithm

The Repeated Incremental Pruning to Produce Error Reduction algorithm is a combination of the ideas behind decision tree and classification rules. It consists of a three-step process:

Grow: add conditions to a rule until it cannot split the data into more segments.
Prune: delete some of the conditions that have large error rates.
Optimize: repeat the above two steps until we cannot add or delete any of the conditions.

7 Case Study 2: QoL in Chronic Disease (Take 2)

Let’s take another look at the same dataset as Case Study 1 - this time applying classification rules. Naturally, we skip over the first two data handling steps and go directly to step three.

7.1 Step 3 - training a model on the data

Let’s start by using the OneR() function in RWeka package. Before installing the package you might want to check that the Java program in your computer is up to date. Also, its version has to match the version of R (i.e., 64bit R needs 64bit Java).

The function OneR() has the following components:

m<-OneR(class~predictors, data=mydata)

class: factor vector with the class for each row mydata.
predictors: feature variables in mydata. If we want to include $x_1$, $x_2$ as predictors and $y$ as the class label variable, we do $y\sim x_1+x_2$. When using a y ~ . instead, we include all of the column variables as predictors.
mydata: the dataset where the features and labels could be find.

# install.packages("RWeka")
library(RWeka)
# just remove the CHRONICDISEASESCORE but keep cd
set.seed(1234)
qol_1R<-OneR(cd~., data=qol[ , -40])
qol_1R

## CHARLSONSCORE:
##  < -4.5  -> severe_disease
##  < 0.5   -> minor_disease
##  < 5.5   -> severe_disease
##  < 8.5   -> minor_disease
##  >= 8.5  -> severe_disease
## (1453/2214 instances correct)

Note that 1,453 out of 2,214 cases are correctly classified, $66\%$, by the “one rule”.

7.2 Step 4 - evaluating model performance

summary(qol_1R)

## 
## === Summary ===
## 
## Correctly Classified Instances        1453               65.6278 %
## Incorrectly Classified Instances       761               34.3722 %
## Kappa statistic                          0.3206
## Mean absolute error                      0.3437
## Root mean squared error                  0.5863
## Relative absolute error                 68.8904 %
## Root relative squared error            117.3802 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  609 549 |   a = minor_disease
##  212 844 |   b = severe_disease

We obtain a rule that correctly specifies 66% of the patients, which is in line with the prior decision tree classification results. Due to algorithmic stochasticity, it’s normal that these results may vary each time you run the algorithm, albeit we used set.seed(1234) to ensure some result reproducibility.

7.3 Step 5 - Alternative model1

Another possible option for the classification rules would be the RIPPER rule algorithm that we discussed earlier in the chapter. In R we use the Java based function JRip() to invoke this algorithm.

JRip() function uses a similar invocation protocol as OneR():

m<-JRip(class~predictors, data=mydata)

set.seed(1234)
qol_jrip1<-JRip(cd~., data=qol[ , -40])
qol_jrip1

## JRIP rules:
## ===========
## 
## (CHARLSONSCORE >= 1) and (RACE_ETHNICITY >= 4) and (AGE >= 49) => cd=severe_disease (448.0/132.0)
## (CHARLSONSCORE >= 1) and (AGE >= 53) => cd=severe_disease (645.0/265.0)
##  => cd=minor_disease (1121.0/360.0)
## 
## Number of Rules : 3

summary(qol_jrip1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1457               65.8085 %
## Incorrectly Classified Instances       757               34.1915 %
## Kappa statistic                          0.3158
## Mean absolute error                      0.4459
## Root mean squared error                  0.4722
## Relative absolute error                 89.3711 %
## Root relative squared error             94.5364 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  761 397 |   a = minor_disease
##  360 696 |   b = severe_disease

This JRip() classifier uses only 3 rules and has a relatively similar accuracy 66%. As each individual has unique characteristics, classification in real world data is rarely perfect (close to 100% accuracy).

7.4 Step 5 - Alternative model2

Another idea is to repeat the generation of trees multiple times, predict according to each tree’s performance, and finally ensemble those weighted votes into a combined classification result. This is precisely the idea behind random forest classification, see Chapter 14.

# install.packages("randomForest")
require(randomForest)
set.seed(12)
# rf.fit <- tuneRF(qol_train[ , -40], qol_train[ , 40], stepFactor=1.5)
rf.fit <- randomForest(cd ~ . , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
# varImpPlot(rf.fit, cex=0.5); print(rf.fit)

# type either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity)

# Accuracy
imp_rf.fit <- importance(rf.fit)
x <- rownames(imp_rf.fit)
y <- imp_rf.fit[,3]
plot_ly(x = ~y, y = ~reorder(x,y),  name = "Var.Imp", type = "bar") %>%
  layout(title="RF Variable Importance Plot (Accuracy)",
         xaxis=list(title="Importance (mean decrease in accuracy)"),  
         yaxis = list(title="Variables (Ordered)"))

# Gini Index
y <- imp_rf.fit[,4]
plot_ly(x = ~y, y = ~reorder(x,y),  name = "Var.Imp", type = "bar") %>%
  layout(title="RF Variable Importance Plot (Gini)",
         xaxis=list(title="Importance (Gini mean decrease)"),  
         yaxis = list(title="Variables (Ordered)"))

rf.fit1 <- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit2 <- randomForest(cd~. , data=qol_train[ , -40], importance=TRUE, nodesize=5, ntree=5000, mtry=26)

# plot(rf.fit,log="x",main="MSE Errors vs. Iterations: Three Models rf.fit, rf.fit1, rf.fit2")
# points(1:5000, rf.fit1$mse, col="red", type="l")
# points(1:5000, rf.fit2$mse, col="green", type="l")
# legend("topright", col=c("black", "red", "green"), lwd=c(2, 2, 2), legend=c("rf.fit", "rf.fit1", "rf.fit2"))

plot_ly(x = c(1:1000), y = rf.fit$err.rate[1:1000,1], name = 'OOB (rf.fit)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  add_trace(x = c(1:1000), y = rf.fit1$err.rate[1:1000,1], name = 'OOB (rf.fit1)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  add_trace(x = c(1:1000), y = rf.fit2$err.rate[1:1000,1], name = 'OOB (rf.fit2)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = "Out Of Bag Error Rates for 3 RF Models",
         xaxis = list(title = "Number of Trees"),
         yaxis = list (title = "OOB Error"))

qol_pred2<-predict(rf.fit2, qol_test, type = 'class')
confusionMatrix(table(qol_pred2, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred2        minor_disease severe_disease
##   minor_disease            143             72
##   severe_disease            72            156
##                                           
##                Accuracy : 0.6749          
##                  95% CI : (0.6291, 0.7184)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 5.956e-12       
##                                           
##                   Kappa : 0.3493          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6842          
##          Pos Pred Value : 0.6651          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4853          
##       Balanced Accuracy : 0.6747          
##                                           
##        'Positive' Class : minor_disease   
##

These variable importance plots (varplot) show the rank orders of importance of the features according to the specific index (Accuracy, left, and Gini, right). More information about random forests is available in Chapter 14: Improving Model Performance.

We can also use the randomForest::partialPlot() method to examine the relative impact of the top salient features (continuous or binary variables). For each feature, this method works by:

Creating a test data set identical to the training data,
Sequentially setting the variable of interest for all observations to a selected value,
Computing the average value of the predicted response for each test set,
Plotting the results against the selected value.

Note that for binary features, partialPlot() would result in only two values of the corresponding mass function. Whereas for continuous variables, it generates the marginal probability effect of the variable corresponding to the class probability (for classification problems) or the continuous response (for regression problems). Using the QoL case-study, (Case06_QoL_Symptom_ChronicIllness.csv), this example shows the partial plots of the top six rank-ordered features of the QoL dataset chosen by the random forest model.

imp <- randomForest::importance(rf.fit)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
# op <- par(mfrow=c(2, 3))
# for (i in 1:6) {   # seq_along(impvar)) { # to plot the marginal probabilities for all features
#     partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
#                 main=paste("Partial Dependence of 'CD'\n on ", impvar[i]))
# }

p <- vector("list", 6)
for (i in 1:6) {   # seq_along(impvar)) { # to plot the marginal probabilities for all features
    p[[i]] <- partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
                main=paste("Partial Dependence of 'CD'\n on ", impvar[i]), plot=F)
}

plot_ly(x = p[[1]]$x, y = p[[1]]$y, name = impvar[1], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[1]),
         xaxis = list(title = impvar[1]), yaxis = list (title = "Impact"))

plot_ly(x = p[[2]]$x, y = p[[2]]$y, name = impvar[2], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[2]),
         xaxis = list(title = impvar[2]), yaxis = list (title = "Impact"))

plot_ly(x = p[[3]]$x, y = p[[3]]$y, name = impvar[3], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[4]]$x, y = p[[4]]$y, name = impvar[4], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[5]]$x, y = p[[5]]$y, name = impvar[5], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[6]]$x, y = p[[6]]$y, name = impvar[6], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[3], " ", impvar[4], " ",impvar[5], " ",impvar[6]),
         xaxis = list(title = "MSA Questions"), yaxis = list (title = "Impact"))

In random forest (RF) classification, the node size (nodesize) refers to the smallest node that can be split, i.e., nodes with fewer cases than the nodesize are never subdivided. Increasing the node size leads to smaller trees, which may compromise previous predictive power. On the flip side, increasing the tree size (maxnodes) and the number of trees (ntree) tends to increase the predictive accuracy. However, there are tradeoffs between increasing node-size and tree-size simultaneously. To optimize the RF predictive accuracy, try smaller node sizes and more trees. Ensembling (forest) results from a larger number of trees will likely generate better results.

7.5 Step 6 - Alternative model3

Another decision three technique we can test is Extra-Trees (extremely randomized trees). Extra-trees are similar to Random Forests with the exception that at each node, Random Forest chooses the best-cutting threshold for the feature, whereas Extra-Tree selects a uniformly random cut so that a feature with the biggest gain (or best classification score). The R extraTrees package provides one implementation of Extra-Trees where single random cuts are extended to several random cuts for each feature. This extension improves performance as it reduce the probability of making very poor cuts and still maintains the designed extra-tree stochastic cutting. Below, we compare the results of four alternative classification methods - random forest, SVM, kNN, and extra-tree.

library(caret)
library(extraTrees)
# For direct Extra-Tree Call
# install.packages("extraTrees")
# library(extraTrees)
# set.seed(12)
# ## Request more memory to JVM (e.g., 5GB
# options( java.parameters = "-Xmx5g" )
# library(extraTrees)
# 
# system.time({
#   # call extraTrees with 3 CPU cores/threads and track exec-time
#   et.fit3 <- extraTrees(qol_train[ , -c(40,41)], qol_train[ , 41],
#                         numThreads=3, ntree=5000, mtry=26)
# })
# qol_pred3 <- predict(et.fit3, qol_test[ , -c(40,41)])
# confusionMatrix(table(qol_pred3, qol_test$cd))
# qol_pred<-predict(rf.fit, qol_test, type = 'class')
# qol_pred1<-predict(rf.fit1, qol_test, type = 'class')
# qol_pred2<-predict(rf.fit2, qol_test, type = 'class')

control <- trainControl(method="repeatedcv", number=10, repeats=3)

## Run all subsequent models in parallel
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
system.time({
  rf.fit  <- train(cd~., data=qol_test[ , -40], method="rf", trControl=control);
  knn.fit <- train(cd~., data=qol_test[ , -40], method="knn", trControl=control);
  svm.fit <- train(cd~., data=qol_test[ , -40], method="svmRadial", trControl=control);
  et.fit  <- train(cd~., data=qol_test[ , -40], method="extraTrees", trControl=control)
})

##    user  system elapsed 
##    4.07    0.09  100.50

stopCluster(cl) # close multi-core cluster

results <- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, ET=et.fit))

# summary of model differences
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, kNN, SVM, ET 
## Number of resamples: 30 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.5000000 0.6000000 0.6363636 0.6364525 0.6780303 0.7777778    0
## kNN 0.3863636 0.5027778 0.5681818 0.5589785 0.6113901 0.7045455    0
## SVM 0.4418605 0.5681818 0.6000000 0.5987902 0.6243393 0.7272727    0
## ET  0.5000000 0.5909091 0.6363636 0.6290083 0.6724806 0.7777778    0
## 
## Kappa 
##             Min.    1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF   0.006160164 0.19999942 0.2673909 0.2707204 0.3558133 0.5535714    0
## kNN -0.222222222 0.00591716 0.1363636 0.1166633 0.2223136 0.3991597    0
## SVM -0.119305857 0.13367597 0.2019697 0.1971070 0.2471653 0.4579055    0
## ET   0.000000000 0.17671518 0.2697064 0.2557297 0.3429830 0.5535714    0

# plot summaries
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)                 # Box plots of

densityplot(results, scales=scales, pch = "|") # Density plots of accuracy

dotplot(results, scales=scales)                # Dot plots of accuracy & Kappa

splom(results)      # contrast pair-wise model scatterplots of prediction accuracy

8 Practice Problem

In the previous case study, we classified the CHRONICDISEASESCORE into two groups. What will happen if we use three groups? Let’s separate CHRONICDISEASESCORE evenly into three groups. Recall the quantile() function that we talked about in Chapter 2. We can use it to get the cut-points for classification. Then, a for loop will help us split the variable CHRONICDISEASESCORE into three categories.

quantile(qol$CHRONICDISEASESCORE, probs = c(1/3, 2/3))

## 33.33333% 66.66667% 
##      1.06      1.80

for(i in 1:2214){
  if(qol$CHRONICDISEASESCORE[i]>0.7&qol$CHRONICDISEASESCORE[i]<2.2){
    qol$cdthree[i]=2
  }
  else if(qol$CHRONICDISEASESCORE[i]>=2.2){
    qol$cdthree[i]=3
  }
  else{
    qol$cdthree[i]=1
  }
}

qol$cdthree<-factor(qol$cdthree, levels=c(1, 2, 3), labels = c("minor_disease", "mild_disease", "severe_disease"))
table(qol$cdthree)

## 
##  minor_disease   mild_disease severe_disease 
##            379           1431            404

After labeling the three categories in the new variable cdthree, our job of preparing the class variable is done. Let’s follow along the earlier sections in the chapter to figure out how well does the tree classifiers and rule classifiers perform in the three-category case. First, try to build a tree classifier using C5.0() with 10 boost trials. One small tip is that in the training dataset, we cannot have column 40 (CHRONICDISEASESCORE), 41 (cd) and now 42 (cdthree) because they all contain class outcome related variables.

# qol_train1<-qol[1:2114, ]
# qol_test1<-qol[2115:2214, ]
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train1<-qol[train_index, ]
qol_test1<-qol[-train_index, ]

prop.table(table(qol_train1$cdthree))

## 
##  minor_disease   mild_disease severe_disease 
##      0.1829475      0.6295878      0.1874647

prop.table(table(qol_test1$cdthree))

## 
##  minor_disease   mild_disease severe_disease 
##      0.1241535      0.7133183      0.1625282

set.seed(1234)
qol_model1<-C5.0(qol_train1[ , -c(40, 41, 42)], qol_train1$cdthree, trials=10)
qol_model1

## 
## Call:
## C5.0.default(x = qol_train1[, -c(40, 41, 42)], y = qol_train1$cdthree, trials
##  = 10)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Number of boosting iterations: 10 
## Average tree size: 238.5 
## 
## Non-standard options: attempt to group attributes

qol_pred1<-predict(qol_model1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease              6           46              3
##   mild_disease              26          267             23
##   severe_disease             1           60             11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6411          
##                  95% CI : (0.5945, 0.6858)
##     No Information Rate : 0.842           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0469          
##                                           
##  Mcnemar's Test P-Value : 3.943e-05       
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                       0.18182              0.7158
## Specificity                       0.88049              0.3000
## Pos Pred Value                    0.10909              0.8449
## Neg Pred Value                    0.93041              0.1654
## Prevalence                        0.07449              0.8420
## Detection Rate                    0.01354              0.6027
## Detection Prevalence              0.12415              0.7133
## Balanced Accuracy                 0.53115              0.5079
##                      Class: severe_disease
## Sensitivity                        0.29730
## Specificity                        0.84975
## Pos Pred Value                     0.15278
## Neg Pred Value                     0.92992
## Prevalence                         0.08352
## Detection Rate                     0.02483
## Detection Prevalence               0.16253
## Balanced Accuracy                  0.57353

We can see that the prediction accuracy with three categories is way lower than the one we did with two categories.

Secondly, try to build a rule classifier with OneR().

set.seed(1234)
qol_1R1<-OneR(cdthree~., data=qol[ , -c(40, 41)])
qol_1R1

## INTERVIEWDATE:
##  < 3.5   -> mild_disease
##  < 28.5  -> severe_disease
##  < 282.0 -> mild_disease
##  < 311.5 -> severe_disease
##  >= 311.5    -> mild_disease
## (1436/2214 instances correct)

summary(qol_1R1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1436               64.86   %
## Incorrectly Classified Instances       778               35.14   %
## Kappa statistic                          0.022 
## Mean absolute error                      0.2343
## Root mean squared error                  0.484 
## Relative absolute error                 67.5977 %
## Root relative squared error            116.2958 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##     0  375    4 |    a = minor_disease
##     0 1422    9 |    b = mild_disease
##     0  390   14 |    c = severe_disease

qol_pred1<-predict(qol_1R1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease              0           55              0
##   mild_disease               0          315              1
##   severe_disease             0           71              1
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7133         
##                  95% CI : (0.6688, 0.755)
##     No Information Rate : 0.9955         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0086         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                            NA            0.714286
## Specificity                        0.8758            0.500000
## Pos Pred Value                         NA            0.996835
## Neg Pred Value                         NA            0.007874
## Prevalence                         0.0000            0.995485
## Detection Rate                     0.0000            0.711061
## Detection Prevalence               0.1242            0.713318
## Balanced Accuracy                      NA            0.607143
##                      Class: severe_disease
## Sensitivity                       0.500000
## Specificity                       0.839002
## Pos Pred Value                    0.013889
## Neg Pred Value                    0.997305
## Prevalence                        0.004515
## Detection Rate                    0.002257
## Detection Prevalence              0.162528
## Balanced Accuracy                 0.669501

The OneRule classifier that is purely based on the value of the INTERVIEWDATE. The internal classification accuracy is 65% and equal to the external (validation data) prediction accuracy. Although, the latter assessment is a bit misleading as the vast majority of external validation data are classified in only one class - mild_disease.

Finally, let’s revisit the JRip() classifier with the same three class labels according to cdthree.

set.seed(1234)
qol_jrip1<-JRip(cdthree~., data=qol[ , -c(40, 41)])
qol_jrip1

## JRIP rules:
## ===========
## 
## (CHARLSONSCORE <= 0) and (AGE <= 50) and (MSA_Q_06 <= 1) and (QOL_Q_07 >= 1) and (MSA_Q_09 <= 1) => cdthree=minor_disease (35.0/11.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_10 >= 4) and (QOL_Q_07 >= 9) => cdthree=severe_disease (54.0/20.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 5) and (MSA_Q_09 <= 4) and (MSA_Q_04 >= 3) => cdthree=severe_disease (64.0/30.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 4) and (PH2_Q_01 >= 3) and (QOL_Q_10 >= 4) and (RACE_ETHNICITY >= 4) => cdthree=severe_disease (43.0/19.0)
##  => cdthree=mild_disease (2018.0/653.0)
## 
## Number of Rules : 5

summary(qol_jrip1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1481               66.8925 %
## Incorrectly Classified Instances       733               33.1075 %
## Kappa statistic                          0.1616
## Mean absolute error                      0.3288
## Root mean squared error                  0.4055
## Relative absolute error                 94.8702 %
## Root relative squared error             97.42   %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##    24  342   13 |    a = minor_disease
##    10 1365   56 |    b = mild_disease
##     1  311   92 |    c = severe_disease

qol_pred1<-predict(qol_jrip1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease              4           49              2
##   mild_disease               3          299             14
##   severe_disease             0           54             18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7246          
##                  95% CI : (0.6805, 0.7657)
##     No Information Rate : 0.9074          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1859          
##                                           
##  Mcnemar's Test P-Value : 2.748e-14       
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                      0.571429              0.7438
## Specificity                      0.883028              0.5854
## Pos Pred Value                   0.072727              0.9462
## Neg Pred Value                   0.992268              0.1890
## Prevalence                       0.015801              0.9074
## Detection Rate                   0.009029              0.6749
## Detection Prevalence             0.124153              0.7133
## Balanced Accuracy                0.727228              0.6646
##                      Class: severe_disease
## Sensitivity                        0.52941
## Specificity                        0.86797
## Pos Pred Value                     0.25000
## Neg Pred Value                     0.95687
## Prevalence                         0.07675
## Detection Rate                     0.04063
## Detection Prevalence               0.16253
## Balanced Accuracy                  0.69869

In terms of the predictive accuracy on the testing data (qol_test1$cdthree), we can see from these outputs that the RIPPER algorithm performed better (67%) compared to the C5.0 decision tree (60%) and similarly to the OneR algorithm (65%). This suggests that simple algorithms might outperform complex methods for certain real world case-studies. Later, in Chapter 14, we will provide more details about optimizing and improving classification and prediction performance.

Try these techniques with other data from the list of our Case-Studies.

Data Science and Predictive Analytics (UMich HS650)

Decision Tree Divide and Conquer Classification

SOCR/MIDAS (Ivo Dinov)

August 2021