In this chapter we will present various progressively supervised machine learning, classification and clustering techniques. There are two categories of machine learning techniques - unsupervised and supervised (human-guided). In general, supervised classification methods aim to identify or predict predefined classes and label new objects as members of specific classes. Whereas unsupervised clustering approaches attempt to group objects into subsets, without knowing a priori labels, and determine relationships between objects.

In the context of machine learning and artificial intelligence, classification and clustering apply mostly for supervised and unsupervised problems, respectively.

Unsupervised classification refers to methods where the outcomes (groupings with common characteristics) are automatically derived based on intrinsic affinities and associations in the data without human indication of clustering. Unsupervised learning is purely based on input data ($X$) without corresponding output labels. The goal is to model the underlying structure, affinities, or distribution in the data in order to learn more about its intrinsic characteristics. It is called unsupervised learning because there are no a priori correct answers and there is no human guidance. Algorithms are left to their own devices to discover and present the interesting structure in the data. Clustering (discover the inherent groupings in the data) and association (discover association rules that describe the data) represent the core unsupervised learning problems. The k-means clustering and the Apriori association rule provide solutions to unsupervised learning problems.

Unsupervised Clustering Approaches
Bayesian		Hierarchical		Partitioning Based
Decision Based	Non-parametric	Divisive (Top-Down)	Agglomerative (Bottom-Up)	Spectral	K-Means / Centroid	Graph-Theoretic	Model Based
Bayesian Classifier has high computational requirements. As there are a priori given labels, some prior is needed/specified to train the classifier, which is in turn used to label new data. Then, the newly labeled samples are subsequently used to train a new (supervised) classifier, i.e., decision-directed unsupervised learning. If the initial classifier is not appropriate, the process may diverge	Chinese restaurant process (CRP), infinite Hidden Markov Model (HMM)	Principal Direction Divisive Partitioning (PDDP)	Start with each case in a separate cluster and repeatedly join the closest pairs into clusters, until a stopping criterion is matched: (1) there is only a single cluster for all cases, (2) predetermined number of clusters is reached, (3) the distance between the closest clusters exceeds a threshold	SpecC	kmeans, hkmeans	Growing Neural Gas	mclust

Supervised classification methods utilize user provided labels representative of specific classes associated with concrete observations, cases or units. These training classes/outcomes are used as references for the classification. Many problems can be addressed by decision-support systems utilizing combinations of supervised and unsupervised classification processes. Supervised learning involves input variables ($X$) and an outcome variable ($Y$) to learn mapping functions from the input to the output: $Y = f(X)$. The goal is to approximate the mapping function so that when it is applied to new (validation) data ($Z$) it (accurately) predicts the (expected) outcome variables ($Y$). It is called supervised learning because the learning process is supervised by initial training labels guiding and correcting the learning until the algorithm achieves an acceptable level of performance.

Regression (output variable is a real value) and classification (output variable is a category) problems represent the two types of supervised learning. Examples of supervised machine learning algorithms include Linear regression and Random forest, both provide solutions for regression-type problems, but Random forest also provides solutions to classification problems.

Just like categorization of exploratory data analytics, Chapter 3, is challenging, so is systematic codification of machine learning techniques. The table below attempts to provide a rough representation of common machine learning methods. However, it is not really intended to be a gold-standard protocol for choosing the best analytical method. Before you settle on a specific strategy for data analysis, you should always review the data characteristics in light of the assumptions of each technique and assess the potential to gain new knowledge or extract valid information from applying a specific technique.

Inference	Outcome	Supervised	Unsupervised
Classification & Prediction	Binary	Classification-Rules, OneR, kNN, NaiveBayes, Decision-Tree, C5.0, AdaBoost, XGBoost, LDA/QDA, Logit/Poisson, SVM	Apriori, Association-Rules, k-Means, NaiveBayes
Classification & Prediction	Categorical	Regression Modeling & Forecasting	Apriori, Association-Rules, k-Means, NaiveBayes
Regression Modeling	Real Quantitative	(MLR) Regression Modeling, LDA/QDA, SVM, Decision-Tree, NeuralNet	Regression Modeling Tree, Apriori/Association-Rules

Many of these will be discussed in later chapters. In this chapter, we will present step-by-step the k-nearest neighbor (kNN) algorithm. Specifically, we will demonstrate (1) data retrieval and normalization, (2) splitting the data into training and testing sets, (3) fitting models on the training data, (4) evaluating model performance on testing data, (5) improving model performance, and (6) determining optimal values of $k$.

In Chapter 9, we will present detailed strategies, and evaluation metrics, to assess the performance of all clustering and classification methods.

1 k-Nearest Neighbor Approach

Classification tasks could be very difficult for a large number of complex and heterogeneous features and target classes. In those scenarios, where the items of similar class type tend to be homogeneous, nearest neighbor classification may be appropriate because assigning unlabeled cases to their most similarly labeled neighbors may be fairly easy to accomplish.

Such classification methods can help us understand the story behind the unlabeled data using known data and avoiding analyzing those complicated features and target classes. This is because these techniques have no prior distribution assumptions. However, this non-parametric approach makes the methods rely heavy on the training instances, which explains their lazy algorithms designation.

The KNN algorithm involves the following steps:

Create a training dataset that has classified examples labeled by nominal variables and different features in ordinal or numerical variables.
Create a testing dataset containing unlabeled examples with similar features as the training data.
Given a predetermined number $k$, match each test case with the $k$ closest training records that are “nearest” to the test case, according to a certain similarity or distance measure.
Assign a test case class label according to the majority vote of the $k$ nearest training cases.

Mathematically, for a given $k$, a specific similarity metric $d$, and a new testing case $x$, the kNN classifier performs two steps ($k$ is typically odd to avoid ties):

Runs through the whole training dataset ($y$) computing $d(x,y)$. Let $A$ represent the $k$ closest points to $x$ in the training data $y$.
Estimates the conditional probability for each class, which corresponds to the fraction of points in $A$ with that given class label. If $I(z)$ is an indicator function $I(z) = \begin{cases} 1 & z= true \\ 0 & otherwise \end{cases}$, then the testing data input $x$ gets assigned to the class with the largest probability, $P(y=j|X=x)$:

\[P(y=j|X=x) =\frac{1}{k} \sum_{i\in A}{I(y^{(i)}=j)}.\]

1.1 Distance Function and Dummy coding

How to measure the similarity between records? We can think of similarity measures as the distance metrics between the two records or cases. There are many distance functions to choose from. Traditionally, we use Euclidean distance as our similarity metric.

If we use a line to connect the two points representing the testing and the training records in $n$ dimensional space, the length of the line is the Euclidean distance. If $a, b$ both have $n$ features, the coordinates for them are $(a_1, a_2, \cdots, a_n)$ and $(b_1, b_2, \cdots, b_n)$. Our distance could be:

\[dist(a, b)=\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+\cdots+(a_n-b_n)^2}.\]

When we have nominal features, it requires a little trick to apply the Euclidean distance formula. We could create dummy variables as indicators of all the nominal feature levels. The dummy variable would equal one when we have the feature and zero otherwise. We show two examples:

\[Gender= \left\{ \begin{array}{ll} 0 & X=male \\ 1 & X=female \\ \end{array} \right. \]

\[Cold= \left\{ \begin{array}{ll} 0 & Temp \geq 37F \\ 1 & Temp < 37F \\ \end{array} \right. \]

This allows only binary expressions. If we have multiple nominal categories, just make each one as a dummy variable and apply the Euclidean distance.

1.2 Estimation of the hyperparameter k

The parameter k could not be too large or too small. If our k is too large, the test record tends to be classified as the most popular class in the training records, rather than the most similar one. On the other hand, if the k is too small, outliers, noise, or mislabeled training data cases might lead to errors in predictions.

The common practice is to calculate the square root of the number of training cases and use that number as (initial) estimate of k.

A more robust way would be to choose several k values and select the one with the optimal (best) classifying performance.

1.3 Rescaling of the features

Different features might have different scales. For example, we can have a measure of pain scaling from 1 to 10 or 1 to 100. Some similarity or distance measures assume the same measuring unit in all feature dimensions. This requires that the data may need to be transferred into the same scale. Re-scaling can make each feature contribute to the distance in a relatively equal manner, avoiding potential bias.

1.4 Rescaling Formulas

There are many alternative strategies to rescale the data.

1.4.1 min-max normalization

\[X_{new}=\frac{X-min(X)}{max(X)-min(X)}.\]

After re-scaling the data, $X_{new}$ would range from 0 to 1, representing the distance between each value and its minimum as a percentage. Larger values indicate further distance from the minimum, 100% means that the value is at the maximum.

1.4.2 z-score standardization

\[X_{new}=\frac{X-\mu}{\sigma}=\frac{X-Mean(X)}{SD(X)}\]

This is based on the properties of normal distribution that we have talked about in Chapter 2. After z-score standardization, the re-scaled feature will have unbounded range. This is different from the min-max normalization which always has a finite range from 0 to 1. Following z-score standardization, the transformed features are uniteless and may resemble standard normal distribution.

1.5 Case Study: Youth Development

1.5.1 Step 1: Collecting Data

The data we are using for this case study is the “Boys Town Study of Youth Development”, which is the second case study, CaseStudy02_Boystown_Data.csv.

Variables:

ID: Case subject identifier
Sex: dichotomous variable (1=male, 2=female)
GPA: Interval-level variable with range of 0-5 (0-“A” average, 1- “B” average, 2- “C” average, 3- “D” average, 4-“E”, 5-“F”“)
Alcohol use: Interval level variable from 0-11 (drink everyday - never drinked)
Attitudes on drinking in the household: Alcatt - Interval level variable from 0-6 (totally approve - totally disapprove)
DadJob: 1-yes, dad has a job and 2- no
MomJob: 1-yes and 2-no
Parent closeness (example: In your opinion, does your mother make you feel close to her?)
- Dadclose: Interval level variable 0-7 (usually-never)
- Momclose: interval level variable 0-7 (usually-never)
Delinquency:
- larceny (how many times have you taken things >$50?): Interval level data 0-4 (never - many times),
- vandalism: Interval level data 0-7 (never - many times)

1.5.2 Step 2: Exploring and preparing the data

First, we need to load and do some data manipulation. We are using the Euclidean distance so dummy variables (one hot encoding) should be used. The following codes transferred sex, dadjob and momjob into dummy variables.

library(class)
library(gmodels)
boystown<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
str(boystown)

## 'data.frame':    200 obs. of  11 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : num  0 0 0 0 1 1 0 0 1 1 ...
##  $ gpa       : int  5 0 3 2 3 3 1 5 1 3 ...
##  $ Alcoholuse: int  2 4 2 2 6 3 2 6 5 2 ...
##  $ alcatt    : int  3 2 3 1 2 0 0 3 0 1 ...
##  $ dadjob    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ momjob    : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ dadclose  : int  1 3 2 1 2 1 3 6 3 1 ...
##  $ momclose  : int  1 4 2 2 1 2 1 2 3 2 ...
##  $ larceny   : int  1 0 0 3 1 0 0 0 1 1 ...
##  $ vandalism : int  3 0 2 2 2 0 5 1 4 0 ...

The str() function tells us that we have 200 observations and 11 variables. However, the ID variable is not important in this case study so we can delete it. In this case-study, we can focus on academic performance, GPA, recidivism, vandalism and larceny, alcohol use, or other outcome variables. One concrete example involves trying to predict recidivism and use knn to classify participants in two categories.

Let’s focus on a specific outcome variable representing two or more infractions of vandalism and larceny, which can be considered as “Recidivism”. Participants with one or no infractions can be labeled as “Controls”. First we can use PCA to explore the data and then construct the new derived recidivism variable and split the data into training and testing sets.

# GPA study
# boystown <- boystown[, -1]
# table(boystown$gpa)
# boystown$grade <- boystown$gpa %in% c(3, 4, 5)    # GPA: (1-“A” average, 2-“B” average, 3-“C” average, 4+ below “C” average)
# boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
# table(boystown$grade)

# You can also try with alternative outcomes, e.g., alcohol use
# Outcome Alcohol Use 6 categories (1- everyday, 2- once or twice/wk, 3- once or twice/month, 4- less than once or twice per month, 5- once or twice, 6- never)
# Y <- as.factor(ifelse(boystown$Alcoholuse > 3, "1"Heavy Alcohol Use", "0"low Alcohol Use)); table(Y)
# Y <- as.factor(boystown$Alcoholuse); table(Y)
# X <- boystown[, -c(1, 4)]

# First explore the data by running a PCA
rawData <- boystown[ , -1]; head(rawData)

##   sex gpa Alcoholuse alcatt dadjob momjob dadclose momclose larceny vandalism
## 1   0   5          2      3      1      0        1        1       1         3
## 2   0   0          4      2      1      0        3        4       0         0
## 3   0   3          2      3      1      0        2        2       0         2
## 4   0   2          2      1      1      0        1        2       3         2
## 5   1   3          6      2      1      1        2        1       1         2
## 6   1   3          3      0      1      0        1        2       0         0

pca1 <- prcomp(as.matrix(rawData), center = T)
summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.9264 1.6267 1.4556 1.3887 1.3023 1.2495 0.95092
## Proportion of Variance 0.2463 0.1756 0.1406 0.1280 0.1126 0.1036 0.06001
## Cumulative Proportion  0.2463 0.4219 0.5625 0.6905 0.8031 0.9067 0.96672
##                            PC8     PC9    PC10
## Standard deviation     0.46931 0.44030 0.29546
## Proportion of Variance 0.01462 0.01287 0.00579
## Cumulative Proportion  0.98134 0.99421 1.00000

pca1$rotation

##                     PC1          PC2          PC3          PC4         PC5
## sex         0.017329782 -0.041942445  0.038644169  0.028574405  0.01345929
## gpa         0.050811919  0.251727224 -0.552866390  0.482012801  0.12517612
## Alcoholuse -0.955960402 -0.171043716 -0.004495945  0.207154515 -0.09577419
## alcatt     -0.087619514  0.345341205 -0.639579077 -0.330797550 -0.46933240
## dadjob      0.007936535 -0.001870547 -0.017028011  0.006888380  0.01469275
## momjob      0.030397552 -0.003518869  0.008763667 -0.001482878  0.02342974
## dadclose    0.052947782 -0.689075136 -0.246264250 -0.452876476 -0.21528193
## momclose   -0.055904765 -0.273643927 -0.435005164 -0.092086033  0.75752753
## larceny     0.063455041 -0.005344755 -0.108928362  0.033669839  0.07405310
## vandalism  -0.254240056  0.486423422  0.147159970 -0.632255411  0.35813577
##                     PC6         PC7          PC8          PC9         PC10
## sex        -0.026402404  0.02939770  0.996267983 -0.036561664 -0.001236127
## gpa         0.599548626 -0.13874602  0.035498809 -0.004136514  0.019965416
## Alcoholuse  0.015534060  0.05830281  0.002438336 -0.032626941 -0.008177261
## alcatt     -0.362724708  0.01546912  0.045913748 -0.019571649  0.001150003
## dadjob     -0.001537909 -0.04457354 -0.001761749 -0.050310596 -0.997425089
## momjob     -0.006536869 -0.01490905 -0.037550689 -0.997051461  0.051468175
## dadclose    0.456556766 -0.04166384  0.008660225 -0.005147468  0.001021197
## momclose   -0.381040880 -0.06639096 -0.008710480  0.018261764  0.020666371
## larceny     0.084351462  0.98381408 -0.026413939 -0.013630662 -0.039663208
## vandalism   0.383715180 -0.00198434  0.042593385 -0.003164270 -0.004956997

Y <- ifelse (boystown$vandalism + boystown$larceny > 1, "Recidivism", "Control")  # more than 1 vandalism or larceny conviction
X <- boystown[, -c(1, 10, 11)]  # covariate set excludes these 3 columns index/ID, vandalism, and larceny

boystown_z <- as.data.frame(lapply(X, scale))
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]
bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]
table(bt_train_labels)

## bt_train_labels
##    Control Recidivism 
##         46        104

Let’s look at the proportions for the two categories in both the training and testing sets.

round(prop.table(table(bt_train_labels)), digits=2)

## bt_train_labels
##    Control Recidivism 
##       0.31       0.69

round(prop.table(table(bt_test_labels)), digits=2)

## bt_test_labels
##    Control Recidivism 
##        0.2        0.8

We can see that most of the participants have incidents related to recidivism (~70-80%).

The remaining 8 features use different measuring scales. If we use these features directly, variables with larger scale may have a greater impact on the classification performance. Therefore, re-scaling may be useful to level the playing field.

1.5.3 Step 3: Normalizing Data

First let’s create a function of our own using the min-max normalization formula. We can check the function using some trial vectors.

normalize<-function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# some test examples:
normalize(c(1, 2, 3, 4, 5))

## [1] 0.00 0.25 0.50 0.75 1.00

normalize(c(1, 3, 6, 7, 9))

## [1] 0.000 0.250 0.625 0.750 1.000

After confirming the function definition, we can use lapply() to apply the normalization transformation to each element in a “list” of predictors of the binary outcome=recidivism.

boystown_n <- as.data.frame(lapply(bt_train, normalize))

# alternatively we can use "scale", as we showed above
#boystown_z <- as.data.frame(lapply(X, scale))

Let’s compare the two alternative normalization approaches on one of the features, Alcohol use.

summary(boystown_n$Alcoholuse)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2727  0.3636  0.3576  0.4545  1.0000

summary(boystown_z$Alcoholuse)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.04795 -0.98958  0.06879  0.00000  0.59798  3.77309

1.5.4 Step 4: Data preparation - creating training and test datasets

We have 200 observations in this dataset. The more data we use to train the algorithm, the more precise the prediction would be. We can use $3/4$ of the data for training and the remaining $1/4$ for testing. For simplicity, we will just take the first cases as training and the remaining as testing. Alternatively, we can use a randomization strategy to split the data into testing and training sets.

# may want to use random split of the raw data into training and testing
# subset_int <- sample(nrow(boystown_n),floor(nrow(boystown_n)*0.8))  
# 80% training + 20% testing
# bt_train<- boystown_n [subset_int, ]; bt_test<-boystown_n[-subset_int, ] 

# Note that the object boystown_n already excludes the outcome variable (Delinquency), index 11!
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]

Then let’s extract the recidivism labels or classes for the training and testing sets.

bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]

1.5.5 Step 5 - Training a model on the data

First, we will use the class::knn() method.

#install.packages('class', repos = "https://cran.us.r-project.org")
library(class)

The function knn() has following components:

p <- knn(train, test, class, k)

train: data frame containing numeric training data (features)
test: data frame containing numeric testing data (features)
class/cl: class for each observation in the training data
k: predetermined integer indication the number of nearest neighbors

We can first test with $k=7$, which is less than the square root of our number of observations: $\sqrt{200}\approx 14$.

bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)

1.5.6 Step 6 - Evaluating model performance

We utilize the CrossTable() function in Chapter 2 to evaluate the KNN model. We have two classes in this example. The goal is to create a $2\times 2$ table that shows the matched true and predicted classes as well as the unmatched ones. However chi-square values are not needed so we use option prop.chisq=False to get rid of it.

# install.packages("gmodels", repos="https://cran.us.r-project.org")
library(gmodels)
CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_pred 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.213 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          3 |         37 |         40 | 
##                |      0.075 |      0.925 |      0.800 | 
##                |      1.000 |      0.787 |            | 
##                |      0.060 |      0.740 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          3 |         47 |         50 | 
##                |      0.060 |      0.940 |            | 
## ---------------|------------|------------|------------|
## 
##

In this table, the cells in the first row-first column and the second row-second column contain the number for cases that have predicted classes matching the true class labels. The other two cells are the counts for unmatched cases. The accuracy of this classifier is calculated by: $\frac{cell[1, 1]+cell[2, 2]}{total}=\frac{37}{50}=0.72$. Note that this value may slightly fluctuate each time you run the classifier, due to the stochastic nature of the algorithm.

1.5.7 Step 7 - Improving model performance

The normalization strategy may play a role in the classification performance. We can try alternative standardization methods - standard Z-score centralization and normalization (via scale() method). Let’s try standardization prior to training the kNN and predicting and assessing the accuracy of the results.

# bt_train <- bt_z[1:150, ]
# bt_test  <- bt_z[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)
# CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)

# bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18) # retrieve Probabilities
# getTraingDataThreshold <- quantile(attributes(bt_test_pred)$prob, c(table(bt_train_labels)[1]/length(bt_train_labels), 0.5))[1]
# bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > getTraingDataThreshold, 1, 0)          # Binarize the probabilities
bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=14) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.217 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          4 |         36 |         40 | 
##                |      0.100 |      0.900 |      0.800 | 
##                |      1.000 |      0.783 |            | 
##                |      0.080 |      0.720 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          4 |         46 |         50 | 
##                |      0.080 |      0.920 |            | 
## ---------------|------------|------------|------------|
## 
##

# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=14) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))

## [1] "Prediction accuracy of model 'bt_test_pred' (k=14) is 0.72"

Under the z-score method, the prediction result is similar to the previous result using the normalization strategy. Albeit, in general, there may be marginal differences, e.g., a few more cases may be correctly labeled based on one of the standardization or normalization approaches.

1.5.8 Testing alternative values of k

Originally, we used the square root of 200 as our k. However, this might not be the best k in this study. We can test different k’s for their predicting performances.

# bt_train <- boystown_n[1:150, ]
# bt_test  <- boystown_n[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred1   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=1)
# bt_test_pred9   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=9)
# bt_test_pred11  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=11)
# bt_test_pred21  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=21)
# bt_test_pred27  <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=27)
# ct_1  <- CrossTable(x=bt_test_labels, y=bt_test_pred1, prop.chisq = F)
# ct_9  <- CrossTable(x=bt_test_labels, y=bt_test_pred9, prop.chisq = F)
# ct_11 <- CrossTable(x=bt_test_labels, y=bt_test_pred11, prop.chisq = F)
# ct_21 <- CrossTable(x=bt_test_labels, y=bt_test_pred21, prop.chisq = F)
# ct_27 <- CrossTable(x=bt_test_labels, y=bt_test_pred27, prop.chisq = F)

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18)        # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")     # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          1 |          9 |         10 | 
##                |      0.100 |      0.900 |      0.200 | 
##                |      1.000 |      0.184 |            | 
##                |      0.020 |      0.180 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          0 |         40 |         40 | 
##                |      0.000 |      1.000 |      0.800 | 
##                |      0.000 |      0.816 |            | 
##                |      0.000 |      0.800 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          1 |         49 |         50 | 
##                |      0.020 |      0.980 |            | 
## ---------------|------------|------------|------------|
## 
##

# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=18) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))

## [1] "Prediction accuracy of model 'bt_test_pred' (k=18) is 0.82"

The choice of $k$ in KNN clustering is very important and it can be fine-tuned using the e1071::tune.knn() of the caret::train() methods. Note that using the tune.knn() method without explicit control over the class-probability cutoff does not generate good results. Specifying that probability exceeding $0.6$ corresponds to recidivism significantly improves the kNN performance using caret::train() and caret::predict() methods.

# install.packages("e1071")
library(e1071)
knntuning = tune.knn(x= bt_train, y = as.factor(bt_train_labels), k = 1:30)
knntuning

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  9
## 
## - best performance: 0.3

summary(knntuning)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  9
## 
## - best performance: 0.3 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.4800000 0.09322745
## 2   2 0.4733333 0.13498971
## 3   3 0.3933333 0.09660918
## 4   4 0.4066667 0.06629526
## 5   5 0.3466667 0.06126244
## 6   6 0.3800000 0.09454243
## 7   7 0.3200000 0.08195151
## 8   8 0.3266667 0.07981460
## 9   9 0.3000000 0.09026709
## 10 10 0.3266667 0.10634210
## 11 11 0.3000000 0.09026709
## 12 12 0.3066667 0.08999314
## 13 13 0.3133333 0.08916623
## 14 14 0.3200000 0.09838197
## 15 15 0.3133333 0.08916623
## 16 16 0.3066667 0.09532271
## 17 17 0.3133333 0.08916623
## 18 18 0.3066667 0.09532271
## 19 19 0.3133333 0.08916623
## 20 20 0.3066667 0.07825252
## 21 21 0.3066667 0.07825252
## 22 22 0.3066667 0.07825252
## 23 23 0.3066667 0.07825252
## 24 24 0.3066667 0.07825252
## 25 25 0.3066667 0.07825252
## 26 26 0.3066667 0.07825252
## 27 27 0.3066667 0.07825252
## 28 28 0.3066667 0.07825252
## 29 29 0.3066667 0.07825252
## 30 30 0.3066667 0.07825252

library(caret)
knnControl <- trainControl(
    method = "cv", ## cross validation
    number = 10,   ## 10-fold
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    verboseIter = FALSE
)
# bt_train_stringLabels <- ifelse (bt_train_labels==1, "Recidivism", "Control")
knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
print(knn_model)

## k-Nearest Neighbors 
## 
## 150 samples
##   8 predictor
##   2 classes: 'Control', 'Recidivism' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 134, 135, 136, 136, 134, 134, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens   Spec     
##    5  0.3500227  0.090  0.9127273
##    7  0.4272955  0.090  0.9218182
##    9  0.4607273  0.110  0.9709091
##   11  0.4863409  0.110  0.9809091
##   13  0.4926591  0.085  0.9900000
##   15  0.5005682  0.025  0.9900000
##   17  0.5253409  0.000  0.9900000
##   19  0.5278182  0.000  1.0000000
##   21  0.5694773  0.000  1.0000000
##   23  0.5822955  0.000  1.0000000
##   25  0.5842500  0.000  1.0000000
##   27  0.5767500  0.000  1.0000000
##   29  0.5905682  0.000  1.0000000
##   31  0.5996136  0.000  1.0000000
##   33  0.6027727  0.000  1.0000000
##   35  0.5826591  0.000  1.0000000
##   37  0.5882273  0.000  1.0000000
##   39  0.5885227  0.000  1.0000000
##   41  0.5805682  0.000  1.0000000
##   43  0.5936591  0.000  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 33.

# Here we are providing a cutoff of 60% probability for derived class label=Recidivism
library(dplyr)
summaryPredictions <- predict(knn_model, newdata = bt_test, type = "prob") # %>% mutate('knnPredClass'=names(.)[apply(., 1, which.max)])
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
testDataPredSummary <- as.data.frame(cbind(trueLabels=bt_test_labels, controlProb=summaryPredictions$Control, 
                                           recidivismProb=summaryPredictions$Recidivism, knnPredLabel=summaryPredictionsLabel))
print(paste0("Accuracy = ", 2*as.numeric(table(testDataPredSummary$trueLabels == testDataPredSummary $knnPredLabel)[2]), "%"))

## [1] "Accuracy = 78%"

It’s useful to visualize the error rate against the value of $k$. This can help us select the optimal $k$ parameter that minimizes the cross-validation (CV) error, see Chapter 9.

library(class)
library(ggplot2)
library(reshape2)
# define a function that generates CV folds
cv_partition <- function(y, num_folds = 10, seed = NULL) {
  if(!is.null(seed)) {
    set.seed(seed)
  }
  n <- length(y)
  
  # split() divides the data into the folds defined by gl().
  # gl() generates factors according to the pattern of their levels

  folds <- split(sample(seq_len(n), n), gl(n = num_folds, k = 1, length = n))
  folds <- lapply(folds, function(fold) {
    list(
      training = which(!seq_along(y) %in% fold),
      test = fold
    )
  })
  names(folds) <- paste0("Fold", names(folds))
  return(folds)
}

# Generate 10-folds of the data
folds = cv_partition(bt_train_labels, num_folds = 10)

# Define a training set_CV_error calculation function
train_cv_error = function(K) {
  #Train error
  #### knnbt = knn(train = bt_train, test = bt_train, cl = bt_train_labels, k = K)
  #### train_error = mean(knnbt != bt_train_labels)
  knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
  summaryPredictions <- predict(knn_model, newdata = bt_train, type = "prob")
  summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
  train_error = mean(summaryPredictionsLabel != bt_train_labels)

  #CV error
  cverrbt = sapply(folds, function(fold) {
    ###  knnbt = knn(train = bt_train[fold$training,], cl = bt_train_labels[fold$training], test = bt_train[fold$test,], k=K)
    knn_model <- train(x=bt_train[fold$training,], y=bt_train_labels[fold$training] , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
    summaryPredictions <- predict(knn_model, newdata = bt_train[fold$test,], type = "prob")
    summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
    mean(summaryPredictionsLabel != bt_train_labels[fold$test])
    # mean(bt_train_labels[fold$test] != knn(train = bt_train[fold$training,], 
    #                                        cl = bt_train_labels[fold$training], 
    #                                        test = bt_train[fold$test,], k=K))
    }
  )

  cv_error = mean(cverrbt)

  #Test error
  knn.test = knn(train = bt_train, test = bt_test, cl = bt_train_labels, k = K)
  test_error = mean(knn.test != bt_test_labels)
  return(c(train_error, cv_error, test_error))
}

k_err = sapply(1:30, function(k) train_cv_error(k))
df_errs = data.frame(t(k_err), 1:30)
colnames(df_errs) = c('Train', 'CV', 'Test', 'K')
dataL <- melt(df_errs, id="K")

# require(ggplot2)
# library(reshape2)
# ggplot(dataL, aes_string(x="K", y="value", colour="variable",
#    group="variable", linetype="variable", shape="variable")) +
#    geom_line(size=0.8) + labs(x = "Number of nearest neighbors (k)",
#            y = "Classification error",
#            colour="", group="",
#            linetype="", shape="") +
#   geom_point(size=2.8) +
#   geom_vline(xintercept=4:5, colour = "pink")+
#   geom_text(aes(4,0,label = "4", vjust = 1)) +
#   geom_text(aes(5,0,label = "5", vjust = 1))

library(plotly)
plot_ly(dataL, x = ~K, y = ~value, color = ~variable, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=25, xend=25, y=0.0, yend=0.33, type = "scatter", name="k=9",
               line=list(color="darkgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=14, xend=14, y=0.0, yend=0.33, type = "scatter", name="k=14",
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=18, xend=18, y=0.0, yend=0.36, type = "scatter",  name="k=18",
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='K-NN Training, CV, and Testing Error Rates against k', 
           legend=list(title=list(text='<b> Samples </b>')), 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

1.5.9 Quantitative Assessment

First review the fundamentals of hypothesis testing inference and recall that:

Confusion Matrix	Negative	Positive
kNN Fails to reject	TN	FN
kNN rejects	FP	TP
Metrics	Specificity: TN/(TN+FP)	Sensitivity: TP/(TP+FN)

Suppose we want to evaluate the kNN model ($k=12$) as to how well it predicts recidivism. Let’s report manually some of the accuracy metrics for the kNN model ($k=12$). Combining the results, we get the following model sensitivity and specificity.

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=12) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          2 |          8 |         10 | 
##                |      0.200 |      0.800 |      0.200 | 
##                |      0.286 |      0.186 |            | 
##                |      0.040 |      0.160 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          5 |         35 |         40 | 
##                |      0.125 |      0.875 |      0.800 | 
##                |      0.714 |      0.814 |            | 
##                |      0.100 |      0.700 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          7 |         43 |         50 | 
##                |      0.140 |      0.860 |            | 
## ---------------|------------|------------|------------|
## 
##

mod12_TN <- CT$prop.row[1, 1]  
mod12_FP <- CT$prop.row[1, 2]
mod12_FN <- CT$prop.row[2, 1]
mod12_TP <- CT$prop.row[2, 2]

mod12_sensi <- mod12_TN/(mod12_TN+mod12_FP) 
mod12_speci <- mod12_TP/(mod12_TP+mod12_FN)
print(paste0("kNN model k=12 Sensitivity=", mod12_sensi))

## [1] "kNN model k=12 Sensitivity=0.2"

print(paste0("kNN model k=12 Specificity=", mod12_speci))

## [1] "kNN model k=12 Specificity=0.875"

table(bt_test_labels, bt_test_predBin)

##               bt_test_predBin
## bt_test_labels Control Recidivism
##     Control          2          8
##     Recidivism       5         35

Therefore, model12, corresponding to $k=12$, yields a marginal accuracy on the testing cases.

Another strategy for model validation and improvement involves the use of the caret::confusionMatrix() method, which reports several complementary metrics quantifying the performance of the prediction model.

Let’s examine deeper the performance of model12 to predict recidivism.

# corr12 <- cor(as.numeric(bt_test_labels), as.numeric(bt_test_predBin))
# corr12
# plot(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))
# table(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))

# install.packages("caret")
library("caret")

# compute the accuracy, LOR, sensitivity/specificity of 3 kNN models

# Model 12: bt_test_predBin
#confusionMatrix(as.numeric(bt_test_labels), as.numeric(bt_test_pred1))
confusionMatrix(as.factor(bt_test_labels), as.factor(bt_test_predBin))

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Control Recidivism
##   Control          2          8
##   Recidivism       5         35
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.5966, 0.8537)
##     No Information Rate : 0.86            
##     P-Value [Acc > NIR] : 0.9927          
##                                           
##                   Kappa : 0.0845          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.2857          
##             Specificity : 0.8140          
##          Pos Pred Value : 0.2000          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.1400          
##          Detection Rate : 0.0400          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.5498          
##                                           
##        'Positive' Class : Control         
##

Finally, we can use a 3D plot to display the confusionMatrix() results of model12 (mod12_TN, mod12_FN, mod12_FP, mod12_TP).

# install.packages("scatterplot3d")
library(scatterplot3d)
grid_xy <- matrix(c(0, 1, 1, 0), nrow=2, ncol=2)
intensity <- matrix(c(mod12_TN, mod12_FN, mod12_FP, mod12_TP), nrow=2, ncol=2)

# scatterplot3d(grid_xy, intensity, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") 

s3d.dat <- data.frame(cols=as.vector(col(grid_xy)), 
      rows=as.vector(row(grid_xy)), 
      value=as.vector(intensity))
scatterplot3d(s3d.dat, pch=16, highlight.3d=TRUE, type="h", xlab="real", ylab="predicted", zlab="Agreement", 
              main="3D Scatterplot: Model12 Results (FP, FN, TP, TN)")

# scatterplot3d(s3d.dat, type="h", lwd=5, pch=" ", xlab="real", ylab="predicted", zlab="Agreement", main="Model9 Results (FP, FN, TP, TN)")

plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod12_TN, mod12_FN, mod12_FP, mod12_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
           legend=list(title=list(text='<b> Model k=12; Performance Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

# plot_ly(type = 'barpolar', r = c(0.2, 0.8, 0.2, 0.85),theta = c(0, 90, 180, 360)) %>%
#     layout(polar = list(radialaxis = list(visible = T,range = c(0,10)),
#       sector = c(0,360), radialaxis = list(tickfont = list(size = 45)),
#       angularaxis = list(tickfont = list(size = 10))))

1.6 Case Study: Predicting Galaxy Spins

Let’s now use the SOCR Case-Study 22 (22_SDSS_GalaxySpins_Case_Study) to train a kNN classifier on $49,122$ (randomly selected out of a total $51,122$) training cases and test the accuracy of predicting the Galactic Spin (L=left or R=right hand spin) on the remaining (randomly chosen testing $2,000$ galaxies). Evaluate and report the classifier performance. Try to compare and improve the performance of several independent classifiers, e.g., kNN, naive Bayesian, LDA.

Report/graph the training, testing and CV error rates for different k parameters and justify your “optimal” choice for $k=k_o$. How good is your Galactic spin classifier? Provide some visual and numeric evidence.

Some sample code to get you started is included below.

#loading libraries
library(e1071); library(caret); library(class); library(gmodels)

#loading Galaxy data
galaxy_data <- read.csv("https://umich.instructure.com/files/6105118/download?download_frd=1", sep=",")

#View(head(galaxy_data))
dim(galaxy_data)

## [1] 51122    12

#dropping Galaxy ID
galaxy_data <- galaxy_data[ , -1]
str(galaxy_data)

## 'data.frame':    51122 obs. of  11 variables:
##  $ RA    : num  236 237 238 238 238 ...
##  $ DEC   : num  -0.493 -0.482 -0.506 -0.544 -0.527 ...
##  $ HAND  : chr  "R" "L" "L" "L" ...
##  $ UZS   : num  17.4 19.2 19.4 19.5 17.8 ...
##  $ GZS   : num  16.2 18 17.5 18.3 16.6 ...
##  $ RZS   : num  15.6 17.4 16.6 17.9 16.2 ...
##  $ IZS   : num  15.2 17 16.2 17.6 15.8 ...
##  $ ZZS   : num  14.9 16.7 15.8 17.3 15.6 ...
##  $ ELLIPS: num  0.825 0.961 0.411 0.609 0.505 ...
##  $ PHIS  : num  154 100.2 29.2 109.4 39.6 ...
##  $ RSS   : num  0.0547 0.0977 0.0784 0.076 0.0796 ...

#randomly assigning 2k for validation purposes
subset_int <- sample(nrow(galaxy_data), 2000) 

#training data, dropping the label
galaxy_data_train <- galaxy_data[-subset_int, -3]; dim(galaxy_data_train) # test

## [1] 49122    10

galaxy_data_test <- galaxy_data[subset_int, -3]; dim(galaxy_data_test);

## [1] 2000   10

#summary(galaxy_data_train)
# labels
galaxy_train_labels <- as.factor(galaxy_data[-subset_int, 3])
galaxy_test_labels <- as.factor(galaxy_data[subset_int, 3])

#summary(galaxy_data_test)
# trying to figure out the best K for this kNN model
# testing alternative values of K
gd_test_pred1<-knn(train=galaxy_data_train, test=galaxy_data_test,
                   cl=galaxy_train_labels, k=1)
gd_test_pred10<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=10)
gd_test_pred20<-knn(train=galaxy_data_train, test=galaxy_data_test,
                     cl=galaxy_train_labels, k=20)
gd_test_pred50<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=50)
ct_1<-CrossTable(x=galaxy_test_labels, y=gd_test_pred1, prop.chisq = F)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                    | gd_test_pred1 
## galaxy_test_labels |         L |         R | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  L |       985 |        20 |      1005 | 
##                    |     0.980 |     0.020 |     0.502 | 
##                    |     0.980 |     0.020 |           | 
##                    |     0.492 |     0.010 |           | 
## -------------------|-----------|-----------|-----------|
##                  R |        20 |       975 |       995 | 
##                    |     0.020 |     0.980 |     0.497 | 
##                    |     0.020 |     0.980 |           | 
##                    |     0.010 |     0.487 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |      1005 |       995 |      2000 | 
##                    |     0.502 |     0.497 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

confusionMatrix(galaxy_test_labels, gd_test_pred1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   L   R
##          L 985  20
##          R  20 975
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9729, 0.9857)
##     No Information Rate : 0.5025          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.96            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9801          
##             Specificity : 0.9799          
##          Pos Pred Value : 0.9801          
##          Neg Pred Value : 0.9799          
##              Prevalence : 0.5025          
##          Detection Rate : 0.4925          
##    Detection Prevalence : 0.5025          
##       Balanced Accuracy : 0.9800          
##                                           
##        'Positive' Class : L               
##

#Alternatively we can use the tuning function
knn.tune = tune.knn(x= galaxy_data_train, y = galaxy_train_labels, k = 1:20,
                    tunecontrol=tune.control(sampling = "fix") , fix=10)

#Summarize the resampling results set
summary(knn.tune)

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  k
##  1
## 
## - best performance: 0.1782094 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.1782094         NA
## 2   2 0.4143765         NA
## 3   3 0.4387444         NA
## 4   4 0.4043606         NA
## 5   5 0.4055820         NA
## 6   6 0.4369122         NA
## 7   7 0.4438744         NA
## 8   8 0.4374008         NA
## 9   9 0.4381947         NA
## 10 10 0.4428973         NA
## 11 11 0.4469891         NA
## 12 12 0.4487602         NA
## 13 13 0.4524246         NA
## 14 14 0.4515085         NA
## 15 15 0.4518749         NA
## 16 16 0.4532185         NA
## 17 17 0.4576768         NA
## 18 18 0.4588983         NA
## 19 19 0.4620740         NA
## 20 20 0.4617687         NA

# plot(knn.tune)
df <- as.data.frame(cbind(x=knn.tune$performance$k, y=knn.tune$performance$error))
plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=1, xend=1, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=5, xend=5, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='Galaxy-spin k-NN Prediction - Error Rate against k', 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

2 Probabilistic Learning - Naive Bayes Classification

In the previous section, we described the types of machine learning methods and presented a lazy learning classification of numerical data using k-nearest neighbors. What about nominal features or textual data? In this section, we will begin to explore some classification techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) illustrate the naive Bayesian classifier on a Head and Neck Cancer Medication case-study. Later, in Chapter 7, we will also discuss text mining and natural language processing of unstructured text data.

2.1 Overview of the Naive Bayes Method

It may be useful to first review the basics of probability theory and Bayesian inference. Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the “map” to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated.

The best scenario of accurately estimating the probability of an outcome-class map is when all features simultaneously contribute to the Bayes classifier. The naive Bayes algorithm is frequently used for text classifications. The maximum a posteriori assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable.

2.2 Model Assumptions

The method name, Naive Bayes, suggests the dependence of the technique on some simple (“naive”) assumptions. Its most important assumption is that all of the features are expected to be equally important and independent. This rarely happens in real world data. However, sometimes even when the assumptions are violated, Naive Bayes still performs fairly well, particularly when the number of features $p$ is large. This is why the Naive Bayes algorithm is often used as a powerful text classifier.

There are interesting relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification. Additional information about LDA and QDA is available here.

2.3 Bayes Formula

Let’s first define the set-theoretic Bayes formula. We assume that $B_i$’s are mutually exclusive events, for all $i= 1, 2, \cdots, n$, where n represents the number of features. If $A$ and $B$ are two events, the Bayes conditional probability formula is as follows: \[Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}\]

Symbolically,

\[P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\] When $B_i's$ represent a partition of the event space, $S=\cup {B_i}$ and $B_i\cap B_j = \emptyset,\ \forall i\not= j$. So we have:

\[P(A|B)=\frac{P(B|A)\times P(A)}{P(B|B_1)\times P(B_1) + P(B|B_2)\times P(B_2) + \cdots + P(B|B_n)\times P(B_n)}.\]

Now, let’s represent the Bayes formula in terms of classification using observed features. Having observed $n$ features, $F_i$, for each of $K$ possible class outcomes, $C_k$. Using the Bayes’ theorem, by decomposing the conditional probability, the Bayesian classifier may be reformulated to make it more tractable.

\[ P(C_k \mid F_1, \dots, F_n)= \frac{P(F_1, \dots, F_n|C_k)P(C_k)}{P(F_1, \dots, F_n)}.\]

In the above expression, only the numerator depends on the class label, $C_k$, as the values of the features $F_i$ are observed (or imputed) making the denominator constant. Let’s focus on the numerator.

The numerator essentially represents the joint probability model:

\[P(F_1, \dots, F_n|C_k)P(C_k) = \underbrace{P(F_1, \cdots, F_n, C_k)}_{\text{joint model}}.\]

Repeatedly using the chain rule and the definition of conditional probability simplifies this to:

\[P(F_1, \dots ,F_n, C_{k})=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times P(F_4, \dots ,F_n, C_{k})=\] \[=\cdots=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Note that the “naive” qualifier in the method name, Naive Bayes classifier, is attributed to the oversimplification of the conditional probability. Assuming each feature $F_i$ is conditionally statistical independent of every other feature $F_j, \ \forall j\neq i$, given the category $C_k$, we get:

\[P(F_i | F_{i+1}, \dots , F_{n}, C_k ) = P(F_i | C_k).\]

This reduces the joint probability model to:

\[P(F_1, \dots ,F_n, C_{k})= P(F_1| C_{k})\times P(F_2| C_{k})\times P(F_3| C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Therefore, the joint model is:

\[P(F_1, \dots ,F_n, C_{k})= P(C_k) \prod_{i=1}^n {P(F_i| C_{k})}\]

Essentially, we express the probability of class level $L$ given an observation, represented as a set of independent features $F_1, F_2, \cdots, F_n$. Then the posterior probability that the observation is in class $L$ is equal to:

\[P(C_L|F_1, \cdots, F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)},\] where the denominator, $\prod_{i=1}^nP(F_i)$, is a scaling factor that represents the marginal probability of observing all features jointly.

For a given case $X=(F_1, F_2, \cdots, F_n)$, i.e., given vector of features, the naive Bayes classifier assigns the most likely class $\hat{C}$ by calculating $\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}$ for all class labels $L$, and then assigning the class $\hat{C}$ corresponding to the maximum posterior probability. Analytically, $\hat{C}$ is defined by: \[ \hat{C} = \arg\max_L{\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}}.\]

As the denominator is static for $L$, the posterior probability above is maximized when the numerator is maximized, i.e., $\hat{C} = \arg\max_L {P(C_L)\prod_{i=1}^nP(F_i|C_L)}.$

The contingency table below illustrates schematically how the Bayesian, marginal, conditional, and joint probabilities may be calculated for a finite number of features (columns) and classes (rows).

$Features\\Classes$	$F_1$	$F_2$	$\cdots$	$F_n$	Total
$C_1$	$\cdots$	$\cdots$	$\cdots$	$\cdots$	Marginal $P(C_1)$
$C_2$	$\cdots$	$\cdots$	$\cdots$	Joint $P(C_2, F_n)$	$\cdots$
$\cdots$	$\cdots$	$\cdots$	$\cdots$	$\cdots$	$\cdots$
$C_L$	Conditional $P(F_1\mid C_L)=\frac{P(F_1,C_L)}{P(C_L)}$	$\cdots$	$\cdots$	$\cdots$	$\cdots$
Total		Marginal $P(F_2)$	$\cdots$	$\cdots$	$N$

In the DSPA2 Appendix, we provide additional technical details, code, and applications of Bayesian simulation, modeling and inference.

2.4 The Laplace Estimator

If at least one $P(F_i|C_L)=0$ then $P(C_L|F_1, \cdots, F_n)=0$, which means the probability of being in this class is 0. However, $P(F_i|C_L)=0$ could happen by random chance, e.g., when selecting the training data.

One of the solutions to this scenario is Laplace estimation, also known as Laplace smoothing, which can be accomplished in two ways. One is to add a small number to each count in the frequency table, which allows each class-feature combination to be at least one in the training data. Then $P(F_i|C_L)>0$ for all $i$ and we avoid degenerate cases. Another strategy is to add some small value, $\epsilon$, to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial ($0$) posterior for another class.

2.5 Case Study: Head and Neck Cancer Medication

2.5.1 Step 1: Collecting Data

We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data archive.

Variables:

PID: coded patient ID
ENC_ID: coded encounter ID
Seer_stage: SEER cancer stage (0 =In situ, 1=Localized, 2=Regional by direct extension, 3=Regional to lymph nodes, 4=Regional (both codes 2 and 3), 5=Regional, NOS, 7= Distant metastases/systemic disease, 8=Not applicable, 9=Unstaged, unknown, or unspecified). See: SSM.
Medication_desc: description of the chemical composition of the medication
Medication_summary: brief description about medication brand and usage
Dose: the dosage in the medication summary
Unit: the unit for dosage in the Medication_summary
Frequency: the frequency of use in the Medication_summary
Total_dose_count: total dosage count according to the Medication_summary

2.5.2 Step 2: Exploring and preparing the data

Let’s first load the cancer dataset.

hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
str(hn_med)

## 'data.frame':    662 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
##  $ seer_stage        : int  1 1 4 1 9 1 1 1 9 1 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 1 2 2 6 15 1 ...

Change the seer_stage (cancer stage indicator) variable into a factor.

hn_med$seer_stage <- factor(hn_med$seer_stage)
str(hn_med$seer_stage)

##  Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 9 2 2 2 9 2 ...

table(hn_med$seer_stage)

## 
##   0   1   2   3   4   5   7   8   9 
##  21 265  53  90  46  18  87  14  68

2.5.2.1 Data preparation - processing text data for analysis

As you can see, the medication_summary contains a great amount of text. We should do some text mining to prepare the data for analysis. In R, the tm package is a good choice for text mining.

# install.packages("tm", repos = "https://cran.us.r-project.org")
# requires R V.3.3.1 +
library(tm)

First step for text mining is to convert text features (text elements) into a corpus object, which is a collection of text documents.

hn_med_corpus <- Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
print(hn_med_corpus)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 662

After we constructed the corpus object, we could see that we have 662 documents. Each document represents an encounter (e.g., notes on medical treatment) for a patient.

inspect(hn_med_corpus[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] (Zantac) 150 mg tablet oral two times a day
## [2] 5,000 unit subcutaneous three times a day  
## [3] (Unasyn) 15 g IV every 6 hours

hn_med_corpus[[1]]$content

## [1] "(Zantac) 150 mg tablet oral two times a day"

hn_med_corpus[[2]]$content

## [1] "5,000 unit subcutaneous three times a day"

hn_med_corpus[[3]]$content

## [1] "(Unasyn) 15 g IV every 6 hours"

There are unwanted punctuations and other symbols in the corpus document that we want to remove. We use the tm_map() function for the cleaning.

corpus_clean <- tm_map(hn_med_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
# corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
# corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

The above lines of code changed all the characters to lowercase, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the corpus to plain text).

inspect(corpus_clean[1:3])

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] zantac mg tablet oral two times a day  unit subcutaneous three times a day 
## [3] unasyn g iv every hours

corpus_clean[[1]]$content

## [1] "zantac mg tablet oral two times a day"

corpus_clean[[2]]$content

## [1] " unit subcutaneous three times a day"

corpus_clean[[3]]$content

## [1] "unasyn g iv every hours"

DocumentTermMatrix() function can successfully tokenize the medication summary into words. It can count frequent terms in each document in the corpus object.

hn_med_dtm <- DocumentTermMatrix(corpus_clean)

2.5.2.2 Data preparation - creating training and test datasets

Just like in the kNN method, we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix.

set.seed(12345)
subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8))  # 80% training + 20% testing
hn_med_train <- hn_med[subset_int, ]
hn_med_test <- hn_med[-subset_int, ]
hn_med_dtm_train <- hn_med_dtm[subset_int, ]
hn_med_dtm_test <-hn_med_dtm[-subset_int, ]
corpus_train <- corpus_clean[subset_int]
corpus_test  <- corpus_clean[-subset_int]

# hn_med_train<-hn_med[1:562, ]
#hn_med_test<-hn_med[563:662, ]
# hn_med_dtm_train<-hn_med_dtm[1:562, ]
# hn_med_dtm_test<-hn_med_dtm[563:662, ]
#corpus_train<-corpus_clean[1:562]
#corpus_test<-corpus_clean[563:662]

Let’s examine the distribution of seer stages in the training and test datasets.

prop.table(table(hn_med_train$seer_stage))

## 
##          0          1          2          3          4          5          7 
## 0.01890359 0.41965974 0.07561437 0.13988658 0.07561437 0.03213611 0.12098299 
##          8          9 
## 0.01890359 0.09829868

prop.table(table(hn_med_test$seer_stage))

## 
##           0           1           2           3           4           5 
## 0.082706767 0.323308271 0.097744361 0.120300752 0.045112782 0.007518797 
##           7           8           9 
## 0.172932331 0.030075188 0.120300752

We can separate (dichotomize) the seer_stage into two categories:

no stage or early stage cancer (seer in [0;4]), and
later stage cancer (seer in [5;9]).

Of course, other binarizations are possible as well. Note that a %in% b is an intuitive interface to match and acts as a binary operator returning a logical vector (T or F) indicating if there is a match between the left and right operands.

hn_med_train$stage <- hn_med_train$seer_stage %in% c(5:9)
hn_med_train$stage <- factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test$stage  <- hn_med_test$seer_stage %in% c(5:9)
hn_med_test$stage  <- factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train$stage))

## 
## early_stage later_stage 
##   0.7296786   0.2703214

prop.table(table(hn_med_test$stage))

## 
## early_stage later_stage 
##   0.6691729   0.3308271

2.5.2.3 Visualizing text data - word clouds

A word cloud can help us visualize text data. More frequent words would have larger fonts in the figure, while less common words are appearing in smaller fonts. There is a wordcloud package in R that is commonly used for creating these figures.

# install.packages("wordcloud", repos = "https://cran.us.r-project.org")
library(wordcloud)
wordcloud(corpus_train, min.freq = 40, random.order = FALSE, colors=brewer.pal(5, "Dark2"))

The random.order=FALSE option makes more frequent words appear in the center of the word cloud. The min.freq=40 option sets the cutoff word frequency to be at least 40 times in the corpus object. Therefore, the words must appear in at least 40 medication summaries to be shown on the graph.

We can also visualize the difference between early stages and later stages using this type of graph.

early <- subset(hn_med_train, stage=="early_stage")
later <- subset(hn_med_train, stage=="later_stage")
wordcloud(early$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

wordcloud(later$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

# tdm = as.matrix(hn_med_dtm)
# tdm1 <- tdm[nchar(rownames(tdm)) < 15, ]
# colnames(tdm1) <- hn_med_dtm$dimnames$Terms[1:14]
# comparison.cloud(tdm1, random.order=FALSE,
#                  colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
#                  title.size=1, min.words=5, max.words=50, scale=c(1.5, 0.4),rot.per=0.4)

We can see that the frequent words are somewhat different in the medication summaries of the early and later stage patients.

2.5.2.4 Data preparation - creating indicator features for frequent words

For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that earlier in the kNN section we used the native data features to obtain kNN classifications. Now, we are going to make frequencies of words as text-derived data features.

summary(findFreqTerms(hn_med_dtm_train, 5))

##    Length     Class      Mode 
##       108 character character

hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test  <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 118 features for us to use.

The Naive Bayes classifier trains on data with categorical features, as it uses frequency tables for learning the data affinities. To create the combinations of class and feature values comprising the frequency-table (matrix), all features must be categorical. For instance, numeric features have to be converted (binned) into categories. Thus, we need to transform our word count features into categorical data. One way to achieve this is to change the count into an indicator of whether this word appears in the document describing the patient encounter. We can create a simple function to convert the presence of a specific word (column) in an encounter document (row) to a binary “Yes” (present) or “No” (not present).

convert_counts <- function(wordFreq) {
  wordFreq <- ifelse(wordFreq > 0, 1, 0)
  wordFreq <- factor(wordFreq, levels = c(0, 1), labels = c("No", "Yes"))
  return(wordFreq)
}

We employ hard-thresholding here x<-ifelse(x>0, 1, 0). This is saying that if we have an x that is greater than 0, we assign value 1 to it, otherwise the value is set to 0.

Now let’s apply our own function convert_counts() on each column (MARGIN=2) of the training and testing datasets.

hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)

# Check the structure of hn_train and hn_train:
# head(hn_train); dim(hn_train)

So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data.

2.5.3 Step 3 - training a model on the data

The package we will use for the Naive Bayes classifier is called e1071.

# install.packages("e1071", repos = "https://cran.us.r-project.org")
library(e1071)

The function naiveBayes() has following components:

m<-naiveBayes(train, class, laplace=0)

train: data frame containing numeric training data (features)
class: factor vector with the class for each row in the training data.
laplace: positive double controlling Laplace smoothing; default is $0$ and disables Laplace smoothing.

Next, we will invoke the naiveBayes() classifier.

hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)

Then we can use the classifier to make predictions using predict(). Recall that when we presented the AdaBoost example in Chapter 2, we saw the general machine learning and artificial intelligence protocol, from training, to prediction, and assessment.

For most supervised and unsupervised methods, the generic function predict() has the following syntax:

p <- predict(m, test, type="class")

m: classifier, in our case naiveBayes()
test: test data frame or matrix
type: either "class" or "raw" specifies whether the predictions should be the most likely class value or the raw predicted probabilities.

hn_test_pred <- predict(hn_classifier, hn_test)

2.5.4 Step 4 - evaluating model performance

Similarly to the approach in the earlier kNN approach, we can use cross table to compare the predicted vs. the true classes using the testing data.

library(gmodels)
CT <- CrossTable(hn_test_pred, hn_med_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          64 |          30 |          94 | 
##              |       0.019 |       0.039 |             | 
##              |       0.681 |       0.319 |       0.707 | 
##              |       0.719 |       0.682 |             | 
##              |       0.481 |       0.226 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |          25 |          14 |          39 | 
##              |       0.046 |       0.093 |             | 
##              |       0.641 |       0.359 |       0.293 | 
##              |       0.281 |       0.318 |             | 
##              |       0.188 |       0.105 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |          89 |          44 |         133 | 
##              |       0.669 |       0.331 |             | 
## -------------|-------------|-------------|-------------|
## 
##

CT

## $t
##              y
## x             early_stage later_stage
##   early_stage          64          30
##   later_stage          25          14
## 
## $prop.row
##              y
## x             early_stage later_stage
##   early_stage   0.6808511   0.3191489
##   later_stage   0.6410256   0.3589744
## 
## $prop.col
##              y
## x             early_stage later_stage
##   early_stage   0.7191011   0.6818182
##   later_stage   0.2808989   0.3181818
## 
## $prop.tbl
##              y
## x             early_stage later_stage
##   early_stage   0.4812030   0.2255639
##   later_stage   0.1879699   0.1052632

mod_TN <- CT$prop.row[1, 1]  
mod_FP <- CT$prop.row[1, 2]
mod_FN <- CT$prop.row[2, 1]
mod_TP <- CT$prop.row[2, 2]
# caret::confusionMatrix(hn_test_pred, hn_med_test$stage)
# CT$prop.row
library(plotly)
plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod_TN, mod_FN, mod_FP, mod_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

It may be worth quickly looking forward in Chapter 9 where we present a summary table containing commonly used measures for evaluating the performance of binary tests, classifiers, or predictions.

In this case, the prediction accuracy of the Naive Bayes classifier, assessed on the testing dataset, is:

\[ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{78}{133}=0.59.\]

From the cross table we can see that our testing-data prediction accuracy is $\frac{78}{133}=0.59$.

2.5.5 Step 5 - improving model performance

After setting laplace=5, the accuracy goes to $acc=\frac{87}{133}=0.65$. Although there is a small difference in accuracy, in general, such protocol alterations may increase or decrease the rate of true later stage patient identification using testing and validation data.

hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test  <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

set.seed(1234)
# View(as.data.frame(inspect(hn_train)))
traindf <- data.frame(stage=hn_med_train$stage, as.matrix(hn_train))
testdf <- data.frame(stage=hn_med_test$stage, as.matrix(hn_test))

# for Training and Testing data, transform predictors/features to factors
for (cols in names(traindf)) traindf[[cols]] <- factor(traindf[[cols]])
for (cols in names(testdf)) testdf[[cols]] <- factor(testdf[[cols]])

# check variable types
# str(traindf)

# run NB classifier (for cancer stage)
hn_classifier <- naiveBayes(traindf[,-1], traindf$stage, laplace = 5, type = "class")
hn_test_pred1 <- predict(hn_classifier, testdf)
caret::confusionMatrix(hn_test_pred1, testdf$stage)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    early_stage later_stage
##   early_stage          83          40
##   later_stage           6           4
##                                           
##                Accuracy : 0.6541          
##                  95% CI : (0.5668, 0.7344)
##     No Information Rate : 0.6692          
##     P-Value [Acc > NIR] : 0.6804          
##                                           
##                   Kappa : 0.0292          
##                                           
##  Mcnemar's Test P-Value : 1.141e-06       
##                                           
##             Sensitivity : 0.93258         
##             Specificity : 0.09091         
##          Pos Pred Value : 0.67480         
##          Neg Pred Value : 0.40000         
##              Prevalence : 0.66917         
##          Detection Rate : 0.62406         
##    Detection Prevalence : 0.92481         
##       Balanced Accuracy : 0.51175         
##                                           
##        'Positive' Class : early_stage     
##

Accuracy is $\frac{87}{133} = 0.65$. This is a marginal performance improvement, compared to the prior ($laplace=0$) default model. Notice that choosing a different Laplace smoothing parameter alters the balance between the true/false and positive/negative elements in the $2\times 2$ confusion matrix.

2.5.6 Step 6 - compare Naive Bayesian vs. LDA

As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of Naive Bayes and LDA classification.

Note: LDA assumes the predictors are jointly approximately normally distributed.

library(MASS)
# df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric), 
#                          stage = hn_med_train$stage)
# df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric), 
#                         stage = hn_med_test$stage)

library("dplyr")
binarizeFunction <- function(x) { ifelse(x=="Yes", 1,0) }

#  A function to Convert Categorical variables to numeric
cat2Numeric <- function (dfInput) {
  df = as.data.frame(lapply( as.data.frame(dfInput), factor)) %>%
                       mutate_all(binarizeFunction)
   return(df)
}

# define the numeric DF of predictors (X) and outcome (Y=stage)
df_hn_train = data.frame(as.matrix(hn_train), stage = as.numeric(hn_med_train$stage))
df_hn_test = data.frame(as.matrix(hn_test), stage = as.numeric(hn_med_test$stage))

# Remove the multicollinearity - this should be done via VIF assessment, 
#      but for now, just take the first few predictors
df_hn_train <- df_hn_train[ , c(1:34, 40:50, 60:70, 109)]
df_hn_train$stage <- hn_med_train$stage
df_hn_test$stage <- hn_med_test$stage

# Fit LDA
set.seed(1234)
hn_lda <- lda(data=df_hn_train, stage ~ .)
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(hn_lda, df_hn_test)
CrossTable(hn_pred$class, df_hn_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##               | df_hn_test$stage 
## hn_pred$class | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   early_stage |          85 |          37 |         122 | 
##               |       0.138 |       0.280 |             | 
##               |       0.697 |       0.303 |       0.917 | 
##               |       0.955 |       0.841 |             | 
##               |       0.639 |       0.278 |             | 
## --------------|-------------|-------------|-------------|
##   later_stage |           4 |           7 |          11 | 
##               |       1.535 |       3.104 |             | 
##               |       0.364 |       0.636 |       0.083 | 
##               |       0.045 |       0.159 |             | 
##               |       0.030 |       0.053 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |          89 |          44 |         133 | 
##               |       0.669 |       0.331 |             | 
## --------------|-------------|-------------|-------------|
## 
##

# caret::confusionMatrix(hn_pred$class, df_hn_test$stage)

There are differences in the performance of the Naive Bayesian ($acc=\frac{87}{133}=0.65$) and the LDA classifiers ($acc=\frac{92}{133}=0.69$) in terms of the overall accuracy. We can compare the type II (false-negative prediction) error rates of the naive Bayesian and the LDA classifiers are $\frac{40}{133}=0.30$ and $\frac{37}{133}=0.28$, respectively. Such differences may have a nuanced clinical importance for early detection of later-stage cancer.

Later, we will step deeper into the space of classification problems and see more sophisticated approaches, especially for difficult problems like cancer forecasting.

3 Decision Trees and Divide and Conquer Classification

When classification models need to be apparent and directly mechanistically interpreted, kNN and naive Bayes methods we saw earlier may not be useful, as they do not yield explicit classification rules. In some cases, we need to derive direct and clear rules for our classification decisions. Just like a scoring criterion for driving ability, credit scoring, or grading rubrics, mechanistic understanding of the derived labeling, prediction, or decision making is often beneficial. For instance, we can assign credit or deduct a penalty for each driving operation task. These rules facilitate the critical decision to issue a driver license to applicants that are qualified, ready, and able to safely operate a specific type of a motor vehicle.

In this section, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.

3.1 Motivation

Decision tree learners enable classification via tree structures modeling the relationships among all features and potential outcomes in the data. All decision trees begin with a trunk representing all points that are part of the same cohort. Iteratively, the trunk is split into narrower and narrower branches by forking-decisions based on the intrinsic data structure. At each step, splitting the data into branches may include binary or multinomial classification. The final decision is obtained when the tree branching process terminates. The terminal (leaf) nodes represent the action to be taken as the result of the series of branching decisions. For predictive models, the leaf nodes provide the expected forecasting results given the series of events in the decision tree. There are a number of R packages available for decision tree classification including rpart, C5.0, party, etc.

3.2 Hands-on Example: Iris Data

Let’s start by seeing a simple example using the Iris dataset, which we saw in Chapter 2. The data features or attributes include Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, and classes are represented by the Species taxa (setosa, versicolor, and virginica).

# install.packages("party")
library("party")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris); table(iris$Species)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

## 
##     setosa versicolor  virginica 
##         50         50         50

The ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) function will build a conditional-inference decision tree.

iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)

## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Species 
## Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
## Number of observations:  150 
## 
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
##   2)*  weights = 50 
## 1) Petal.Length > 1.9
##   3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
##     4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
##       5)*  weights = 46 
##     4) Petal.Length > 4.8
##       6)*  weights = 8 
##   3) Petal.Width > 1.7
##     7)*  weights = 46

plot(iris_ctree, cex=2)  #  party::plot.BinaryTree()

head(iris); tail(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Similarly, we can demonstrate a classification of the iris taxa via rpart:

# install from https://cran.r-project.org/web/packages/rpart.utils/
library(rpart)
library(rpart.utils)
library(plotly)
# iris_rpart = rpart(Species~., data=iris)
# print(iris_rpart)
# 
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
# install.packages("rattle")
# library(rattle)
# fancyRpartPlot(iris_rpart, cex = 1.5, caption = "rattle::fancyRpartPlot (Iris flowers)")
#
#### Plot_ly decision tree visualization
iris_rpart <- rpart(Species ~ . , method="class", data=iris)
# printcp(iris_rpart)

# Set the Graph/Tree Node names
treeFrame <- iris_rpart$frame

# Identify Leaf nodes
isLeave <- treeFrame$var == "<leaf>"
nodes <- rep(NA, length(isLeave))

# Get the 3 Isis taxa labels: "setosa", "versicolor", "virginica" for all nodes (including leaves)
ylevel <- attr(iris_rpart, "ylevels")
nodes[isLeave] <- ylevel[treeFrame$yval][isLeave]
nodes[!isLeave] <- labels(iris_rpart)[-1][!isLeave[-length(isLeave)]]

# Get all Tree Graph Connections (source --> target)
treeFrame <- iris_rpart$frame
treeRules <- rpart.utils::rpart.rules(iris_rpart)

targetPaths <- sapply(as.numeric(row.names(treeFrame)), 
                      function(x) { strsplit(unlist(treeRules[x]),split=",") } )

lastStop <- sapply(1:length(targetPaths),
                   function(x) { targetPaths[[x]][length(targetPaths[[x]])] } )

oneBefore <- sapply(1:length(targetPaths),
                    function(x) { targetPaths[[x]][length(targetPaths[[x]])-1] } )

# Initialize the tree edges
target=c()
source=c()
values=treeFrame$n
for(i in 2:length(oneBefore)) {
  tmpNode=oneBefore[[i]]
  q = which(lastStop==tmpNode)

  q=ifelse(length(q)==0, 1, q)
  source = c(source, q)
  target = c(target, i)
}
source=source-1
target=target-1

# Display the sankey tree
plot_ly(type = "sankey", orientation = "h", 
             node = list(label = nodes, pad = 15, thickness = 30, line = list(color="black", width=1)),
             link = list(source = source, target = target, value=values[-1])) %>% 
 layout(title = "Iris rPart Decision Tree", font = list(size = 12))

3.3 Decision Tree Overview

The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begins at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.

3.3.1 Divide and conquer

Decision trees involve recursive partitioning that use data features and attributes to split the data into groups (nodes) of similar classes.

To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly bundled together. Each of the groups is considered a segment of the data. After getting the approximate range of each feature value within each group, we can make the decision tree.

\[X = [X_1, X_2, X_3, \cdots, X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, k} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{pmatrix} }_{features/attributes} \left\} \begin{array}{ll} & \\ cases & \\ & \end{array} \right.\]

The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset $D$ (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into $K$ subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:

All the samples belong to the same class, that is they have the same label and the sample is already pure.
Stop when the majority of the points are already of the same class (relative to some error threshold).
There are no remaining attributes on which the samples may be further partitioned.

One objective criterion for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. Three commonly used measures for evaluating the impurity reduction are: Misclassification error, Gini index and Entropy.

For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogeneous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.

3.3.2 Entropy

The Entropy is an information measure of the amount of disorder or uncertainty in a system. Suppose we have a data set $D=(X_1, X_2, \cdots, X_n)$ that includes $n$ features (variables) and suppose each of these features can take on any of $k$ possible values (states). Then the cardinality of the entire system is $k^n$ as each of the features are assumed to have $k$ independent states, thus the total number of different datasets that can be expected is $\underbrace{k\times k\times k \times \cdots\times k}_{n}=k^n$. Suppose ${\displaystyle p_{1},p_{2},\cdots, p_k}$ represent the proportions of each class (note: $\sum_{i} {p_i}=1$) present in the child node that results from a split in a decision tree classifier. Then the entropy measure is defined by:

\[Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\]

If each of the $1\le i\leq k$ states for each feature is equally likely to be observed with probability $p_i=\frac{1}{k}$, then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}= \sum_{i=1}^{k} {\frac{1}{k} \log{k}}= k{\frac{1}{k} \log{k}}=\log{k} = O(1).\]

In the other extreme, the entropy is minimized when the probability of one class is unitary and all other ones are trivial. Note that by L’Hopital’s Rule $\lim_{x\longrightarrow 0} {(x\times \log(x))} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0$ for a single class classification where the probability of one class is unitary ($p_{i_o}=1$) and the other ones are trivial ($p_{i\not= i_o}=0$):

\[Entropy(D)=p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]

In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample $D=\{X_1, X_2, \cdots, X_n\}$ is defined by:

\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]

where $P(c_i|D)$ is the probability of a data point in $D$ being labeled with class $c_i$, and $k$ is the number of classes (clusters). $P(c_i|D)$ can be estimated from the observed data by:

\[P(c_i|D) = \frac{P(D | c_i)P(c_i)}{P(D)} = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]

Observe that if the observations are evenly split among all $k$ classes then $P(c_i|D)=\frac{1}{k}$ and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=\log{k}=O(1).\]

At the other extreme, if all the observations are from one class then: \[H(D)=-1*\log(1)=0.\]

Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy $\left ( \log_b(x) = \frac{\log_2(x)}{\log_2(b)}\right )$.

The Information gain is the expected reduction in entropy caused by knowing the value of an attribute.

3.3.3 Misclassification Error and Gini Index

Similar to the Entropy measure, the Misclassification error and the Gini index are also applied to evaluate information gain. The Misclassification error is defined by the formula:

\[ME = 1-\max_k (p_k).\]

And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]

3.3.4 C5.0 Decision Tree Algorithm

The C5.0 algorithm is a popular implementation of decision trees. To begin with, let’s consider the term purity. If the segments of data contain a single class, they are considered pure. The entropy represents a mathematical formalism measuring purity for data segments.

\[Entropy(S)=- \sum_{i=1}^c {p_i \log_2(p_i)},\]

where entropy is the measurement, c is the number of total class levels, and $p_i$ refers to the proportion of observations that fall into each class (i.e., probability of a randomly selected data point to belong to the $i^{th}$ class level. For 2 possible classes, the entropy ranges from 0 to 1. For $n$ classes, the entropy ranges from $0$ to $\log_2(n)$, where the minimum entropy corresponds to data that is purely homogeneous (completely deterministic/predictable) and the maximum entropy represents completely disordered data (stochastic or extremely noisy). You might wonder what is the benefit of using entropy? The answer is because the smaller the entropy the more information is contained in the corresponding decision node split. Systems (data) with high entropy indicate significant information content (randomness) and data with low entropy indicates highly-compressible data with structure in it.

If we only have one class in the segment then the entropy is trivial, $Entropy(S)=(-1)\times \log_2(1)=0$. Let’s try another example. If we have a segment of data that contains two classes, the first class contains 80% of the data and the second class contains the remaining 20%. Then, we have the following entropy: $Entropy(S)=-0.8\log_2(0.8)-0.2\log_2(0.2)=0.7219281.$

For a two class problem, the relationship between class-proportions and entropy is illustrated in the following graph. Assume $x$ is the proportion for one of the classes.

N <- 1000
set.seed(1234)
x <- runif(N)
x_ind <- sort(x)
y <- -x_ind*log2(x_ind)-(1-x_ind)*log2(1-x_ind)

# curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
# segments(0.8, 0, 0.8, 0.7219281, lwd=2, col="blue", lty = 2)
# points(0.8, 0.7219281, col = "blue", pch=21, cex=2, lwd=2)
# text(-0.02, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.48, 0.3, pos=4, srt = 90, "Low-Structure/High-Entropy")
# text(0.96, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.65, 0.8, pos=4, "(x=0.8,Entr=0.72)")

modelLabels <- c('High-Structure/Low-Entropy', 'Low-Structure/High-Entropy', 'High-Structure/Low-Entropy')
modelLabels.x <- c(0.03, 0.5, 0.97)
modelLabels.y <- c(0.5, 0.7, 0.5)
modelLabels.col <- c("blue", "red", "blue")
plot_ly() %>%
  add_lines(x = ~x_ind, y = ~y, name="Entropy", line = list(width = 4)) %>%
  add_segments(x=0.8, xend=0.8, y=0.001, yend = 0.7219281, showlegend=F) %>%
  # add_segments(x=-0.2, xend=1.1, y=0, yend = 0, showlegend=F) %>%
  # add_markers(x = ~x, y = ~y, name="Sample Simulated Data") %>%
  #add_lines(x = ~c(point1$x,point2$x), y = ~c(point1$y,point2$y), name="Second PC, Orthogonal to lm(Y ~ X)", 
  #          line = list(width = 4))  %>% 
  add_markers(x = 0.8, y = 0.7219281, name="(x=0.8,Entropy=0.72)", marker = list(size = 20,
                                    color = 'green', line = list(color = 'yellow', width = 2))) %>%
  layout(title="Binary Class Entropy",
         xaxis=list(title="Proportion of Class 1 Observations", scaleanchor="y"),  # control the y:x axes aspect ratio
         yaxis = list(title="Entropy", scaleanchor  = "x", range=c(-0.1, 1.1)), legend = list(orientation = 'h'),
         annotations = list(text=modelLabels,  x=modelLabels.x, y=modelLabels.y, textangle=90,
                            font=list(size=15, color=modelLabels.col), showarrow=FALSE))

The closer the binary proportion split is to $0.5$ the greater the entropy. The more homogeneous the split (one class becomes the majority) the lower the entropy. Decision trees aim to find splits in the data that reduce the entropy, i.e., increasing the homogeneity of the elements within all classes.

This measuring mechanism could be used to measure and compare the information we get using different features as data partitioning characteristics. Let’s consider this scenario. Suppose $S$ and $S_1$ represent the entropy of the system before and after the splitting/partitioning of the data according to a specific data feature attribute ($F$). Denote the entropies of the original and the derived partition by $Entropy(S)$ and $Entropy(S_1)$, respectively. The information we gained from partitioning the data using this specific feature (F) is calculated as a change in the entropy:

\[Gain(F) = Entropy(S)-Entropy(S_1)\]

Note that smaller entropy $Entropy(S_1)$ corresponds with better classification and more information gained. A more complicated case would be that the partitions create multiple segments. Then, the entropy for each partition method is calculated by the following formula:

\[Entropy(S)=\sum_{i=1}^n w_i Entropy(P_i)=\sum_{i=1}^{n} w_i \left (-\sum_{j=1}^{c}p_i\log_2(p_i)\right )\]

Where $w_i$ is the proportion of examples falling in that segment and $P_i$ is segment i. Thus, the total entropy of a partition method is calculated by a weighted sum of entropies for each segment created by this method.

When we get the maximum reduction in entropy with a feature ($F$) then the $Gain(F)=Entropy(S)$, since $Entropy(S_1)=0$. On the contrary, if we gain no information with this feature, we have $Gain(F)=0$.

3.3.5 Pruning the Decision Tree

While making a decision tree, we can classify those observations using as many splits as we want. This eventually might over classify our data. An extreme example of this would be that we make each observation as a class, which is meaningless.

So how to control the size of the decision tree? One possible solution is to make a cutoff for the number of decisions that a decision tree could make. Similarly, we can control the minimum number of points in each segment. This method is called early stopping, or pre-pruning, the decision tree. However, this might terminate the decision procedure ahead of some important pending partition.

Another solution post-pruning can be applied when we begin with growing a big decision tree and subsequently reduce the branches based on error rates with penalty at the nodes. This is often more effective than the pre-pruning solution.

The C5.0 algorithm uses the post-pruning method to control the size of the decision tree. It first grows an overfitting large tree that contains all the possibilities of partitioning. Then, it cuts out nodes and branches with little effect on classification errors.

3.4 Case Study 1: Quality of Life and Chronic Disease

3.4.1 Step 1: Collecting Data

Let’s use the Quality of life and chronic disease dataset, Case06_QoL_Symptom_ChronicIllness.csv, which has 41 variables. Detailed description for each variable is provided here.

Important variables:

Charlson Comorbidity Index: ranging from 0-10. A score of 0 indicates no comorbid conditions. Higher scores indicate a greater level of comorbidity.
Chronic Disease Score: A summary score based on the presence and complexity of prescription medications for select chronic conditions. A high score in decades the patient has severe chronic diseases. -9 indicates a missing value.

3.4.2 Step 2 - exploring and preparing the data

Let’s load the data first.

qol <- read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
str(qol)

## 'data.frame':    2356 obs. of  41 variables:
##  $ ID                 : int  171 171 172 179 180 180 181 182 183 186 ...
##  $ INTERVIEWDATE      : int  0 427 0 0 0 42 0 0 0 0 ...
##  $ LANGUAGE           : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ AGE                : int  49 49 62 44 64 64 52 48 49 78 ...
##  $ RACE_ETHNICITY     : int  3 3 3 7 3 3 3 3 3 4 ...
##  $ SEX                : int  2 2 2 2 1 1 2 1 1 1 ...
##  $ QOL_Q_01           : int  4 4 3 6 3 3 4 2 3 5 ...
##  $ QOL_Q_02           : int  4 3 3 6 2 5 4 1 4 6 ...
##  $ QOL_Q_03           : int  4 4 4 6 3 6 4 3 4 4 ...
##  $ QOL_Q_04           : int  4 4 2 6 3 6 2 2 5 2 ...
##  $ QOL_Q_05           : int  1 5 4 6 2 6 4 3 4 3 ...
##  $ QOL_Q_06           : int  4 4 2 6 1 2 4 1 2 4 ...
##  $ QOL_Q_07           : int  1 2 5 -1 0 5 8 4 3 7 ...
##  $ QOL_Q_08           : int  6 1 3 6 6 6 3 1 2 4 ...
##  $ QOL_Q_09           : int  3 4 3 6 2 2 4 2 2 4 ...
##  $ QOL_Q_10           : int  3 1 3 6 3 6 3 2 4 3 ...
##  $ MSA_Q_01           : int  1 3 2 6 2 3 4 1 1 2 ...
##  $ MSA_Q_02           : int  1 1 2 6 1 6 4 3 2 4 ...
##  $ MSA_Q_03           : int  2 1 2 6 1 2 3 3 1 2 ...
##  $ MSA_Q_04           : int  1 3 2 6 1 2 1 4 1 5 ...
##  $ MSA_Q_05           : int  1 1 1 6 1 2 1 6 3 2 ...
##  $ MSA_Q_06           : int  1 2 2 6 1 2 1 1 2 2 ...
##  $ MSA_Q_07           : int  2 1 3 6 1 1 1 1 1 5 ...
##  $ MSA_Q_08           : int  1 1 1 6 1 1 1 1 2 1 ...
##  $ MSA_Q_09           : int  1 1 1 6 2 2 4 6 2 1 ...
##  $ MSA_Q_10           : int  1 1 1 6 1 1 1 1 1 3 ...
##  $ MSA_Q_11           : int  2 3 2 6 1 1 2 1 1 5 ...
##  $ MSA_Q_12           : int  1 1 2 6 1 1 2 6 1 3 ...
##  $ MSA_Q_13           : int  1 1 1 6 1 6 2 1 4 2 ...
##  $ MSA_Q_14           : int  1 1 1 6 1 2 1 1 3 1 ...
##  $ MSA_Q_15           : int  2 1 1 6 1 1 3 2 1 3 ...
##  $ MSA_Q_16           : int  2 3 5 6 1 2 1 2 1 2 ...
##  $ MSA_Q_17           : int  2 1 1 6 1 1 1 1 1 3 ...
##  $ PH2_Q_01           : int  3 2 1 5 1 1 3 1 2 3 ...
##  $ PH2_Q_02           : int  4 4 1 5 1 2 1 1 4 2 ...
##  $ TOS_Q_01           : int  2 2 2 4 1 1 2 2 1 1 ...
##  $ TOS_Q_02           : int  1 1 1 4 4 4 1 2 4 4 ...
##  $ TOS_Q_03           : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ TOS_Q_04           : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ CHARLSONSCORE      : int  2 2 3 1 0 0 2 8 0 1 ...
##  $ CHRONICDISEASESCORE: num  1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...

Most of the coded variables like QOL_Q_01(health rating) have ordinal values (1=excellent, 2=very good, 3=good, 4=fair, 5= poor, 6=no answer) and we can use the table() function to see their distributions. We also have some numerical variables in the dataset like CHRONICDISEASESCORE, which can be examined using summary().

Our outcome of interest CHRONICDISEASESCORE has some missing data. A simple way to address this is just to remove those observations with missing values. However, we could also try to impute the missing values using various imputation methods mentioned in Chapter 2.

table(qol$QOL_Q_01)

## 
##   1   2   3   4   5   6 
##  44 213 801 900 263 135

qol <- qol[!qol$CHRONICDISEASESCORE==-9, ]
summary(qol$CHRONICDISEASESCORE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.880   1.395   1.497   1.970   4.760

Let’s create two classes using the variable CHRONICDISEASESCORE. We classify the patients with CHRONICDISEASESCORE < mean(CHRONICDISEASESCORE) as having minor disease and the rest as having severe disease. This dichotomous classification (qol$cd) may not be perfect and we will talk about alternative classification strategies in the practice problem at the end of the chapter.

qol$cd <- qol$CHRONICDISEASESCORE > 1.497   # mean(qol$CHRONICDISEASESCORE)
qol$cd <- factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))

3.4.2.1 Data preparation - creating random training and test datasets

To make the qol data more organized, we can order the data by the variable ID.

qol <- qol[order(qol$ID), ]

# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]

Then, we are able to subset training and testing datasets. Here is an example of a non-random split of the entire data into training ($2114$) and testing ($100$) sets:

qol_train <- qol[1:2114, ]
qol_test <- qol[2115:2214, ]

And here is an example of random assignments of cases into training and testing sets (80-20% split):

set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train <- qol[train_index, ]
qol_test <- qol[-train_index, ]

We can quickly inspect the distributions of the training and testing data to ensure they are not vastly different. We can see that the classes are split fairly equal in training and test datasets.

prop.table(table(qol_train$cd))

## 
##  minor_disease severe_disease 
##      0.5324675      0.4675325

prop.table(table(qol_test$cd))

## 
##  minor_disease severe_disease 
##      0.4853273      0.5146727

3.4.3 Step 3 - training a model on the data

In this section we are using the C5.0() function from the C50 package.

The function C5.0() has following components:

m<-C5.0(train, class, trials=1, costs=NULL)

train: data frame containing numeric training data (features)
class: factor vector with the class for each row in the training data.
trials: an optional number to control the boosting iterations (default=1).
costs: an optional matrix to specify the costs of false positive and false negative.

You could delete the # in the following code and run it in R to install and load the C50 package.

# install.packages("C50")
library(C50)

In the qol dataset (ID column is already removed), column 41 is the class vector (qol$cd) and column 40 is the numerical version of vector 41 (qol$CHRONICDISEASESCORE). We need to delete these two columns to create our training data that only contains feature variables that will be used for predicting the outcome.

summary(qol_train[,-c(40, 41)])

##  INTERVIEWDATE       LANGUAGE          AGE        RACE_ETHNICITY 
##  Min.   :  0.00   Min.   :1.000   Min.   :20.00   Min.   :1.000  
##  1st Qu.:  0.00   1st Qu.:1.000   1st Qu.:52.00   1st Qu.:3.000  
##  Median :  0.00   Median :1.000   Median :59.00   Median :3.000  
##  Mean   : 20.29   Mean   :1.199   Mean   :58.55   Mean   :3.618  
##  3rd Qu.:  0.00   3rd Qu.:1.000   3rd Qu.:66.00   3rd Qu.:4.000  
##  Max.   :440.00   Max.   :2.000   Max.   :90.00   Max.   :7.000  
##       SEX           QOL_Q_01        QOL_Q_02        QOL_Q_03    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :1.000   Median :4.000   Median :3.000   Median :4.000  
##  Mean   :1.423   Mean   :3.678   Mean   :3.417   Mean   :3.705  
##  3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :2.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     QOL_Q_04        QOL_Q_05        QOL_Q_06        QOL_Q_07     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :-1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 1.000  
##  Median :3.000   Median :3.000   Median :3.000   Median : 5.000  
##  Mean   :3.021   Mean   :3.099   Mean   :2.963   Mean   : 4.172  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.: 7.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :10.000  
##     QOL_Q_08        QOL_Q_09        QOL_Q_10        MSA_Q_01    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.885   Mean   :3.151   Mean   :2.922   Mean   :2.975  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_02        MSA_Q_03        MSA_Q_04        MSA_Q_05    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :2.000   Median :2.000   Median :1.000  
##  Mean   :3.066   Mean   :2.298   Mean   :2.404   Mean   :1.895  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_06        MSA_Q_07        MSA_Q_08       MSA_Q_09        MSA_Q_10    
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :1.00   Median :2.000   Median :1.000  
##  Mean   :2.401   Mean   :2.172   Mean   :1.48   Mean   :2.193   Mean   :1.717  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.00   Max.   :6.000   Max.   :6.000  
##     MSA_Q_11        MSA_Q_12        MSA_Q_13        MSA_Q_14    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :2.242   Mean   :2.036   Mean   :1.897   Mean   :2.002  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     MSA_Q_15        MSA_Q_16        MSA_Q_17        PH2_Q_01    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :2.000  
##  Mean   :1.765   Mean   :1.823   Mean   :2.061   Mean   :2.479  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :6.000   Max.   :6.000   Max.   :6.000  
##     PH2_Q_02        TOS_Q_01        TOS_Q_02        TOS_Q_03    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:4.000  
##  Median :2.000   Median :2.000   Median :4.000   Median :4.000  
##  Mean   :2.368   Mean   :2.211   Mean   :3.297   Mean   :3.787  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :6.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     TOS_Q_04     CHARLSONSCORE    
##  Min.   :1.000   Min.   :-9.0000  
##  1st Qu.:5.000   1st Qu.: 0.0000  
##  Median :5.000   Median : 1.0000  
##  Mean   :4.679   Mean   : 0.9339  
##  3rd Qu.:5.000   3rd Qu.: 1.0000  
##  Max.   :6.000   Max.   :10.0000

set.seed(1234)
qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)
qol_model

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Tree size: 5 
## 
## Non-standard options: attempt to group attributes

summary(qol_model)

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Feb 26 10:36:46 2025
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 1771 cases (40 attributes) from undefined.data
## 
## Decision tree:
## 
## CHARLSONSCORE <= 0: minor_disease (645/169)
## CHARLSONSCORE > 0:
## :...CHARLSONSCORE > 5: minor_disease (29/9)
##     CHARLSONSCORE <= 5:
##     :...AGE > 47: severe_disease (936/352)
##         AGE <= 47:
##         :...MSA_Q_08 <= 1: minor_disease (133/46)
##             MSA_Q_08 > 1: severe_disease (28/8)
## 
## 
## Evaluation on training data (1771 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5  584(33.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     583   360    (a): class minor_disease
##     224   604    (b): class severe_disease
## 
## 
##  Attribute usage:
## 
##  100.00% CHARLSONSCORE
##   61.94% AGE
##    9.09% MSA_Q_08
## 
## 
## Time: 0.0 secs

# plot(qol_model, subtree = 17) # ?C50::plot.C5.0

The output of qol_model indicates that we have a tree that has 25 terminal nodes. summary(qol_model) suggests that the classification error for the decision tree is 28% percent in the training data.

3.4.4 Step 4 - evaluating model performance

Now we can make predictions using the decision tree that we just built. The predict() function we will use is the same as the one we showed earlier, e.g., in the Naive Bayes classifier, and in Chapter 2. In general, predict() is extended by each specific type of regression, classification, clustering or forecasting machine learning technique. For example, randomForest::predict.randomForest() is invoked by:

predict(RF_model, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...),

where type represents the type of prediction output to be generated - “response” (equivalent to “class”), “prob” or “votes”. Thus, the predicted values are either predicted “response” class labels, matrix of class probabilities, or vote counts.

This time we are going to introduce the confusionMatrix() function under package caret as the evaluation method. When we combine it with a table() function, the output of the evaluation is very straight forward.

# See docs for predict # ?C50::predict.C5.0
qol_pred <- predict(qol_model, qol_test[ ,-c(40, 41)])  # removing the last 2 columns CHRONICDISEASESCORE and cd, which represent the clinical outcomes we are predicting!
# install.packages("caret")
library(caret)
confusionMatrix(table(qol_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             71
##   severe_disease            72            157
##                                           
##                Accuracy : 0.6772          
##                  95% CI : (0.6315, 0.7206)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 3.001e-12       
##                                           
##                   Kappa : 0.3538          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6886          
##          Pos Pred Value : 0.6682          
##          Neg Pred Value : 0.6856          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4831          
##       Balanced Accuracy : 0.6769          
##                                           
##        'Positive' Class : minor_disease   
##

The Confusion Matrix shows prediction accuracy is about 63%, however, this may vary (see the corresponding confidence interval).

3.4.5 Step 5 - trial option

Recall that the C5.0() function has a trials option, an integer specifying the number of boosting iterations. The default value of one indicates that a single model is used, and we can specify a larger number of iterations, for instance trials=6.

set.seed(1234)
qol_boost6 <- C5.0(qol_train[ , -c(40, 41)], qol_train$cd, trials=6) 
# try alternative values for the trials option
qol_boost6

## 
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd, trials = 6)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Number of boosting iterations: 6 
## Average tree size: 4.7 
## 
## Non-standard options: attempt to group attributes

The size of the tree may vary with each repeated experiment and depends on the parameter settings.

Since this is a fairly small tree we can visualize it by the function plot(). We also use the option type="simple" to make the tree look more condensed. We can also zoom in and display just a specific branch of the tree by using the subtree parameter.

plot(qol_boost6, type="simple")   # C50::plot.C5.0

Caution: The plotting of decision trees will fail if you have columns that start with numbers or special characters (e.g., “5variable”, “!variable”). In general, avoid spaces, special characters, and other non-terminal symbols in column/row names.

The next step is to demonstrate prediction and testing of the accuracy of a fitted decition tree model.

qol_boost_pred6 <- predict(qol_boost6, qol_test[ ,-c(40, 41)])
confusionMatrix(table(qol_boost_pred6, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_boost_pred6  minor_disease severe_disease
##   minor_disease            145             69
##   severe_disease            70            159
##                                           
##                Accuracy : 0.6862          
##                  95% CI : (0.6408, 0.7292)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 1.747e-13       
##                                           
##                   Kappa : 0.3718          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6744          
##             Specificity : 0.6974          
##          Pos Pred Value : 0.6776          
##          Neg Pred Value : 0.6943          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3273          
##    Detection Prevalence : 0.4831          
##       Balanced Accuracy : 0.6859          
##                                           
##        'Positive' Class : minor_disease   
##

The accuracy is about 64%. However, this may vary each time we run the experiment (mind the confidence interval). In some studies, the trials option provides significant improvement to the overall accuracy. A good choice for this option is trials=10.

3.4.5.1 Loading the misclassification error matrix

Suppose we want to reduce the false negative rate, which corresponds to the case of misclassifying a severe case as mild. False negative (failure to detect a severe disease case) may be more costly than false positive (misclassifying a minor disease case as severe). Misclassification errors can be expressed as a matrix, which can be loaded to penalize specific types of erroneous predicitons.

error_cost <- matrix(c(0, 1, 4, 0), nrow = 2)
error_cost

##      [,1] [,2]
## [1,]    0    4
## [2,]    1    0

Let’s build a decision tree with the option cpsts=error_cost.

set.seed(1234)
qol_cost <- C5.0(qol_train[-c(40, 41)], qol_train$cd, costs=error_cost)
qol_cost_pred <- predict(qol_cost, qol_test)
confusionMatrix(table(qol_cost_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_cost_pred    minor_disease severe_disease
##   minor_disease             73             14
##   severe_disease           142            214
##                                           
##                Accuracy : 0.6479          
##                  95% CI : (0.6014, 0.6923)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 1.024e-08       
##                                           
##                   Kappa : 0.2829          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3395          
##             Specificity : 0.9386          
##          Pos Pred Value : 0.8391          
##          Neg Pred Value : 0.6011          
##              Prevalence : 0.4853          
##          Detection Rate : 0.1648          
##    Detection Prevalence : 0.1964          
##       Balanced Accuracy : 0.6391          
##                                           
##        'Positive' Class : minor_disease   
##

Although the overall accuracy decreased, the false negative cell labels were reduced from 75 (without specifying a cost matrix) to 17 (when specifying a non-trivial (loaded) cost matrix). This comes at the cost of increasing the rate of false-positive labeling (minor disease cases misclassified as severe).

3.4.5.2 Parameter Tuning

There are multiple choices to plot trees fitted by rpart or C5.0.

library("rpart") 
# remove CHRONICDISEASESCORE, but keep *cd* label
set.seed(1234)
qol_model<-rpart(cd~., data=qol_train[, -40], cp=0.01) 
# here we use rpart::cp = *complexity parameter* = 0.01
qol_model

## n= 1771 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1771 828 minor_disease (0.5324675 0.4675325)  
##   2) CHARLSONSCORE< 0.5 645 169 minor_disease (0.7379845 0.2620155) *
##   3) CHARLSONSCORE>=0.5 1126 467 severe_disease (0.4147425 0.5852575)  
##     6) AGE< 47.5 165  66 minor_disease (0.6000000 0.4000000) *
##     7) AGE>=47.5 961 368 severe_disease (0.3829344 0.6170656) *

You can also plot directly using rpart.plot

library(rpart.plot)

## 
## Attaching package: 'rpart.plot'

## The following object is masked from 'package:rpart.utils':
## 
##     rpart.rules

rpart.plot(qol_model, type = 4,extra = 1,clip.right.labs = F)

The method fancyRpartPlot() provides another decision tree rendering.

library("rattle")

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")

qol_pred <- predict(qol_model, qol_test,type = 'class')
confusionMatrix(table(qol_pred, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             74
##   severe_disease            72            154
##                                           
##                Accuracy : 0.6704          
##                  95% CI : (0.6245, 0.7141)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 2.276e-11       
##                                           
##                   Kappa : 0.3405          
##                                           
##  Mcnemar's Test P-Value : 0.934           
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6754          
##          Pos Pred Value : 0.6590          
##          Neg Pred Value : 0.6814          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4898          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : minor_disease   
##

These results are consistent with their counterparts reported using C5.0. How can we tune the parameters to further improve the results?

set.seed(1234)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
qol_model= rpart(cd ~ ., data = qol_train[ , -40], control = control)
# plotcp(qol_model)
# printcp(qol_model)

data <- as.data.frame(qol_model$cptable)
data$CP <- as.factor(data$CP)

plot_ly(data = data, x = ~CP, y = ~xerror, type = 'scatter', mode='lines+markers',
        name = 'Test', error_y = ~list(array = xstd,  color = 'gray')) %>% 
  #add_segments(x=0, xend=0.23, y=0.76, yend = 0.76, line = list(dash = "dash"), showlegend=F) %>%
  layout(title="Complexity Parameter vs. (CV) Error Rate")

Now, we can prune the tree according to the optimal cp, complexity parameter to which the rpart object will be trimmed. Instead of using the real error (e.g., $1-R^2$, RMSE) to capture the discrepancy between the observed labels and the model-predicted labels, we will use the xerror which averages the discrepancy between observed and predicted classifications using cross-validation, see Chapter 9. We can compare the full and pruned decision trees by contrasting the outputs of summary(qol_model) and summary(selected_tr).

set.seed(1234)
selected_tr <- prune(qol_model, cp= qol_model$cptable[which.min(qol_model$cptable[,"xerror"]),"CP"])
fancyRpartPlot(selected_tr, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")

qol_pred_tune <- predict(selected_tr, qol_test, type = 'class')
confusionMatrix(table(qol_pred_tune, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred_tune    minor_disease severe_disease
##   minor_disease            143             74
##   severe_disease            72            154
##                                           
##                Accuracy : 0.6704          
##                  95% CI : (0.6245, 0.7141)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 2.276e-11       
##                                           
##                   Kappa : 0.3405          
##                                           
##  Mcnemar's Test P-Value : 0.934           
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6754          
##          Pos Pred Value : 0.6590          
##          Neg Pred Value : 0.6814          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4898          
##       Balanced Accuracy : 0.6703          
##                                           
##        'Positive' Class : minor_disease   
##

The result is roughly the same as that of C5.0. Despite the fact that there is no substantial classification improvement, the tree-pruning process generates a graphical representation of the decision making protocol (selected_tr) that is much simpler and intuitive compared to the original (un-pruned) tree (qol_model):

fancyRpartPlot(qol_model, cex = 0.1, caption = "rattle::fancyRpartPlot (QoL Data)")

3.5 Comparing different impurity indices

We can change split=“entropy” to “error” or “gini” to apply an alternative information gain index. Experiment with these and compare the results.

set.seed(1234)
qol_model = rpart(cd ~ ., data=qol_train[ , -40], parms = list(split = "entropy"))
fancyRpartPlot(qol_model, cex = 1, caption = "rattle::fancyRpartPlot (QoL data)")

# Modify and test using "error" and "gini"
# qol_pred<-predict(qol_model, qol_test,type = 'class')
# confusionMatrix(table(qol_pred, qol_test$cd))

3.6 Classification rules

In addition to the classification trees we just saw, we can explore classification rules that utilize if-else logical statements to assign classes to unlabeled data. Below we review three classification rule strategies.

3.6.1 Separate and conquer

Separate and conquer repeatedly splits the data (and subsets of the data) by rules that cover a subset of examples. This procedure is very similar to the Divide and conquer approach. However, a notable difference is that each rule can be independent, yet, each decision node in a tree has to be linked to past decisions.

3.6.2 The One Rule algorithm

To understand the One Rule (OneR) algorithm, we need to know about its “sibling” - ZeroR rule. ZeroR rule means that we assign the mode class to unlabeled test observations regardless of its feature value. The One Rule (OneR) algorithm is an improved version of ZeroR that uses a single rule for classification. In other words, OneR splits the training dataset into several segments based on feature values. Then, it assigns the mode of the classes in each segment to related observations in the unlabeled test data. In practice, we first test multiple rules and pick the rule with the smallest error rate to be our One Rule. Remember, the rules are subjective.

3.6.3 The RIPPER algorithm

The Repeated Incremental Pruning to Produce Error Reduction algorithm is a combination of the ideas behind decision tree and classification rules. It consists of a three-step process:

Grow: add conditions to a rule until it cannot split the data into more segments.
Prune: delete some of the conditions that have large error rates.
Optimize: repeat the above two steps until we cannot add or delete any of the conditions.

3.7 Case Study 2: QoL in Chronic Disease (Take 2)

Let’s take another look at the same dataset as Case Study 1 (QoL in Chronic Disease) - this time applying classification rules. Naturally, we skip over the first two data handling steps and go directly to step three.

3.7.1 (Continuing) Step 3 - training a model on the data

Let’s start by using the OneR() function in the RWeka package. Before installing the package you might want to check that the Java program in your computer is up to date. Also, its version has to match the version of R (i.e., 64bit R needs 64bit Java).

The function OneR() has the following components:

m<-OneR(class~predictors, data=mydata)

class: factor vector with the class for each row mydata.
predictors: feature variables in mydata. If we want to include $x_1$, $x_2$ as predictors and $y$ as the class label variable, we do $y\sim x_1+x_2$. When using a y ~ . instead, we include all of the column variables as predictors.
mydata: the dataset where the features and labels could be found.

# install.packages("RWeka")
library(RWeka)
# just remove the CHRONICDISEASESCORE but keep cd
set.seed(1234)
qol_1R <- OneR(cd~., data=qol[ , -40])
qol_1R

## CHARLSONSCORE:
##  < -4.5  -> severe_disease
##  < 0.5   -> minor_disease
##  < 5.5   -> severe_disease
##  < 8.5   -> minor_disease
##  >= 8.5  -> severe_disease
## (1453/2214 instances correct)

Note that $1,453$ out of $2,214$ cases are correctly classified, $66\%$, by the “one rule”.

3.7.2 Step 4 - evaluating model performance

summary(qol_1R)

## 
## === Summary ===
## 
## Correctly Classified Instances        1453               65.6278 %
## Incorrectly Classified Instances       761               34.3722 %
## Kappa statistic                          0.3206
## Mean absolute error                      0.3437
## Root mean squared error                  0.5863
## Relative absolute error                 68.8904 %
## Root relative squared error            117.3802 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  609 549 |   a = minor_disease
##  212 844 |   b = severe_disease

We obtain a rule that correctly specifies 66% of the patients, which is in line with the prior decision tree classification results. Due to algorithmic stochasticity, it’s normal that these results may vary each time you run the algorithm, albeit we used set.seed(1234) to ensure some result reproducibility.

3.7.3 Step 5 - Alternative model1

Another possible option for the classification rules would be the RIPPER rule algorithm that we discussed earlier in the chapter. In R we use the Java based function JRip() to invoke this algorithm.

JRip() function uses a similar invocation protocol as OneR():

m<-JRip(class~predictors, data=mydata)

set.seed(1234)
qol_jrip1 <- JRip(cd~., data=qol[ , -40])
qol_jrip1

## JRIP rules:
## ===========
## 
## (CHARLSONSCORE >= 1) and (RACE_ETHNICITY >= 4) and (AGE >= 49) => cd=severe_disease (448.0/132.0)
## (CHARLSONSCORE >= 1) and (AGE >= 53) => cd=severe_disease (645.0/265.0)
##  => cd=minor_disease (1121.0/360.0)
## 
## Number of Rules : 3

summary(qol_jrip1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1457               65.8085 %
## Incorrectly Classified Instances       757               34.1915 %
## Kappa statistic                          0.3158
## Mean absolute error                      0.4459
## Root mean squared error                  0.4722
## Relative absolute error                 89.3711 %
## Root relative squared error             94.5364 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##    a   b   <-- classified as
##  761 397 |   a = minor_disease
##  360 696 |   b = severe_disease

This JRip() classifier uses only 3 rules and has a relatively similar accuracy of 66%. As each individual has unique characteristics, classification in real world data is rarely perfect (close to 100% accuracy).

3.7.4 Step 5 - Alternative model2

Another idea is to repeat the generation of trees multiple times, predict according to each tree’s performance, and finally ensemble those weighted votes into a combined classification result. This is precisely the idea behind random forest classification, see Chapter 9.

# install.packages("randomForest")
require(randomForest)
set.seed(12)
# rf.fit <- tuneRF(qol_train[ , -40], qol_train[ , 40], stepFactor=1.5)
rf.fit <- randomForest(cd ~ . , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
# varImpPlot(rf.fit, cex=0.5); print(rf.fit)

# type either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity)

# Accuracy
imp_rf.fit <- importance(rf.fit)
x <- rownames(imp_rf.fit)
y <- imp_rf.fit[,3]
plot_ly(x = ~y, y = ~reorder(x,y),  name = "Var.Imp", type = "bar") %>%
  layout(title="RF Variable Importance Plot (Accuracy)",
         xaxis=list(title="Importance (mean decrease in accuracy)"),  
         yaxis = list(title="Variables (Ordered)"))

# Gini Index
y <- imp_rf.fit[,4]
plot_ly(x = ~y, y = ~reorder(x,y),  name = "Var.Imp", type = "bar") %>%
  layout(title="RF Variable Importance Plot (Gini)",
         xaxis=list(title="Importance (Gini mean decrease)"),  
         yaxis = list(title="Variables (Ordered)"))

rf.fit1 <- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit2 <- randomForest(cd~. , data=qol_train[ , -40], importance=TRUE, nodesize=5, ntree=5000, mtry=26)

# plot(rf.fit,log="x",main="MSE Errors vs. Iterations: Three Models rf.fit, rf.fit1, rf.fit2")
# points(1:5000, rf.fit1$mse, col="red", type="l")
# points(1:5000, rf.fit2$mse, col="green", type="l")
# legend("topright", col=c("black", "red", "green"), lwd=c(2, 2, 2), legend=c("rf.fit", "rf.fit1", "rf.fit2"))

plot_ly(x = c(1:1000), y = rf.fit$err.rate[1:1000,1], name = 'OOB (rf.fit)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  add_trace(x = c(1:1000), y = rf.fit1$err.rate[1:1000,1], name = 'OOB (rf.fit1)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  add_trace(x = c(1:1000), y = rf.fit2$err.rate[1:1000,1], name = 'OOB (rf.fit2)', type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = "Out Of Bag Error Rates for 3 RF Models",
         xaxis = list(title = "Number of Trees"),
         yaxis = list (title = "OOB Error"))

qol_pred2<-predict(rf.fit2, qol_test, type = 'class')
confusionMatrix(table(qol_pred2, qol_test$cd))

## Confusion Matrix and Statistics
## 
##                 
## qol_pred2        minor_disease severe_disease
##   minor_disease            143             72
##   severe_disease            72            156
##                                           
##                Accuracy : 0.6749          
##                  95% CI : (0.6291, 0.7184)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 5.956e-12       
##                                           
##                   Kappa : 0.3493          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6651          
##             Specificity : 0.6842          
##          Pos Pred Value : 0.6651          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.4853          
##          Detection Rate : 0.3228          
##    Detection Prevalence : 0.4853          
##       Balanced Accuracy : 0.6747          
##                                           
##        'Positive' Class : minor_disease   
##

The variable importance plots (varplot) show the rank orders of importance of the most salient features according to the specific metric (e.g., accuracy, Gini). We can also use the randomForest::partialPlot() method to examine the relative impact of the top salient features (continuous or binary variables). For each feature, this method works by

Creating a test data set identical to the training data,
Sequentially setting the variable of interest for all observations to a selected value,
Computing the average value of the predicted response for each test set,
Plotting the results against the selected value.

Note that for binary features, partialPlot() would result in only two values of the corresponding mass function. Whereas for continuous variables, it generates the marginal probability effect of the variable corresponding to the class probability (for classification problems) or the continuous response (for regression problems). Using the QoL case-study, (Case06_QoL_Symptom_ChronicIllness.csv), this example shows the partial plots of the top six rank-ordered features of the QoL dataset chosen by the random forest model.

imp <- randomForest::importance(rf.fit)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
# op <- par(mfrow=c(2, 3))
# for (i in 1:6) {   # seq_along(impvar)) { # to plot the marginal probabilities for all features
#     partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
#                 main=paste("Partial Dependence of 'CD'\n on ", impvar[i]))
# }

p <- vector("list", 6)
for (i in 1:6) {   # seq_along(impvar)) { # to plot the marginal probabilities for all features
    p[[i]] <- partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
                main=paste("Partial Dependence of 'CD'\n on ", impvar[i]), plot=F)
}

plot_ly(x = p[[1]]$x, y = p[[1]]$y, name = impvar[1], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[1]),
         xaxis = list(title = impvar[1]), yaxis = list (title = "Impact"))

plot_ly(x = p[[2]]$x, y = p[[2]]$y, name = impvar[2], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[2]),
         xaxis = list(title = impvar[2]), yaxis = list (title = "Impact"))

plot_ly(x = p[[3]]$x, y = p[[3]]$y, name = impvar[3], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[4]]$x, y = p[[4]]$y, name = impvar[4], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[5]]$x, y = p[[5]]$y, name = impvar[5], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>%
  add_trace(x = p[[6]]$x, y = p[[6]]$y, name = impvar[6], type = 'scatter', mode = 'lines',
        line = list(width = 2)) %>% 
  layout(title = paste0("CD Partial Dependence on ",impvar[3], " ", impvar[4], " ",impvar[5], " ",impvar[6]),
         xaxis = list(title = "MSA Questions"), yaxis = list (title = "Impact"))

# reset the device plot canvas
# par(op)

In random forest (RF) classification, the node size (nodesize) refers to the smallest node that can be split, i.e., nodes with fewer cases than the nodesize are never subdivided. Increasing the node size leads to smaller trees, which may compromise previous predictive power. On the flip side, increasing the tree size (maxnodes) and the number of trees (ntree) tends to increase the predictive accuracy. However, there are tradeoffs between increasing node-size and tree-size simultaneously. To optimize the RF predictive accuracy, try smaller node sizes and more trees. Ensembling (forest) results from a larger number of trees will likely generate better results.

3.7.5 Step 6 - Alternative model3

Another decision three technique we can test is Extra-Trees (extremely randomized trees). Extra-trees are similar to Random Forests with the exception that at each node, Random Forest chooses the best-cutting threshold for the feature, whereas Extra-Tree selects a uniformly random cut, so that the feature with the biggest gain (or best classification score) is chosen.. The R extraTrees package provides one implementation of Extra-Trees where single random cuts are extended to several random cuts for each feature. This extension improves performance as it reduces the probability of making very poor cuts and still maintains the designed extra-tree stochastic cutting. Below, we compare the results of four alternative chronic disease classification methods - random forest, SVM, kNN, and extra-tree.

# https://cran.r-project.org/src/contrib/Archive/extraTrees/
library(caret)
library(extraTrees)
# For direct Extra-Tree Call
# install.packages("extraTrees")
# library(extraTrees)
# set.seed(12)
# ## Request more memory to JVM (e.g., 5GB
# options( java.parameters = "-Xmx5g" )
# library(extraTrees)
# 
# system.time({
#   # call extraTrees with 3 CPU cores/threads and track exec-time
#   et.fit3 <- extraTrees(qol_train[ , -c(40,41)], qol_train[ , 41],
#                         numThreads=3, ntree=5000, mtry=26)
# })
# qol_pred3 <- predict(et.fit3, qol_test[ , -c(40,41)])
# confusionMatrix(table(qol_pred3, qol_test$cd))
# qol_pred<-predict(rf.fit, qol_test, type = 'class')
# qol_pred1<-predict(rf.fit1, qol_test, type = 'class')
# qol_pred2<-predict(rf.fit2, qol_test, type = 'class')

control <- trainControl(method="repeatedcv", number=10, repeats=3)

## Run all subsequent models in parallel
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
system.time({
  rf.fit  <- train(cd~., data=qol_test[ , -40], method="rf", trControl=control);
  knn.fit <- train(cd~., data=qol_test[ , -40], method="knn", trControl=control);
  svm.fit <- train(cd~., data=qol_test[ , -40], method="svmRadial", trControl=control);
  et.fit  <- train(cd~., data=qol_test[ , -40], method="extraTrees", trControl=control)
})

##    user  system elapsed 
##    1.94    0.07   44.81

stopCluster(cl) # close multi-core cluster

results <- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, ET=et.fit))

# summary of model differences
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, kNN, SVM, ET 
## Number of resamples: 30 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.5000000 0.6000000 0.6363636 0.6364525 0.6780303 0.7777778    0
## kNN 0.3863636 0.5027778 0.5681818 0.5589785 0.6113901 0.7045455    0
## SVM 0.4418605 0.5681818 0.6000000 0.5987902 0.6243393 0.7272727    0
## ET  0.5000000 0.5909091 0.6363636 0.6290083 0.6724806 0.7777778    0
## 
## Kappa 
##             Min.    1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF   0.006160164 0.19999942 0.2673909 0.2707204 0.3558133 0.5535714    0
## kNN -0.222222222 0.00591716 0.1363636 0.1166633 0.2223136 0.3991597    0
## SVM -0.119305857 0.13367597 0.2019697 0.1971070 0.2471653 0.4579055    0
## ET   0.000000000 0.17671518 0.2697064 0.2557297 0.3429830 0.5535714    0

# plot summaries
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales)                 # Box plots of

densityplot(results, scales=scales, pch = "|") # Density plots of accuracy

dotplot(results, scales=scales)                # Dot plots of accuracy & Kappa

splom(results)      # contrast pair-wise model scatterplots of prediction accuracy

3.8 Practice Problem

3.8.1 Iris Species

The classification of the iris flowers represents an easy example of the Naive Bayesian classifier.

library(e1071)
data(iris)
nbc_model <- naiveBayes(Species ~ ., data = iris)
## alternatively:
nbc_model <- naiveBayes(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")], iris[,"Species"])
predicted.nbcvalues <- predict(nbc_model, iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
table(predicted.nbcvalues, iris[, "Species"])

##                    
## predicted.nbcvalues setosa versicolor virginica
##          setosa         50          0         0
##          versicolor      0         47         3
##          virginica       0          3        47

3.8.2 Cancer Study

In the previous cancer case study, we classified the patients with seer_stage of “not applicable”(seer_stage=8) and “unstaged, unknown or unspecified” (seer_stage=9) as no cancer or early cancer stages. Let’s remove these two categories and replicate the Naive Bayes classifier case study again.

hn_med1 <- hn_med[!hn_med$seer_stage %in% c(8, 9), ]
str(hn_med1); dim(hn_med1)

## 'data.frame':    580 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10103 1012 10135 10143 10152 10184 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47511 3138 47739 47769 47800 47938 ...
##  $ seer_stage        : Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 2 2 2 2 7 2 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 2 2 6 1 24 2 ...

## [1] 580   9

Now we have only 580 observations. We can either use the first 480 of them as the training dataset and the last 100 as the test dataset, or select 80-20 (training-testing) split, and evaluate the prediction accuracy when laplace=1?

We can use the same code for creating the classes in training and test dataset. Since the seer_stage=8 or 9 is not in the data, we classify seer_stage=0, 1, 2 or 3 as “early_stage” and seer_stage=4, 5 or 7 as “later_stage”.

hn_med_train1$stage <- hn_med_train1$seer_stage %in% c(4, 5, 7)
hn_med_train1$stage <- factor(hn_med_train1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test1$stage <- hn_med_test1$seer_stage %in% c(4, 5, 7)
hn_med_test1$stage <- factor(hn_med_test1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train1$stage))

## 
## early_stage later_stage 
##   0.7392241   0.2607759

prop.table(table(hn_med_test1$stage))

## 
## early_stage later_stage 
##   0.7413793   0.2586207

To test the naïve Bayes classifier, we can u Use the document term matrices constructed from a corpus of high-frequency terms that have appeared in at least five documents in the training dataset.

summary(findFreqTerms(hn_med_dtm_train1, 5))

##    Length     Class      Mode 
##       116 character character

hn_med_dict1 <- as.character(findFreqTerms(hn_med_dtm_train1, 5))
hn_train1 <- DocumentTermMatrix(corpus_train1, list(dictionary=hn_med_dict1))
hn_test1 <- DocumentTermMatrix(corpus_test1, list(dictionary=hn_med_dict1))

traindf1 <- data.frame(stage=hn_med_train1$stage, as.matrix(hn_train1))
testdf1 <- data.frame(stage=hn_med_test1$stage, as.matrix(hn_test1))

# for Training and Testing data, transform predictors/features to factors
for (cols in names(traindf1[-1])) traindf1[[cols]] <- factor(traindf1[[cols]])
for (cols in names(testdf1[-1])) testdf1[[cols]] <- factor(testdf1[[cols]])

# check variable types
# str(traindf)

# run NB classifier (for cancer stage)
hn_classifier1 <- naiveBayes(traindf1[,-1], traindf1$stage, laplace = 15, type = "class")
hn_test_pred1 <- predict(hn_classifier1, testdf1)
caret::confusionMatrix(hn_test_pred1, hn_med_test1$stage)

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    early_stage later_stage
##   early_stage          85          29
##   later_stage           1           1
##                                           
##                Accuracy : 0.7414          
##                  95% CI : (0.6518, 0.8182)
##     No Information Rate : 0.7414          
##     P-Value [Acc > NIR] : 0.5489          
##                                           
##                   Kappa : 0.0312          
##                                           
##  Mcnemar's Test P-Value : 8.244e-07       
##                                           
##             Sensitivity : 0.98837         
##             Specificity : 0.03333         
##          Pos Pred Value : 0.74561         
##          Neg Pred Value : 0.50000         
##              Prevalence : 0.74138         
##          Detection Rate : 0.73276         
##    Detection Prevalence : 0.98276         
##       Balanced Accuracy : 0.51085         
##                                           
##        'Positive' Class : early_stage     
##

\[ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{86}{116}=0.74.\] Note that this is not a particularly useful classifier, as the improved accuracy corresponds with predictions that are predominantly reflecting a single class , early_stage.

3.8.3 Baseball Data

Use the MLB Data (01a_data.txt) to predict the Player's Position (or perhaps the player’s Team) using naiveBayes classifier. Compute and report the agreement between predicted and actual labels (for the player’s position). Below is some example code.

mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1',as.is=T, header=T)  # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
# TESTING DATA
test <- mydata[-train_ind, ]
library("e1071")
nbc_model <- naiveBayes(train[ , c("Weight", "Height", "Age")], as.factor(train$Position), laplace = 15)
nbc_model

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = train[, c("Weight", "Height", "Age")], 
##     y = as.factor(train$Position), laplace = 15)
## 
## A-priori probabilities:
## as.factor(train$Position)
##           Catcher Designated_Hitter     First_Baseman        Outfielder 
##        0.07225806        0.01806452        0.05161290        0.19225806 
##    Relief_Pitcher    Second_Baseman         Shortstop  Starting_Pitcher 
##        0.29806452        0.05677419        0.04645161        0.22064516 
##     Third_Baseman 
##        0.04387097 
## 
## Conditional probabilities:
##                          Weight
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           206.6786 15.21973
##         Designated_Hitter 222.5000 22.84311
##         First_Baseman     212.5750 20.87556
##         Outfielder        200.3289 18.47951
##         Relief_Pitcher    203.9913 21.84650
##         Second_Baseman    184.8864 10.67537
##         Shortstop         182.4444 13.55717
##         Starting_Pitcher  204.9825 22.58356
##         Third_Baseman     200.5588 19.18047
## 
##                          Height
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           72.89286 1.865267
##         Designated_Hitter 74.57143 1.603567
##         First_Baseman     73.97500 1.941286
##         Outfielder        73.16779 2.021515
##         Relief_Pitcher    74.53247 2.160421
##         Second_Baseman    71.27273 1.703125
##         Shortstop         71.88889 1.720096
##         Starting_Pitcher  74.70760 2.187370
##         Third_Baseman     73.08824 2.287882
## 
##                          Age
## as.factor(train$Position)     [,1]     [,2]
##         Catcher           29.75571 4.445924
##         Designated_Hitter 31.47857 4.471898
##         First_Baseman     29.78425 5.219565
##         Outfielder        28.99342 4.255701
##         Relief_Pitcher    28.61420 4.160627
##         Second_Baseman    29.32386 4.616064
##         Shortstop         28.48972 3.919462
##         Starting_Pitcher  28.00661 4.401236
##         Third_Baseman     28.73353 3.798898

predicted.nbcvalues <- predict(nbc_model, as.data.frame(test))

# report results
tab <- table(predicted.nbcvalues, test$Position)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)

sum(diag(table(predicted.nbcvalues, test$Position)))

## [1] 74

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap")

# write.csv(file="./test.csv" , table(predicted.nbcvalues, test$Position))

Heatmap of the actual (vertical columns) and predicted (horizontal rows) MLB data positions based on multi-class naïve Bayesian classifier.

3.8.4 Medical Specialty Text-Notes Classification

Let’s demonstrate text-classification using a clinical transcription text dataset, which consists of an index and 5 data elements - description, medical_specialty (prediction outcome target), sample_name, transcription, and keywords. Our task is to derive computed phenotypes automatically classifying the 40 different medical specialties using the clinical transcription text.

dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T) 
str(dataCT)

## 'data.frame':    4999 obs. of  6 variables:
##  $ Index            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ description      : chr  " A 23-year-old white female presents with complaint of allergies." " Consult for laparoscopic gastric bypass." " Consult for laparoscopic gastric bypass." " 2-D M-Mode. Doppler.  " ...
##  $ medical_specialty: chr  " Allergy / Immunology" " Bariatrics" " Bariatrics" " Cardiovascular / Pulmonary" ...
##  $ sample_name      : chr  " Allergic Rhinitis " " Laparoscopic Gastric Bypass Consult - 2 " " Laparoscopic Gastric Bypass Consult - 1 " " 2-D Echocardiogram - 1 " ...
##  $ transcription    : chr  "SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies w"| __truncated__ "PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to p"| __truncated__ "HISTORY OF PRESENT ILLNESS: , I have seen ABC today.  He is a very pleasant gentleman who is 42 years old, 344 "| __truncated__ "2-D M-MODE: , ,1.  Left atrial enlargement with left atrial diameter of 4.7 cm.,2.  Normal size right and left "| __truncated__ ...
##  $ keywords         : chr  "allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegr"| __truncated__ "bariatrics, laparoscopic gastric bypass, weight loss programs, gastric bypass, atkin's diet, weight watcher's, "| __truncated__ "bariatrics, laparoscopic gastric bypass, heart attacks, body weight, pulmonary embolism, potential complication"| __truncated__ "cardiovascular / pulmonary, 2-d m-mode, doppler, aortic valve, atrial enlargement, diastolic function, ejection"| __truncated__ ...

# 1. EDA
library(dplyr)
mySummary <- dataCT %>%
  count(medical_specialty, sort = TRUE) 
mySummary

##                 medical_specialty    n
## 1                         Surgery 1103
## 2      Consult - History and Phy.  516
## 3      Cardiovascular / Pulmonary  372
## 4                      Orthopedic  355
## 5                       Radiology  273
## 6                General Medicine  259
## 7                Gastroenterology  230
## 8                       Neurology  223
## 9   SOAP / Chart / Progress Notes  166
## 10        Obstetrics / Gynecology  160
## 11                        Urology  158
## 12              Discharge Summary  108
## 13           ENT - Otolaryngology   98
## 14                   Neurosurgery   94
## 15          Hematology - Oncology   90
## 16                  Ophthalmology   83
## 17                     Nephrology   81
## 18         Emergency Room Reports   75
## 19          Pediatrics - Neonatal   70
## 20                Pain Management   62
## 21        Psychiatry / Psychology   53
## 22                   Office Notes   51
## 23                       Podiatry   47
## 24                    Dermatology   29
## 25     Cosmetic / Plastic Surgery   27
## 26                      Dentistry   27
## 27                        Letters   23
## 28      Physical Medicine - Rehab   21
## 29                 Sleep Medicine   20
## 30                  Endocrinology   19
## 31                     Bariatrics   18
## 32         IME-QME-Work Comp etc.   16
## 33                   Chiropractic   14
## 34           Diets and Nutritions   10
## 35                   Rheumatology   10
## 36              Speech - Language    9
## 37                        Autopsy    8
## 38       Lab Medicine - Pathology    8
## 39           Allergy / Immunology    7
## 40      Hospice - Palliative Care    6

plot_ly(dataCT, x = ~medical_specialty) %>%
  add_histogram()

# 2. Preprocess the medical clinical notes (transcription)
# library(tm)
dataCT_corpus <- Corpus(VectorSource(dataCT$transcription))
dataCT_corpus_clean <- tm_map(dataCT_corpus, tolower)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removePunctuation)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removeNumbers)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, stripWhitespace)
dataCT_corpus_dtm <- DocumentTermMatrix(dataCT_corpus_clean)

set.seed(1234)
subset_train <- sample(nrow(dataCT),floor(nrow(dataCT)*0.8))  # 80% training + 20% testing
dataCT_train <- dataCT[subset_train, ]
dataCT_test <- dataCT[-subset_train, ]
dataCT_corpus_dtm_train <- dataCT_corpus_dtm[subset_train, ]
hn_med_dtm_test <- dataCT_corpus_dtm[-subset_train, ]
dataCT_corpus_train <- dataCT_corpus_clean[subset_train]
dataCT_corpus_test <- dataCT_corpus_clean[-subset_train]

dataCT_train$MedSpecFac <- factor(dataCT_train$medical_specialty)
dataCT_test$MedSpecFac <- factor(dataCT_test$medical_specialty)
# prop.table(table(dataCT_test$medical_specialty))
prop.table(table(dataCT_test$MedSpecFac))

## 
##           Allergy / Immunology                        Autopsy 
##                          0.002                          0.002 
##                     Bariatrics     Cardiovascular / Pulmonary 
##                          0.004                          0.070 
##                   Chiropractic     Consult - History and Phy. 
##                          0.003                          0.100 
##     Cosmetic / Plastic Surgery                      Dentistry 
##                          0.003                          0.004 
##                    Dermatology              Discharge Summary 
##                          0.003                          0.020 
##         Emergency Room Reports                  Endocrinology 
##                          0.024                          0.002 
##           ENT - Otolaryngology               Gastroenterology 
##                          0.016                          0.052 
##               General Medicine          Hematology - Oncology 
##                          0.051                          0.022 
##         IME-QME-Work Comp etc.       Lab Medicine - Pathology 
##                          0.003                          0.004 
##                        Letters                     Nephrology 
##                          0.001                          0.018 
##                      Neurology                   Neurosurgery 
##                          0.037                          0.021 
##        Obstetrics / Gynecology                   Office Notes 
##                          0.028                          0.009 
##                  Ophthalmology                     Orthopedic 
##                          0.020                          0.082 
##                Pain Management          Pediatrics - Neonatal 
##                          0.017                          0.012 
##      Physical Medicine - Rehab                       Podiatry 
##                          0.004                          0.012 
##        Psychiatry / Psychology                      Radiology 
##                          0.010                          0.058 
##                   Rheumatology                 Sleep Medicine 
##                          0.001                          0.004 
##  SOAP / Chart / Progress Notes              Speech - Language 
##                          0.030                          0.002 
##                        Surgery                        Urology 
##                          0.221                          0.028

summary(findFreqTerms(dataCT_corpus_dtm_train, 5))

##    Length     Class      Mode 
##     12630 character character

dataCT_corpus_dict <- as.character(findFreqTerms(dataCT_corpus_dtm_train, 5))

dataCT_train1 <- DocumentTermMatrix(dataCT_corpus_train, list(dictionary=dataCT_corpus_dict))
dataCT_test1 <- DocumentTermMatrix(dataCT_corpus_test, list(dictionary=dataCT_corpus_dict))

dataCT_train1 <- apply(dataCT_train1, MARGIN = 2, convert_counts)
dataCT_test1 <- apply(dataCT_test1, MARGIN = 2, convert_counts)

dataCT_classifier <- naiveBayes(dataCT_train1, dataCT_train$MedSpecFac, laplace = 0)
dataCT_pred <- predict(dataCT_classifier, dataCT_test1)
# table(dataCT_pred)

# report results
tab <- table(dataCT_pred, dataCT_test$MedSpecFac)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)
# gmodels::CrossTable(dataCT_pred, dataCT_test$MedSpecFac)

sum(diag(table(dataCT_pred, dataCT_test$MedSpecFac)))

## [1] 51

plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap") %>%
  layout(title="Confusion Matrix of Naive-Bayesian Classification of 40 Medical Specialties using Text-Notes (Test Data)",
          xaxis=list(title='True Specialties'), yaxis=list(title='Derived NB Class Labels'))

Later, in Chapter 14 (Deep Learning) we will extend this example and demonstrate a more powerful binary and multinomial (categorical) classification of the medical specialty unit based on the clinical notes, using deep neural networks.

Readers are encouraged to apply the Naive Bayesian classification techniques to some new data from the list of our Case-Studies.

3.8.5 Chronic Disease

In the previous Chronic Disease case study, we classified the CHRONICDISEASESCORE into two groups. What will happen if we use three groups? Let’s separate CHRONICDISEASESCORE evenly into three groups. Recall the quantile() function that we talked about in Chapter 2. We can use it to get the cut-points for classification. Then, a for loop will help us split the variable CHRONICDISEASESCORE into three categories.

quantile(qol$CHRONICDISEASESCORE, probs = c(1/3, 2/3))

## 33.33333% 66.66667% 
##      1.06      1.80

for(i in 1:2214){
  if(qol$CHRONICDISEASESCORE[i]>0.7&qol$CHRONICDISEASESCORE[i]<2.2){
    qol$cdthree[i]=2
  }
  else if(qol$CHRONICDISEASESCORE[i]>=2.2){
    qol$cdthree[i]=3
  }
  else{
    qol$cdthree[i]=1
  }
}

qol$cdthree<-factor(qol$cdthree, levels=c(1, 2, 3), labels = c("minor_disease", "mild_disease", "severe_disease"))
table(qol$cdthree)

## 
##  minor_disease   mild_disease severe_disease 
##            379           1431            404

After labeling the three categories in the new variable cdthree, our job of preparing the class variable is done. Let’s follow along the earlier sections in the chapter to figure out how well the tree classifiers and rule classifiers perform in the three-category case. First, try to build a tree classifier using C5.0() with 10 boost trials. One small tip is that in the training dataset, we cannot have column 40 (CHRONICDISEASESCORE), 41 (cd) and now 42 (cdthree) because they all contain class outcome related variables.

# qol_train1<-qol[1:2114, ]
# qol_test1<-qol[2115:2214, ]
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train1<-qol[train_index, ]
qol_test1<-qol[-train_index, ]

prop.table(table(qol_train1$cdthree))

## 
##  minor_disease   mild_disease severe_disease 
##      0.1693958      0.6437041      0.1869001

prop.table(table(qol_test1$cdthree))

## 
##  minor_disease   mild_disease severe_disease 
##      0.1783296      0.6568849      0.1647856

set.seed(1234)
qol_model1<-C5.0(qol_train1[ , -c(40, 41, 42)], qol_train1$cdthree, trials=10)
qol_model1

## 
## Call:
## C5.0.default(x = qol_train1[, -c(40, 41, 42)], y = qol_train1$cdthree, trials
##  = 10)
## 
## Classification Tree
## Number of samples: 1771 
## Number of predictors: 39 
## 
## Number of boosting iterations: 10 
## Average tree size: 230 
## 
## Non-standard options: attempt to group attributes

qol_pred1<-predict(qol_model1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease             14           61              4
##   mild_disease              17          242             32
##   severe_disease             1           59             13
## 
## Overall Statistics
##                                        
##                Accuracy : 0.6072       
##                  95% CI : (0.56, 0.653)
##     No Information Rate : 0.8172       
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.091        
##                                        
##  Mcnemar's Test P-Value : 1.457e-07    
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                       0.43750              0.6685
## Specificity                       0.84185              0.3951
## Pos Pred Value                    0.17722              0.8316
## Neg Pred Value                    0.95055              0.2105
## Prevalence                        0.07223              0.8172
## Detection Rate                    0.03160              0.5463
## Detection Prevalence              0.17833              0.6569
## Balanced Accuracy                 0.63967              0.5318
##                      Class: severe_disease
## Sensitivity                        0.26531
## Specificity                        0.84772
## Pos Pred Value                     0.17808
## Neg Pred Value                     0.90270
## Prevalence                         0.11061
## Detection Rate                     0.02935
## Detection Prevalence               0.16479
## Balanced Accuracy                  0.55651

We can see that the prediction accuracy with three categories is way lower than the one we did with two categories. In addition, we can try to build a one-rule classifier with OneR().

set.seed(1234)
qol_1R1<-OneR(cdthree~., data=qol[ , -c(40, 41)])
qol_1R1

## INTERVIEWDATE:
##  < 3.5   -> mild_disease
##  < 28.5  -> severe_disease
##  < 282.0 -> mild_disease
##  < 311.5 -> severe_disease
##  >= 311.5    -> mild_disease
## (1436/2214 instances correct)

summary(qol_1R1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1436               64.86   %
## Incorrectly Classified Instances       778               35.14   %
## Kappa statistic                          0.022 
## Mean absolute error                      0.2343
## Root mean squared error                  0.484 
## Relative absolute error                 67.5977 %
## Root relative squared error            116.2958 %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##     0  375    4 |    a = minor_disease
##     0 1422    9 |    b = mild_disease
##     0  390   14 |    c = severe_disease

qol_pred1<-predict(qol_1R1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease              0           78              1
##   mild_disease               0          289              2
##   severe_disease             0           68              5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6637          
##                  95% CI : (0.6176, 0.7076)
##     No Information Rate : 0.9819          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0445          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                            NA             0.66437
## Specificity                        0.8217             0.75000
## Pos Pred Value                         NA             0.99313
## Neg Pred Value                         NA             0.03947
## Prevalence                         0.0000             0.98194
## Detection Rate                     0.0000             0.65237
## Detection Prevalence               0.1783             0.65688
## Balanced Accuracy                      NA             0.70718
##                      Class: severe_disease
## Sensitivity                        0.62500
## Specificity                        0.84368
## Pos Pred Value                     0.06849
## Neg Pred Value                     0.99189
## Prevalence                         0.01806
## Detection Rate                     0.01129
## Detection Prevalence               0.16479
## Balanced Accuracy                  0.73434

The OneRule classifier that is purely based on the value of the INTERVIEWDATE. The internal classification accuracy is 65% and equal to the external (validation data) prediction accuracy. Although, the latter assessment is a bit misleading as the vast majority of external validation data are classified in only one class - mild_disease.

Finally, let’s revisit the JRip() classifier with the same three class labels according to cdthree.

set.seed(1234)
qol_jrip1<-JRip(cdthree~., data=qol[ , -c(40, 41)])
qol_jrip1

## JRIP rules:
## ===========
## 
## (CHARLSONSCORE <= 0) and (AGE <= 50) and (MSA_Q_06 <= 1) and (QOL_Q_07 >= 1) and (MSA_Q_09 <= 1) => cdthree=minor_disease (35.0/11.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_10 >= 4) and (QOL_Q_07 >= 9) => cdthree=severe_disease (54.0/20.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 5) and (MSA_Q_09 <= 4) and (MSA_Q_04 >= 3) => cdthree=severe_disease (64.0/30.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 4) and (PH2_Q_01 >= 3) and (QOL_Q_10 >= 4) and (RACE_ETHNICITY >= 4) => cdthree=severe_disease (43.0/19.0)
##  => cdthree=mild_disease (2018.0/653.0)
## 
## Number of Rules : 5

summary(qol_jrip1)

## 
## === Summary ===
## 
## Correctly Classified Instances        1481               66.8925 %
## Incorrectly Classified Instances       733               33.1075 %
## Kappa statistic                          0.1616
## Mean absolute error                      0.3288
## Root mean squared error                  0.4055
## Relative absolute error                 94.8702 %
## Root relative squared error             97.42   %
## Total Number of Instances             2214     
## 
## === Confusion Matrix ===
## 
##     a    b    c   <-- classified as
##    24  342   13 |    a = minor_disease
##    10 1365   56 |    b = mild_disease
##     1  311   92 |    c = severe_disease

qol_pred1<-predict(qol_jrip1, qol_test1)

confusionMatrix(table(qol_test1$cdthree, qol_pred1))

## Confusion Matrix and Statistics
## 
##                 qol_pred1
##                  minor_disease mild_disease severe_disease
##   minor_disease              4           73              2
##   mild_disease               0          272             19
##   severe_disease             0           56             17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6614          
##                  95% CI : (0.6152, 0.7054)
##     No Information Rate : 0.9052          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.131           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: minor_disease Class: mild_disease
## Sensitivity                      1.000000              0.6783
## Specificity                      0.829157              0.5476
## Pos Pred Value                   0.050633              0.9347
## Neg Pred Value                   1.000000              0.1513
## Prevalence                       0.009029              0.9052
## Detection Rate                   0.009029              0.6140
## Detection Prevalence             0.178330              0.6569
## Balanced Accuracy                0.914579              0.6130
##                      Class: severe_disease
## Sensitivity                        0.44737
## Specificity                        0.86173
## Pos Pred Value                     0.23288
## Neg Pred Value                     0.94324
## Prevalence                         0.08578
## Detection Rate                     0.03837
## Detection Prevalence               0.16479
## Balanced Accuracy                  0.65455

In terms of the predictive accuracy on the testing data (qol_test1$cdthree), we can see from these outputs that the RIPPER algorithm performed better (67%) compared to the C5.0 decision tree (60%) and similarly to the OneR algorithm (65%). This suggests that simple algorithms might outperform complex methods for certain real world case-studies. Later, in Chapter 9, we will provide more details about optimizing and improving classification and prediction performance.

These supervised classification techniques can be further tested with other data from the list of our Case-Studies.

\(Features\\Classes\)	\(F_1\)	\(F_2\)	\(\cdots\)	\(F_n\)	Total
\(C_1\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	Marginal \(P(C_1)\)
\(C_2\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	Joint \(P(C_2, F_n)\)	\(\cdots\)
\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)
\(C_L\)	Conditional \(P(F_1\mid C_L)=\frac{P(F_1,C_L)}{P(C_L)}\)	\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)
Total		Marginal \(P(F_2)\)	\(\cdots\)	\(\cdots\)	\(N\)

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Supervised Classification

SOCR/MIDAS (Ivo Dinov)

February 2025