SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
In this chapter we will present various progressively supervised machine learning, classification and clustering techniques. There are two categories of machine learning techniques - unsupervised and supervised (human-guided). In general, supervised classification methods aim to identify or predict predefined classes and label new objects as members of specific classes. Whereas unsupervised clustering approaches attempt to group objects into subsets, without knowing a priori labels, and determine relationships between objects.
In the context of machine learning and artificial intelligence, classification and clustering apply mostly for supervised and unsupervised problems, respectively.
Unsupervised classification refers to methods where the outcomes (groupings with common characteristics) are automatically derived based on intrinsic affinities and associations in the data without human indication of clustering. Unsupervised learning is purely based on input data (\(X\)) without corresponding output labels. The goal is to model the underlying structure, affinities, or distribution in the data in order to learn more about its intrinsic characteristics. It is called unsupervised learning because there are no a priori correct answers and there is no human guidance. Algorithms are left to their own devices to discover and present the interesting structure in the data. Clustering (discover the inherent groupings in the data) and association (discover association rules that describe the data) represent the core unsupervised learning problems. The k-means clustering and the Apriori association rule provide solutions to unsupervised learning problems.
Decision Based | Non-parametric | Divisive (Top-Down) | Agglomerative (Bottom-Up) | Spectral | K-Means / Centroid | Graph-Theoretic | Model Based |
---|---|---|---|---|---|---|---|
Bayesian Classifier has high computational requirements. As there are a
priori given labels, some prior is needed/specified to train the
classifier, which is in turn used to label new data. Then, the newly
labeled samples are subsequently used to train a new (supervised)
classifier, i.e., decision-directed unsupervised learning. If the initial classifier is not appropriate, the process may diverge |
Chinese restaurant process (CRP), infinite Hidden Markov Model (HMM) | Principal Direction Divisive Partitioning (PDDP) | Start with each case in a separate cluster and repeatedly join the closest pairs into clusters, until a stopping criterion is matched: (1) there is only a single cluster for all cases, (2) predetermined number of clusters is reached, (3) the distance between the closest clusters exceeds a threshold | SpecC | kmeans, hkmeans | Growing Neural Gas | mclust |
Supervised classification methods utilize user provided labels representative of specific classes associated with concrete observations, cases or units. These training classes/outcomes are used as references for the classification. Many problems can be addressed by decision-support systems utilizing combinations of supervised and unsupervised classification processes. Supervised learning involves input variables (\(X\)) and an outcome variable (\(Y\)) to learn mapping functions from the input to the output: \(Y = f(X)\). The goal is to approximate the mapping function so that when it is applied to new (validation) data (\(Z\)) it (accurately) predicts the (expected) outcome variables (\(Y\)). It is called supervised learning because the learning process is supervised by initial training labels guiding and correcting the learning until the algorithm achieves an acceptable level of performance.
Regression (output variable is a real value) and classification (output variable is a category) problems represent the two types of supervised learning. Examples of supervised machine learning algorithms include Linear regression and Random forest, both provide solutions for regression-type problems, but Random forest also provides solutions to classification problems.
Just like categorization of exploratory data analytics, Chapter 3, is challenging, so is systematic codification of machine learning techniques. The table below attempts to provide a rough representation of common machine learning methods. However, it is not really intended to be a gold-standard protocol for choosing the best analytical method. Before you settle on a specific strategy for data analysis, you should always review the data characteristics in light of the assumptions of each technique and assess the potential to gain new knowledge or extract valid information from applying a specific technique.
Inference | Outcome | Supervised | Unsupervised |
---|---|---|---|
Classification & Prediction | Binary | Classification-Rules, OneR, kNN, NaiveBayes, Decision-Tree, C5.0, AdaBoost, XGBoost, LDA/QDA, Logit/Poisson, SVM | Apriori, Association-Rules, k-Means, NaiveBayes |
Classification & Prediction | Categorical | Regression Modeling & Forecasting | Apriori, Association-Rules, k-Means, NaiveBayes |
Regression Modeling | Real Quantitative | (MLR) Regression Modeling, LDA/QDA, SVM, Decision-Tree, NeuralNet | Regression Modeling Tree, Apriori/Association-Rules |
Many of these will be discussed in later chapters. In this chapter, we will present step-by-step the k-nearest neighbor (kNN) algorithm. Specifically, we will demonstrate (1) data retrieval and normalization, (2) splitting the data into training and testing sets, (3) fitting models on the training data, (4) evaluating model performance on testing data, (5) improving model performance, and (6) determining optimal values of \(k\).
In Chapter 9, we will present detailed strategies, and evaluation metrics, to assess the performance of all clustering and classification methods.
Classification tasks could be very difficult for a large number of complex and heterogeneous features and target classes. In those scenarios, where the items of similar class type tend to be homogeneous, nearest neighbor classification may be appropriate because assigning unlabeled cases to their most similarly labeled neighbors may be fairly easy to accomplish.
Such classification methods can help us understand the story behind the unlabeled data using known data and avoiding analyzing those complicated features and target classes. This is because these techniques have no prior distribution assumptions. However, this non-parametric approach makes the methods rely heavy on the training instances, which explains their lazy algorithms designation.
The KNN algorithm involves the following steps:
Mathematically, for a given \(k\), a specific similarity metric \(d\), and a new testing case \(x\), the kNN classifier performs two steps (\(k\) is typically odd to avoid ties):
\[P(y=j|X=x) =\frac{1}{k} \sum_{i\in A}{I(y^{(i)}=j)}.\]
How to measure the similarity between records? We can think of similarity measures as the distance metrics between the two records or cases. There are many distance functions to choose from. Traditionally, we use Euclidean distance as our similarity metric.
If we use a line to connect the two points representing the testing and the training records in \(n\) dimensional space, the length of the line is the Euclidean distance. If \(a, b\) both have \(n\) features, the coordinates for them are \((a_1, a_2, \cdots, a_n)\) and \((b_1, b_2, \cdots, b_n)\). Our distance could be:
\[dist(a, b)=\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+\cdots+(a_n-b_n)^2}.\]
When we have nominal features, it requires a little trick to apply the Euclidean distance formula. We could create dummy variables as indicators of all the nominal feature levels. The dummy variable would equal one when we have the feature and zero otherwise. We show two examples:
\[Gender= \left\{ \begin{array}{ll} 0 & X=male \\ 1 & X=female \\ \end{array} \right. \]
\[Cold= \left\{ \begin{array}{ll} 0 & Temp \geq 37F \\ 1 & Temp < 37F \\ \end{array} \right. \]
This allows only binary expressions. If we have multiple nominal categories, just make each one as a dummy variable and apply the Euclidean distance.
The parameter k could not be too large or too small. If our k is too large, the test record tends to be classified as the most popular class in the training records, rather than the most similar one. On the other hand, if the k is too small, outliers, noise, or mislabeled training data cases might lead to errors in predictions.
The common practice is to calculate the square root of the number of training cases and use that number as (initial) estimate of k.
A more robust way would be to choose several k values and select the one with the optimal (best) classifying performance.
Different features might have different scales. For example, we can have a measure of pain scaling from 1 to 10 or 1 to 100. Some similarity or distance measures assume the same measuring unit in all feature dimensions. This requires that the data may need to be transferred into the same scale. Re-scaling can make each feature contribute to the distance in a relatively equal manner, avoiding potential bias.
There are many alternative strategies to rescale the data.
\[X_{new}=\frac{X-min(X)}{max(X)-min(X)}.\]
After re-scaling the data, \(X_{new}\) would range from 0 to 1, representing the distance between each value and its minimum as a percentage. Larger values indicate further distance from the minimum, 100% means that the value is at the maximum.
\[X_{new}=\frac{X-\mu}{\sigma}=\frac{X-Mean(X)}{SD(X)}\]
This is based on the properties of normal distribution that we have talked about in Chapter 2. After z-score standardization, the re-scaled feature will have unbounded range. This is different from the min-max normalization which always has a finite range from 0 to 1. Following z-score standardization, the transformed features are uniteless and may resemble standard normal distribution.
The data we are using for this case study is the “Boys Town Study of Youth Development”, which is the second case study, CaseStudy02_Boystown_Data.csv.
Variables:
First, we need to load and do some data manipulation. We are using
the Euclidean distance so dummy variables (one hot encoding) should be
used. The following codes transferred sex
,
dadjob
and momjob
into dummy variables.
library(class)
library(gmodels)
boystown<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
str(boystown)
## 'data.frame': 200 obs. of 11 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : int 5 0 3 2 3 3 1 5 1 3 ...
## $ Alcoholuse: int 2 4 2 2 6 3 2 6 5 2 ...
## $ alcatt : int 3 2 3 1 2 0 0 3 0 1 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : int 1 3 2 1 2 1 3 6 3 1 ...
## $ momclose : int 1 4 2 2 1 2 1 2 3 2 ...
## $ larceny : int 1 0 0 3 1 0 0 0 1 1 ...
## $ vandalism : int 3 0 2 2 2 0 5 1 4 0 ...
The str()
function tells us that we have 200
observations and 11 variables. However, the ID variable is not important
in this case study so we can delete it. In this case-study, we can focus
on academic performance, GPA, recidivism, vandalism
and larceny, alcohol use, or other outcome variables.
One concrete example involves trying to predict recidivism and
use knn
to classify participants in two categories.
Let’s focus on a specific outcome variable representing two or more infractions of vandalism and larceny, which can be considered as “Recidivism”. Participants with one or no infractions can be labeled as “Controls”. First we can use PCA to explore the data and then construct the new derived recidivism variable and split the data into training and testing sets.
# GPA study
# boystown <- boystown[, -1]
# table(boystown$gpa)
# boystown$grade <- boystown$gpa %in% c(3, 4, 5) # GPA: (1-“A” average, 2-“B” average, 3-“C” average, 4+ below “C” average)
# boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
# table(boystown$grade)
# You can also try with alternative outcomes, e.g., alcohol use
# Outcome Alcohol Use 6 categories (1- everyday, 2- once or twice/wk, 3- once or twice/month, 4- less than once or twice per month, 5- once or twice, 6- never)
# Y <- as.factor(ifelse(boystown$Alcoholuse > 3, "1"Heavy Alcohol Use", "0"low Alcohol Use)); table(Y)
# Y <- as.factor(boystown$Alcoholuse); table(Y)
# X <- boystown[, -c(1, 4)]
# First explore the data by running a PCA
rawData <- boystown[ , -1]; head(rawData)
## sex gpa Alcoholuse alcatt dadjob momjob dadclose momclose larceny vandalism
## 1 0 5 2 3 1 0 1 1 1 3
## 2 0 0 4 2 1 0 3 4 0 0
## 3 0 3 2 3 1 0 2 2 0 2
## 4 0 2 2 1 1 0 1 2 3 2
## 5 1 3 6 2 1 1 2 1 1 2
## 6 1 3 3 0 1 0 1 2 0 0
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9264 1.6267 1.4556 1.3887 1.3023 1.2495 0.95092
## Proportion of Variance 0.2463 0.1756 0.1406 0.1280 0.1126 0.1036 0.06001
## Cumulative Proportion 0.2463 0.4219 0.5625 0.6905 0.8031 0.9067 0.96672
## PC8 PC9 PC10
## Standard deviation 0.46931 0.44030 0.29546
## Proportion of Variance 0.01462 0.01287 0.00579
## Cumulative Proportion 0.98134 0.99421 1.00000
## PC1 PC2 PC3 PC4 PC5
## sex 0.017329782 -0.041942445 0.038644169 0.028574405 0.01345929
## gpa 0.050811919 0.251727224 -0.552866390 0.482012801 0.12517612
## Alcoholuse -0.955960402 -0.171043716 -0.004495945 0.207154515 -0.09577419
## alcatt -0.087619514 0.345341205 -0.639579077 -0.330797550 -0.46933240
## dadjob 0.007936535 -0.001870547 -0.017028011 0.006888380 0.01469275
## momjob 0.030397552 -0.003518869 0.008763667 -0.001482878 0.02342974
## dadclose 0.052947782 -0.689075136 -0.246264250 -0.452876476 -0.21528193
## momclose -0.055904765 -0.273643927 -0.435005164 -0.092086033 0.75752753
## larceny 0.063455041 -0.005344755 -0.108928362 0.033669839 0.07405310
## vandalism -0.254240056 0.486423422 0.147159970 -0.632255411 0.35813577
## PC6 PC7 PC8 PC9 PC10
## sex -0.026402404 0.02939770 0.996267983 -0.036561664 -0.001236127
## gpa 0.599548626 -0.13874602 0.035498809 -0.004136514 0.019965416
## Alcoholuse 0.015534060 0.05830281 0.002438336 -0.032626941 -0.008177261
## alcatt -0.362724708 0.01546912 0.045913748 -0.019571649 0.001150003
## dadjob -0.001537909 -0.04457354 -0.001761749 -0.050310596 -0.997425089
## momjob -0.006536869 -0.01490905 -0.037550689 -0.997051461 0.051468175
## dadclose 0.456556766 -0.04166384 0.008660225 -0.005147468 0.001021197
## momclose -0.381040880 -0.06639096 -0.008710480 0.018261764 0.020666371
## larceny 0.084351462 0.98381408 -0.026413939 -0.013630662 -0.039663208
## vandalism 0.383715180 -0.00198434 0.042593385 -0.003164270 -0.004956997
Y <- ifelse (boystown$vandalism + boystown$larceny > 1, "Recidivism", "Control") # more than 1 vandalism or larceny conviction
X <- boystown[, -c(1, 10, 11)] # covariate set excludes these 3 columns index/ID, vandalism, and larceny
boystown_z <- as.data.frame(lapply(X, scale))
bt_train <- boystown_z[1:150, ]
bt_test <- boystown_z[151:200, ]
bt_train_labels <- Y[1:150]
bt_test_labels <- Y[151:200]
table(bt_train_labels)
## bt_train_labels
## Control Recidivism
## 46 104
Let’s look at the proportions for the two categories in both the training and testing sets.
## bt_train_labels
## Control Recidivism
## 0.31 0.69
## bt_test_labels
## Control Recidivism
## 0.2 0.8
We can see that most of the participants have incidents related to recidivism (~70-80%).
The remaining 8 features use different measuring scales. If we use these features directly, variables with larger scale may have a greater impact on the classification performance. Therefore, re-scaling may be useful to level the playing field.
First let’s create a function of our own using the min-max normalization formula. We can check the function using some trial vectors.
normalize<-function(x){
return((x-min(x))/(max(x)-min(x)))
}
# some test examples:
normalize(c(1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00
## [1] 0.000 0.250 0.625 0.750 1.000
After confirming the function definition, we can use
lapply()
to apply the normalization transformation to each
element in a “list” of predictors of the binary
outcome=recidivism
.
boystown_n <- as.data.frame(lapply(bt_train, normalize))
# alternatively we can use "scale", as we showed above
#boystown_z <- as.data.frame(lapply(X, scale))
Let’s compare the two alternative normalization approaches on one of the features, Alcohol use.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2727 0.3636 0.3576 0.4545 1.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.04795 -0.98958 0.06879 0.00000 0.59798 3.77309
We have 200 observations in this dataset. The more data we use to train the algorithm, the more precise the prediction would be. We can use \(3/4\) of the data for training and the remaining \(1/4\) for testing. For simplicity, we will just take the first cases as training and the remaining as testing. Alternatively, we can use a randomization strategy to split the data into testing and training sets.
# may want to use random split of the raw data into training and testing
# subset_int <- sample(nrow(boystown_n),floor(nrow(boystown_n)*0.8))
# 80% training + 20% testing
# bt_train<- boystown_n [subset_int, ]; bt_test<-boystown_n[-subset_int, ]
# Note that the object boystown_n already excludes the outcome variable (Delinquency), index 11!
bt_train <- boystown_z[1:150, ]
bt_test <- boystown_z[151:200, ]
Then let’s extract the recidivism labels or classes for the training and testing sets.
First, we will use the class::knn()
method.
The function knn()
has following components:
p <- knn(train, test, class, k)
We can first test with \(k=7\), which is less than the square root of our number of observations: \(\sqrt{200}\approx 14\).
We utilize the CrossTable()
function in Chapter
2 to evaluate the KNN model. We have two classes in this example.
The goal is to create a \(2\times 2\)
table that shows the matched true and predicted classes as well as the
unmatched ones. However chi-square values are not needed so we use
option prop.chisq=False
to get rid of it.
# install.packages("gmodels", repos="https://cran.us.r-project.org")
library(gmodels)
CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 50
##
##
## | bt_test_pred
## bt_test_labels | Control | Recidivism | Row Total |
## ---------------|------------|------------|------------|
## Control | 0 | 10 | 10 |
## | 0.000 | 1.000 | 0.200 |
## | 0.000 | 0.213 | |
## | 0.000 | 0.200 | |
## ---------------|------------|------------|------------|
## Recidivism | 3 | 37 | 40 |
## | 0.075 | 0.925 | 0.800 |
## | 1.000 | 0.787 | |
## | 0.060 | 0.740 | |
## ---------------|------------|------------|------------|
## Column Total | 3 | 47 | 50 |
## | 0.060 | 0.940 | |
## ---------------|------------|------------|------------|
##
##
In this table, the cells in the first row-first column and the second row-second column contain the number for cases that have predicted classes matching the true class labels. The other two cells are the counts for unmatched cases. The accuracy of this classifier is calculated by: \(\frac{cell[1, 1]+cell[2, 2]}{total}=\frac{37}{50}=0.72\). Note that this value may slightly fluctuate each time you run the classifier, due to the stochastic nature of the algorithm.
The normalization strategy may play a role in the classification
performance. We can try alternative standardization methods - standard
Z-score centralization and normalization (via scale()
method). Let’s try standardization prior to training the kNN and
predicting and assessing the accuracy of the results.
# bt_train <- bt_z[1:150, ]
# bt_test <- bt_z[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels <- boystown[151:200, 11]
# bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)
# CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)
# bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, prob=T, k=18) # retrieve Probabilities
# getTraingDataThreshold <- quantile(attributes(bt_test_pred)$prob, c(table(bt_train_labels)[1]/length(bt_train_labels), 0.5))[1]
# bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > getTraingDataThreshold, 1, 0) # Binarize the probabilities
bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, prob=T, k=14) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control") # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 50
##
##
## | bt_test_predBin
## bt_test_labels | Control | Recidivism | Row Total |
## ---------------|------------|------------|------------|
## Control | 0 | 10 | 10 |
## | 0.000 | 1.000 | 0.200 |
## | 0.000 | 0.217 | |
## | 0.000 | 0.200 | |
## ---------------|------------|------------|------------|
## Recidivism | 4 | 36 | 40 |
## | 0.100 | 0.900 | 0.800 |
## | 1.000 | 0.783 | |
## | 0.080 | 0.720 | |
## ---------------|------------|------------|------------|
## Column Total | 4 | 46 | 50 |
## | 0.080 | 0.920 | |
## ---------------|------------|------------|------------|
##
##
# CT
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=14) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))
## [1] "Prediction accuracy of model 'bt_test_pred' (k=14) is 0.72"
Under the z-score method, the prediction result is similar to the previous result using the normalization strategy. Albeit, in general, there may be marginal differences, e.g., a few more cases may be correctly labeled based on one of the standardization or normalization approaches.
Originally, we used the square root of 200 as our k. However, this might not be the best k in this study. We can test different k’s for their predicting performances.
# bt_train <- boystown_n[1:150, ]
# bt_test <- boystown_n[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels <- boystown[151:200, 11]
# bt_test_pred1 <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=1)
# bt_test_pred9 <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=9)
# bt_test_pred11 <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=11)
# bt_test_pred21 <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=21)
# bt_test_pred27 <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=27)
# ct_1 <- CrossTable(x=bt_test_labels, y=bt_test_pred1, prop.chisq = F)
# ct_9 <- CrossTable(x=bt_test_labels, y=bt_test_pred9, prop.chisq = F)
# ct_11 <- CrossTable(x=bt_test_labels, y=bt_test_pred11, prop.chisq = F)
# ct_21 <- CrossTable(x=bt_test_labels, y=bt_test_pred21, prop.chisq = F)
# ct_27 <- CrossTable(x=bt_test_labels, y=bt_test_pred27, prop.chisq = F)
bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, prob=T, k=18) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control") # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 50
##
##
## | bt_test_predBin
## bt_test_labels | Control | Recidivism | Row Total |
## ---------------|------------|------------|------------|
## Control | 1 | 9 | 10 |
## | 0.100 | 0.900 | 0.200 |
## | 1.000 | 0.184 | |
## | 0.020 | 0.180 | |
## ---------------|------------|------------|------------|
## Recidivism | 0 | 40 | 40 |
## | 0.000 | 1.000 | 0.800 |
## | 0.000 | 0.816 | |
## | 0.000 | 0.800 | |
## ---------------|------------|------------|------------|
## Column Total | 1 | 49 | 50 |
## | 0.020 | 0.980 | |
## ---------------|------------|------------|------------|
##
##
# CT
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=18) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))
## [1] "Prediction accuracy of model 'bt_test_pred' (k=18) is 0.82"
The choice of \(k\) in KNN
clustering is very important and it can be fine-tuned using the
e1071::tune.knn()
of the caret::train()
methods. Note that using the tune.knn()
method without
explicit control over the class-probability cutoff does not generate
good results. Specifying that probability exceeding \(0.6\) corresponds to recidivism
significantly improves the kNN performance using
caret::train()
and caret::predict()
methods.
# install.packages("e1071")
library(e1071)
knntuning = tune.knn(x= bt_train, y = as.factor(bt_train_labels), k = 1:30)
knntuning
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 12
##
## - best performance: 0.2866667
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 12
##
## - best performance: 0.2866667
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.4866667 0.1407212
## 2 2 0.4466667 0.1475730
## 3 3 0.3866667 0.1079552
## 4 4 0.4133333 0.1167460
## 5 5 0.3533333 0.1090928
## 6 6 0.3733333 0.1264911
## 7 7 0.3333333 0.0993808
## 8 8 0.3200000 0.1124365
## 9 9 0.3000000 0.1305260
## 10 10 0.2933333 0.1303367
## 11 11 0.3000000 0.1448712
## 12 12 0.2866667 0.1441878
## 13 13 0.3200000 0.1209020
## 14 14 0.3133333 0.1219188
## 15 15 0.3133333 0.1177987
## 16 16 0.3200000 0.1167460
## 17 17 0.3066667 0.1225249
## 18 18 0.3133333 0.1177987
## 19 19 0.3066667 0.1225249
## 20 20 0.3066667 0.1225249
## 21 21 0.3066667 0.1225249
## 22 22 0.3066667 0.1225249
## 23 23 0.3066667 0.1225249
## 24 24 0.3066667 0.1225249
## 25 25 0.3066667 0.1225249
## 26 26 0.3066667 0.1225249
## 27 27 0.3066667 0.1225249
## 28 28 0.3066667 0.1225249
## 29 29 0.3066667 0.1225249
## 30 30 0.3066667 0.1225249
library(caret)
knnControl <- trainControl(
method = "cv", ## cross validation
number = 10, ## 10-fold
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = FALSE
)
# bt_train_stringLabels <- ifelse (bt_train_labels==1, "Recidivism", "Control")
knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
print(knn_model)
## k-Nearest Neighbors
##
## 150 samples
## 8 predictor
## 2 classes: 'Control', 'Recidivism'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 136, 135, 134, 134, 134, 136, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.3583182 0.085 0.9045455
## 7 0.4536136 0.065 0.9527273
## 9 0.4608864 0.085 0.9718182
## 11 0.5139773 0.090 0.9909091
## 13 0.5074318 0.040 0.9718182
## 15 0.5069773 0.000 0.9709091
## 17 0.5485455 0.000 0.9909091
## 19 0.5355227 0.000 1.0000000
## 21 0.5661591 0.000 1.0000000
## 23 0.5805000 0.000 1.0000000
## 25 0.5720909 0.000 1.0000000
## 27 0.5680000 0.000 1.0000000
## 29 0.5782955 0.000 1.0000000
## 31 0.5880227 0.000 1.0000000
## 33 0.5736136 0.000 1.0000000
## 35 0.5948864 0.000 1.0000000
## 37 0.6018864 0.000 1.0000000
## 39 0.5770682 0.000 1.0000000
## 41 0.5783636 0.000 1.0000000
## 43 0.5610455 0.000 1.0000000
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 37.
# Here we are providing a cutoff of 60% probability for derived class label=Recidivism
library(dplyr)
summaryPredictions <- predict(knn_model, newdata = bt_test, type = "prob") # %>% mutate('knnPredClass'=names(.)[apply(., 1, which.max)])
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
testDataPredSummary <- as.data.frame(cbind(trueLabels=bt_test_labels, controlProb=summaryPredictions$Control,
recidivismProb=summaryPredictions$Recidivism, knnPredLabel=summaryPredictionsLabel))
print(paste0("Accuracy = ", 2*as.numeric(table(testDataPredSummary$trueLabels == testDataPredSummary $knnPredLabel)[2]), "%"))
## [1] "Accuracy = 80%"
It’s useful to visualize the error rate
against the
value of \(k\). This can help us select
the optimal \(k\) parameter that
minimizes the cross-validation (CV) error, see Chapter
9.
library(class)
library(ggplot2)
library(reshape2)
# define a function that generates CV folds
cv_partition <- function(y, num_folds = 10, seed = NULL) {
if(!is.null(seed)) {
set.seed(seed)
}
n <- length(y)
# split() divides the data into the folds defined by gl().
# gl() generates factors according to the pattern of their levels
folds <- split(sample(seq_len(n), n), gl(n = num_folds, k = 1, length = n))
folds <- lapply(folds, function(fold) {
list(
training = which(!seq_along(y) %in% fold),
test = fold
)
})
names(folds) <- paste0("Fold", names(folds))
return(folds)
}
# Generate 10-folds of the data
folds = cv_partition(bt_train_labels, num_folds = 10)
# Define a training set_CV_error calculation function
train_cv_error = function(K) {
#Train error
#### knnbt = knn(train = bt_train, test = bt_train, cl = bt_train_labels, k = K)
#### train_error = mean(knnbt != bt_train_labels)
knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
summaryPredictions <- predict(knn_model, newdata = bt_train, type = "prob")
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
train_error = mean(summaryPredictionsLabel != bt_train_labels)
#CV error
cverrbt = sapply(folds, function(fold) {
### knnbt = knn(train = bt_train[fold$training,], cl = bt_train_labels[fold$training], test = bt_train[fold$test,], k=K)
knn_model <- train(x=bt_train[fold$training,], y=bt_train_labels[fold$training] , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
summaryPredictions <- predict(knn_model, newdata = bt_train[fold$test,], type = "prob")
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
mean(summaryPredictionsLabel != bt_train_labels[fold$test])
# mean(bt_train_labels[fold$test] != knn(train = bt_train[fold$training,],
# cl = bt_train_labels[fold$training],
# test = bt_train[fold$test,], k=K))
}
)
cv_error = mean(cverrbt)
#Test error
knn.test = knn(train = bt_train, test = bt_test, cl = bt_train_labels, k = K)
test_error = mean(knn.test != bt_test_labels)
return(c(train_error, cv_error, test_error))
}
k_err = sapply(1:30, function(k) train_cv_error(k))
df_errs = data.frame(t(k_err), 1:30)
colnames(df_errs) = c('Train', 'CV', 'Test', 'K')
dataL <- melt(df_errs, id="K")
# require(ggplot2)
# library(reshape2)
# ggplot(dataL, aes_string(x="K", y="value", colour="variable",
# group="variable", linetype="variable", shape="variable")) +
# geom_line(size=0.8) + labs(x = "Number of nearest neighbors (k)",
# y = "Classification error",
# colour="", group="",
# linetype="", shape="") +
# geom_point(size=2.8) +
# geom_vline(xintercept=4:5, colour = "pink")+
# geom_text(aes(4,0,label = "4", vjust = 1)) +
# geom_text(aes(5,0,label = "5", vjust = 1))
library(plotly)
plot_ly(dataL, x = ~K, y = ~value, color = ~variable, type = "scatter", mode = "markers+lines") %>%
add_segments(x=25, xend=25, y=0.0, yend=0.33, type = "scatter", name="k=9",
line=list(color="darkgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>%
add_segments(x=14, xend=14, y=0.0, yend=0.33, type = "scatter", name="k=14",
line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>%
add_segments(x=18, xend=18, y=0.0, yend=0.36, type = "scatter", name="k=18",
line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>%
layout(title='K-NN Training, CV, and Testing Error Rates against k',
legend=list(title=list(text='<b> Samples </b>')),
xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))
First review the fundamentals of hypothesis testing inference and recall that:
Confusion Matrix | Negative | Positive |
---|---|---|
kNN Fails to reject | TN | FN |
kNN rejects | FP | TP |
Metrics | Specificity: TN/(TN+FP) | Sensitivity: TP/(TP+FN) |
Suppose we want to evaluate the kNN model (\(k=12\)) as to how well it predicts recidivism. Let’s report manually some of the accuracy metrics for the kNN model (\(k=12\)). Combining the results, we get the following model sensitivity and specificity.
bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, prob=T, k=12) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control") # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 50
##
##
## | bt_test_predBin
## bt_test_labels | Control | Recidivism | Row Total |
## ---------------|------------|------------|------------|
## Control | 2 | 8 | 10 |
## | 0.200 | 0.800 | 0.200 |
## | 0.286 | 0.186 | |
## | 0.040 | 0.160 | |
## ---------------|------------|------------|------------|
## Recidivism | 5 | 35 | 40 |
## | 0.125 | 0.875 | 0.800 |
## | 0.714 | 0.814 | |
## | 0.100 | 0.700 | |
## ---------------|------------|------------|------------|
## Column Total | 7 | 43 | 50 |
## | 0.140 | 0.860 | |
## ---------------|------------|------------|------------|
##
##
mod12_TN <- CT$prop.row[1, 1]
mod12_FP <- CT$prop.row[1, 2]
mod12_FN <- CT$prop.row[2, 1]
mod12_TP <- CT$prop.row[2, 2]
mod12_sensi <- mod12_TN/(mod12_TN+mod12_FP)
mod12_speci <- mod12_TP/(mod12_TP+mod12_FN)
print(paste0("kNN model k=12 Sensitivity=", mod12_sensi))
## [1] "kNN model k=12 Sensitivity=0.2"
## [1] "kNN model k=12 Specificity=0.875"
## bt_test_predBin
## bt_test_labels Control Recidivism
## Control 2 8
## Recidivism 5 35
Therefore, model12
, corresponding to \(k=12\), yields a marginal accuracy on the
testing cases.
Another strategy for model validation and improvement involves the
use of the caret::confusionMatrix()
method, which reports
several complementary metrics quantifying the performance of the
prediction model.
Let’s examine deeper the performance of model12
to
predict recidivism.
# corr12 <- cor(as.numeric(bt_test_labels), as.numeric(bt_test_predBin))
# corr12
# plot(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))
# table(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))
# install.packages("caret")
library("caret")
# compute the accuracy, LOR, sensitivity/specificity of 3 kNN models
# Model 12: bt_test_predBin
#confusionMatrix(as.numeric(bt_test_labels), as.numeric(bt_test_pred1))
confusionMatrix(as.factor(bt_test_labels), as.factor(bt_test_predBin))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Control Recidivism
## Control 2 8
## Recidivism 5 35
##
## Accuracy : 0.74
## 95% CI : (0.5966, 0.8537)
## No Information Rate : 0.86
## P-Value [Acc > NIR] : 0.9927
##
## Kappa : 0.0845
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.2857
## Specificity : 0.8140
## Pos Pred Value : 0.2000
## Neg Pred Value : 0.8750
## Prevalence : 0.1400
## Detection Rate : 0.0400
## Detection Prevalence : 0.2000
## Balanced Accuracy : 0.5498
##
## 'Positive' Class : Control
##
Finally, we can use a 3D plot to display the
confusionMatrix()
results of model12
(mod12_TN, mod12_FN, mod12_FP, mod12_TP).
# install.packages("scatterplot3d")
library(scatterplot3d)
grid_xy <- matrix(c(0, 1, 1, 0), nrow=2, ncol=2)
intensity <- matrix(c(mod12_TN, mod12_FN, mod12_FP, mod12_TP), nrow=2, ncol=2)
# scatterplot3d(grid_xy, intensity, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot")
s3d.dat <- data.frame(cols=as.vector(col(grid_xy)),
rows=as.vector(row(grid_xy)),
value=as.vector(intensity))
scatterplot3d(s3d.dat, pch=16, highlight.3d=TRUE, type="h", xlab="real", ylab="predicted", zlab="Agreement",
main="3D Scatterplot: Model12 Results (FP, FN, TP, TN)")
# scatterplot3d(s3d.dat, type="h", lwd=5, pch=" ", xlab="real", ylab="predicted", zlab="Agreement", main="Model9 Results (FP, FN, TP, TN)")
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(mod12_TN, mod12_FN, mod12_FP, mod12_TP),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Model k=12; Performance Metrics </b>')),
xaxis=list(title='Metrics'), yaxis=list(title='Probability'))
Let’s now use the SOCR Case-Study 22 (22_SDSS_GalaxySpins_Case_Study) to train a kNN classifier on \(49,122\) (randomly selected out of a total \(51,122\)) training cases and test the accuracy of predicting the Galactic Spin (L=left or R=right hand spin) on the remaining (randomly chosen testing \(2,000\) galaxies). Evaluate and report the classifier performance. Try to compare and improve the performance of several independent classifiers, e.g., kNN, naive Bayesian, LDA.
Report/graph the training, testing and CV error rates for different k parameters and justify your “optimal” choice for \(k=k_o\). How good is your Galactic spin classifier? Provide some visual and numeric evidence.
Some sample code to get you started is included below.
#loading libraries
library(e1071); library(caret); library(class); library(gmodels)
#loading Galaxy data
galaxy_data <- read.csv("https://umich.instructure.com/files/6105118/download?download_frd=1", sep=",")
#View(head(galaxy_data))
dim(galaxy_data)
## [1] 51122 12
## 'data.frame': 51122 obs. of 11 variables:
## $ RA : num 236 237 238 238 238 ...
## $ DEC : num -0.493 -0.482 -0.506 -0.544 -0.527 ...
## $ HAND : chr "R" "L" "L" "L" ...
## $ UZS : num 17.4 19.2 19.4 19.5 17.8 ...
## $ GZS : num 16.2 18 17.5 18.3 16.6 ...
## $ RZS : num 15.6 17.4 16.6 17.9 16.2 ...
## $ IZS : num 15.2 17 16.2 17.6 15.8 ...
## $ ZZS : num 14.9 16.7 15.8 17.3 15.6 ...
## $ ELLIPS: num 0.825 0.961 0.411 0.609 0.505 ...
## $ PHIS : num 154 100.2 29.2 109.4 39.6 ...
## $ RSS : num 0.0547 0.0977 0.0784 0.076 0.0796 ...
#randomly assigning 2k for validation purposes
subset_int <- sample(nrow(galaxy_data), 2000)
#training data, dropping the label
galaxy_data_train <- galaxy_data[-subset_int, -3]; dim(galaxy_data_train) # test
## [1] 49122 10
## [1] 2000 10
#summary(galaxy_data_train)
# labels
galaxy_train_labels <- as.factor(galaxy_data[-subset_int, 3])
galaxy_test_labels <- as.factor(galaxy_data[subset_int, 3])
#summary(galaxy_data_test)
# trying to figure out the best K for this kNN model
# testing alternative values of K
gd_test_pred1<-knn(train=galaxy_data_train, test=galaxy_data_test,
cl=galaxy_train_labels, k=1)
gd_test_pred10<-knn(train=galaxy_data_train, test=galaxy_data_test,
cl=galaxy_train_labels, k=10)
gd_test_pred20<-knn(train=galaxy_data_train, test=galaxy_data_test,
cl=galaxy_train_labels, k=20)
gd_test_pred50<-knn(train=galaxy_data_train, test=galaxy_data_test,
cl=galaxy_train_labels, k=50)
ct_1<-CrossTable(x=galaxy_test_labels, y=gd_test_pred1, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 2000
##
##
## | gd_test_pred1
## galaxy_test_labels | L | R | Row Total |
## -------------------|-----------|-----------|-----------|
## L | 1013 | 8 | 1021 |
## | 0.992 | 0.008 | 0.510 |
## | 0.984 | 0.008 | |
## | 0.506 | 0.004 | |
## -------------------|-----------|-----------|-----------|
## R | 16 | 963 | 979 |
## | 0.016 | 0.984 | 0.489 |
## | 0.016 | 0.992 | |
## | 0.008 | 0.481 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 1029 | 971 | 2000 |
## | 0.514 | 0.485 | |
## -------------------|-----------|-----------|-----------|
##
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction L R
## L 1013 8
## R 16 963
##
## Accuracy : 0.988
## 95% CI : (0.9822, 0.9923)
## No Information Rate : 0.5145
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.976
##
## Mcnemar's Test P-Value : 0.153
##
## Sensitivity : 0.9845
## Specificity : 0.9918
## Pos Pred Value : 0.9922
## Neg Pred Value : 0.9837
## Prevalence : 0.5145
## Detection Rate : 0.5065
## Detection Prevalence : 0.5105
## Balanced Accuracy : 0.9881
##
## 'Positive' Class : L
##
#Alternatively we can use the tuning function
knn.tune = tune.knn(x= galaxy_data_train, y = galaxy_train_labels, k = 1:20,
tunecontrol=tune.control(sampling = "fix") , fix=10)
#Summarize the resampling results set
summary(knn.tune)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## k
## 1
##
## - best performance: 0.1772933
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.1772933 NA
## 2 2 0.4163308 NA
## 3 3 0.4374008 NA
## 4 4 0.3946501 NA
## 5 5 0.4021009 NA
## 6 6 0.4250031 NA
## 7 7 0.4277513 NA
## 8 8 0.4273849 NA
## 9 9 0.4342250 NA
## 10 10 0.4416148 NA
## 11 11 0.4455234 NA
## 12 12 0.4455234 NA
## 13 13 0.4472334 NA
## 14 14 0.4521803 NA
## 15 15 0.4511421 NA
## 16 16 0.4511421 NA
## 17 17 0.4509588 NA
## 18 18 0.4547453 NA
## 19 19 0.4571272 NA
## 20 20 0.4528521 NA
# plot(knn.tune)
df <- as.data.frame(cbind(x=knn.tune$performance$k, y=knn.tune$performance$error))
plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "markers+lines") %>%
add_segments(x=1, xend=1, y=0.0, yend=0.45, type = "scatter",
line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>%
add_segments(x=5, xend=5, y=0.0, yend=0.45, type = "scatter",
line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>%
layout(title='Galaxy-spin k-NN Prediction - Error Rate against k',
xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))
In the previous section, we described the types of machine learning methods and presented a lazy learning classification of numerical data using k-nearest neighbors. What about nominal features or textual data? In this section, we will begin to explore some classification techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) illustrate the naive Bayesian classifier on a Head and Neck Cancer Medication case-study. Later, in Chapter 7, we will also discuss text mining and natural language processing of unstructured text data.
It may be useful to first review the basics of probability theory and Bayesian inference. Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the “map” to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated.
The best scenario of accurately estimating the probability of an outcome-class map is when all features simultaneously contribute to the Bayes classifier. The naive Bayes algorithm is frequently used for text classifications. The maximum a posteriori assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable.
The method name, Naive Bayes, suggests the dependence of the technique on some simple (“naive”) assumptions. Its most important assumption is that all of the features are expected to be equally important and independent. This rarely happens in real world data. However, sometimes even when the assumptions are violated, Naive Bayes still performs fairly well, particularly when the number of features \(p\) is large. This is why the Naive Bayes algorithm is often used as a powerful text classifier.
There are interesting relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification. Additional information about LDA and QDA is available here.
Let’s first define the set-theoretic Bayes formula. We assume that \(B_i\)’s are mutually exclusive events, for all \(i= 1, 2, \cdots, n\), where n represents the number of features. If \(A\) and \(B\) are two events, the Bayes conditional probability formula is as follows: \[Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}\]
Symbolically,
\[P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\] When \(B_i's\) represent a partition of the event space, \(S=\cup {B_i}\) and \(B_i\cap B_j = \emptyset,\ \forall i\not= j\). So we have:
\[P(A|B)=\frac{P(B|A)\times P(A)}{P(B|B_1)\times P(B_1) + P(B|B_2)\times P(B_2) + \cdots + P(B|B_n)\times P(B_n)}.\]
Now, let’s represent the Bayes formula in terms of classification
using observed features. Having observed \(n\) features, \(F_i\), for each of \(K\) possible class
outcomes,
\(C_k\). Using the Bayes’ theorem, by
decomposing the conditional probability, the Bayesian classifier may be
reformulated to make it more tractable.
\[ P(C_k \mid F_1, \dots, F_n)= \frac{P(F_1, \dots, F_n|C_k)P(C_k)}{P(F_1, \dots, F_n)}.\]
In the above expression, only the numerator depends on the class label, \(C_k\), as the values of the features \(F_i\) are observed (or imputed) making the denominator constant. Let’s focus on the numerator.
The numerator essentially represents the
joint probability model
:
\[P(F_1, \dots, F_n|C_k)P(C_k) = \underbrace{P(F_1, \cdots, F_n, C_k)}_{\text{joint model}}.\]
Repeatedly using the chain rule and the definition of conditional probability simplifies this to:
\[P(F_1, \dots ,F_n, C_{k})=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times P(F_4, \dots ,F_n, C_{k})=\] \[=\cdots=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]
Note that the “naive” qualifier in the method name, Naive Bayes classifier, is attributed to the oversimplification of the conditional probability. Assuming each feature \(F_i\) is conditionally statistical independent of every other feature \(F_j, \ \forall j\neq i\), given the category \(C_k\), we get:
\[P(F_i | F_{i+1}, \dots , F_{n}, C_k ) = P(F_i | C_k).\]
This reduces the joint probability model to:
\[P(F_1, \dots ,F_n, C_{k})= P(F_1| C_{k})\times P(F_2| C_{k})\times P(F_3| C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]
Therefore, the joint model is:
\[P(F_1, \dots ,F_n, C_{k})= P(C_k) \prod_{i=1}^n {P(F_i| C_{k})}\]
Essentially, we express the probability of class level \(L\) given an observation, represented as a set of independent features \(F_1, F_2, \cdots, F_n\). Then the posterior probability that the observation is in class \(L\) is equal to:
\[P(C_L|F_1, \cdots, F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)},\] where the denominator, \(\prod_{i=1}^nP(F_i)\), is a scaling factor that represents the marginal probability of observing all features jointly.
For a given case \(X=(F_1, F_2, \cdots, F_n)\), i.e., given vector of features, the naive Bayes classifier assigns the most likely class \(\hat{C}\) by calculating \(\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}\) for all class labels \(L\), and then assigning the class \(\hat{C}\) corresponding to the maximum posterior probability. Analytically, \(\hat{C}\) is defined by: \[ \hat{C} = \arg\max_L{\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}}.\]
As the denominator is static for \(L\), the posterior probability above is maximized when the numerator is maximized, i.e., \(\hat{C} = \arg\max_L {P(C_L)\prod_{i=1}^nP(F_i|C_L)}.\)
The contingency table below illustrates schematically how the Bayesian, marginal, conditional, and joint probabilities may be calculated for a finite number of features (columns) and classes (rows).
\(Features\\Classes\) | \(F_1\) | \(F_2\) | \(\cdots\) | \(F_n\) | Total |
---|---|---|---|---|---|
\(C_1\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | Marginal \(P(C_1)\) |
\(C_2\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | Joint \(P(C_2, F_n)\) | \(\cdots\) |
\(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) |
\(C_L\) | Conditional \(P(F_1\mid C_L)=\frac{P(F_1,C_L)}{P(C_L)}\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) |
Total | Marginal \(P(F_2)\) | \(\cdots\) | \(\cdots\) | \(N\) |
In the DSPA2 Appendix, we provide additional technical details, code, and applications of Bayesian simulation, modeling and inference.
If at least one \(P(F_i|C_L)=0\) then \(P(C_L|F_1, \cdots, F_n)=0\), which means the probability of being in this class is 0. However, \(P(F_i|C_L)=0\) could happen by random chance, e.g., when selecting the training data.
One of the solutions to this scenario is Laplace estimation, also known as Laplace smoothing, which can be accomplished in two ways. One is to add a small number to each count in the frequency table, which allows each class-feature combination to be at least one in the training data. Then \(P(F_i|C_L)>0\) for all \(i\) and we avoid degenerate cases. Another strategy is to add some small value, \(\epsilon\), to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial (\(0\)) posterior for another class.
We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data archive.
Variables:
Let’s first load the cancer dataset.
hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
str(hn_med)
## 'data.frame': 662 obs. of 9 variables:
## $ PID : int 10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
## $ ENC_ID : int 46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
## $ seer_stage : int 1 1 4 1 9 1 1 1 9 1 ...
## $ MEDICATION_DESC : chr "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
## $ MEDICATION_SUMMARY: chr "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg Per PACU protocol" ...
## $ DOSE : chr "150" "5000" "1.5" "50" ...
## $ UNIT : chr "mg" "unit" "g" "microgram" ...
## $ FREQUENCY : chr "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
## $ TOTAL_DOSE_COUNT : int 5 3 11 2 1 2 2 6 15 1 ...
Change the seer_stage
(cancer stage indicator) variable
into a factor.
## Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 9 2 2 2 9 2 ...
##
## 0 1 2 3 4 5 7 8 9
## 21 265 53 90 46 18 87 14 68
As you can see, the medication_summary
contains a great
amount of text. We should do some text mining to prepare the data for
analysis. In R, the tm
package is a good choice for text
mining.
# install.packages("tm", repos = "https://cran.us.r-project.org")
# requires R V.3.3.1 +
library(tm)
First step for text mining is to convert text features (text
elements) into a corpus
object, which is a collection of
text documents.
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 662
After we constructed the corpus
object, we could see
that we have 662 documents. Each document represents an encounter (e.g.,
notes on medical treatment) for a patient.
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
##
## [1] (Zantac) 150 mg tablet oral two times a day
## [2] 5,000 unit subcutaneous three times a day
## [3] (Unasyn) 15 g IV every 6 hours
## [1] "(Zantac) 150 mg tablet oral two times a day"
## [1] "5,000 unit subcutaneous three times a day"
## [1] "(Unasyn) 15 g IV every 6 hours"
There are unwanted punctuations and other symbols in the corpus
document that we want to remove. We use the tm_map()
function for the cleaning.
corpus_clean <- tm_map(hn_med_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
# corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
# corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
The above lines of code changed all the characters to lowercase, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the corpus to plain text).
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
##
## [1] zantac mg tablet oral two times a day unit subcutaneous three times a day
## [3] unasyn g iv every hours
## [1] "zantac mg tablet oral two times a day"
## [1] " unit subcutaneous three times a day"
## [1] "unasyn g iv every hours"
DocumentTermMatrix()
function can successfully tokenize
the medication summary into words. It can count frequent terms in each
document in the corpus object.
Just like in the kNN method, we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix.
set.seed(12345)
subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8)) # 80% training + 20% testing
hn_med_train <- hn_med[subset_int, ]
hn_med_test <- hn_med[-subset_int, ]
hn_med_dtm_train <- hn_med_dtm[subset_int, ]
hn_med_dtm_test <-hn_med_dtm[-subset_int, ]
corpus_train <- corpus_clean[subset_int]
corpus_test <- corpus_clean[-subset_int]
# hn_med_train<-hn_med[1:562, ]
#hn_med_test<-hn_med[563:662, ]
# hn_med_dtm_train<-hn_med_dtm[1:562, ]
# hn_med_dtm_test<-hn_med_dtm[563:662, ]
#corpus_train<-corpus_clean[1:562]
#corpus_test<-corpus_clean[563:662]
Let’s examine the distribution of seer stages in the training and test datasets.
##
## 0 1 2 3 4 5 7
## 0.01890359 0.41965974 0.07561437 0.13988658 0.07561437 0.03213611 0.12098299
## 8 9
## 0.01890359 0.09829868
##
## 0 1 2 3 4 5
## 0.082706767 0.323308271 0.097744361 0.120300752 0.045112782 0.007518797
## 7 8 9
## 0.172932331 0.030075188 0.120300752
We can separate (dichotomize) the seer_stage into two categories:
Of course, other binarizations are possible as well. Note that
a %in% b
is an intuitive interface to match
and acts as a binary operator returning a logical vector (T or F)
indicating if there is a match between the left and right operands.
hn_med_train$stage <- hn_med_train$seer_stage %in% c(5:9)
hn_med_train$stage <- factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test$stage <- hn_med_test$seer_stage %in% c(5:9)
hn_med_test$stage <- factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train$stage))
##
## early_stage later_stage
## 0.7296786 0.2703214
##
## early_stage later_stage
## 0.6691729 0.3308271
A word cloud can help us visualize text data. More frequent words
would have larger fonts in the figure, while less common words are
appearing in smaller fonts. There is a wordcloud
package in
R that is commonly used for creating these figures.
# install.packages("wordcloud", repos = "https://cran.us.r-project.org")
library(wordcloud)
wordcloud(corpus_train, min.freq = 40, random.order = FALSE, colors=brewer.pal(5, "Dark2"))
The random.order=FALSE
option makes more frequent words
appear in the center of the word cloud. The min.freq=40
option sets the cutoff word frequency to be at least 40 times in the
corpus object. Therefore, the words must appear in at least 40
medication summaries to be shown on the graph.
We can also visualize the difference between early stages and later stages using this type of graph.
early <- subset(hn_med_train, stage=="early_stage")
later <- subset(hn_med_train, stage=="later_stage")
wordcloud(early$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))
# tdm = as.matrix(hn_med_dtm)
# tdm1 <- tdm[nchar(rownames(tdm)) < 15, ]
# colnames(tdm1) <- hn_med_dtm$dimnames$Terms[1:14]
# comparison.cloud(tdm1, random.order=FALSE,
# colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
# title.size=1, min.words=5, max.words=50, scale=c(1.5, 0.4),rot.per=0.4)
We can see that the frequent words are somewhat different in the medication summaries of the early and later stage patients.
For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that earlier in the kNN section we used the native data features to obtain kNN classifications. Now, we are going to make frequencies of words as text-derived data features.
## Length Class Mode
## 108 character character
hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))
The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 118 features for us to use.
The Naive Bayes classifier trains on data with categorical features, as it uses frequency tables for learning the data affinities. To create the combinations of class and feature values comprising the frequency-table (matrix), all features must be categorical. For instance, numeric features have to be converted (binned) into categories. Thus, we need to transform our word count features into categorical data. One way to achieve this is to change the count into an indicator of whether this word appears in the document describing the patient encounter. We can create a simple function to convert the presence of a specific word (column) in an encounter document (row) to a binary “Yes” (present) or “No” (not present).
convert_counts <- function(wordFreq) {
wordFreq <- ifelse(wordFreq > 0, 1, 0)
wordFreq <- factor(wordFreq, levels = c(0, 1), labels = c("No", "Yes"))
return(wordFreq)
}
We employ hard-thresholding here
x<-ifelse(x>0, 1, 0)
. This is saying that if we have
an x
that is greater than 0, we assign value 1 to it,
otherwise the value is set to 0.
Now let’s apply our own function convert_counts()
on
each column (MARGIN=2
) of the training and testing
datasets.
hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)
# Check the structure of hn_train and hn_train:
# head(hn_train); dim(hn_train)
So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data.
The package we will use for the Naive Bayes classifier is called
e1071
.
The function naiveBayes()
has following components:
m<-naiveBayes(train, class, laplace=0)
Next, we will invoke the naiveBayes()
classifier.
Then we can use the classifier to make predictions using
predict()
. Recall that when we presented the AdaBoost
example in Chapter
2, we saw the general machine learning and artificial intelligence
protocol, from training, to prediction, and
assessment.
For most supervised and unsupervised methods, the generic function
predict()
has the following syntax:
p <- predict(m, test, type="class")
naiveBayes()
"class"
or "raw"
specifies whether the predictions should be the most likely class value
or the raw predicted probabilities.Similarly to the approach in the earlier kNN approach, we can use cross table to compare the predicted vs. the true classes using the testing data.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 133
##
##
## | hn_med_test$stage
## hn_test_pred | early_stage | later_stage | Row Total |
## -------------|-------------|-------------|-------------|
## early_stage | 64 | 30 | 94 |
## | 0.019 | 0.039 | |
## | 0.681 | 0.319 | 0.707 |
## | 0.719 | 0.682 | |
## | 0.481 | 0.226 | |
## -------------|-------------|-------------|-------------|
## later_stage | 25 | 14 | 39 |
## | 0.046 | 0.093 | |
## | 0.641 | 0.359 | 0.293 |
## | 0.281 | 0.318 | |
## | 0.188 | 0.105 | |
## -------------|-------------|-------------|-------------|
## Column Total | 89 | 44 | 133 |
## | 0.669 | 0.331 | |
## -------------|-------------|-------------|-------------|
##
##
## $t
## y
## x early_stage later_stage
## early_stage 64 30
## later_stage 25 14
##
## $prop.row
## y
## x early_stage later_stage
## early_stage 0.6808511 0.3191489
## later_stage 0.6410256 0.3589744
##
## $prop.col
## y
## x early_stage later_stage
## early_stage 0.7191011 0.6818182
## later_stage 0.2808989 0.3181818
##
## $prop.tbl
## y
## x early_stage later_stage
## early_stage 0.4812030 0.2255639
## later_stage 0.1879699 0.1052632
mod_TN <- CT$prop.row[1, 1]
mod_FP <- CT$prop.row[1, 2]
mod_FN <- CT$prop.row[2, 1]
mod_TP <- CT$prop.row[2, 2]
# caret::confusionMatrix(hn_test_pred, hn_med_test$stage)
# CT$prop.row
library(plotly)
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(mod_TN, mod_FN, mod_FP, mod_TP),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Metrics </b>')),
xaxis=list(title='Metrics'), yaxis=list(title='Probability'))
It may be worth quickly looking forward in Chapter 9 where we present a summary table containing commonly used measures for evaluating the performance of binary tests, classifiers, or predictions.
In this case, the prediction accuracy of the Naive Bayes classifier, assessed on the testing dataset, is:
\[ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{78}{133}=0.59.\]
From the cross table we can see that our testing-data prediction accuracy is \(\frac{78}{133}=0.59\).
After setting laplace=5
, the accuracy goes to \(acc=\frac{87}{133}=0.65\). Although there
is a small difference in accuracy, in general, such protocol alterations
may increase or decrease the rate of true later stage patient
identification using testing and validation data.
hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))
set.seed(1234)
# View(as.data.frame(inspect(hn_train)))
traindf <- data.frame(stage=hn_med_train$stage, as.matrix(hn_train))
testdf <- data.frame(stage=hn_med_test$stage, as.matrix(hn_test))
# for Training and Testing data, transform predictors/features to factors
for (cols in names(traindf)) traindf[[cols]] <- factor(traindf[[cols]])
for (cols in names(testdf)) testdf[[cols]] <- factor(testdf[[cols]])
# check variable types
# str(traindf)
# run NB classifier (for cancer stage)
hn_classifier <- naiveBayes(traindf[,-1], traindf$stage, laplace = 5, type = "class")
hn_test_pred1 <- predict(hn_classifier, testdf)
caret::confusionMatrix(hn_test_pred1, testdf$stage)
## Confusion Matrix and Statistics
##
## Reference
## Prediction early_stage later_stage
## early_stage 83 40
## later_stage 6 4
##
## Accuracy : 0.6541
## 95% CI : (0.5668, 0.7344)
## No Information Rate : 0.6692
## P-Value [Acc > NIR] : 0.6804
##
## Kappa : 0.0292
##
## Mcnemar's Test P-Value : 1.141e-06
##
## Sensitivity : 0.93258
## Specificity : 0.09091
## Pos Pred Value : 0.67480
## Neg Pred Value : 0.40000
## Prevalence : 0.66917
## Detection Rate : 0.62406
## Detection Prevalence : 0.92481
## Balanced Accuracy : 0.51175
##
## 'Positive' Class : early_stage
##
Accuracy is \(\frac{87}{133} = 0.65\). This is a marginal performance improvement, compared to the prior (\(laplace=0\)) default model. Notice that choosing a different Laplace smoothing parameter alters the balance between the true/false and positive/negative elements in the \(2\times 2\) confusion matrix.
As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of Naive Bayes and LDA classification.
Note: LDA assumes the predictors are jointly approximately normally distributed.
library(MASS)
# df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric),
# stage = hn_med_train$stage)
# df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric),
# stage = hn_med_test$stage)
library("dplyr")
binarizeFunction <- function(x) { ifelse(x=="Yes", 1,0) }
# A function to Convert Categorical variables to numeric
cat2Numeric <- function (dfInput) {
df = as.data.frame(lapply( as.data.frame(dfInput), factor)) %>%
mutate_all(binarizeFunction)
return(df)
}
# define the numeric DF of predictors (X) and outcome (Y=stage)
df_hn_train = data.frame(as.matrix(hn_train), stage = as.numeric(hn_med_train$stage))
df_hn_test = data.frame(as.matrix(hn_test), stage = as.numeric(hn_med_test$stage))
# Remove the multicollinearity - this should be done via VIF assessment,
# but for now, just take the first few predictors
df_hn_train <- df_hn_train[ , c(1:34, 40:50, 60:70, 109)]
df_hn_train$stage <- hn_med_train$stage
df_hn_test$stage <- hn_med_test$stage
# Fit LDA
set.seed(1234)
hn_lda <- lda(data=df_hn_train, stage ~ .)
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(hn_lda, df_hn_test)
CrossTable(hn_pred$class, df_hn_test$stage)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 133
##
##
## | df_hn_test$stage
## hn_pred$class | early_stage | later_stage | Row Total |
## --------------|-------------|-------------|-------------|
## early_stage | 85 | 37 | 122 |
## | 0.138 | 0.280 | |
## | 0.697 | 0.303 | 0.917 |
## | 0.955 | 0.841 | |
## | 0.639 | 0.278 | |
## --------------|-------------|-------------|-------------|
## later_stage | 4 | 7 | 11 |
## | 1.535 | 3.104 | |
## | 0.364 | 0.636 | 0.083 |
## | 0.045 | 0.159 | |
## | 0.030 | 0.053 | |
## --------------|-------------|-------------|-------------|
## Column Total | 89 | 44 | 133 |
## | 0.669 | 0.331 | |
## --------------|-------------|-------------|-------------|
##
##
There are differences in the performance of the Naive Bayesian (\(acc=\frac{87}{133}=0.65\)) and the LDA classifiers (\(acc=\frac{92}{133}=0.69\)) in terms of the overall accuracy. We can compare the type II (false-negative prediction) error rates of the naive Bayesian and the LDA classifiers are \(\frac{40}{133}=0.30\) and \(\frac{37}{133}=0.28\), respectively. Such differences may have a nuanced clinical importance for early detection of later-stage cancer.
Later, we will step deeper into the space of classification problems and see more sophisticated approaches, especially for difficult problems like cancer forecasting.
When classification models need to be apparent and directly mechanistically interpreted, kNN and naive Bayes methods we saw earlier may not be useful, as they do not yield explicit classification rules. In some cases, we need to derive direct and clear rules for our classification decisions. Just like a scoring criterion for driving ability, credit scoring, or grading rubrics, mechanistic understanding of the derived labeling, prediction, or decision making is often beneficial. For instance, we can assign credit or deduct a penalty for each driving operation task. These rules facilitate the critical decision to issue a driver license to applicants that are qualified, ready, and able to safely operate a specific type of a motor vehicle.
In this section, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.
Decision tree learners enable classification via tree structures
modeling the relationships among all features and potential outcomes in
the data. All decision trees begin with a trunk representing all points
that are part of the same cohort. Iteratively, the trunk is split into
narrower and narrower branches by forking-decisions based on the
intrinsic data structure. At each step, splitting the data into branches
may include binary or multinomial classification. The final decision is
obtained when the tree branching process terminates. The terminal (leaf)
nodes represent the action to be taken as the result of the series of
branching decisions. For predictive models, the leaf nodes provide the
expected forecasting results given the series of events in the decision
tree. There are a number of R
packages available for
decision tree classification including rpart
,
C5.0
, party
, etc.
Let’s start by seeing a simple example using the Iris dataset, which
we saw in Chapter
2. The data features or attributes include
Sepal.Length
, Sepal.Width
,
Petal.Length
, and Petal.Width
, and classes are
represented by the Species taxa
(setosa, versicolor, and
virginica).
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
##
## setosa versicolor virginica
## 50 50 50
The
ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
function will build a conditional-inference decision tree.
iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)
##
## Conditional inference tree with 4 terminal nodes
##
## Response: Species
## Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
## Number of observations: 150
##
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
## 2)* weights = 50
## 1) Petal.Length > 1.9
## 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
## 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
## 5)* weights = 46
## 4) Petal.Length > 4.8
## 6)* weights = 8
## 3) Petal.Width > 1.7
## 7)* weights = 46
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Similarly, we can demonstrate a classification of the iris
taxa via rpart
:
# install from https://cran.r-project.org/web/packages/rpart.utils/
library(rpart)
library(rpart.utils)
library(plotly)
# iris_rpart = rpart(Species~., data=iris)
# print(iris_rpart)
#
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
# install.packages("rattle")
# library(rattle)
# fancyRpartPlot(iris_rpart, cex = 1.5, caption = "rattle::fancyRpartPlot (Iris flowers)")
#
#### Plot_ly decision tree visualization
iris_rpart <- rpart(Species ~ . , method="class", data=iris)
# printcp(iris_rpart)
# Set the Graph/Tree Node names
treeFrame <- iris_rpart$frame
# Identify Leaf nodes
isLeave <- treeFrame$var == "<leaf>"
nodes <- rep(NA, length(isLeave))
# Get the 3 Isis taxa labels: "setosa", "versicolor", "virginica" for all nodes (including leaves)
ylevel <- attr(iris_rpart, "ylevels")
nodes[isLeave] <- ylevel[treeFrame$yval][isLeave]
nodes[!isLeave] <- labels(iris_rpart)[-1][!isLeave[-length(isLeave)]]
# Get all Tree Graph Connections (source --> target)
treeFrame <- iris_rpart$frame
treeRules <- rpart.utils::rpart.rules(iris_rpart)
targetPaths <- sapply(as.numeric(row.names(treeFrame)),
function(x) { strsplit(unlist(treeRules[x]),split=",") } )
lastStop <- sapply(1:length(targetPaths),
function(x) { targetPaths[[x]][length(targetPaths[[x]])] } )
oneBefore <- sapply(1:length(targetPaths),
function(x) { targetPaths[[x]][length(targetPaths[[x]])-1] } )
# Initialize the tree edges
target=c()
source=c()
values=treeFrame$n
for(i in 2:length(oneBefore)) {
tmpNode=oneBefore[[i]]
q = which(lastStop==tmpNode)
q=ifelse(length(q)==0, 1, q)
source = c(source, q)
target = c(target, i)
}
source=source-1
target=target-1
# Display the sankey tree
plot_ly(type = "sankey", orientation = "h",
node = list(label = nodes, pad = 15, thickness = 30, line = list(color="black", width=1)),
link = list(source = source, target = target, value=values[-1])) %>%
layout(title = "Iris rPart Decision Tree", font = list(size = 12))
The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begins at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.
Decision trees involve recursive partitioning that use data features and attributes to split the data into groups (nodes) of similar classes.
To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly bundled together. Each of the groups is considered a segment of the data. After getting the approximate range of each feature value within each group, we can make the decision tree.
\[X = [X_1, X_2, X_3, \cdots, X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, k} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{pmatrix} }_{features/attributes} \left\} \begin{array}{ll} & \\ cases & \\ & \end{array} \right.\]
The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset \(D\) (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into \(K\) subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:
pure
.One objective criterion for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. Three commonly used measures for evaluating the impurity reduction are: Misclassification error, Gini index and Entropy.
For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogeneous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.
The Entropy
is an information measure of the amount of
disorder or uncertainty in a system. Suppose we have a data set \(D=(X_1, X_2, \cdots, X_n)\) that includes
\(n\) features (variables) and suppose
each of these features can take on any of \(k\) possible values (states). Then the
cardinality of the entire system is \(k^n\) as each of the features are assumed
to have \(k\) independent states, thus
the total number of different datasets that can be expected is \(\underbrace{k\times k\times k \times \cdots\times
k}_{n}=k^n\). Suppose \({\displaystyle
p_{1},p_{2},\cdots, p_k}\) represent the proportions of each
class (note: \(\sum_{i} {p_i}=1\))
present in the child node that results from a split in a decision tree
classifier. Then the entropy measure is defined by:
\[Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\]
If each of the \(1\le i\leq k\) states for each feature is equally likely to be observed with probability \(p_i=\frac{1}{k}\), then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}= \sum_{i=1}^{k} {\frac{1}{k} \log{k}}= k{\frac{1}{k} \log{k}}=\log{k} = O(1).\]
In the other extreme, the entropy is minimized when the probability of one class is unitary and all other ones are trivial. Note that by L’Hopital’s Rule \(\lim_{x\longrightarrow 0} {(x\times \log(x))} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0\) for a single class classification where the probability of one class is unitary (\(p_{i_o}=1\)) and the other ones are trivial (\(p_{i\not= i_o}=0\)):
\[Entropy(D)=p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]
In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample \(D=\{X_1, X_2, \cdots, X_n\}\) is defined by:
\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]
where \(P(c_i|D)\) is the probability of a data point in \(D\) being labeled with class \(c_i\), and \(k\) is the number of classes (clusters). \(P(c_i|D)\) can be estimated from the observed data by:
\[P(c_i|D) = \frac{P(D | c_i)P(c_i)}{P(D)} = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]
Observe that if the observations are evenly split among all \(k\) classes then \(P(c_i|D)=\frac{1}{k}\) and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=\log{k}=O(1).\]
At the other extreme, if all the observations are from one class then: \[H(D)=-1*\log(1)=0.\]
Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy \(\left ( \log_b(x) = \frac{\log_2(x)}{\log_2(b)}\right )\).
The Information gain
is the expected reduction in
entropy caused by knowing the value of an attribute.
Similar to the Entropy
measure, the
Misclassification error
and the Gini index
are
also applied to evaluate information gain. The Misclassification error
is defined by the formula:
\[ME = 1-\max_k (p_k).\]
And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]
The C5.0 algorithm is a popular implementation of decision trees. To
begin with, let’s consider the term purity
. If the segments
of data contain a single class, they are considered pure
.
The entropy
represents a mathematical formalism measuring
purity for data segments.
\[Entropy(S)=- \sum_{i=1}^c {p_i \log_2(p_i)},\]
where entropy
is the measurement, c is the
number of total class levels, and \(p_i\) refers to the proportion of
observations that fall into each class (i.e., probability of a randomly
selected data point to belong to the \(i^{th}\) class level. For 2 possible
classes, the entropy
ranges from 0 to 1. For \(n\) classes, the entropy ranges from \(0\) to \(\log_2(n)\), where the minimum entropy
corresponds to data that is purely homogeneous (completely
deterministic/predictable) and the maximum entropy represents completely
disordered data (stochastic or extremely noisy). You might wonder what
is the benefit of using entropy? The answer is because the smaller the
entropy the more information is contained in the corresponding decision
node split. Systems (data) with high entropy indicate significant
information content (randomness) and data with low entropy indicates
highly-compressible data with structure in it.
If we only have one class in the segment then the entropy is trivial,
\(Entropy(S)=(-1)\times \log_2(1)=0\).
Let’s try another example. If we have a segment of data that contains
two classes, the first class contains 80% of the data and the second
class contains the remaining 20%. Then, we have the following
entropy
: \(Entropy(S)=-0.8\log_2(0.8)-0.2\log_2(0.2)=0.7219281.\)
For a two class problem, the relationship between class-proportions and entropy is illustrated in the following graph. Assume \(x\) is the proportion for one of the classes.
N <- 1000
set.seed(1234)
x <- runif(N)
x_ind <- sort(x)
y <- -x_ind*log2(x_ind)-(1-x_ind)*log2(1-x_ind)
# curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
# segments(0.8, 0, 0.8, 0.7219281, lwd=2, col="blue", lty = 2)
# points(0.8, 0.7219281, col = "blue", pch=21, cex=2, lwd=2)
# text(-0.02, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.48, 0.3, pos=4, srt = 90, "Low-Structure/High-Entropy")
# text(0.96, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.65, 0.8, pos=4, "(x=0.8,Entr=0.72)")
modelLabels <- c('High-Structure/Low-Entropy', 'Low-Structure/High-Entropy', 'High-Structure/Low-Entropy')
modelLabels.x <- c(0.03, 0.5, 0.97)
modelLabels.y <- c(0.5, 0.7, 0.5)
modelLabels.col <- c("blue", "red", "blue")
plot_ly() %>%
add_lines(x = ~x_ind, y = ~y, name="Entropy", line = list(width = 4)) %>%
add_segments(x=0.8, xend=0.8, y=0.001, yend = 0.7219281, showlegend=F) %>%
# add_segments(x=-0.2, xend=1.1, y=0, yend = 0, showlegend=F) %>%
# add_markers(x = ~x, y = ~y, name="Sample Simulated Data") %>%
#add_lines(x = ~c(point1$x,point2$x), y = ~c(point1$y,point2$y), name="Second PC, Orthogonal to lm(Y ~ X)",
# line = list(width = 4)) %>%
add_markers(x = 0.8, y = 0.7219281, name="(x=0.8,Entropy=0.72)", marker = list(size = 20,
color = 'green', line = list(color = 'yellow', width = 2))) %>%
layout(title="Binary Class Entropy",
xaxis=list(title="Proportion of Class 1 Observations", scaleanchor="y"), # control the y:x axes aspect ratio
yaxis = list(title="Entropy", scaleanchor = "x", range=c(-0.1, 1.1)), legend = list(orientation = 'h'),
annotations = list(text=modelLabels, x=modelLabels.x, y=modelLabels.y, textangle=90,
font=list(size=15, color=modelLabels.col), showarrow=FALSE))
The closer the binary proportion split is to \(0.5\) the greater the entropy. The more homogeneous the split (one class becomes the majority) the lower the entropy. Decision trees aim to find splits in the data that reduce the entropy, i.e., increasing the homogeneity of the elements within all classes.
This measuring mechanism could be used to measure and compare the information we get using different features as data partitioning characteristics. Let’s consider this scenario. Suppose \(S\) and \(S_1\) represent the entropy of the system before and after the splitting/partitioning of the data according to a specific data feature attribute (\(F\)). Denote the entropies of the original and the derived partition by \(Entropy(S)\) and \(Entropy(S_1)\), respectively. The information we gained from partitioning the data using this specific feature (F) is calculated as a change in the entropy:
\[Gain(F) = Entropy(S)-Entropy(S_1)\]
Note that smaller entropy \(Entropy(S_1)\) corresponds with better classification and more information gained. A more complicated case would be that the partitions create multiple segments. Then, the entropy for each partition method is calculated by the following formula:
\[Entropy(S)=\sum_{i=1}^n w_i Entropy(P_i)=\sum_{i=1}^{n} w_i \left (-\sum_{j=1}^{c}p_i\log_2(p_i)\right )\]
Where \(w_i\) is the proportion of examples falling in that segment and \(P_i\) is segment i. Thus, the total entropy of a partition method is calculated by a weighted sum of entropies for each segment created by this method.
When we get the maximum reduction in entropy with a feature (\(F\)) then the \(Gain(F)=Entropy(S)\), since \(Entropy(S_1)=0\). On the contrary, if we gain no information with this feature, we have \(Gain(F)=0\).
While making a decision tree, we can classify those observations using as many splits as we want. This eventually might over classify our data. An extreme example of this would be that we make each observation as a class, which is meaningless.
So how to control the size of the decision tree? One possible solution is to make a cutoff for the number of decisions that a decision tree could make. Similarly, we can control the minimum number of points in each segment. This method is called early stopping, or pre-pruning, the decision tree. However, this might terminate the decision procedure ahead of some important pending partition.
Another solution post-pruning can be applied when we begin with growing a big decision tree and subsequently reduce the branches based on error rates with penalty at the nodes. This is often more effective than the pre-pruning solution.
The C5.0 algorithm
uses the post-pruning method
to control the size of the decision tree. It first grows an overfitting
large tree that contains all the possibilities of partitioning. Then, it
cuts out nodes and branches with little effect on classification
errors.
Let’s use the Quality
of life and chronic disease dataset,
Case06_QoL_Symptom_ChronicIllness.csv
, which has 41
variables. Detailed
description for each variable is provided here.
Important variables:
Let’s load the data first.
## 'data.frame': 2356 obs. of 41 variables:
## $ ID : int 171 171 172 179 180 180 181 182 183 186 ...
## $ INTERVIEWDATE : int 0 427 0 0 0 42 0 0 0 0 ...
## $ LANGUAGE : int 1 1 1 1 1 1 1 1 1 2 ...
## $ AGE : int 49 49 62 44 64 64 52 48 49 78 ...
## $ RACE_ETHNICITY : int 3 3 3 7 3 3 3 3 3 4 ...
## $ SEX : int 2 2 2 2 1 1 2 1 1 1 ...
## $ QOL_Q_01 : int 4 4 3 6 3 3 4 2 3 5 ...
## $ QOL_Q_02 : int 4 3 3 6 2 5 4 1 4 6 ...
## $ QOL_Q_03 : int 4 4 4 6 3 6 4 3 4 4 ...
## $ QOL_Q_04 : int 4 4 2 6 3 6 2 2 5 2 ...
## $ QOL_Q_05 : int 1 5 4 6 2 6 4 3 4 3 ...
## $ QOL_Q_06 : int 4 4 2 6 1 2 4 1 2 4 ...
## $ QOL_Q_07 : int 1 2 5 -1 0 5 8 4 3 7 ...
## $ QOL_Q_08 : int 6 1 3 6 6 6 3 1 2 4 ...
## $ QOL_Q_09 : int 3 4 3 6 2 2 4 2 2 4 ...
## $ QOL_Q_10 : int 3 1 3 6 3 6 3 2 4 3 ...
## $ MSA_Q_01 : int 1 3 2 6 2 3 4 1 1 2 ...
## $ MSA_Q_02 : int 1 1 2 6 1 6 4 3 2 4 ...
## $ MSA_Q_03 : int 2 1 2 6 1 2 3 3 1 2 ...
## $ MSA_Q_04 : int 1 3 2 6 1 2 1 4 1 5 ...
## $ MSA_Q_05 : int 1 1 1 6 1 2 1 6 3 2 ...
## $ MSA_Q_06 : int 1 2 2 6 1 2 1 1 2 2 ...
## $ MSA_Q_07 : int 2 1 3 6 1 1 1 1 1 5 ...
## $ MSA_Q_08 : int 1 1 1 6 1 1 1 1 2 1 ...
## $ MSA_Q_09 : int 1 1 1 6 2 2 4 6 2 1 ...
## $ MSA_Q_10 : int 1 1 1 6 1 1 1 1 1 3 ...
## $ MSA_Q_11 : int 2 3 2 6 1 1 2 1 1 5 ...
## $ MSA_Q_12 : int 1 1 2 6 1 1 2 6 1 3 ...
## $ MSA_Q_13 : int 1 1 1 6 1 6 2 1 4 2 ...
## $ MSA_Q_14 : int 1 1 1 6 1 2 1 1 3 1 ...
## $ MSA_Q_15 : int 2 1 1 6 1 1 3 2 1 3 ...
## $ MSA_Q_16 : int 2 3 5 6 1 2 1 2 1 2 ...
## $ MSA_Q_17 : int 2 1 1 6 1 1 1 1 1 3 ...
## $ PH2_Q_01 : int 3 2 1 5 1 1 3 1 2 3 ...
## $ PH2_Q_02 : int 4 4 1 5 1 2 1 1 4 2 ...
## $ TOS_Q_01 : int 2 2 2 4 1 1 2 2 1 1 ...
## $ TOS_Q_02 : int 1 1 1 4 4 4 1 2 4 4 ...
## $ TOS_Q_03 : int 4 4 4 4 4 4 4 4 4 4 ...
## $ TOS_Q_04 : int 5 5 5 5 5 5 5 5 5 5 ...
## $ CHARLSONSCORE : int 2 2 3 1 0 0 2 8 0 1 ...
## $ CHRONICDISEASESCORE: num 1.6 1.6 1.54 2.97 1.28 1.28 1.31 1.67 2.21 2.51 ...
Most of the coded variables like QOL_Q_01
(health rating)
have ordinal values (1=excellent, 2=very good, 3=good, 4=fair, 5= poor,
6=no answer) and we can use the table()
function to see
their distributions. We also have some numerical variables in the
dataset like CHRONICDISEASESCORE
, which can be examined
using summary()
.
Our outcome of interest CHRONICDISEASESCORE
has some
missing data. A simple way to address this is just to remove those
observations with missing values. However, we could also try to impute
the missing values using various imputation methods mentioned in Chapter
2.
##
## 1 2 3 4 5 6
## 44 213 801 900 263 135
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.880 1.395 1.497 1.970 4.760
Let’s create two classes using the variable
CHRONICDISEASESCORE
. We classify the patients with
CHRONICDISEASESCORE < mean(CHRONICDISEASESCORE)
as
having minor disease and the rest as having severe disease. This
dichotomous classification (qol$cd
) may not be perfect and
we will talk about alternative classification strategies in the practice
problem at the end of the chapter.
qol$cd <- qol$CHRONICDISEASESCORE > 1.497 # mean(qol$CHRONICDISEASESCORE)
qol$cd <- factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))
To make the qol
data more organized, we can order the
data by the variable ID
.
qol <- qol[order(qol$ID), ]
# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]
Then, we are able to subset training and testing datasets. Here is an example of a non-random split of the entire data into training (\(2114\)) and testing (\(100\)) sets:
And here is an example of random assignments of cases into training and testing sets (80-20% split):
set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train <- qol[train_index, ]
qol_test <- qol[-train_index, ]
We can quickly inspect the distributions of the training and testing data to ensure they are not vastly different. We can see that the classes are split fairly equal in training and test datasets.
##
## minor_disease severe_disease
## 0.5324675 0.4675325
##
## minor_disease severe_disease
## 0.4853273 0.5146727
In this section we are using the C5.0()
function from
the C50
package.
The function C5.0()
has following components:
m<-C5.0(train, class, trials=1, costs=NULL)
You could delete the #
in the following code and run it
in R to install and load the C50
package.
In the qol
dataset (ID column is already removed),
column 41 is the class vector (qol$cd
) and column 40 is the
numerical version of vector 41 (qol$CHRONICDISEASESCORE
).
We need to delete these two columns to create our training data that
only contains feature variables that will be used for predicting the
outcome.
## INTERVIEWDATE LANGUAGE AGE RACE_ETHNICITY
## Min. : 0.00 Min. :1.000 Min. :20.00 Min. :1.000
## 1st Qu.: 0.00 1st Qu.:1.000 1st Qu.:52.00 1st Qu.:3.000
## Median : 0.00 Median :1.000 Median :59.00 Median :3.000
## Mean : 20.29 Mean :1.199 Mean :58.55 Mean :3.618
## 3rd Qu.: 0.00 3rd Qu.:1.000 3rd Qu.:66.00 3rd Qu.:4.000
## Max. :440.00 Max. :2.000 Max. :90.00 Max. :7.000
## SEX QOL_Q_01 QOL_Q_02 QOL_Q_03
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
## Median :1.000 Median :4.000 Median :3.000 Median :4.000
## Mean :1.423 Mean :3.678 Mean :3.417 Mean :3.705
## 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :2.000 Max. :6.000 Max. :6.000 Max. :6.000
## QOL_Q_04 QOL_Q_05 QOL_Q_06 QOL_Q_07
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :-1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 1.000
## Median :3.000 Median :3.000 Median :3.000 Median : 5.000
## Mean :3.021 Mean :3.099 Mean :2.963 Mean : 4.172
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.: 7.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :10.000
## QOL_Q_08 QOL_Q_09 QOL_Q_10 MSA_Q_01
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :3.000 Median :3.000
## Mean :2.885 Mean :3.151 Mean :2.922 Mean :2.975
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_02 MSA_Q_03 MSA_Q_04 MSA_Q_05
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :3.000 Median :2.000 Median :2.000 Median :1.000
## Mean :3.066 Mean :2.298 Mean :2.404 Mean :1.895
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_06 MSA_Q_07 MSA_Q_08 MSA_Q_09 MSA_Q_10
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :1.00 Median :2.000 Median :1.000
## Mean :2.401 Mean :2.172 Mean :1.48 Mean :2.193 Mean :1.717
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.00 Max. :6.000 Max. :6.000
## MSA_Q_11 MSA_Q_12 MSA_Q_13 MSA_Q_14
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :2.242 Mean :2.036 Mean :1.897 Mean :2.002
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## MSA_Q_15 MSA_Q_16 MSA_Q_17 PH2_Q_01
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :2.000
## Mean :1.765 Mean :1.823 Mean :2.061 Mean :2.479
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :6.000 Max. :6.000 Max. :6.000 Max. :6.000
## PH2_Q_02 TOS_Q_01 TOS_Q_02 TOS_Q_03
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:4.000
## Median :2.000 Median :2.000 Median :4.000 Median :4.000
## Mean :2.368 Mean :2.211 Mean :3.297 Mean :3.787
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :6.000 Max. :5.000 Max. :5.000 Max. :5.000
## TOS_Q_04 CHARLSONSCORE
## Min. :1.000 Min. :-9.0000
## 1st Qu.:5.000 1st Qu.: 0.0000
## Median :5.000 Median : 1.0000
## Mean :4.679 Mean : 0.9339
## 3rd Qu.:5.000 3rd Qu.: 1.0000
## Max. :6.000 Max. :10.0000
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Tree size: 5
##
## Non-standard options: attempt to group attributes
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd)
##
##
## C5.0 [Release 2.07 GPL Edition] Sat Apr 20 18:03:53 2024
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 1771 cases (40 attributes) from undefined.data
##
## Decision tree:
##
## CHARLSONSCORE <= 0: minor_disease (645/169)
## CHARLSONSCORE > 0:
## :...CHARLSONSCORE > 5: minor_disease (29/9)
## CHARLSONSCORE <= 5:
## :...AGE > 47: severe_disease (936/352)
## AGE <= 47:
## :...MSA_Q_08 <= 1: minor_disease (133/46)
## MSA_Q_08 > 1: severe_disease (28/8)
##
##
## Evaluation on training data (1771 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 5 584(33.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 583 360 (a): class minor_disease
## 224 604 (b): class severe_disease
##
##
## Attribute usage:
##
## 100.00% CHARLSONSCORE
## 61.94% AGE
## 9.09% MSA_Q_08
##
##
## Time: 0.0 secs
The output of qol_model
indicates that we have a tree
that has 25 terminal nodes. summary(qol_model)
suggests
that the classification error for the decision tree is 28% percent in
the training data.
Now we can make predictions using the decision tree that we just
built. The predict()
function we will use is the same as
the one we showed earlier, e.g., in the Naive Bayes classifier, and in
Chapter
2. In general, predict()
is extended by each specific
type of regression, classification, clustering or forecasting machine
learning technique. For example,
randomForest::predict.randomForest()
is invoked by:
predict(RF_model, newdata, type="response", norm.votes=TRUE, predict.all=FALSE, proximity=FALSE, nodes=FALSE, cutoff, ...),
where type
represents the type of prediction output to
be generated - “response” (equivalent to “class”), “prob” or “votes”.
Thus, the predicted values are either predicted “response” class labels,
matrix of class probabilities, or vote counts.
This time we are going to introduce the
confusionMatrix()
function under package caret
as the evaluation method. When we combine it with a table()
function, the output of the evaluation is very straight forward.
# See docs for predict # ?C50::predict.C5.0
qol_pred <- predict(qol_model, qol_test[ ,-c(40, 41)]) # removing the last 2 columns CHRONICDISEASESCORE and cd, which represent the clinical outcomes we are predicting!
# install.packages("caret")
library(caret)
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
##
## Accuracy : 0.6772
## 95% CI : (0.6315, 0.7206)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 3.001e-12
##
## Kappa : 0.3538
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6651
## Specificity : 0.6886
## Pos Pred Value : 0.6682
## Neg Pred Value : 0.6856
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4831
## Balanced Accuracy : 0.6769
##
## 'Positive' Class : minor_disease
##
The Confusion Matrix shows prediction accuracy is about 63%, however, this may vary (see the corresponding confidence interval).
Recall that the C5.0()
function has a
trials
option, an integer specifying the number of boosting
iterations. The default value of one indicates that a single model is
used, and we can specify a larger number of iterations, for instance
trials=6
.
set.seed(1234)
qol_boost6 <- C5.0(qol_train[ , -c(40, 41)], qol_train$cd, trials=6)
# try alternative values for the trials option
qol_boost6
##
## Call:
## C5.0.default(x = qol_train[, -c(40, 41)], y = qol_train$cd, trials = 6)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 6
## Average tree size: 4.7
##
## Non-standard options: attempt to group attributes
The size of the tree may vary with each repeated experiment and depends on the parameter settings.
Since this is a fairly small tree we can visualize it by the function
plot()
. We also use the option type="simple"
to make the tree look more condensed. We can also zoom in and display
just a specific branch of the tree by using the subtree
parameter.
Caution: The plotting of decision trees will fail if you have columns that start with numbers or special characters (e.g., “5variable”, “!variable”). In general, avoid spaces, special characters, and other non-terminal symbols in column/row names.
The next step is to demonstrate prediction and testing of the accuracy of a fitted decition tree model.
qol_boost_pred6 <- predict(qol_boost6, qol_test[ ,-c(40, 41)])
confusionMatrix(table(qol_boost_pred6, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_boost_pred6 minor_disease severe_disease
## minor_disease 145 69
## severe_disease 70 159
##
## Accuracy : 0.6862
## 95% CI : (0.6408, 0.7292)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 1.747e-13
##
## Kappa : 0.3718
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6744
## Specificity : 0.6974
## Pos Pred Value : 0.6776
## Neg Pred Value : 0.6943
## Prevalence : 0.4853
## Detection Rate : 0.3273
## Detection Prevalence : 0.4831
## Balanced Accuracy : 0.6859
##
## 'Positive' Class : minor_disease
##
The accuracy is about 64%. However, this may vary each time we run
the experiment (mind the confidence interval). In some studies, the
trials option provides significant improvement to the overall
accuracy. A good choice for this option is trials=10
.
Suppose we want to reduce the false negative rate, which corresponds to the case of misclassifying a severe case as mild. False negative (failure to detect a severe disease case) may be more costly than false positive (misclassifying a minor disease case as severe). Misclassification errors can be expressed as a matrix, which can be loaded to penalize specific types of erroneous predicitons.
## [,1] [,2]
## [1,] 0 4
## [2,] 1 0
Let’s build a decision tree with the option
cpsts=error_cost
.
set.seed(1234)
qol_cost <- C5.0(qol_train[-c(40, 41)], qol_train$cd, costs=error_cost)
qol_cost_pred <- predict(qol_cost, qol_test)
confusionMatrix(table(qol_cost_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_cost_pred minor_disease severe_disease
## minor_disease 73 14
## severe_disease 142 214
##
## Accuracy : 0.6479
## 95% CI : (0.6014, 0.6923)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 1.024e-08
##
## Kappa : 0.2829
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3395
## Specificity : 0.9386
## Pos Pred Value : 0.8391
## Neg Pred Value : 0.6011
## Prevalence : 0.4853
## Detection Rate : 0.1648
## Detection Prevalence : 0.1964
## Balanced Accuracy : 0.6391
##
## 'Positive' Class : minor_disease
##
Although the overall accuracy decreased, the false negative cell labels were reduced from 75 (without specifying a cost matrix) to 17 (when specifying a non-trivial (loaded) cost matrix). This comes at the cost of increasing the rate of false-positive labeling (minor disease cases misclassified as severe).
There are multiple choices to plot trees fitted by rpart
or C5.0
.
library("rpart")
# remove CHRONICDISEASESCORE, but keep *cd* label
set.seed(1234)
qol_model<-rpart(cd~., data=qol_train[, -40], cp=0.01)
# here we use rpart::cp = *complexity parameter* = 0.01
qol_model
## n= 1771
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 1771 828 minor_disease (0.5324675 0.4675325)
## 2) CHARLSONSCORE< 0.5 645 169 minor_disease (0.7379845 0.2620155) *
## 3) CHARLSONSCORE>=0.5 1126 467 severe_disease (0.4147425 0.5852575)
## 6) AGE< 47.5 165 66 minor_disease (0.6000000 0.4000000) *
## 7) AGE>=47.5 961 368 severe_disease (0.3829344 0.6170656) *
You can also plot directly using rpart.plot
##
## Attaching package: 'rpart.plot'
## The following object is masked from 'package:rpart.utils':
##
## rpart.rules
The method fancyRpartPlot()
provides another decision
tree rendering.
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
qol_pred <- predict(qol_model, qol_test,type = 'class')
confusionMatrix(table(qol_pred, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 74
## severe_disease 72 154
##
## Accuracy : 0.6704
## 95% CI : (0.6245, 0.7141)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 2.276e-11
##
## Kappa : 0.3405
##
## Mcnemar's Test P-Value : 0.934
##
## Sensitivity : 0.6651
## Specificity : 0.6754
## Pos Pred Value : 0.6590
## Neg Pred Value : 0.6814
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4898
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : minor_disease
##
These results are consistent with their counterparts reported using
C5.0.
How can we tune the parameters to further improve the
results?
set.seed(1234)
control = rpart.control(cp = 0.000, xxval = 100, minsplit = 2)
qol_model= rpart(cd ~ ., data = qol_train[ , -40], control = control)
# plotcp(qol_model)
# printcp(qol_model)
data <- as.data.frame(qol_model$cptable)
data$CP <- as.factor(data$CP)
plot_ly(data = data, x = ~CP, y = ~xerror, type = 'scatter', mode='lines+markers',
name = 'Test', error_y = ~list(array = xstd, color = 'gray')) %>%
#add_segments(x=0, xend=0.23, y=0.76, yend = 0.76, line = list(dash = "dash"), showlegend=F) %>%
layout(title="Complexity Parameter vs. (CV) Error Rate")
Now, we can prune the tree according to the optimal cp
,
complexity parameter to which the rpart
object
will be trimmed. Instead of using the real error (e.g., \(1-R^2\), RMSE) to capture the discrepancy
between the observed labels and the model-predicted labels, we will use
the xerror
which averages the discrepancy between observed
and predicted classifications using cross-validation, see Chapter
9. We can compare the full and pruned decision
trees by contrasting the outputs of summary(qol_model)
and
summary(selected_tr)
.
set.seed(1234)
selected_tr <- prune(qol_model, cp= qol_model$cptable[which.min(qol_model$cptable[,"xerror"]),"CP"])
fancyRpartPlot(selected_tr, cex = 1, caption = "rattle::fancyRpartPlot (QoL Data)")
qol_pred_tune <- predict(selected_tr, qol_test, type = 'class')
confusionMatrix(table(qol_pred_tune, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred_tune minor_disease severe_disease
## minor_disease 143 74
## severe_disease 72 154
##
## Accuracy : 0.6704
## 95% CI : (0.6245, 0.7141)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 2.276e-11
##
## Kappa : 0.3405
##
## Mcnemar's Test P-Value : 0.934
##
## Sensitivity : 0.6651
## Specificity : 0.6754
## Pos Pred Value : 0.6590
## Neg Pred Value : 0.6814
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4898
## Balanced Accuracy : 0.6703
##
## 'Positive' Class : minor_disease
##
The result is roughly the same as that of C5.0
. Despite
the fact that there is no substantial classification improvement, the
tree-pruning process generates a graphical representation of the
decision making protocol (selected_tr
) that is much simpler
and intuitive compared to the original (un-pruned) tree
(qol_model
):
We can change split=“entropy” to “error” or “gini” to apply an alternative information gain index. Experiment with these and compare the results.
In addition to the classification trees we just saw, we can explore
classification rules that utilize if-else
logical
statements to assign classes to unlabeled data. Below we review three
classification rule strategies.
Separate and conquer repeatedly splits the data (and subsets of the data) by rules that cover a subset of examples. This procedure is very similar to the Divide and conquer approach. However, a notable difference is that each rule can be independent, yet, each decision node in a tree has to be linked to past decisions.
To understand the One Rule (OneR)
algorithm, we need to
know about its “sibling” - ZeroR
rule. ZeroR
rule means that we assign the mode class to unlabeled test observations
regardless of its feature value. The One Rule (OneR)
algorithm is an improved version of ZeroR
that uses a
single rule for classification. In other words, OneR
splits
the training dataset into several segments based on feature values.
Then, it assigns the mode of the classes in each segment to related
observations in the unlabeled test data. In practice, we first test
multiple rules and pick the rule with the smallest error rate to be our
One Rule
. Remember, the rules are subjective.
The Repeated Incremental Pruning to Produce Error Reduction algorithm is a combination of the ideas behind decision tree and classification rules. It consists of a three-step process:
Let’s take another look at the same dataset as Case Study 1 (QoL in Chronic Disease) - this time applying classification rules. Naturally, we skip over the first two data handling steps and go directly to step three.
Let’s start by using the OneR()
function in the
RWeka
package. Before installing the package you might want
to check that the Java program in your computer is up to date. Also, its
version has to match the version of R (i.e., 64bit R needs 64bit
Java).
The function OneR()
has the following components:
m<-OneR(class~predictors, data=mydata)
y ~ .
instead, we include all of the column variables as
predictors.# install.packages("RWeka")
library(RWeka)
# just remove the CHRONICDISEASESCORE but keep cd
set.seed(1234)
qol_1R <- OneR(cd~., data=qol[ , -40])
qol_1R
## CHARLSONSCORE:
## < -4.5 -> severe_disease
## < 0.5 -> minor_disease
## < 5.5 -> severe_disease
## < 8.5 -> minor_disease
## >= 8.5 -> severe_disease
## (1453/2214 instances correct)
Note that \(1,453\) out of \(2,214\) cases are correctly classified, \(66\%\), by the “one rule”.
##
## === Summary ===
##
## Correctly Classified Instances 1453 65.6278 %
## Incorrectly Classified Instances 761 34.3722 %
## Kappa statistic 0.3206
## Mean absolute error 0.3437
## Root mean squared error 0.5863
## Relative absolute error 68.8904 %
## Root relative squared error 117.3802 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 609 549 | a = minor_disease
## 212 844 | b = severe_disease
We obtain a rule that correctly specifies 66% of the patients, which
is in line with the prior decision tree classification results. Due to
algorithmic stochasticity, it’s normal that these results may vary each
time you run the algorithm, albeit we used set.seed(1234)
to ensure some result reproducibility.
Another possible option for the classification rules would be the
RIPPER
rule algorithm that we discussed earlier in the
chapter. In R we use the Java
based function
JRip()
to invoke this algorithm.
JRip()
function uses a similar invocation protocol as
OneR()
:
m<-JRip(class~predictors, data=mydata)
## JRIP rules:
## ===========
##
## (CHARLSONSCORE >= 1) and (RACE_ETHNICITY >= 4) and (AGE >= 49) => cd=severe_disease (448.0/132.0)
## (CHARLSONSCORE >= 1) and (AGE >= 53) => cd=severe_disease (645.0/265.0)
## => cd=minor_disease (1121.0/360.0)
##
## Number of Rules : 3
##
## === Summary ===
##
## Correctly Classified Instances 1457 65.8085 %
## Incorrectly Classified Instances 757 34.1915 %
## Kappa statistic 0.3158
## Mean absolute error 0.4459
## Root mean squared error 0.4722
## Relative absolute error 89.3711 %
## Root relative squared error 94.5364 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 761 397 | a = minor_disease
## 360 696 | b = severe_disease
This JRip()
classifier uses only 3 rules and has a
relatively similar accuracy of 66%. As each individual has unique
characteristics, classification in real world data is rarely perfect
(close to 100% accuracy).
Another idea is to repeat the generation of trees multiple times,
predict according to each tree’s performance, and finally ensemble those
weighted votes into a combined classification result. This is precisely
the idea behind random forest
classification, see Chapter
9.
# install.packages("randomForest")
require(randomForest)
set.seed(12)
# rf.fit <- tuneRF(qol_train[ , -40], qol_train[ , 40], stepFactor=1.5)
rf.fit <- randomForest(cd ~ . , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
# varImpPlot(rf.fit, cex=0.5); print(rf.fit)
# type either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity)
# Accuracy
imp_rf.fit <- importance(rf.fit)
x <- rownames(imp_rf.fit)
y <- imp_rf.fit[,3]
plot_ly(x = ~y, y = ~reorder(x,y), name = "Var.Imp", type = "bar") %>%
layout(title="RF Variable Importance Plot (Accuracy)",
xaxis=list(title="Importance (mean decrease in accuracy)"),
yaxis = list(title="Variables (Ordered)"))
# Gini Index
y <- imp_rf.fit[,4]
plot_ly(x = ~y, y = ~reorder(x,y), name = "Var.Imp", type = "bar") %>%
layout(title="RF Variable Importance Plot (Gini)",
xaxis=list(title="Importance (Gini mean decrease)"),
yaxis = list(title="Variables (Ordered)"))
rf.fit1 <- randomForest(cd~. , data=qol_train[ , -40],importance=TRUE,ntree=2000,mtry=26)
rf.fit2 <- randomForest(cd~. , data=qol_train[ , -40], importance=TRUE, nodesize=5, ntree=5000, mtry=26)
# plot(rf.fit,log="x",main="MSE Errors vs. Iterations: Three Models rf.fit, rf.fit1, rf.fit2")
# points(1:5000, rf.fit1$mse, col="red", type="l")
# points(1:5000, rf.fit2$mse, col="green", type="l")
# legend("topright", col=c("black", "red", "green"), lwd=c(2, 2, 2), legend=c("rf.fit", "rf.fit1", "rf.fit2"))
plot_ly(x = c(1:1000), y = rf.fit$err.rate[1:1000,1], name = 'OOB (rf.fit)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = c(1:1000), y = rf.fit1$err.rate[1:1000,1], name = 'OOB (rf.fit1)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = c(1:1000), y = rf.fit2$err.rate[1:1000,1], name = 'OOB (rf.fit2)', type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = "Out Of Bag Error Rates for 3 RF Models",
xaxis = list(title = "Number of Trees"),
yaxis = list (title = "OOB Error"))
qol_pred2<-predict(rf.fit2, qol_test, type = 'class')
confusionMatrix(table(qol_pred2, qol_test$cd))
## Confusion Matrix and Statistics
##
##
## qol_pred2 minor_disease severe_disease
## minor_disease 143 72
## severe_disease 72 156
##
## Accuracy : 0.6749
## 95% CI : (0.6291, 0.7184)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 5.956e-12
##
## Kappa : 0.3493
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6651
## Specificity : 0.6842
## Pos Pred Value : 0.6651
## Neg Pred Value : 0.6842
## Prevalence : 0.4853
## Detection Rate : 0.3228
## Detection Prevalence : 0.4853
## Balanced Accuracy : 0.6747
##
## 'Positive' Class : minor_disease
##
The variable importance plots (varplot
) show the rank
orders of importance of the most salient features according to the
specific metric (e.g., accuracy, Gini). We can also use the
randomForest::partialPlot()
method to examine the relative
impact of the top salient features (continuous or binary variables). For
each feature, this method works by
Note that for binary features, partialPlot()
would
result in only two values of the corresponding mass function. Whereas
for continuous variables, it generates the marginal probability effect
of the variable corresponding to the class probability (for
classification problems) or the continuous response (for regression
problems). Using the QoL
case-study, (Case06_QoL_Symptom_ChronicIllness.csv
),
this example shows the partial plots of the top six
rank-ordered features of the QoL dataset chosen by the random forest
model.
imp <- randomForest::importance(rf.fit)
impvar <- rownames(imp)[order(imp[, 1], decreasing=TRUE)]
# op <- par(mfrow=c(2, 3))
# for (i in 1:6) { # seq_along(impvar)) { # to plot the marginal probabilities for all features
# partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
# main=paste("Partial Dependence of 'CD'\n on ", impvar[i]))
# }
p <- vector("list", 6)
for (i in 1:6) { # seq_along(impvar)) { # to plot the marginal probabilities for all features
p[[i]] <- partialPlot(rf.fit, qol_train, impvar[i], xlab=impvar[i],
main=paste("Partial Dependence of 'CD'\n on ", impvar[i]), plot=F)
}
plot_ly(x = p[[1]]$x, y = p[[1]]$y, name = impvar[1], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[1]),
xaxis = list(title = impvar[1]), yaxis = list (title = "Impact"))
plot_ly(x = p[[2]]$x, y = p[[2]]$y, name = impvar[2], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[2]),
xaxis = list(title = impvar[2]), yaxis = list (title = "Impact"))
plot_ly(x = p[[3]]$x, y = p[[3]]$y, name = impvar[3], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[4]]$x, y = p[[4]]$y, name = impvar[4], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[5]]$x, y = p[[5]]$y, name = impvar[5], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
add_trace(x = p[[6]]$x, y = p[[6]]$y, name = impvar[6], type = 'scatter', mode = 'lines',
line = list(width = 2)) %>%
layout(title = paste0("CD Partial Dependence on ",impvar[3], " ", impvar[4], " ",impvar[5], " ",impvar[6]),
xaxis = list(title = "MSA Questions"), yaxis = list (title = "Impact"))
In random forest (RF) classification, the node size
(nodesize
) refers to the smallest node that can be split,
i.e., nodes with fewer cases than the nodesize
are never
subdivided. Increasing the node size leads to smaller trees, which may
compromise previous predictive power. On the flip side, increasing the
tree size (maxnodes
) and the number of trees
(ntree
) tends to increase the predictive accuracy. However,
there are tradeoffs between increasing node-size and tree-size
simultaneously. To optimize the RF predictive accuracy, try smaller node
sizes and more trees. Ensembling (forest) results from a larger number
of trees will likely generate better results.
Another decision three technique we can test is
Extra-Trees
(extremely randomized trees).
Extra-trees are similar to Random Forests
with the
exception that at each node, Random Forest chooses the best-cutting
threshold for the feature, whereas Extra-Tree selects a uniformly random
cut, so that the feature with the biggest gain (or best classification
score) is chosen.. The R
extraTrees
package provides one implementation of Extra-Trees where single
random cuts are extended to several random cuts for each feature. This
extension improves performance as it reduces the probability of making
very poor cuts and still maintains the designed extra-tree stochastic
cutting. Below, we compare the results of four alternative chronic
disease classification methods - random forest, SVM, kNN, and
extra-tree.
# https://cran.r-project.org/src/contrib/Archive/extraTrees/
library(caret)
library(extraTrees)
# For direct Extra-Tree Call
# install.packages("extraTrees")
# library(extraTrees)
# set.seed(12)
# ## Request more memory to JVM (e.g., 5GB
# options( java.parameters = "-Xmx5g" )
# library(extraTrees)
#
# system.time({
# # call extraTrees with 3 CPU cores/threads and track exec-time
# et.fit3 <- extraTrees(qol_train[ , -c(40,41)], qol_train[ , 41],
# numThreads=3, ntree=5000, mtry=26)
# })
# qol_pred3 <- predict(et.fit3, qol_test[ , -c(40,41)])
# confusionMatrix(table(qol_pred3, qol_test$cd))
# qol_pred<-predict(rf.fit, qol_test, type = 'class')
# qol_pred1<-predict(rf.fit1, qol_test, type = 'class')
# qol_pred2<-predict(rf.fit2, qol_test, type = 'class')
control <- trainControl(method="repeatedcv", number=10, repeats=3)
## Run all subsequent models in parallel
library(doParallel)
cl <- makePSOCKcluster(4)
registerDoParallel(cl)
system.time({
rf.fit <- train(cd~., data=qol_test[ , -40], method="rf", trControl=control);
knn.fit <- train(cd~., data=qol_test[ , -40], method="knn", trControl=control);
svm.fit <- train(cd~., data=qol_test[ , -40], method="svmRadial", trControl=control);
et.fit <- train(cd~., data=qol_test[ , -40], method="extraTrees", trControl=control)
})
## user system elapsed
## 0.16 0.03 38.03
stopCluster(cl) # close multi-core cluster
results <- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, ET=et.fit))
# summary of model differences
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, kNN, SVM, ET
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5000000 0.6000000 0.6363636 0.6364525 0.6780303 0.7777778 0
## kNN 0.3863636 0.5027778 0.5681818 0.5589785 0.6113901 0.7045455 0
## SVM 0.4418605 0.5681818 0.6000000 0.5987902 0.6243393 0.7272727 0
## ET 0.5000000 0.5909091 0.6363636 0.6290083 0.6724806 0.7777778 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.006160164 0.19999942 0.2673909 0.2707204 0.3558133 0.5535714 0
## kNN -0.222222222 0.00591716 0.1363636 0.1166633 0.2223136 0.3991597 0
## SVM -0.119305857 0.13367597 0.2019697 0.1971070 0.2471653 0.4579055 0
## ET 0.000000000 0.17671518 0.2697064 0.2557297 0.3429830 0.5535714 0
The classification of the iris flowers represents an easy example of the Naive Bayesian classifier.
library(e1071)
data(iris)
nbc_model <- naiveBayes(Species ~ ., data = iris)
## alternatively:
nbc_model <- naiveBayes(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")], iris[,"Species"])
predicted.nbcvalues <- predict(nbc_model, iris[,c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])
table(predicted.nbcvalues, iris[, "Species"])
##
## predicted.nbcvalues setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
In the previous cancer case study, we classified the patients with
seer_stage
of “not applicable”(seer_stage
=8)
and “unstaged, unknown or unspecified” (seer_stage
=9) as no
cancer or early cancer stages. Let’s remove these two categories and
replicate the Naive Bayes classifier case study again.
## 'data.frame': 580 obs. of 9 variables:
## $ PID : int 10000 10008 10029 10063 10103 1012 10135 10143 10152 10184 ...
## $ ENC_ID : int 46836 46886 47034 47240 47511 3138 47739 47769 47800 47938 ...
## $ seer_stage : Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 2 2 2 2 7 2 ...
## $ MEDICATION_DESC : chr "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
## $ MEDICATION_SUMMARY: chr "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg Per PACU protocol" ...
## $ DOSE : chr "150" "5000" "1.5" "50" ...
## $ UNIT : chr "mg" "unit" "g" "microgram" ...
## $ FREQUENCY : chr "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
## $ TOTAL_DOSE_COUNT : int 5 3 11 2 2 2 6 1 24 2 ...
## [1] 580 9
Now we have only 580 observations. We can either use the first 480 of
them as the training dataset and the last 100 as the test dataset, or
select 80-20 (training-testing) split, and evaluate the prediction
accuracy when laplace=1
?
We can use the same code for creating the classes in training and
test dataset. Since the seer_stage=8 or 9
is not in the
data, we classify seer_stage=0, 1, 2 or 3
as “early_stage”
and seer_stage=4, 5 or 7
as “later_stage”.
hn_med_train1$stage <- hn_med_train1$seer_stage %in% c(4, 5, 7)
hn_med_train1$stage <- factor(hn_med_train1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test1$stage <- hn_med_test1$seer_stage %in% c(4, 5, 7)
hn_med_test1$stage <- factor(hn_med_test1$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train1$stage))
##
## early_stage later_stage
## 0.7392241 0.2607759
##
## early_stage later_stage
## 0.7413793 0.2586207
To test the naïve Bayes classifier, we can u Use the document term matrices constructed from a corpus of high-frequency terms that have appeared in at least five documents in the training dataset.
## Length Class Mode
## 116 character character
hn_med_dict1 <- as.character(findFreqTerms(hn_med_dtm_train1, 5))
hn_train1 <- DocumentTermMatrix(corpus_train1, list(dictionary=hn_med_dict1))
hn_test1 <- DocumentTermMatrix(corpus_test1, list(dictionary=hn_med_dict1))
traindf1 <- data.frame(stage=hn_med_train1$stage, as.matrix(hn_train1))
testdf1 <- data.frame(stage=hn_med_test1$stage, as.matrix(hn_test1))
# for Training and Testing data, transform predictors/features to factors
for (cols in names(traindf1[-1])) traindf1[[cols]] <- factor(traindf1[[cols]])
for (cols in names(testdf1[-1])) testdf1[[cols]] <- factor(testdf1[[cols]])
# check variable types
# str(traindf)
# run NB classifier (for cancer stage)
hn_classifier1 <- naiveBayes(traindf1[,-1], traindf1$stage, laplace = 15, type = "class")
hn_test_pred1 <- predict(hn_classifier1, testdf1)
caret::confusionMatrix(hn_test_pred1, hn_med_test1$stage)
## Confusion Matrix and Statistics
##
## Reference
## Prediction early_stage later_stage
## early_stage 85 29
## later_stage 1 1
##
## Accuracy : 0.7414
## 95% CI : (0.6518, 0.8182)
## No Information Rate : 0.7414
## P-Value [Acc > NIR] : 0.5489
##
## Kappa : 0.0312
##
## Mcnemar's Test P-Value : 8.244e-07
##
## Sensitivity : 0.98837
## Specificity : 0.03333
## Pos Pred Value : 0.74561
## Neg Pred Value : 0.50000
## Prevalence : 0.74138
## Detection Rate : 0.73276
## Detection Prevalence : 0.98276
## Balanced Accuracy : 0.51085
##
## 'Positive' Class : early_stage
##
\[ACC = \frac{T P + T N}{T P + F P + F N +
T N } = \frac{86}{116}=0.74.\] Note that this is not a
particularly useful classifier, as the improved accuracy corresponds
with predictions that are predominantly reflecting a single class ,
early_stage
.
Use the MLB
Data (01a_data.txt) to predict the Player's Position
(or perhaps the player’s Team
) using
naiveBayes
classifier. Compute and report the agreement
between predicted and actual labels (for the player’s position). Below
is some example code.
mydata <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1',as.is=T, header=T) # 01a_data.txt
# mydata <- read.table('data.txt',as.is=T, header=T)
sample_size <- floor(0.75 * nrow(mydata))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = sample_size)
train <- mydata[train_ind, ]
# TESTING DATA
test <- mydata[-train_ind, ]
library("e1071")
nbc_model <- naiveBayes(train[ , c("Weight", "Height", "Age")], as.factor(train$Position), laplace = 15)
nbc_model
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = train[, c("Weight", "Height", "Age")],
## y = as.factor(train$Position), laplace = 15)
##
## A-priori probabilities:
## as.factor(train$Position)
## Catcher Designated_Hitter First_Baseman Outfielder
## 0.07225806 0.01806452 0.05161290 0.19225806
## Relief_Pitcher Second_Baseman Shortstop Starting_Pitcher
## 0.29806452 0.05677419 0.04645161 0.22064516
## Third_Baseman
## 0.04387097
##
## Conditional probabilities:
## Weight
## as.factor(train$Position) [,1] [,2]
## Catcher 206.6786 15.21973
## Designated_Hitter 222.5000 22.84311
## First_Baseman 212.5750 20.87556
## Outfielder 200.3289 18.47951
## Relief_Pitcher 203.9913 21.84650
## Second_Baseman 184.8864 10.67537
## Shortstop 182.4444 13.55717
## Starting_Pitcher 204.9825 22.58356
## Third_Baseman 200.5588 19.18047
##
## Height
## as.factor(train$Position) [,1] [,2]
## Catcher 72.89286 1.865267
## Designated_Hitter 74.57143 1.603567
## First_Baseman 73.97500 1.941286
## Outfielder 73.16779 2.021515
## Relief_Pitcher 74.53247 2.160421
## Second_Baseman 71.27273 1.703125
## Shortstop 71.88889 1.720096
## Starting_Pitcher 74.70760 2.187370
## Third_Baseman 73.08824 2.287882
##
## Age
## as.factor(train$Position) [,1] [,2]
## Catcher 29.75571 4.445924
## Designated_Hitter 31.47857 4.471898
## First_Baseman 29.78425 5.219565
## Outfielder 28.99342 4.255701
## Relief_Pitcher 28.61420 4.160627
## Second_Baseman 29.32386 4.616064
## Shortstop 28.48972 3.919462
## Starting_Pitcher 28.00661 4.401236
## Third_Baseman 28.73353 3.798898
predicted.nbcvalues <- predict(nbc_model, as.data.frame(test))
# report results
tab <- table(predicted.nbcvalues, test$Position)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)
sum(diag(table(predicted.nbcvalues, test$Position)))
## [1] 74
Heatmap of the actual (vertical columns) and predicted (horizontal rows) MLB data positions based on multi-class naïve Bayesian classifier.
Let’s demonstrate text-classification using a clinical transcription text dataset, which consists of an index and 5 data elements - description, medical_specialty (prediction outcome target), sample_name, transcription, and keywords. Our task is to derive computed phenotypes automatically classifying the 40 different medical specialties using the clinical transcription text.
dataCT <- read.csv('https://umich.instructure.com/files/21152999/download?download_frd=1', header=T)
str(dataCT)
## 'data.frame': 4999 obs. of 6 variables:
## $ Index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ description : chr " A 23-year-old white female presents with complaint of allergies." " Consult for laparoscopic gastric bypass." " Consult for laparoscopic gastric bypass." " 2-D M-Mode. Doppler. " ...
## $ medical_specialty: chr " Allergy / Immunology" " Bariatrics" " Bariatrics" " Cardiovascular / Pulmonary" ...
## $ sample_name : chr " Allergic Rhinitis " " Laparoscopic Gastric Bypass Consult - 2 " " Laparoscopic Gastric Bypass Consult - 1 " " 2-D Echocardiogram - 1 " ...
## $ transcription : chr "SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies w"| __truncated__ "PAST MEDICAL HISTORY:, He has difficulty climbing stairs, difficulty with airline seats, tying shoes, used to p"| __truncated__ "HISTORY OF PRESENT ILLNESS: , I have seen ABC today. He is a very pleasant gentleman who is 42 years old, 344 "| __truncated__ "2-D M-MODE: , ,1. Left atrial enlargement with left atrial diameter of 4.7 cm.,2. Normal size right and left "| __truncated__ ...
## $ keywords : chr "allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegr"| __truncated__ "bariatrics, laparoscopic gastric bypass, weight loss programs, gastric bypass, atkin's diet, weight watcher's, "| __truncated__ "bariatrics, laparoscopic gastric bypass, heart attacks, body weight, pulmonary embolism, potential complication"| __truncated__ "cardiovascular / pulmonary, 2-d m-mode, doppler, aortic valve, atrial enlargement, diastolic function, ejection"| __truncated__ ...
## medical_specialty n
## 1 Surgery 1103
## 2 Consult - History and Phy. 516
## 3 Cardiovascular / Pulmonary 372
## 4 Orthopedic 355
## 5 Radiology 273
## 6 General Medicine 259
## 7 Gastroenterology 230
## 8 Neurology 223
## 9 SOAP / Chart / Progress Notes 166
## 10 Obstetrics / Gynecology 160
## 11 Urology 158
## 12 Discharge Summary 108
## 13 ENT - Otolaryngology 98
## 14 Neurosurgery 94
## 15 Hematology - Oncology 90
## 16 Ophthalmology 83
## 17 Nephrology 81
## 18 Emergency Room Reports 75
## 19 Pediatrics - Neonatal 70
## 20 Pain Management 62
## 21 Psychiatry / Psychology 53
## 22 Office Notes 51
## 23 Podiatry 47
## 24 Dermatology 29
## 25 Cosmetic / Plastic Surgery 27
## 26 Dentistry 27
## 27 Letters 23
## 28 Physical Medicine - Rehab 21
## 29 Sleep Medicine 20
## 30 Endocrinology 19
## 31 Bariatrics 18
## 32 IME-QME-Work Comp etc. 16
## 33 Chiropractic 14
## 34 Diets and Nutritions 10
## 35 Rheumatology 10
## 36 Speech - Language 9
## 37 Autopsy 8
## 38 Lab Medicine - Pathology 8
## 39 Allergy / Immunology 7
## 40 Hospice - Palliative Care 6
# 2. Preprocess the medical clinical notes (transcription)
# library(tm)
dataCT_corpus <- Corpus(VectorSource(dataCT$transcription))
dataCT_corpus_clean <- tm_map(dataCT_corpus, tolower)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removePunctuation)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, removeNumbers)
dataCT_corpus_clean <- tm_map(dataCT_corpus_clean, stripWhitespace)
dataCT_corpus_dtm <- DocumentTermMatrix(dataCT_corpus_clean)
set.seed(1234)
subset_train <- sample(nrow(dataCT),floor(nrow(dataCT)*0.8)) # 80% training + 20% testing
dataCT_train <- dataCT[subset_train, ]
dataCT_test <- dataCT[-subset_train, ]
dataCT_corpus_dtm_train <- dataCT_corpus_dtm[subset_train, ]
hn_med_dtm_test <- dataCT_corpus_dtm[-subset_train, ]
dataCT_corpus_train <- dataCT_corpus_clean[subset_train]
dataCT_corpus_test <- dataCT_corpus_clean[-subset_train]
dataCT_train$MedSpecFac <- factor(dataCT_train$medical_specialty)
dataCT_test$MedSpecFac <- factor(dataCT_test$medical_specialty)
# prop.table(table(dataCT_test$medical_specialty))
prop.table(table(dataCT_test$MedSpecFac))
##
## Allergy / Immunology Autopsy
## 0.002 0.002
## Bariatrics Cardiovascular / Pulmonary
## 0.004 0.070
## Chiropractic Consult - History and Phy.
## 0.003 0.100
## Cosmetic / Plastic Surgery Dentistry
## 0.003 0.004
## Dermatology Discharge Summary
## 0.003 0.020
## Emergency Room Reports Endocrinology
## 0.024 0.002
## ENT - Otolaryngology Gastroenterology
## 0.016 0.052
## General Medicine Hematology - Oncology
## 0.051 0.022
## IME-QME-Work Comp etc. Lab Medicine - Pathology
## 0.003 0.004
## Letters Nephrology
## 0.001 0.018
## Neurology Neurosurgery
## 0.037 0.021
## Obstetrics / Gynecology Office Notes
## 0.028 0.009
## Ophthalmology Orthopedic
## 0.020 0.082
## Pain Management Pediatrics - Neonatal
## 0.017 0.012
## Physical Medicine - Rehab Podiatry
## 0.004 0.012
## Psychiatry / Psychology Radiology
## 0.010 0.058
## Rheumatology Sleep Medicine
## 0.001 0.004
## SOAP / Chart / Progress Notes Speech - Language
## 0.030 0.002
## Surgery Urology
## 0.221 0.028
## Length Class Mode
## 12630 character character
dataCT_corpus_dict <- as.character(findFreqTerms(dataCT_corpus_dtm_train, 5))
dataCT_train1 <- DocumentTermMatrix(dataCT_corpus_train, list(dictionary=dataCT_corpus_dict))
dataCT_test1 <- DocumentTermMatrix(dataCT_corpus_test, list(dictionary=dataCT_corpus_dict))
dataCT_train1 <- apply(dataCT_train1, MARGIN = 2, convert_counts)
dataCT_test1 <- apply(dataCT_test1, MARGIN = 2, convert_counts)
dataCT_classifier <- naiveBayes(dataCT_train1, dataCT_train$MedSpecFac, laplace = 0)
dataCT_pred <- predict(dataCT_classifier, dataCT_test1)
# table(dataCT_pred)
# report results
tab <- table(dataCT_pred, dataCT_test$MedSpecFac)
tab_df <- tidyr::spread(as.data.frame(tab), key = Var2, value = Freq)
# gmodels::CrossTable(dataCT_pred, dataCT_test$MedSpecFac)
sum(diag(table(dataCT_pred, dataCT_test$MedSpecFac)))
## [1] 51
plot_ly(x = colnames(tab), y = colnames(tab), z = as.matrix(tab_df[, -1]), type = "heatmap") %>%
layout(title="Confusion Matrix of Naive-Bayesian Classification of 40 Medical Specialties using Text-Notes (Test Data)",
xaxis=list(title='True Specialties'), yaxis=list(title='Derived NB Class Labels'))
Later, in Chapter 14 (Deep Learning) we will extend this example and demonstrate a more powerful binary and multinomial (categorical) classification of the medical specialty unit based on the clinical notes, using deep neural networks.
Readers are encouraged to apply the Naive Bayesian classification techniques to some new data from the list of our Case-Studies.
In the previous Chronic Disease
case study, we
classified the CHRONICDISEASESCORE
into two groups. What
will happen if we use three groups? Let’s separate
CHRONICDISEASESCORE
evenly into three groups. Recall the
quantile()
function that we talked about in Chapter
2. We can use it to get the cut-points for classification. Then, a
for
loop will help us split the variable
CHRONICDISEASESCORE
into three categories.
## 33.33333% 66.66667%
## 1.06 1.80
for(i in 1:2214){
if(qol$CHRONICDISEASESCORE[i]>0.7&qol$CHRONICDISEASESCORE[i]<2.2){
qol$cdthree[i]=2
}
else if(qol$CHRONICDISEASESCORE[i]>=2.2){
qol$cdthree[i]=3
}
else{
qol$cdthree[i]=1
}
}
qol$cdthree<-factor(qol$cdthree, levels=c(1, 2, 3), labels = c("minor_disease", "mild_disease", "severe_disease"))
table(qol$cdthree)
##
## minor_disease mild_disease severe_disease
## 379 1431 404
After labeling the three categories in the new variable
cdthree
, our job of preparing the class variable is done.
Let’s follow along the earlier sections in the chapter to figure out how
well the tree classifiers and rule classifiers perform in the
three-category case. First, try to build a tree classifier using
C5.0()
with 10 boost trials. One small tip is that in the
training dataset, we cannot have column 40
(CHRONICDISEASESCORE
), 41 (cd
) and now 42
(cdthree
) because they all contain class outcome related
variables.
# qol_train1<-qol[1:2114, ]
# qol_test1<-qol[2115:2214, ]
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train1<-qol[train_index, ]
qol_test1<-qol[-train_index, ]
prop.table(table(qol_train1$cdthree))
##
## minor_disease mild_disease severe_disease
## 0.1693958 0.6437041 0.1869001
##
## minor_disease mild_disease severe_disease
## 0.1783296 0.6568849 0.1647856
set.seed(1234)
qol_model1<-C5.0(qol_train1[ , -c(40, 41, 42)], qol_train1$cdthree, trials=10)
qol_model1
##
## Call:
## C5.0.default(x = qol_train1[, -c(40, 41, 42)], y = qol_train1$cdthree, trials
## = 10)
##
## Classification Tree
## Number of samples: 1771
## Number of predictors: 39
##
## Number of boosting iterations: 10
## Average tree size: 230
##
## Non-standard options: attempt to group attributes
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 14 61 4
## mild_disease 17 242 32
## severe_disease 1 59 13
##
## Overall Statistics
##
## Accuracy : 0.6072
## 95% CI : (0.56, 0.653)
## No Information Rate : 0.8172
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.091
##
## Mcnemar's Test P-Value : 1.457e-07
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 0.43750 0.6685
## Specificity 0.84185 0.3951
## Pos Pred Value 0.17722 0.8316
## Neg Pred Value 0.95055 0.2105
## Prevalence 0.07223 0.8172
## Detection Rate 0.03160 0.5463
## Detection Prevalence 0.17833 0.6569
## Balanced Accuracy 0.63967 0.5318
## Class: severe_disease
## Sensitivity 0.26531
## Specificity 0.84772
## Pos Pred Value 0.17808
## Neg Pred Value 0.90270
## Prevalence 0.11061
## Detection Rate 0.02935
## Detection Prevalence 0.16479
## Balanced Accuracy 0.55651
We can see that the prediction accuracy with three categories is way
lower than the one we did with two categories. In addition, we can try
to build a one-rule classifier with OneR()
.
## INTERVIEWDATE:
## < 3.5 -> mild_disease
## < 28.5 -> severe_disease
## < 282.0 -> mild_disease
## < 311.5 -> severe_disease
## >= 311.5 -> mild_disease
## (1436/2214 instances correct)
##
## === Summary ===
##
## Correctly Classified Instances 1436 64.86 %
## Incorrectly Classified Instances 778 35.14 %
## Kappa statistic 0.022
## Mean absolute error 0.2343
## Root mean squared error 0.484
## Relative absolute error 67.5977 %
## Root relative squared error 116.2958 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 0 375 4 | a = minor_disease
## 0 1422 9 | b = mild_disease
## 0 390 14 | c = severe_disease
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 0 78 1
## mild_disease 0 289 2
## severe_disease 0 68 5
##
## Overall Statistics
##
## Accuracy : 0.6637
## 95% CI : (0.6176, 0.7076)
## No Information Rate : 0.9819
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0445
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity NA 0.66437
## Specificity 0.8217 0.75000
## Pos Pred Value NA 0.99313
## Neg Pred Value NA 0.03947
## Prevalence 0.0000 0.98194
## Detection Rate 0.0000 0.65237
## Detection Prevalence 0.1783 0.65688
## Balanced Accuracy NA 0.70718
## Class: severe_disease
## Sensitivity 0.62500
## Specificity 0.84368
## Pos Pred Value 0.06849
## Neg Pred Value 0.99189
## Prevalence 0.01806
## Detection Rate 0.01129
## Detection Prevalence 0.16479
## Balanced Accuracy 0.73434
The OneRule
classifier that is purely based on the value
of the INTERVIEWDATE
. The internal classification accuracy
is 65% and equal to the external (validation data) prediction accuracy.
Although, the latter assessment is a bit misleading as the vast majority
of external validation data are classified in only one class -
mild_disease.
Finally, let’s revisit the JRip()
classifier with the
same three class labels according to cdthree
.
## JRIP rules:
## ===========
##
## (CHARLSONSCORE <= 0) and (AGE <= 50) and (MSA_Q_06 <= 1) and (QOL_Q_07 >= 1) and (MSA_Q_09 <= 1) => cdthree=minor_disease (35.0/11.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_10 >= 4) and (QOL_Q_07 >= 9) => cdthree=severe_disease (54.0/20.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 5) and (MSA_Q_09 <= 4) and (MSA_Q_04 >= 3) => cdthree=severe_disease (64.0/30.0)
## (CHARLSONSCORE >= 1) and (QOL_Q_02 >= 4) and (PH2_Q_01 >= 3) and (QOL_Q_10 >= 4) and (RACE_ETHNICITY >= 4) => cdthree=severe_disease (43.0/19.0)
## => cdthree=mild_disease (2018.0/653.0)
##
## Number of Rules : 5
##
## === Summary ===
##
## Correctly Classified Instances 1481 66.8925 %
## Incorrectly Classified Instances 733 33.1075 %
## Kappa statistic 0.1616
## Mean absolute error 0.3288
## Root mean squared error 0.4055
## Relative absolute error 94.8702 %
## Root relative squared error 97.42 %
## Total Number of Instances 2214
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 24 342 13 | a = minor_disease
## 10 1365 56 | b = mild_disease
## 1 311 92 | c = severe_disease
## Confusion Matrix and Statistics
##
## qol_pred1
## minor_disease mild_disease severe_disease
## minor_disease 4 73 2
## mild_disease 0 272 19
## severe_disease 0 56 17
##
## Overall Statistics
##
## Accuracy : 0.6614
## 95% CI : (0.6152, 0.7054)
## No Information Rate : 0.9052
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.131
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: minor_disease Class: mild_disease
## Sensitivity 1.000000 0.6783
## Specificity 0.829157 0.5476
## Pos Pred Value 0.050633 0.9347
## Neg Pred Value 1.000000 0.1513
## Prevalence 0.009029 0.9052
## Detection Rate 0.009029 0.6140
## Detection Prevalence 0.178330 0.6569
## Balanced Accuracy 0.914579 0.6130
## Class: severe_disease
## Sensitivity 0.44737
## Specificity 0.86173
## Pos Pred Value 0.23288
## Neg Pred Value 0.94324
## Prevalence 0.08578
## Detection Rate 0.03837
## Detection Prevalence 0.16479
## Balanced Accuracy 0.65455
In terms of the predictive accuracy on the testing data
(qol_test1$cdthree
), we can see from these outputs that the
RIPPER
algorithm performed better (67%) compared to the
C5.0
decision tree (60%) and similarly to the
OneR
algorithm (65%). This suggests that simple algorithms
might outperform complex methods for certain real world case-studies.
Later, in Chapter
9, we will provide more details about optimizing and improving
classification and prediction performance.
These supervised classification techniques can be further tested with other data from the list of our Case-Studies.