SOCR ≫ DSPA ≫ DSPA2 Topics ≫

In this chapter we will present various progressively supervised machine learning, classification and clustering techniques. There are two categories of machine learning techniques - unsupervised and supervised (human-guided). In general, supervised classification methods aim to identify or predict predefined classes and label new objects as members of specific classes. Whereas unsupervised clustering approaches attempt to group objects into subsets, without knowing a priori labels, and determine relationships between objects.

In the context of machine learning and artificial intelligence, classification and clustering apply mostly for supervised and unsupervised problems, respectively.

Unsupervised classification refers to methods where the outcomes (groupings with common characteristics) are automatically derived based on intrinsic affinities and associations in the data without human indication of clustering. Unsupervised learning is purely based on input data (\(X\)) without corresponding output labels. The goal is to model the underlying structure, affinities, or distribution in the data in order to learn more about its intrinsic characteristics. It is called unsupervised learning because there are no a priori correct answers and there is no human guidance. Algorithms are left to their own devices to discover and present the interesting structure in the data. Clustering (discover the inherent groupings in the data) and association (discover association rules that describe the data) represent the core unsupervised learning problems. The k-means clustering and the Apriori association rule provide solutions to unsupervised learning problems.

Unsupervised Clustering Approaches
Bayesian
Hierarchical
Partitioning Based
Decision Based Non-parametric Divisive (Top-Down) Agglomerative (Bottom-Up) Spectral K-Means / Centroid Graph-Theoretic Model Based
Bayesian Classifier has high computational requirements. As there are a priori given labels, some prior is needed/specified to train the classifier, which is in turn used to label new data. Then, the newly labeled samples are subsequently used to train a new (supervised) classifier, i.e.,
decision-directed unsupervised learning. If the initial classifier is not appropriate, the process may diverge
Chinese restaurant process (CRP), infinite Hidden Markov Model (HMM) Principal Direction Divisive Partitioning (PDDP) Start with each case in a separate cluster and repeatedly join the closest pairs into clusters, until a stopping criterion is matched: (1) there is only a single cluster for all cases, (2) predetermined number of clusters is reached, (3) the distance between the closest clusters exceeds a threshold SpecC kmeans, hkmeans Growing Neural Gas mclust

Supervised classification methods utilize user provided labels representative of specific classes associated with concrete observations, cases or units. These training classes/outcomes are used as references for the classification. Many problems can be addressed by decision-support systems utilizing combinations of supervised and unsupervised classification processes. Supervised learning involves input variables (\(X\)) and an outcome variable (\(Y\)) to learn mapping functions from the input to the output: \(Y = f(X)\). The goal is to approximate the mapping function so that when it is applied to new (validation) data (\(Z\)) it (accurately) predicts the (expected) outcome variables (\(Y\)). It is called supervised learning because the learning process is supervised by initial training labels guiding and correcting the learning until the algorithm achieves an acceptable level of performance.

Regression (output variable is a real value) and classification (output variable is a category) problems represent the two types of supervised learning. Examples of supervised machine learning algorithms include Linear regression and Random forest, both provide solutions for regression-type problems, but Random forest also provides solutions to classification problems.

Just like categorization of exploratory data analytics, Chapter 3, is challenging, so is systematic codification of machine learning techniques. The table below attempts to provide a rough representation of common machine learning methods. However, it is not really intended to be a gold-standard protocol for choosing the best analytical method. Before you settle on a specific strategy for data analysis, you should always review the data characteristics in light of the assumptions of each technique and assess the potential to gain new knowledge or extract valid information from applying a specific technique.

Inference Outcome Supervised Unsupervised
Classification & Prediction Binary Classification-Rules, OneR, kNN, NaiveBayes, Decision-Tree, C5.0, AdaBoost, XGBoost, LDA/QDA, Logit/Poisson, SVM Apriori, Association-Rules, k-Means, NaiveBayes
Classification & Prediction Categorical Regression Modeling & Forecasting Apriori, Association-Rules, k-Means, NaiveBayes
Regression Modeling Real Quantitative (MLR) Regression Modeling, LDA/QDA, SVM, Decision-Tree, NeuralNet Regression Modeling Tree, Apriori/Association-Rules

Many of these will be discussed in later chapters. In this chapter, we will present step-by-step the k-nearest neighbor (kNN) algorithm. Specifically, we will demonstrate (1) data retrieval and normalization, (2) splitting the data into training and testing sets, (3) fitting models on the training data, (4) evaluating model performance on testing data, (5) improving model performance, and (6) determining optimal values of \(k\).

In Chapter 9, we will present detailed strategies, and evaluation metrics, to assess the performance of all clustering and classification methods.

1 k-Nearest Neighbor Approach

Classification tasks could be very difficult for a large number of complex and heterogeneous features and target classes. In those scenarios, where the items of similar class type tend to be homogeneous, nearest neighbor classification may be appropriate because assigning unlabeled cases to their most similarly labeled neighbors may be fairly easy to accomplish.

Such classification methods can help us understand the story behind the unlabeled data using known data and avoiding analyzing those complicated features and target classes. This is because these techniques have no prior distribution assumptions. However, this non-parametric approach makes the methods rely heavy on the training instances, which explains their lazy algorithms designation.

The KNN algorithm involves the following steps:

  1. Create a training dataset that has classified examples labeled by nominal variables and different features in ordinal or numerical variables.
  2. Create a testing dataset containing unlabeled examples with similar features as the training data.
  3. Given a predetermined number \(k\), match each test case with the \(k\) closest training records that are “nearest” to the test case, according to a certain similarity or distance measure.
  4. Assign a test case class label according to the majority vote of the \(k\) nearest training cases.

Mathematically, for a given \(k\), a specific similarity metric \(d\), and a new testing case \(x\), the kNN classifier performs two steps (\(k\) is typically odd to avoid ties):

  • Runs through the whole training dataset (\(y\)) computing \(d(x,y)\). Let \(A\) represent the \(k\) closest points to \(x\) in the training data \(y\).
  • Estimates the conditional probability for each class, which corresponds to the fraction of points in \(A\) with that given class label. If \(I(z)\) is an indicator function \(I(z) = \begin{cases} 1 & z= true \\ 0 & otherwise \end{cases}\), then the testing data input \(x\) gets assigned to the class with the largest probability, \(P(y=j|X=x)\):

\[P(y=j|X=x) =\frac{1}{k} \sum_{i\in A}{I(y^{(i)}=j)}.\]

1.1 Distance Function and Dummy coding

How to measure the similarity between records? We can think of similarity measures as the distance metrics between the two records or cases. There are many distance functions to choose from. Traditionally, we use Euclidean distance as our similarity metric.

If we use a line to connect the two points representing the testing and the training records in \(n\) dimensional space, the length of the line is the Euclidean distance. If \(a, b\) both have \(n\) features, the coordinates for them are \((a_1, a_2, \cdots, a_n)\) and \((b_1, b_2, \cdots, b_n)\). Our distance could be:

\[dist(a, b)=\sqrt{(a_1-b_1)^2+(a_2-b_2)^2+\cdots+(a_n-b_n)^2}.\]

When we have nominal features, it requires a little trick to apply the Euclidean distance formula. We could create dummy variables as indicators of all the nominal feature levels. The dummy variable would equal one when we have the feature and zero otherwise. We show two examples:

\[Gender= \left\{ \begin{array}{ll} 0 & X=male \\ 1 & X=female \\ \end{array} \right. \]

\[Cold= \left\{ \begin{array}{ll} 0 & Temp \geq 37F \\ 1 & Temp < 37F \\ \end{array} \right. \]

This allows only binary expressions. If we have multiple nominal categories, just make each one as a dummy variable and apply the Euclidean distance.

1.2 Estimation of the hyperparameter k

The parameter k could not be too large or too small. If our k is too large, the test record tends to be classified as the most popular class in the training records, rather than the most similar one. On the other hand, if the k is too small, outliers, noise, or mislabeled training data cases might lead to errors in predictions.

The common practice is to calculate the square root of the number of training cases and use that number as (initial) estimate of k.

A more robust way would be to choose several k values and select the one with the optimal (best) classifying performance.

1.3 Rescaling of the features

Different features might have different scales. For example, we can have a measure of pain scaling from 1 to 10 or 1 to 100. Some similarity or distance measures assume the same measuring unit in all feature dimensions. This requires that the data may need to be transferred into the same scale. Re-scaling can make each feature contribute to the distance in a relatively equal manner, avoiding potential bias.

1.4 Rescaling Formulas

There are many alternative strategies to rescale the data.

1.4.1 min-max normalization

\[X_{new}=\frac{X-min(X)}{max(X)-min(X)}.\]

After re-scaling the data, \(X_{new}\) would range from 0 to 1, representing the distance between each value and its minimum as a percentage. Larger values indicate further distance from the minimum, 100% means that the value is at the maximum.

1.4.2 z-score standardization

\[X_{new}=\frac{X-\mu}{\sigma}=\frac{X-Mean(X)}{SD(X)}\]

This is based on the properties of normal distribution that we have talked about in Chapter 2. After z-score standardization, the re-scaled feature will have unbounded range. This is different from the min-max normalization which always has a finite range from 0 to 1. Following z-score standardization, the transformed features are uniteless and may resemble standard normal distribution.

1.5 Case Study: Youth Development

1.5.1 Step 1: Collecting Data

The data we are using for this case study is the “Boys Town Study of Youth Development”, which is the second case study, CaseStudy02_Boystown_Data.csv.

Variables:

  • ID: Case subject identifier
  • Sex: dichotomous variable (1=male, 2=female)
  • GPA: Interval-level variable with range of 0-5 (0-“A” average, 1- “B” average, 2- “C” average, 3- “D” average, 4-“E”, 5-“F”“)
  • Alcohol use: Interval level variable from 0-11 (drink everyday - never drinked)
  • Attitudes on drinking in the household: Alcatt - Interval level variable from 0-6 (totally approve - totally disapprove)
  • DadJob: 1-yes, dad has a job and 2- no
  • MomJob: 1-yes and 2-no
  • Parent closeness (example: In your opinion, does your mother make you feel close to her?)
    • Dadclose: Interval level variable 0-7 (usually-never)
    • Momclose: interval level variable 0-7 (usually-never)
  • Delinquency:
    • larceny (how many times have you taken things >$50?): Interval level data 0-4 (never - many times),
    • vandalism: Interval level data 0-7 (never - many times)

1.5.2 Step 2: Exploring and preparing the data

First, we need to load and do some data manipulation. We are using the Euclidean distance so dummy variables (one hot encoding) should be used. The following codes transferred sex, dadjob and momjob into dummy variables.

library(class)
library(gmodels)
boystown<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
str(boystown)
## 'data.frame':    200 obs. of  11 variables:
##  $ id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex       : num  0 0 0 0 1 1 0 0 1 1 ...
##  $ gpa       : int  5 0 3 2 3 3 1 5 1 3 ...
##  $ Alcoholuse: int  2 4 2 2 6 3 2 6 5 2 ...
##  $ alcatt    : int  3 2 3 1 2 0 0 3 0 1 ...
##  $ dadjob    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ momjob    : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ dadclose  : int  1 3 2 1 2 1 3 6 3 1 ...
##  $ momclose  : int  1 4 2 2 1 2 1 2 3 2 ...
##  $ larceny   : int  1 0 0 3 1 0 0 0 1 1 ...
##  $ vandalism : int  3 0 2 2 2 0 5 1 4 0 ...

The str() function tells us that we have 200 observations and 11 variables. However, the ID variable is not important in this case study so we can delete it. In this case-study, we can focus on academic performance, GPA, recidivism, vandalism and larceny, alcohol use, or other outcome variables. One concrete example involves trying to predict recidivism and use knn to classify participants in two categories.

Let’s focus on a specific outcome variable representing two or more infractions of vandalism and larceny, which can be considered as “Recidivism”. Participants with one or no infractions can be labeled as “Controls”. First we can use PCA to explore the data and then construct the new derived recidivism variable and split the data into training and testing sets.

# GPA study
# boystown <- boystown[, -1]
# table(boystown$gpa)
# boystown$grade <- boystown$gpa %in% c(3, 4, 5)    # GPA: (1-“A” average, 2-“B” average, 3-“C” average, 4+ below “C” average)
# boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
# table(boystown$grade)

# You can also try with alternative outcomes, e.g., alcohol use
# Outcome Alcohol Use 6 categories (1- everyday, 2- once or twice/wk, 3- once or twice/month, 4- less than once or twice per month, 5- once or twice, 6- never)
# Y <- as.factor(ifelse(boystown$Alcoholuse > 3, "1"Heavy Alcohol Use", "0"low Alcohol Use)); table(Y)
# Y <- as.factor(boystown$Alcoholuse); table(Y)
# X <- boystown[, -c(1, 4)]

# First explore the data by running a PCA
rawData <- boystown[ , -1]; head(rawData)
##   sex gpa Alcoholuse alcatt dadjob momjob dadclose momclose larceny vandalism
## 1   0   5          2      3      1      0        1        1       1         3
## 2   0   0          4      2      1      0        3        4       0         0
## 3   0   3          2      3      1      0        2        2       0         2
## 4   0   2          2      1      1      0        1        2       3         2
## 5   1   3          6      2      1      1        2        1       1         2
## 6   1   3          3      0      1      0        1        2       0         0
pca1 <- prcomp(as.matrix(rawData), center = T)
summary(pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5    PC6     PC7
## Standard deviation     1.9264 1.6267 1.4556 1.3887 1.3023 1.2495 0.95092
## Proportion of Variance 0.2463 0.1756 0.1406 0.1280 0.1126 0.1036 0.06001
## Cumulative Proportion  0.2463 0.4219 0.5625 0.6905 0.8031 0.9067 0.96672
##                            PC8     PC9    PC10
## Standard deviation     0.46931 0.44030 0.29546
## Proportion of Variance 0.01462 0.01287 0.00579
## Cumulative Proportion  0.98134 0.99421 1.00000
pca1$rotation
##                     PC1          PC2          PC3          PC4         PC5
## sex         0.017329782 -0.041942445  0.038644169  0.028574405  0.01345929
## gpa         0.050811919  0.251727224 -0.552866390  0.482012801  0.12517612
## Alcoholuse -0.955960402 -0.171043716 -0.004495945  0.207154515 -0.09577419
## alcatt     -0.087619514  0.345341205 -0.639579077 -0.330797550 -0.46933240
## dadjob      0.007936535 -0.001870547 -0.017028011  0.006888380  0.01469275
## momjob      0.030397552 -0.003518869  0.008763667 -0.001482878  0.02342974
## dadclose    0.052947782 -0.689075136 -0.246264250 -0.452876476 -0.21528193
## momclose   -0.055904765 -0.273643927 -0.435005164 -0.092086033  0.75752753
## larceny     0.063455041 -0.005344755 -0.108928362  0.033669839  0.07405310
## vandalism  -0.254240056  0.486423422  0.147159970 -0.632255411  0.35813577
##                     PC6         PC7          PC8          PC9         PC10
## sex        -0.026402404  0.02939770  0.996267983 -0.036561664 -0.001236127
## gpa         0.599548626 -0.13874602  0.035498809 -0.004136514  0.019965416
## Alcoholuse  0.015534060  0.05830281  0.002438336 -0.032626941 -0.008177261
## alcatt     -0.362724708  0.01546912  0.045913748 -0.019571649  0.001150003
## dadjob     -0.001537909 -0.04457354 -0.001761749 -0.050310596 -0.997425089
## momjob     -0.006536869 -0.01490905 -0.037550689 -0.997051461  0.051468175
## dadclose    0.456556766 -0.04166384  0.008660225 -0.005147468  0.001021197
## momclose   -0.381040880 -0.06639096 -0.008710480  0.018261764  0.020666371
## larceny     0.084351462  0.98381408 -0.026413939 -0.013630662 -0.039663208
## vandalism   0.383715180 -0.00198434  0.042593385 -0.003164270 -0.004956997
Y <- ifelse (boystown$vandalism + boystown$larceny > 1, "Recidivism", "Control")  # more than 1 vandalism or larceny conviction
X <- boystown[, -c(1, 10, 11)]  # covariate set excludes these 3 columns index/ID, vandalism, and larceny

boystown_z <- as.data.frame(lapply(X, scale))
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]
bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]
table(bt_train_labels)
## bt_train_labels
##    Control Recidivism 
##         46        104

Let’s look at the proportions for the two categories in both the training and testing sets.

round(prop.table(table(bt_train_labels)), digits=2)
## bt_train_labels
##    Control Recidivism 
##       0.31       0.69
round(prop.table(table(bt_test_labels)), digits=2)
## bt_test_labels
##    Control Recidivism 
##        0.2        0.8

We can see that most of the participants have incidents related to recidivism (~70-80%).

The remaining 8 features use different measuring scales. If we use these features directly, variables with larger scale may have a greater impact on the classification performance. Therefore, re-scaling may be useful to level the playing field.

1.5.3 Step 3: Normalizing Data

First let’s create a function of our own using the min-max normalization formula. We can check the function using some trial vectors.

normalize<-function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# some test examples:
normalize(c(1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00
normalize(c(1, 3, 6, 7, 9))
## [1] 0.000 0.250 0.625 0.750 1.000

After confirming the function definition, we can use lapply() to apply the normalization transformation to each element in a “list” of predictors of the binary outcome=recidivism.

boystown_n <- as.data.frame(lapply(bt_train, normalize))

# alternatively we can use "scale", as we showed above
#boystown_z <- as.data.frame(lapply(X, scale))

Let’s compare the two alternative normalization approaches on one of the features, Alcohol use.

summary(boystown_n$Alcoholuse)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2727  0.3636  0.3576  0.4545  1.0000
summary(boystown_z$Alcoholuse)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.04795 -0.98958  0.06879  0.00000  0.59798  3.77309

1.5.4 Step 4: Data preparation - creating training and test datasets

We have 200 observations in this dataset. The more data we use to train the algorithm, the more precise the prediction would be. We can use \(3/4\) of the data for training and the remaining \(1/4\) for testing. For simplicity, we will just take the first cases as training and the remaining as testing. Alternatively, we can use a randomization strategy to split the data into testing and training sets.

# may want to use random split of the raw data into training and testing
# subset_int <- sample(nrow(boystown_n),floor(nrow(boystown_n)*0.8))  
# 80% training + 20% testing
# bt_train<- boystown_n [subset_int, ]; bt_test<-boystown_n[-subset_int, ] 

# Note that the object boystown_n already excludes the outcome variable (Delinquency), index 11!
bt_train <- boystown_z[1:150, ]
bt_test  <- boystown_z[151:200, ]

Then let’s extract the recidivism labels or classes for the training and testing sets.

bt_train_labels <- Y[1:150]  
bt_test_labels  <- Y[151:200]

1.5.5 Step 5 - Training a model on the data

First, we will use the class::knn() method.

#install.packages('class', repos = "https://cran.us.r-project.org")
library(class)

The function knn() has following components:

p <- knn(train, test, class, k)

  • train: data frame containing numeric training data (features)
  • test: data frame containing numeric testing data (features)
  • class/cl: class for each observation in the training data
  • k: predetermined integer indication the number of nearest neighbors

We can first test with \(k=7\), which is less than the square root of our number of observations: \(\sqrt{200}\approx 14\).

bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)

1.5.6 Step 6 - Evaluating model performance

We utilize the CrossTable() function in Chapter 2 to evaluate the KNN model. We have two classes in this example. The goal is to create a \(2\times 2\) table that shows the matched true and predicted classes as well as the unmatched ones. However chi-square values are not needed so we use option prop.chisq=False to get rid of it.

# install.packages("gmodels", repos="https://cran.us.r-project.org")
library(gmodels)
CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_pred 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.213 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          3 |         37 |         40 | 
##                |      0.075 |      0.925 |      0.800 | 
##                |      1.000 |      0.787 |            | 
##                |      0.060 |      0.740 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          3 |         47 |         50 | 
##                |      0.060 |      0.940 |            | 
## ---------------|------------|------------|------------|
## 
## 

In this table, the cells in the first row-first column and the second row-second column contain the number for cases that have predicted classes matching the true class labels. The other two cells are the counts for unmatched cases. The accuracy of this classifier is calculated by: \(\frac{cell[1, 1]+cell[2, 2]}{total}=\frac{37}{50}=0.72\). Note that this value may slightly fluctuate each time you run the classifier, due to the stochastic nature of the algorithm.

1.5.7 Step 7 - Improving model performance

The normalization strategy may play a role in the classification performance. We can try alternative standardization methods - standard Z-score centralization and normalization (via scale() method). Let’s try standardization prior to training the kNN and predicting and assessing the accuracy of the results.

# bt_train <- bt_z[1:150, ]
# bt_test  <- bt_z[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=7)
# CrossTable(x=bt_test_labels, y=bt_test_pred, prop.chisq = F)

# bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18) # retrieve Probabilities
# getTraingDataThreshold <- quantile(attributes(bt_test_pred)$prob, c(table(bt_train_labels)[1]/length(bt_train_labels), 0.5))[1]
# bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > getTraingDataThreshold, 1, 0)          # Binarize the probabilities
bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=14) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          0 |         10 |         10 | 
##                |      0.000 |      1.000 |      0.200 | 
##                |      0.000 |      0.217 |            | 
##                |      0.000 |      0.200 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          4 |         36 |         40 | 
##                |      0.100 |      0.900 |      0.800 | 
##                |      1.000 |      0.783 |            | 
##                |      0.080 |      0.720 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          4 |         46 |         50 | 
##                |      0.080 |      0.920 |            | 
## ---------------|------------|------------|------------|
## 
## 
# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=14) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))
## [1] "Prediction accuracy of model 'bt_test_pred' (k=14) is 0.72"

Under the z-score method, the prediction result is similar to the previous result using the normalization strategy. Albeit, in general, there may be marginal differences, e.g., a few more cases may be correctly labeled based on one of the standardization or normalization approaches.

1.5.8 Testing alternative values of k

Originally, we used the square root of 200 as our k. However, this might not be the best k in this study. We can test different k’s for their predicting performances.

# bt_train <- boystown_n[1:150, ]
# bt_test  <- boystown_n[151:200, ]
# bt_train_labels <- boystown[1:150, 11]
# bt_test_labels  <- boystown[151:200, 11]
# bt_test_pred1   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=1)
# bt_test_pred9   <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=9)
# bt_test_pred11  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=11)
# bt_test_pred21  <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, k=21)
# bt_test_pred27  <- knn(train=bt_train, test=bt_test, cl=bt_train_labels, k=27)
# ct_1  <- CrossTable(x=bt_test_labels, y=bt_test_pred1, prop.chisq = F)
# ct_9  <- CrossTable(x=bt_test_labels, y=bt_test_pred9, prop.chisq = F)
# ct_11 <- CrossTable(x=bt_test_labels, y=bt_test_pred11, prop.chisq = F)
# ct_21 <- CrossTable(x=bt_test_labels, y=bt_test_pred21, prop.chisq = F)
# ct_27 <- CrossTable(x=bt_test_labels, y=bt_test_pred27, prop.chisq = F)

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=18)        # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")     # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          1 |          9 |         10 | 
##                |      0.100 |      0.900 |      0.200 | 
##                |      1.000 |      0.184 |            | 
##                |      0.020 |      0.180 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          0 |         40 |         40 | 
##                |      0.000 |      1.000 |      0.800 | 
##                |      0.000 |      0.816 |            | 
##                |      0.000 |      0.800 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          1 |         49 |         50 | 
##                |      0.020 |      0.980 |            | 
## ---------------|------------|------------|------------|
## 
## 
# CT 
print(paste0("Prediction accuracy of model 'bt_test_pred' (k=18) is ", (CT$prop.tbl[1,1]+CT$prop.tbl[2,2]) ))
## [1] "Prediction accuracy of model 'bt_test_pred' (k=18) is 0.82"

The choice of \(k\) in KNN clustering is very important and it can be fine-tuned using the e1071::tune.knn() of the caret::train() methods. Note that using the tune.knn() method without explicit control over the class-probability cutoff does not generate good results. Specifying that probability exceeding \(0.6\) corresponds to recidivism significantly improves the kNN performance using caret::train() and caret::predict() methods.

# install.packages("e1071")
library(e1071)
knntuning = tune.knn(x= bt_train, y = as.factor(bt_train_labels), k = 1:30)
knntuning
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   k
##  18
## 
## - best performance: 0.3
summary(knntuning)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   k
##  18
## 
## - best performance: 0.3 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.4733333 0.11946382
## 2   2 0.5066667 0.10976968
## 3   3 0.3866667 0.10795518
## 4   4 0.4066667 0.06629526
## 5   5 0.3600000 0.08999314
## 6   6 0.3800000 0.09962894
## 7   7 0.3400000 0.08577893
## 8   8 0.3400000 0.10634210
## 9   9 0.3133333 0.10446808
## 10 10 0.3200000 0.06885304
## 11 11 0.3133333 0.09454243
## 12 12 0.3066667 0.09532271
## 13 13 0.3066667 0.10036969
## 14 14 0.3066667 0.10036969
## 15 15 0.3133333 0.07730012
## 16 16 0.3066667 0.07825252
## 17 17 0.3066667 0.07825252
## 18 18 0.3000000 0.06478835
## 19 19 0.3066667 0.07825252
## 20 20 0.3066667 0.07825252
## 21 21 0.3066667 0.07825252
## 22 22 0.3066667 0.07825252
## 23 23 0.3066667 0.07825252
## 24 24 0.3066667 0.07825252
## 25 25 0.3066667 0.07825252
## 26 26 0.3066667 0.07825252
## 27 27 0.3066667 0.07825252
## 28 28 0.3066667 0.07825252
## 29 29 0.3066667 0.07825252
## 30 30 0.3066667 0.07825252
library(caret)
knnControl <- trainControl(
    method = "cv", ## cross validation
    number = 10,   ## 10-fold
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    verboseIter = FALSE
)
# bt_train_stringLabels <- ifelse (bt_train_labels==1, "Recidivism", "Control")
knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
print(knn_model)
## k-Nearest Neighbors 
## 
## 150 samples
##   8 predictor
##   2 classes: 'Control', 'Recidivism' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 136, 135, 136, 135, 134, 135, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens   Spec     
##    5  0.3588409  0.045  0.9018182
##    7  0.4134773  0.130  0.9536364
##    9  0.4573182  0.085  0.9818182
##   11  0.4820909  0.065  0.9900000
##   13  0.4305455  0.000  1.0000000
##   15  0.4852955  0.000  0.9900000
##   17  0.4554773  0.000  0.9900000
##   19  0.4944773  0.000  1.0000000
##   21  0.5283182  0.000  1.0000000
##   23  0.5135682  0.000  1.0000000
##   25  0.5105227  0.000  1.0000000
##   27  0.5492045  0.000  1.0000000
##   29  0.5420909  0.000  1.0000000
##   31  0.5585455  0.000  1.0000000
##   33  0.5451818  0.000  1.0000000
##   35  0.5723636  0.000  1.0000000
##   37  0.5715227  0.000  1.0000000
##   39  0.5812273  0.000  1.0000000
##   41  0.5827500  0.000  1.0000000
##   43  0.6021591  0.000  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 43.
# Here we are providing a cutoff of 60% probability for derived class label=Recidivism
library(dplyr)
summaryPredictions <- predict(knn_model, newdata = bt_test, type = "prob") # %>% mutate('knnPredClass'=names(.)[apply(., 1, which.max)])
summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
testDataPredSummary <- as.data.frame(cbind(trueLabels=bt_test_labels, controlProb=summaryPredictions$Control, 
                                           recidivismProb=summaryPredictions$Recidivism, knnPredLabel=summaryPredictionsLabel))
print(paste0("Accuracy = ", 2*as.numeric(table(testDataPredSummary$trueLabels == testDataPredSummary $knnPredLabel)[2]), "%"))
## [1] "Accuracy = 80%"

It’s useful to visualize the error rate against the value of \(k\). This can help us select the optimal \(k\) parameter that minimizes the cross-validation (CV) error, see Chapter 9.

library(class)
library(ggplot2)
library(reshape2)
# define a function that generates CV folds
cv_partition <- function(y, num_folds = 10, seed = NULL) {
  if(!is.null(seed)) {
    set.seed(seed)
  }
  n <- length(y)
  
  # split() divides the data into the folds defined by gl().
  # gl() generates factors according to the pattern of their levels

  folds <- split(sample(seq_len(n), n), gl(n = num_folds, k = 1, length = n))
  folds <- lapply(folds, function(fold) {
    list(
      training = which(!seq_along(y) %in% fold),
      test = fold
    )
  })
  names(folds) <- paste0("Fold", names(folds))
  return(folds)
}

# Generate 10-folds of the data
folds = cv_partition(bt_train_labels, num_folds = 10)

# Define a training set_CV_error calculation function
train_cv_error = function(K) {
  #Train error
  #### knnbt = knn(train = bt_train, test = bt_train, cl = bt_train_labels, k = K)
  #### train_error = mean(knnbt != bt_train_labels)
  knn_model <- train(x=bt_train, y=bt_train_labels , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
  summaryPredictions <- predict(knn_model, newdata = bt_train, type = "prob")
  summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
  train_error = mean(summaryPredictionsLabel != bt_train_labels)

  #CV error
  cverrbt = sapply(folds, function(fold) {
    ###  knnbt = knn(train = bt_train[fold$training,], cl = bt_train_labels[fold$training], test = bt_train[fold$test,], k=K)
    knn_model <- train(x=bt_train[fold$training,], y=bt_train_labels[fold$training] , metric = "ROC", method = "knn", tuneLength = 20, trControl = knnControl)
    summaryPredictions <- predict(knn_model, newdata = bt_train[fold$test,], type = "prob")
    summaryPredictionsLabel <- ifelse (summaryPredictions$Recidivism > 0.6, "Recidivism", "Control")
    mean(summaryPredictionsLabel != bt_train_labels[fold$test])
    # mean(bt_train_labels[fold$test] != knn(train = bt_train[fold$training,], 
    #                                        cl = bt_train_labels[fold$training], 
    #                                        test = bt_train[fold$test,], k=K))
    }
  )

  cv_error = mean(cverrbt)

  #Test error
  knn.test = knn(train = bt_train, test = bt_test, cl = bt_train_labels, k = K)
  test_error = mean(knn.test != bt_test_labels)
  return(c(train_error, cv_error, test_error))
}

k_err = sapply(1:30, function(k) train_cv_error(k))
df_errs = data.frame(t(k_err), 1:30)
colnames(df_errs) = c('Train', 'CV', 'Test', 'K')
dataL <- melt(df_errs, id="K")

# require(ggplot2)
# library(reshape2)
# ggplot(dataL, aes_string(x="K", y="value", colour="variable",
#    group="variable", linetype="variable", shape="variable")) +
#    geom_line(size=0.8) + labs(x = "Number of nearest neighbors (k)",
#            y = "Classification error",
#            colour="", group="",
#            linetype="", shape="") +
#   geom_point(size=2.8) +
#   geom_vline(xintercept=4:5, colour = "pink")+
#   geom_text(aes(4,0,label = "4", vjust = 1)) +
#   geom_text(aes(5,0,label = "5", vjust = 1))

library(plotly)
plot_ly(dataL, x = ~K, y = ~value, color = ~variable, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=25, xend=25, y=0.0, yend=0.33, type = "scatter", name="k=9",
               line=list(color="darkgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=14, xend=14, y=0.0, yend=0.33, type = "scatter", name="k=14",
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=18, xend=18, y=0.0, yend=0.36, type = "scatter",  name="k=18",
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='K-NN Training, CV, and Testing Error Rates against k', 
           legend=list(title=list(text='<b> Samples </b>')), 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

1.5.9 Quantitative Assessment

First review the fundamentals of hypothesis testing inference and recall that:

Confusion Matrix Negative Positive
kNN Fails to reject TN FN
kNN rejects FP TP
Metrics Specificity: TN/(TN+FP) Sensitivity: TP/(TP+FN)

Suppose we want to evaluate the kNN model (\(k=12\)) as to how well it predicts recidivism. Let’s report manually some of the accuracy metrics for the kNN model (\(k=12\)). Combining the results, we get the following model sensitivity and specificity.

bt_test_pred <- knn(train=bt_train, test=bt_test,  cl=bt_train_labels, prob=T, k=12) # retrieve Probabilities
bt_test_predBin <- ifelse(attributes(bt_test_pred)$prob > 0.6, "Recidivism", "Control")          # Binarize the probabilities
CT <- CrossTable(x=bt_test_labels, y=bt_test_predBin, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  50 
## 
##  
##                | bt_test_predBin 
## bt_test_labels |    Control | Recidivism |  Row Total | 
## ---------------|------------|------------|------------|
##        Control |          2 |          8 |         10 | 
##                |      0.200 |      0.800 |      0.200 | 
##                |      0.286 |      0.186 |            | 
##                |      0.040 |      0.160 |            | 
## ---------------|------------|------------|------------|
##     Recidivism |          5 |         35 |         40 | 
##                |      0.125 |      0.875 |      0.800 | 
##                |      0.714 |      0.814 |            | 
##                |      0.100 |      0.700 |            | 
## ---------------|------------|------------|------------|
##   Column Total |          7 |         43 |         50 | 
##                |      0.140 |      0.860 |            | 
## ---------------|------------|------------|------------|
## 
## 
mod12_TN <- CT$prop.row[1, 1]  
mod12_FP <- CT$prop.row[1, 2]
mod12_FN <- CT$prop.row[2, 1]
mod12_TP <- CT$prop.row[2, 2]

mod12_sensi <- mod12_TN/(mod12_TN+mod12_FP) 
mod12_speci <- mod12_TP/(mod12_TP+mod12_FN)
print(paste0("kNN model k=12 Sensitivity=", mod12_sensi))
## [1] "kNN model k=12 Sensitivity=0.2"
print(paste0("kNN model k=12 Specificity=", mod12_speci))
## [1] "kNN model k=12 Specificity=0.875"
table(bt_test_labels, bt_test_predBin)
##               bt_test_predBin
## bt_test_labels Control Recidivism
##     Control          2          8
##     Recidivism       5         35

Therefore, model12, corresponding to \(k=12\), yields a marginal accuracy on the testing cases.

Another strategy for model validation and improvement involves the use of the caret::confusionMatrix() method, which reports several complementary metrics quantifying the performance of the prediction model.

Let’s examine deeper the performance of model12 to predict recidivism.

# corr12 <- cor(as.numeric(bt_test_labels), as.numeric(bt_test_predBin))
# corr12
# plot(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))
# table(as.numeric(bt_test_labels), as.numeric(bt_test_pred9))

# install.packages("caret")
library("caret")

# compute the accuracy, LOR, sensitivity/specificity of 3 kNN models

# Model 12: bt_test_predBin
#confusionMatrix(as.numeric(bt_test_labels), as.numeric(bt_test_pred1))
confusionMatrix(as.factor(bt_test_labels), as.factor(bt_test_predBin))
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Control Recidivism
##   Control          2          8
##   Recidivism       5         35
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.5966, 0.8537)
##     No Information Rate : 0.86            
##     P-Value [Acc > NIR] : 0.9927          
##                                           
##                   Kappa : 0.0845          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.2857          
##             Specificity : 0.8140          
##          Pos Pred Value : 0.2000          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.1400          
##          Detection Rate : 0.0400          
##    Detection Prevalence : 0.2000          
##       Balanced Accuracy : 0.5498          
##                                           
##        'Positive' Class : Control         
## 

Finally, we can use a 3D plot to display the confusionMatrix() results of model12 (mod12_TN, mod12_FN, mod12_FP, mod12_TP).

# install.packages("scatterplot3d")
library(scatterplot3d)
grid_xy <- matrix(c(0, 1, 1, 0), nrow=2, ncol=2)
intensity <- matrix(c(mod12_TN, mod12_FN, mod12_FP, mod12_TP), nrow=2, ncol=2)

# scatterplot3d(grid_xy, intensity, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") 

s3d.dat <- data.frame(cols=as.vector(col(grid_xy)), 
      rows=as.vector(row(grid_xy)), 
      value=as.vector(intensity))
scatterplot3d(s3d.dat, pch=16, highlight.3d=TRUE, type="h", xlab="real", ylab="predicted", zlab="Agreement", 
              main="3D Scatterplot: Model12 Results (FP, FN, TP, TN)") 

# scatterplot3d(s3d.dat, type="h", lwd=5, pch=" ", xlab="real", ylab="predicted", zlab="Agreement", main="Model9 Results (FP, FN, TP, TN)")

plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod12_TN, mod12_FN, mod12_FP, mod12_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
           legend=list(title=list(text='<b> Model k=12; Performance Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))
# plot_ly(type = 'barpolar', r = c(0.2, 0.8, 0.2, 0.85),theta = c(0, 90, 180, 360)) %>%
#     layout(polar = list(radialaxis = list(visible = T,range = c(0,10)),
#       sector = c(0,360), radialaxis = list(tickfont = list(size = 45)),
#       angularaxis = list(tickfont = list(size = 10))))

1.6 Case Study: Predicting Galaxy Spins

Let’s now use the SOCR Case-Study 22 (22_SDSS_GalaxySpins_Case_Study) to train a kNN classifier on \(49,122\) (randomly selected out of a total \(51,122\)) training cases and test the accuracy of predicting the Galactic Spin (L=left or R=right hand spin) on the remaining (randomly chosen testing \(2,000\) galaxies). Evaluate and report the classifier performance. Try to compare and improve the performance of several independent classifiers, e.g., kNN, naive Bayesian, LDA.

Report/graph the training, testing and CV error rates for different k parameters and justify your “optimal” choice for \(k=k_o\). How good is your Galactic spin classifier? Provide some visual and numeric evidence.

Some sample code to get you started is included below.

#loading libraries
library(e1071); library(caret); library(class); library(gmodels)

#loading Galaxy data
galaxy_data <- read.csv("https://umich.instructure.com/files/6105118/download?download_frd=1", sep=",")

#View(head(galaxy_data))
dim(galaxy_data)
## [1] 51122    12
#dropping Galaxy ID
galaxy_data <- galaxy_data[ , -1]
str(galaxy_data)
## 'data.frame':    51122 obs. of  11 variables:
##  $ RA    : num  236 237 238 238 238 ...
##  $ DEC   : num  -0.493 -0.482 -0.506 -0.544 -0.527 ...
##  $ HAND  : chr  "R" "L" "L" "L" ...
##  $ UZS   : num  17.4 19.2 19.4 19.5 17.8 ...
##  $ GZS   : num  16.2 18 17.5 18.3 16.6 ...
##  $ RZS   : num  15.6 17.4 16.6 17.9 16.2 ...
##  $ IZS   : num  15.2 17 16.2 17.6 15.8 ...
##  $ ZZS   : num  14.9 16.7 15.8 17.3 15.6 ...
##  $ ELLIPS: num  0.825 0.961 0.411 0.609 0.505 ...
##  $ PHIS  : num  154 100.2 29.2 109.4 39.6 ...
##  $ RSS   : num  0.0547 0.0977 0.0784 0.076 0.0796 ...
#randomly assigning 2k for validation purposes
subset_int <- sample(nrow(galaxy_data), 2000) 

#training data, dropping the label
galaxy_data_train <- galaxy_data[-subset_int, -3]; dim(galaxy_data_train) # test
## [1] 49122    10
galaxy_data_test <- galaxy_data[subset_int, -3]; dim(galaxy_data_test);
## [1] 2000   10
#summary(galaxy_data_train)
# labels
galaxy_train_labels <- as.factor(galaxy_data[-subset_int, 3])
galaxy_test_labels <- as.factor(galaxy_data[subset_int, 3])

#summary(galaxy_data_test)
# trying to figure out the best K for this kNN model
# testing alternative values of K
gd_test_pred1<-knn(train=galaxy_data_train, test=galaxy_data_test,
                   cl=galaxy_train_labels, k=1)
gd_test_pred10<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=10)
gd_test_pred20<-knn(train=galaxy_data_train, test=galaxy_data_test,
                     cl=galaxy_train_labels, k=20)
gd_test_pred50<-knn(train=galaxy_data_train, test=galaxy_data_test,
                    cl=galaxy_train_labels, k=50)
ct_1<-CrossTable(x=galaxy_test_labels, y=gd_test_pred1, prop.chisq = F)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2000 
## 
##  
##                    | gd_test_pred1 
## galaxy_test_labels |         L |         R | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  L |       948 |        28 |       976 | 
##                    |     0.971 |     0.029 |     0.488 | 
##                    |     0.977 |     0.027 |           | 
##                    |     0.474 |     0.014 |           | 
## -------------------|-----------|-----------|-----------|
##                  R |        22 |      1002 |      1024 | 
##                    |     0.021 |     0.979 |     0.512 | 
##                    |     0.023 |     0.973 |           | 
##                    |     0.011 |     0.501 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |       970 |      1030 |      2000 | 
##                    |     0.485 |     0.515 |           | 
## -------------------|-----------|-----------|-----------|
## 
## 
confusionMatrix(galaxy_test_labels, gd_test_pred1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    L    R
##          L  948   28
##          R   22 1002
##                                           
##                Accuracy : 0.975           
##                  95% CI : (0.9672, 0.9814)
##     No Information Rate : 0.515           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9773          
##             Specificity : 0.9728          
##          Pos Pred Value : 0.9713          
##          Neg Pred Value : 0.9785          
##              Prevalence : 0.4850          
##          Detection Rate : 0.4740          
##    Detection Prevalence : 0.4880          
##       Balanced Accuracy : 0.9751          
##                                           
##        'Positive' Class : L               
## 
#Alternatively we can use the tuning function
knn.tune = tune.knn(x= galaxy_data_train, y = galaxy_train_labels, k = 1:20,
                    tunecontrol=tune.control(sampling = "fix") , fix=10)

#Summarize the resampling results set
summary(knn.tune)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  k
##  1
## 
## - best performance: 0.1909124 
## 
## - Detailed performance results:
##     k     error dispersion
## 1   1 0.1909124         NA
## 2   2 0.4222548         NA
## 3   3 0.4482106         NA
## 4   4 0.4155979         NA
## 5   5 0.4174301         NA
## 6   6 0.4397215         NA
## 7   7 0.4417980         NA
## 8   8 0.4405765         NA
## 9   9 0.4416758         NA
## 10 10 0.4503481         NA
## 11 11 0.4500428         NA
## 12 12 0.4504703         NA
## 13 13 0.4502870         NA
## 14 14 0.4520581         NA
## 15 15 0.4527299         NA
## 16 16 0.4590204         NA
## 17 17 0.4606694         NA
## 18 18 0.4604861         NA
## 19 19 0.4617687         NA
## 20 20 0.4598143         NA
# plot(knn.tune)
df <- as.data.frame(cbind(x=knn.tune$performance$k, y=knn.tune$performance$error))
plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = "markers+lines") %>% 
  add_segments(x=1, xend=1, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="gray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  add_segments(x=5, xend=5, y=0.0, yend=0.45, type = "scatter", 
               line=list(color="lightgray", width = 2, dash = 'dot'), mode = "lines", showlegend=FALSE) %>% 
  layout(title='Galaxy-spin k-NN Prediction - Error Rate against k', 
           xaxis=list(title='Number of nearest neighbors (k)'), yaxis=list(title='Classification error'))

2 Probabilistic Learning - Naive Bayes Classification

In the previous section, we described the types of machine learning methods and presented a lazy learning classification of numerical data using k-nearest neighbors. What about nominal features or textual data? In this section, we will begin to explore some classification techniques for categorical data. Specifically, we will (1) present the Naive Bayes algorithm, (2) review its assumptions, (3) discuss Laplace estimation, and (4) illustrate the naive Bayesian classifier on a Head and Neck Cancer Medication case-study. Later, in Chapter 7, we will also discuss text mining and natural language processing of unstructured text data.

2.1 Overview of the Naive Bayes Method

It may be useful to first review the basics of probability theory and Bayesian inference. Bayes classifiers use training data to calculate an observed probability of each class based on all the features. The probability links feature values to classes like a map. When labeling the test data, we utilize the feature values in the test data and the “map” to classify our test data with the most likely class. This idea seems simple but the corresponding algorithmic implementations might be very sophisticated.

The best scenario of accurately estimating the probability of an outcome-class map is when all features simultaneously contribute to the Bayes classifier. The naive Bayes algorithm is frequently used for text classifications. The maximum a posteriori assignment to the class label is based on obtaining the conditional probability density function for each feature given the value of the class variable.

2.2 Model Assumptions

The method name, Naive Bayes, suggests the dependence of the technique on some simple (“naive”) assumptions. Its most important assumption is that all of the features are expected to be equally important and independent. This rarely happens in real world data. However, sometimes even when the assumptions are violated, Naive Bayes still performs fairly well, particularly when the number of features \(p\) is large. This is why the Naive Bayes algorithm is often used as a powerful text classifier.

There are interesting relations between QDA (Quadratic Discriminant Analysis), LDA (Linear Discriminant Analysis) and Naive Bayes classification. Additional information about LDA and QDA is available here.

2.3 Bayes Formula

Let’s first define the set-theoretic Bayes formula. We assume that \(B_i\)’s are mutually exclusive events, for all \(i= 1, 2, \cdots, n\), where n represents the number of features. If \(A\) and \(B\) are two events, the Bayes conditional probability formula is as follows: \[Posterior\, Probability=\frac{likelihood\times Prior\, Probability}{Marginal\, Likelihood}\]

Symbolically,

\[P(A|B)=\frac{P(B|A)P(A)}{P(B)}.\] When \(B_i's\) represent a partition of the event space, \(S=\cup {B_i}\) and \(B_i\cap B_j = \emptyset,\ \forall i\not= j\). So we have:

\[P(A|B)=\frac{P(B|A)\times P(A)}{P(B|B_1)\times P(B_1) + P(B|B_2)\times P(B_2) + \cdots + P(B|B_n)\times P(B_n)}.\]

Now, let’s represent the Bayes formula in terms of classification using observed features. Having observed \(n\) features, \(F_i\), for each of \(K\) possible class outcomes, \(C_k\). Using the Bayes’ theorem, by decomposing the conditional probability, the Bayesian classifier may be reformulated to make it more tractable.

\[ P(C_k \mid F_1, \dots, F_n)= \frac{P(F_1, \dots, F_n|C_k)P(C_k)}{P(F_1, \dots, F_n)}.\]

In the above expression, only the numerator depends on the class label, \(C_k\), as the values of the features \(F_i\) are observed (or imputed) making the denominator constant. Let’s focus on the numerator.

The numerator essentially represents the joint probability model:

\[P(F_1, \dots, F_n|C_k)P(C_k) = \underbrace{P(F_1, \cdots, F_n, C_k)}_{\text{joint model}}.\]

Repeatedly using the chain rule and the definition of conditional probability simplifies this to:

\[P(F_1, \dots ,F_n, C_{k})=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3, \dots ,F_n, C_{k})=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times P(F_4, \dots ,F_n, C_{k})=\] \[=\cdots=\] \[=P(F_1| F_2, \dots ,F_n, C_{k})\times P(F_2|F_3, \dots ,F_n, C_{k})\times P(F_3|F_4, \dots ,F_n, C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Note that the “naive” qualifier in the method name, Naive Bayes classifier, is attributed to the oversimplification of the conditional probability. Assuming each feature \(F_i\) is conditionally statistical independent of every other feature \(F_j, \ \forall j\neq i\), given the category \(C_k\), we get:

\[P(F_i | F_{i+1}, \dots , F_{n}, C_k ) = P(F_i | C_k).\]

This reduces the joint probability model to:

\[P(F_1, \dots ,F_n, C_{k})= P(F_1| C_{k})\times P(F_2| C_{k})\times P(F_3| C_{k})\times \cdots \times P(F_n| C_{k})\times P(C_k)\]

Therefore, the joint model is:

\[P(F_1, \dots ,F_n, C_{k})= P(C_k) \prod_{i=1}^n {P(F_i| C_{k})}\]

Essentially, we express the probability of class level \(L\) given an observation, represented as a set of independent features \(F_1, F_2, \cdots, F_n\). Then the posterior probability that the observation is in class \(L\) is equal to:

\[P(C_L|F_1, \cdots, F_n)=\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)},\] where the denominator, \(\prod_{i=1}^nP(F_i)\), is a scaling factor that represents the marginal probability of observing all features jointly.

For a given case \(X=(F_1, F_2, \cdots, F_n)\), i.e., given vector of features, the naive Bayes classifier assigns the most likely class \(\hat{C}\) by calculating \(\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}\) for all class labels \(L\), and then assigning the class \(\hat{C}\) corresponding to the maximum posterior probability. Analytically, \(\hat{C}\) is defined by: \[ \hat{C} = \arg\max_L{\frac{P(C_L)\prod_{i=1}^nP(F_i|C_L)}{\prod_{i=1}^nP(F_i)}}.\]

As the denominator is static for \(L\), the posterior probability above is maximized when the numerator is maximized, i.e., \(\hat{C} = \arg\max_L {P(C_L)\prod_{i=1}^nP(F_i|C_L)}.\)

The contingency table below illustrates schematically how the Bayesian, marginal, conditional, and joint probabilities may be calculated for a finite number of features (columns) and classes (rows).

\(Features\\Classes\) \(F_1\) \(F_2\) \(\cdots\) \(F_n\) Total
\(C_1\) \(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\) Marginal \(P(C_1)\)
\(C_2\) \(\cdots\) \(\cdots\) \(\cdots\) Joint \(P(C_2, F_n)\) \(\cdots\)
\(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\)
\(C_L\) Conditional \(P(F_1\mid C_L)=\frac{P(F_1,C_L)}{P(C_L)}\) \(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\)
Total Marginal \(P(F_2)\) \(\cdots\) \(\cdots\) \(N\)

In the DSPA2 Appendix, we provide additional technical details, code, and applications of Bayesian simulation, modeling and inference.

2.4 The Laplace Estimator

If at least one \(P(F_i|C_L)=0\) then \(P(C_L|F_1, \cdots, F_n)=0\), which means the probability of being in this class is 0. However, \(P(F_i|C_L)=0\) could happen by random chance, e.g., when selecting the training data.

One of the solutions to this scenario is Laplace estimation, also known as Laplace smoothing, which can be accomplished in two ways. One is to add a small number to each count in the frequency table, which allows each class-feature combination to be at least one in the training data. Then \(P(F_i|C_L)>0\) for all \(i\) and we avoid degenerate cases. Another strategy is to add some small value, \(\epsilon\), to the numerator and denominator when calculating the posterior probability. Note that these small perturbations of the denominator should be larger than the changes in the numerator to avoid trivial (\(0\)) posterior for another class.

2.5 Case Study: Head and Neck Cancer Medication

2.5.1 Step 1: Collecting Data

We utilize the Inpatient Head and Neck Cancer Medication data for this case study, which is the case study 14 in our data archive.

Variables:

  • PID: coded patient ID
  • ENC_ID: coded encounter ID
  • Seer_stage: SEER cancer stage (0 =In situ, 1=Localized, 2=Regional by direct extension, 3=Regional to lymph nodes, 4=Regional (both codes 2 and 3), 5=Regional, NOS, 7= Distant metastases/systemic disease, 8=Not applicable, 9=Unstaged, unknown, or unspecified). See: SSM.
  • Medication_desc: description of the chemical composition of the medication
  • Medication_summary: brief description about medication brand and usage
  • Dose: the dosage in the medication summary
  • Unit: the unit for dosage in the Medication_summary
  • Frequency: the frequency of use in the Medication_summary
  • Total_dose_count: total dosage count according to the Medication_summary

2.5.2 Step 2: Exploring and preparing the data

Let’s first load the cancer dataset.

hn_med<-read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1", stringsAsFactors = FALSE)
str(hn_med)
## 'data.frame':    662 obs. of  9 variables:
##  $ PID               : int  10000 10008 10029 10063 10071 10103 1012 10135 10136 10143 ...
##  $ ENC_ID            : int  46836 46886 47034 47240 47276 47511 3138 47739 47744 47769 ...
##  $ seer_stage        : int  1 1 4 1 9 1 1 1 9 1 ...
##  $ MEDICATION_DESC   : chr  "ranitidine" "heparin injection" "ampicillin/sulbactam IVPB UH" "fentaNYL injection UH" ...
##  $ MEDICATION_SUMMARY: chr  "(Zantac) 150 mg tablet oral two times a day" "5,000 unit subcutaneous three times a day" "(Unasyn) 15 g IV every 6 hours" "25 - 50 microgram IV every 5 minutes PRN severe pain\nMaximum dose 200 mcg  Per PACU protocol" ...
##  $ DOSE              : chr  "150" "5000" "1.5" "50" ...
##  $ UNIT              : chr  "mg" "unit" "g" "microgram" ...
##  $ FREQUENCY         : chr  "two times a day" "three times a day" "every 6 hours" "every 5 minutes" ...
##  $ TOTAL_DOSE_COUNT  : int  5 3 11 2 1 2 2 6 15 1 ...

Change the seer_stage (cancer stage indicator) variable into a factor.

hn_med$seer_stage <- factor(hn_med$seer_stage)
str(hn_med$seer_stage)
##  Factor w/ 9 levels "0","1","2","3",..: 2 2 5 2 9 2 2 2 9 2 ...
table(hn_med$seer_stage)
## 
##   0   1   2   3   4   5   7   8   9 
##  21 265  53  90  46  18  87  14  68

2.5.2.1 Data preparation - processing text data for analysis

As you can see, the medication_summary contains a great amount of text. We should do some text mining to prepare the data for analysis. In R, the tm package is a good choice for text mining.

# install.packages("tm", repos = "https://cran.us.r-project.org")
# requires R V.3.3.1 +
library(tm)

First step for text mining is to convert text features (text elements) into a corpus object, which is a collection of text documents.

hn_med_corpus <- Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
print(hn_med_corpus)
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 662

After we constructed the corpus object, we could see that we have 662 documents. Each document represents an encounter (e.g., notes on medical treatment) for a patient.

inspect(hn_med_corpus[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] (Zantac) 150 mg tablet oral two times a day
## [2] 5,000 unit subcutaneous three times a day  
## [3] (Unasyn) 15 g IV every 6 hours
hn_med_corpus[[1]]$content
## [1] "(Zantac) 150 mg tablet oral two times a day"
hn_med_corpus[[2]]$content
## [1] "5,000 unit subcutaneous three times a day"
hn_med_corpus[[3]]$content
## [1] "(Unasyn) 15 g IV every 6 hours"

There are unwanted punctuations and other symbols in the corpus document that we want to remove. We use the tm_map() function for the cleaning.

corpus_clean <- tm_map(hn_med_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
# corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
# corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

The above lines of code changed all the characters to lowercase, removed all punctuations and extra white spaces (typically created by deleting punctuations), and removed numbers (we could also convert the corpus to plain text).

inspect(corpus_clean[1:3])
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 3
## 
## [1] zantac mg tablet oral two times a day  unit subcutaneous three times a day 
## [3] unasyn g iv every hours
corpus_clean[[1]]$content
## [1] "zantac mg tablet oral two times a day"
corpus_clean[[2]]$content
## [1] " unit subcutaneous three times a day"
corpus_clean[[3]]$content
## [1] "unasyn g iv every hours"

DocumentTermMatrix() function can successfully tokenize the medication summary into words. It can count frequent terms in each document in the corpus object.

hn_med_dtm <- DocumentTermMatrix(corpus_clean)

2.5.2.2 Data preparation - creating training and test datasets

Just like in the kNN method, we need to separate the dataset into training and test subsets. We have to subset the raw data with other features, the corpus object and the document term matrix.

set.seed(12345)
subset_int <- sample(nrow(hn_med),floor(nrow(hn_med)*0.8))  # 80% training + 20% testing
hn_med_train <- hn_med[subset_int, ]
hn_med_test <- hn_med[-subset_int, ]
hn_med_dtm_train <- hn_med_dtm[subset_int, ]
hn_med_dtm_test <-hn_med_dtm[-subset_int, ]
corpus_train <- corpus_clean[subset_int]
corpus_test  <- corpus_clean[-subset_int]

# hn_med_train<-hn_med[1:562, ]
#hn_med_test<-hn_med[563:662, ]
# hn_med_dtm_train<-hn_med_dtm[1:562, ]
# hn_med_dtm_test<-hn_med_dtm[563:662, ]
#corpus_train<-corpus_clean[1:562]
#corpus_test<-corpus_clean[563:662]

Let’s examine the distribution of seer stages in the training and test datasets.

prop.table(table(hn_med_train$seer_stage))
## 
##          0          1          2          3          4          5          7 
## 0.01890359 0.41965974 0.07561437 0.13988658 0.07561437 0.03213611 0.12098299 
##          8          9 
## 0.01890359 0.09829868
prop.table(table(hn_med_test$seer_stage))
## 
##           0           1           2           3           4           5 
## 0.082706767 0.323308271 0.097744361 0.120300752 0.045112782 0.007518797 
##           7           8           9 
## 0.172932331 0.030075188 0.120300752

We can separate (dichotomize) the seer_stage into two categories:

  • no stage or early stage cancer (seer in [0;4]), and
  • later stage cancer (seer in [5;9]).

Of course, other binarizations are possible as well. Note that a %in% b is an intuitive interface to match and acts as a binary operator returning a logical vector (T or F) indicating if there is a match between the left and right operands.

hn_med_train$stage <- hn_med_train$seer_stage %in% c(5:9)
hn_med_train$stage <- factor(hn_med_train$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
hn_med_test$stage  <- hn_med_test$seer_stage %in% c(5:9)
hn_med_test$stage  <- factor(hn_med_test$stage, levels=c(F, T), labels = c("early_stage", "later_stage"))
prop.table(table(hn_med_train$stage))
## 
## early_stage later_stage 
##   0.7296786   0.2703214
prop.table(table(hn_med_test$stage))
## 
## early_stage later_stage 
##   0.6691729   0.3308271

2.5.2.3 Visualizing text data - word clouds

A word cloud can help us visualize text data. More frequent words would have larger fonts in the figure, while less common words are appearing in smaller fonts. There is a wordcloud package in R that is commonly used for creating these figures.

# install.packages("wordcloud", repos = "https://cran.us.r-project.org")
library(wordcloud)
wordcloud(corpus_train, min.freq = 40, random.order = FALSE, colors=brewer.pal(5, "Dark2"))

The random.order=FALSE option makes more frequent words appear in the center of the word cloud. The min.freq=40 option sets the cutoff word frequency to be at least 40 times in the corpus object. Therefore, the words must appear in at least 40 medication summaries to be shown on the graph.

We can also visualize the difference between early stages and later stages using this type of graph.

early <- subset(hn_med_train, stage=="early_stage")
later <- subset(hn_med_train, stage=="later_stage")
wordcloud(early$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

wordcloud(later$MEDICATION_SUMMARY, max.words = 20, colors=brewer.pal(3, "Dark2"))

# tdm = as.matrix(hn_med_dtm)
# tdm1 <- tdm[nchar(rownames(tdm)) < 15, ]
# colnames(tdm1) <- hn_med_dtm$dimnames$Terms[1:14]
# comparison.cloud(tdm1, random.order=FALSE,
#                  colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
#                  title.size=1, min.words=5, max.words=50, scale=c(1.5, 0.4),rot.per=0.4)

We can see that the frequent words are somewhat different in the medication summaries of the early and later stage patients.

2.5.2.4 Data preparation - creating indicator features for frequent words

For simplicity, we utilize the medication summary as the only feature to classify cancer stages. You may recall that earlier in the kNN section we used the native data features to obtain kNN classifications. Now, we are going to make frequencies of words as text-derived data features.

summary(findFreqTerms(hn_med_dtm_train, 5))
##    Length     Class      Mode 
##       108 character character
hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test  <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

The above code limits the document term matrix with words that have appeared in at least 5 different documents. This created (about) 118 features for us to use.

The Naive Bayes classifier trains on data with categorical features, as it uses frequency tables for learning the data affinities. To create the combinations of class and feature values comprising the frequency-table (matrix), all features must be categorical. For instance, numeric features have to be converted (binned) into categories. Thus, we need to transform our word count features into categorical data. One way to achieve this is to change the count into an indicator of whether this word appears in the document describing the patient encounter. We can create a simple function to convert the presence of a specific word (column) in an encounter document (row) to a binary “Yes” (present) or “No” (not present).

convert_counts <- function(wordFreq) {
  wordFreq <- ifelse(wordFreq > 0, 1, 0)
  wordFreq <- factor(wordFreq, levels = c(0, 1), labels = c("No", "Yes"))
  return(wordFreq)
}

We employ hard-thresholding here x<-ifelse(x>0, 1, 0). This is saying that if we have an x that is greater than 0, we assign value 1 to it, otherwise the value is set to 0.

Now let’s apply our own function convert_counts() on each column (MARGIN=2) of the training and testing datasets.

hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)

# Check the structure of hn_train and hn_train:
# head(hn_train); dim(hn_train)

So far, we successfully created indicators for words that appeared at least in 5 different documents in the training data.

2.5.3 Step 3 - training a model on the data

The package we will use for the Naive Bayes classifier is called e1071.

# install.packages("e1071", repos = "https://cran.us.r-project.org")
library(e1071)

The function naiveBayes() has following components:

m<-naiveBayes(train, class, laplace=0)

  • train: data frame containing numeric training data (features)
  • class: factor vector with the class for each row in the training data.
  • laplace: positive double controlling Laplace smoothing; default is \(0\) and disables Laplace smoothing.

Next, we will invoke the naiveBayes() classifier.

hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)

Then we can use the classifier to make predictions using predict(). Recall that when we presented the AdaBoost example in Chapter 2, we saw the general machine learning and artificial intelligence protocol, from training, to prediction, and assessment.

For most supervised and unsupervised methods, the generic function predict() has the following syntax:

p <- predict(m, test, type="class")

  • m: classifier, in our case naiveBayes()
  • test: test data frame or matrix
  • type: either "class" or "raw" specifies whether the predictions should be the most likely class value or the raw predicted probabilities.
hn_test_pred <- predict(hn_classifier, hn_test)

2.5.4 Step 4 - evaluating model performance

Similarly to the approach in the earlier kNN approach, we can use cross table to compare the predicted vs. the true classes using the testing data.

library(gmodels)
CT <- CrossTable(hn_test_pred, hn_med_test$stage)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          64 |          30 |          94 | 
##              |       0.019 |       0.039 |             | 
##              |       0.681 |       0.319 |       0.707 | 
##              |       0.719 |       0.682 |             | 
##              |       0.481 |       0.226 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |          25 |          14 |          39 | 
##              |       0.046 |       0.093 |             | 
##              |       0.641 |       0.359 |       0.293 | 
##              |       0.281 |       0.318 |             | 
##              |       0.188 |       0.105 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |          89 |          44 |         133 | 
##              |       0.669 |       0.331 |             | 
## -------------|-------------|-------------|-------------|
## 
## 
CT
## $t
##              y
## x             early_stage later_stage
##   early_stage          64          30
##   later_stage          25          14
## 
## $prop.row
##              y
## x             early_stage later_stage
##   early_stage   0.6808511   0.3191489
##   later_stage   0.6410256   0.3589744
## 
## $prop.col
##              y
## x             early_stage later_stage
##   early_stage   0.7191011   0.6818182
##   later_stage   0.2808989   0.3181818
## 
## $prop.tbl
##              y
## x             early_stage later_stage
##   early_stage   0.4812030   0.2255639
##   later_stage   0.1879699   0.1052632
mod_TN <- CT$prop.row[1, 1]  
mod_FP <- CT$prop.row[1, 2]
mod_FN <- CT$prop.row[2, 1]
mod_TP <- CT$prop.row[2, 2]
# caret::confusionMatrix(hn_test_pred, hn_med_test$stage)
# CT$prop.row
library(plotly)
plot_ly(x = c("TN", "FN", "FP", "TP"),
  y = c(mod_TN, mod_FN, mod_FP, mod_TP),
  name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
           legend=list(title=list(text='<b> Metrics </b>')), 
           xaxis=list(title='Metrics'), yaxis=list(title='Probability'))

It may be worth quickly looking forward in Chapter 9 where we present a summary table containing commonly used measures for evaluating the performance of binary tests, classifiers, or predictions.

In this case, the prediction accuracy of the Naive Bayes classifier, assessed on the testing dataset, is:

\[ACC = \frac{T P + T N}{T P + F P + F N + T N } = \frac{78}{133}=0.59.\]

From the cross table we can see that our testing-data prediction accuracy is \(\frac{78}{133}=0.59\).

2.5.5 Step 5 - improving model performance

After setting laplace=5, the accuracy goes to \(acc=\frac{87}{133}=0.65\). Although there is a small difference in accuracy, in general, such protocol alterations may increase or decrease the rate of true later stage patient identification using testing and validation data.

hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test  <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))

set.seed(1234)
# View(as.data.frame(inspect(hn_train)))
traindf <- data.frame(stage=hn_med_train$stage, as.matrix(hn_train))
testdf <- data.frame(stage=hn_med_test$stage, as.matrix(hn_test))

# for Training and Testing data, transform predictors/features to factors
for (cols in names(traindf)) traindf[[cols]] <- factor(traindf[[cols]])
for (cols in names(testdf)) testdf[[cols]] <- factor(testdf[[cols]])

# check variable types
# str(traindf)

# run NB classifier (for cancer stage)
hn_classifier <- naiveBayes(traindf[,-1], traindf$stage, laplace = 5, type = "class")
hn_test_pred1 <- predict(hn_classifier, testdf)
caret::confusionMatrix(hn_test_pred1, testdf$stage)
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    early_stage later_stage
##   early_stage          83          40
##   later_stage           6           4
##                                           
##                Accuracy : 0.6541          
##                  95% CI : (0.5668, 0.7344)
##     No Information Rate : 0.6692          
##     P-Value [Acc > NIR] : 0.6804          
##                                           
##                   Kappa : 0.0292          
##                                           
##  Mcnemar's Test P-Value : 1.141e-06       
##                                           
##             Sensitivity : 0.93258         
##             Specificity : 0.09091         
##          Pos Pred Value : 0.67480         
##          Neg Pred Value : 0.40000         
##              Prevalence : 0.66917         
##          Detection Rate : 0.62406         
##    Detection Prevalence : 0.92481         
##       Balanced Accuracy : 0.51175         
##                                           
##        'Positive' Class : early_stage     
## 

Accuracy is \(\frac{87}{133} = 0.65\). This is a marginal performance improvement, compared to the prior (\(laplace=0\)) default model. Notice that choosing a different Laplace smoothing parameter alters the balance between the true/false and positive/negative elements in the \(2\times 2\) confusion matrix.

2.5.6 Step 6 - compare Naive Bayesian vs. LDA

As mentioned earlier, Naive Bayes with normality assumption is a special case of Discriminant Analysis. It might be interesting to compare the prediction results of Naive Bayes and LDA classification.

Note: LDA assumes the predictors are jointly approximately normally distributed.

library(MASS)
# df_hn_train = data.frame(lapply(as.data.frame(hn_train),as.numeric), 
#                          stage = hn_med_train$stage)
# df_hn_test = data.frame(lapply(as.data.frame(hn_test),as.numeric), 
#                         stage = hn_med_test$stage)

library("dplyr")
binarizeFunction <- function(x) { ifelse(x=="Yes", 1,0) }

#  A function to Convert Categorical variables to numeric
cat2Numeric <- function (dfInput) {
  df = as.data.frame(lapply( as.data.frame(dfInput), factor)) %>%
                       mutate_all(binarizeFunction)
   return(df)
}

# define the numeric DF of predictors (X) and outcome (Y=stage)
df_hn_train = data.frame(as.matrix(hn_train), stage = as.numeric(hn_med_train$stage))
df_hn_test = data.frame(as.matrix(hn_test), stage = as.numeric(hn_med_test$stage))

# Remove the multicollinearity - this should be done via VIF assessment, 
#      but for now, just take the first few predictors
df_hn_train <- df_hn_train[ , c(1:34, 40:50, 60:70, 109)]
df_hn_train$stage <- hn_med_train$stage
df_hn_test$stage <- hn_med_test$stage

# Fit LDA
set.seed(1234)
hn_lda <- lda(data=df_hn_train, stage ~ .)
# hn_pred = predict(hn_lda, df_hn_test[,-104])
hn_pred = predict(hn_lda, df_hn_test)
CrossTable(hn_pred$class, df_hn_test$stage)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  133 
## 
##  
##               | df_hn_test$stage 
## hn_pred$class | early_stage | later_stage |   Row Total | 
## --------------|-------------|-------------|-------------|
##   early_stage |          85 |          37 |         122 | 
##               |       0.138 |       0.280 |             | 
##               |       0.697 |       0.303 |       0.917 | 
##               |       0.955 |       0.841 |             | 
##               |       0.639 |       0.278 |             | 
## --------------|-------------|-------------|-------------|
##   later_stage |           4 |           7 |          11 | 
##               |       1.535 |       3.104 |             | 
##               |       0.364 |       0.636 |       0.083 | 
##               |       0.045 |       0.159 |             | 
##               |       0.030 |       0.053 |             | 
## --------------|-------------|-------------|-------------|
##  Column Total |          89 |          44 |         133 | 
##               |       0.669 |       0.331 |             | 
## --------------|-------------|-------------|-------------|
## 
## 
# caret::confusionMatrix(hn_pred$class, df_hn_test$stage)

There are differences in the performance of the Naive Bayesian (\(acc=\frac{87}{133}=0.65\)) and the LDA classifiers (\(acc=\frac{92}{133}=0.69\)) in terms of the overall accuracy. We can compare the type II (false-negative prediction) error rates of the naive Bayesian and the LDA classifiers are \(\frac{40}{133}=0.30\) and \(\frac{37}{133}=0.28\), respectively. Such differences may have a nuanced clinical importance for early detection of later-stage cancer.

Later, we will step deeper into the space of classification problems and see more sophisticated approaches, especially for difficult problems like cancer forecasting.

3 Decision Trees and Divide and Conquer Classification

When classification models need to be apparent and directly mechanistically interpreted, kNN and naive Bayes methods we saw earlier may not be useful, as they do not yield explicit classification rules. In some cases, we need to derive direct and clear rules for our classification decisions. Just like a scoring criterion for driving ability, credit scoring, or grading rubrics, mechanistic understanding of the derived labeling, prediction, or decision making is often beneficial. For instance, we can assign credit or deduct a penalty for each driving operation task. These rules facilitate the critical decision to issue a driver license to applicants that are qualified, ready, and able to safely operate a specific type of a motor vehicle.

In this section, we will (1) see a simple motivational example of decision trees based on the Iris data, (2) describe decision-tree divide and conquer methods, (3) examine certain measures quantifying classification accuracy, (4) show strategies for pruning decision trees, (5) work through a Quality of Life in Chronic Disease case-study, and (6) review the One Rule and RIPPER algorithms.

3.1 Motivation

Decision tree learners enable classification via tree structures modeling the relationships among all features and potential outcomes in the data. All decision trees begin with a trunk representing all points that are part of the same cohort. Iteratively, the trunk is split into narrower and narrower branches by forking-decisions based on the intrinsic data structure. At each step, splitting the data into branches may include binary or multinomial classification. The final decision is obtained when the tree branching process terminates. The terminal (leaf) nodes represent the action to be taken as the result of the series of branching decisions. For predictive models, the leaf nodes provide the expected forecasting results given the series of events in the decision tree. There are a number of R packages available for decision tree classification including rpart, C5.0, party, etc.

3.2 Hands-on Example: Iris Data

Let’s start by seeing a simple example using the Iris dataset, which we saw in Chapter 2. The data features or attributes include Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width, and classes are represented by the Species taxa (setosa, versicolor, and virginica).

# install.packages("party")
library("party")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris); table(iris$Species)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
## 
##     setosa versicolor  virginica 
##         50         50         50

The ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) function will build a conditional-inference decision tree.

iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris)
print(iris_ctree)
## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Species 
## Inputs:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width 
## Number of observations:  150 
## 
## 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
##   2)*  weights = 50 
## 1) Petal.Length > 1.9
##   3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
##     4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
##       5)*  weights = 46 
##     4) Petal.Length > 4.8
##       6)*  weights = 8 
##   3) Petal.Width > 1.7
##     7)*  weights = 46
plot(iris_ctree, cex=2)  #  party::plot.BinaryTree()

head(iris); tail(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Similarly, we can demonstrate a classification of the iris taxa via rpart:

# install from https://cran.r-project.org/web/packages/rpart.utils/
library(rpart)
library(rpart.utils)
library(plotly)
# iris_rpart = rpart(Species~., data=iris)
# print(iris_rpart)
# 
# Use the `rattle::fancyRpartPlot` to generates an elegant plot
# install.packages("rattle")
# library(rattle)
# fancyRpartPlot(iris_rpart, cex = 1.5, caption = "rattle::fancyRpartPlot (Iris flowers)")
#
#### Plot_ly decision tree visualization
iris_rpart <- rpart(Species ~ . , method="class", data=iris)
# printcp(iris_rpart)

# Set the Graph/Tree Node names
treeFrame <- iris_rpart$frame

# Identify Leaf nodes
isLeave <- treeFrame$var == "<leaf>"
nodes <- rep(NA, length(isLeave))

# Get the 3 Isis taxa labels: "setosa", "versicolor", "virginica" for all nodes (including leaves)
ylevel <- attr(iris_rpart, "ylevels")
nodes[isLeave] <- ylevel[treeFrame$yval][isLeave]
nodes[!isLeave] <- labels(iris_rpart)[-1][!isLeave[-length(isLeave)]]

# Get all Tree Graph Connections (source --> target)
treeFrame <- iris_rpart$frame
treeRules <- rpart.utils::rpart.rules(iris_rpart)

targetPaths <- sapply(as.numeric(row.names(treeFrame)), 
                      function(x) { strsplit(unlist(treeRules[x]),split=",") } )

lastStop <- sapply(1:length(targetPaths),
                   function(x) { targetPaths[[x]][length(targetPaths[[x]])] } )

oneBefore <- sapply(1:length(targetPaths),
                    function(x) { targetPaths[[x]][length(targetPaths[[x]])-1] } )

# Initialize the tree edges
target=c()
source=c()
values=treeFrame$n
for(i in 2:length(oneBefore)) {
  tmpNode=oneBefore[[i]]
  q = which(lastStop==tmpNode)

  q=ifelse(length(q)==0, 1, q)
  source = c(source, q)
  target = c(target, i)
}
source=source-1
target=target-1

# Display the sankey tree
plot_ly(type = "sankey", orientation = "h", 
             node = list(label = nodes, pad = 15, thickness = 30, line = list(color="black", width=1)),
             link = list(source = source, target = target, value=values[-1])) %>% 
 layout(title = "Iris rPart Decision Tree", font = list(size = 12))

3.3 Decision Tree Overview

The decision tree algorithm represents an upside down tree with lots of branches where a series of logical decisions are encoded as branches between nodes of the tree. The classification begins at the root node and goes through many branches until it gets to the terminal nodes. This iterative process splits the data into different classes by a rigid criterion.

3.3.1 Divide and conquer

Decision trees involve recursive partitioning that use data features and attributes to split the data into groups (nodes) of similar classes.

To make classification trees using data features, we need to observe the pattern between the data features and potential classes using training data. We can draw scatter plots and separate groups that are clearly bundled together. Each of the groups is considered a segment of the data. After getting the approximate range of each feature value within each group, we can make the decision tree.

\[X = [X_1, X_2, X_3, \cdots, X_k] = \underbrace{ \begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, k} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{pmatrix} }_{features/attributes} \left\} \begin{array}{ll} & \\ cases & \\ & \end{array} \right.\]

The decision tree algorithms use a top-down recursive divide-and-conquer approach (sometimes they may also use bottom up or mixed splitting strategies) to divide and evaluate the splits of a dataset \(D\) (input). The best split decision corresponds to the split with the highest information gain, reflecting a partition of the data into \(K\) subsets (using divide-and-conquer). The iterative algorithm terminates when some stopping criteria are reached. Examples of stopping conditions used to terminate the recursive process include:

  • All the samples belong to the same class, that is they have the same label and the sample is already pure.
  • Stop when the majority of the points are already of the same class (relative to some error threshold).
  • There are no remaining attributes on which the samples may be further partitioned.

One objective criterion for splitting or clustering data into groups is based on the information gain measure, or impurity reduction, which can be used to select the test attribute at each node in the decision tree. The attribute with the highest information gain (i.e., greatest entropy reduction) is selected as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions. Three commonly used measures for evaluating the impurity reduction are: Misclassification error, Gini index and Entropy.

For a given table containing pairs of attributes and their class labels, we can assess the homology of the classes in the table. A table is pure (homogeneous) if it only contains a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. This degree of impurity or heterogeneity can be quantitatively evaluated using impurity measures like entropy, Gini index, and misclassification error.

3.3.2 Entropy

The Entropy is an information measure of the amount of disorder or uncertainty in a system. Suppose we have a data set \(D=(X_1, X_2, \cdots, X_n)\) that includes \(n\) features (variables) and suppose each of these features can take on any of \(k\) possible values (states). Then the cardinality of the entire system is \(k^n\) as each of the features are assumed to have \(k\) independent states, thus the total number of different datasets that can be expected is \(\underbrace{k\times k\times k \times \cdots\times k}_{n}=k^n\). Suppose \({\displaystyle p_{1},p_{2},\cdots, p_k}\) represent the proportions of each class (note: \(\sum_{i} {p_i}=1\)) present in the child node that results from a split in a decision tree classifier. Then the entropy measure is defined by:

\[Entropy(D)= - \sum_{i} {p_i \log_{2}{p_i}}.\]

If each of the \(1\le i\leq k\) states for each feature is equally likely to be observed with probability \(p_i=\frac{1}{k}\), then the entropy is maximized: \[ Entropy(D)=- \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}= \sum_{i=1}^{k} {\frac{1}{k} \log{k}}= k{\frac{1}{k} \log{k}}=\log{k} = O(1).\]

In the other extreme, the entropy is minimized when the probability of one class is unitary and all other ones are trivial. Note that by L’Hopital’s Rule \(\lim_{x\longrightarrow 0} {(x\times \log(x))} =\lim_{x\longrightarrow 0} {\frac{\frac{1}{x}}{-\frac{1}{x^2}}}=\lim_{x\longrightarrow 0} {x}=0\) for a single class classification where the probability of one class is unitary (\(p_{i_o}=1\)) and the other ones are trivial (\(p_{i\not= i_o}=0\)):

\[Entropy(D)=p_{i_o}\times \log{p_{i_o}} + \sum_{i\not=i_o}{p_i \log{p_i}}=\] \[=1\times \log{1} + \lim_{x \longrightarrow 0}{\sum_{i\not=i_o}{x \log(x)}} = 0+0=0.\]

In classification settings, higher entropy (i.e., more disorder) corresponds to a sample that has a mixed collection of labels. Conversely, lower entropy corresponds to a classification where we have mostly pure partitions. In general, the entropy of a sample \(D=\{X_1, X_2, \cdots, X_n\}\) is defined by:

\[H(D) = - \sum_{i=1}^{k} {P(c_i|D) \log{P(c_i|D)}}, \]

where \(P(c_i|D)\) is the probability of a data point in \(D\) being labeled with class \(c_i\), and \(k\) is the number of classes (clusters). \(P(c_i|D)\) can be estimated from the observed data by:

\[P(c_i|D) = \frac{P(D | c_i)P(c_i)}{P(D)} = \frac{|\{x_j \in D | x_j \text{ has label } y_j = c_i\}|}{|D|}.\]

Observe that if the observations are evenly split among all \(k\) classes then \(P(c_i|D)=\frac{1}{k}\) and \[H(D) = - \sum_{i=1}^{k} {\frac{1}{k} \log{\frac{1}{k}}}=\log{k}=O(1).\]

At the other extreme, if all the observations are from one class then: \[H(D)=-1*\log(1)=0.\]

Also note that the base of the log function is somewhat irrelevant and can be used to normalize (scale) the range of the entropy \(\left ( \log_b(x) = \frac{\log_2(x)}{\log_2(b)}\right )\).

The Information gain is the expected reduction in entropy caused by knowing the value of an attribute.

3.3.3 Misclassification Error and Gini Index

Similar to the Entropy measure, the Misclassification error and the Gini index are also applied to evaluate information gain. The Misclassification error is defined by the formula:

\[ME = 1-\max_k (p_k).\]

And the Gini index is expressed as: \[GI = \sum_{k}p_k(1-p_k)=1-\sum_{k}p_k^2.\]

3.3.4 C5.0 Decision Tree Algorithm

The C5.0 algorithm is a popular implementation of decision trees. To begin with, let’s consider the term purity. If the segments of data contain a single class, they are considered pure. The entropy represents a mathematical formalism measuring purity for data segments.

\[Entropy(S)=- \sum_{i=1}^c {p_i \log_2(p_i)},\]

where entropy is the measurement, c is the number of total class levels, and \(p_i\) refers to the proportion of observations that fall into each class (i.e., probability of a randomly selected data point to belong to the \(i^{th}\) class level. For 2 possible classes, the entropy ranges from 0 to 1. For \(n\) classes, the entropy ranges from \(0\) to \(\log_2(n)\), where the minimum entropy corresponds to data that is purely homogeneous (completely deterministic/predictable) and the maximum entropy represents completely disordered data (stochastic or extremely noisy). You might wonder what is the benefit of using entropy? The answer is because the smaller the entropy the more information is contained in the corresponding decision node split. Systems (data) with high entropy indicate significant information content (randomness) and data with low entropy indicates highly-compressible data with structure in it.

If we only have one class in the segment then the entropy is trivial, \(Entropy(S)=(-1)\times \log_2(1)=0\). Let’s try another example. If we have a segment of data that contains two classes, the first class contains 80% of the data and the second class contains the remaining 20%. Then, we have the following entropy: \(Entropy(S)=-0.8\log_2(0.8)-0.2\log_2(0.2)=0.7219281.\)

For a two class problem, the relationship between class-proportions and entropy is illustrated in the following graph. Assume \(x\) is the proportion for one of the classes.

N <- 1000
set.seed(1234)
x <- runif(N)
x_ind <- sort(x)
y <- -x_ind*log2(x_ind)-(1-x_ind)*log2(1-x_ind)

# curve(-x*log2(x)-(1-x)*log2(1-x), col="red", main="Entropy for Different Proportions", xlab = "x (proportion for class 1)", ylab = "Entropy", lwd=3)
# segments(0.8, 0, 0.8, 0.7219281, lwd=2, col="blue", lty = 2)
# points(0.8, 0.7219281, col = "blue", pch=21, cex=2, lwd=2)
# text(-0.02, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.48, 0.3, pos=4, srt = 90, "Low-Structure/High-Entropy")
# text(0.96, 0.3, pos=4, srt = 90, "High-Structure/Low-Entropy")
# text(0.65, 0.8, pos=4, "(x=0.8,Entr=0.72)")

modelLabels <- c('High-Structure/Low-Entropy', 'Low-Structure/High-Entropy', 'High-Structure/Low-Entropy')
modelLabels.x <- c(0.03, 0.5, 0.97)
modelLabels.y <- c(0.5, 0.7, 0.5)
modelLabels.col <- c("blue", "red", "blue")
plot_ly() %>%
  add_lines(x = ~x_ind, y = ~y, name="Entropy", line = list(width = 4)) %>%
  add_segments(x=0.8, xend=0.8, y=0.001, yend = 0.7219281, showlegend=F) %>%
  # add_segments(x=-0.2, xend=1.1, y=0, yend = 0, showlegend=F) %>%
  # add_markers(x = ~x, y = ~y, name="Sample Simulated Data") %>%
  #add_lines(x = ~c(point1$x,point2$x), y = ~c(point1$y,point2$y), name="Second PC, Orthogonal to lm(Y ~ X)", 
  #          line = list(width = 4))  %>% 
  add_markers(x = 0.8, y = 0.7219281, name="(x=0.8,Entropy=0.72)", marker = list(size = 20,
                                    color = 'green', line = list(color = 'yellow', width = 2))) %>%
  layout(title="Binary Class Entropy",
         xaxis=list(title="Proportion of Class 1 Observations", scaleanchor="y"),  # control the y:x axes aspect ratio
         yaxis = list(title="Entropy", scaleanchor  = "x", range=c(-0.1, 1.1)), legend = list(orientation = 'h'),
         annotations = list(text=modelLabels,  x=modelLabels.x, y=modelLabels.y, textangle=90,
                            font=list(size=15, color=modelLabels.col), showarrow=FALSE))