1 Model Performance Assessment, Validation, and Improvement

In previous chapters, we used several measures, such as prediction accuracy, to evaluate classification and regression models. In general, accurate predictions for one dataset does not necessarily imply that our model is perfect or that it will reproduce when tested on external or prospective data. We need additional metrics to evaluate the model performance and ensure it is robust, reproducible, reliable, and unbiased.

In this chapter, we will (1) discuss various evaluation strategies for prediction, clustering, classification, regression, and decision-making, (2) demonstrate performance visualization, e.g., visualization of ROC curves, (3) discuss performance tradeoffs, and (4) present internal statistical cross-validation and bootstrap sampling.

1.1 Measuring the performance of classification methods

Prediction accuracy represents just one evaluation aspect of classification model performance and reliability assessment of clustering methods. Different classification models and alternative clustering techniques may be appropriate for different situations. For example, when screening newborns for rare genetic defects, we may want the model to have as few true-negatives as possible. We don’t want to classify anyone as “non-carriers” when they actually may have a defective gene, since early treatment might impact near- and long-term life experiences.

We can use the following three types of data to evaluate the performance of a classifier model.

Actual class values (for supervised classification).
Predicted class values.
Estimated probabilities of the prediction.

We have already seen examples of these cases. For instance, the last type of validation relies on the predict(model, test_data) function that we used previously in the classification and regression chapters (Chapter 3-6). Let’s revisit some of the models and test data we discussed in Chapter 5 - Inpatient Head and Neck Cancer Medication data. We will demonstrate prediction probability estimation using this case-study CaseStudy14_HeadNeck_Cancer_Medication.csv.

hn_med <- read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1",
                   stringsAsFactors = FALSE)
hn_med$seer_stage <- factor(hn_med$seer_stage)
library(tm)
hn_med_corpus <- Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
 corpus_clean <- tm_map(hn_med_corpus, tolower)
 corpus_clean <- tm_map(corpus_clean, removePunctuation)
 corpus_clean <- tm_map(corpus_clean, stripWhitespace)
 corpus_clean <- tm_map(corpus_clean, removeNumbers)
 hn_med_dtm   <- DocumentTermMatrix(corpus_clean)
 hn_med_train <- hn_med[1:562, ]
 hn_med_test  <- hn_med[563:662, ]
 hn_med_dtm_train <- hn_med_dtm[1:562, ]
 hn_med_dtm_test <- hn_med_dtm[563:662, ]
 corpus_train <- corpus_clean[1:562]
 corpus_test <- corpus_clean[563:662]
 hn_med_train$stage <- hn_med_train$seer_stage %in% c(4, 5, 7)
 hn_med_train$stage <- factor(hn_med_train$stage, levels=c(F, T), 
                              labels = c("early_stage", "later_stage"))
 hn_med_test$stage <- hn_med_test$seer_stage %in% c(4, 5, 7)
 hn_med_test$stage <- factor(hn_med_test$stage, levels=c(F, T), 
                             labels = c("early_stage", "later_stage"))
 
convert_counts <- function(x) {
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
  return(x)
}

hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test  <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))
hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test  <- apply(hn_test, MARGIN = 2, convert_counts)
library(e1071)
hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)

pred_raw <- predict(hn_classifier, hn_test, type="raw")
head(pred_raw)

##      early_stage later_stage
## [1,]   0.8812634  0.11873664
## [2,]   0.8812634  0.11873664
## [3,]   0.8425828  0.15741724
## [4,]   0.9636007  0.03639926
## [5,]   0.8654298  0.13457022
## [6,]   0.8654298  0.13457022

The above output includes the prediction probabilities for the first 6 rows of the data. This example is based on Naive Bayes classifier (naiveBayes), however the same approach works for any other machine learning classification or prediction technique. The type="raw" indicates that the prediction call will return the conditional a-posterior probabilities for each class. When type="class", predict() returns the class label corresponding to the maximal probability.

In addition, we can report the predicted probability with the outputs of the Naive Bayesian decision-support system (hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)):

pred_nb <- predict(hn_classifier, hn_test)
head(stats::ftable(pred_nb))

##                                       
##  "pred_nb" "early_stage" "later_stage"
##                                       
##                       96             4

The general predict() method automatically subclasses to the specific predict.naiveBayes(object, newdata, type = c("class", "raw"), threshold = 0.001, ...) call where type="raw" and type = "class" specify the output as the conditional a-posterior probabilities for each class or the class with maximal probability, respectively. Back in Chapter 5, we discussed the C5.0 and the randomForest classifiers to predict the chronic disease score in a another case-study Quality of Life (QoL).

qol <- read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
qol <- qol[!qol$CHRONICDISEASESCORE==-9, ]
qol$cd <- qol$CHRONICDISEASESCORE>1.497
qol$cd <- factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))
qol <- qol[order(qol$ID), ]

# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]

# 80-20%  training-testing data split
set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train <- qol[train_index, ]
qol_test <- qol[-train_index, ]

library(C50)
set.seed(1234)
qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)

Below are the (probability) results of the C5.0 classification tree model prediction.

pred_prob <- predict(qol_model, qol_test, type="prob")
head(pred_prob)

##    minor_disease severe_disease
## 1      0.3762353      0.6237647
## 3      0.3762353      0.6237647
## 4      0.2942230      0.7057770
## 9      0.7376664      0.2623336
## 10     0.3762353      0.6237647
## 12     0.3762353      0.6237647

These can be contrasted against the C5.0 tree classification label results.

pred_tree <- predict(qol_model, qol_test)
head(pred_tree); head(stats::ftable(pred_tree))

## [1] severe_disease severe_disease severe_disease minor_disease  severe_disease
## [6] severe_disease
## Levels: minor_disease severe_disease

##                                              
##  "pred_tree" "minor_disease" "severe_disease"
##                                              
##                          214              229

The same complementary types of outputs can be reported for most machine learning classification and prediction approaches.

1.2 Evaluation strategies

In Chapter 5, we saw an attempt to categorize the supervised classification and unsupervised clustering methods. Similarly, the table below summarizes the basic types of evaluation and validation strategies for different forecasting, prediction, ensembling, and clustering techniques. (Internal) Statistical Cross Validation or external validation should always be applied to ensure reliability and reproducibility of the results. The SciKit clustering performance evaluation and Classification metrics sections provide details about many alternative techniques and metrics for performance evaluation of clustering and classification methods.

Inference	Outcome	Evaluation Metrics	R functions
Classification & Prediction	Binary	Accuracy, Sensitivity, Specificity, PPV/Precision, NPV/Recall, LOR	`caret::confusionMatrix`, `gmodels::CrossTable`, `cluster::silhouette`
Classification & Prediction	Categorical	Accuracy, Sensitivity/Specificity, PPV, NPV, LOR, Silhouette Coefficient	`caret::confusionMatrix`, `gmodels::CrossTable`, `cluster::silhouette`
Regression Modeling	Real Quantitative	correlation coefficient, $R^2$, RMSE, Mutual Information, Homogeneity and Completeness Scores	`cor`, `metrics::mse`

1.2.1 Binary outcomes

More details about binary test assessment is available on the Scientific Methods for Health Sciences (SMHS) EBook site. The table below summarizes the key measures commonly used to evaluate the performance of binary tests, classifiers, or predictions.

		Actual Condition		Test Interpretation
		Absent ($H_0$ is true)	Present ($H_1$ is true)	Test Interpretation
Test Result	Negative (fail to reject $H_0$)	TN Condition absent + Negative result = True (accurate) Negative	FN Condition present + Negative result = False (invalid) Negative Type II error* (proportional to $\beta$)*	$NPV$ $=\frac{TN}{TN+FN}$
Test Result	Positive (reject $H_0$)	FP Condition absent + Positive result = False Positive Type I error ($\alpha$)	TP Condition Present + Positive result = True Positive	$PPV=Precision$ $=\frac{TP}{TP+FP}$
Test Interpretation	$Power =1-\beta$ $= 1-\frac{FN}{FN+TP}$	$Specificity=\frac{TN}{TN+FP}$	$Power=Recall=Sensitivity$ $=\frac{TP}{TP+FN}$	$LOR=\ln\left (\frac{TN\times TP}{FP\times FN}\right )$

1.2.2 Cross-tables, contingency tables, and confusion-matrices

In ML and AI, the concepts of cross table, contingency table, and confusion matrix are often used to quantify the performance of a classifier. A contingency table represents a statistical cross tabulation that summarizes the multivariate frequency distribution of two (or more) categorical variables. In the special case of cross tabulating the performance of an AI classifier relative to ground truth, this comparison is referred to as a confusion matrix, for supervised learning, or a matching matrix, for unsupervised learning. Typically, the rows and columns in these cross tables represent observed instances in predicted and actual class labels, respectively.

We already saw some confusion matrices in Chapter 8. For binary classes, these will be just $2\times 2$ matrices where each of the cells represents the agreement, match, or discrepancy between the real and predicted class labels, indexed by the row and column indices.

Graph $2\times 2$ table:

cross table
	predict_T	predict_F
TRUE	TP	TN
FALSE	FP	FN

True Positive(TP): Number of observations that are correctly classified as “yes” or “success”.
True Negative(TN): Number of observations that are correctly classified as “no” or “failure”.
False Positive(FP): Number of observations that are incorrectly classified as “yes” or “success”.
False Negative(FN): Number of observations that are incorrectly classified as “no” or “failure”.

Using confusion matrices to measure performance: The way we calculate accuracy using these four cells is summarized by the following formula

\[accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}}\ .\]

On the other hand, the error rate, or proportion of incorrectly classified observations is calculated using:

\[error rate=\frac{FP+FN}{TP+TN+FP+FN}==\frac{FP+FN}{\text{Total number of observations}}=1-accuracy\ .\]

If we look at the numerator and denominator carefully, we can see that the error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate.

In R, we have multiple ways to obtain a confusion table. The simplest way would be table(). For example, in Chapter 5, to get a plain $2\times 2$ table reporting the agreement between the real clinical cancer labels and their machine learning predicted counterparts, we used:

hn_test_pred <- predict(hn_classifier, hn_test)
table(hn_test_pred, hn_med_test$stage)

##              
## hn_test_pred  early_stage later_stage
##   early_stage          73          23
##   later_stage           4           0

The reason we sometimes use the gmodels::CrossTable() function, e.g., see Chapter 5, is because it reports additional information about the model performance.

library(gmodels)
CrossTable(hn_test_pred, hn_med_test$stage)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##              | hn_med_test$stage 
## hn_test_pred | early_stage | later_stage |   Row Total | 
## -------------|-------------|-------------|-------------|
##  early_stage |          73 |          23 |          96 | 
##              |       0.011 |       0.038 |             | 
##              |       0.760 |       0.240 |       0.960 | 
##              |       0.948 |       1.000 |             | 
##              |       0.730 |       0.230 |             | 
## -------------|-------------|-------------|-------------|
##  later_stage |           4 |           0 |           4 | 
##              |       0.275 |       0.920 |             | 
##              |       1.000 |       0.000 |       0.040 | 
##              |       0.052 |       0.000 |             | 
##              |       0.040 |       0.000 |             | 
## -------------|-------------|-------------|-------------|
## Column Total |          77 |          23 |         100 | 
##              |       0.770 |       0.230 |             | 
## -------------|-------------|-------------|-------------|
## 
##

The second entry in each cell of the crossTable table reports the Chi-square contribution. This uses the standard Chi-Square formula for computing relative discrepancy between observed and expected counts. For instance, the Chi-square contribution of cell(1,1), (hn_med_test$stage=early_stage and hn_test_pred=early_stage), can be computed as follows from the $\frac{(Observed-Expected)^2}{Expected}$ formula. Assuming independence between the rows and columns (i.e., random classification), the expected cell(1,1) value is computed as the product of the corresponding row (96) and column (77) marginal counts, $\frac{96\times 77}{100}$. Thus the Chi-square value for cell(1,1) is:

\[\text{Chi-square cell(1,1)} = \frac{(Observed-Expected)^2}{Expected}=\] \[=\frac{\left (73-\frac{96\times 77}{100}\right ) ^2}{\frac{96\times 77}{100}}=0.01145022\ .\]

Note, that each cell Chi-square value represents one of the four (in this case) components of the Chi-square test-statistics, which tries to answer the question if there is no association between observed and predicted class labels. That is, under the null-hypothesis there is no association between actual and observed counts for each level of the factor variable, which allows us to quantify whether the derived classification jibes with the real class annotations (labels). The aggregate sum of all Chi-square values represents the $\chi_o^2 = \sum_{all-categories}{(O-E)^2 \over E} \sim \chi_{(df)}^2$ statistics, where $df = (\# rows - 1)\times (\# columns - 1)$.

Using either table (CrossTable, confusionMatrix), we can calculate accuracy and error rate by hand.

accuracy <- (73+0)/100
accuracy

## [1] 0.73

error_rate <- (23+4)/100
error_rate

## [1] 0.27

1-accuracy

## [1] 0.27

For matrices that are larger than $2\times 2$, all diagonal elements count the observations that are correctly classified and the off-diagonal elements represent incorrectly labeled cases.

1.2.3 Other measures of performance beyond accuracy

So far we discussed two performance methods - table() and CrossTable(). A third function is caret::confusionMatrix() which provides the easiest way to report model performance. Notice that the first argument is an actual vector of the labels, i.e., $Test\_Y$ and the second argument, of the same length, represents the vector of predicted labels.

This example was presented as the first case-study in Chapter 5.

library(caret)
qol_pred <- predict(qol_model, qol_test)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")

## Confusion Matrix and Statistics
## 
##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             71
##   severe_disease            72            157
##                                           
##                Accuracy : 0.6772          
##                  95% CI : (0.6315, 0.7206)
##     No Information Rate : 0.5147          
##     P-Value [Acc > NIR] : 3.001e-12       
##                                           
##                   Kappa : 0.3538          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.6886          
##             Specificity : 0.6651          
##          Pos Pred Value : 0.6856          
##          Neg Pred Value : 0.6682          
##              Prevalence : 0.5147          
##          Detection Rate : 0.3544          
##    Detection Prevalence : 0.5169          
##       Balanced Accuracy : 0.6769          
##                                           
##        'Positive' Class : severe_disease  
##

1.2.3.1 Silhouette coefficient

In Chapter 8 we already saw the Silhouette coefficient, which captures the shape of the clustering boundaries. It is a function of the intracluster distance of a sample in the dataset ($i$). Recall that

$d_i$ is the average dissimilarity of point $i$ with all other data points within its cluster. Then, $d_i$ captures the quality of the assignment of $i$ to its current class label. Smaller or larger $d_i$ values suggest better or worse overall assignment for $i$ to its cluster, respectively. The average dissimilarity of $i$ to a cluster $C$ is the average distance between $i$ and all points in the cluster of points labeled $C$.
$l_i$ is the lowest average dissimilarity of point $i$ to any other cluster that $i$ is not a member of. The cluster corresponding to $l_i$, the lowest average dissimilarity, is called the $i$ neighboring cluster, as it is the next best fit cluster for $i$.

Then, the Silhouette coefficient for a sample point $i$ is:

\[-1\leq Silhouette(i)=\frac{l_i-d_i}{\max(l_i,d_i)} \leq 1.\] For interpreting the Silhouette of point $i$, we use:

\[Silhouette(i) \approx \begin{cases} -1 & \text{sample } i \text{ is closer to a neighboring cluster} \\ 0 & \text {the sample } i \text{ is near the border of its cluster, i.e., } i \text{ represents the closest point in its cluster to the rest of the dataset clusters} \\ 1 & \text {the sample } i \text{ is near the center of its cluster} \end{cases}.\]

The mean Silhouette value represents the arithmetic average of all Silhouette coefficients (either within a cluster, or overall) and represents the quality of the cluster (clustering). High mean Silhouette corresponds to compact clustering (dense and separated clusters), whereas low values represent more diffused clusters. The Silhouette value is useful when the number of predicted clusters is smaller than the number of samples.

1.2.3.2 The kappa ( $\kappa$ ) statistic

The Kappa statistic was originally developed to measure the reliability between two human raters. It can be harnessed in machine learning applications to compare the accuracy of a classifier, where one rater represents the ground truth (for labeled data, these are the actual values of each instance) and the second rater represents the results of the automated machine learning classifier. The order of listing the raters is irrelevant.

Kappa statistic measures the possibility of a correct prediction by chance alone and answers the question of How much better is the agreement (between the ground truth and the machine learning prediction) than would be expected by chance alone? Its value is between $0$ and $1$. When $\kappa=1$, we have a perfect agreement between a computed prediction (typically the result of a model-based or model-free technique forecasting an outcome of interest) and an expected prediction (typically random, by-chance, prediction). A common interpretation of the Kappa statistics includes:

Poor agreement: less than 0.20
Fair agreement: 0.20-0.40
Moderate agreement: 0.40-0.60
Good agreement: 0.60-0.80
Very good agreement: 0.80-1

In the above confusionMatrix output, we have a fair agreement. For different problems, we may have different interpretations of Kappa statistics.

To understand Kappa statistic better, let’s look at its definition.

Predicted \ Observed	Minor	Severe	Row Sum
Minor	$A=143$	$B=71$	$A+B=214$
Severe	$C=72$	$D=157$	$C+D=229$
Column Sum	$A+C=215$	$B+D=228$	$A+B+C+D=443$

In this table, $A=143, B=71, C=72, D=157$ denote the frequencies (counts) of cases within each of the cells in the $2\times 2$ design. Then

\[ObservedAgreement = (A+D)=300.\]

\[ExpectedAgreement = \frac{(A+B)\times (A+C)+(C+D)\times (B+D)}{A+B+C+D}=221.72.\]

\[(Kappa)\ \kappa = \frac{(ObservedAgreement) – (ExpectedAgreement)} {(A+B+C+D) – (ExpectedAgreement)}=0.35.\]

In this manual calculation of kappa statistics ($\kappa$) we used the corresponding values we saw earlier in the Quality of Life (QoL) case-study, where chronic-disease binary outcome qol$cd<-qol$CHRONICDISEASESCORE>1.497, and we used the cd prediction (qol_pred).

table(qol_pred, qol_test$cd)

##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             71
##   severe_disease            72            157

According to the above table, high agreement between actual and predicted clinical diagnosis (CD) corresponds to the high algorithmic performance (accuracy).

A=143; B=71; C=72; D=157
# A+B+ C+D # 443
# ((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D)   # 221.7201
EA=((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # Expected accuracy
OA=A+D; OA    # Observed accuracy

## [1] 300

k=(OA-EA)/(A+B+C+D - EA); k # 0.3537597

## [1] 0.3537597

# Compare against the official kappa score
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2] # report official Kappa

##     Kappa 
## 0.3537597

The manually and automatically computed accuracies coincide ($\sim 0.35$).

Let’s now look at computing Kappa. It may be trickier to obtain the expected agreement. Probability rules tell us that the probability of the union of two disjoint events equals the sum of the individual (marginal) probabilities for these two events. Thus, we have:

round(confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2], 2) # report official Kappa

## Kappa 
##  0.35

We get a similar value in the confusionTable() output. A more straightforward way of getting the Kappa statistics is by using the Kappa() function in the vcd package.

# install.packages(vcd)
library(vcd)
Kappa(table(qol_pred, qol_test$cd))

##             value     ASE     z Pr(>|z|)
## Unweighted 0.3538 0.04446 7.957 1.76e-15
## Weighted   0.3538 0.04446 7.957 1.76e-15

The combination of Kappa() and table function yields a $2\times 4$ matrix. The Kappa statistic is under the unweighted value.

Generally speaking, predicting a severe disease outcome is a more critical problem than predicting a mild disease state. Thus, weighted Kappa is also useful. We give the severe disease a higher weight. The Kappa test result is not acceptable since the classifier may make too many mistakes for the severe disease cases. The Kappa value is $0.26374$. Notice that the range of weighted Kappa may exceed [0,1].

Kappa(table(qol_pred, qol_test$cd),weights = matrix(c(1,10,1,10),nrow=2))

##                value     ASE        z  Pr(>|z|)
## Unweighted  0.353760 0.04446  7.95721 1.760e-15
## Weighted   -0.004386 0.04667 -0.09397 9.251e-01

When the predicted value is the first argument, the row and column names represent the true labels and the predicted labels, respectively.

table(qol_pred, qol_test$cd)

##                 
## qol_pred         minor_disease severe_disease
##   minor_disease            143             71
##   severe_disease            72            157

1.2.3.3 Summary of the Kappa score for calculating prediction accuracy

Kappa compares an Observed classification accuracy (output of our ML classifier) with an Expected classification accuracy (corresponding to random chance classification). It may be used to evaluate single classifiers and/or to compare among a set of different classifiers. It takes into account random chance (agreement with a random classifier). That makes Kappa more meaningful than simply using the accuracy as a single quality metric. For instance, the interpretation of an Observed Accuracy of 80% is relative to the Expected Accuracy. Observed Accuracy of 80% is more impactful for an Expected Accuracy of 50% compared to Expected Accuracy of 75%.

1.2.3.4 Sensitivity and specificity

Take a closer look at the confusionMatrix() output where we can find two important statistics - “sensitivity” and “specificity”.

Sensitivity or true positive rate measures the proportion of “success” observations that are correctly classified. \[sensitivity=\frac{TP}{TP+FN}.\] Notice $TP+FN$ are the total number of true “success” observations.

On the other hand, specificity or true negative rate measures the proportion of “failure” observations that are correctly classified. \[specificity=\frac{TN}{TN+FP}.\] Accordingly, $TN+FP$ are the total number of true “failure” observations.

In the QoL data, considering “severe_disease” as “success” and using the table() output we can manually compute the sensitivity and specificity, as well as precision and recall (below):

sens <- 131/(131+89)
sens

## [1] 0.5954545

spec <- 149/(149+74)
spec

## [1] 0.6681614

Another R package caret also provides functions to directly calculate the sensitivity and specificity.

library(caret)
sensitivity(qol_pred, qol_test$cd, positive="severe_disease")

## [1] 0.6885965

# specificity(qol_pred, qol_test$cd)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[1] # another way to report the sensitivity

## Sensitivity 
##   0.6885965

# confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[2] # another way to report the specificity

Sensitivity and specificity both range from 0 to 1. For either measure, a value of 1 implies that the positive and negative predictions are very accurate. However, simultaneously high sensitivity and specificity may not be attainable in real world situations. There is a tradeoff between sensitivity and specificity. To compromise, some studies loosen the demands on one and focus on achieving high values on the other.

1.2.3.5 Precision and recall

Very similar to sensitivity, precision measures the proportion of true “success” observations among predicted “success” observations.

\[precision=\frac{TP}{TP+FP}.\]

Recall is the proportion of true “failures” among all “failures”. A model with high recall captures most “interesting” cases.

\[recall=\frac{TP}{TP+FN}.\]

Again, let’s calculate these by hand for the QoL data, and report the Area under the ROC Curve (AUC).

prec   <- 157/(157+72); prec

## [1] 0.6855895

recall <- 157/(157+71); recall

## [1] 0.6885965

library (ROCR)
library(plotly)
qol_pred<-predict(qol_model, qol_test)
qol_pred <- predict(qol_model, qol_test, type = 'prob')

pred <- prediction( qol_pred[,2], qol_test$cd)
PrecRec <- performance(pred, "prec", "rec")
PrecRecAUC <- performance(pred, "auc")
paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))

## [1] "AUC=0.69"

# plot(PrecRec)
plot_ly(x = ~PrecRec@x.values[[1]][2:length(PrecRec@x.values[[1]])], 
        y = ~PrecRec@y.values[[1]][2:length(PrecRec@y.values[[1]])], 
        name = 'Recall-Precision relation', type='scatter', mode='markers+lines') %>%
  layout(title=paste0("Precision-Recall Plot, AUC=",
                      round(as.numeric(PrecRecAUC@y.values[[1]]), 2)),
         xaxis=list(title="Recall"), yaxis=list(title="Precision"))

# PrecRecAUC <- performance(pred, "auc")
# paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))

Another way to obtain precision would be posPredValue() in the caret package. Remember to specify which one is the “success” class.

qol_pred <- predict(qol_model, qol_test)
posPredValue(qol_pred, qol_test$cd, positive="severe_disease")

## [1] 0.6855895

From the definitions of precision and recall, we can derive the type 1 error and type 2 errors as follow:

\[error_1 = 1- Precision = \frac{FP}{TP+FP}.\]

\[error_2 = 1- Recall = \frac{FN}{TN+FN}.\]

Thus, we can compute the type 1 error ($0.31$) and type 2 error ($0.31$).

error1<-1-prec; error1

## [1] 0.3144105

error2<-1-recall; error2

## [1] 0.3114035

1.2.3.6 The F-measure

The F-measure, or F1-score, combines precision and recall using the harmonic mean assuming equal weights. High F1-score means high precision and high recall. This is a convenient way of measuring model performances and comparing models.

\[F1=\frac{2\times precision\times recall}{recall+precision}=\frac{2\times TP}{2\times TP+FP+FN}\]

Let’s calculate the F1-score by hand using the Quality of Life prediction results.

f1 <- (2*prec*recall)/(prec+recall)
f1

## [1] 0.6870897

We can contrast these results against direct calculations of the F1-statistics obtained using caret.

precision <- posPredValue(qol_pred, qol_test$cd, positive="severe_disease")
recall <- sensitivity(qol_pred, qol_test$cd, positive="severe_disease")
F1 <- (2 * precision * recall) / (precision + recall); F1

## [1] 0.6870897

1.2.4 Visualizing performance tradeoffs (ROC Curve)

Another choice for evaluating classifiers’ performance is by graphs rather than scalars, numerical vectors, or statistics. Graphs are usually more comprehensive than single statistics.

The R package ROCR provides user-friendly functions for visualizing model performance. Details can be found on the ROCR website.

Here, we evaluate the model performance for the Quality of Life case study in Chapter 5.

# install.packages("ROCR")
library(ROCR)
pred <- ROCR::prediction(predictions=pred_prob[, 2], labels=qol_test$cd) 
# avoid naming collision (ROCR::prediction), as
# there is another prediction function in the neuralnet package.

The prediction() method argument pred_prob[, 2] stores the probability of classifying each observation as “severe_disease”, and we saved all model prediction information into the object pred.

Receiver Operating Characteristic (ROC) curves are often used for examining the trade-off between detecting true positives and avoiding the false positives.

# curve(log(x), from=0, to=100, xlab="False Positive Rate", ylab="True Positive Rate", main="ROC curve", col="green", lwd=3, axes=F)
# Axis(side=1, at=c(0, 20, 40, 60, 80, 100), labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# Axis(side=2, at=0:5, labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# segments(0, 0, 110, 5, lty=2, lwd=3)
# segments(0, 0, 0, 4.7, lty=2, lwd=3, col="blue")
# segments(0, 4.7, 107, 4.7, lty=2, lwd=3, col="blue")
# text(20, 4, col="blue", labels = "Perfect Classifier")
# text(40, 3, col="green", labels = "Test Classifier")
# text(70, 2, col="black", labels= "Classifier with no predictive value")

x <- seq(from=0, to=1.0, by=0.01) + 0.001
plot_ly(x = ~x, y = (log(100*x)+2.3)/(log(100*x[101])+2.3), line=list(color="lightgreen"),
        name = 'Test Classifier', type='scatter', mode='lines', showlegend=T) %>%
  add_lines(x=c(0,1), y=c(0,1), line=list(color="black", dash='dash'),
               name="Classifier with no predictive value") %>%
  add_segments(x=0, xend=0, y=0, yend = 1, line=list(color="blue"), 
               name="Perfect Classifier") %>%
  add_segments(x=0, xend=1, y=1, yend = 1, line=list(color="blue"), 
               name="Perfect Classifier 2", showlegend=F) %>%
  layout(title="ROC curve", legend = list(orientation = 'h'),
         xaxis=list(title="False Positive Rate", scaleanchor="y", range=c(0,1)), 
         yaxis=list(title="True Positive Rate", scaleanchor="x"))

The blue line in the above graph represents the perfect classifier where we have 0% false positive and 100% true positive. The middle green line represents a test classifier. Most of our classifiers trained by real data will look like this. The black diagonal line illustrates a classifier with no predictive value. We can see that it has the same true positive rate and false positive rate. Thus, it cannot distinguish between the two states.

Classifiers with high true positive values have ROC curves near the (blue) perfect classifier curve. Thus, we measure the area under the ROC curve (abbreviated as AUC) as a proxy of the classifier performance. To do this, we have to change the scale of the graph above. Mapping 100% to 1, we have a $1\times 1$ square. The area under perfect classifier would be 1 and area under classifier with no predictive value being 0.5. Then, 1 and 0.5 will be the upper and lower limits for our model ROC curve. For model ROC curves, the typical interpretation of the area under curve (AUC) includes:

Outstanding: 0.9-1.0
Excellent/good: 0.8-0.9
Acceptable/fair: 0.7-0.8
Poor: 0.6-0.7
No discrimination: 0.5-0.6

Note that this rating system is somewhat subjective. We can use the ROCR package to draw ROC curves.

roc <- performance(pred, measure="tpr", x.measure="fpr")

We can specify a “performance” object by providing "tpr" (true positive rate) and "fpr" (false positive rate) parameters.

# plot(roc, main="ROC curve for Quality of Life Model", col="blue", lwd=3)
# segments(0, 0, 1, 1, lty=2)

plot_ly(x = ~roc@x.values[[1]], y = ~roc@y.values[[1]], 
        name = 'ROC Curve', type='scatter', mode='markers+lines') %>%
  add_lines(x=c(0,1), y=c(0,1), line=list(color="black", dash='dash'),
               name="Classifier with no predictive value") %>%
  layout(title="ROC Curve for Quality of Life C5.0 classification Tree Model", 
         legend = list(orientation = 'h'),
         xaxis=list(title="False Positive Rate", scaleanchor="y", range=c(0,1)), 
         yaxis=list(title="True Positive Rate", scaleanchor="x"),
         annotations = list(text=paste0("AUC=",
                      round(as.numeric(performance(pred, "auc")@y.values[[1]]), 2)),  
                            x=0.6, y=0.4, textangle=0,
                            font=list(size=15, color="blue", showarrow=FALSE)))

The segments command draws the dash line representing a classifier with no predictive value.

To measure the model performance quantitatively we need to create a new performance object with measure="auc", for area under the curve.

roc_auc <- performance(pred, measure="auc")

Now the roc_auc is stored as a S4 object. This is quite different from data frames and matrices. First, we can use the str() function to examine its structure.

str(roc_auc)

## Formal class 'performance' [package "ROCR"] with 6 slots
##   ..@ x.name      : chr "None"
##   ..@ y.name      : chr "Area under the ROC curve"
##   ..@ alpha.name  : chr "none"
##   ..@ x.values    : list()
##   ..@ y.values    :List of 1
##   .. ..$ : num 0.692
##   ..@ alpha.values: list()

It has 6 members, or “slots”, and the AUC value is stored in the member y.values. To extract object members, we use the @ symbol according to str() output.

roc_auc@y.values

## [[1]]
## [1] 0.6915953

This reports $AUC=0.65$, which suggests a fair classifier, according to the above scoring schema.

1.3 Estimating future performance (internal statistical cross-validation)

The evaluation methods we have talked about are all measuring re-substitution error. That involves building the model on training data and measuring the model performance (error/accuracy) on testing data. This evaluation process provides one mechanism of dealing with unseen data. Let’s look at some alternative strategies.

1.3.1 The holdout method

The holdout method idea is to partition a dataset into two separate sets. Using the first set to create (train) the model and the other to test (validate) the model performance. In practice, we usually use a fraction (e.g., $60\%$, or $\frac{3}{5}$) of our data for training the model, and reserve the rest (e.g., $40\%$, or $\frac{2}{5}$) for testing. Note that the testing data may also be further split into proportions for internal repeated (e.g., cross-validation) testing and final external (independent) testing.

The partition has to be randomized. In R, the best way of doing this is to create a parameter that randomly draws numbers and use this parameter to extract random rows from the original dataset. In Chapter 6, we used this method to partition the Google Trends data.

Another way of partitioning is using the caret::createDatePartition() method. We can subset the original dataset or any independent variable column of the original dataset, e.g., google_norm$RealEstate:

sub <- caret::createDataPartition(google_norm$RealEstate, p=0.75, list = F)
google_train <- google_norm[sub, ]
google_test <- google_norm[-sub, ]

To make sure that the model can be applied to future datasets, we can partition the original dataset into three separate subsets. In this way, we have two subsets of data for testing and validation. The additional validation dataset can alleviate the probability that we have a good model due to chance (non-representative subsets). A common split among training, testing, and validation subsets may be 50%, 25%, and 25% respectively.

sub <- sample(nrow(google_norm), floor(nrow(google_norm)*0.50))
google_train <- google_norm[sub, ]
google_test <- google_norm[-sub, ]
sub1 <- sample(nrow(google_test), floor(nrow(google_test)*0.5))
google_test1 <- google_test[sub1, ]
google_test2 <- google_test[-sub1, ]
nrow(google_norm)

## [1] 731

nrow(google_train) # training

## [1] 365

nrow(google_test1) # testing: internal cross validation

## [1] 183

nrow(google_test2) # testing: out of bag validation

## [1] 183

However, when we only have a very small dataset, it’s difficult to split off too much data as this may excessively reduce the training sample size. There are the following two options for evaluation of model performance with unseen data. Both of these are implemented in the caret package.

1.3.2 Cross-validation

The complete details about cross validation will be presented below. Now, we describe the fundamentals of cross-validation as an internal statistical validation technique.

This technique is known as k-fold cross-validation, or k-fold CV, which is a standard for estimating model performance. K-fold CV randomly partitions the original data into k separate random subsets called folds.

A common practice is to use k=10 or 10-fold CV. That is to split the data into 10 different subsets. Each time, one of the subsets is reserved for testing, and the rest are employed for learning/building the model. This can be accomplished using the caret::createFolds() method. Using set.seed() ensures the reproducibility of the created folds, in case you run the code multiple times. Here, we use 1234, a random number, to seed the fold separation. You can use any number for set.seed(). We demonstrate the process using the normalized Google Trend dataset.

library("caret")
set.seed(1234)
folds <- createFolds(google_norm$RealEstate, k=10)
str(folds)

## List of 10
##  $ Fold01: int [1:73] 9 12 28 31 42 54 62 81 83 86 ...
##  $ Fold02: int [1:73] 20 23 47 61 63 72 82 88 96 100 ...
##  $ Fold03: int [1:73] 15 22 26 43 49 53 73 80 101 109 ...
##  $ Fold04: int [1:74] 5 17 38 52 55 74 91 92 141 142 ...
##  $ Fold05: int [1:73] 7 14 29 35 37 39 41 48 51 57 ...
##  $ Fold06: int [1:73] 4 18 33 36 44 59 70 77 78 112 ...
##  $ Fold07: int [1:73] 6 40 45 65 66 69 71 94 98 102 ...
##  $ Fold08: int [1:73] 19 24 27 50 60 75 76 97 99 106 ...
##  $ Fold09: int [1:74] 3 11 13 16 21 30 32 34 46 56 ...
##  $ Fold10: int [1:72] 1 2 8 10 25 67 68 105 111 126 ...

Another way to cross-validate is to use methods such as sparsediscrim::cv_partition() or caret::createFolds().

# install.packages("sparsediscrim")
library(sparsediscrim)
folds2 = cv_partition(1:nrow(google_norm), num_folds=10)

And the structure of folds may be reported by:

str(folds2)

## List of 10
##  $ Fold1 :List of 2
##   ..$ training: int [1:657] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ test    : int [1:74] 493 338 652 368 698 55 261 407 379 648 ...
##  $ Fold2 :List of 2
##   ..$ training: int [1:658] 2 3 4 5 6 7 8 9 10 11 ...
##   ..$ test    : int [1:73] 377 531 465 273 466 526 454 424 417 191 ...
##  $ Fold3 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ test    : int [1:73] 280 289 467 459 216 166 224 278 453 438 ...
##  $ Fold4 :List of 2
##   ..$ training: int [1:658] 1 3 5 7 8 10 11 12 13 14 ...
##   ..$ test    : int [1:73] 272 461 103 245 53 6 684 565 400 389 ...
##  $ Fold5 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ test    : int [1:73] 124 137 202 253 729 572 451 452 187 295 ...
##  $ Fold6 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 5 6 7 9 10 11 ...
##   ..$ test    : int [1:73] 283 118 687 228 491 549 510 40 607 164 ...
##  $ Fold7 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 12 ...
##   ..$ test    : int [1:73] 125 95 617 369 94 204 109 539 32 412 ...
##  $ Fold8 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ test    : int [1:73] 49 480 328 406 520 182 160 27 705 386 ...
##  $ Fold9 :List of 2
##   ..$ training: int [1:658] 1 2 3 4 6 8 9 10 11 12 ...
##   ..$ test    : int [1:73] 333 515 176 627 546 674 312 340 388 288 ...
##  $ Fold10:List of 2
##   ..$ training: int [1:658] 1 2 4 5 6 7 8 9 10 11 ...
##   ..$ test    : int [1:73] 595 57 86 83 387 666 495 715 45 96 ...

Now, we have 10 different subsets in the folds object. We can use lapply() to fit the model. 90% of data will be used for training so we use [-x, ] to represent all observations not in a specific fold. In Chapter 6, we built a neural network model for the Google Trends data. We can do the same for each fold manually. Then we can train, test, and aggregate the results. Finally, we can report the agreement (e.g., correlations between the predicted and observed RealEstate values).

library(neuralnet)
fold_cv <- lapply(folds, function(x){
  google_train <- google_norm[-x, ]
  google_test <- google_norm[x, ]
  google_model <- neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
                            data=google_train)
  google_pred <- compute(google_model, google_test[, c(1:2, 4:8)])
  pred_results <- google_pred$net.result
  pred_cor <- cor(google_test$RealEstate, pred_results)
  return(pred_cor)
})
str(fold_cv)

## List of 10
##  $ Fold01: num [1, 1] 0.972
##  $ Fold02: num [1, 1] 0.979
##  $ Fold03: num [1, 1] 0.976
##  $ Fold04: num [1, 1] 0.975
##  $ Fold05: num [1, 1] 0.977
##  $ Fold06: num [1, 1] 0.981
##  $ Fold07: num [1, 1] 0.974
##  $ Fold08: num [1, 1] 0.973
##  $ Fold09: num [1, 1] 0.977
##  $ Fold10: num [1, 1] 0.972

From the output, we know that in most of the folds the model predicts very well. In a typical run, one fold may yield bad results. We can use the mean of these 10 correlations to represent the overall model performance. But first, we need to use the unlist() function to transform fold_cv into a vector.

mean(unlist(fold_cv))

## [1] 0.9756676

This correlation is high suggesting strong association between predicted and true values. Thus, the model is very good in terms of its prediction.

1.3.3 Bootstrap sampling

The second method is called bootstrap sampling. In k-fold CV, each observation can only be used once for testing. Bootstrap sampling relies on a sampling with replacement process. Before selecting a new sample, it recycles every observation so that each observation could appear in multiple folds.

At each iteration, bootstrap sampling uses $63.2\%$ of the original data as our training dataset and the remaining $36.8\%$ as the test dataset. Thus, compared to k-fold CV, bootstrap sampling is less representative of the full dataset. A special case of bootstrapping is the 0.632 bootstrap technique, which addresses this issue with changing the final performance error assessment formula to

\[error=0.632\times error_{test}+0.368\times error_{train}.\]

This synthesizes the optimistic model performance on training data ($error_{train}$) with the pessimistic model performance on testing data ($error_{test}$) by weighting the corresponding errors. This method is extremely reliable for small samples (it may be computationally intensive for large samples).

To see the (asymptotics) rationale behind 0.632 bootstrap technique, consider a standard training set $T$ of cardinality $n$ where our bootstrapping sampling generates $m$ new training sets $T_i$, each of size $n'$. As sampling from $T$ is uniform with replacement, some observations may be repeated in each sample $T_i$. Suppose the size of the sub-samples are of the same order as $T$, i.e., $n'=n$, then for large $n$ the sample $T_{i}$ is expected to have $\left (1 - \frac{1}{e}\right ) \sim 0.632$ unique cases from the complete original collection $T$, the remaining proportion $0.368$ are expected to be repeated duplicates. Hence the name 0.632 bootstrap sampling. A particular training data element has a probability of $1-\frac{1}{n}$ of not being picked for training, and hence, its probability of being in the testing set is $\left (1- \frac{1}{n}\right )^n=e^{-1} \sim 0.368$.

The simulation below illustrates the 0.632 bootstrap experimentally.

# define the total data size
n <- 500
# define number of Bootstrap iterations
N=2000
#define the resampling function and compute the proportion of uniquely selected elements
uniqueProportions <- function(myDataSet, sampleSize){
    indices <- sample(1:sampleSize, sampleSize, replace=TRUE) # sample With Replacement
    length(unique(indices))/sampleSize
}
# compute the N proportions of unique elements (could also use a for loop)
proportionsVector <- c(lapply(1:N, uniqueProportions, sampleSize=n), recursive=TRUE)
# report the Expected (mean) proportion of unique elements in the bootstrapping samples of n
mean(proportionsVector)

## [1] 0.632138

In general, for large $n\gg n'$, the sample $T_{i}$ is expected to have $n\left ( 1-e^{-n'/n}\right )$ unique cases, see On Estimating the Size and Confidence of a Statistical Audit).

Having the bootstrap samples, the $m$ models can be fitted (estimated) and aggregated, e.g., by averaging the outputs (for regression) or using voting methods (for classification). We will discuss this more in later Chapters. Let’s look at the implementation of the 0.632 Bootstrap technique for the QoL case-study.

# Recall: qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)
# predict labels of testing data
qol_pred <- predict(qol_model, qol_test)
# compute matches and mismatches of Predicted and Observed class labels
predObsEqual <- qol_pred == qol_test$cd
predObsTF <- c(table(predObsEqual)[1], table(predObsEqual)[2]); predObsTF

## FALSE  TRUE 
##   143   300

# training error rate
train.err <- as.numeric(predObsTF[1]/(predObsTF[1]+predObsTF[2]))

# testing error rate, leave-one-out Bootstrap (LOOB) Cross-Validation
B <- 10 
loob.err <- NULL
N <- dim(qol_test)[1]   # size of test-dataset
for (b in 1:B) {
   bootIndices <- sample(1:N, N*0.9, replace=T)
   train <- qol_test[bootIndices, ]
   qol_modelBS <- C5.0(train[,-c(40, 41)], train$cd)
   inner.err <- NULL

   # for current iteration extract the appropriate testing cases for testing
   i <- (1:length(bootIndices))
   i <- i[is.na(match(i, bootIndices))]
   test <- qol_test[i, ]
  
   # predict using model at current iteration
   qol_modelBS_pred <- predict(qol_modelBS, test)
  
   predObsEqual <- qol_modelBS_pred == test$cd
   predObsTF <- c(table(predObsEqual)[1], table(predObsEqual)[2]); predObsTF
   # training error rate
   inner.err <- as.numeric(predObsTF[1]/(predObsTF[1]+predObsTF[2]))
   loob.err <- c(loob.err, mean(inner.err))
}

test.err <- ifelse(is.null(loob.err), NA, mean(loob.err))

# 0.632 Bootstrap error
boot.632 <- 0.368 * train.err + 0.632 * test.err; boot.632

## [1] 0.3876717

1.4 Improving model performance by parameter tuning

We already explored several alternative machine learning (ML) methods for prediction, classification, clustering and outcome forecasting. In many situations, we derive models by estimating model coefficients or parameters. The main question now is How can we adopt crowd-sourcing advantages of social networks to aggregate different predictive analytics strategies?

Are there reasons to believe that such ensembles of forecasting methods actually improve the performance or boost the prediction accuracy of the resulting consensus meta-algorithm? In this chapter, we are going to introduce ways that we can search for optimal parameters for a single ML method as well as aggregate different methods into ensembles to augment their collective performance, relative to any of the individual methods part of the meta-algorithm.

Recall that earlier in Chapter 6, we presented strategies for improving model performance based on meta-learning, bagging, and boosting.

One of the methods for improving model performance relies on tuning. For a given ML technique, tuning is the process of searching through the parameter space for the optimal parameter(s). The following table summarizes some of the parameters used in ML techniques we covered in previous chapters.

Model	Learning Task	Method	Parameters
KNN	Classification	`class::knn`	`data, k`
K-Means	Classification	`stats::kmeans`	`data, k`
Naive Bayes	Classification	`e1071::naiveBayes`	`train, class, laplace`
Decision Trees	Classification	`C50::C5.0`	`train, class, trials, costs`
OneR Rule Learner	Classification	`RWeka::OneR`	`class~predictors, data`
RIPPER Rule Learner	Classification	`RWeka::JRip`	`formula, data, subset, na.action, control, options`
Linear Regression	Regression	`stats::lm`	`formula, data, subset, weights, na.action, method`
Regression Trees	Regression	`rpart::rpart`	`dep_var ~ indep_var, data`
Model Trees	Regression	`RWeka::M5P`	`formula, data, subset, na.action, control`
Neural Networks	Dual use	`nnet::nnet`	`x, y, weights, size, Wts, mask,linout, entropy, softmax, censored, skip, rang, decay, maxit, Hess, trace, MaxNWts, abstol, reltol`
Support Vector Machines (Polynomial Kernel)	Dual use	`caret::train::svmLinear`	`C`
Support Vector Machines (Radial Basis Kernel)	Dual use	`caret::train::svmRadial`	`C, sigma`
Support Vector Machines (general)	Dual use	`kernlab::ksvm`	`formula, data, kernel`
Random Forests	Dual use	`randomForest::randomForest`	`formula, data`

\[\textbf{Table 1: Core parameters of several machine learning techniques.}\]

1.4.1 Using `caret` for automated parameter tuning

In Chapter 5 we used kNN and plugged in random k parameters for the number of clusters. This time we will test simultaneously multiple k values and select the parameter(s) yielding the highest prediction accuracy. Using caret allows us to specify an outcome class variable, covariate predictor features, and a specific ML method. In Chapter 5, we showed the Boys Town Study of Youth Development dataset, where we normalized all the features, stored them in a boystown_n computable object, and defined an outcome class variable (boystown$grade).

library(randomForest)
boystown <- read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex <- boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
boystown <- boystown[, -1]
table(boystown$gpa)

## 
##  0  1  2  3  4  5 
## 30 50 54 40 14 12

boystown$grade <- boystown$gpa %in% c(3, 4, 5)
boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
normalize <- function(x){
  return((x-min(x))/(max(x)-min(x)))
}
boystown_n <- as.data.frame(lapply(boystown[, -11], normalize))

str(boystown_n)

## 'data.frame':    200 obs. of  10 variables:
##  $ sex       : num  0 0 0 0 1 1 0 0 1 1 ...
##  $ gpa       : num  1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
##  $ Alcoholuse: num  0.182 0.364 0.182 0.182 0.545 ...
##  $ alcatt    : num  0.5 0.333 0.5 0.167 0.333 ...
##  $ dadjob    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ momjob    : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ dadclose  : num  0.143 0.429 0.286 0.143 0.286 ...
##  $ momclose  : num  0.143 0.571 0.286 0.286 0.143 ...
##  $ larceny   : num  0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
##  $ vandalism : num  0.429 0 0.286 0.286 0.286 ...

boystown_n <- cbind(boystown_n, boystown[, 11])
str(boystown_n)

## 'data.frame':    200 obs. of  11 variables:
##  $ sex           : num  0 0 0 0 1 1 0 0 1 1 ...
##  $ gpa           : num  1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
##  $ Alcoholuse    : num  0.182 0.364 0.182 0.182 0.545 ...
##  $ alcatt        : num  0.5 0.333 0.5 0.167 0.333 ...
##  $ dadjob        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ momjob        : num  0 0 0 0 1 0 0 0 1 1 ...
##  $ dadclose      : num  0.143 0.429 0.286 0.143 0.286 ...
##  $ momclose      : num  0.143 0.571 0.286 0.286 0.143 ...
##  $ larceny       : num  0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
##  $ vandalism     : num  0.429 0 0.286 0.286 0.286 ...
##  $ boystown[, 11]: Factor w/ 2 levels "above_avg","avg_or_below": 2 1 2 1 2 2 1 2 1 2 ...

colnames(boystown_n)[11] <- "grade"

Now that the dataset includes an explicit class variable and predictor features, we can use the KNN method to predict the outcome grade. Let’s plug this information into the caret::train() function. Note that caret can use the complete dataset as it will automatically do the random sampling for the internal statistical cross-validation. To make results reproducible, we may utilize the set.seed() function that we presented earlier.

library(caret)
set.seed(123)
kNN_mod <- train(grade~., data=boystown_n, method="knn")
kNN_mod; summary(kNN_mod)

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
##   2 classes: 'above_avg', 'avg_or_below' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7971851  0.5216369
##   7  0.8037923  0.5282107
##   9  0.7937086  0.4973139
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.

##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      10     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

In this case, using str(m) to summarize the object m may report out too much information. Instead, we can simply type the object name m to get a more concise information about it.

Description about the dataset: number of samples, features, and classes.
Re-sampling process: here it is using 25 bootstrap samples with 200 observations (same size as the observed dataset) each to train the model.
Candidate models with different parameters that have been evaluated: by default, caret uses 3 different choices for each parameter, but for binary parameters, it only takes 2 choices TRUE and FALSE). As KNN has only one parameter k, we have 3 candidate models reported in the output above.
Optimal model: the model with largest accuracy is the one corresponding to k=9.

Let’s see how accurate this “optimal model” is in terms of the re-substitution error. Again, we will use the predict() function specifying the object m and the dataset boystown_n. Then, we can report the contingency table showing the agreement between the predictions and real class labels.

set.seed(1234)
pred <- predict(kNN_mod, boystown_n)
table(pred, boystown_n$grade)

##               
## pred           above_avg avg_or_below
##   above_avg          132           17
##   avg_or_below         2           49

This model has $(17+2)/200=0.09$ re-substitution error (9%). This means that in the 200 observations that we used to train this model, 91% of them were correctly classified. Note that re-substitution error is different from accuracy. The accuracy of this model is $0.81$, which is reported by a model summary call. As mentioned earlier, we can obtain prediction probabilities for each observation in the original boystown_n dataset.

head(predict(kNN_mod, boystown_n, type = "prob"))

##   above_avg avg_or_below
## 1 0.0000000    1.0000000
## 2 1.0000000    0.0000000
## 3 0.7142857    0.2857143
## 4 0.8571429    0.1428571
## 5 0.2857143    0.7142857
## 6 0.5714286    0.4285714

1.5 Customizing the tuning process

The default setting of train() might not meet the specific needs for every study. In our case, the optimal $k$ might be smaller than $9$. The caret package allows us to customize the settings for train(). Specifically, caret::trainControl() can help us to customize re-sampling methods. There are 6 popular re-sampling methods that we might want to use, which are summarized in the following table.

Resampling method	Method name	Additional options and default values
Holdout sampling	`LGOCV`	`p = 0.75` (training data proportion)
k-fold cross-validation	`cv`	`number = 10` (number of folds)
Repeated k-fold cross validation	`repeatedcv`	`number = 10` (number of folds), `repeats = 10` (number of iterations)
Bootstrap sampling	`boot`	`number = 25` (resampling iterations)
0.632 bootstrap	`boot632`	`number = 25` (resampling iterations)
Leave-one-out cross-validation	`LOOCV`	None

\[\textbf{Table 2: Core re-sampling methods.}\]

Each of these methods rely on alternative representative sampling strategies to train the model. Let’s use 0.632 bootstrap for example. Just specify method="boot632" in the trainControl() function. The number of different samples to include can be customized by the number= option. Another option in trainControl() allows specification of the model performance evaluation. We can select a preferred method of evaluation for choosing the optimal model. For instance, the oneSE method chooses the simplest model within one standard error of the best performance to be the optimal model. Other strategies are also available in the caret package. For detailed information, type ?best in the R console.

We can also specify a list of $k$ values we want to test by creating a matrix or a grid.

ctrl <- trainControl(method = "boot632", number=25, selectionFunction = "oneSE")
grid <- expand.grid(k=c(1, 3, 5, 7, 9)) 
# Creates a data frame from all combinations of the supplied factors

Usually, to avoid ties, we prefer to choose an odd number of clusters $k$. Now the constraints are all set. We can start to select models again using train().

set.seed(123)
kNN_mod2 <-train(grade ~ ., data=boystown_n, method="knn", 
         metric="Kappa", 
         trControl=ctrl, 
         tuneGrid=grid)
kNN_mod2

## k-Nearest Neighbors 
## 
## 200 samples
##  10 predictor
##   2 classes: 'above_avg', 'avg_or_below' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   1  0.8748058  0.7155507
##   3  0.8389441  0.6235744
##   5  0.8411961  0.6254587
##   7  0.8384469  0.6132381
##   9  0.8341422  0.5971359
## 
## Kappa was used to select the optimal model using  the one SE rule.
## The final value used for the model was k = 1.

Here we added metric="Kappa" to include the Kappa statistics as one of the criteria to select the optimal model. We can see the output accuracy for all the candidate models are better than the default bootstrap sampling. The optimal model has k=1, a high accuracy $0.861$, and a high Kappa statistics, which is much better than the model we had in Chapter 5. Note that the output based on the SE rule may not necessarily choose the model with the highest accuracy or the highest Kappa statistic as the “optimal model”. The tuning process is more comprehensive than only looking at one statistic.

1.6 Comparing the performance of several alternative models

Earlier in Chapter 5 we saw examples of how to choose appropriate evaluation metrics and how to contrast the performance of various AI/ML methods. Below, we illustrate model comparison based on classification of Case-Study 6, Quality of Life (QoL) dataset using bagging, boosting, random forest, SVN, k nearest neighbors, and decision trees. All available caret package ML/AI training methods are listed here.

# install.packages(fastAdaboost)
library(fastAdaboost)
library(caret)    # for modeling
library(lattice)  # for plotting
control <- trainControl(method="repeatedcv", number=10, repeats=3)

## Run all subsequent models in parallel
library(adabag)
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
system.time({
  rf.fit        <- train(cd~., data=qol[, -37], method="rf", trControl=control);
  knn.fit       <- train(cd~., data=qol[, -37], method="knn", trControl=control);
  svm.fit       <- train(cd~., data=qol[, -37], method="svmRadialWeights", trControl=control);
  adabag.fit    <- train(cd~., data=qol[, -37], method="AdaBag", trControl=control);
  adaboost.fit  <- train(cd~., data=qol[, -37], method="adaboost", trControl=control)
})

##    user  system elapsed 
##    3.31    0.16  125.16

stopCluster(cl) # close multi-core cluster
rm(cl)

results <- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, Bag=adabag.fit, Boost=adaboost.fit))

# summary of model differences
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, kNN, SVM, Bag, Boost 
## Number of resamples: 30 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF    0.9954751 1.0000000 1.0000000 0.9995489 1.0000000 1.0000000    0
## kNN   0.5270270 0.6004525 0.6238739 0.6225767 0.6475225 0.7090909    0
## SVM   0.9409091 0.9593219 0.9683971 0.9676231 0.9773499 0.9864865    0
## Bag   1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000    0
## Boost 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000    0
## 
## Kappa 
##             Min.   1st Qu.   Median      Mean   3rd Qu.      Max. NA's
## RF    0.99092365 1.0000000 1.000000 0.9990954 1.0000000 1.0000000    0
## kNN   0.04709345 0.1967654 0.245143 0.2434337 0.2949951 0.4155251    0
## SVM   0.88132780 0.9183430 0.936674 0.9350707 0.9545693 0.9729290    0
## Bag   1.00000000 1.0000000 1.000000 1.0000000 1.0000000 1.0000000    0
## Boost 1.00000000 1.0000000 1.000000 1.0000000 1.0000000 1.0000000    0

# Plot Accuracy Summaries
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# bwplot(results, scales=scales)                 # Box plots of accuracy

# Convert (results) data-frame from wide to long format
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
library(tidyr)
results_long <- gather(results$values[, -1], method, measurement, factor_key=TRUE) %>%
  separate(method, c("Technique", "Metric"), sep = "~")

# Compare original wide format to transformed long format
results$values[, -1]

##    RF~Accuracy  RF~Kappa kNN~Accuracy  kNN~Kappa SVM~Accuracy SVM~Kappa
## 1    1.0000000 1.0000000    0.6334842 0.26710338    0.9729730 0.9457478
## 2    1.0000000 1.0000000    0.6334842 0.26147943    0.9772727 0.9545455
## 3    1.0000000 1.0000000    0.6606335 0.31988839    0.9409091 0.8813278
## 4    1.0000000 1.0000000    0.6063348 0.21178207    0.9684685 0.9368857
## 5    1.0000000 1.0000000    0.6515837 0.29858621    0.9728507 0.9455665
## 6    1.0000000 1.0000000    0.5945946 0.18886002    0.9864865 0.9729290
## 7    1.0000000 1.0000000    0.7090909 0.41552511    0.9547511 0.9091955
## 8    1.0000000 1.0000000    0.6036036 0.20688535    0.9504505 0.9007398
## 9    1.0000000 1.0000000    0.5855856 0.16881003    0.9774775 0.9548450
## 10   0.9954751 0.9909237    0.5791855 0.15969582    0.9638009 0.9272907
## 11   1.0000000 1.0000000    0.6216216 0.23922977    0.9549550 0.9096533
## 12   1.0000000 1.0000000    0.6621622 0.32814139    0.9819820 0.9639200
## 13   1.0000000 1.0000000    0.5855856 0.16813294    0.9684685 0.9368344
## 14   0.9954955 0.9909690    0.6561086 0.31422505    0.9504505 0.9005781
## 15   1.0000000 1.0000000    0.6000000 0.19634703    0.9684685 0.9367315
## 16   1.0000000 1.0000000    0.6486486 0.29985444    0.9819820 0.9639200
## 17   1.0000000 1.0000000    0.6351351 0.26967752    0.9594595 0.9186548
## 18   1.0000000 1.0000000    0.6063348 0.20749351    0.9502262 0.9001602
## 19   1.0000000 1.0000000    0.6396396 0.28073870    0.9819820 0.9639493
## 20   1.0000000 1.0000000    0.5270270 0.04709345    0.9819820 0.9638907
## 21   1.0000000 1.0000000    0.6441441 0.28422170    0.9683258 0.9365229
## 22   1.0000000 1.0000000    0.6216216 0.24046921    0.9592760 0.9182390
## 23   1.0000000 1.0000000    0.6018100 0.19802062    0.9592760 0.9181650
## 24   0.9954955 0.9909690    0.5855856 0.17016090    0.9639640 0.9277226
## 25   1.0000000 1.0000000    0.6261261 0.24981679    0.9727273 0.9452963
## 26   1.0000000 1.0000000    0.6742081 0.34733388    0.9773756 0.9545772
## 27   1.0000000 1.0000000    0.6171171 0.22985879    0.9773756 0.9546183
## 28   1.0000000 1.0000000    0.6380090 0.27356397    0.9683258 0.9365229
## 29   1.0000000 1.0000000    0.6561086 0.31562220    0.9683258 0.9366165
## 30   1.0000000 1.0000000    0.5727273 0.14439388    0.9683258 0.9364760
##    Bag~Accuracy Bag~Kappa Boost~Accuracy Boost~Kappa
## 1             1         1              1           1
## 2             1         1              1           1
## 3             1         1              1           1
## 4             1         1              1           1
## 5             1         1              1           1
## 6             1         1              1           1
## 7             1         1              1           1
## 8             1         1              1           1
## 9             1         1              1           1
## 10            1         1              1           1
## 11            1         1              1           1
## 12            1         1              1           1
## 13            1         1              1           1
## 14            1         1              1           1
## 15            1         1              1           1
## 16            1         1              1           1
## 17            1         1              1           1
## 18            1         1              1           1
## 19            1         1              1           1
## 20            1         1              1           1
## 21            1         1              1           1
## 22            1         1              1           1
## 23            1         1              1           1
## 24            1         1              1           1
## 25            1         1              1           1
## 26            1         1              1           1
## 27            1         1              1           1
## 28            1         1              1           1
## 29            1         1              1           1
## 30            1         1              1           1

head(results_long)

##   Technique   Metric measurement
## 1        RF Accuracy           1
## 2        RF Accuracy           1
## 3        RF Accuracy           1
## 4        RF Accuracy           1
## 5        RF Accuracy           1
## 6        RF Accuracy           1

library(plotly)
plot_ly(results_long, x=~Technique, y = ~measurement, color = ~Metric, type = "box")

#densityplot(results, scales=scales, pch = "|") # Density plots of accuracy

densityModels <- with(results_long[which(results_long$Metric=='Accuracy'), ],
                      tapply(measurement, INDEX = Technique, density))
df <- data.frame(
  x = unlist(lapply(densityModels, "[[", "x")),
  y = unlist(lapply(densityModels, "[[", "y")),
  method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)

plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
  layout(title="Performance Density Plots (Accuracy)", legend = list(orientation='h'),
         xaxis=list(title="Accuracy"), yaxis=list(title="Density"))

densityModels <- with(results_long[which(results_long$Metric=='Kappa'), ],
                      tapply(measurement, INDEX = Technique, density))
df <- data.frame(
  x = unlist(lapply(densityModels, "[[", "x")),
  y = unlist(lapply(densityModels, "[[", "y")),
  method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)

plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
  layout(title="Performance Density Plots (Kappa)", legend = list(orientation='h'),
         xaxis=list(title="Kappa"), yaxis=list(title="Density"))

# dotplot(results, scales=scales)                # Dot plots of Accuracy & Kappa
#splom(results)      # contrast pair-wise model scatterplots of prediction accuracy (Trellis Scatterplot matrices)

# Pairs - Accuracy
results_wide <- results_long[which(results_long$Metric=='Accuracy'), -2] %>%
  pivot_wider(names_from = Technique, values_from = measurement)

df = data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))

dims <- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
  layout(title="Performance Pairs Plot (Accuracy)")

# Pairs - Accuracy
results_wide <- results_long[which(results_long$Metric=='Kappa'), -2] %>%
  pivot_wider(names_from = Technique, values_from = measurement)

df = data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))

dims <- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
  layout(title="Performance Pairs Plot (Kappa)")

1.7 Forecasting types and assessment approaches

Cross-validation is a strategy for validating predictive methods, classification models and clustering techniques by assessing the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation defines a test dataset to evaluate the model predictive performance as well as assess its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of identifying actual relationships and salient features in the data.

In this section, we will use Google Flu Trends, Autism, and Parkinson’s disease case-studies to illustrate (1) alternative forecasting types using linear and non-linear predictions, (2) exhaustive and non-exhaustive internal statistical cross-validation, and (3) explore complementary predictor functions.

In Chapter 5 we discussed the types of classification and prediction methods, including supervised and unsupervised learning. The former are direct and predictive (there are known outcome variables that can be predicted and the corresponding forecasts can be evaluated) and the latter are indirect and descriptive (there are no a priori labels or specific outcomes).

1.7.1 Overfitting

Before we go into the cross-validation of predictive analytics, we will present several examples of overfitting that illustrates why a certain amount of skepticism and mistrust may be appropriate when dealing with forecasting models based on large and complex data.

1.7.1.1 Example (US Presidential Elections)

By 2022, there were only 58 US presidential elections and 46 presidents. That is a small dataset, and learning from it may be challenging. For instance:

If the predictor space expands to include things like having false teeth, it’s pretty easy for the model to go from fitting the generalizable features of the data (the signal, e.g., presidential actions) to matching noise patterns (e.g., irrelevant characteristics like gender of the children of presidents, or types of dentures they may wear).
When overfitting noise patterns takes place, the quality of the model fit assessed on the historical data may improve (e.g., better $R^2$, more about the Coefficient of Determination is available here). At the same time, however, the model performance may be suboptimal when used to make inferences about prospective data, e.g., future presidential elections.

This cartoon illustrates some of the (unique) noisy presidential characteristics that are thought to be unimportant to presidential elections or presidential performance. Cartoon of noisy presidential characteristics

1.7.1.2 Example (Google Flu Trends)

A March 14, 2014 article in Science (DOI: 10.1126/science.1248506), identified problems in Google Flu Trends (GFT), DOI 10.1371/journal.pone.0023610, which may be attributed in part to overfitting. The GFT was built to predict the future Centers for Disease Control and Prevention (CDC) reports of doctor office visits for influenza-like illness (ILI). In February 2013, Nature reported that GFT was predicting more than double the proportion of doctor visits compared to the CDC forecast for the same period.

The GFT model found the best matches among 50 million web search terms to fit 1,152 data points. It predicted quite high odds of finding search terms that match the propensity of the flu but are structurally unrelated and hence are not prospectively predictive. In fact, the GFT investigators reported weeding out unrelated to the flu seasonal search terms may have been strongly correlated to the CDC data, e.g., high school basketball season. The big GFT data may have overfitted the relatively small number of cases. This false-alarm result was also paired with a false-negative finding. The GFT model also missed the non-seasonal 2009 H1N1 influenza pandemic, which provides a cautionary tale about prediction, overfitting, and prospective validation.

1.7.1.3 Example (Autism)

Autistic brains constantly overfit visual and cognitive stimuli. To an autistic person, a general conversation of several adults may seem like a cacophony due to super-sensitive detail-oriented hearing and perception tuned to literally pick up all elements of the conversation and clues of the surrounding environment. At the same time, autistic brains downplay body language, sarcasm and non-literal cues. We can miss the forest for the trees when we start “overfitting”, over-interpreting the noise on top of the actual salient information. Ambient noise, trivial observations and unrelated perceptions may hide the true communication details.

Human conversations and communications involve exchanges of both critical information and random noise. Fitting a perfect model requires focus only on the “relevant” information. Overfitting occurs when attention is (excessively) consumed with peripheral noise, or worse, overwhelmed by inconsequential noise drowning the salient aspects of the communication exchange.

Any dataset is a mix of signal and noise. The main task of our brains is to sort these components and interpret the information (i.e., ignore the noise).

“One person’s noise is another person’s treasure map!”

Our predictions are most accurate if we can model as much of the signal and as little of the noise as possible. Note that in these terms, $R^2$ is a poor metric to identify predictive power - it measures how much of the signal and the noise is explained by our model. In practice, it’s hard to always identify what’s signal and what’s noise. This is why practical applications tend to favor simpler models, since the more complicated a model is, the easier it is to overfit the noise component of the observed information.

1.8 Internal Statistical Cross-validation

Internal statistical cross-validation assesses the expected performance of a prediction method in cases (subject, units, regions, etc.) drawn from a similar population as the original training data sample. Internal validation is distinct from external validation, as the latter potentially allows for the existence of differences between the populations (training data, used to develop or train the technique, and testing data, used to independently quantify the performance of the technique).

Each step in the internal statistical cross-validation protocol involves:

Randomly partitioning a sample of data into 2 complementary subsets (training + testing).
Performing the analysis, fitting or estimating the model using the training set.
Validating the analysis or evaluating the performance of the model using a separate testing set.
Increasing the iteration index and repeating the process. Various termination criteria can involve a fixed number, a desired mean variability, or an upper bound on the error-rate.

Here is one example of internal statistical cross-validation Predictive diagnostic modeling in Parkinson’s disease.

To reduce the noise and variability at each iteration, the final validation results may include the averaged performance results each iteration.

In cases when new observations are hard to obtain (due to costs, reliability, time, or other constraints), cross-validation guards against testing hypotheses suggested by the data themselves (also known as Type III error or False-Suggestion).

Cross-validation is different from conventional-validation (e.g. 80%-20% partitioning the data set into training and testing subsets) where the prediction error (e.g., Root Mean Square Error, RMSE) evaluated on the training data is not a useful estimator of model performance, as it does not generalize across multiple samples.

In general, the errors of the conventional-valuation are based on the results of a specific test dataset and may not accurately represent the model performance. A more appropriate strategy to properly estimate model prediction performance is to use cross-validation (CV), which combines (averages) prediction errors to measure the model performance. CV corrects for the expected stochastic nature of partitioning the training and testing sets and generates a more accurate and robust estimate of the expected model performance.

Relative to a simpler model, a more complex model may overfit-the-data if it has a short foresight, i.e., it generates accurate fitting results for known data but less accurate results when predicting based on new data. Knowledge from past experiences may include either relevant or irrelevant (noise) information. In challenging data-driven prediction models when uncertainty (entropy) is high, more noise is present in past information that needs to be accounted for in prospective forecasting. However, it is generally hard to discriminate patterns from noise in complex systems (i.e., deciding which part to model and which to ignore). Models that reduce the chance of fitting noise are called robust.

1.8.1 Example (Linear Regression)

Let’s demonstrate a simple model assessment using linear regression. Suppose we observe the response values {${y_1, \cdots, y_n}$}, and the corresponding $k$ predictors represented as a $kD$ vector of covariates {${x_1, \cdots, x_n }$}, where subjects/cases are indexed by $1\leq i\leq n$, and the data-elements (variables) are indexed by $1\leq j\leq k$.

\[ \begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} .\]

Using least squares to estimate the linear function parameters (effect-sizes), ${\beta _1, \cdots , \beta _k }$, allows us to compute a hyperplane $y = a + x\beta$ that best fits the observed data ${(x_i, y_i )}_{1\leq i \leq n}$. This is expressed as a matrix by:

\[\begin{pmatrix} y_1\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} a_1\\ \vdots \\ a_n\\ \end{pmatrix}+\begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} \begin{pmatrix} \beta_1\\ \vdots \\ \beta_k\\ \end{pmatrix}.\]
Corresponding to the system of linear hyperplanes:

\[\left\{ \begin{array}{c} y_1=a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k \\ y_2=a_2+ x_{2, 1} \beta_1 +x_{2, 2} \beta_2 + \cdots +x_{2, k} \beta_k \\ \vdots\\ y_n=a_n+ x_{n, 1} \beta_1 +x_{n, 2} \beta_2 + \cdots +x_{n, k} \beta_k \end{array} \right.\ .\]

One measure to evaluate the model fit may be the mean squared error (MSE). The MSE for a given value of the parameters $\alpha$ and $\beta$ on the observed training data ${(x_i, y_i) }_{1\leq i\leq n}$ is expressed as

\[MSE=\frac{1}{n}\sum_{i=1}^{n} (y_i- \underbrace{(a_1+ x_{i, 1} \beta_1 +x_{i, 2} \beta_2 + \cdots +x_{i, k} \beta_k) }_{\text{predicted value } \hat{y}_i, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2\ .\]

And the corresponding root mean square error (RMSE) is:

\[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_1- \underbrace{(a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k) }_{\text{predicted value } \hat{y}_1, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2}\ .\]

In the linear model case, the expected value of the MSE (over the distribution of training sets) for the training set is $\frac{n - k - 1}{n + k + 1} E$, where $E$ is the expected value of the MSE for the testing/validation data. Therefore, fitting a model and computing the MSE on the training set, we may produce an over optimistic evaluation assessment (smaller RMSE) of how well the model may fit another dataset. This bias represents in-sample estimate of the fit, whereas we are interested in the cross-validation estimate as an out-of-sample estimate.

In the linear regression model, cross validation may not be as useful, since we can compute the exact correction factor $\frac{n - k - 1}{n + k + 1}$ to obtain as estimate of the exact expected out-of-sample fit using the in-sample MSE (under)estimate. However, even in this situation, cross-validation remains useful as it can be used to select an optimal regularized cost function.

In most other modeling procedures (e.g. logistic regression), there are no simple general closed-form expressions (formulas) to adjust the cross-validation error estimate from the in-sample fit estimate. Cross-validation is a generally applicable way to predict the performance of a model on a validation set using stochastic computation instead of obtaining experimental, theoretical, mathematical, or analytic error estimates.

1.8.2 Cross-validation methods

There are 2 classes of cross-validation approaches, exhaustive and non-exhaustive.

1.8.2.1 Exhaustive cross-validation

Exhaustive cross-validation methods are based on determining all possible ways to divide the original sample into training and testing data. For instance, the Leave-m-out cross-validation involves using $m$ observations for testing and the remaining ($n-m$) observations as training. The case when $m=1$, i.e., leave-1-out method, is only applicable when $n$ is small, due to its huge computational cost. This process is repeated on all partitions of the original sample. This method requires model fitting and validating $C_m^n$ times ($n$ is the total number of observations in the original sample and $m$ is the number left out for validation). This requires a very large number of steps.

1.8.2.2 Non-exhaustive cross-validation

Non-exhaustive cross validation methods avoid computing estimates/errors using all possible partitionings of the original sample, but rather approximate these. For example, in the k-fold cross-validation, the original sample is randomly partitioned into $k$ equal sized subsamples, or folds. Of the $k$ subsamples, a single subsample is kept as final testing data for validation of the model. The other $k - 1$ subsamples are used as training data. The cross-validation process is then repeated $k$ times, corresponding to the $k$ folds. Each of the $k$ subsamples is used once as the validation data. There are corresponding $k$ results that are averaged (or otherwise aggregated) to generate a final pooled model-quality estimation. In k-fold validation, all observations are used for both training and validation, and each observation is used for validation exactly once. In general, $k$ is a parameter that needs to be selected by an investigator (common values may be $5$ or $10$).

A general case of the k-fold validation is $k=n$ (the total number of observations), when it coincides with the leave-one-out cross-validation.

A variation of the k-fold validation is stratified k-fold cross-validation, where each fold has the same (approximately) mean response value. For instance, if the model represents a binary classification of cases (e.g., controls vs. patients), this implies that each fold contains roughly the same proportion of the two class labels.

Repeated random sub-sampling validation splits randomly the entire dataset into a training set, where the model is fit, and a testing set, where the predictive accuracy is assessed. Again, the results are averaged over all iterative splits. This method has an advantage over k-fold cross validation as that the proportion of the training/testing split is not dependent on the number of iterations (folds). However, its drawback is that some observations may never be selected whereas others may be selected multiple times in the testing/validation subsample. As validation subsets may overlap, the results may vary each time we repeat the validation protocol, unless we set a seed point in the algorithm.

Asymptotically, as the number of random splits increases, the repeated random sub-sampling validation approaches the leave-k-out cross-validation.

1.8.3 Case-Studies

In the examples below, we have intentionally suppressed some of the R output to save space. This is accomplished using this Rmarkdown command, {r eval=TRUE, results='hide'}, however, the reader is encouraged to try hands-on all the protocols, make modifications, inspect and interpret the outputs.

1.8.3.1 Example 1: Prediction of Parkinson’s Disease using `Adaptive Boosting` (AdaBoost)

This Parkinson’s Diseases Study involves heterogeneous neuroimaging, genetics, clinical, and phenotypic data of over 600 volunteers. The multivariate data includes three cohorts (HC=Healthy Controls, PD=Parkinson’s, SWEDD=subjects without evidence for dopaminergic deficit).

Load the PPMI data 06_PPMI_ClassificationValidationData_Short.csv.

ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)

Binarize the Dx (clinical diagnoses) classes.

# binarize the Dx classes
ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
attach(ppmi_data)

head(ppmi_data)
# View (ppmi_data)

Obtain a model-free predictive analytics, e.g., AdaBoost classification, and report the results.

# Model-free analysis, classification
# install.packages("crossval")
# install.packages("ada")
# library("crossval")
library(crossval)
library(ada)
# set up adaboosting prediction function

# Define a new AdaBoost classification result-reporting function
my.ada <- function (train.x, train.y, test.x, test.y, negative, formula){
  ada.fit <- ada(train.x, train.y)
  predict.y <- predict(ada.fit, test.x)
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}

Recall from Chapter 2 that when group sizes are imbalanced, we may need to rebalance them to avoid potential biases of dominant cohorts. In this case, we will re-balance the groups using the package SMOTE Synthetic Minority Oversampling Technique. SMOTE may be used to handle class imbalance in binary or multinomial (multiclass) classification.

# balance cases
# SMOTE: Synthetic Minority Oversampling Technique to handle class imbalance in binary classification.
set.seed(1000)
# install.packages("unbalanced") to deal with unbalanced group data
# https://cran.r-project.org/src/contrib/Archive/unbalanced/
library(unbalanced)
ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
uniqueID <- unique(ppmi_data$FID_IID) 
ppmi_data <- ppmi_data[ppmi_data$VisitID==1, ]
ppmi_data$PD <- factor(ppmi_data$PD)

colnames(ppmi_data)
# ppmi_data.1<-ppmi_data[, c(3:281, 284, 287, 336:340, 341)]
n <- ncol(ppmi_data)
output.1 <- ppmi_data$PD

# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
# remove Default Real Clinical subject classifications! 
input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.matrix(ppmi_data[ , which(names(ppmi_data) %in% {"PD"})])
output <- as.factor(ppmi_data$PD)
c(dim(input), dim(output))

#balance the dataset
set.seed(123)
data.1 <- ubBalance(X= input, Y=output, type="ubSMOTE", percOver=300, percUnder=150, verbose=TRUE)
balancedData <- cbind(data.1$X, data.1$Y)
table(data.1$Y)

nrow(data.1$X); ncol(data.1$X)
nrow(balancedData); ncol(balancedData)
nrow(input); ncol(input)

colnames(balancedData) <- c(colnames(input), "PD")

Next, we’ll check the re-balanced cohort sizes.

library(plotly)

###Check balance
## T test
alpha.0.05 <- 0.05
test.results.bin <- NULL        # binarized/dichotomized p-values
test.results.raw <- NULL        # raw p-values

# get a better error-handling t.test function that gracefully handles NA's and trivial variances
my.t.test.p.value <- function(input1, input2) {
obj <- try(t.test(input1, input2), silent=TRUE)
if (is(obj, "try-error")) 
  return(NA) 
else 
  return(obj$p.value)
} 

for (i in 1:ncol(balancedData)) 
{   
test.results.raw[i]  <- my.t.test.p.value(input[, i], balancedData [, i])
    test.results.bin[i] <- ifelse(test.results.raw[i]  > alpha.0.05, 1, 0)   
# binarize the p-value (0=significant, 1=otherwise) 
    print(c("i=", i, "var=", colnames(balancedData[i]), "t-test_raw_p_value=", test.results.raw[i]))
}

# we can also employ (e.g., FDR, Bonferroni) correction for multiple testing!
    # test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw)) 
    # where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
    # plot(test.results.raw, test.results.corr)
    # sum(test.results.raw < alpha.0.05, na.rm=T)/length(test.results.raw)  #check proportion of inconsistencies
# sum(test.results.corr < alpha.0.05, na.rm =T)/length(test.results.corr)

QQ <- qqplot(input[, 5], balancedData [, 5], plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
  add_trace(x=c(0,1), y=c(0,1), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
  layout(title=paste0("QQ-Plot Original vs. Rebalanced Data (Feature=",colnames(input)[5], ")"),
         xaxis=list(title="Original Data"), yaxis=list(title="Rebalanced Data"),
         hovermode = "x unified", legend = list(orientation='h'))

# Now, check visually for differences between the distributions of the raw (input) and rebalanced data.
# par(mar=c(1,1,1,1))
# par(mfrow=c(10,10))
# for(i in c(1:62,64:101)){ qqplot(balancedData [, i],input[, i]) } #except VisitID

# as the sample-size is changed:
length(input[, 5]); length(balancedData [, 5])
# to plot raw vs. rebalanced pairs (e.g., var="L_insular_cortex_Volume"), we need to equalize the lengths
#plot (input[, 5] +0*balancedData [, 5], balancedData [, 5])   # [, 5] == "L_insular_cortex_Volume"

# print(c("T-test results: ", test.results))
# zeros (0) are significant independent between-group T-test differences, ones (1) are insignificant

for (i in 1:(ncol(balancedData)-1)) 
{
    test.results.raw [i]  <- wilcox.test(input[, i], balancedData [, i])$p.value
    test.results.bin [i] <- ifelse(test.results.raw [i] > alpha.0.05, 1, 0)
    print(c("i=", i, "Wilcoxon-test=", test.results.raw [i]))
}
print(c("Wilcoxon test results: ", test.results.bin))
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw)) 
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)

QQ <- qqplot(input[, 5], balancedData [, 5], plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
  add_trace(x=c(0,1), y=c(0,1), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
  layout(title=paste0("QQ-Plot Original vs. Rebalanced Data (Feature=",colnames(input)[5], ")"),
         xaxis=list(title="Original Data"), yaxis=list(title="Rebalanced Data"),
         hovermode = "x unified", legend = list(orientation='h'))

The next step will be the actual Cross validation.

# using raw data:
X <- as.data.frame(input); Y <- output
neg <- "1"   # "Control" == "1"

# using Rebalanced data: 
X <- as.data.frame(data.1$X); Y <- data.1$Y
# balancedData<-cbind(data.1$X, data.1$Y); dim(balancedData)

# Side note: There is a function name collision for "crossval", the same method is present in the "mlr" (machine Learning in R) package and in the "crossval" package.  
# To specify a function call from a specific package do:  packagename::functionname()

set.seed(115) 
cv.out <- crossval::crossval(my.ada, X, Y, K = 5, B = 1, negative = neg)
    # the label of a negative "null" sample (default: "control")
out <- diagnosticErrors(cv.out$stat)

print(cv.out$stat)

##    FP    TP    TN    FN 
##   0.2 109.8  97.4   0.0

print(out)

##       acc      sens      spec       ppv       npv       lor 
## 0.9990357 1.0000000 0.9979508 0.9981818 1.0000000       Inf

As we can see from the reported metrics, the overall averaged AdaBoost-based diagnostic predictions are quite good.

1.8.3.2 Example 2: Sleep dataset

These data contain the effect of two soporific drugs to increase hours of sleep (treatment-compared design) on 10 patients. The data is available by default (sleep {datasets})

First, load the data and report some graphs and summaries.

data(sleep); str(sleep)

## 'data.frame':    20 obs. of  3 variables:
##  $ extra: num  0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ...
##  $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ID   : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...

X = as.matrix(sleep[, 1, drop=FALSE]) # increase in hours of sleep, 
    # drop is logical, if TRUE the result is coerced to the lowest possible dimension. 
# The default is to drop if only one column is left, but not to drop if only one row is left.
Y = sleep[, 2] # drug given 
# plot(X ~ Y)

plot_ly(data=sleep, x=~group, y=~extra, color=~group, type="box", name="Hours of Sleep") %>%
  layout(title="Hours of Extra Sleep", legend = list(orientation='h'))

levels(Y) # "1" "2"

## [1] "1" "2"

dim(X) # 20  1

## [1] 20  1

Next, we will define a new LDA (linear discriminant analysis) predicting function and perform the Cross-validation (CV) on the resulting predictor.

library("MASS") # for lda function 

predfun.lda = function(train.x, train.y, test.x, test.y, negative)
{   lda.fit = lda(train.x, grouping=train.y)
    ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
    out = confusionMatrix(test.y, ynew, negative=negative)
return( out )
}

# install.packages("crossval")
library("crossval")

set.seed(123456)
cv.out <- crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
cv.out$stat
diagnosticErrors(cv.out$stat)

Execute the above code and interpret the diagnostic results measuring the performance of the LDA prediction.

1.8.3.3 Example 3: Model-based (linear regression) prediction using the `attitude` dataset

This data represents the average survey responses from about 35 employees from 30 (randomly selected) departments in a large organization. The data captures the proportion of favorable responses to 7 questions in each department.

Let’s load and summarize the data, which is available by default in attitude {datasets}.

# ?attitude, colnames(attitude)
# Note: when using a data frame, a time-saver is to use "." to indicate "include all covariates" in the DF.  
# E.g., fit <- lm(Y ~ ., data = D)

data("attitude")
y = attitude[, 1]       # rating variable
x = attitude[, -1]  # date frame with the remaining variables
is.factor(y)
summary( lm(y ~ . , data=x) )   # R-squared:  0.7326
# set up lm prediction function

We will demonstrate model-based analytics using lm and lda, and will validate the forecasting using CV.

predfun.lm = function(train.x, train.y, test.x, test.y) {     
  lm.fit = lm(train.y ~ . , data=train.x)
  ynew = predict(lm.fit, test.x ) 
  # compute squared error risk (MSE)
  out = mean( (ynew - test.y)^2)
  # note that, in general, when fitting linear model to continuous outcome variable (Y), 
  # we can't use the out<-confusionMatrix(test.y, ynew, negative=negative), as it requires a binary outcome
  # this is why we use the MSE as an estimate of the discrepancy between observed & predicted values
  return(out)
}

# library("MASS")
# predfun.lda = function(train.x, train.y, test.x, test.y, negative) {  
#  lda.fit = lda(train.x, grouping=train.y)
#   ynew = predict(lda.fit, test.x)$class
#   # count TP, FP etc.
#   out = confusionMatrix(test.y, ynew, negative=negative)
#   return( out )
# }

# prediction MSE using all variables
set.seed(123456)
cv.out.lm = crossval::crossval(predfun.lm, x, y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se)            # 72.581198  3.736784
# reducing to using only two variables
cv.out.lm = crossval::crossval(predfun.lm, x[, c(1, 3)], y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se)        
# 52.563957  2.015109

1.8.3.4 Example 4: Parkinson’s data (`ppmi_data`)

Let’s go back to the more elaborate Parkinson’s disease (PD/PPMI) case, load and preprocess the derived-PPMI data.

# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# attach(ppmi_data); head(ppmi_data)
# install.packages("crossval")
# library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
# input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.factor(ppmi_data$PD)

# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input)    # Predictor variables
Y = as.matrix(output)   #  Actual PD clinical assessment 
dim(X); 
dim(Y)

fit <- lm(Y~X); 

# layout(matrix(c(1, 2, 3, 4), 2, 2))       # optional 4 graphs/page
# plot(fit)         # plot the fit 

xResid <- scale(fit$residuals)
QQ <- qqplot(xResid, rnorm(1000), plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
  add_trace(x=c(-4,4), y=c(-4,4), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
  layout(title="QQ Normal Plot of Model Residuals",
         xaxis=list(title="Model Residuals"), yaxis=list(title="Normal Residuals"),
         hovermode = "x unified", legend = list(orientation='h'))

levels(as.factor(Y))            # "0" "1"

## [1] "0" "1"

c(dim(X), dim(Y))           # 1043  103

## [1] 422 100 422   1

Apply Cross-validation to assess the performance of the linear model.

set.seed(12345)
# cv.out.lm = crossval::crossval(predfun.lm, as.data.frame(X), as.numeric(Y), K=5, B=20)

cv.out.lda = crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
# K=Number of folds; B=Number of repetitions.

# Results
#cv.out.lda$stat; 
#cv.out.lda; 
diagnosticErrors(cv.out.lda$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9606635 0.9515000 0.9831967 0.9928696 0.8918216 7.0457111

#cv.out.lm$stat; 
#cv.out.lm; 
#diagnosticErrors(cv.out.lm$stat)

1.8.4 Summary of CV output

The cross-validation (CV) output object includes the following components:

stat.cv: Vector of statistics returned by predfun for each cross validation run
stat: statistic returned by predfun averaged over all cross validation runs
stat.se: variability capturing the corresponding standard error across cv-iterations.

1.8.5 Alternative predictor functions

We have already seen a number of predict() functions, e.g., Chapter 3. Below, we will add to the collection of predictive analytics and forecasting functions using the PPMI dataset (06_PPMI_ClassificationValidationData_Short.csv).

1.8.5.1 Logistic Regression

We will learn more about the logit model in Chapter 11. Now, we will demonstrate the use of a logit-predictor function to forecast a binary diagnostic outcome (patient or control) using the Parkinson’s disease study, 06_PPMI_ClassificationValidationData_Short.csv.

# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# install.packages("crossval"); library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 

# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input)    # Predictor variables
Y = as.matrix(output)

Note that the predicted values are in $log$ terms, so they need to be exponentiated to interpret them correctly.

lm.logit <- glm(as.numeric(Y) ~ ., data = as.data.frame(X), family = "binomial")
ynew <- predict(lm.logit, as.data.frame(X)); #plot(ynew)
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1); # plot(ynew2)

predfun.logit = function(train.x, train.y, test.x, test.y, neg)
{   lm.logit <- glm(train.y ~ ., data = train.x, family = "binomial")
    ynew = predict(lm.logit, test.x )
 # compute TP, FP, TN, FN
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1)
    out = confusionMatrix(test.y, ynew2, negative=neg)  # Binary outcome, we can use confusionMatrix
return( out )
}
# Reduce the bag of explanatory variables, purely to simplify the interpretation of the analytics in this example!
input.short <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume",                                                            
            "R_fusiform_gyrus_ShapeIndex", "R_fusiform_gyrus_Curvedness", 
 "Sex",  "Weight", "Age" , "chr12_rs34637584_GT", "chr17_rs11868035_GT",                                                              
            "UPDRS_Part_I_Summary_Score_Baseline", "UPDRS_Part_I_Summary_Score_Month_03",                                             
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",                        
            "UPDRS_Part_III_Summary_Score_Baseline",                                             
            "X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"     
))]
X = as.matrix(input.short)

cv.out.logit = crossval::crossval(predfun.logit, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1", verbose=FALSE)
cv.out.logit$stat.cv

##       FP TP TN FN
## B1.F1  2 55 26  1
## B1.F2  0 61 21  3
## B1.F3  4 57 21  2
## B1.F4  2 58 22  3
## B1.F5  0 55 24  5
## B2.F1  1 53 29  2
## B2.F2  0 63 17  4
## B2.F3  1 62 21  1
## B2.F4  3 50 29  2
## B2.F5  3 58 18  5

diagnosticErrors(cv.out.logit$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914

Caution: Note that if you forget to exponentiate the predicted logistic model values (see ynew2 in predict.logit), you will get nonsense results, e.g., all cases may be predicted to be in one class, trivial sensitivity or negative predictive power (NPP).

1.8.5.2 Quadratic Discriminant Analysis (QDA)

In Chapter 5 we discussed the linear and quadratic discriminant analysis models. Let’s now introduce a predfun.qda() function.

predfun.qda = function(train.x, train.y, test.x, test.y, negative) {
  library("MASS") # for lda function
  qda.fit = qda(train.x, grouping=train.y)
  ynew = predict(qda.fit, test.x)$class
  out.qda = confusionMatrix(test.y, ynew, negative=negative)
  return( out.qda )
}  
  
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(input.short), as.factor(Y), K=5, B=20, neg="1")

## Error in qda.default(x, grouping, ...): rank deficiency in group 1

diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat);

## Error: object 'cv.out.qda' not found

This error message: “Error in qda.default(x, grouping, …) : rank deficiency in group 1” indicates that there is a rank deficiency, i.e. some variables are collinear and one or more covariance matrices cannot be inverted to obtain the estimates in group 1 (Controls)!

Removing the strongly correlated data elements (“R_fusiform_gyrus_Volume”, “R_fusiform_gyrus_ShapeIndex”, and “R_fusiform_gyrus_Curvedness”), resolves the rank-deficiency problem!

input.short2 <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume", "Sex",
                       "Weight", "Age" , "chr17_rs11868035_GT", "UPDRS_Part_I_Summary_Score_Baseline",
                       "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",
                       "UPDRS_Part_III_Summary_Score_Baseline", 
                       "X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"))]
X = as.matrix(input.short2)
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1")

It makes sense to contrast the QDA and GLM/Logit predictions.

diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9395735 0.9550000 0.9016393 0.9597990 0.8906883 5.2706226

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914

Clearly, both the QDA and Logit model predictions are quite similar and reliable.

1.8.6 Foundation of LDA and QDA for prediction, dimensionality reduction, or forecasting

Previously, in Chapter 5 we saw some examples of LDA/QDA methods. Now, we’ll talk more about the details. Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data $P(X \mid Y=k)$ for each class $k$. Their predictions are obtained by using Bayesian theorem, see Appendix,

\[P(Y=k \mid X)= \frac{P(X \mid Y=k)P(Y=k)}{P(X)}=\frac{P(X \mid Y=k)P(Y=k)}{\sum_{l=0}^{\infty}P(X \mid Y=l)P(Y=l)},\]

and we select the class $k$, which maximizes this conditional probability (maximum likelihood estimation). In linear and quadratic discriminant analysis, $P(X \mid Y)$ is modeled as a multivariate Gaussian distribution with density

\[P(X \mid Y=k)= \frac{1}{{(2 \pi)}^{\frac{n}{2}} {\vert{\Sigma_{k}}\vert}^{\frac{1}{2}}}\times e^{(-\frac{1}{2}{(X- \mu_k)}^T \Sigma_k^{-1}(X- \mu_k))}.\]

This model can be used to classify data by using the training data to estimate:

the class prior probabilities $P(Y=k)$ by counting the proportion of observed instances of class $k$,
the class means $\mu_k$ by computing the empirical sample class means, and
the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., LASSO).

In the linear case (LDA), the Gaussian components for each class are assumed to share the same covariance matrix
$\Sigma_k=\Sigma$ for each class $k$. This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes ($k$ and $l$):
$LOR=\log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )$, LOR=0 $\Longleftrightarrow$ the two probabilities are identical, i.e., yield same class label.

\[LOR=\log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )=0\ \ \Longleftrightarrow\ \ {(\mu_k-\mu_l)}^T \Sigma^{-1} (\mu_k-\mu_l)= \frac{1}{2}(\mu_k^T \Sigma^{-1} \mu_k-\mu_l^T \Sigma^{-1} \mu_l). \]

But, in the more general quadratic case (QDA), these assumptions about the covariance matrices $\Sigma_k$ of the Gaussian components are relaxed leading to quadratic decision surfaces.

1.8.6.1 LDA (Linear Discriminant Analysis)

LDA is similar to the generalized linear model (GLM) used in ANOVA and regression analyses. LDA also attempts to express one dependent variable as a linear combination of other features or data elements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression logit and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.

predfun.lda = function(train.x, train.y, test.x, test.y, neg) {
  library("MASS")
  lda.fit = lda(train.x, grouping=train.y)
  ynew = predict(lda.fit, test.x)$class
  out.lda = confusionMatrix(test.y, ynew, negative=neg)
  return( out.lda )
}

1.8.6.2 QDA (Quadratic Discriminant Analysis)

Similarly to LDA, the QDA prediction function can be defined by:

predfun.qda = function(train.x, train.y, test.x, test.y, neg) {
  library("MASS") # for lda function
  qda.fit = qda(train.x, grouping=train.y)
  ynew = predict(qda.fit, test.x)$class
  out.qda = confusionMatrix(test.y, ynew, negative=neg)
  return( out.qda )
}

1.8.6.3 Neural Network

We saw Neural Networks (NNs) in Chapter 6. Applying NNs is not straightforward. We have to create a design matrix with an indicator column for the response feature. In addition, we need to write a predict function to translate the output of neuralnet() into analytical forecasts.

# predict nn
library("neuralnet")
pred = function(nn, dat) {
  yhat = compute(nn, dat)$net.result
  yhat = apply(yhat, 1, which.max)-1
  return(yhat)
}

my.neural <- function (train.x, train.y, test.x, test.y,method,layer=c(5,5)) {
  train.x <- as.data.frame(train.x)
  train.y <- as.data.frame(train.y)
  colnames(train.x) <- paste0('V', 1:ncol(X)) 
  colnames(train.y) <- "V1"
  train_y_ind = model.matrix(~factor(train.y$V1)-1)
  colnames(train_y_ind) = paste0('out', 0:1)
  train = cbind(train.x, train_y_ind)
  y_names = paste0('out', 0:1)
  x_names = paste0('V', 1:ncol(train.x))
  nn = neuralnet(
    paste(paste(y_names, collapse='+'),
          '~',
          paste(x_names, collapse='+')),
    train,
    hidden=layer,
      linear.output=FALSE,
    lifesign='full', lifesign.step=1000)
  #predict
  predict.y <- pred(nn, test.x)
  out <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
  return (out)
}

set.seed(1234) 
cv.out.nn <- crossval::crossval(my.neural, scale(X), Y, K = 5, B = 1,layer=c(20,20),verbose = F) # scaled predictors are necessary.
crossval::diagnosticErrors(cv.out.nn$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9407583 0.9016393 0.9566667 0.8943089 0.9598662 5.3101066

1.8.6.4 SVM

We also saw SVM in Chapter 6. Let’s try cross validation on Linear and Gaussian (radial) kernel SVM. We can expect that linear SVM should achieve a close result to Gaussian or even better than Gaussian since this dataset have a large $k$ (# features) compared with $n$ (# cases), which we have studied in detail in Chapter 6.

library("e1071")
my.svm <- function (train.x, train.y, test.x, test.y,method,cost=1,gamma=1/ncol(dx_norm),coef0=0,degree=3) {
  svm_l.fit <- svm(x = train.x, y=as.factor(train.y), kernel = method)
  predict.y <- predict(svm_l.fit, test.x)
  out <- crossval::confusionMatrix(test.y, predict.y, negative = 0)
  return (out)
}

# Linear kernel
set.seed(123)
cv.out.svml <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,
                                  method = "linear",cost=tune_svm$best.parameters$cost,verbose = F) 
diagnosticErrors(cv.out.svml$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9573460 0.9344262 0.9666667 0.9193548 0.9731544 6.0240527

# Gaussian kernel
set.seed(123)
cv.out.svmg <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,
                                  method = "radial",cost=tune_svmg$best.parameters$cost,
                                  gamma=tune_svmg$best.parameters$gamma,verbose = F)
diagnosticErrors(cv.out.svmg$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9262295 0.9566667 0.8968254 0.9695946 5.6246961

Indeed, both types of kernels yield good quality predictors according to the assessment metrics reported by the diagnosticErrors() method.

1.8.6.5 k-Nearest Neighbors algorithm (k-NN)

As we saw in Chapter 5, k-NN is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors ($k$ is always a small positive integer). When $k=1$, then an object is assigned to the class of its single nearest neighbor.
In k-NN regression, the output is the property value for the object representing the average of the values of its $k$ nearest neighbors.

Let’s now build the corresponding predfun.knn() method.

# X = as.matrix(input)  # Predictor variables X = as.matrix(input.short2)
#  Y = as.matrix(output)    # Outcome
# KNN (k-nearest neighbors)
library("class")
# knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)$class
# table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")
# This can be used for polytomous variable (multiple classes)

predfun.knn = function(train.x, train.y, test.x, test.y, neg) {
  library("class")
  knn.fit = knn(train.x, test.x, cl = train.y, prob=T)  # knn is already a prediction function!!!
  # ynew = predict(knn.fit, test.x)$class           # no need of another prediction, in this case
  out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
  return( out.knn )
}   
# cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
#Compare all 3 classifiers (lda, qda, knn, and logit)
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat);

We can also examine the performance of k-NN prediction on the PPMI (Parkinson’s disease) data. Start by partitioning the data into training and testing sets.

# TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(input))
## set the seed to make your partition reproducible
set.seed(1234)
input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
input.train <- input[input.train.ind, ]
output.train <- as.matrix(output)[input.train.ind, ]

# TESTING DATA
input.test <- input[-input.train.ind, ]
output.test <- as.matrix(output)[-input.train.ind, ]

Then fit the k-NN model and report the results.

library("class")
knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
#plot(knn_model)
summary(knn_model)
attributes(knn_model)

# cross-validation
knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
summary(knn_model.cv)

1.8.6.6 k-Means Clustering (k-MC)

In Chapter 8, we showed that k-MC aims to partition $n$ observations into $k$ clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution for this, i.e., the problem is NP-hard. However, there are efficient algorithms that converge quickly to local optima, e.g., expectation-maximization algorithm for Gaussian mixture distribution modeling that we saw in Chapter 8.

kmeans_model <- kmeans(input.train, 2)
table (output.train, kmeans_model$cluster-1)

##             
## output.train   0   1
##            0 173  50
##            1  65  28

layout(matrix(1, 1))
# tiff("C:/Users/Dinov/Desktop/test.tiff", width = 10, height = 10, units = 'in', res = 300)
fpc::plotcluster(input.train, output.train, col = kmeans_model$cluster)
legend("topright", legend=c("0(trueLabel)+0(kmeansLabel)", "1(trueLabel)+0(kmeansLabel)", "0(trueLabel)+1(kmeansLabel)", "1(trueLabel)+1(kmeansLabel)"),
       col=c("red", "black", "black", "red"), 
       text.col=c("red", "black", "black", "red"), 
       pch = 15 , y.intersp=0.6, x.intersp=0.7,
       text.width = 5, cex=1)

# plot_ly(x=~input.train, y=~output.train, color = ~as.factor(kmeans_model$cluster), type="scatter", mode="marker")

input.train <- input.train[ , apply(input.train, 2, var, na.rm=TRUE) != 0]  # remove constant columns

cluster::clusplot(input.train, kmeans_model$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# par(mfrow=c(10,10))
# # the next figure is very large and will not render in RStudio, you may need to save it as a PDF file!
# # pdf("C:/Users/Dinov/Desktop/test.pdf", width = 50, height = 50)
# # with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster])) 
# # dev.off()
# with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster])) 

# Convert all Income features to factors and construct SPLOM plot
df2 <- input.train[,1:10]
col_names <- names(df2)
dims <- as.data.frame(lapply(df2[col_names], factor))

dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE))

# plot(input.train, col = kmeans_model$cluster) 
# points(kmeans_model$centers, col = 1:2, pch = 8, cex = 2)

## cluster centers "fitted" to each obs.:
fitted.kmeans <- fitted(kmeans_model);  head(fitted.kmeans)

##   L_insular_cortex_AvgMeanCurvature L_insular_cortex_ComputeArea
## 1                         0.1076258                     2626.614
## 2                         0.2281071                     1094.980
## 1                         0.1076258                     2626.614
## 1                         0.1076258                     2626.614
## 1                         0.1076258                     2626.614
## 2                         0.2281071                     1094.980
##   L_insular_cortex_Volume L_insular_cortex_ShapeIndex
## 1                7959.068                   0.3260816
## 2                1999.192                   0.2773389
## 1                7959.068                   0.3260816
## 1                7959.068                   0.3260816
## 1                7959.068                   0.3260816
## 2                1999.192                   0.2773389
##   L_insular_cortex_Curvedness R_insular_cortex_AvgMeanCurvature
## 1                   0.1069922                         0.1217837
## 2                   0.2723533                         0.2988653
## 1                   0.1069922                         0.1217837
## 1                   0.1069922                         0.1217837
## 1                   0.1069922                         0.1217837
## 2                   0.2723533                         0.2988653
##   R_insular_cortex_ComputeArea R_insular_cortex_Volume
## 1                    2019.1489                4927.957
## 2                     770.1955                1113.124
## 1                    2019.1489                4927.957
## 1                    2019.1489                4927.957
## 1                    2019.1489                4927.957
## 2                     770.1955                1113.124
##   R_insular_cortex_ShapeIndex R_insular_cortex_Curvedness
## 1                   0.3032212                   0.1278277
## 2                   0.2672094                   0.3530532
## 1                   0.3032212                   0.1278277
## 1                   0.3032212                   0.1278277
## 1                   0.3032212                   0.1278277
## 2                   0.2672094                   0.3530532
##   L_cingulate_gyrus_AvgMeanCurvature L_cingulate_gyrus_ComputeArea
## 1                          0.1325229                      3967.805
## 2                          0.2880964                      1311.079
## 1                          0.1325229                      3967.805
## 1                          0.1325229                      3967.805
## 1                          0.1325229                      3967.805
## 2                          0.2880964                      1311.079
##   L_cingulate_gyrus_Volume L_cingulate_gyrus_ShapeIndex
## 1                10015.619                    0.4463758
## 2                 1835.168                    0.3741087
## 1                10015.619                    0.4463758
## 1                10015.619                    0.4463758
## 1                10015.619                    0.4463758
## 2                 1835.168                    0.3741087
##   L_cingulate_gyrus_Curvedness R_cingulate_gyrus_AvgMeanCurvature
## 1                   0.07632257                          0.1434376
## 2                   0.22731432                          0.2840280
## 1                   0.07632257                          0.1434376
## 1                   0.07632257                          0.1434376
## 1                   0.07632257                          0.1434376
## 2                   0.22731432                          0.2840280
##   R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume
## 1                      3953.795                 9964.057
## 2                      1177.863                 1710.276
## 1                      3953.795                 9964.057
## 1                      3953.795                 9964.057
## 1                      3953.795                 9964.057
## 2                      1177.863                 1710.276
##   R_cingulate_gyrus_ShapeIndex R_cingulate_gyrus_Curvedness
## 1                    0.4379627                   0.09675352
## 2                    0.3367350                   0.27422318
## 1                    0.4379627                   0.09675352
## 1                    0.4379627                   0.09675352
## 1                    0.4379627                   0.09675352
## 2                    0.3367350                   0.27422318
##   L_caudate_AvgMeanCurvature L_caudate_ComputeArea L_caudate_Volume
## 1                  0.2601677              797.0776        1233.5135
## 2                  0.8551616              146.6532         102.8474
## 1                  0.2601677              797.0776        1233.5135
## 1                  0.2601677              797.0776        1233.5135
## 1                  0.2601677              797.0776        1233.5135
## 2                  0.8551616              146.6532         102.8474
##   L_caudate_ShapeIndex L_caudate_Curvedness R_caudate_AvgMeanCurvature
## 1            0.2914226            0.2796301                  0.1894883
## 2            0.3595532            0.6930212                  0.7598691
## 1            0.2914226            0.2796301                  0.1894883
## 1            0.2914226            0.2796301                  0.1894883
## 1            0.2914226            0.2796301                  0.1894883
## 2            0.3595532            0.6930212                  0.7598691
##   R_caudate_ComputeArea R_caudate_Volume R_caudate_ShapeIndex
## 1             1089.4075        1927.4092            0.3055045
## 2              200.8635         175.1637            0.3521822
## 1             1089.4075        1927.4092            0.3055045
## 1             1089.4075        1927.4092            0.3055045
## 1             1089.4075        1927.4092            0.3055045
## 2              200.8635         175.1637            0.3521822
##   R_caudate_Curvedness L_putamen_AvgMeanCurvature L_putamen_ComputeArea
## 1            0.1863632                  0.1704558              1108.874
## 2            0.6260534                  0.5752676               369.652
## 1            0.1863632                  0.1704558              1108.874
## 1            0.1863632                  0.1704558              1108.874
## 1            0.1863632                  0.1704558              1108.874
## 2            0.6260534                  0.5752676               369.652
##   L_putamen_Volume L_putamen_ShapeIndex L_putamen_Curvedness
## 1        2234.7919            0.2827044            0.1959905
## 2         387.9603            0.2993340            0.5815818
## 1        2234.7919            0.2827044            0.1959905
## 1        2234.7919            0.2827044            0.1959905
## 1        2234.7919            0.2827044            0.1959905
## 2         387.9603            0.2993340            0.5815818
##   R_putamen_AvgMeanCurvature R_putamen_ComputeArea R_putamen_Volume
## 1                  0.1282211             1538.4491        3700.6326
## 2                  0.3955008              539.4418         718.2451
## 1                  0.1282211             1538.4491        3700.6326
## 1                  0.1282211             1538.4491        3700.6326
## 1                  0.1282211             1538.4491        3700.6326
## 2                  0.3955008              539.4418         718.2451
##   R_putamen_ShapeIndex R_putamen_Curvedness L_hippocampus_AvgMeanCurvature
## 1            0.3193293            0.1344744                      0.1305658
## 2            0.2755695            0.4745640                      0.2868015
## 1            0.3193293            0.1344744                      0.1305658
## 1            0.3193293            0.1344744                      0.1305658
## 1            0.3193293            0.1344744                      0.1305658
## 2            0.2755695            0.4745640                      0.2868015
##   L_hippocampus_ComputeArea L_hippocampus_Volume L_hippocampus_ShapeIndex
## 1                 1573.5764             3817.140                 0.314922
## 2                  731.5145             1180.315                 0.253671
## 1                 1573.5764             3817.140                 0.314922
## 1                 1573.5764             3817.140                 0.314922
## 1                 1573.5764             3817.140                 0.314922
## 2                  731.5145             1180.315                 0.253671
##   L_hippocampus_Curvedness R_hippocampus_AvgMeanCurvature
## 1                0.1309192                      0.1242315
## 2                0.3950394                      0.2596902
## 1                0.1309192                      0.1242315
## 1                0.1309192                      0.1242315
## 1                0.1309192                      0.1242315
## 2                0.3950394                      0.2596902
##   R_hippocampus_ComputeArea R_hippocampus_Volume R_hippocampus_ShapeIndex
## 1                 1692.6730             4256.232                0.3130925
## 2                  849.2564             1488.660                0.2615934
## 1                 1692.6730             4256.232                0.3130925
## 1                 1692.6730             4256.232                0.3130925
## 1                 1692.6730             4256.232                0.3130925
## 2                  849.2564             1488.660                0.2615934
##   R_hippocampus_Curvedness L_fusiform_gyrus_AvgMeanCurvature
## 1                0.1283778                         0.1135699
## 2                0.3563544                         0.1570402
## 1                0.1283778                         0.1135699
## 1                0.1283778                         0.1135699
## 1                0.1283778                         0.1135699
## 2                0.3563544                         0.1570402
##   L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## 1                     3681.950               11910.490
## 2                     2882.248                6772.765
## 1                     3681.950               11910.490
## 1                     3681.950               11910.490
## 1                     3681.950               11910.490
## 2                     2882.248                6772.765
##   L_fusiform_gyrus_ShapeIndex L_fusiform_gyrus_Curvedness
## 1                   0.3041462                   0.0985821
## 2                   0.3158982                   0.1423833
## 1                   0.3041462                   0.0985821
## 1                   0.3041462                   0.0985821
## 1                   0.3041462                   0.0985821
## 2                   0.3158982                   0.1423833
##   R_fusiform_gyrus_AvgMeanCurvature R_fusiform_gyrus_ComputeArea
## 1                         0.1169820                     3535.754
## 2                         0.1529637                     2724.399
## 1                         0.1169820                     3535.754
## 1                         0.1169820                     3535.754
## 1                         0.1169820                     3535.754
## 2                         0.1529637                     2724.399
##   R_fusiform_gyrus_Volume R_fusiform_gyrus_ShapeIndex
## 1               11462.851                   0.3318113
## 2                6923.456                   0.3222729
## 1               11462.851                   0.3318113
## 1               11462.851                   0.3318113
## 1               11462.851                   0.3318113
## 2                6923.456                   0.3222729
##   R_fusiform_gyrus_Curvedness      Sex   Weight      Age chr12_rs34637584_GT
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 2                  0.15437168 1.269231 85.65256 62.48391                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 2                  0.15437168 1.269231 85.65256 62.48391                   0
##   chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT chr17_rs12185268_GT
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 2           0.5384615        0.4102564         0.5128205           0.4871795
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 2           0.5384615        0.4102564         0.5128205           0.4871795
##   chr17_rs199533_GT chr17_rs199533_DP chr17_rs199533_GQ
## 1         0.3529412          36.99580          72.88235
## 2         0.4743590          45.38462          78.16667
## 1         0.3529412          36.99580          72.88235
## 1         0.3529412          36.99580          72.88235
## 1         0.3529412          36.99580          72.88235
## 2         0.4743590          45.38462          78.16667
##   UPDRS_Part_I_Summary_Score_Baseline UPDRS_Part_I_Summary_Score_Month_03
## 1                            1.013571                            1.508369
## 2                            0.974359                            1.690367
## 1                            1.013571                            1.508369
## 1                            1.013571                            1.508369
## 1                            1.013571                            1.508369
## 2                            0.974359                            1.690367
##   UPDRS_Part_I_Summary_Score_Month_06 UPDRS_Part_I_Summary_Score_Month_09
## 1                            1.272435                            1.270631
## 2                            1.426228                            1.539340
## 1                            1.272435                            1.270631
## 1                            1.272435                            1.270631
## 1                            1.272435                            1.270631
## 2                            1.426228                            1.539340
##   UPDRS_Part_I_Summary_Score_Month_12 UPDRS_Part_I_Summary_Score_Month_18
## 1                            1.085737                            1.495304
## 2                            1.177899                            1.691696
## 1                            1.085737                            1.495304
## 1                            1.085737                            1.495304
## 1                            1.085737                            1.495304
## 2                            1.177899                            1.691696
##   UPDRS_Part_I_Summary_Score_Month_24
## 1                            1.540068
## 2                            1.605211
## 1                            1.540068
## 1                            1.540068
## 1                            1.540068
## 2                            1.605211
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 1                                                   4.425786
## 2                                                   4.012821
## 1                                                   4.425786
## 1                                                   4.425786
## 1                                                   4.425786
## 2                                                   4.012821
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## 1                                                   6.952695
## 2                                                   7.073448
## 1                                                   6.952695
## 1                                                   6.952695
## 1                                                   6.952695
## 2                                                   7.073448
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## 1                                                   6.498689
## 2                                                   6.842894
## 1                                                   6.498689
## 1                                                   6.498689
## 1                                                   6.498689
## 2                                                   6.842894
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## 1                                                   6.944167
## 2                                                   7.604399
## 1                                                   6.944167
## 1                                                   6.944167
## 1                                                   6.944167
## 2                                                   7.604399
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 1                                                   5.072632
## 2                                                   5.114479
## 1                                                   5.072632
## 1                                                   5.072632
## 1                                                   5.072632
## 2                                                   5.114479
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## 1                                                   7.426062
## 2                                                   8.043127
## 1                                                   7.426062
## 1                                                   7.426062
## 1                                                   7.426062
## 2                                                   8.043127
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 1                                                   6.259172
## 2                                                   5.976416
## 1                                                   6.259172
## 1                                                   6.259172
## 1                                                   6.259172
## 2                                                   5.976416
##   UPDRS_Part_III_Summary_Score_Baseline UPDRS_Part_III_Summary_Score_Month_03
## 1                              15.80387                              22.16758
## 2                              12.55128                              20.97394
## 1                              15.80387                              22.16758
## 1                              15.80387                              22.16758
## 1                              15.80387                              22.16758
## 2                              12.55128                              20.97394
##   UPDRS_Part_III_Summary_Score_Month_06 UPDRS_Part_III_Summary_Score_Month_09
## 1                              22.87287                              22.72272
## 2                              20.37245                              19.97371
## 1                              22.87287                              22.72272
## 1                              22.87287                              22.72272
## 1                              22.87287                              22.72272
## 2                              20.37245                              19.97371
##   UPDRS_Part_III_Summary_Score_Month_12 UPDRS_Part_III_Summary_Score_Month_18
## 1                              17.35737                              23.75503
## 2                              13.38860                              21.69028
## 1                              17.35737                              23.75503
## 1                              17.35737                              23.75503
## 1                              17.35737                              23.75503
## 2                              13.38860                              21.69028
##   UPDRS_Part_III_Summary_Score_Month_24
## 1                              19.73472
## 2                              16.22692
## 1                              19.73472
## 1                              19.73472
## 1                              19.73472
## 2                              16.22692
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## 1                                                               5.852568
## 2                                                               6.334990
## 1                                                               5.852568
## 1                                                               5.852568
## 1                                                               5.852568
## 2                                                               6.334990
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## 1                                                               5.537727
## 2                                                               6.043413
## 1                                                               5.537727
## 1                                                               5.537727
## 1                                                               5.537727
## 2                                                               6.043413
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## 1                                                               5.970469
## 2                                                               6.585084
## 1                                                               5.970469
## 1                                                               5.970469
## 1                                                               5.970469
## 2                                                               6.585084
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## 1                                                               6.351765
## 2                                                               7.175677
## 1                                                               6.351765
## 1                                                               6.351765
## 1                                                               6.351765
## 2                                                               7.175677
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## 1                                                                           5.231092
## 2                                                                           5.384615
## 1                                                                           5.231092
## 1                                                                           5.231092
## 1                                                                           5.231092
## 2                                                                           5.384615
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## 1                                                                           5.023530
## 2                                                                           5.032027
## 1                                                                           5.023530
## 1                                                                           5.023530
## 1                                                                           5.023530
## 2                                                                           5.032027
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## 1                                                                           5.200913
## 2                                                                           5.340521
## 1                                                                           5.200913
## 1                                                                           5.200913
## 1                                                                           5.200913
## 2                                                                           5.340521
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## 1                                                                           5.203064
## 2                                                                           5.289329
## 1                                                                           5.203064
## 1                                                                           5.203064
## 1                                                                           5.203064
## 2                                                                           5.289329

resid.kmeans <- (input.train - fitted(kmeans_model))

# define the sum of squares function
ss <- function(data) sum(scale(data, scale = FALSE)^2)

## Equalities 
cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")],    # the same two columns
                c (ss(fitted.kmeans), ss(resid.kmeans),  ss(input.train)))

##              [,1]        [,2]       
## betweenss    16891584021 16891584021
## tot.withinss 12772530290 12772538936
## totss        29664114311 29664114311

# validation
stopifnot(all.equal(kmeans_model$totss, ss(input.train)), all.equal(kmeans_model$tot.withinss, ss(resid.kmeans)), 
      ## these three are the same:
        all.equal(kmeans_model$betweenss, ss(fitted.kmeans)), 
        all.equal(kmeans_model$betweenss, kmeans_model$totss - kmeans_model$tot.withinss), 
      ## and hence also
        all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
)

## Error: kmeans_model$tot.withinss and ss(resid.kmeans) are not equal:
##   Mean relative difference: 6.768987e-07

# kmeans(input.train, 1)$withinss       
# trivial one-cluster, (its W.SS == ss(input.train))
clust_kmeans2 = kmeans(scale(X), center=X[1:2,],iter.max=100, algorithm='Lloyd')

We get an empty cluster instead of two clusters when we randomly select two points as the initial centers. The way to solve this problem is using k-means++.

# k++ initialize
kpp_init = function(dat, K) {
  x = as.matrix(dat)
  n = nrow(x)
  # Randomly choose a first center
  centers = matrix(NA, nrow=K, ncol=ncol(x))
  centers[1,] = as.matrix(x[sample(1:n, 1),])
  for (k in 2:K) {
    # Calculate dist^2 to closest center for each point
    dists = matrix(NA, nrow=n, ncol=k-1)
    for (j in 1:(k-1)) {
      temp = sweep(x, 2, centers[j,], '-')
      dists[,j] = rowSums(temp^2)
    }
    dists = rowMeans(dists)
    # Draw next center with probability proportional to dist^2
    cumdists = cumsum(dists)
    prop = runif(1, min=0, max=cumdists[n])
    centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
  }
  return(centers)
}
clust_kmeans2_plus = kmeans(scale(X), kpp_init(scale(X), 2), iter.max=100, algorithm='Lloyd')

Now let’s evaluate the model. The first step is to justify the selection of k=2. We use silhouette() in the package cluster. Recall from Chapter 8 that the silhouette value is in the range $[-1,1]$ with negative and positive values corresponding to (mostly) “mis-labeled” and “correctly-labeled” cases, respectively.

clust_k2 = clust_kmeans2_plus$cluster
library(cluster)
# X = as.matrix(input.short2)
# as the data is too large for the silhouette plot, we'll just subsample and plot 100 random cases
subset_int <- sample(nrow(X),100)  #100 cases from 661 total cases
dis = dist(as.data.frame(scale(X[subset_int,])))
sil_k2 = silhouette(clust_k2[subset_int],dis) #best
plot(sil_k2, col=c(1:length(clust_kmeans2_plus$size)))

summary(sil_k2)

## Silhouette of 100 units in 2 clusters from silhouette.default(x = clust_k2[subset_int], dist = dis) :
##  Cluster sizes and average silhouette widths:
##         56         44 
## 0.23255908 0.03090563 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.14149  0.05501  0.14890  0.14383  0.25305  0.34628

mean(sil_k2<0)

## [1] 0.04666667

The result is pretty good. Only a very small number of samples are “mis-clustered” (having negative silhouette value).

Further, you can observe that when k=3 or k=4, the overall silhouette is smaller.

dis = dist(as.data.frame(scale(X)))
clust_kmeans3_plus = kmeans(scale(X), kpp_init(scale(X), 3), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans3_plus$cluster,dis))

## Silhouette of 422 units in 3 clusters from silhouette.default(x = clust_kmeans3_plus$cluster, dist = dis) :
##  Cluster sizes and average silhouette widths:
##        176         59        187 
## 0.10389846 0.06762243 0.13880747 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.06382  0.05580  0.11102  0.11430  0.16941  0.29441

clust_kmeans4_plus = kmeans(scale(X), kpp_init(scale(X), 4), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans4_plus$cluster,dis))

## Silhouette of 422 units in 4 clusters from silhouette.default(x = clust_kmeans4_plus$cluster, dist = dis) :
##  Cluster sizes and average silhouette widths:
##         78        116        117        111 
## 0.04558664 0.17648759 0.06998387 0.19086006 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.12123  0.05065  0.14215  0.12654  0.20929  0.32875

Then, let’s calculate the unsupervised classification error. Here, $p$ represents the percentage of class $0$ cases, which provides the weighting factor for labeling each cluster.

mat = matrix(1,nrow = length(Y))
p = sum(Y==0)/length(Y)
for (i in 1:2) {
  id = which(clust_k2==i)
  if(sum(Y[id]==0)>length(id)*p)  mat[id] = 0
}
caret::confusionMatrix(factor(Y), factor(mat))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 189 111
##          1   1 121
##                                           
##                Accuracy : 0.7346          
##                  95% CI : (0.6897, 0.7761)
##     No Information Rate : 0.5498          
##     P-Value [Acc > NIR] : 3.789e-15       
##                                           
##                   Kappa : 0.4906          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9947          
##             Specificity : 0.5216          
##          Pos Pred Value : 0.6300          
##          Neg Pred Value : 0.9918          
##              Prevalence : 0.4502          
##          Detection Rate : 0.4479          
##    Detection Prevalence : 0.7109          
##       Balanced Accuracy : 0.7581          
##                                           
##        'Positive' Class : 0               
##

#table(Y, mat)
# caret::confusionMatrix(factor(Y, levels=0:1), factor(mat, levels=0:1))

It achieves $75\%$ accuracy, which is reasonable for unsupervised classification.

Finally, let’s visualize the results by superimposing the data into the first two multidimensional scaling (MDS) dimensions.

library("ggplot2")
mds = as.data.frame(cmdscale(dis, k=2))
mds_temp = cbind( mds, as.factor(clust_k2))
names(mds_temp) = c('V1', 'V2', 'cluster k=2')

# gp_cluster = ggplot(mds_temp, aes(x=V2, y=V1, color=as.factor(clust_k2))) +
#   geom_point(aes(shape = as.factor(Y))) + theme()
# gp_cluster

plot_ly(data=mds_temp, x=~V1, y=~V2, color=~mds_temp[,3], symbol=~as.factor(Y), type="scatter", mode="markers",
        marker=list(size=15)) %>%
    layout(title = "MDS Performance",
                            hovermode = "x unified", legend = list(orientation='h'))

1.8.6.7 Spectral Clustering

Expanding on the spectral clustering approach we covered in the previous Chapter 8, suppose the multivariate dataset is represented as a set of data points $A$. We can define a similarity matrix $S={s_{(i, j)}}$, where $s_{(i, j)}$ represents a measure of the similarity between points $i, j\in A.$ Spectral clustering uses the spectrum of the similarity matrix of the high-dimensional data and perform dimensionality reduction for clustering into fewer dimensions. The spectrum of a matrix is the set of its eigenvalues. In general, if $T:\Omega \stackrel{\text{linear operator}}{\rightarrow} \Omega$ maps a vector space $\Omega$ into itself, its spectrum is the set of scalars $\lambda=\{\lambda_i\}$ such that $(T-\lambda I)v=0$, where $I$ is the identity matrix and $v$ are the eigen-vectors (or eigen-functions) for the operator $T$. The determinant of the matrix equals the product of its eigenvalues, i.e., $det(T)=\Pi_i \lambda_i$ , the trace of the matrix $tr(T)=\Sigma_i \lambda_i$, and the pseudo-determinant for a singular matrix is the product of its nonzero eigenvalues, $pseudo_{det}(T)=\Pi_{\lambda_i \neq 0}\lambda_i.$

To partition the data points into two sets $(S_1, S_2)$ suppose $v$ is the second-smallest eigenvector of the Laplacian matrix:

\[L = I - D^{-\frac{1}{2}}SD^{\frac{1}{2}}\]

of the similarity matrix $S$, where $D$ is the diagonal matrix $D_{i, i}=\Sigma_j S_{i, j}.$

This actual $(S_1, S_2)$ partitioning of the cases in the data may be done in different ways. For instance, $S_1$ may use the median $m$ of the components in $v$ and group all data points whose component in $v$ is greater than $m$. Then the remaining cases can be labeled as part of $S_2$. This approach may be used iteratively for hierarchical clustering by repeatedly partitioning the subsets.

The $specc$ method in the $kernlab$ package implements a spectral clustering algorithm where the data-clustering is performed by embedding the data into the subspace of the eigenvectors of an affinity matrix.

# install.packages("kernlab")
library("kernlab")

# review and choose a dataset (for example the Iris data
data()
#plot(iris)

Let’s look at a few simple cases of spectral clustering. We are suppressing some of the outputs to save space (e.g., #plot(my_data, col= data_sc)).

1.8.6.8 Iris petal data

Let’s look at the iris dataset we saw in Chapter 2.

library(dplyr)
library(plotly)
my_data <- as.data.frame(iris); dim(my_data)
my_data <- as.matrix(mutate_all(my_data[, -5], function(x) as.numeric(as.character(x))))

library(kernlab)
num_clusters <- 3
data_sc <- specc(my_data, centers= num_clusters)
data_sc
centers(data_sc)
withinss(data_sc)
#plot(my_data, col= data_sc)
# plot(iris[,1:2], col=data_sc, pch=as.numeric(iris$Species), lwd=2)
# legend("topright", legend=c("setosa", "versicolor", "virginica"), col=c("black", "red", "green"), 
#       lty=as.numeric(iris$Species), cex=1.2)
# legend("bottomright", legend=c("Clust1", "Clust2", "Clust3"), pch=unique(as.numeric(iris$Species)), 
#        cex=1.2)
legend <- c("setosa", "versicolor", "virginica")
plot_ly(x=~iris[,1], y=~iris[,2], type="scatter", mode="markers", color=~as.factor(data_sc@.Data), 
        marker=list(size=15), symbol=~as.numeric(iris$Species)) %>%
  layout(title="Spectral Clustering (Iris Data)", xaxis=list(title=colnames(iris)[1]), yaxis=list(title=colnames(iris)[2]))

1.8.6.9 Spirals data

Let’s look at another simple example using the kernlab::spirals dataset.

# library("kernlab")
data(spirals)
num_clusters <- 2
data_sc <- specc(spirals, centers= num_clusters)
data_sc

## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 2 2 1 1 2 1 1 1 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 1 1 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 1 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 2 2 2 1 1 2 1 2 1 1 1 2 1 2 1 1 2 2 1 1 1 2 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  576.800192957636 
## 
## Centers:  
##             [,1]       [,2]
## [1,] -0.01770984  0.1775137
## [2,]  0.01997201 -0.1761483
## 
## Cluster size:  
## [1] 150 150
## 
## Within-cluster sum of squares:  
## [1] 118.1182 117.3429

centers(data_sc)

##             [,1]       [,2]
## [1,] -0.01770984  0.1775137
## [2,]  0.01997201 -0.1761483

withinss(data_sc)

## [1] 118.1182 117.3429

# plot(spirals, col= data_sc)

clusterNames <- paste0("Cluster ", data_sc@.Data)

plot_ly(x=~spirals[,1], y=~spirals[,2], type="scatter", mode="markers", color = ~data_sc@.Data, name=clusterNames) %>% hide_colorbar()

1.8.6.10 Income data

A kernlab::income datasets contains customer income information from a marketing survey.

data(income)
num_clusters <- 2
data_sc <- specc(income, centers= num_clusters)
data_sc

## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 1 2 2 1 1 2 1 2 2 2 2 2 1 2 
##  
## String kernel function.  Type =  spectrum 
##  Hyperparameters : sub-sequence/string length =  4 
##  Normalized 
## 
## Cluster size:  
## [1] 5 9

centers(data_sc)

##      [,1]
## [1,]   NA

withinss(data_sc)

## logical(0)

# plot(income, col= data_sc)

# Convert all Income features to factors and construct SPLOM plot
df2 <- income
col_names <- names(df2)
dims <- as.data.frame(lapply(df2[col_names], factor))

dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE))

1.8.7 Comparing multiple classifiers

Now let’s compare all eight classifiers (AdaBoost, LDA, QDA, knn, logit, Neural Network, linear SVM and Gaussian SVM) we presented above using the PPMI dataset (06_PPMI_ClassificationValidationData_Short.csv).

# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# get AdaBoost CV results
set.seed(123)
cv.out.ada <- crossval::crossval(my.ada, as.data.frame(X), Y, K = 5, B = 1, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 5 
## 
## Round # 1 of 1 
## CV Fit # 1 of 5 
## CV Fit # 2 of 5 
## CV Fit # 3 of 5 
## CV Fit # 4 of 5 
## CV Fit # 5 of 5

# get k-Means CV results
my.kmeans <- function (train.x, train.y, test.x, test.y, negative, formula){
  kmeans.fit <- kmeans(scale(test.x), kpp_init(scale(test.x), 2), iter.max=100, algorithm='Lloyd')
  predict.y <- kmeans.fit$cluster
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}
set.seed(123)
cv.out.kmeans <- crossval::crossval(my.kmeans, as.data.frame(X), Y, K = 5, B = 2, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 10 
## 
## Round # 1 of 2 
## CV Fit # 1 of 10 
## CV Fit # 2 of 10 
## CV Fit # 3 of 10 
## CV Fit # 4 of 10 
## CV Fit # 5 of 10 
## 
## Round # 2 of 2 
## CV Fit # 6 of 10 
## CV Fit # 7 of 10 
## CV Fit # 8 of 10 
## CV Fit # 9 of 10 
## CV Fit # 10 of 10

# get spectral clustering CV results
my.sc <- function (train.x, train.y, test.x, test.y, negative, formula){
  sc.fit <- specc(scale(test.x), centers= 2)
  predict.y <- sc.fit@.Data
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}
set.seed(123)
cv.out.sc <- crossval::crossval(my.sc, as.data.frame(X), Y, K = 5, B = 2, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 10 
## 
## Round # 1 of 2 
## CV Fit # 1 of 10 
## CV Fit # 2 of 10 
## CV Fit # 3 of 10 
## CV Fit # 4 of 10 
## CV Fit # 5 of 10 
## 
## Round # 2 of 2 
## CV Fit # 6 of 10 
## CV Fit # 7 of 10 
## CV Fit # 8 of 10 
## CV Fit # 9 of 10 
## CV Fit # 10 of 10

library(knitr)
res_tab=rbind(diagnosticErrors(cv.out.ada$stat),diagnosticErrors(cv.out.lda$stat),diagnosticErrors(cv.out.qda$stat),diagnosticErrors(cv.out.knn$stat),diagnosticErrors(cv.out.logit$stat),diagnosticErrors(cv.out.nn$stat),diagnosticErrors(cv.out.svml$stat),diagnosticErrors(cv.out.svmg$stat),diagnosticErrors(cv.out.kmeans$stat),diagnosticErrors(cv.out.sc$stat))
rownames(res_tab) <- c("AdaBoost", "LDA", "QDA", "knn", "logit", "Neural Network", "linear SVM", "Gaussian SVM", "k-Means", "Spectral Clustering")
kable(res_tab,caption = "Compare Result")

Compare Result
	acc	sens	spec	ppv	npv	lor
AdaBoost	0.9668246	0.9866667	0.9180328	0.9673203	0.9655172	6.7199789
LDA	0.9606635	0.9515000	0.9831967	0.9928696	0.8918216	7.0457111
QDA	0.9395735	0.9550000	0.9016393	0.9597990	0.8906883	5.2706226
knn	0.6137441	0.7400000	0.3032787	0.7231270	0.3217391	0.2142352
logit	0.9478673	0.9533333	0.9344262	0.9727891	0.8906250	5.6736914
Neural Network	0.9407583	0.9016393	0.9566667	0.8943089	0.9598662	5.3101066
linear SVM	0.9573460	0.9344262	0.9666667	0.9193548	0.9731544	6.0240527
Gaussian SVM	0.9478673	0.9262295	0.9566667	0.8968254	0.9695946	5.6246961
k-Means	0.4241706	0.3866667	0.5163934	0.6628571	0.2550607	-0.3957483
Spectral Clustering	0.4727488	0.4683333	0.4836066	0.6904177	0.2700229	-0.1924337

Leaving knn, kmeans and specc aside, the other methods achieve pretty good results. With these data, the reason for the suboptimal results of some clustering methods may be rooted in lack of training (e.g., specc and kmeans) or the curse of (high) dimensionality, which we saw in Chapter 4. As the data is super sparse, predicting from the nearest neighbors may not be too reliable.

		Actual Condition		Test Interpretation
		Absent (\(H_0\) is true)	Present (\(H_1\) is true)	Test Interpretation
Test Result	Negative (fail to reject \(H_0\))	TN Condition absent + Negative result = True (accurate) Negative	FN Condition present + Negative result = False (invalid) Negative Type II error* (proportional to \(\beta\))*	\(NPV\) \(=\frac{TN}{TN+FN}\)
Test Result	Positive (reject \(H_0\))	FP Condition absent + Positive result = False Positive Type I error (\(\alpha\))	TP Condition Present + Positive result = True Positive	\(PPV=Precision\) \(=\frac{TP}{TP+FP}\)
Test Interpretation	\(Power =1-\beta\) \(= 1-\frac{FN}{FN+TP}\)	\(Specificity=\frac{TN}{TN+FP}\)	\(Power=Recall=Sensitivity\) \(=\frac{TP}{TP+FN}\)	\(LOR=\ln\left (\frac{TN\times TP}{FP\times FN}\right )\)

Predicted \ Observed	Minor	Severe	Row Sum
Minor	\(A=143\)	\(B=71\)	\(A+B=214\)
Severe	\(C=72\)	\(D=157\)	\(C+D=229\)
Column Sum	\(A+C=215\)	\(B+D=228\)	\(A+B+C+D=443\)

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Model Performance Assessment, Validation, and Improvement

SOCR/MIDAS (Ivo Dinov)

February 2025