| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
In previous chapters, we used several measures, such as prediction accuracy, to evaluate classification and regression models. In general, accurate predictions for one dataset does not necessarily imply that our model is perfect or that it will reproduce when tested on external or prospective data. We need additional metrics to evaluate the model performance and ensure it is robust, reproducible, reliable, and unbiased.
In this chapter, we will (1) discuss various evaluation strategies for prediction, clustering, classification, regression, and decision-making, (2) demonstrate performance visualization, e.g., visualization of ROC curves, (3) discuss performance tradeoffs, and (4) present internal statistical cross-validation and bootstrap sampling.
Prediction accuracy represents just one evaluation aspect of classification model performance and reliability assessment of clustering methods. Different classification models and alternative clustering techniques may be appropriate for different situations. For example, when screening newborns for rare genetic defects, we may want the model to have as few true-negatives as possible. We don’t want to classify anyone as “non-carriers” when they actually may have a defective gene, since early treatment might impact near- and long-term life experiences.
We can use the following three types of data to evaluate the performance of a classifier model.
We have already seen examples of these cases. For instance, the last
type of validation relies on the predict(model, test_data)
function that we used previously in the classification and regression
chapters (Chapter 3-6).
Let’s revisit some of the models and test data we discussed in Chapter
5 - Inpatient
Head and Neck Cancer Medication data. We will demonstrate prediction
probability estimation using this case-study CaseStudy14_HeadNeck_Cancer_Medication.csv.
hn_med <- read.csv("https://umich.instructure.com/files/1614350/download?download_frd=1",
stringsAsFactors = FALSE)
hn_med$seer_stage <- factor(hn_med$seer_stage)
library(tm)
hn_med_corpus <- Corpus(VectorSource(hn_med$MEDICATION_SUMMARY))
corpus_clean <- tm_map(hn_med_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
hn_med_dtm <- DocumentTermMatrix(corpus_clean)
hn_med_train <- hn_med[1:562, ]
hn_med_test <- hn_med[563:662, ]
hn_med_dtm_train <- hn_med_dtm[1:562, ]
hn_med_dtm_test <- hn_med_dtm[563:662, ]
corpus_train <- corpus_clean[1:562]
corpus_test <- corpus_clean[563:662]
hn_med_train$stage <- hn_med_train$seer_stage %in% c(4, 5, 7)
hn_med_train$stage <- factor(hn_med_train$stage, levels=c(F, T),
labels = c("early_stage", "later_stage"))
hn_med_test$stage <- hn_med_test$seer_stage %in% c(4, 5, 7)
hn_med_test$stage <- factor(hn_med_test$stage, levels=c(F, T),
labels = c("early_stage", "later_stage"))
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return(x)
}
hn_med_dict <- as.character(findFreqTerms(hn_med_dtm_train, 5))
hn_train <- DocumentTermMatrix(corpus_train, list(dictionary=hn_med_dict))
hn_test <- DocumentTermMatrix(corpus_test, list(dictionary=hn_med_dict))
hn_train <- apply(hn_train, MARGIN = 2, convert_counts)
hn_test <- apply(hn_test, MARGIN = 2, convert_counts)
library(e1071)
hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)## early_stage later_stage
## [1,] 0.8812634 0.11873664
## [2,] 0.8812634 0.11873664
## [3,] 0.8425828 0.15741724
## [4,] 0.9636007 0.03639926
## [5,] 0.8654298 0.13457022
## [6,] 0.8654298 0.13457022
The above output includes the prediction probabilities for the first
6 rows of the data. This example is based on Naive
Bayes classifier (naiveBayes), however the same approach works for
any other machine learning classification or prediction technique. The
type="raw" indicates that the prediction call will return
the conditional a-posterior probabilities for each class. When
type="class", predict() returns the class
label corresponding to the maximal probability.
In addition, we can report the predicted probability with the outputs
of the Naive Bayesian decision-support system
(hn_classifier <- naiveBayes(hn_train, hn_med_train$stage)):
##
## "pred_nb" "early_stage" "later_stage"
##
## 96 4
The general predict() method automatically subclasses to
the specific
predict.naiveBayes(object, newdata, type = c("class", "raw"), threshold = 0.001, ...)
call where type="raw" and type = "class"
specify the output as the conditional a-posterior probabilities for each
class or the class with maximal probability, respectively. Back in Chapter
5, we discussed the C5.0 and the
randomForest classifiers to predict the chronic disease
score in a another case-study Quality
of Life (QoL).
qol <- read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
qol <- qol[!qol$CHRONICDISEASESCORE==-9, ]
qol$cd <- qol$CHRONICDISEASESCORE>1.497
qol$cd <- factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))
qol <- qol[order(qol$ID), ]
# Remove ID (col=1) # the clinical Diagnosis (col=41) will be handled later
qol <- qol[ , -1]
# 80-20% training-testing data split
set.seed(1234)
train_index <- sample(seq_len(nrow(qol)), size = 0.8*nrow(qol))
qol_train <- qol[train_index, ]
qol_test <- qol[-train_index, ]
library(C50)
set.seed(1234)
qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)Below are the (probability) results of the C5.0
classification tree model prediction.
## minor_disease severe_disease
## 1 0.3762353 0.6237647
## 3 0.3762353 0.6237647
## 4 0.2942230 0.7057770
## 9 0.7376664 0.2623336
## 10 0.3762353 0.6237647
## 12 0.3762353 0.6237647
These can be contrasted against the C5.0 tree
classification label results.
## [1] severe_disease severe_disease severe_disease minor_disease severe_disease
## [6] severe_disease
## Levels: minor_disease severe_disease
##
## "pred_tree" "minor_disease" "severe_disease"
##
## 214 229
The same complementary types of outputs can be reported for most machine learning classification and prediction approaches.
In Chapter 5, we saw an attempt to categorize the supervised classification and unsupervised clustering methods. Similarly, the table below summarizes the basic types of evaluation and validation strategies for different forecasting, prediction, ensembling, and clustering techniques. (Internal) Statistical Cross Validation or external validation should always be applied to ensure reliability and reproducibility of the results. The SciKit clustering performance evaluation and Classification metrics sections provide details about many alternative techniques and metrics for performance evaluation of clustering and classification methods.
| Inference | Outcome | Evaluation Metrics | R functions |
|---|---|---|---|
| Classification & Prediction | Binary | Accuracy, Sensitivity, Specificity, PPV/Precision, NPV/Recall, LOR | caret::confusionMatrix,
gmodels::CrossTable, cluster::silhouette |
| Classification & Prediction | Categorical | Accuracy, Sensitivity/Specificity, PPV, NPV, LOR, Silhouette Coefficient | caret::confusionMatrix,
gmodels::CrossTable, cluster::silhouette |
| Regression Modeling | Real Quantitative | correlation coefficient, \(R^2\), RMSE, Mutual Information, Homogeneity and Completeness Scores | cor, metrics::mse |
More details about binary test assessment is available on the Scientific Methods for Health Sciences (SMHS) EBook site. The table below summarizes the key measures commonly used to evaluate the performance of binary tests, classifiers, or predictions.
| Actual Condition | Test Interpretation | ||||
| Absent (\(H_0\) is true) | Present (\(H_1\) is true) | ||||
| Test Result |
Negative (fail to reject \(H_0\)) |
TN Condition absent + Negative result = True (accurate) Negative |
FN Condition present + Negative result = False (invalid) Negative Type II error (proportional to \(\beta\)) |
\(NPV\) \(=\frac{TN}{TN+FN}\) | |
|
Positive (reject \(H_0\)) |
FP Condition absent + Positive result = False Positive Type I error (\(\alpha\)) |
TP Condition Present + Positive result = True Positive |
\(PPV=Precision\) \(=\frac{TP}{TP+FP}\) | ||
| Test Interpretation |
\(Power =1-\beta\) \(= 1-\frac{FN}{FN+TP}\) |
\(Specificity=\frac{TN}{TN+FP}\) |
\(Power=Recall=Sensitivity\) \(=\frac{TP}{TP+FN}\) |
\(LOR=\ln\left (\frac{TN\times TP}{FP\times FN}\right )\) | |
See also SMHS EBook, Power, Sensitivity and Specificity section.
In ML and AI, the concepts of cross table, contingency table, and confusion matrix are often used to quantify the performance of a classifier. A contingency table represents a statistical cross tabulation that summarizes the multivariate frequency distribution of two (or more) categorical variables. In the special case of cross tabulating the performance of an AI classifier relative to ground truth, this comparison is referred to as a confusion matrix, for supervised learning, or a matching matrix, for unsupervised learning. Typically, the rows and columns in these cross tables represent observed instances in predicted and actual class labels, respectively.
We already saw some confusion matrices in Chapter 8. For binary classes, these will be just \(2\times 2\) matrices where each of the cells represents the agreement, match, or discrepancy between the real and predicted class labels, indexed by the row and column indices.
Graph \(2\times 2\) table:
| predict_T | predict_F | |
|---|---|---|
| TRUE | TP | TN |
| FALSE | FP | FN |
Using confusion matrices to measure performance: The way we calculate accuracy using these four cells is summarized by the following formula
\[accuracy=\frac{TP+TN}{TP+TN+FP+FN}=\frac{TP+TN}{\text{Total number of observations}}\ .\]
On the other hand, the error rate, or proportion of incorrectly classified observations is calculated using:
\[error rate=\frac{FP+FN}{TP+TN+FP+FN}==\frac{FP+FN}{\text{Total number of observations}}=1-accuracy\ .\]
If we look at the numerator and denominator carefully, we can see that the error rate and accuracy add up to 1. Therefore, a 95% accuracy means 5% error rate.
In R, we have multiple ways to obtain a confusion table.
The simplest way would be table(). For example, in Chapter
5, to get a plain \(2\times 2\)
table reporting the agreement between the real clinical cancer labels
and their machine learning predicted counterparts, we used:
##
## hn_test_pred early_stage later_stage
## early_stage 73 23
## later_stage 4 0
The reason we sometimes use the gmodels::CrossTable()
function, e.g., see Chapter
5, is because it reports additional information about the model
performance.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | hn_med_test$stage
## hn_test_pred | early_stage | later_stage | Row Total |
## -------------|-------------|-------------|-------------|
## early_stage | 73 | 23 | 96 |
## | 0.011 | 0.038 | |
## | 0.760 | 0.240 | 0.960 |
## | 0.948 | 1.000 | |
## | 0.730 | 0.230 | |
## -------------|-------------|-------------|-------------|
## later_stage | 4 | 0 | 4 |
## | 0.275 | 0.920 | |
## | 1.000 | 0.000 | 0.040 |
## | 0.052 | 0.000 | |
## | 0.040 | 0.000 | |
## -------------|-------------|-------------|-------------|
## Column Total | 77 | 23 | 100 |
## | 0.770 | 0.230 | |
## -------------|-------------|-------------|-------------|
##
##
The second entry in each cell of the crossTable table
reports the Chi-square contribution. This uses the standard Chi-Square
formula for computing relative discrepancy between observed
and expected counts. For instance, the Chi-square contribution
of cell(1,1), (hn_med_test$stage=early_stage and
hn_test_pred=early_stage), can be computed as follows from
the \(\frac{(Observed-Expected)^2}{Expected}\)
formula. Assuming independence between the rows and columns (i.e.,
random classification), the expected cell(1,1) value is
computed as the product of the corresponding row (96) and column (77)
marginal counts, \(\frac{96\times
77}{100}\). Thus the Chi-square value for cell(1,1) is:
\[\text{Chi-square cell(1,1)} = \frac{(Observed-Expected)^2}{Expected}=\] \[=\frac{\left (73-\frac{96\times 77}{100}\right ) ^2}{\frac{96\times 77}{100}}=0.01145022\ .\]
Note, that each cell Chi-square value represents one of the four (in this case) components of the Chi-square test-statistics, which tries to answer the question if there is no association between observed and predicted class labels. That is, under the null-hypothesis there is no association between actual and observed counts for each level of the factor variable, which allows us to quantify whether the derived classification jibes with the real class annotations (labels). The aggregate sum of all Chi-square values represents the \(\chi_o^2 = \sum_{all-categories}{(O-E)^2 \over E} \sim \chi_{(df)}^2\) statistics, where \(df = (\# rows - 1)\times (\# columns - 1)\).
Using either table (CrossTable, confusionMatrix), we can calculate accuracy and error rate by hand.
## [1] 0.73
## [1] 0.27
## [1] 0.27
For matrices that are larger than \(2\times 2\), all diagonal elements count the observations that are correctly classified and the off-diagonal elements represent incorrectly labeled cases.
So far we discussed two performance methods - table()
and CrossTable(). A third function is
caret::confusionMatrix() which provides the easiest way to
report model performance. Notice that the first argument is an
actual vector of the labels, i.e., \(Test\_Y\) and the second argument, of the
same length, represents the vector of predicted labels.
This example was presented as the first case-study in Chapter 5.
library(caret)
qol_pred <- predict(qol_model, qol_test)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")## Confusion Matrix and Statistics
##
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
##
## Accuracy : 0.6772
## 95% CI : (0.6315, 0.7206)
## No Information Rate : 0.5147
## P-Value [Acc > NIR] : 3.001e-12
##
## Kappa : 0.3538
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.6886
## Specificity : 0.6651
## Pos Pred Value : 0.6856
## Neg Pred Value : 0.6682
## Prevalence : 0.5147
## Detection Rate : 0.3544
## Detection Prevalence : 0.5169
## Balanced Accuracy : 0.6769
##
## 'Positive' Class : severe_disease
##
In Chapter 8 we already saw the Silhouette coefficient, which captures the shape of the clustering boundaries. It is a function of the intracluster distance of a sample in the dataset (\(i\)). Recall that
Then, the Silhouette coefficient for a sample point \(i\) is:
\[-1\leq Silhouette(i)=\frac{l_i-d_i}{\max(l_i,d_i)} \leq 1.\] For interpreting the Silhouette of point \(i\), we use:
\[Silhouette(i) \approx \begin{cases} -1 & \text{sample } i \text{ is closer to a neighboring cluster} \\ 0 & \text {the sample } i \text{ is near the border of its cluster, i.e., } i \text{ represents the closest point in its cluster to the rest of the dataset clusters} \\ 1 & \text {the sample } i \text{ is near the center of its cluster} \end{cases}.\]
The mean Silhouette value represents the arithmetic average of all Silhouette coefficients (either within a cluster, or overall) and represents the quality of the cluster (clustering). High mean Silhouette corresponds to compact clustering (dense and separated clusters), whereas low values represent more diffused clusters. The Silhouette value is useful when the number of predicted clusters is smaller than the number of samples.
The Kappa statistic was originally
developed to measure the reliability between two human raters. It
can be harnessed in machine learning applications to compare the
accuracy of a classifier, where one rater represents the
ground truth (for labeled data, these are the actual values of each
instance) and the second rater represents the results of
the automated machine learning classifier. The order of listing the
raters is irrelevant.
Kappa statistic measures the possibility of a correct prediction by chance alone and answers the question of How much better is the agreement (between the ground truth and the machine learning prediction) than would be expected by chance alone? Its value is between \(0\) and \(1\). When \(\kappa=1\), we have a perfect agreement between a computed prediction (typically the result of a model-based or model-free technique forecasting an outcome of interest) and an expected prediction (typically random, by-chance, prediction). A common interpretation of the Kappa statistics includes:
In the above confusionMatrix output, we have a fair
agreement. For different problems, we may have different interpretations
of Kappa statistics.
To understand Kappa statistic better, let’s look at its definition.
| Predicted \ Observed | Minor | Severe | Row Sum |
|---|---|---|---|
| Minor | \(A=143\) | \(B=71\) | \(A+B=214\) |
| Severe | \(C=72\) | \(D=157\) | \(C+D=229\) |
| Column Sum | \(A+C=215\) | \(B+D=228\) | \(A+B+C+D=443\) |
In this table, \(A=143, B=71, C=72, D=157\) denote the frequencies (counts) of cases within each of the cells in the \(2\times 2\) design. Then
\[ObservedAgreement = (A+D)=300.\]
\[ExpectedAgreement = \frac{(A+B)\times (A+C)+(C+D)\times (B+D)}{A+B+C+D}=221.72.\]
\[(Kappa)\ \kappa = \frac{(ObservedAgreement) – (ExpectedAgreement)} {(A+B+C+D) – (ExpectedAgreement)}=0.35.\]
In this manual calculation of kappa statistics (\(\kappa\)) we used the corresponding values
we saw earlier in the Quality of Life (QoL) case-study, where
chronic-disease binary outcome
qol$cd<-qol$CHRONICDISEASESCORE>1.497, and we used
the cd prediction (qol_pred).
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
According to the above table, high agreement between actual and predicted clinical diagnosis (CD) corresponds to the high algorithmic performance (accuracy).
A=143; B=71; C=72; D=157
# A+B+ C+D # 443
# ((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # 221.7201
EA=((A+B)*(A+C)+(C+D)*(B+D))/(A+B+C+D) # Expected accuracy
OA=A+D; OA # Observed accuracy## [1] 300
## [1] 0.3537597
# Compare against the official kappa score
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2] # report official Kappa## Kappa
## 0.3537597
The manually and automatically computed accuracies coincide (\(\sim 0.35\)).
Let’s now look at computing Kappa. It may be trickier to
obtain the expected agreement. Probability
rules tell us that the probability of the union of two disjoint
events equals the sum of the individual (marginal) probabilities
for these two events. Thus, we have:
round(confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$overall[2], 2) # report official Kappa## Kappa
## 0.35
We get a similar value in the confusionTable() output. A
more straightforward way of getting the Kappa statistics is by using the
Kappa() function in the vcd package.
## value ASE z Pr(>|z|)
## Unweighted 0.3538 0.04446 7.957 1.76e-15
## Weighted 0.3538 0.04446 7.957 1.76e-15
The combination of Kappa() and table
function yields a \(2\times 4\) matrix.
The Kappa statistic is under the unweighted value.
Generally speaking, predicting a severe disease outcome is a more critical problem than predicting a mild disease state. Thus, weighted Kappa is also useful. We give the severe disease a higher weight. The Kappa test result is not acceptable since the classifier may make too many mistakes for the severe disease cases. The Kappa value is \(0.26374\). Notice that the range of weighted Kappa may exceed [0,1].
## value ASE z Pr(>|z|)
## Unweighted 0.353760 0.04446 7.95721 1.760e-15
## Weighted -0.004386 0.04667 -0.09397 9.251e-01
When the predicted value is the first argument, the row and column names represent the true labels and the predicted labels, respectively.
##
## qol_pred minor_disease severe_disease
## minor_disease 143 71
## severe_disease 72 157
Kappa compares an Observed classification accuracy
(output of our ML classifier) with an Expected classification
accuracy (corresponding to random chance classification). It
may be used to evaluate single classifiers and/or to compare among a set
of different classifiers. It takes into account random chance (agreement
with a random classifier). That makes Kappa more
meaningful than simply using the accuracy as a single
quality metric. For instance, the interpretation of an
Observed Accuracy of 80% is relative to
the Expected Accuracy.
Observed Accuracy of 80% is more impactful for an
Expected Accuracy of 50% compared to
Expected Accuracy of 75%.
Take a closer look at the confusionMatrix() output where
we can find two important statistics - “sensitivity” and
“specificity”.
Sensitivity or true positive rate measures the proportion of “success” observations that are correctly classified. \[sensitivity=\frac{TP}{TP+FN}.\] Notice \(TP+FN\) are the total number of true “success” observations.
On the other hand, specificity or true negative rate measures the proportion of “failure” observations that are correctly classified. \[specificity=\frac{TN}{TN+FP}.\] Accordingly, \(TN+FP\) are the total number of true “failure” observations.
In the QoL data, considering “severe_disease” as “success” and using
the table() output we can manually compute the
sensitivity and specificity, as well as
precision and recall (below):
## [1] 0.5954545
## [1] 0.6681614
Another R package caret also provides functions to
directly calculate the sensitivity and specificity.
## [1] 0.6885965
# specificity(qol_pred, qol_test$cd)
confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[1] # another way to report the sensitivity## Sensitivity
## 0.6885965
# confusionMatrix(table(qol_pred, qol_test$cd), positive="severe_disease")$byClass[2] # another way to report the specificitySensitivity and specificity both range from 0 to 1. For either measure, a value of 1 implies that the positive and negative predictions are very accurate. However, simultaneously high sensitivity and specificity may not be attainable in real world situations. There is a tradeoff between sensitivity and specificity. To compromise, some studies loosen the demands on one and focus on achieving high values on the other.
Very similar to sensitivity, precision measures the proportion of true “success” observations among predicted “success” observations.
\[precision=\frac{TP}{TP+FP}.\]
Recall is the proportion of true “failures” among all “failures”. A model with high recall captures most “interesting” cases.
\[recall=\frac{TP}{TP+FN}.\]
Again, let’s calculate these by hand for the QoL data, and report the Area under the ROC Curve (AUC).
## [1] 0.6855895
## [1] 0.6885965
library (ROCR)
library(plotly)
qol_pred<-predict(qol_model, qol_test)
qol_pred <- predict(qol_model, qol_test, type = 'prob')
pred <- prediction( qol_pred[,2], qol_test$cd)
PrecRec <- performance(pred, "prec", "rec")
PrecRecAUC <- performance(pred, "auc")
paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))## [1] "AUC=0.69"
# plot(PrecRec)
plot_ly(x = ~PrecRec@x.values[[1]][2:length(PrecRec@x.values[[1]])],
y = ~PrecRec@y.values[[1]][2:length(PrecRec@y.values[[1]])],
name = 'Recall-Precision relation', type='scatter', mode='markers+lines') %>%
layout(title=paste0("Precision-Recall Plot, AUC=",
round(as.numeric(PrecRecAUC@y.values[[1]]), 2)),
xaxis=list(title="Recall"), yaxis=list(title="Precision"))# PrecRecAUC <- performance(pred, "auc")
# paste0("AUC=", round(as.numeric(PrecRecAUC@y.values), 2))Another way to obtain precision would be
posPredValue() in the caret package. Remember
to specify which one is the “success” class.
qol_pred <- predict(qol_model, qol_test)
posPredValue(qol_pred, qol_test$cd, positive="severe_disease")## [1] 0.6855895
From the definitions of precision and recall, we can derive the type 1 error and type 2 errors as follow:
\[error_1 = 1- Precision = \frac{FP}{TP+FP}.\]
\[error_2 = 1- Recall = \frac{FN}{TN+FN}.\]
Thus, we can compute the type 1 error (\(0.31\)) and type 2 error (\(0.31\)).
## [1] 0.3144105
## [1] 0.3114035
The F-measure, or F1-score, combines precision and recall using the harmonic mean assuming equal weights. High F1-score means high precision and high recall. This is a convenient way of measuring model performances and comparing models.
\[F1=\frac{2\times precision\times recall}{recall+precision}=\frac{2\times TP}{2\times TP+FP+FN}\]
Let’s calculate the F1-score by hand using the Quality of Life prediction results.
## [1] 0.6870897
We can contrast these results against direct calculations of the
F1-statistics obtained using caret.
precision <- posPredValue(qol_pred, qol_test$cd, positive="severe_disease")
recall <- sensitivity(qol_pred, qol_test$cd, positive="severe_disease")
F1 <- (2 * precision * recall) / (precision + recall); F1## [1] 0.6870897
Another choice for evaluating classifiers’ performance is by graphs rather than scalars, numerical vectors, or statistics. Graphs are usually more comprehensive than single statistics.
The R package ROCR provides user-friendly
functions for visualizing model performance. Details can be found on the
ROCR website.
Here, we evaluate the model performance for the Quality of Life case study in Chapter 5.
# install.packages("ROCR")
library(ROCR)
pred <- ROCR::prediction(predictions=pred_prob[, 2], labels=qol_test$cd)
# avoid naming collision (ROCR::prediction), as
# there is another prediction function in the neuralnet package.The prediction() method argument
pred_prob[, 2] stores the probability of classifying each
observation as “severe_disease”, and we saved all model prediction
information into the object pred.
Receiver Operating Characteristic (ROC) curves are often used for examining the trade-off between detecting true positives and avoiding the false positives.
# curve(log(x), from=0, to=100, xlab="False Positive Rate", ylab="True Positive Rate", main="ROC curve", col="green", lwd=3, axes=F)
# Axis(side=1, at=c(0, 20, 40, 60, 80, 100), labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# Axis(side=2, at=0:5, labels = c("0%", "20%", "40%", "60%", "80%", "100%"))
# segments(0, 0, 110, 5, lty=2, lwd=3)
# segments(0, 0, 0, 4.7, lty=2, lwd=3, col="blue")
# segments(0, 4.7, 107, 4.7, lty=2, lwd=3, col="blue")
# text(20, 4, col="blue", labels = "Perfect Classifier")
# text(40, 3, col="green", labels = "Test Classifier")
# text(70, 2, col="black", labels= "Classifier with no predictive value")
x <- seq(from=0, to=1.0, by=0.01) + 0.001
plot_ly(x = ~x, y = (log(100*x)+2.3)/(log(100*x[101])+2.3), line=list(color="lightgreen"),
name = 'Test Classifier', type='scatter', mode='lines', showlegend=T) %>%
add_lines(x=c(0,1), y=c(0,1), line=list(color="black", dash='dash'),
name="Classifier with no predictive value") %>%
add_segments(x=0, xend=0, y=0, yend = 1, line=list(color="blue"),
name="Perfect Classifier") %>%
add_segments(x=0, xend=1, y=1, yend = 1, line=list(color="blue"),
name="Perfect Classifier 2", showlegend=F) %>%
layout(title="ROC curve", legend = list(orientation = 'h'),
xaxis=list(title="False Positive Rate", scaleanchor="y", range=c(0,1)),
yaxis=list(title="True Positive Rate", scaleanchor="x"))The blue line in the above graph represents the perfect classifier where we have 0% false positive and 100% true positive. The middle green line represents a test classifier. Most of our classifiers trained by real data will look like this. The black diagonal line illustrates a classifier with no predictive value. We can see that it has the same true positive rate and false positive rate. Thus, it cannot distinguish between the two states.
Classifiers with high true positive values have ROC curves near the (blue) perfect classifier curve. Thus, we measure the area under the ROC curve (abbreviated as AUC) as a proxy of the classifier performance. To do this, we have to change the scale of the graph above. Mapping 100% to 1, we have a \(1\times 1\) square. The area under perfect classifier would be 1 and area under classifier with no predictive value being 0.5. Then, 1 and 0.5 will be the upper and lower limits for our model ROC curve. For model ROC curves, the typical interpretation of the area under curve (AUC) includes:
Note that this rating system is somewhat subjective. We can use the
ROCR package to draw ROC curves.
We can specify a “performance” object by providing "tpr"
(true positive rate) and "fpr" (false positive rate)
parameters.
# plot(roc, main="ROC curve for Quality of Life Model", col="blue", lwd=3)
# segments(0, 0, 1, 1, lty=2)
plot_ly(x = ~roc@x.values[[1]], y = ~roc@y.values[[1]],
name = 'ROC Curve', type='scatter', mode='markers+lines') %>%
add_lines(x=c(0,1), y=c(0,1), line=list(color="black", dash='dash'),
name="Classifier with no predictive value") %>%
layout(title="ROC Curve for Quality of Life C5.0 classification Tree Model",
legend = list(orientation = 'h'),
xaxis=list(title="False Positive Rate", scaleanchor="y", range=c(0,1)),
yaxis=list(title="True Positive Rate", scaleanchor="x"),
annotations = list(text=paste0("AUC=",
round(as.numeric(performance(pred, "auc")@y.values[[1]]), 2)),
x=0.6, y=0.4, textangle=0,
font=list(size=15, color="blue", showarrow=FALSE)))The segments command draws the dash line representing a classifier with no predictive value.
To measure the model performance quantitatively we need to create a
new performance object with measure="auc", for area under
the curve.
Now the roc_auc is stored as a S4 object. This is
quite different from data frames and matrices. First, we can use the
str() function to examine its structure.
## Formal class 'performance' [package "ROCR"] with 6 slots
## ..@ x.name : chr "None"
## ..@ y.name : chr "Area under the ROC curve"
## ..@ alpha.name : chr "none"
## ..@ x.values : list()
## ..@ y.values :List of 1
## .. ..$ : num 0.692
## ..@ alpha.values: list()
It has 6 members, or “slots”, and the AUC value is stored in the
member y.values. To extract object members, we use the
@ symbol according to str() output.
## [[1]]
## [1] 0.6915953
This reports \(AUC=0.65\), which suggests a fair classifier, according to the above scoring schema.
The evaluation methods we have talked about are all measuring re-substitution error. That involves building the model on training data and measuring the model performance (error/accuracy) on testing data. This evaluation process provides one mechanism of dealing with unseen data. Let’s look at some alternative strategies.
The holdout method idea is to partition a dataset into
two separate sets. Using the first set to create (train) the model and
the other to test (validate) the model performance. In practice, we
usually use a fraction (e.g., \(60\%\),
or \(\frac{3}{5}\)) of our data for
training the model, and reserve the rest (e.g., \(40\%\), or \(\frac{2}{5}\)) for testing. Note that the
testing data may also be further split into proportions for internal
repeated (e.g., cross-validation) testing and final external
(independent) testing.
The partition has to be randomized. In R, the best way of doing this is to create a parameter that randomly draws numbers and use this parameter to extract random rows from the original dataset. In Chapter 6, we used this method to partition the Google Trends data.
Another way of partitioning is using the
caret::createDatePartition() method. We can subset the
original dataset or any independent variable column of the original
dataset, e.g., google_norm$RealEstate:
sub <- caret::createDataPartition(google_norm$RealEstate, p=0.75, list = F)
google_train <- google_norm[sub, ]
google_test <- google_norm[-sub, ]To make sure that the model can be applied to future datasets, we can partition the original dataset into three separate subsets. In this way, we have two subsets of data for testing and validation. The additional validation dataset can alleviate the probability that we have a good model due to chance (non-representative subsets). A common split among training, testing, and validation subsets may be 50%, 25%, and 25% respectively.
sub <- sample(nrow(google_norm), floor(nrow(google_norm)*0.50))
google_train <- google_norm[sub, ]
google_test <- google_norm[-sub, ]
sub1 <- sample(nrow(google_test), floor(nrow(google_test)*0.5))
google_test1 <- google_test[sub1, ]
google_test2 <- google_test[-sub1, ]
nrow(google_norm)## [1] 731
## [1] 365
## [1] 183
## [1] 183
However, when we only have a very small dataset, it’s difficult to
split off too much data as this may excessively reduce the training
sample size. There are the following two options for evaluation of model
performance with unseen data. Both of these are implemented in the
caret package.
The complete details about cross validation will be presented below. Now, we describe the fundamentals of cross-validation as an internal statistical validation technique.
This technique is known as k-fold cross-validation, or k-fold CV, which is a standard for estimating model performance. K-fold CV randomly partitions the original data into k separate random subsets called folds.
A common practice is to use k=10 or 10-fold CV. That is
to split the data into 10 different subsets. Each time, one of the
subsets is reserved for testing, and the rest are employed for
learning/building the model. This can be accomplished using the
caret::createFolds() method. Using set.seed()
ensures the reproducibility of the created folds, in case you run the
code multiple times. Here, we use 1234, a random number, to
seed the fold separation. You can use any number for
set.seed(). We demonstrate the process using the normalized
Google Trend dataset.
## List of 10
## $ Fold01: int [1:73] 9 12 28 31 42 54 62 81 83 86 ...
## $ Fold02: int [1:73] 20 23 47 61 63 72 82 88 96 100 ...
## $ Fold03: int [1:73] 15 22 26 43 49 53 73 80 101 109 ...
## $ Fold04: int [1:74] 5 17 38 52 55 74 91 92 141 142 ...
## $ Fold05: int [1:73] 7 14 29 35 37 39 41 48 51 57 ...
## $ Fold06: int [1:73] 4 18 33 36 44 59 70 77 78 112 ...
## $ Fold07: int [1:73] 6 40 45 65 66 69 71 94 98 102 ...
## $ Fold08: int [1:73] 19 24 27 50 60 75 76 97 99 106 ...
## $ Fold09: int [1:74] 3 11 13 16 21 30 32 34 46 56 ...
## $ Fold10: int [1:72] 1 2 8 10 25 67 68 105 111 126 ...
Another way to cross-validate is to use methods such as
sparsediscrim::cv_partition() or
caret::createFolds().
# install.packages("sparsediscrim")
library(sparsediscrim)
folds2 = cv_partition(1:nrow(google_norm), num_folds=10)And the structure of folds may be reported by:
## List of 10
## $ Fold1 :List of 2
## ..$ training: int [1:657] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ test : int [1:74] 493 338 652 368 698 55 261 407 379 648 ...
## $ Fold2 :List of 2
## ..$ training: int [1:658] 2 3 4 5 6 7 8 9 10 11 ...
## ..$ test : int [1:73] 377 531 465 273 466 526 454 424 417 191 ...
## $ Fold3 :List of 2
## ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ test : int [1:73] 280 289 467 459 216 166 224 278 453 438 ...
## $ Fold4 :List of 2
## ..$ training: int [1:658] 1 3 5 7 8 10 11 12 13 14 ...
## ..$ test : int [1:73] 272 461 103 245 53 6 684 565 400 389 ...
## $ Fold5 :List of 2
## ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ test : int [1:73] 124 137 202 253 729 572 451 452 187 295 ...
## $ Fold6 :List of 2
## ..$ training: int [1:658] 1 2 3 4 5 6 7 9 10 11 ...
## ..$ test : int [1:73] 283 118 687 228 491 549 510 40 607 164 ...
## $ Fold7 :List of 2
## ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 12 ...
## ..$ test : int [1:73] 125 95 617 369 94 204 109 539 32 412 ...
## $ Fold8 :List of 2
## ..$ training: int [1:658] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ test : int [1:73] 49 480 328 406 520 182 160 27 705 386 ...
## $ Fold9 :List of 2
## ..$ training: int [1:658] 1 2 3 4 6 8 9 10 11 12 ...
## ..$ test : int [1:73] 333 515 176 627 546 674 312 340 388 288 ...
## $ Fold10:List of 2
## ..$ training: int [1:658] 1 2 4 5 6 7 8 9 10 11 ...
## ..$ test : int [1:73] 595 57 86 83 387 666 495 715 45 96 ...
Now, we have 10 different subsets in the folds object.
We can use lapply() to fit the model. 90% of data will be
used for training so we use [-x, ] to represent all
observations not in a specific fold. In Chapter
6, we built a neural network model for the Google Trends
data. We can do the same for each fold manually. Then we can train,
test, and aggregate the results. Finally, we can report the agreement
(e.g., correlations between the predicted and observed
RealEstate values).
library(neuralnet)
fold_cv <- lapply(folds, function(x){
google_train <- google_norm[-x, ]
google_test <- google_norm[x, ]
google_model <- neuralnet(RealEstate~Unemployment+Rental+Mortgage+Jobs+Investing+DJI_Index+StdDJI,
data=google_train)
google_pred <- compute(google_model, google_test[, c(1:2, 4:8)])
pred_results <- google_pred$net.result
pred_cor <- cor(google_test$RealEstate, pred_results)
return(pred_cor)
})
str(fold_cv)## List of 10
## $ Fold01: num [1, 1] 0.972
## $ Fold02: num [1, 1] 0.979
## $ Fold03: num [1, 1] 0.976
## $ Fold04: num [1, 1] 0.975
## $ Fold05: num [1, 1] 0.977
## $ Fold06: num [1, 1] 0.981
## $ Fold07: num [1, 1] 0.974
## $ Fold08: num [1, 1] 0.973
## $ Fold09: num [1, 1] 0.977
## $ Fold10: num [1, 1] 0.972
From the output, we know that in most of the folds the model predicts
very well. In a typical run, one fold may yield bad results. We can use
the mean of these 10 correlations to represent the
overall model performance. But first, we need to use the
unlist() function to transform fold_cv into a
vector.
## [1] 0.9756676
This correlation is high suggesting strong association between predicted and true values. Thus, the model is very good in terms of its prediction.
The second method is called bootstrap sampling. In k-fold CV, each observation can only be used once for testing. Bootstrap sampling relies on a sampling with replacement process. Before selecting a new sample, it recycles every observation so that each observation could appear in multiple folds.
At each iteration, bootstrap sampling uses \(63.2\%\) of the original data as our training dataset and the remaining \(36.8\%\) as the test dataset. Thus, compared to k-fold CV, bootstrap sampling is less representative of the full dataset. A special case of bootstrapping is the 0.632 bootstrap technique, which addresses this issue with changing the final performance error assessment formula to
\[error=0.632\times error_{test}+0.368\times error_{train}.\]
This synthesizes the optimistic model performance on training data (\(error_{train}\)) with the pessimistic model performance on testing data (\(error_{test}\)) by weighting the corresponding errors. This method is extremely reliable for small samples (it may be computationally intensive for large samples).
To see the (asymptotics) rationale behind 0.632 bootstrap technique, consider a standard training set \(T\) of cardinality \(n\) where our bootstrapping sampling generates \(m\) new training sets \(T_i\), each of size \(n'\). As sampling from \(T\) is uniform with replacement, some observations may be repeated in each sample \(T_i\). Suppose the size of the sub-samples are of the same order as \(T\), i.e., \(n'=n\), then for large \(n\) the sample \(T_{i}\) is expected to have \(\left (1 - \frac{1}{e}\right ) \sim 0.632\) unique cases from the complete original collection \(T\), the remaining proportion \(0.368\) are expected to be repeated duplicates. Hence the name 0.632 bootstrap sampling. A particular training data element has a probability of \(1-\frac{1}{n}\) of not being picked for training, and hence, its probability of being in the testing set is \(\left (1- \frac{1}{n}\right )^n=e^{-1} \sim 0.368\).
The simulation below illustrates the 0.632 bootstrap experimentally.
# define the total data size
n <- 500
# define number of Bootstrap iterations
N=2000
#define the resampling function and compute the proportion of uniquely selected elements
uniqueProportions <- function(myDataSet, sampleSize){
indices <- sample(1:sampleSize, sampleSize, replace=TRUE) # sample With Replacement
length(unique(indices))/sampleSize
}
# compute the N proportions of unique elements (could also use a for loop)
proportionsVector <- c(lapply(1:N, uniqueProportions, sampleSize=n), recursive=TRUE)
# report the Expected (mean) proportion of unique elements in the bootstrapping samples of n
mean(proportionsVector)## [1] 0.632138
In general, for large \(n\gg n'\), the sample \(T_{i}\) is expected to have \(n\left ( 1-e^{-n'/n}\right )\) unique cases, see On Estimating the Size and Confidence of a Statistical Audit).
Having the bootstrap samples, the \(m\) models can be fitted (estimated) and
aggregated, e.g., by averaging the outputs (for regression) or using
voting methods (for classification). We will discuss this more in later
Chapters. Let’s look at the
implementation of the 0.632 Bootstrap technique for the
QoL case-study.
# Recall: qol_model <- C5.0(qol_train[,-c(40, 41)], qol_train$cd)
# predict labels of testing data
qol_pred <- predict(qol_model, qol_test)
# compute matches and mismatches of Predicted and Observed class labels
predObsEqual <- qol_pred == qol_test$cd
predObsTF <- c(table(predObsEqual)[1], table(predObsEqual)[2]); predObsTF## FALSE TRUE
## 143 300
# training error rate
train.err <- as.numeric(predObsTF[1]/(predObsTF[1]+predObsTF[2]))
# testing error rate, leave-one-out Bootstrap (LOOB) Cross-Validation
B <- 10
loob.err <- NULL
N <- dim(qol_test)[1] # size of test-dataset
for (b in 1:B) {
bootIndices <- sample(1:N, N*0.9, replace=T)
train <- qol_test[bootIndices, ]
qol_modelBS <- C5.0(train[,-c(40, 41)], train$cd)
inner.err <- NULL
# for current iteration extract the appropriate testing cases for testing
i <- (1:length(bootIndices))
i <- i[is.na(match(i, bootIndices))]
test <- qol_test[i, ]
# predict using model at current iteration
qol_modelBS_pred <- predict(qol_modelBS, test)
predObsEqual <- qol_modelBS_pred == test$cd
predObsTF <- c(table(predObsEqual)[1], table(predObsEqual)[2]); predObsTF
# training error rate
inner.err <- as.numeric(predObsTF[1]/(predObsTF[1]+predObsTF[2]))
loob.err <- c(loob.err, mean(inner.err))
}
test.err <- ifelse(is.null(loob.err), NA, mean(loob.err))
# 0.632 Bootstrap error
boot.632 <- 0.368 * train.err + 0.632 * test.err; boot.632## [1] 0.3876717
We already explored several alternative machine learning (ML) methods for prediction, classification, clustering and outcome forecasting. In many situations, we derive models by estimating model coefficients or parameters. The main question now is How can we adopt crowd-sourcing advantages of social networks to aggregate different predictive analytics strategies?
Are there reasons to believe that such ensembles of forecasting methods actually improve the performance or boost the prediction accuracy of the resulting consensus meta-algorithm? In this chapter, we are going to introduce ways that we can search for optimal parameters for a single ML method as well as aggregate different methods into ensembles to augment their collective performance, relative to any of the individual methods part of the meta-algorithm.
Recall that earlier in Chapter 6, we presented strategies for improving model performance based on meta-learning, bagging, and boosting.
One of the methods for improving model performance relies on tuning. For a given ML technique, tuning is the process of searching through the parameter space for the optimal parameter(s). The following table summarizes some of the parameters used in ML techniques we covered in previous chapters.
| Model | Learning Task | Method | Parameters |
|---|---|---|---|
| KNN | Classification | class::knn |
data, k |
| K-Means | Classification | stats::kmeans |
data, k |
| Naive Bayes | Classification | e1071::naiveBayes |
train, class, laplace |
| Decision Trees | Classification | C50::C5.0 |
train, class, trials, costs |
| OneR Rule Learner | Classification | RWeka::OneR |
class~predictors, data |
| RIPPER Rule Learner | Classification | RWeka::JRip |
formula, data, subset, na.action, control, options |
| Linear Regression | Regression | stats::lm |
formula, data, subset, weights, na.action, method |
| Regression Trees | Regression | rpart::rpart |
dep_var ~ indep_var, data |
| Model Trees | Regression | RWeka::M5P |
formula, data, subset, na.action, control |
| Neural Networks | Dual use | nnet::nnet |
x, y, weights, size, Wts, mask,linout, entropy, softmax, censored, skip, rang, decay, maxit, Hess, trace, MaxNWts, abstol, reltol |
| Support Vector Machines (Polynomial Kernel) | Dual use | caret::train::svmLinear |
C |
| Support Vector Machines (Radial Basis Kernel) | Dual use | caret::train::svmRadial |
C, sigma |
| Support Vector Machines (general) | Dual use | kernlab::ksvm |
formula, data, kernel |
| Random Forests | Dual use | randomForest::randomForest |
formula, data |
\[\textbf{Table 1: Core parameters of several machine learning techniques.}\]
caret for automated parameter tuningIn Chapter
5 we used kNN and plugged in random k parameters for the
number of clusters. This time we will test simultaneously multiple
k values and select the parameter(s) yielding the highest
prediction accuracy. Using caret allows us to specify an
outcome class variable, covariate predictor features, and a specific ML
method. In Chapter
5, we showed the Boys
Town Study of Youth Development dataset, where we normalized all the
features, stored them in a boystown_n computable object,
and defined an outcome class variable (boystown$grade).
library(randomForest)
boystown <- read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex <- boystown$sex-1
boystown$dadjob <- -1*(boystown$dadjob-2)
boystown$momjob <- -1*(boystown$momjob-2)
boystown <- boystown[, -1]
table(boystown$gpa)##
## 0 1 2 3 4 5
## 30 50 54 40 14 12
boystown$grade <- boystown$gpa %in% c(3, 4, 5)
boystown$grade <- factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
normalize <- function(x){
return((x-min(x))/(max(x)-min(x)))
}
boystown_n <- as.data.frame(lapply(boystown[, -11], normalize))## 'data.frame': 200 obs. of 10 variables:
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : num 1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
## $ Alcoholuse: num 0.182 0.364 0.182 0.182 0.545 ...
## $ alcatt : num 0.5 0.333 0.5 0.167 0.333 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : num 0.143 0.429 0.286 0.143 0.286 ...
## $ momclose : num 0.143 0.571 0.286 0.286 0.143 ...
## $ larceny : num 0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
## $ vandalism : num 0.429 0 0.286 0.286 0.286 ...
## 'data.frame': 200 obs. of 11 variables:
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : num 1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
## $ Alcoholuse : num 0.182 0.364 0.182 0.182 0.545 ...
## $ alcatt : num 0.5 0.333 0.5 0.167 0.333 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : num 0.143 0.429 0.286 0.143 0.286 ...
## $ momclose : num 0.143 0.571 0.286 0.286 0.143 ...
## $ larceny : num 0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
## $ vandalism : num 0.429 0 0.286 0.286 0.286 ...
## $ boystown[, 11]: Factor w/ 2 levels "above_avg","avg_or_below": 2 1 2 1 2 2 1 2 1 2 ...
Now that the dataset includes an explicit class variable and
predictor features, we can use the KNN method to predict the outcome
grade. Let’s plug this information into the
caret::train() function. Note that caret can
use the complete dataset as it will automatically do the random sampling
for the internal statistical cross-validation. To make results
reproducible, we may utilize the set.seed() function that
we presented earlier.
library(caret)
set.seed(123)
kNN_mod <- train(grade~., data=boystown_n, method="knn")
kNN_mod; summary(kNN_mod)## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
## 2 classes: 'above_avg', 'avg_or_below'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7971851 0.5216369
## 7 0.8037923 0.5282107
## 9 0.7937086 0.4973139
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 10 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
In this case, using str(m) to summarize the object
m may report out too much information. Instead, we can
simply type the object name m to get a more concise
information about it.
caret uses 3 different choices for each
parameter, but for binary parameters, it only takes 2 choices
TRUE and FALSE). As KNN has only one parameter
k, we have 3 candidate models reported in the output
above.k=9.Let’s see how accurate this “optimal model” is in terms of the
re-substitution error. Again, we will use the predict()
function specifying the object m and the dataset
boystown_n. Then, we can report the contingency table
showing the agreement between the predictions and real class labels.
##
## pred above_avg avg_or_below
## above_avg 132 17
## avg_or_below 2 49
This model has \((17+2)/200=0.09\)
re-substitution error (9%). This means that in the 200 observations that
we used to train this model, 91% of them were correctly classified. Note
that re-substitution error is different from accuracy. The accuracy of
this model is \(0.81\), which is
reported by a model summary call. As mentioned earlier, we can obtain
prediction probabilities for each observation in the original
boystown_n dataset.
## above_avg avg_or_below
## 1 0.0000000 1.0000000
## 2 1.0000000 0.0000000
## 3 0.7142857 0.2857143
## 4 0.8571429 0.1428571
## 5 0.2857143 0.7142857
## 6 0.5714286 0.4285714
The default setting of train() might not meet the
specific needs for every study. In our case, the optimal \(k\) might be smaller than \(9\). The caret package allows
us to customize the settings for train(). Specifically,
caret::trainControl() can help us to customize re-sampling
methods. There are 6 popular re-sampling methods that we might want to
use, which are summarized in the following table.
| Resampling method | Method name | Additional options and default values |
|---|---|---|
| Holdout sampling | LGOCV |
p = 0.75 (training data proportion) |
| k-fold cross-validation | cv |
number = 10 (number of folds) |
| Repeated k-fold cross validation | repeatedcv |
number = 10 (number of folds),
repeats = 10 (number of iterations) |
| Bootstrap sampling | boot |
number = 25 (resampling iterations) |
| 0.632 bootstrap | boot632 |
number = 25 (resampling iterations) |
| Leave-one-out cross-validation | LOOCV |
None |
\[\textbf{Table 2: Core re-sampling methods.}\]
Each of these methods rely on alternative representative sampling
strategies to train the model. Let’s use 0.632 bootstrap for
example. Just specify method="boot632" in the
trainControl() function. The number of different samples to
include can be customized by the number= option. Another
option in trainControl() allows specification of the model
performance evaluation. We can select a preferred method of evaluation
for choosing the optimal model. For instance, the oneSE
method chooses the simplest model within one standard error of the best
performance to be the optimal model. Other strategies are also available
in the caret package. For detailed information, type
?best in the R console.
We can also specify a list of \(k\) values we want to test by creating a matrix or a grid.
ctrl <- trainControl(method = "boot632", number=25, selectionFunction = "oneSE")
grid <- expand.grid(k=c(1, 3, 5, 7, 9))
# Creates a data frame from all combinations of the supplied factorsUsually, to avoid ties, we prefer to choose an odd number of clusters
\(k\). Now the constraints are all set.
We can start to select models again using train().
set.seed(123)
kNN_mod2 <-train(grade ~ ., data=boystown_n, method="knn",
metric="Kappa",
trControl=ctrl,
tuneGrid=grid)
kNN_mod2## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
## 2 classes: 'above_avg', 'avg_or_below'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8748058 0.7155507
## 3 0.8389441 0.6235744
## 5 0.8411961 0.6254587
## 7 0.8384469 0.6132381
## 9 0.8341422 0.5971359
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was k = 1.
Here we added metric="Kappa" to include the Kappa
statistics as one of the criteria to select the optimal model. We
can see the output accuracy for all the candidate models are better than
the default bootstrap sampling. The optimal model has k=1, a
high accuracy \(0.861\), and a high
Kappa statistics, which is much better than the model we had in Chapter
5. Note that the output based on the SE rule may not necessarily
choose the model with the highest accuracy or the highest Kappa
statistic as the “optimal model”. The tuning process is more
comprehensive than only looking at one statistic.
Earlier in Chapter
5 we saw examples of how to choose appropriate evaluation metrics
and how to contrast the performance of various AI/ML methods. Below, we
illustrate model comparison based on classification of Case-Study
6, Quality of Life (QoL) dataset using bagging, boosting, random
forest, SVN, k nearest neighbors, and decision trees. All available caret
package ML/AI training methods are listed here.
# install.packages(fastAdaboost)
library(fastAdaboost)
library(caret) # for modeling
library(lattice) # for plotting
control <- trainControl(method="repeatedcv", number=10, repeats=3)
## Run all subsequent models in parallel
library(adabag)
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
system.time({
rf.fit <- train(cd~., data=qol[, -37], method="rf", trControl=control);
knn.fit <- train(cd~., data=qol[, -37], method="knn", trControl=control);
svm.fit <- train(cd~., data=qol[, -37], method="svmRadialWeights", trControl=control);
adabag.fit <- train(cd~., data=qol[, -37], method="AdaBag", trControl=control);
adaboost.fit <- train(cd~., data=qol[, -37], method="adaboost", trControl=control)
})## user system elapsed
## 6.28 0.75 219.67
stopCluster(cl) # close multi-core cluster
rm(cl)
results <- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, Bag=adabag.fit, Boost=adaboost.fit))
# summary of model differences
summary(results)##
## Call:
## summary.resamples(object = results)
##
## Models: RF, kNN, SVM, Bag, Boost
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9954751 1.0000000 1.0000000 0.9995489 1.0000000 1.0000000 0
## kNN 0.5270270 0.6004525 0.6238739 0.6225767 0.6475225 0.7090909 0
## SVM 0.9409091 0.9593219 0.9683971 0.9676231 0.9773499 0.9864865 0
## Bag 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0
## Boost 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.99092365 1.0000000 1.000000 0.9990954 1.0000000 1.0000000 0
## kNN 0.04709345 0.1967654 0.245143 0.2434337 0.2949951 0.4155251 0
## SVM 0.88132780 0.9183430 0.936674 0.9350707 0.9545693 0.9729290 0
## Bag 1.00000000 1.0000000 1.000000 1.0000000 1.0000000 1.0000000 0
## Boost 1.00000000 1.0000000 1.000000 1.0000000 1.0000000 1.0000000 0
# Plot Accuracy Summaries
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# bwplot(results, scales=scales) # Box plots of accuracy
# Convert (results) data-frame from wide to long format
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
library(tidyr)
results_long <- gather(results$values[, -1], method, measurement, factor_key=TRUE) %>%
separate(method, c("Technique", "Metric"), sep = "~")
# Compare original wide format to transformed long format
results$values[, -1]## RF~Accuracy RF~Kappa kNN~Accuracy kNN~Kappa SVM~Accuracy SVM~Kappa
## 1 1.0000000 1.0000000 0.6334842 0.26710338 0.9729730 0.9457478
## 2 1.0000000 1.0000000 0.6334842 0.26147943 0.9772727 0.9545455
## 3 1.0000000 1.0000000 0.6606335 0.31988839 0.9409091 0.8813278
## 4 1.0000000 1.0000000 0.6063348 0.21178207 0.9684685 0.9368857
## 5 1.0000000 1.0000000 0.6515837 0.29858621 0.9728507 0.9455665
## 6 1.0000000 1.0000000 0.5945946 0.18886002 0.9864865 0.9729290
## 7 1.0000000 1.0000000 0.7090909 0.41552511 0.9547511 0.9091955
## 8 1.0000000 1.0000000 0.6036036 0.20688535 0.9504505 0.9007398
## 9 1.0000000 1.0000000 0.5855856 0.16881003 0.9774775 0.9548450
## 10 0.9954751 0.9909237 0.5791855 0.15969582 0.9638009 0.9272907
## 11 1.0000000 1.0000000 0.6216216 0.23922977 0.9549550 0.9096533
## 12 1.0000000 1.0000000 0.6621622 0.32814139 0.9819820 0.9639200
## 13 1.0000000 1.0000000 0.5855856 0.16813294 0.9684685 0.9368344
## 14 0.9954955 0.9909690 0.6561086 0.31422505 0.9504505 0.9005781
## 15 1.0000000 1.0000000 0.6000000 0.19634703 0.9684685 0.9367315
## 16 1.0000000 1.0000000 0.6486486 0.29985444 0.9819820 0.9639200
## 17 1.0000000 1.0000000 0.6351351 0.26967752 0.9594595 0.9186548
## 18 1.0000000 1.0000000 0.6063348 0.20749351 0.9502262 0.9001602
## 19 1.0000000 1.0000000 0.6396396 0.28073870 0.9819820 0.9639493
## 20 1.0000000 1.0000000 0.5270270 0.04709345 0.9819820 0.9638907
## 21 1.0000000 1.0000000 0.6441441 0.28422170 0.9683258 0.9365229
## 22 1.0000000 1.0000000 0.6216216 0.24046921 0.9592760 0.9182390
## 23 1.0000000 1.0000000 0.6018100 0.19802062 0.9592760 0.9181650
## 24 0.9954955 0.9909690 0.5855856 0.17016090 0.9639640 0.9277226
## 25 1.0000000 1.0000000 0.6261261 0.24981679 0.9727273 0.9452963
## 26 1.0000000 1.0000000 0.6742081 0.34733388 0.9773756 0.9545772
## 27 1.0000000 1.0000000 0.6171171 0.22985879 0.9773756 0.9546183
## 28 1.0000000 1.0000000 0.6380090 0.27356397 0.9683258 0.9365229
## 29 1.0000000 1.0000000 0.6561086 0.31562220 0.9683258 0.9366165
## 30 1.0000000 1.0000000 0.5727273 0.14439388 0.9683258 0.9364760
## Bag~Accuracy Bag~Kappa Boost~Accuracy Boost~Kappa
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
## 8 1 1 1 1
## 9 1 1 1 1
## 10 1 1 1 1
## 11 1 1 1 1
## 12 1 1 1 1
## 13 1 1 1 1
## 14 1 1 1 1
## 15 1 1 1 1
## 16 1 1 1 1
## 17 1 1 1 1
## 18 1 1 1 1
## 19 1 1 1 1
## 20 1 1 1 1
## 21 1 1 1 1
## 22 1 1 1 1
## 23 1 1 1 1
## 24 1 1 1 1
## 25 1 1 1 1
## 26 1 1 1 1
## 27 1 1 1 1
## 28 1 1 1 1
## 29 1 1 1 1
## 30 1 1 1 1
## Technique Metric measurement
## 1 RF Accuracy 1
## 2 RF Accuracy 1
## 3 RF Accuracy 1
## 4 RF Accuracy 1
## 5 RF Accuracy 1
## 6 RF Accuracy 1
library(plotly)
plot_ly(results_long, x=~Technique, y = ~measurement, color = ~Metric, type = "box")#densityplot(results, scales=scales, pch = "|") # Density plots of accuracy
densityModels <- with(results_long[which(results_long$Metric=='Accuracy'), ],
tapply(measurement, INDEX = Technique, density))
df <- data.frame(
x = unlist(lapply(densityModels, "[[", "x")),
y = unlist(lapply(densityModels, "[[", "y")),
method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)
plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
layout(title="Performance Density Plots (Accuracy)", legend = list(orientation='h'),
xaxis=list(title="Accuracy"), yaxis=list(title="Density"))densityModels <- with(results_long[which(results_long$Metric=='Kappa'), ],
tapply(measurement, INDEX = Technique, density))
df <- data.frame(
x = unlist(lapply(densityModels, "[[", "x")),
y = unlist(lapply(densityModels, "[[", "y")),
method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)
plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
layout(title="Performance Density Plots (Kappa)", legend = list(orientation='h'),
xaxis=list(title="Kappa"), yaxis=list(title="Density"))# dotplot(results, scales=scales) # Dot plots of Accuracy & Kappa
#splom(results) # contrast pair-wise model scatterplots of prediction accuracy (Trellis Scatterplot matrices)
# Pairs - Accuracy
results_wide <- results_long[which(results_long$Metric=='Accuracy'), -2] %>%
pivot_wider(names_from = Technique, values_from = measurement)
df = data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))
dims <- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
layout(title="Performance Pairs Plot (Accuracy)")# Pairs - Accuracy
results_wide <- results_long[which(results_long$Metric=='Kappa'), -2] %>%
pivot_wider(names_from = Technique, values_from = measurement)
df = data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))
dims <- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
layout(title="Performance Pairs Plot (Kappa)")Cross-validation is a strategy for validating predictive methods, classification models and clustering techniques by assessing the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation defines a test dataset to evaluate the model predictive performance as well as assess its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of identifying actual relationships and salient features in the data.
In this section, we will use Google Flu Trends, Autism, and Parkinson’s disease case-studies to illustrate (1) alternative forecasting types using linear and non-linear predictions, (2) exhaustive and non-exhaustive internal statistical cross-validation, and (3) explore complementary predictor functions.
In Chapter
5 we discussed the types of classification and prediction methods,
including supervised and unsupervised
learning. The former are direct and predictive (there are known outcome
variables that can be predicted and the corresponding forecasts can be
evaluated) and the latter are indirect and descriptive (there are no
a priori labels or specific outcomes).
Before we go into the cross-validation of predictive analytics, we will present several examples of overfitting that illustrates why a certain amount of skepticism and mistrust may be appropriate when dealing with forecasting models based on large and complex data.
By 2022, there were only 58 US presidential elections and 46 presidents. That is a small dataset, and learning from it may be challenging. For instance:
This cartoon illustrates some of the (unique) noisy presidential
characteristics that are thought to be unimportant to presidential
elections or presidential performance.