SOCR ≫ | DSPA ≫ | Topics ≫ |
Start by reviewing the Chapter 13 (Model Performance Assessment). Cross-validation is a strategy for validating predictive methods, classification models and clustering techniques by assessing the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation defines a test dataset to evaluate the model predictive performance as well as assess its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of underlying the actual relationships and salient features in the data.
In this Chapter, we will use Google Flu Trends, Autism, and Parkinson’s disease case-studies to illustrate (1) alternative forecasting types using linear and non-linear predictions, (2) exhaustive and non-exhaustive internal statistical cross-validation, and (3) complementary predictor functions.
In Chapter 6 we discussed the types of classification and prediction methods, including supervised
and unsupervised
learning. The former are direct and predictive (there are known outcome variables that can be predicted and the corresponding forecasts can be evaluated) and the latter are indirect and descriptive (there are no a priori labels or specific outcomes).
There are alternative metrics used for evaluation of model performance, see Chapter 13. For example, assessment of supervised prediction and classification methods depends on the type of the labeled outcome responses - categorical
(binary or polytomous) vs. continuous
.
Confusion matrices
reporting accuracy, FP, FN, PPV, NPV, LOR and other metrics may be used to assess predictions of dichotomous (binary) or polytomous outcomes
.
\(R^2\), correlations (between predicted and observed outcomes), and RMSE measures may be used to quantify the performance of various supervised forecasting methods on continuous features
.
Before we go into the cross-validation of predictive analytics, we will present several examples of overfitting that illustrate why certain amount of skepticism and mistrust may be appropriate when dealing with forecasting models based on large and complex data.
By 2017, there were only 57 US presidential elections and 45 presidents. That is a small dataset, and learning from it may be challenging. For instance:
This cartoon illustrates some of the (unique) noisy presidential characteristics that are thought to be unimportant to presidential elections or presidential performance.
A March 14, 2014 article in Science (DOI: 10.1126/science.1248506), identified problems in Google Flu Trends (GFT), DOI 10.1371/journal.pone.0023610, which may be attributed in part to overfitting. The GFT was built to predict the future Centers for Disease Control and Prevention (CDC) reports of doctor office visits for influenza-like illness (ILI). In February 2013, Nature reported that GFT was predicting more than double the proportion of doctor visits compared to the CDC forecast for the same period.
The GFT model found the best matches among 50 million web search terms to fit 1,152 data points. It predicted quite high odds of finding search terms that match the propensity of the flu but are structurally unrelated and hence are not prospectively predictive. In fact, the GFT investigators reported weeding out unrelated to the flu seasonal search terms may have been strongly correlated to the CDC data, e.g., high school basketball season. The big GFT data may have overfitted the relatively small number of cases. This false-alarm result was also paired with a false-negative finding. The GFT model also missed the non-seasonal 2009 H1N1 influenza pandemic, which provides a cautionary tale about prediction, overfitting, and prospective validation.
Autistic brains constantly overfit visual and cognitive stimuli. To an autistic person, a general conversation of several adults may seem like a cacophony due to super-sensitive detail-oriented hearing and perception tuned to literally pick up all elements of the conversation and clues of the surrounding environment. At the same time, autistic brains downplay body language, sarcasm and non-literal cues. We can miss the forest for the trees
when we start “overfitting”, over-interpreting the noise on top of the actual salient information. Ambient noise, trivial observations and unrelated perceptions may hide the true communication details.
Human conversations and communications involve exchanges of both critical information and random noise. Fitting a perfect model requires focus only on the “relevant” information. Overfitting occurs when attention is (excessively) consumed with peripheral noise, or worse, overwhelmed by inconsequential noise drowning the salient aspects of the communication exchange.
Any dataset is a mix of signal and noise. The main task of our brains is to sort these components and interpret the information (i.e., ignore the noise).
“One person’s noise is another person’s treasure map!”
Our predictions are most accurate if we can model as much of the signal and as little of the noise as possible. Note that in these terms, \(R^2\) is a poor metric to identify predictive power - it measures how much of the signal and the noise is explained by our model. In practice, it’s hard to always identify what’s signal and what’s noise. This is why practical applications tends to favor simpler models, since the more complicated a model is, the easier it is to overfit the noise component of the observed information.
Internal statistical cross-validation assesses the expected performance of a prediction method in cases (subject, units, regions, etc.) drawn from a similar population as the original training data sample. Internal
validation is distinct from external
validation, as the latter potentially allows for the existence of differences between the populations (training data, used to develop or train the technique, and testing data, used to independently quantify the performance of the technique).
Each step in the internal statistical cross-validation protocol involves:
Here is one example of internal statistical cross-validation Predictive diagnostic modeling in Parkinson’s disease.
To reduce the noise and variability at each iteration, the final validation results may include the averaged performance results each iteration.
In cases when new observations are hard to obtain (due to costs, reliability, time, or other constraints), cross-validation guards against testing hypotheses suggested by the data themselves (also known as Type III error
or False-Suggestion).
Cross-validation is different from conventional-validation (e.g. 80%-20% partitioning the data set into training and testing subsets) where the prediction error (e.g., Root Mean Square Error, RMSE) evaluated on the training data is not a useful estimator of model performance, as it does not generalize across multiple samples.
In general, the errors of the conventional-valuation are based on the results of a specific test dataset and may not accurately represent the model performance. A more appropriate strategy to properly estimate model prediction performance is to use cross-validation (CV), which combines (averages) prediction errors to measure the model performance. CV corrects for the expected stochastic nature of partitioning the training and testing sets and generates a more accurate and robust estimate of the expected model performance.
Relative to a simpler model, a more complex model may overfit-the-data if it has a short foresight, i.e., it generates accurate fitting results for known data but less accurate results when predicting based on new data. Knowledge from past experiences may include either relevant or irrelevant (noise) information. In challenging data-driven prediction models when uncertainty (entropy) is high, more noise is present in past information that needs to be accounted for in prospective forecasting. However, it is generally hard to discriminate patterns from noise in complex systems (i.e., deciding which part to model and which to ignore). Models that reduce the chance of fitting noise are called robust.
Let’s demonstrate a simple model assessment using linear regression. Suppose we observe the response values {\({y_1, \cdots, y_n}\)}, and the corresponding \(k\) predictors represented as a \(kD\) vector of covariates {\({x_1, \cdots, x_n }\)}, where subjects/cases are indexed by \(1\leq i\leq n\), and the data-elements (variables) are indexed by \(1\leq j\leq k\).
\[ \begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} .\]
Using least squares to estimate the linear function parameters (effect-sizes), \({\beta _1, \cdots , \beta _k }\), allows us to compute a hyperplane \(y = a + x\beta\) that best fits the observed data \({(x_i, y_i )}_{1\leq i \leq n}\). This is expressed as a matrix by:
\[\begin{pmatrix} y_1\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} a_1\\ \vdots \\ a_n\\ \end{pmatrix}+\begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} \begin{pmatrix} \beta_1\\ \vdots \\ \beta_k\\ \end{pmatrix}.\]
Corresponding to the system of linear hyperplanes:
\[ \left\{ \begin{array}{c} y_1=a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k \\ y_2=a_2+ x_{2, 1} \beta_1 +x_{2, 2} \beta_2 + \cdots +x_{2, k} \beta_k \\ \vdots\\ y_n=a_1+ x_{n, 1} \beta_1 +x_{n, 2} \beta_2 + \cdots +x_{n, k} \beta_k \end{array} \right. . \]
One measure to evaluate the model fit may be the mean squared error (MSE). The MSE for a given value of the parameters \(\alpha\) and \(\beta\) on the observed training data \({(x_i, y_i) }_{1\leq i\leq n}\) is expressed as:
\[MSE=\frac{1}{n}\sum_{i=1}^{n} (y_i- \underbrace{(a_1+ x_{i, 1} \beta_1 +x_{i, 2} \beta_2 + \cdots +x_{i, k} \beta_k) }_{\text{predicted value } \hat{y}_i, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2\]
And the corresponding root mean square error (RMSE) is:
\[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_1- \underbrace{(a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k) }_{\text{predicted value } \hat{y}_1, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2}.\]
In the linear model case, the expected value of the MSE (over the distribution of training sets) for the training set is \(\frac{n - k - 1}{n + k + 1} E\), where \(E\) is the expected value of the MSE for the testing/validation data. Therefore, fitting a model and computing the MSE on the training set, we may produce an over optimistic evaluation assessment (smaller RMSE) of how well the model may fit another dataset. This bias represents in-sample estimate of the fit, whereas we are interested in the cross-validation estimate as an out-of-sample estimate.
In the linear regression model, cross validation may not be as useful, since we can compute the exact correction factor \(\frac{n - k - 1}{n + k + 1}\) to obtain as estimate of the exact expected out-of-sample fit using the in-sample MSE (under)estimate. However, even in this situation, cross-validation remains useful as it can be used to select an optimal regularized cost function.
In most other modeling procedures (e.g. logistic regression), there are no simple general closed-form expressions (formulas) to adjust the cross-validation error estimate from the in-sample fit estimate. Cross-validation is generally applicable way to predict the performance of a model on a validation set using stochastic computation instead of obtaining experimental, theoretical, mathematical, or analytic error estimates.
There are 2 classes of cross-validation approaches, exhaustive and non-exhaustive.
Exhaustive cross-validation methods are based on determining all possible ways to divide the original sample into training and testing data. For instance, the Leave-m-out cross-validation involves using \(m\) observations for testing and the remaining (\(n-m\)) observations as training. The case when \(m=1\), i.e., leave-1-out method, is only applicable when \(n\) is small, due to its huge computational cost. This process is repeated on all partitions of the original sample. This method requires model fitting and validating \(C_m^n\) times (\(n\) is the total number of observations in the original sample and \(m\) is the number left out for validation). This requires a very large number of steps.
Non-exhaustive cross validation methods avoid computing estimates/errors using all possible partitionings of the original sample, but rather approximates these. For example, in the k-fold cross-validation, the original sample is randomly partitioned into \(k\) equal sized subsamples, or folds. Of the \(k\) subsamples, a single subsample is kept as final testing data for validation of the model. The other \(k - 1\) subsamples are used as training data. The cross-validation process is then repeated \(k\) times, corresponding to the \(k\) folds. Each of the \(k\) subsamples is used once as the validation data. There are corresponding \(k\) results that are averaged (or otherwise aggregated) to generate a final pooled model-quality estimation. In k-fold validation, all observations are used for both training and validation, and each observation is used for validation exactly once. In general, \(k\) is a parameter that needs to be selected by investigator (common values may be \(5\) or \(10\)).
A general case of the k-fold validation
is \(k=n\) (the total number of observations), when it coincides with the leave-one-out cross-validation.
A variation of the k-fold validation is stratified k-fold cross-validation, where each fold has the same (approximately) mean response value. For instance, if the model represents a binary classification of cases (e.g., controls vs. patients), this implies that each fold contains roughly the same proportion of the two class labels.
Repeated random sub-sampling validation splits randomly the entire dataset into a training set, where the model is fit, and a testing set, where the predictive accuracy is assessed. Again, the results are averaged over all iterative splits. This method has an advantage over k-fold cross validation as that the proportion of the training/testing split is not dependent on the number of iterations (folds). However, its drawback is that some observations may never be selected whereas others may be selected multiple times in the testing/validation subsample. As validation subsets may overlap, the results may vary each time we repeat the validation protocol, unless we set a seed point in the algorithm.
Asymptotically, as the number of random splits increases, the repeated random sub-sampling validation approaches the leave-k-out cross-validation.
In the examples below, we have intentionally suppressed some of the R output to save space. This is accomplished using this Rmarkdown command, {r eval=TRUE, results='hide'}
, however, the reader is encouraged to try hands-on all the protocols, make modifications, inspect and interpret the outputs.
Adaptive Boosting
(AdaBoost)This Parkinson’s Diseases Study involves heterogeneous neuroimaging, genetics, clinical, and phenotypic data for over 600 volunteers produced multivariate data for three cohorts (HC=Healthy Controls, PD=Parkinson’s, SWEDD=subjects without evidence for dopaminergic deficit).
# update packages
# update.packages()
Load the data: 06_PPMI_ClassificationValidationData_Short.csv.
ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
Binarize the Dx (clinical diagnoses) classes.
# binarize the Dx classes
ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
attach(ppmi_data)
head(ppmi_data)
# View (ppmi_data)
Obtain a model-free predictive analytics, e.g., AdaBoost classification, and report the results.
# Model-free analysis, classification
# install.packages("crossval")
# install.packages("ada")
# library("crossval")
require(crossval)
require(ada)
#set up adaboosting prediction function
# Define a new AdaBoost classification result-reporting function
my.ada <- function (train.x, train.y, test.x, test.y, negative, formula){
ada.fit <- ada(train.x, train.y)
predict.y <- predict(ada.fit, test.x)
#count TP, FP, TN, FN, Accuracy, etc.
out <- confusionMatrix(test.y, predict.y, negative = negative)
# negative is the label of a negative "null" sample (default: "control").
return (out)
}
When group sizes are imbalanced, we may need to rebalance them to avoid potential biases of the dominant cohorts. In this case, we will re-balance the groups using the package SMOTE
Synthetic Minority Oversampling Technique. SMOTE may be used to handle class imbalance in binary classification.
# balance cases
# SMOTE: Synthetic Minority Oversampling Technique to handle class misbalance in binary classification.
set.seed(1000)
# install.packages("unbalanced") to deal with unbalanced group data
require(unbalanced)
ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
uniqueID <- unique(ppmi_data$FID_IID)
ppmi_data <- ppmi_data[ppmi_data$VisitID==1, ]
ppmi_data$PD <- factor(ppmi_data$PD)
colnames(ppmi_data)
# ppmi_data.1<-ppmi_data[, c(3:281, 284, 287, 336:340, 341)]
n <- ncol(ppmi_data)
output.1 <- ppmi_data$PD
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# remove Default Real Clinical subject classifications!
input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.matrix(ppmi_data[ , which(names(ppmi_data) %in% {"PD"})])
output <- as.factor(ppmi_data$PD)
c(dim(input), dim(output))
#balance the dataset
set.seed(123)
data.1<-ubBalance(X= input, Y=output, type="ubSMOTE", percOver=300, percUnder=150, verbose=TRUE)
balancedData<-cbind(data.1$X, data.1$Y)
table(data.1$Y)
nrow(data.1$X); ncol(data.1$X)
nrow(balancedData); ncol(balancedData)
nrow(input); ncol(input)
colnames(balancedData) <- c(colnames(input), "PD")
Next, we’ll check the re-balanced cohort sizes.
###Check balance
## T test
alpha.0.05 <- 0.05
test.results.bin <- NULL # binarized/dichotomized p-values
test.results.raw <- NULL # raw p-values
# get a better error-handling t.test function that gracefully handles NA's and trivial variances
my.t.test.p.value <- function(input1, input2) {
obj <- try(t.test(input1, input2), silent=TRUE)
if (is(obj, "try-error"))
return(NA)
else
return(obj$p.value)
}
for (i in 1:ncol(balancedData))
{
test.results.raw[i] <- my.t.test.p.value(input[, i], balancedData [, i])
test.results.bin[i] <- ifelse(test.results.raw[i] > alpha.0.05, 1, 0)
# binarize the p-value (0=significant, 1=otherwise)
print(c("i=", i, "var=", colnames(balancedData[i]), "t-test_raw_p_value=", test.results.raw[i]))
}
# we can also employ (e.g., FDR, Bonferonni) correction for multiple testing!
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw))
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)
# sum(test.results.raw < alpha.0.05, na.rm=T)/length(test.results.raw) #check proportion of inconsistencies
# sum(test.results.corr < alpha.0.05, na.rm =T)/length(test.results.corr)
qqplot(input[, 5], balancedData [, 5]) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
# Now, check visually for differences between the distributions of the raw (input) and rebalanced data.
# par(mar=c(1,1,1,1))
# par(mfrow=c(10,10))
# for(i in c(1:62,64:101)){ qqplot(balancedData [, i],input[, i]) } #except VisitID
# as the sample-size is changed:
length(input[, 5]); length(balancedData [, 5])
# to plot raw vs. rebalanced pairs (e.g., var="L_insular_cortex_Volume"), we need to equalize the lengths
#plot (input[, 5] +0*balancedData [, 5], balancedData [, 5]) # [, 5] == "L_insular_cortex_Volume"
# print(c("T-test results: ", test.results))
# zeros (0) are significant independent between-group T-test differences, ones (1) are insignificant
for (i in 1:(ncol(balancedData)-1))
{
test.results.raw [i] <- wilcox.test(input[, i], balancedData [, i])$p.value
test.results.bin [i] <- ifelse(test.results.raw [i] > alpha.0.05, 1, 0)
print(c("i=", i, "Wilcoxon-test=", test.results.raw [i]))
}
print(c("Wilcoxon test results: ", test.results.bin))
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw))
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)
The next step will be the actual Cross validation
.
# using raw data:
X <- as.data.frame(input); Y <- output
neg <- "1" # "Control" == "1"
# using Rebalanced data:
X <- as.data.frame(data.1$X); Y <- data.1$Y
# balancedData<-cbind(data.1$X, data.1$Y); dim(balancedData)
# Side note: There is a function name collision for "crossval", the same method is present in the "mlr" (machine Learning in R) package and in the "crossval" package.
# To specify a function call from a specific package do: packagename::functionname()
set.seed(115)
cv.out <- crossval::crossval(my.ada, X, Y, K = 5, B = 1, negative = neg)
# the label of a negative "null" sample (default: "control")
out <- diagnosticErrors(cv.out$stat)
print(cv.out$stat)
## FP TP TN FN
## 0.6 109.6 97.0 0.2
print(out)
## acc sens spec ppv npv lor
## 0.9961427 0.9981785 0.9938525 0.9945554 0.9979424 11.3918119
As we can see from the reported metrics, the overall averaged AdaBoost-based diagnostic predictions are quite good.
These data contain the effect of two soporific drugs to increase hours of sleep (treatment-compared design) on 10 patients. The data is available by default (sleep {datasets}
)
First, load the data and report some graphs and summaries.
data(sleep); str(sleep)
X = as.matrix(sleep[, 1, drop=FALSE]) # increase in hours of sleep,
# drop is logical, if TRUE the result is coerced to the lowest possible dimension.
# The default is to drop if only one column is left, but not to drop if only one row is left.
Y = sleep[, 2] # drug given
plot(X ~ Y)
levels(Y) # "1" "2"
dim(X) # 20 1
Next, we will define a new LDA (linear discriminant analysis) predicting function and perform the Cross-validation
(CV) on the resulting predictor.
require("MASS") # for lda function
predfun.lda = function(train.x, train.y, test.x, test.y, negative)
{ lda.fit = lda(train.x, grouping=train.y)
ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
out = confusionMatrix(test.y, ynew, negative=negative)
return( out )
}
# install.packages("crossval")
library("crossval")
set.seed(123456)
cv.out <- crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
cv.out$stat
diagnosticErrors(cv.out$stat)
Execute the above code and interpret the diagnostic results measuring the performance of the LDA prediction.
attitude
datasetThis data represents a survey of clerical employees of an organization with 35 employees of 30 (randomly selected) departments. Data is proportion of favorable responses to 7 questions in each department.
Let’s load and summarize the data, which is available by default in attitude {datasets}
.
# ?attitude, colnames(attitude)
# Note: when using a data frame, a time-saver is to use "." to indicate "include all covariates" in the DF.
# E.g., fit <- lm(Y ~ ., data = D)
data("attitude")
y = attitude[, 1] # rating variable
x = attitude[, -1] # date frame with the remaining variables
is.factor(y)
summary( lm(y ~ . , data=x) ) # R-squared: 0.7326
# set up lm prediction function
We will demonstrate model-based analytics using lm
and lda
, and will validate the forecasting using CV.
predfun.lm = function(train.x, train.y, test.x, test.y)
{ lm.fit = lm(train.y ~ . , data=train.x)
ynew = predict(lm.fit, test.x )
# compute squared error risk (MSE)
out = mean( (ynew - test.y)^2)
# note that, in general, when fitting linear model to continuous outcome variable (Y),
# we can't use the out<-confusionMatrix(test.y, ynew, negative=negative), as it requires a binary outcome
# this is why we use the MSE as an estimate of the discrepancy between observed & predicted values
return(out)
}
# require("MASS")
#predfun.lda = function(train.x, train.y, test.x, test.y, negative)
#{ lda.fit = lda(train.x, grouping=train.y)
# ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
# out = confusionMatrix(test.y, ynew, negative=negative)
#return( out )
#}
# prediction MSE using all variables
set.seed(123456)
cv.out.lm = crossval::crossval(predfun.lm, x, y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se) # 72.581198 3.736784
# reducing to using only two variables
cv.out.lm = crossval::crossval(predfun.lm, x[, c(1, 3)], y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se)
# 52.563957 2.015109
ppmi_data
)Let’s go back to the more elaborate PD data starting with loading and preprocessing the derived-PPMI data.
# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# attach(ppmi_data); head(ppmi_data)
# install.packages("crossval")
# library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.factor(ppmi_data$PD)
# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input) # Predictor variables
Y = as.matrix(output) # Actual PD clinical assessment
dim(X);
dim(Y)
layout(matrix(c(1, 2, 3, 4), 2, 2)) # optional 4 graphs/page
fit <- lm(Y~X);
plot(fit) # plot the fit
levels(as.factor(Y)) # "0" "1"
## [1] "0" "1"
c(dim(X), dim(Y)) # 1043 103
## [1] 422 100 422 1
Apply Cross-validation
to assess the performance of the linear model.
set.seed(12345)
# cv.out.lm = crossval::crossval(predfun.lm, as.data.frame(X), as.numeric(Y), K=5, B=20)
cv.out.lda = crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
# K=Number of folds; B=Number of repetitions.
# Results
#cv.out.lda$stat;
#cv.out.lda;
diagnosticErrors(cv.out.lda$stat)
## acc sens spec ppv npv lor
## 0.9617299 0.9513333 0.9872951 0.9945984 0.8918919 7.3258500
#cv.out.lm$stat;
#cv.out.lm;
#diagnosticErrors(cv.out.lm$stat)
The cross-validation (CV) output object includes the following components:
stat.cv
: Vector of statistics returned by predfun for each cross validation runstat
: Mean the statistic returned by predfun averaged over all cross validation runsstat.se
: Variability: the corresponding standard error.We have already seen a number of predict()
functions, e.g., Chapter 17. Below, we will add to the collection of predictive analytics and forecasting functions.
We already saw the logit model in Chapter 17. Now, we will demonstrate a logit-predictor function.
# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# install.packages("crossval"); library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input) # Predictor variables
Y = as.matrix(output)
Note that the predicted values are in \(log\) terms, so they need to be exponentiated to interpret them correctly.
lm.logit <- glm(as.numeric(Y) ~ ., data = as.data.frame(X), family = "binomial")
ynew <- predict(lm.logit, as.data.frame(X)); #plot(ynew)
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1); # plot(ynew2)
predfun.logit = function(train.x, train.y, test.x, test.y, neg)
{ lm.logit <- glm(train.y ~ ., data = train.x, family = "binomial")
ynew = predict(lm.logit, test.x )
# compute TP, FP, TN, FN
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1)
out = confusionMatrix(test.y, ynew2, negative=neg) # Binary outcome, we can use confusionMatrix
return( out )
}
# Reduce the bag of explanatory variables, purely to simplify the interpretation of the analytics in this example!
input.short <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume",
"R_fusiform_gyrus_ShapeIndex", "R_fusiform_gyrus_Curvedness",
"Sex", "Weight", "Age" , "chr12_rs34637584_GT", "chr17_rs11868035_GT",
"UPDRS_Part_I_Summary_Score_Baseline", "UPDRS_Part_I_Summary_Score_Month_03",
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",
"UPDRS_Part_III_Summary_Score_Baseline",
"X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"
))]
X = as.matrix(input.short)
cv.out.logit = crossval::crossval(predfun.logit, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1", verbose=FALSE)
cv.out.logit$stat.cv
## FP TP TN FN
## B1.F1 1 50 31 2
## B1.F2 0 60 19 6
## B1.F3 2 55 19 8
## B1.F4 3 58 23 0
## B1.F5 3 60 21 1
## B2.F1 2 56 22 4
## B2.F2 0 57 23 5
## B2.F3 3 60 20 1
## B2.F4 1 58 23 2
## B2.F5 1 54 27 3
diagnosticErrors(cv.out.logit$stat)
## acc sens spec ppv npv lor
## 0.9431280 0.9466667 0.9344262 0.9726027 0.8769231 5.5331424
Caution: Note that if you forget to exponentiate the predicted logistic model values (see ynew2 in predict.logit
), you will get nonsense results, e.g., all cases may be predicted to be in one class, trivial sensitivity or NPP.
In Chapters 7 and 20 we discussed the linear and quadratic discriminant analysis models. Let’s now introduce a predfun.qda()
function.
predfun.qda = function(train.x, train.y, test.x, test.y, negative)
{
require("MASS") # for lda function
qda.fit = qda(train.x, grouping=train.y)
ynew = predict(qda.fit, test.x)$class
out.qda = confusionMatrix(test.y, ynew, negative=negative)
return( out.qda )
}
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(input.short), as.factor(Y), K=5, B=20, neg="1")
## Error in qda.default(x, grouping, ...): rank deficiency in group 1
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat);
## Error in diagnosticErrors(cv.out.qda$stat): object 'cv.out.qda' not found
This error message: “Error in qda.default(x, grouping, …) : rank deficiency in group 1” indicates that there is a rank deficiency, i.e. some variables are collinear and one or more covariance matrices cannot be inverted to obtain the estimates in group 1 (Controls)!
If you remove the strongly correlated data elements (“R_fusiform_gyrus_Volume”, “R_fusiform_gyrus_ShapeIndex”, and “R_fusiform_gyrus_Curvedness”), the rank-deficiency problem goes away!
input.short2 <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume", "Sex", "Weight", "Age" , "chr17_rs11868035_GT", "UPDRS_Part_I_Summary_Score_Baseline",
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",
"UPDRS_Part_III_Summary_Score_Baseline",
"X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"
))]
X = as.matrix(input.short2)
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1")
It makes sense to contrast the QDA and GLM/Logit predictions.
diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat)
## acc sens spec ppv npv lor
## 0.9407583 0.9533333 0.9098361 0.9629630 0.8880000 5.3285694
## acc sens spec ppv npv lor
## 0.9431280 0.9466667 0.9344262 0.9726027 0.8769231 5.5331424
Clearly, both the QDA and Logit model predictions are quite similar and reliable.
Previously, in Chapter 7 we saw some examples of LDA/QDA methods. Now, we’ll talk more about the details. Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X \mid Y=k)\) for each class \(k\). Their predictions are obtained by using [Bayesian theorem] (http://wiki.socr.umich.edu/index.php/SMHS_BayesianInference#Bayesian_Rule), see Appendix:
\[P(Y=k \mid X)= \frac{P(X \mid Y=k)P(Y=k)}{P(X)}=\frac{P(X \mid Y=k)P(Y=k)}{\sum_{l=0}^{\infty}P(X \mid Y=l)P(Y=l)}\] and we select the class \(k\), which maximizes this conditional probability (maximum likelihood estimation). In linear and quadratic discriminant analysis, \(P(X \mid Y)\) is modelled as a multivariate Gaussian distribution with density:
\[P(X \mid Y=k)= \frac{1}{{(2 \pi)}^n {\vert{\Sigma_{k}}\vert}^{\frac{1}{2}}}\times e^{(-\frac{1}{2}{(X- \mu_k)}^T \Sigma_k^{-1}(X- \mu_k))}.\]
This model can be used to classify data by using the training data to estimate:
In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:
\(\Sigma_k=\Sigma\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\)):
\(LOR=log(\frac{P(Y=k \mid X)}{P(Y=l \mid X)})\), (the LOR=0 \(\Longleftrightarrow\) the two probabilities are identical, i.e., same class)
\(LOR=log(\frac{P(Y=k \mid X)}{P(Y=l \mid X)})=0\) \({(\mu_k-\mu_l)}^T \Sigma^{-1} (\mu_k-\mu_l)= \frac{1}{2}(\mu_k^T \Sigma^{-1} \mu_k-\mu_l^T \Sigma^{-1} \mu_l)\)
But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices \(\Sigma_k\) of the Gaussians, leading to quadratic decision surfaces.
LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempt to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.
predfun.lda = function(train.x, train.y, test.x, test.y, neg)
{
require("MASS")
lda.fit = lda(train.x, grouping=train.y)
ynew = predict(lda.fit, test.x)$class
out.lda = confusionMatrix(test.y, ynew, negative=neg)
return( out.lda )
}
Similarly to LDA, the QDA prediction function can be defined by:
predfun.qda = function(train.x, train.y, test.x, test.y, neg)
{
require("MASS") # for lda function
qda.fit = qda(train.x, grouping=train.y)
ynew = predict(qda.fit, test.x)$class
out.qda = confusionMatrix(test.y, ynew, negative=neg)
return( out.qda )
}
We saw Neural Networks (NNs) in Chapter 10. Applying NNs is not straightforward. We have to create a design matrix with an indicator column for the response feature. In addition, we need to write a predict function to translate the output of neuralnet()
into analytical forecasts.
# predict nn
library("neuralnet")
pred = function(nn, dat) {
yhat = compute(nn, dat)$net.result
yhat = apply(yhat, 1, which.max)-1
return(yhat)
}
my.neural <- function (train.x, train.y, test.x, test.y,method,layer=c(5,5)){
train.x <- as.data.frame(train.x)
train.y <- as.data.frame(train.y)
colnames(train.x) <- paste0('V', 1:ncol(X))
colnames(train.y) <- "V1"
train_y_ind = model.matrix(~factor(train.y$V1)-1)
colnames(train_y_ind) = paste0('out', 0:1)
train = cbind(train.x, train_y_ind)
y_names = paste0('out', 0:1)
x_names = paste0('V', 1:ncol(train.x))
nn = neuralnet(
paste(paste(y_names, collapse='+'),
'~',
paste(x_names, collapse='+')),
train,
hidden=layer,
linear.output=FALSE,
lifesign='full', lifesign.step=1000)
#predict
predict.y <- pred(nn, test.x)
out <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
return (out)
}
set.seed(1234)
cv.out.nn <- crossval::crossval(my.neural, scale(X), Y, K = 5, B = 1,layer=c(20,20),verbose = F) # scale predictors is necessary.
## hidden: 20, 20 thresh: 0.01 rep: 1/1 steps: 63 error: 1.02185 time: 0.1 secs
## hidden: 20, 20 thresh: 0.01 rep: 1/1 steps: 79 error: 1.01374 time: 0.1 secs
## hidden: 20, 20 thresh: 0.01 rep: 1/1 steps: 73 error: 1.02399 time: 0.11 secs
## hidden: 20, 20 thresh: 0.01 rep: 1/1 steps: 66 error: 1.03016 time: 0.1 secs
## hidden: 20, 20 thresh: 0.01 rep: 1/1 steps: 72 error: 1.01491 time: 0.1 secs
crossval::diagnosticErrors(cv.out.nn$stat)
## acc sens spec ppv npv
## 0.9454976303 0.9016393443 0.9633333333 0.9090909091 0.9601328904
## lor
## 5.4841051313
We also saw SVM in Chapter 10. Let’s try cross validation on Linear and Gaussian (radial) kernel SVM. We can expect that linear SVM should achieve a close result to Gaussian or even better than Gaussian since this dataset have a large \(k\) (# features) compared with \(n\) (# cases), which we have studied in detail in Chapter 10.
library("e1071")
##
## Attaching package: 'e1071'
## The following object is masked from 'package:mlr':
##
## impute
my.svm <- function (train.x, train.y, test.x, test.y,method,cost=1,gamma=1/ncol(dx_norm),coef0=0,degree=3){
svm_l.fit <- svm(x = train.x, y=as.factor(train.y),kernel = method)
predict.y <- predict(svm_l.fit, test.x)
out <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
return (out)
}
# Linear kernel
set.seed(123)
cv.out.svml <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "linear",cost=tune_svm$best.parameters$cost,verbose = F)
diagnosticErrors(cv.out.svml$stat)
## acc sens spec ppv npv
## 0.9502369668 0.9098360656 0.9666666667 0.9173553719 0.9634551495
## lor
## 5.6789307585
# Gaussian kernel
set.seed(123)
cv.out.svmg <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "radial",cost=tune_svmg$best.parameters$cost,gamma=tune_svmg$best.parameters$gamma,verbose = F)
diagnosticErrors(cv.out.svmg$stat)
## acc sens spec ppv npv
## 0.9454976303 0.9262295082 0.9533333333 0.8897637795 0.9694915254
## lor
## 5.5470977226
Indeed, both types of kernels yield good quality predictors according to the assessment metrics reported by the diagnosticErrors()
method.
As we saw in Chapter 6, k-NN is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:
Let’s now build the corresponding predfun.knn()
method.
# X = as.matrix(input) # Predictor variables X = as.matrix(input.short2)
# Y = as.matrix(output) # Outcome
# KNN (k-nearest neighbors)
library("class")
# knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)$class
# table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")
# This can be used for polytomous variable (multiple classes)
predfun.knn = function(train.x, train.y, test.x, test.y, neg)
{
require("class")
knn.fit = knn(train.x, test.x, cl = train.y, prob=T) # knn is already a prediction function!!!
# ynew = predict(knn.fit, test.x)$class # no need of another prediction, in this case
out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
return( out.knn )
}
cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
#Compare all 3 classifiers (lda, qda, knn, and logit)
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat);
We can also examine the performance of k-NN prediction on the PPMI (Parkinson’s disease) data. Start by partitioning the data into training
and testing
sets.
# TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(input))
## set the seed to make your partition reproducible
set.seed(1234)
input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
input.train <- input[input.train.ind, ]
output.train <- as.matrix(output)[input.train.ind, ]
# TESTING DATA
input.test <- input[-input.train.ind, ]
output.test <- as.matrix(output)[-input.train.ind, ]
Then fit the k-NN model and report the results.
library("class")
knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
#plot(knn_model)
summary(knn_model)
attributes(knn_model)
# cross-validation
knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
summary(knn_model.cv)
In Chapter 12, we showed that k-MC aims to partition \(n\) observations into \(k\) clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution for this, i.e., the problem is NP-hard. However, there are efficient algorithms that converge quickly to local optima, e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach.
kmeans_model <- kmeans(input.train, 2)
layout(matrix(1, 1))
# tiff("C:/Users/Dinov/Desktop/test.tiff", width = 10, height = 10, units = 'in', res = 300)
fpc::plotcluster(input.train, output.train, col = kmeans_model$cluster)
cluster::clusplot(input.train, kmeans_model$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
par(mfrow=c(10,10))
# the next figure is very large and will not render in RStudio, you may need to save it as PDF file!
# pdf("C:/Users/Dinov/Desktop/test.pdf", width = 50, height = 50)
# with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster]))
# dev.off()
with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster]))
# plot(input.train, col = kmeans_model$cluster)
# points(kmeans_model$centers, col = 1:2, pch = 8, cex = 2)
## cluster centers "fitted" to each obs.:
fitted.kmeans <- fitted(kmeans_model); head(fitted.kmeans)
## L_insular_cortex_AvgMeanCurvature L_insular_cortex_ComputeArea
## 2 0.1071299082 2635.580514
## 2 0.1071299082 2635.580514
## 1 0.2221893533 1134.578902
## 2 0.1071299082 2635.580514
## 2 0.1071299082 2635.580514
## 2 0.1071299082 2635.580514
## L_insular_cortex_Volume L_insular_cortex_ShapeIndex
## 2 7969.485443 0.3250065829
## 2 7969.485443 0.3250065829
## 1 2111.385018 0.2788562513
## 2 7969.485443 0.3250065829
## 2 7969.485443 0.3250065829
## 2 7969.485443 0.3250065829
## L_insular_cortex_Curvedness R_insular_cortex_AvgMeanCurvature
## 2 0.1069266191 0.1215389633
## 2 0.1069266191 0.1215389633
## 1 0.2583588034 0.2937407279
## 2 0.1069266191 0.1215389633
## 2 0.1069266191 0.1215389633
## 2 0.1069266191 0.1215389633
## R_insular_cortex_ComputeArea R_insular_cortex_Volume
## 2 2025.4208294 4930.217222
## 2 2025.4208294 4930.217222
## 1 789.0137289 1154.801329
## 2 2025.4208294 4930.217222
## 2 2025.4208294 4930.217222
## 2 2025.4208294 4930.217222
## R_insular_cortex_ShapeIndex R_insular_cortex_Curvedness
## 2 0.3021979707 0.1285003972
## 2 0.3021979707 0.1285003972
## 1 0.2692948282 0.3440270266
## 2 0.3021979707 0.1285003972
## 2 0.3021979707 0.1285003972
## 2 0.3021979707 0.1285003972
## L_cingulate_gyrus_AvgMeanCurvature L_cingulate_gyrus_ComputeArea
## 2 0.1330584840 3975.836928
## 2 0.1330584840 3975.836928
## 1 0.2726862644 1406.748536
## 2 0.1330584840 3975.836928
## 2 0.1330584840 3975.836928
## 2 0.1330584840 3975.836928
## L_cingulate_gyrus_Volume L_cingulate_gyrus_ShapeIndex
## 2 10007.626258 0.4473695944
## 2 10007.626258 0.4473695944
## 1 2011.044152 0.3798959265
## 2 10007.626258 0.4473695944
## 2 10007.626258 0.4473695944
## 2 10007.626258 0.4473695944
## L_cingulate_gyrus_Curvedness R_cingulate_gyrus_AvgMeanCurvature
## 2 0.07623665616 0.1431254429
## 2 0.07623665616 0.1431254429
## 1 0.20762998775 0.2723838183
## 2 0.07623665616 0.1431254429
## 2 0.07623665616 0.1431254429
## 2 0.07623665616 0.1431254429
## R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume
## 2 3961.003469 9968.622665
## 2 3961.003469 9968.622665
## 1 1263.162934 1835.915640
## 2 3961.003469 9968.622665
## 2 3961.003469 9968.622665
## 2 3961.003469 9968.622665
## R_cingulate_gyrus_ShapeIndex R_cingulate_gyrus_Curvedness
## 2 0.4393269631 0.09585915614
## 2 0.4393269631 0.09585915614
## 1 0.3439061415 0.25525363889
## 2 0.4393269631 0.09585915614
## 2 0.4393269631 0.09585915614
## 2 0.4393269631 0.09585915614
## L_caudate_AvgMeanCurvature L_caudate_ComputeArea L_caudate_Volume
## 2 0.2485862167 798.2086326 1236.8761875
## 2 0.2485862167 798.2086326 1236.8761875
## 1 0.7618949830 170.4455529 131.9571447
## 2 0.2485862167 798.2086326 1236.8761875
## 2 0.2485862167 798.2086326 1236.8761875
## 2 0.2485862167 798.2086326 1236.8761875
## L_caudate_ShapeIndex L_caudate_Curvedness R_caudate_AvgMeanCurvature
## 2 0.2893636194 0.2766738989 0.1859592222
## 2 0.2893636194 0.2766738989 0.1859592222
## 1 0.3416820021 0.6654124554 0.7327517931
## 2 0.2893636194 0.2766738989 0.1859592222
## 2 0.2893636194 0.2766738989 0.1859592222
## 2 0.2893636194 0.2766738989 0.1859592222
## R_caudate_ComputeArea R_caudate_Volume R_caudate_ShapeIndex
## 2 1101.9124553 1961.9315566 0.3058288485
## 2 1101.9124553 1961.9315566 0.3058288485
## 1 203.1959333 175.1086948 0.3496297630
## 2 1101.9124553 1961.9315566 0.3058288485
## 2 1101.9124553 1961.9315566 0.3058288485
## 2 1101.9124553 1961.9315566 0.3058288485
## R_caudate_Curvedness L_putamen_AvgMeanCurvature L_putamen_ComputeArea
## 2 0.1815527895 0.1694057297 1112.8660462
## 2 0.1815527895 0.1694057297 1112.8660462
## 1 0.6127042499 0.5416908762 407.9678282
## 2 0.1815527895 0.1694057297 1112.8660462
## 2 0.1815527895 0.1694057297 1112.8660462
## 2 0.1815527895 0.1694057297 1112.8660462
## L_putamen_Volume L_putamen_ShapeIndex L_putamen_Curvedness
## 2 2245.9881012 0.2819146444 0.1962480439
## 2 2245.9881012 0.2819146444 0.1962480439
## 1 447.4248324 0.2937043962 0.5629161599
## 2 2245.9881012 0.2819146444 0.1962480439
## 2 2245.9881012 0.2819146444 0.1962480439
## 2 2245.9881012 0.2819146444 0.1962480439
## R_putamen_AvgMeanCurvature R_putamen_ComputeArea R_putamen_Volume
## 2 0.1275343443 1541.8270406 3713.0640176
## 2 0.1275343443 1541.8270406 3713.0640176
## 1 0.3583084577 614.1127111 855.8598884
## 2 0.1275343443 1541.8270406 3713.0640176
## 2 0.1275343443 1541.8270406 3713.0640176
## 2 0.1275343443 1541.8270406 3713.0640176
## R_putamen_ShapeIndex R_putamen_Curvedness L_hippocampus_AvgMeanCurvature
## 2 0.3204750773 0.1327026259 0.1302027871
## 2 0.3204750773 0.1327026259 0.1302027871
## 1 0.2724724452 0.4323588270 0.2906379356
## 2 0.3204750773 0.1327026259 0.1302027871
## 2 0.3204750773 0.1327026259 0.1302027871
## 2 0.3204750773 0.1327026259 0.1302027871
## L_hippocampus_ComputeArea L_hippocampus_Volume L_hippocampus_ShapeIndex
## 2 1591.1420376 3891.725208 0.3155120036
## 2 1591.1420376 3891.725208 0.3155120036
## 1 691.6213926 1071.966747 0.2494871276
## 2 1591.1420376 3891.725208 0.3155120036
## 2 1591.1420376 3891.725208 0.3155120036
## 2 1591.1420376 3891.725208 0.3155120036
## L_hippocampus_Curvedness R_hippocampus_AvgMeanCurvature
## 2 0.1303731070 0.1244254565
## 2 0.1303731070 0.1244254565
## 1 0.4024739268 0.2596758352
## 2 0.1303731070 0.1244254565
## 2 0.1303731070 0.1244254565
## 2 0.1303731070 0.1244254565
## R_hippocampus_ComputeArea R_hippocampus_Volume R_hippocampus_ShapeIndex
## 2 1703.3011422 4300.188018 0.3139513636
## 2 1703.3011422 4300.188018 0.3139513636
## 1 818.5087378 1426.010833 0.2583831982
## 2 1703.3011422 4300.188018 0.3139513636
## 2 1703.3011422 4300.188018 0.3139513636
## 2 1703.3011422 4300.188018 0.3139513636
## R_hippocampus_Curvedness L_fusiform_gyrus_AvgMeanCurvature
## 2 0.1283933772 0.1129471492
## 2 0.1283933772 0.1129471492
## 1 0.3607905897 0.1574323260
## 2 0.1283933772 0.1129471492
## 2 0.1283933772 0.1129471492
## 2 0.1283933772 0.1129471492
## L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## 2 3711.982193 12044.820784
## 2 3711.982193 12044.820784
## 1 2868.641005 6635.242239
## 2 3711.982193 12044.820784
## 2 3711.982193 12044.820784
## 2 3711.982193 12044.820784
## L_fusiform_gyrus_ShapeIndex L_fusiform_gyrus_Curvedness
## 2 0.3048966952 0.09805833475
## 2 0.3048966952 0.09805833475
## 1 0.3199120500 0.13989495538
## 2 0.3048966952 0.09805833475
## 2 0.3048966952 0.09805833475
## 2 0.3048966952 0.09805833475
## R_fusiform_gyrus_AvgMeanCurvature R_fusiform_gyrus_ComputeArea
## 2 0.1206680660 3544.746299
## 2 0.1206680660 3544.746299
## 1 0.1509123825 2876.395903
## 2 0.1206680660 3544.746299
## 2 0.1206680660 3544.746299
## 2 0.1206680660 3544.746299
## R_fusiform_gyrus_Volume R_fusiform_gyrus_ShapeIndex
## 2 11498.095787 0.3331544076
## 2 11498.095787 0.3331544076
## 1 7258.816201 0.3245023300
## 2 11498.095787 0.3331544076
## 2 11498.095787 0.3331544076
## 2 11498.095787 0.3331544076
## R_fusiform_gyrus_Curvedness Sex Weight Age
## 2 0.1094973356 1.355102041 81.11061224 60.59132449
## 2 0.1094973356 1.355102041 81.11061224 60.59132449
## 1 0.1490957531 1.281690141 85.71267606 61.27513099
## 2 0.1094973356 1.355102041 81.11061224 60.59132449
## 2 0.1094973356 1.355102041 81.11061224 60.59132449
## 2 0.1094973356 1.355102041 81.11061224 60.59132449
## chr12_rs34637584_GT chr17_rs11868035_GT chr17_rs11012_GT
## 2 0.01224489796 0.6408163265 0.3306122449
## 2 0.01224489796 0.6408163265 0.3306122449
## 1 0.00000000000 0.5633802817 0.3521126761
## 2 0.01224489796 0.6408163265 0.3306122449
## 2 0.01224489796 0.6408163265 0.3306122449
## 2 0.01224489796 0.6408163265 0.3306122449
## chr17_rs393152_GT chr17_rs12185268_GT chr17_rs199533_GT
## 2 0.4163265306 0.3836734694 0.3714285714
## 2 0.4163265306 0.3836734694 0.3714285714
## 1 0.4507042254 0.4225352113 0.3943661972
## 2 0.4163265306 0.3836734694 0.3714285714
## 2 0.4163265306 0.3836734694 0.3714285714
## 2 0.4163265306 0.3836734694 0.3714285714
## chr17_rs199533_DP chr17_rs199533_GQ UPDRS_Part_I_Summary_Score_Baseline
## 2 38.05714286 73.41224490 1.1315502630
## 2 38.05714286 73.41224490 1.1315502630
## 1 38.71830986 75.02816901 0.9295774648
## 2 38.05714286 73.41224490 1.1315502630
## 2 38.05714286 73.41224490 1.1315502630
## 2 38.05714286 73.41224490 1.1315502630
## UPDRS_Part_I_Summary_Score_Month_03 UPDRS_Part_I_Summary_Score_Month_06
## 2 1.577289982 1.328677304
## 2 1.577289982 1.328677304
## 1 1.689833367 1.425508720
## 2 1.577289982 1.328677304
## 2 1.577289982 1.328677304
## 2 1.577289982 1.328677304
## UPDRS_Part_I_Summary_Score_Month_09 UPDRS_Part_I_Summary_Score_Month_12
## 2 1.440205481 1.318939174
## 2 1.440205481 1.318939174
## 1 1.431763576 1.005858755
## 2 1.440205481 1.318939174
## 2 1.440205481 1.318939174
## 2 1.440205481 1.318939174
## UPDRS_Part_I_Summary_Score_Month_18 UPDRS_Part_I_Summary_Score_Month_24
## 2 1.578790874 1.634198123
## 2 1.578790874 1.634198123
## 1 1.600658728 1.499416341
## 2 1.578790874 1.634198123
## 2 1.578790874 1.634198123
## 2 1.578790874 1.634198123
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 2 4.776886176
## 2 4.776886176
## 1 3.478873239
## 2 4.776886176
## 2 4.776886176
## 2 4.776886176
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## 2 7.049362961
## 2 7.049362961
## 1 6.836643898
## 2 7.049362961
## 2 7.049362961
## 2 7.049362961
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## 2 6.759766138
## 2 6.759766138
## 1 6.505717719
## 2 6.759766138
## 2 6.759766138
## 2 6.759766138
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## 2 7.211055904
## 2 7.211055904
## 1 7.202864291
## 2 7.211055904
## 2 7.211055904
## 2 7.211055904
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 2 5.421732325
## 2 5.421732325
## 1 4.532224835
## 2 5.421732325
## 2 5.421732325
## 2 5.421732325
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## 2 7.901011501
## 2 7.901011501
## 1 7.832944816
## 2 7.901011501
## 2 7.901011501
## 2 7.901011501
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 2 6.637002398
## 2 6.637002398
## 1 5.330873291
## 2 6.637002398
## 2 6.637002398
## 2 6.637002398
## UPDRS_Part_III_Summary_Score_Baseline
## 2 16.16457783
## 2 16.16457783
## 1 11.67605634
## 2 16.16457783
## 2 16.16457783
## 2 16.16457783
## UPDRS_Part_III_Summary_Score_Month_03
## 2 22.30678456
## 2 22.30678456
## 1 20.61865086
## 2 22.30678456
## 2 22.30678456
## 2 22.30678456
## UPDRS_Part_III_Summary_Score_Month_06
## 2 23.00938768
## 2 23.00938768
## 1 20.34311950
## 2 23.00938768
## 2 23.00938768
## 2 23.00938768
## UPDRS_Part_III_Summary_Score_Month_09
## 2 23.01972893
## 2 23.01972893
## 1 20.33610905
## 2 23.01972893
## 2 23.01972893
## 2 23.01972893
## UPDRS_Part_III_Summary_Score_Month_12
## 2 18.33941552
## 2 18.33941552
## 1 12.56906281
## 2 18.33941552
## 2 18.33941552
## 2 18.33941552
## UPDRS_Part_III_Summary_Score_Month_18
## 2 24.38412049
## 2 24.38412049
## 1 21.90061112
## 2 24.38412049
## 2 24.38412049
## 2 24.38412049
## UPDRS_Part_III_Summary_Score_Month_24
## 2 20.30536244
## 2 20.30536244
## 1 14.26860539
## 2 20.30536244
## 2 20.30536244
## 2 20.30536244
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## 2 6.230119712
## 2 6.230119712
## 1 6.394366197
## 2 6.230119712
## 2 6.230119712
## 2 6.230119712
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## 2 5.882457828
## 2 5.882457828
## 1 6.189583235
## 2 5.882457828
## 2 5.882457828
## 2 5.882457828
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## 2 6.047880008
## 2 6.047880008
## 1 6.715505639
## 2 6.047880008
## 2 6.047880008
## 2 6.047880008
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## 2 6.561365121
## 2 6.561365121
## 1 7.168051258
## 2 6.561365121
## 2 6.561365121
## 2 6.561365121
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## 2 5.183673469
## 2 5.183673469
## 1 5.507042254
## 2 5.183673469
## 2 5.183673469
## 2 5.183673469
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## 2 5.142491748
## 2 5.142491748
## 1 5.005294805
## 2 5.142491748
## 2 5.142491748
## 2 5.142491748
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## 2 5.214925842
## 2 5.214925842
## 1 5.102426051
## 2 5.214925842
## 2 5.214925842
## 2 5.214925842
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## 2 5.368586942
## 2 5.368586942
## 1 5.142588648
## 2 5.368586942
## 2 5.368586942
## 2 5.368586942
resid.kmeans <- (input.train - fitted(kmeans_model))
# define the sum of squares function
ss <- function(data) sum(scale(data, scale = FALSE)^2)
## Equalities
cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], # the same two columns
c (ss(fitted.kmeans), ss(resid.kmeans), ss(input.train)))
## [,1] [,2]
## betweenss 15462062254 15462062254
## tot.withinss 12249286905 12249286905
## totss 27711349159 27711349159
# validation
stopifnot(all.equal(kmeans_model$totss, ss(input.train)),
all.equal(kmeans_model$tot.withinss, ss(resid.kmeans)),
## these three are the same:
all.equal(kmeans_model$betweenss, ss(fitted.kmeans)),
all.equal(kmeans_model$betweenss, kmeans_model$totss - kmeans_model$tot.withinss),
## and hence also
all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
)
# kmeans(input.train, 1)$withinss
# trivial one-cluster, (its W.SS == ss(input.train))
clust_kmeans2 = kmeans(scale(X), center=X[1:2,],iter.max=100, algorithm='Lloyd')
We get empty cluster instead of two clusters when we randomly select two points as the initial centers. The way to solve this problem is using k-means++.
# k++ initialize
kpp_init = function(dat, K) {
x = as.matrix(dat)
n = nrow(x)
# Randomly choose a first center
centers = matrix(NA, nrow=K, ncol=ncol(x))
centers[1,] = as.matrix(x[sample(1:n, 1),])
for (k in 2:K) {
# Calculate dist^2 to closest center for each point
dists = matrix(NA, nrow=n, ncol=k-1)
for (j in 1:(k-1)) {
temp = sweep(x, 2, centers[j,], '-')
dists[,j] = rowSums(temp^2)
}
dists = rowMeans(dists)
# Draw next center with probability proportional to dist^2
cumdists = cumsum(dists)
prop = runif(1, min=0, max=cumdists[n])
centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
}
return(centers)
}
clust_kmeans2_plus = kmeans(scale(X), kpp_init(scale(X), 2), iter.max=100, algorithm='Lloyd')
Now let’s evaluate the model. The first step is to justify the selection of k=2
. We use silhouette()
in package cluster
. Recall from Chapter 13 that the silhouette value is between -1 and 1 and negative value represent “mis-clustered” cases.
clust_k2 = clust_kmeans2_plus$cluster
require(cluster)
## Loading required package: cluster
# X = as.matrix(input.short2)
# as the data is too large for the silhouette plot, we'll just subsample and plot 100 random cases
subset_int <- sample(nrow(X),100) #100 cases from 661 total cases
dis = dist(as.data.frame(scale(X[subset_int,])))
sil_k2 = silhouette(clust_k2[subset_int],dis) #best
plot(sil_k2)
summary(sil_k2)
## Silhouette of 100 units in 2 clusters from silhouette.default(x = clust_k2[subset_int], dist = dis) :
## Cluster sizes and average silhouette widths:
## 48 52
## 0.1895633766 0.1018642857
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06886907 0.06533312 0.14169240 0.14395980 0.22658680 0.33585520
mean(sil_k2<0)
## [1] 0.01666666667
The result is pretty good. Only very small number of samples are “mis-clustered” (having negative silhouette value).
Further, you can observe that when k=3
or k=4
, the overall silhouette is smaller.
dis = dist(as.data.frame(scale(X)))
clust_kmeans3_plus = kmeans(scale(X), kpp_init(scale(X), 3), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans3_plus$cluster,dis))
## Silhouette of 422 units in 3 clusters from silhouette.default(x = clust_kmeans3_plus$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 139 157 126
## 0.08356111542 0.19458813829 0.17237138090
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06355399 0.08376430 0.16639550 0.15138420 0.21855670 0.33107050
clust_kmeans4_plus = kmeans(scale(X), kpp_init(scale(X), 4), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans4_plus$cluster,dis))
## Silhouette of 422 units in 4 clusters from silhouette.default(x = clust_kmeans4_plus$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 138 121 111 52
## 0.124165755516 0.170228092125 0.193359499726 0.008929262925
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.16300240 0.08751445 0.15091580 0.14137370 0.21035560 0.32293680
Then, let’s calculate the unsupervised classification error. Here, \(p\) represents the percentage of class \(0\) cases, which provides the weighting factor for labelling each cluster.
mat = matrix(1,nrow = length(Y))
p = sum(Y==0)/length(Y)
for (i in 1:2){
id = which(clust_k2==i)
if(sum(Y[id]==0)>length(id)*p){
mat[id] = 0
}
}
caret::confusionMatrix(Y,mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 195 105
## 1 1 121
##
## Accuracy : 0.7488152
## 95% CI : (0.7045933, 0.7895087)
## No Information Rate : 0.535545
## P-Value [Acc > NIR] : < 0.00000000000000022204
##
## Kappa : 0.5122558
## Mcnemar's Test P-Value : < 0.00000000000000022204
##
## Sensitivity : 0.9948980
## Specificity : 0.5353982
## Pos Pred Value : 0.6500000
## Neg Pred Value : 0.9918033
## Prevalence : 0.4644550
## Detection Rate : 0.4620853
## Detection Prevalence : 0.7109005
## Balanced Accuracy : 0.7651481
##
## 'Positive' Class : 0
##
It achieves 69% accuracy, which is reasonable for unsupervised classification.
Finally, let’s visualize the results by superimposing the data into the first two multi-dimensional scaling (MDS) dimensions.
library("ggplot2")
mds = as.data.frame(cmdscale(dis, k=2))
mds_temp = cbind(
mds, as.factor(clust_k2))
names(mds_temp) = c('V1', 'V2', 'cluster k=2')
gp_cluster = ggplot(mds_temp, aes(x=V2, y=V1, color=as.factor(clust_k2))) +
geom_point(aes(shape = as.factor(Y))) + theme()
gp_cluster
Suppose the multivariate dataset is represented as a set of data points A. We can define a similarity matrix \(S={s_{(i, j)}}\), where \(s_{(i, j)}\) represents a measure of the similarity between points \(i, j\in A.\) Spectral clustering uses the spectrum of the similarity matrix of the high-dimensional data and perform dimensionality reduction for clustering into fewer dimensions. The spectrum of a matrix is the set of its eigenvalues. In general, if \(M:\Omega \stackrel{\text{linear operator}}{\rightarrow} \Omega\) maps a vector space V into itself, its spectrum is the set of scalars \(\lambda=\{\lambda_i\}\) such that \((T-\lambda I)v=0\), where \(I\) is the identity matrix and \(v\) are the eigen-vectors (or eigen-functions) for the operator \(T\). The determinant of the matrix equals the product of its eigenvalues, i.e., \(det(T)=\Pi_i \lambda_i\) , the trace of the matrix \(tr(T)=\Sigma_i \lambda_i\), and the pseudo-determinant for a singular matrix is the product of its nonzero eigenvalues, \(pseudo_{det}(T)=\Pi_{\lambda_i \neq 0}\lambda_i.\)
To partition the data points into two sets \((S_1, S_2)\) suppose \(v\) is the second-smallest eigenvector of the Laplacian matrix:
\[L = I - D^{-\frac{1}{2}}SD^{\frac{1}{2}}\]
of the similarity matrix \(S\), where \(D\) is the diagonal matrix \(D_{i, i}=\Sigma_j S_{i, j}.\)
This actual \((S_1, S_2)\) partitioning of the cases in the data may be done in different ways. For instance, \(S_1\) may use the median \(m\) of the components in \(v\) and group all data points whose component in \(v\) is greater than \(m\). Then the remaining cases can be labeled as part of \(S_2\). This approach may be used iteratively for hierarchical clustering by repeatedly partitioning the subsets.
The \(specc\) method in the \(kernlab\) package implements a spectral clustering algorithm where the data-clustering is performed by embedding the data into the subspace of the eigenvectors of an affinity matrix.
# install.packages("kernlab")
library("kernlab")
# review and choose a dataset (for example the Iris data
data()
#plot(iris)
Let’s look at a few simple cases of spectral clustering. We are suppressing some of the outputs to save space (e.g., #plot(my_data, col= data_sc)
).
Let’s look at the iris dataset we saw in Chapter 2.
my_data <- iris; data(my_data)
num_clusters <- 3
data_sc <- specc(my_data, centers= num_clusters)
data_sc
centers(data_sc)
withinss(data_sc)
#plot(my_data, col= data_sc)
Another simple dataset is `kernlab::spirals
.
# library("kernlab")
data(spirals)
num_clusters <- 2
data_sc <- specc(spirals, centers= num_clusters)
data_sc
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 1 1 2 2 1 2 2 2 1 2 2 1 1 2 2 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 2 1 2 2 1 2 1 2 1 1 2 2 2 2 1 1 1 1 1 2 1 2 1 1 2 2 2 1 1 1 1 1 2 2 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 2 2 1 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 2 2 2 2 1 2 1 2 1 2 1 1 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 1 2 2 1 2 1 2 2 2 1 2 1 2 2 1 1 2 2 2 1
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 367.501471756465
##
## Centers:
## [,1] [,2]
## [1,] 0.01997200551 -0.1761483316
## [2,] -0.01770984369 0.1775136857
##
## Cluster size:
## [1] 150 150
##
## Within-cluster sum of squares:
## [1] 117.3429096 118.1182272
centers(data_sc)
## [,1] [,2]
## [1,] 0.01997200551 -0.1761483316
## [2,] -0.01770984369 0.1775136857
withinss(data_sc)
## [1] 117.3429096 118.1182272
plot(spirals, col= data_sc)
A customer income dataset representing a marketing survey is included in kernlab::income
.
data(income)
num_clusters <- 2
data_sc <- specc(income, centers= num_clusters)
data_sc
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 2 1 2 2 2 1 2 1 2 1 1 1 2 1
##
## String kernel function. Type = spectrum
## Hyperparameters : sub-sequence/string length = 4
## Normalized
##
## Cluster size:
## [1] 7 7
centers(data_sc)
## [,1]
## [1,] NA
withinss(data_sc)
## logical(0)
plot(income, col= data_sc)
Now let’s compare all eight classifiers (AdaBoost, LDA, QDA, knn, logit, Neural Network, linear SVM
and Gaussian SVM
) we presented above.
# get AdaBoost CV results
set.seed(123)
cv.out.ada <- crossval::crossval(my.ada, as.data.frame(X), Y, K = 5, B = 1, negative = neg)
## Number of folds: 5
## Total number of CV fits: 5
##
## Round # 1 of 1
## CV Fit # 1 of 5
## CV Fit # 2 of 5
## CV Fit # 3 of 5
## CV Fit # 4 of 5
## CV Fit # 5 of 5
# get k-Means CV results
my.kmeans <- function (train.x, train.y, test.x, test.y, negative, formula){
kmeans.fit <- kmeans(scale(test.x), kpp_init(scale(test.x), 2), iter.max=100, algorithm='Lloyd')
predict.y <- kmeans.fit$cluster
#count TP, FP, TN, FN, Accuracy, etc.
out <- confusionMatrix(test.y, predict.y, negative = negative)
# negative is the label of a negative "null" sample (default: "control").
return (out)
}
set.seed(123)
cv.out.kmeans <- crossval::crossval(my.kmeans, as.data.frame(X), Y, K = 5, B = 2, negative = neg)
## Number of folds: 5
## Total number of CV fits: 10
##
## Round # 1 of 2
## CV Fit # 1 of 10
## CV Fit # 2 of 10
## CV Fit # 3 of 10
## CV Fit # 4 of 10
## CV Fit # 5 of 10
##
## Round # 2 of 2
## CV Fit # 6 of 10
## CV Fit # 7 of 10
## CV Fit # 8 of 10
## CV Fit # 9 of 10
## CV Fit # 10 of 10
# get spectral clustering CV results
my.sc <- function (train.x, train.y, test.x, test.y, negative, formula){
sc.fit <- specc(scale(test.x), centers= 2)
predict.y <- sc.fit@.Data
#count TP, FP, TN, FN, Accuracy, etc.
out <- confusionMatrix(test.y, predict.y, negative = negative)
# negative is the label of a negative "null" sample (default: "control").
return (out)
}
set.seed(123)
cv.out.sc <- crossval::crossval(my.sc, as.data.frame(X), Y, K = 5, B = 2, negative = neg)
## Number of folds: 5
## Total number of CV fits: 10
##
## Round # 1 of 2
## CV Fit # 1 of 10
## CV Fit # 2 of 10
## CV Fit # 3 of 10
## CV Fit # 4 of 10
## CV Fit # 5 of 10
##
## Round # 2 of 2
## CV Fit # 6 of 10
## CV Fit # 7 of 10
## CV Fit # 8 of 10
## CV Fit # 9 of 10
## CV Fit # 10 of 10
require(knitr)
## Loading required package: knitr
res_tab=rbind(diagnosticErrors(cv.out.ada$stat),diagnosticErrors(cv.out.lda$stat),diagnosticErrors(cv.out.qda$stat),diagnosticErrors(cv.out.knn$stat),diagnosticErrors(cv.out.logit$stat),diagnosticErrors(cv.out.nn$stat),diagnosticErrors(cv.out.svml$stat),diagnosticErrors(cv.out.svmg$stat),diagnosticErrors(cv.out.kmeans$stat),diagnosticErrors(cv.out.sc$stat))
rownames(res_tab) <- c("AdaBoost", "LDA", "QDA", "knn", "logit", "Neural Network", "linear SVM", "Gaussian SVM", "k-Means", "Spectral Clustering")
kable(res_tab,caption = "Compare Result")
acc | sens | spec | ppv | npv | lor | |
---|---|---|---|---|---|---|
AdaBoost | 0.9739336493 | 0.9933333333 | 0.9262295082 | 0.9706840391 | 0.9826086957 | 7.5341095473 |
LDA | 0.9617298578 | 0.9513333333 | 0.9872950820 | 0.9945983621 | 0.8918918919 | 7.3258499745 |
QDA | 0.9407582938 | 0.9533333333 | 0.9098360656 | 0.9629629630 | 0.8880000000 | 5.3285694097 |
knn | 0.6113744076 | 0.7133333333 | 0.3606557377 | 0.7328767123 | 0.3384615385 | 0.3391095260 |
logit | 0.9431279621 | 0.9466666667 | 0.9344262295 | 0.9726027397 | 0.8769230769 | 5.5331424226 |
Neural Network | 0.9454976303 | 0.9016393443 | 0.9633333333 | 0.9090909091 | 0.9601328904 | 5.4841051313 |
linear SVM | 0.9502369668 | 0.9098360656 | 0.9666666667 | 0.9173553719 | 0.9634551495 | 5.6789307585 |
Gaussian SVM | 0.9454976303 | 0.9262295082 | 0.9533333333 | 0.8897637795 | 0.9694915254 | 5.5470977226 |
k-Means | 0.5331753555 | 0.5400000000 | 0.5163934426 | 0.7330316742 | 0.3134328358 | 0.2259399326 |
Spectral Clustering | 0.5592417062 | 0.5900000000 | 0.4836065574 | 0.7375000000 | 0.3241758242 | 0.2983680947 |
Leaving knn
, kmeans
and specc
aside, the other methods achieve pretty good results. With these data, the reason for the suboptimal results of some clustering methods may be rooted in lack of training (e.g., specc
and kmeans
) or the curse of (high) dimensionality, which we saw in Chapter 6. As the data is super sparse, predicting from the nearest neighbors may not be too reliable.