Start by reviewing Chapter 13 (Model Performance Assessment). Cross-validation is a strategy for validating predictive methods, classification models and clustering techniques by assessing the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation defines a test dataset to evaluate the model predictive performance as well as assess its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of identifying actual relationships and salient features in the data.

In this Chapter, we will use Google Flu Trends, Autism, and Parkinson’s disease case-studies to illustrate (1) alternative forecasting types using linear and non-linear predictions, (2) exhaustive and non-exhaustive internal statistical cross-validation, and (3) explore complementary predictor functions.

1 Forecasting types and assessment approaches

In Chapter 6 we discussed the types of classification and prediction methods, including supervised and unsupervised learning. The former are direct and predictive (there are known outcome variables that can be predicted and the corresponding forecasts can be evaluated) and the latter are indirect and descriptive (there are no a priori labels or specific outcomes).

There are alternative metrics used for evaluation of model performance, see Chapter 13. For example, assessment of supervised prediction and classification methods depends on the type of the labeled outcome responses - categorical (binary or polytomous) vs. continuous.

Confusion matrices reporting accuracy, FP, FN, PPV, NPV, LOR and other metrics may be used to assess predictions of dichotomous (binary) or polytomous outcomes.
\(R^2\), correlations (between predicted and observed outcomes), and RMSE measures may be used to quantify the performance of various supervised forecasting methods on continuous features.

2 Overfitting

Before we go into the cross-validation of predictive analytics, we will present several examples of overfitting that illustrate why certain amount of skepticism and mistrust may be appropriate when dealing with forecasting models based on large and complex data.

2.1 Example (US Presidential Elections)

By 2017, there were only 57 US presidential elections and 45 presidents. That is a small dataset, and learning from it may be challenging. For instance:

If the predictor space expands to include things like having false teeth, it’s pretty easy for the model to go from fitting the generalizable features of the data (the signal, e.g., presidential actions) to matching noise patterns (e.g., irrelevant characteristics like gender of the children of presidents, or types of dentures they may wear).
When overfitting noise patterns takes place, the quality of the model fit assessed on the historical data may improve (e.g., better \(R^2\), more about the Coefficient of Determination is available here). At the same time, however, the model performance may be suboptimal when used to make inferences about prospective data, e.g., future presidential elections.

This cartoon illustrates some of the (unique) noisy presidential characteristics that are thought to be unimportant to presidential elections or presidential performance. Cartoon of noisy presidential characteristics

2.2 Example (Google Flu Trends)

A March 14, 2014 article in Science (DOI: 10.1126/science.1248506), identified problems in Google Flu Trends (GFT), DOI 10.1371/journal.pone.0023610, which may be attributed in part to overfitting. The GFT was built to predict the future Centers for Disease Control and Prevention (CDC) reports of doctor office visits for influenza-like illness (ILI). In February 2013, Nature reported that GFT was predicting more than double the proportion of doctor visits compared to the CDC forecast for the same period.

The GFT model found the best matches among 50 million web search terms to fit 1,152 data points. It predicted quite high odds of finding search terms that match the propensity of the flu but are structurally unrelated and hence are not prospectively predictive. In fact, the GFT investigators reported weeding out unrelated to the flu seasonal search terms may have been strongly correlated to the CDC data, e.g., high school basketball season. The big GFT data may have overfitted the relatively small number of cases. This false-alarm result was also paired with a false-negative finding. The GFT model also missed the non-seasonal 2009 H1N1 influenza pandemic, which provides a cautionary tale about prediction, overfitting, and prospective validation.

2.3 Example (Autism)

Autistic brains constantly overfit visual and cognitive stimuli. To an autistic person, a general conversation of several adults may seem like a cacophony due to super-sensitive detail-oriented hearing and perception tuned to literally pick up all elements of the conversation and clues of the surrounding environment. At the same time, autistic brains downplay body language, sarcasm and non-literal cues. We can miss the forest for the trees when we start “overfitting”, over-interpreting the noise on top of the actual salient information. Ambient noise, trivial observations and unrelated perceptions may hide the true communication details.

Human conversations and communications involve exchanges of both critical information and random noise. Fitting a perfect model requires focus only on the “relevant” information. Overfitting occurs when attention is (excessively) consumed with peripheral noise, or worse, overwhelmed by inconsequential noise drowning the salient aspects of the communication exchange.

Any dataset is a mix of signal and noise. The main task of our brains is to sort these components and interpret the information (i.e., ignore the noise).

“One person’s noise is another person’s treasure map!”

Our predictions are most accurate if we can model as much of the signal and as little of the noise as possible. Note that in these terms, \(R^2\) is a poor metric to identify predictive power - it measures how much of the signal and the noise is explained by our model. In practice, it’s hard to always identify what’s signal and what’s noise. This is why practical applications tends to favor simpler models, since the more complicated a model is, the easier it is to overfit the noise component of the observed information.

3 Internal Statistical Cross-validation is an iterative process

Internal statistical cross-validation assesses the expected performance of a prediction method in cases (subject, units, regions, etc.) drawn from a similar population as the original training data sample. Internal validation is distinct from external validation, as the latter potentially allows for the existence of differences between the populations (training data, used to develop or train the technique, and testing data, used to independently quantify the performance of the technique).

Each step in the internal statistical cross-validation protocol involves:

Randomly partitioning a sample of data into 2 complementary subsets (training + testing).
Performing the analysis, fitting or estimating the model using the training set.
Validating the analysis or evaluating the performance of the model using the separate testing set.
Increasing the iteration index and repeating the process. Various termination criteria can involve a fixed number, a desired mean variability, or an upper bound on the error-rate.

Here is one example of internal statistical cross-validation Predictive diagnostic modeling in Parkinson’s disease.

To reduce the noise and variability at each iteration, the final validation results may include the averaged performance results each iteration.

In cases when new observations are hard to obtain (due to costs, reliability, time, or other constraints), cross-validation guards against testing hypotheses suggested by the data themselves (also known as Type III error or False-Suggestion).

Cross-validation is different from conventional-validation (e.g. 80%-20% partitioning the data set into training and testing subsets) where the prediction error (e.g., Root Mean Square Error, RMSE) evaluated on the training data is not a useful estimator of model performance, as it does not generalize across multiple samples.

In general, the errors of the conventional-valuation are based on the results of a specific test dataset and may not accurately represent the model performance. A more appropriate strategy to properly estimate model prediction performance is to use cross-validation (CV), which combines (averages) prediction errors to measure the model performance. CV corrects for the expected stochastic nature of partitioning the training and testing sets and generates a more accurate and robust estimate of the expected model performance.

Relative to a simpler model, a more complex model may overfit-the-data if it has a short foresight, i.e., it generates accurate fitting results for known data but less accurate results when predicting based on new data. Knowledge from past experiences may include either relevant or irrelevant (noise) information. In challenging data-driven prediction models when uncertainty (entropy) is high, more noise is present in past information that needs to be accounted for in prospective forecasting. However, it is generally hard to discriminate patterns from noise in complex systems (i.e., deciding which part to model and which to ignore). Models that reduce the chance of fitting noise are called robust.

4 Example (Linear Regression)

Let’s demonstrate a simple model assessment using linear regression. Suppose we observe the response values {\({y_1, \cdots, y_n}\)}, and the corresponding \(k\) predictors represented as a \(kD\) vector of covariates {\({x_1, \cdots, x_n }\)}, where subjects/cases are indexed by \(1\leq i\leq n\), and the data-elements (variables) are indexed by \(1\leq j\leq k\).

\[ \begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} .\]

Using least squares to estimate the linear function parameters (effect-sizes), \({\beta _1, \cdots , \beta _k }\), allows us to compute a hyperplane \(y = a + x\beta\) that best fits the observed data \({(x_i, y_i )}_{1\leq i \leq n}\). This is expressed as a matrix by:
\[\begin{pmatrix} y_1\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} a_1\\ \vdots \\ a_n\\ \end{pmatrix}+\begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} \begin{pmatrix} \beta_1\\ \vdots \\ \beta_k\\ \end{pmatrix}.\]
Corresponding to the system of linear hyperplanes:

\[ \left\{ \begin{array}{c} y_1=a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k \\ y_2=a_2+ x_{2, 1} \beta_1 +x_{2, 2} \beta_2 + \cdots +x_{2, k} \beta_k \\ \vdots\\ y_n=a_n+ x_{n, 1} \beta_1 +x_{n, 2} \beta_2 + \cdots +x_{n, k} \beta_k \end{array} \right. . \]

One measure to evaluate the model fit may be the mean squared error (MSE). The MSE for a given value of the parameters \(\alpha\) and \(\beta\) on the observed training data \({(x_i, y_i) }_{1\leq i\leq n}\) is expressed as:
\[MSE=\frac{1}{n}\sum_{i=1}^{n} (y_i- \underbrace{(a_1+ x_{i, 1} \beta_1 +x_{i, 2} \beta_2 + \cdots +x_{i, k} \beta_k) }_{\text{predicted value } \hat{y}_i, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2\]

And the corresponding root mean square error (RMSE) is:

\[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_1- \underbrace{(a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k) }_{\text{predicted value } \hat{y}_1, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2}.\]

In the linear model case, the expected value of the MSE (over the distribution of training sets) for the training set is \(\frac{n - k - 1}{n + k + 1} E\), where \(E\) is the expected value of the MSE for the testing/validation data. Therefore, fitting a model and computing the MSE on the training set, we may produce an over optimistic evaluation assessment (smaller RMSE) of how well the model may fit another dataset. This bias represents in-sample estimate of the fit, whereas we are interested in the cross-validation estimate as an out-of-sample estimate.

In the linear regression model, cross validation may not be as useful, since we can compute the exact correction factor \(\frac{n - k - 1}{n + k + 1}\) to obtain as estimate of the exact expected out-of-sample fit using the in-sample MSE (under)estimate. However, even in this situation, cross-validation remains useful as it can be used to select an optimal regularized cost function.

In most other modeling procedures (e.g. logistic regression), there are no simple general closed-form expressions (formulas) to adjust the cross-validation error estimate from the in-sample fit estimate. Cross-validation is generally applicable way to predict the performance of a model on a validation set using stochastic computation instead of obtaining experimental, theoretical, mathematical, or analytic error estimates.

4.1 Cross-validation methods

There are 2 classes of cross-validation approaches, exhaustive and non-exhaustive.

4.2 Exhaustive cross-validation

Exhaustive cross-validation methods are based on determining all possible ways to divide the original sample into training and testing data. For instance, the Leave-m-out cross-validation involves using \(m\) observations for testing and the remaining (\(n-m\)) observations as training. The case when \(m=1\), i.e., leave-1-out method, is only applicable when \(n\) is small, due to its huge computational cost. This process is repeated on all partitions of the original sample. This method requires model fitting and validating \(C_m^n\) times (\(n\) is the total number of observations in the original sample and \(m\) is the number left out for validation). This requires a very large number of steps.

4.3 Non-exhaustive cross-validation

Non-exhaustive cross validation methods avoid computing estimates/errors using all possible partitionings of the original sample, but rather approximates these. For example, in the k-fold cross-validation, the original sample is randomly partitioned into \(k\) equal sized subsamples, or folds. Of the \(k\) subsamples, a single subsample is kept as final testing data for validation of the model. The other \(k - 1\) subsamples are used as training data. The cross-validation process is then repeated \(k\) times, corresponding to the \(k\) folds. Each of the \(k\) subsamples is used once as the validation data. There are corresponding \(k\) results that are averaged (or otherwise aggregated) to generate a final pooled model-quality estimation. In k-fold validation, all observations are used for both training and validation, and each observation is used for validation exactly once. In general, \(k\) is a parameter that needs to be selected by investigator (common values may be \(5\) or \(10\)).

A general case of the k-fold validation is \(k=n\) (the total number of observations), when it coincides with the leave-one-out cross-validation.

A variation of the k-fold validation is stratified k-fold cross-validation, where each fold has the same (approximately) mean response value. For instance, if the model represents a binary classification of cases (e.g., controls vs. patients), this implies that each fold contains roughly the same proportion of the two class labels.

Repeated random sub-sampling validation splits randomly the entire dataset into a training set, where the model is fit, and a testing set, where the predictive accuracy is assessed. Again, the results are averaged over all iterative splits. This method has an advantage over k-fold cross validation as that the proportion of the training/testing split is not dependent on the number of iterations (folds). However, its drawback is that some observations may never be selected whereas others may be selected multiple times in the testing/validation subsample. As validation subsets may overlap, the results may vary each time we repeat the validation protocol, unless we set a seed point in the algorithm.

Asymptotically, as the number of random splits increases, the repeated random sub-sampling validation approaches the leave-k-out cross-validation.

5 Case-Studies

In the examples below, we have intentionally suppressed some of the R output to save space. This is accomplished using this Rmarkdown command, {r eval=TRUE, results='hide'}, however, the reader is encouraged to try hands-on all the protocols, make modifications, inspect and interpret the outputs.

5.1 Example 1: Prediction of Parkinson’s Disease using `Adaptive Boosting` (AdaBoost)

This Parkinson’s Diseases Study involves heterogeneous neuroimaging, genetics, clinical, and phenotypic data for over 600 volunteers produced multivariate data for three cohorts (HC=Healthy Controls, PD=Parkinson’s, SWEDD=subjects without evidence for dopaminergic deficit).

Load the data: 06_PPMI_ClassificationValidationData_Short.csv.

ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)

Binarize the Dx (clinical diagnoses) classes.

# binarize the Dx classes
ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
attach(ppmi_data)

head(ppmi_data)
# View (ppmi_data)

Obtain a model-free predictive analytics, e.g., AdaBoost classification, and report the results.

# Model-free analysis, classification
# install.packages("crossval")
# install.packages("ada")
# library("crossval")
require(crossval)
require(ada)
#set up adaboosting prediction function

# Define a new AdaBoost classification result-reporting function
my.ada <- function (train.x, train.y, test.x, test.y, negative, formula){
  ada.fit <- ada(train.x, train.y)
  predict.y <- predict(ada.fit, test.x)
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}

When group sizes are imbalanced, we may need to rebalance them to avoid potential biases of the dominant cohorts. In this case, we will re-balance the groups using the package SMOTE Synthetic Minority Oversampling Technique. SMOTE may be used to handle class imbalance in binary classification.

# balance cases
# SMOTE: Synthetic Minority Oversampling Technique to handle class misbalance in binary classification.
set.seed(1000)
# install.packages("unbalanced") to deal with unbalanced group data
require(unbalanced)
ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
uniqueID <- unique(ppmi_data$FID_IID) 
ppmi_data <- ppmi_data[ppmi_data$VisitID==1, ]
ppmi_data$PD <- factor(ppmi_data$PD)

colnames(ppmi_data)
# ppmi_data.1<-ppmi_data[, c(3:281, 284, 287, 336:340, 341)]
n <- ncol(ppmi_data)
output.1 <- ppmi_data$PD

# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
# remove Default Real Clinical subject classifications! 
input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.matrix(ppmi_data[ , which(names(ppmi_data) %in% {"PD"})])
output <- as.factor(ppmi_data$PD)
c(dim(input), dim(output))

#balance the dataset
set.seed(123)
data.1<-ubBalance(X= input, Y=output, type="ubSMOTE", percOver=300, percUnder=150, verbose=TRUE)
balancedData<-cbind(data.1$X, data.1$Y)
table(data.1$Y)

nrow(data.1$X); ncol(data.1$X)
nrow(balancedData); ncol(balancedData)
nrow(input); ncol(input)

colnames(balancedData) <- c(colnames(input), "PD")

Next, we’ll check the re-balanced cohort sizes.

library(plotly)

###Check balance
## T test
alpha.0.05 <- 0.05
test.results.bin <- NULL        # binarized/dichotomized p-values
test.results.raw <- NULL        # raw p-values

# get a better error-handling t.test function that gracefully handles NA's and trivial variances
my.t.test.p.value <- function(input1, input2) {
obj <- try(t.test(input1, input2), silent=TRUE)
if (is(obj, "try-error")) 
  return(NA) 
else 
  return(obj$p.value)
} 

for (i in 1:ncol(balancedData)) 
{   
test.results.raw[i]  <- my.t.test.p.value(input[, i], balancedData [, i])
    test.results.bin[i] <- ifelse(test.results.raw[i]  > alpha.0.05, 1, 0)   
# binarize the p-value (0=significant, 1=otherwise) 
    print(c("i=", i, "var=", colnames(balancedData[i]), "t-test_raw_p_value=", test.results.raw[i]))
}

# we can also employ (e.g., FDR, Bonferroni) correction for multiple testing!
    # test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw)) 
    # where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
    # plot(test.results.raw, test.results.corr)
    # sum(test.results.raw < alpha.0.05, na.rm=T)/length(test.results.raw)  #check proportion of inconsistencies
# sum(test.results.corr < alpha.0.05, na.rm =T)/length(test.results.corr)

QQ <- qqplot(input[, 5], balancedData [, 5], plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
  add_trace(x=c(0,1), y=c(0,1), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
  layout(title=paste0("QQ-Plot Original vs. Rebalanced Data (Feature=",colnames(input)[5], ")"),
         xaxis=list(title="Original Data"), yaxis=list(title="Rebalanced Data"),
         hovermode = "x unified", legend = list(orientation='h'))

# Now, check visually for differences between the distributions of the raw (input) and rebalanced data.
# par(mar=c(1,1,1,1))
# par(mfrow=c(10,10))
# for(i in c(1:62,64:101)){ qqplot(balancedData [, i],input[, i]) } #except VisitID

# as the sample-size is changed:
length(input[, 5]); length(balancedData [, 5])
# to plot raw vs. rebalanced pairs (e.g., var="L_insular_cortex_Volume"), we need to equalize the lengths
#plot (input[, 5] +0*balancedData [, 5], balancedData [, 5])   # [, 5] == "L_insular_cortex_Volume"

# print(c("T-test results: ", test.results))
# zeros (0) are significant independent between-group T-test differences, ones (1) are insignificant

for (i in 1:(ncol(balancedData)-1)) 
{
    test.results.raw [i]  <- wilcox.test(input[, i], balancedData [, i])$p.value
    test.results.bin [i] <- ifelse(test.results.raw [i] > alpha.0.05, 1, 0)
    print(c("i=", i, "Wilcoxon-test=", test.results.raw [i]))
}
print(c("Wilcoxon test results: ", test.results.bin))
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw)) 
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)

The next step will be the actual Cross validation.

# using raw data:
X <- as.data.frame(input); Y <- output
neg <- "1"   # "Control" == "1"

# using Rebalanced data: 
X <- as.data.frame(data.1$X); Y <- data.1$Y
# balancedData<-cbind(data.1$X, data.1$Y); dim(balancedData)

# Side note: There is a function name collision for "crossval", the same method is present in the "mlr" (machine Learning in R) package and in the "crossval" package.  
# To specify a function call from a specific package do:  packagename::functionname()

set.seed(115) 
cv.out <- crossval::crossval(my.ada, X, Y, K = 5, B = 1, negative = neg)
    # the label of a negative "null" sample (default: "control")
out <- diagnosticErrors(cv.out$stat)

print(cv.out$stat)

##    FP    TP    TN    FN 
##   0.2 109.8  97.4   0.0

print(out)

##       acc      sens      spec       ppv       npv       lor 
## 0.9990357 1.0000000 0.9979508 0.9981818 1.0000000       Inf

As we can see from the reported metrics, the overall averaged AdaBoost-based diagnostic predictions are quite good.

5.2 Example 2: Sleep dataset

These data contain the effect of two soporific drugs to increase hours of sleep (treatment-compared design) on 10 patients. The data is available by default (sleep {datasets})

First, load the data and report some graphs and summaries.

data(sleep); str(sleep)
X = as.matrix(sleep[, 1, drop=FALSE]) # increase in hours of sleep, 
    # drop is logical, if TRUE the result is coerced to the lowest possible dimension. 
# The default is to drop if only one column is left, but not to drop if only one row is left.
Y = sleep[, 2] # drug given 
# plot(X ~ Y)

plot_ly(data=sleep, x=~group, y=~extra, color=~group, type="box", name="Hours of Sleep") %>%
  layout(title="Hours of Extra Sleep", legend = list(orientation='h'))

levels(Y) # "1" "2"
dim(X) # 20  1

Next, we will define a new LDA (linear discriminant analysis) predicting function and perform the Cross-validation (CV) on the resulting predictor.

require("MASS") # for lda function 

predfun.lda = function(train.x, train.y, test.x, test.y, negative)
{   lda.fit = lda(train.x, grouping=train.y)
    ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
    out = confusionMatrix(test.y, ynew, negative=negative)
return( out )
}

# install.packages("crossval")
library("crossval")

set.seed(123456)
cv.out <- crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
cv.out$stat
diagnosticErrors(cv.out$stat)

Execute the above code and interpret the diagnostic results measuring the performance of the LDA prediction.

5.3 Example 3: Model-based (linear regression) prediction using the `attitude` dataset

This data represents a survey of clerical employees of an organization with 35 employees of 30 (randomly selected) departments. Data is proportion of favorable responses to 7 questions in each department.

Let’s load and summarize the data, which is available by default in attitude {datasets}.

# ?attitude, colnames(attitude)
# Note: when using a data frame, a time-saver is to use "." to indicate "include all covariates" in the DF.  
# E.g., fit <- lm(Y ~ ., data = D)

data("attitude")
y = attitude[, 1]       # rating variable
x = attitude[, -1]  # date frame with the remaining variables
is.factor(y)
summary( lm(y ~ . , data=x) )   # R-squared:  0.7326
# set up lm prediction function

We will demonstrate model-based analytics using lm and lda, and will validate the forecasting using CV.

predfun.lm = function(train.x, train.y, test.x, test.y)
{     lm.fit = lm(train.y ~ . , data=train.x)
       ynew = predict(lm.fit, test.x )
                     # compute squared error risk (MSE)
       out = mean( (ynew - test.y)^2)
            # note that, in general, when fitting linear model to continuous outcome variable (Y), 
            # we can't use the out<-confusionMatrix(test.y, ynew, negative=negative), as it requires a binary outcome
            # this is why we use the MSE as an estimate of the discrepancy between observed & predicted values
return(out)
}

# require("MASS")
#predfun.lda = function(train.x, train.y, test.x, test.y, negative)
#{  lda.fit = lda(train.x, grouping=train.y)
#   ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
#   out = confusionMatrix(test.y, ynew, negative=negative)
#return( out )
#}

# prediction MSE using all variables
set.seed(123456)
cv.out.lm = crossval::crossval(predfun.lm, x, y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se)            # 72.581198  3.736784
# reducing to using only two variables
cv.out.lm = crossval::crossval(predfun.lm, x[, c(1, 3)], y, K=5, B=20, verbose=FALSE)
c(cv.out.lm$stat, cv.out.lm$stat.se)        
# 52.563957  2.015109

5.4 Example 4: Parkinson’s data (`ppmi_data`)

Let’s go back to the more elaborate PD data starting with loading and preprocessing the derived-PPMI data.

# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# attach(ppmi_data); head(ppmi_data)
# install.packages("crossval")
# library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 
# input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.factor(ppmi_data$PD)

# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input)    # Predictor variables
Y = as.matrix(output)   #  Actual PD clinical assessment 
dim(X); 
dim(Y)

fit <- lm(Y~X); 

# layout(matrix(c(1, 2, 3, 4), 2, 2))       # optional 4 graphs/page
# plot(fit)         # plot the fit 

xResid <- scale(fit$residuals)
QQ <- qqplot(xResid, rnorm(1000), plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
  add_trace(x=c(-4,4), y=c(-4,4), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
  layout(title="QQ Normal Plot of Model Residuals",
         xaxis=list(title="Model Residuals"), yaxis=list(title="Normal Residuals"),
         hovermode = "x unified", legend = list(orientation='h'))

levels(as.factor(Y))            # "0" "1"

## [1] "0" "1"

c(dim(X), dim(Y))           # 1043  103

## [1] 422 100 422   1

Apply Cross-validation to assess the performance of the linear model.

set.seed(12345)
# cv.out.lm = crossval::crossval(predfun.lm, as.data.frame(X), as.numeric(Y), K=5, B=20)

cv.out.lda = crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
# K=Number of folds; B=Number of repetitions.

# Results
#cv.out.lda$stat; 
#cv.out.lda; 
diagnosticErrors(cv.out.lda$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9606635 0.9515000 0.9831967 0.9928696 0.8918216 7.0457111

#cv.out.lm$stat; 
#cv.out.lm; 
#diagnosticErrors(cv.out.lm$stat)

6 Summary of CV output

The cross-validation (CV) output object includes the following components:

stat.cv: Vector of statistics returned by predfun for each cross validation run
stat: statistic returned by predfun averaged over all cross validation runs
stat.se: Variability: the corresponding standard error.

7 Alternative predictor functions

We have already seen a number of predict() functions, e.g., Chapter 17. Below, we will add to the collection of predictive analytics and forecasting functions.

7.1 Logistic Regression

We already saw the logit model in Chapter 17. Now, we will demonstrate a logit-predictor function.

# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# install.packages("crossval"); library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0) 

# remove the irrelevant variables (e.g., visit ID)
output <- as.factor(ppmi_data$PD)
input <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
X = as.matrix(input)    # Predictor variables
Y = as.matrix(output)

Note that the predicted values are in \(log\) terms, so they need to be exponentiated to interpret them correctly.

lm.logit <- glm(as.numeric(Y) ~ ., data = as.data.frame(X), family = "binomial")
ynew <- predict(lm.logit, as.data.frame(X)); #plot(ynew)
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1); # plot(ynew2)

predfun.logit = function(train.x, train.y, test.x, test.y, neg)
{   lm.logit <- glm(train.y ~ ., data = train.x, family = "binomial")
    ynew = predict(lm.logit, test.x )
 # compute TP, FP, TN, FN
ynew2 <- ifelse(exp(ynew)<0.5, 0, 1)
    out = confusionMatrix(test.y, ynew2, negative=neg)  # Binary outcome, we can use confusionMatrix
return( out )
}
# Reduce the bag of explanatory variables, purely to simplify the interpretation of the analytics in this example!
input.short <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume",                                                            
            "R_fusiform_gyrus_ShapeIndex", "R_fusiform_gyrus_Curvedness", 
 "Sex",  "Weight", "Age" , "chr12_rs34637584_GT", "chr17_rs11868035_GT",                                                              
            "UPDRS_Part_I_Summary_Score_Baseline", "UPDRS_Part_I_Summary_Score_Month_03",                                             
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",                        
            "UPDRS_Part_III_Summary_Score_Baseline",                                             
            "X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"     
))]
X = as.matrix(input.short)

cv.out.logit = crossval::crossval(predfun.logit, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1", verbose=FALSE)
cv.out.logit$stat.cv

##       FP TP TN FN
## B1.F1  2 55 26  1
## B1.F2  0 61 21  3
## B1.F3  4 57 21  2
## B1.F4  2 58 22  3
## B1.F5  0 55 24  5
## B2.F1  1 53 29  2
## B2.F2  0 63 17  4
## B2.F3  1 62 21  1
## B2.F4  3 50 29  2
## B2.F5  3 58 18  5

diagnosticErrors(cv.out.logit$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914

Caution: Note that if you forget to exponentiate the predicted logistic model values (see ynew2 in predict.logit), you will get nonsense results, e.g., all cases may be predicted to be in one class, trivial sensitivity or NPP.

7.2 Quadratic Discriminant Analysis (QDA)

In Chapters 7 and 20 we discussed the linear and quadratic discriminant analysis models. Let’s now introduce a predfun.qda() function.

predfun.qda = function(train.x, train.y, test.x, test.y, negative)
{
  require("MASS") # for lda function
  qda.fit = qda(train.x, grouping=train.y)
  ynew = predict(qda.fit, test.x)$class
  out.qda = confusionMatrix(test.y, ynew, negative=negative)
  return( out.qda )
}  
  
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(input.short), as.factor(Y), K=5, B=20, neg="1")

## Error in qda.default(x, grouping, ...): rank deficiency in group 1

diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat);

## Error in diagnosticErrors(cv.out.qda$stat): object 'cv.out.qda' not found

This error message: “Error in qda.default(x, grouping, …) : rank deficiency in group 1” indicates that there is a rank deficiency, i.e. some variables are collinear and one or more covariance matrices cannot be inverted to obtain the estimates in group 1 (Controls)!

If you remove the strongly correlated data elements (“R_fusiform_gyrus_Volume”, “R_fusiform_gyrus_ShapeIndex”, and “R_fusiform_gyrus_Curvedness”), the rank-deficiency problem goes away!

input.short2 <- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume",            "Sex",  "Weight", "Age" , "chr17_rs11868035_GT",                                        "UPDRS_Part_I_Summary_Score_Baseline", 
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",                        
            "UPDRS_Part_III_Summary_Score_Baseline",                                        
            "X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"     
))]
X = as.matrix(input.short2)
cv.out.qda = crossval::crossval(predfun.qda, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1")

It makes sense to contrast the QDA and GLM/Logit predictions.

diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9395735 0.9550000 0.9016393 0.9597990 0.8906883 5.2706226

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914

Clearly, both the QDA and Logit model predictions are quite similar and reliable.

7.3 Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

Previously, in Chapter 7 we saw some examples of LDA/QDA methods. Now, we’ll talk more about the details. Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X \mid Y=k)\) for each class \(k\). Their predictions are obtained by using Bayesian theorem, see Appendix:
\[P(Y=k \mid X)= \frac{P(X \mid Y=k)P(Y=k)}{P(X)}=\frac{P(X \mid Y=k)P(Y=k)}{\sum_{l=0}^{\infty}P(X \mid Y=l)P(Y=l)}\] and we select the class \(k\), which maximizes this conditional probability (maximum likelihood estimation). In linear and quadratic discriminant analysis, \(P(X \mid Y)\) is modeled as a multivariate Gaussian distribution with density:
\[P(X \mid Y=k)= \frac{1}{{(2 \pi)}^{\frac{n}{2}} {\vert{\Sigma_{k}}\vert}^{\frac{1}{2}}}\times e^{(-\frac{1}{2}{(X- \mu_k)}^T \Sigma_k^{-1}(X- \mu_k))}.\]

This model can be used to classify data by using the training data to estimate:

the class prior probabilities \(P(Y=k)\) by counting the proportion of observed instances of class \(k\),
the class means \(\mu_k\) by computing the empirical sample class means, and
the covariance matrices by computing either the empirical sample class covariance matrices, or by using a regularized estimator, e.g., LASSO).

In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:
\(\Sigma_k=\Sigma\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\)):
\(LOR=log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )\), (the LOR=0 \(\Longleftrightarrow\) the two probabilities are identical, i.e., same class). \(LOR=log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )=0\) \({(\mu_k-\mu_l)}^T \Sigma^{-1} (\mu_k-\mu_l)= \frac{1}{2}(\mu_k^T \Sigma^{-1} \mu_k-\mu_l^T \Sigma^{-1} \mu_l)\).

But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices \(\Sigma_k\) of the Gaussians, leading to quadratic decision surfaces.

7.3.1 LDA (Linear Discriminant Analysis)

LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.

predfun.lda = function(train.x, train.y, test.x, test.y, neg)
{
  require("MASS")
  lda.fit = lda(train.x, grouping=train.y)
  ynew = predict(lda.fit, test.x)$class
  out.lda = confusionMatrix(test.y, ynew, negative=neg)
  return( out.lda )
}

7.3.2 QDA (Quadratic Discriminant Analysis)

Similarly to LDA, the QDA prediction function can be defined by:

predfun.qda = function(train.x, train.y, test.x, test.y, neg)
{
  require("MASS") # for lda function
  qda.fit = qda(train.x, grouping=train.y)
  ynew = predict(qda.fit, test.x)$class
  out.qda = confusionMatrix(test.y, ynew, negative=neg)
  return( out.qda )
}

7.4 Neural Network

We saw Neural Networks (NNs) in Chapter 10. Applying NNs is not straightforward. We have to create a design matrix with an indicator column for the response feature. In addition, we need to write a predict function to translate the output of neuralnet() into analytical forecasts.

# predict nn
library("neuralnet")
pred = function(nn, dat) {
    yhat = compute(nn, dat)$net.result
    yhat = apply(yhat, 1, which.max)-1
    return(yhat)
}

my.neural <- function (train.x, train.y, test.x, test.y,method,layer=c(5,5)){
  train.x <- as.data.frame(train.x)
  train.y <- as.data.frame(train.y)
  colnames(train.x) <- paste0('V', 1:ncol(X)) 
  colnames(train.y) <- "V1"
  train_y_ind = model.matrix(~factor(train.y$V1)-1)
  colnames(train_y_ind) = paste0('out', 0:1)
  train = cbind(train.x, train_y_ind)
  y_names = paste0('out', 0:1)
  x_names = paste0('V', 1:ncol(train.x))
  nn = neuralnet(
    paste(paste(y_names, collapse='+'),
          '~',
          paste(x_names, collapse='+')),
    train,
    hidden=layer,
      linear.output=FALSE,
    lifesign='full', lifesign.step=1000)
  #predict
  predict.y <- pred(nn, test.x)
  out <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
  return (out)
}

set.seed(1234) 
cv.out.nn <- crossval::crossval(my.neural, scale(X), Y, K = 5, B = 1,layer=c(20,20),verbose = F) # scale predictors is necessary.
crossval::diagnosticErrors(cv.out.nn$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9407583 0.9016393 0.9566667 0.8943089 0.9598662 5.3101066

7.5 SVM

We also saw SVM in Chapter 10. Let’s try cross validation on Linear and Gaussian (radial) kernel SVM. We can expect that linear SVM should achieve a close result to Gaussian or even better than Gaussian since this dataset have a large \(k\) (# features) compared with \(n\) (# cases), which we have studied in detail in Chapter 10.

library("e1071")
my.svm <- function (train.x, train.y, test.x, test.y,method,cost=1,gamma=1/ncol(dx_norm),coef0=0,degree=3){
  svm_l.fit <- svm(x = train.x, y=as.factor(train.y),kernel = method)
  predict.y <- predict(svm_l.fit, test.x)
  out <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
  return (out)
}

# Linear kernel
set.seed(123)
cv.out.svml <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "linear",cost=tune_svm$best.parameters$cost,verbose = F) 
diagnosticErrors(cv.out.svml$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9573460 0.9344262 0.9666667 0.9193548 0.9731544 6.0240527

# Gaussian kernel
set.seed(123)
cv.out.svmg <- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "radial",cost=tune_svmg$best.parameters$cost,gamma=tune_svmg$best.parameters$gamma,verbose = F)
diagnosticErrors(cv.out.svmg$stat)

##       acc      sens      spec       ppv       npv       lor 
## 0.9478673 0.9262295 0.9566667 0.8968254 0.9695946 5.6246961

Indeed, both types of kernels yield good quality predictors according to the assessment metrics reported by the diagnosticErrors() method.

7.6 k-Nearest Neighbors algorithm (k-NN)

As we saw in Chapter 6, k-NN is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership (labels). Objects in the testing data are classified by a majority vote of their neighbors. Each object is assigned to a class that is most common among its k nearest neighbors (\(k\) is always a small positive integer). When \(k=1\), then an object is assigned to the class of its single nearest neighbor.
In k-NN regression, the output is the property value for the object representing the average of the values of its \(k\) nearest neighbors.

Let’s now build the corresponding predfun.knn() method.

# X = as.matrix(input)  # Predictor variables X = as.matrix(input.short2)
#  Y = as.matrix(output)    # Outcome
# KNN (k-nearest neighbors)
library("class")
# knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)$class
# table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")
# This can be used for polytomous variable (multiple classes)

predfun.knn = function(train.x, train.y, test.x, test.y, neg)
{
  require("class")
  knn.fit = knn(train.x, test.x, cl = train.y, prob=T)  # knn is already a prediction function!!!
  # ynew = predict(knn.fit, test.x)$class           # no need of another prediction, in this case
  out.knn = confusionMatrix(test.y, knn.fit, negative=neg)
  return( out.knn )
}   
cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")

cv.out.knn = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
#Compare all 3 classifiers (lda, qda, knn, and logit)
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat);

We can also examine the performance of k-NN prediction on the PPMI (Parkinson’s disease) data. Start by partitioning the data into training and testing sets.

# TRAINING: 75% of the sample size
sample_size <- floor(0.75 * nrow(input))
## set the seed to make your partition reproducible
set.seed(1234)
input.train.ind <- sample(seq_len(nrow(input)), size = sample_size)
input.train <- input[input.train.ind, ]
output.train <- as.matrix(output)[input.train.ind, ]

# TESTING DATA
input.test <- input[-input.train.ind, ]
output.test <- as.matrix(output)[-input.train.ind, ]

Then fit the k-NN model and report the results.

library("class")
knn_model <- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
#plot(knn_model)
summary(knn_model)
attributes(knn_model)

# cross-validation
knn_model.cv <- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
summary(knn_model.cv)

7.7 k-Means Clustering (k-MC)

In Chapter 12, we showed that k-MC aims to partition \(n\) observations into \(k\) clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution for this, i.e., the problem is NP-hard. However, there are efficient algorithms that converge quickly to local optima, e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach.

kmeans_model <- kmeans(input.train, 2)
table (output.train, kmeans_model$cluster-1)

##             
## output.train   0   1
##            0 173  50
##            1  65  28

layout(matrix(1, 1))
# tiff("C:/Users/Dinov/Desktop/test.tiff", width = 10, height = 10, units = 'in', res = 300)
fpc::plotcluster(input.train, output.train, col = kmeans_model$cluster)
legend("topright", legend=c("0(trueLabel)+0(kmeansLabel)", "1(trueLabel)+0(kmeansLabel)", "0(trueLabel)+1(kmeansLabel)", "1(trueLabel)+1(kmeansLabel)"),
       col=c("red", "black", "black", "red"), 
       text.col=c("red", "black", "black", "red"), 
       pch = 15 , y.intersp=0.6, x.intersp=0.7,
       text.width = 5, cex=1)

# plot_ly(x=~input.train, y=~output.train, color = ~as.factor(kmeans_model$cluster), type="scatter", mode="marker")

input.train <- input.train[ , apply(input.train, 2, var, na.rm=TRUE) != 0]  # remove constant columns

cluster::clusplot(input.train, kmeans_model$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

# par(mfrow=c(10,10))
# # the next figure is very large and will not render in RStudio, you may need to save it as PDF file!
# # pdf("C:/Users/Dinov/Desktop/test.pdf", width = 50, height = 50)
# # with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster])) 
# # dev.off()
# with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster])) 

# Convert all Income features to factors and construct SPLOM plot
df2 <- input.train[,1:10]
col_names <- names(df2)
dims <- as.data.frame(lapply(df2[col_names], factor))

dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE))

# plot(input.train, col = kmeans_model$cluster) 
# points(kmeans_model$centers, col = 1:2, pch = 8, cex = 2)

## cluster centers "fitted" to each obs.:
fitted.kmeans <- fitted(kmeans_model);  head(fitted.kmeans)

##   L_insular_cortex_AvgMeanCurvature L_insular_cortex_ComputeArea
## 1                         0.1076258                     2626.614
## 2                         0.2281071                     1094.980
## 1                         0.1076258                     2626.614
## 1                         0.1076258                     2626.614
## 1                         0.1076258                     2626.614
## 2                         0.2281071                     1094.980
##   L_insular_cortex_Volume L_insular_cortex_ShapeIndex
## 1                7959.068                   0.3260816
## 2                1999.192                   0.2773389
## 1                7959.068                   0.3260816
## 1                7959.068                   0.3260816
## 1                7959.068                   0.3260816
## 2                1999.192                   0.2773389
##   L_insular_cortex_Curvedness R_insular_cortex_AvgMeanCurvature
## 1                   0.1069922                         0.1217837
## 2                   0.2723533                         0.2988653
## 1                   0.1069922                         0.1217837
## 1                   0.1069922                         0.1217837
## 1                   0.1069922                         0.1217837
## 2                   0.2723533                         0.2988653
##   R_insular_cortex_ComputeArea R_insular_cortex_Volume
## 1                    2019.1489                4927.957
## 2                     770.1955                1113.124
## 1                    2019.1489                4927.957
## 1                    2019.1489                4927.957
## 1                    2019.1489                4927.957
## 2                     770.1955                1113.124
##   R_insular_cortex_ShapeIndex R_insular_cortex_Curvedness
## 1                   0.3032212                   0.1278277
## 2                   0.2672094                   0.3530532
## 1                   0.3032212                   0.1278277
## 1                   0.3032212                   0.1278277
## 1                   0.3032212                   0.1278277
## 2                   0.2672094                   0.3530532
##   L_cingulate_gyrus_AvgMeanCurvature L_cingulate_gyrus_ComputeArea
## 1                          0.1325229                      3967.805
## 2                          0.2880964                      1311.079
## 1                          0.1325229                      3967.805
## 1                          0.1325229                      3967.805
## 1                          0.1325229                      3967.805
## 2                          0.2880964                      1311.079
##   L_cingulate_gyrus_Volume L_cingulate_gyrus_ShapeIndex
## 1                10015.619                    0.4463758
## 2                 1835.168                    0.3741087
## 1                10015.619                    0.4463758
## 1                10015.619                    0.4463758
## 1                10015.619                    0.4463758
## 2                 1835.168                    0.3741087
##   L_cingulate_gyrus_Curvedness R_cingulate_gyrus_AvgMeanCurvature
## 1                   0.07632257                          0.1434376
## 2                   0.22731432                          0.2840280
## 1                   0.07632257                          0.1434376
## 1                   0.07632257                          0.1434376
## 1                   0.07632257                          0.1434376
## 2                   0.22731432                          0.2840280
##   R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume
## 1                      3953.795                 9964.057
## 2                      1177.863                 1710.276
## 1                      3953.795                 9964.057
## 1                      3953.795                 9964.057
## 1                      3953.795                 9964.057
## 2                      1177.863                 1710.276
##   R_cingulate_gyrus_ShapeIndex R_cingulate_gyrus_Curvedness
## 1                    0.4379627                   0.09675352
## 2                    0.3367350                   0.27422318
## 1                    0.4379627                   0.09675352
## 1                    0.4379627                   0.09675352
## 1                    0.4379627                   0.09675352
## 2                    0.3367350                   0.27422318
##   L_caudate_AvgMeanCurvature L_caudate_ComputeArea L_caudate_Volume
## 1                  0.2601677              797.0776        1233.5135
## 2                  0.8551616              146.6532         102.8474
## 1                  0.2601677              797.0776        1233.5135
## 1                  0.2601677              797.0776        1233.5135
## 1                  0.2601677              797.0776        1233.5135
## 2                  0.8551616              146.6532         102.8474
##   L_caudate_ShapeIndex L_caudate_Curvedness R_caudate_AvgMeanCurvature
## 1            0.2914226            0.2796301                  0.1894883
## 2            0.3595532            0.6930212                  0.7598691
## 1            0.2914226            0.2796301                  0.1894883
## 1            0.2914226            0.2796301                  0.1894883
## 1            0.2914226            0.2796301                  0.1894883
## 2            0.3595532            0.6930212                  0.7598691
##   R_caudate_ComputeArea R_caudate_Volume R_caudate_ShapeIndex
## 1             1089.4075        1927.4092            0.3055045
## 2              200.8635         175.1637            0.3521822
## 1             1089.4075        1927.4092            0.3055045
## 1             1089.4075        1927.4092            0.3055045
## 1             1089.4075        1927.4092            0.3055045
## 2              200.8635         175.1637            0.3521822
##   R_caudate_Curvedness L_putamen_AvgMeanCurvature L_putamen_ComputeArea
## 1            0.1863632                  0.1704558              1108.874
## 2            0.6260534                  0.5752676               369.652
## 1            0.1863632                  0.1704558              1108.874
## 1            0.1863632                  0.1704558              1108.874
## 1            0.1863632                  0.1704558              1108.874
## 2            0.6260534                  0.5752676               369.652
##   L_putamen_Volume L_putamen_ShapeIndex L_putamen_Curvedness
## 1        2234.7919            0.2827044            0.1959905
## 2         387.9603            0.2993340            0.5815818
## 1        2234.7919            0.2827044            0.1959905
## 1        2234.7919            0.2827044            0.1959905
## 1        2234.7919            0.2827044            0.1959905
## 2         387.9603            0.2993340            0.5815818
##   R_putamen_AvgMeanCurvature R_putamen_ComputeArea R_putamen_Volume
## 1                  0.1282211             1538.4491        3700.6326
## 2                  0.3955008              539.4418         718.2451
## 1                  0.1282211             1538.4491        3700.6326
## 1                  0.1282211             1538.4491        3700.6326
## 1                  0.1282211             1538.4491        3700.6326
## 2                  0.3955008              539.4418         718.2451
##   R_putamen_ShapeIndex R_putamen_Curvedness L_hippocampus_AvgMeanCurvature
## 1            0.3193293            0.1344744                      0.1305658
## 2            0.2755695            0.4745640                      0.2868015
## 1            0.3193293            0.1344744                      0.1305658
## 1            0.3193293            0.1344744                      0.1305658
## 1            0.3193293            0.1344744                      0.1305658
## 2            0.2755695            0.4745640                      0.2868015
##   L_hippocampus_ComputeArea L_hippocampus_Volume L_hippocampus_ShapeIndex
## 1                 1573.5764             3817.140                 0.314922
## 2                  731.5145             1180.315                 0.253671
## 1                 1573.5764             3817.140                 0.314922
## 1                 1573.5764             3817.140                 0.314922
## 1                 1573.5764             3817.140                 0.314922
## 2                  731.5145             1180.315                 0.253671
##   L_hippocampus_Curvedness R_hippocampus_AvgMeanCurvature
## 1                0.1309192                      0.1242315
## 2                0.3950394                      0.2596902
## 1                0.1309192                      0.1242315
## 1                0.1309192                      0.1242315
## 1                0.1309192                      0.1242315
## 2                0.3950394                      0.2596902
##   R_hippocampus_ComputeArea R_hippocampus_Volume R_hippocampus_ShapeIndex
## 1                 1692.6730             4256.232                0.3130925
## 2                  849.2564             1488.660                0.2615934
## 1                 1692.6730             4256.232                0.3130925
## 1                 1692.6730             4256.232                0.3130925
## 1                 1692.6730             4256.232                0.3130925
## 2                  849.2564             1488.660                0.2615934
##   R_hippocampus_Curvedness L_fusiform_gyrus_AvgMeanCurvature
## 1                0.1283778                         0.1135699
## 2                0.3563544                         0.1570402
## 1                0.1283778                         0.1135699
## 1                0.1283778                         0.1135699
## 1                0.1283778                         0.1135699
## 2                0.3563544                         0.1570402
##   L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## 1                     3681.950               11910.490
## 2                     2882.248                6772.765
## 1                     3681.950               11910.490
## 1                     3681.950               11910.490
## 1                     3681.950               11910.490
## 2                     2882.248                6772.765
##   L_fusiform_gyrus_ShapeIndex L_fusiform_gyrus_Curvedness
## 1                   0.3041462                   0.0985821
## 2                   0.3158982                   0.1423833
## 1                   0.3041462                   0.0985821
## 1                   0.3041462                   0.0985821
## 1                   0.3041462                   0.0985821
## 2                   0.3158982                   0.1423833
##   R_fusiform_gyrus_AvgMeanCurvature R_fusiform_gyrus_ComputeArea
## 1                         0.1169820                     3535.754
## 2                         0.1529637                     2724.399
## 1                         0.1169820                     3535.754
## 1                         0.1169820                     3535.754
## 1                         0.1169820                     3535.754
## 2                         0.1529637                     2724.399
##   R_fusiform_gyrus_Volume R_fusiform_gyrus_ShapeIndex
## 1               11462.851                   0.3318113
## 2                6923.456                   0.3222729
## 1               11462.851                   0.3318113
## 1               11462.851                   0.3318113
## 1               11462.851                   0.3318113
## 2                6923.456                   0.3222729
##   R_fusiform_gyrus_Curvedness      Sex   Weight      Age chr12_rs34637584_GT
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 2                  0.15437168 1.269231 85.65256 62.48391                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 1                  0.09852701 1.394958 81.01218 60.68045                   0
## 2                  0.15437168 1.269231 85.65256 62.48391                   0
##   chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT chr17_rs12185268_GT
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 2           0.5384615        0.4102564         0.5128205           0.4871795
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 1           0.6260504        0.3193277         0.4033613           0.3697479
## 2           0.5384615        0.4102564         0.5128205           0.4871795
##   chr17_rs199533_GT chr17_rs199533_DP chr17_rs199533_GQ
## 1         0.3529412          36.99580          72.88235
## 2         0.4743590          45.38462          78.16667
## 1         0.3529412          36.99580          72.88235
## 1         0.3529412          36.99580          72.88235
## 1         0.3529412          36.99580          72.88235
## 2         0.4743590          45.38462          78.16667
##   UPDRS_Part_I_Summary_Score_Baseline UPDRS_Part_I_Summary_Score_Month_03
## 1                            1.013571                            1.508369
## 2                            0.974359                            1.690367
## 1                            1.013571                            1.508369
## 1                            1.013571                            1.508369
## 1                            1.013571                            1.508369
## 2                            0.974359                            1.690367
##   UPDRS_Part_I_Summary_Score_Month_06 UPDRS_Part_I_Summary_Score_Month_09
## 1                            1.272435                            1.270631
## 2                            1.426228                            1.539340
## 1                            1.272435                            1.270631
## 1                            1.272435                            1.270631
## 1                            1.272435                            1.270631
## 2                            1.426228                            1.539340
##   UPDRS_Part_I_Summary_Score_Month_12 UPDRS_Part_I_Summary_Score_Month_18
## 1                            1.085737                            1.495304
## 2                            1.177899                            1.691696
## 1                            1.085737                            1.495304
## 1                            1.085737                            1.495304
## 1                            1.085737                            1.495304
## 2                            1.177899                            1.691696
##   UPDRS_Part_I_Summary_Score_Month_24
## 1                            1.540068
## 2                            1.605211
## 1                            1.540068
## 1                            1.540068
## 1                            1.540068
## 2                            1.605211
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 1                                                   4.425786
## 2                                                   4.012821
## 1                                                   4.425786
## 1                                                   4.425786
## 1                                                   4.425786
## 2                                                   4.012821
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## 1                                                   6.952695
## 2                                                   7.073448
## 1                                                   6.952695
## 1                                                   6.952695
## 1                                                   6.952695
## 2                                                   7.073448
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## 1                                                   6.498689
## 2                                                   6.842894
## 1                                                   6.498689
## 1                                                   6.498689
## 1                                                   6.498689
## 2                                                   6.842894
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## 1                                                   6.944167
## 2                                                   7.604399
## 1                                                   6.944167
## 1                                                   6.944167
## 1                                                   6.944167
## 2                                                   7.604399
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 1                                                   5.072632
## 2                                                   5.114479
## 1                                                   5.072632
## 1                                                   5.072632
## 1                                                   5.072632
## 2                                                   5.114479
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## 1                                                   7.426062
## 2                                                   8.043127
## 1                                                   7.426062
## 1                                                   7.426062
## 1                                                   7.426062
## 2                                                   8.043127
##   UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 1                                                   6.259172
## 2                                                   5.976416
## 1                                                   6.259172
## 1                                                   6.259172
## 1                                                   6.259172
## 2                                                   5.976416
##   UPDRS_Part_III_Summary_Score_Baseline UPDRS_Part_III_Summary_Score_Month_03
## 1                              15.80387                              22.16758
## 2                              12.55128                              20.97394
## 1                              15.80387                              22.16758
## 1                              15.80387                              22.16758
## 1                              15.80387                              22.16758
## 2                              12.55128                              20.97394
##   UPDRS_Part_III_Summary_Score_Month_06 UPDRS_Part_III_Summary_Score_Month_09
## 1                              22.87287                              22.72272
## 2                              20.37245                              19.97371
## 1                              22.87287                              22.72272
## 1                              22.87287                              22.72272
## 1                              22.87287                              22.72272
## 2                              20.37245                              19.97371
##   UPDRS_Part_III_Summary_Score_Month_12 UPDRS_Part_III_Summary_Score_Month_18
## 1                              17.35737                              23.75503
## 2                              13.38860                              21.69028
## 1                              17.35737                              23.75503
## 1                              17.35737                              23.75503
## 1                              17.35737                              23.75503
## 2                              13.38860                              21.69028
##   UPDRS_Part_III_Summary_Score_Month_24
## 1                              19.73472
## 2                              16.22692
## 1                              19.73472
## 1                              19.73472
## 1                              19.73472
## 2                              16.22692
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## 1                                                               5.852568
## 2                                                               6.334990
## 1                                                               5.852568
## 1                                                               5.852568
## 1                                                               5.852568
## 2                                                               6.334990
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## 1                                                               5.537727
## 2                                                               6.043413
## 1                                                               5.537727
## 1                                                               5.537727
## 1                                                               5.537727
## 2                                                               6.043413
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## 1                                                               5.970469
## 2                                                               6.585084
## 1                                                               5.970469
## 1                                                               5.970469
## 1                                                               5.970469
## 2                                                               6.585084
##   X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## 1                                                               6.351765
## 2                                                               7.175677
## 1                                                               6.351765
## 1                                                               6.351765
## 1                                                               6.351765
## 2                                                               7.175677
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## 1                                                                           5.231092
## 2                                                                           5.384615
## 1                                                                           5.231092
## 1                                                                           5.231092
## 1                                                                           5.231092
## 2                                                                           5.384615
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## 1                                                                           5.023530
## 2                                                                           5.032027
## 1                                                                           5.023530
## 1                                                                           5.023530
## 1                                                                           5.023530
## 2                                                                           5.032027
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## 1                                                                           5.200913
## 2                                                                           5.340521
## 1                                                                           5.200913
## 1                                                                           5.200913
## 1                                                                           5.200913
## 2                                                                           5.340521
##   X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## 1                                                                           5.203064
## 2                                                                           5.289329
## 1                                                                           5.203064
## 1                                                                           5.203064
## 1                                                                           5.203064
## 2                                                                           5.289329

resid.kmeans <- (input.train - fitted(kmeans_model))

# define the sum of squares function
ss <- function(data) sum(scale(data, scale = FALSE)^2)

## Equalities 
cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")],    # the same two columns
                c (ss(fitted.kmeans), ss(resid.kmeans),  ss(input.train)))

##              [,1]        [,2]       
## betweenss    16891584021 16891584021
## tot.withinss 12772530290 12772538936
## totss        29664114311 29664114311

# validation
stopifnot(all.equal(kmeans_model$totss,        ss(input.train)), 
        all.equal(kmeans_model$tot.withinss, ss(resid.kmeans)), 
         ## these three are the same:
        all.equal(kmeans_model$betweenss,    ss(fitted.kmeans)), 
        all.equal(kmeans_model$betweenss, kmeans_model$totss - kmeans_model$tot.withinss), 
        ## and hence also
        all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
)

## Error: kmeans_model$tot.withinss and ss(resid.kmeans) are not equal:
##   Mean relative difference: 6.768987e-07

# kmeans(input.train, 1)$withinss       
# trivial one-cluster, (its W.SS == ss(input.train))
clust_kmeans2 = kmeans(scale(X), center=X[1:2,],iter.max=100, algorithm='Lloyd')

We get empty cluster instead of two clusters when we randomly select two points as the initial centers. The way to solve this problem is using k-means++.

# k++ initialize
kpp_init = function(dat, K) {
  x = as.matrix(dat)
  n = nrow(x)
  # Randomly choose a first center
  centers = matrix(NA, nrow=K, ncol=ncol(x))
  centers[1,] = as.matrix(x[sample(1:n, 1),])
  for (k in 2:K) {
    # Calculate dist^2 to closest center for each point
    dists = matrix(NA, nrow=n, ncol=k-1)
    for (j in 1:(k-1)) {
      temp = sweep(x, 2, centers[j,], '-')
      dists[,j] = rowSums(temp^2)
    }
    dists = rowMeans(dists)
    # Draw next center with probability proportional to dist^2
    cumdists = cumsum(dists)
    prop = runif(1, min=0, max=cumdists[n])
    centers[k,] = as.matrix(x[min(which(cumdists > prop)),])
  }
  return(centers)
}
clust_kmeans2_plus = kmeans(scale(X), kpp_init(scale(X), 2), iter.max=100, algorithm='Lloyd')

Now let’s evaluate the model. The first step is to justify the selection of k=2. We use silhouette() in package cluster. Recall from Chapter 13 that the silhouette value is between -1 and 1 and negative value represent “mis-clustered” cases.

clust_k2 = clust_kmeans2_plus$cluster
require(cluster)

## Loading required package: cluster

# X = as.matrix(input.short2)
# as the data is too large for the silhouette plot, we'll just subsample and plot 100 random cases
subset_int <- sample(nrow(X),100)  #100 cases from 661 total cases
dis = dist(as.data.frame(scale(X[subset_int,])))
sil_k2 = silhouette(clust_k2[subset_int],dis) #best
plot(sil_k2, col=c(1:length(clust_kmeans2_plus$size)))

summary(sil_k2)

## Silhouette of 100 units in 2 clusters from silhouette.default(x = clust_k2[subset_int], dist = dis) :
##  Cluster sizes and average silhouette widths:
##         56         44 
## 0.23255908 0.03090563 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.14149  0.05501  0.14890  0.14383  0.25305  0.34628

mean(sil_k2<0)

## [1] 0.04666667

The result is pretty good. Only very small number of samples are “mis-clustered” (having negative silhouette value).

Further, you can observe that when k=3 or k=4, the overall silhouette is smaller.

dis = dist(as.data.frame(scale(X)))
clust_kmeans3_plus = kmeans(scale(X), kpp_init(scale(X), 3), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans3_plus$cluster,dis))

## Silhouette of 422 units in 3 clusters from silhouette.default(x = clust_kmeans3_plus$cluster, dist = dis) :
##  Cluster sizes and average silhouette widths:
##        176         59        187 
## 0.10389846 0.06762243 0.13880747 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.06382  0.05580  0.11102  0.11430  0.16941  0.29441

clust_kmeans4_plus = kmeans(scale(X), kpp_init(scale(X), 4), iter.max=100, algorithm='Lloyd')
summary(silhouette(clust_kmeans4_plus$cluster,dis))

## Silhouette of 422 units in 4 clusters from silhouette.default(x = clust_kmeans4_plus$cluster, dist = dis) :
##  Cluster sizes and average silhouette widths:
##         78        116        117        111 
## 0.04558664 0.17648759 0.06998387 0.19086006 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.12123  0.05065  0.14215  0.12654  0.20929  0.32875

Then, let’s calculate the unsupervised classification error. Here, \(p\) represents the percentage of class \(0\) cases, which provides the weighting factor for labeling each cluster.

mat = matrix(1,nrow = length(Y))
p = sum(Y==0)/length(Y)
for (i in 1:2){
  id = which(clust_k2==i)
  if(sum(Y[id]==0)>length(id)*p){
    mat[id] = 0
  }
}
caret::confusionMatrix(factor(Y), factor(mat))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 189 111
##          1   1 121
##                                           
##                Accuracy : 0.7346          
##                  95% CI : (0.6897, 0.7761)
##     No Information Rate : 0.5498          
##     P-Value [Acc > NIR] : 3.789e-15       
##                                           
##                   Kappa : 0.4906          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9947          
##             Specificity : 0.5216          
##          Pos Pred Value : 0.6300          
##          Neg Pred Value : 0.9918          
##              Prevalence : 0.4502          
##          Detection Rate : 0.4479          
##    Detection Prevalence : 0.7109          
##       Balanced Accuracy : 0.7581          
##                                           
##        'Positive' Class : 0               
##

#table(Y, mat)
# caret::confusionMatrix(factor(Y, levels=0:1), factor(mat, levels=0:1))

It achieves 75% accuracy, which is reasonable for unsupervised classification.

Finally, let’s visualize the results by superimposing the data into the first two multi-dimensional scaling (MDS) dimensions.

library("ggplot2")
mds = as.data.frame(cmdscale(dis, k=2))
mds_temp = cbind( mds, as.factor(clust_k2))
names(mds_temp) = c('V1', 'V2', 'cluster k=2')

# gp_cluster = ggplot(mds_temp, aes(x=V2, y=V1, color=as.factor(clust_k2))) +
#   geom_point(aes(shape = as.factor(Y))) + theme()
# gp_cluster

plot_ly(data=mds_temp, x=~V1, y=~V2, color=~mds_temp[,3], symbol=~as.factor(Y), type="scatter", mode="markers",
        marker=list(size=15)) %>%
    layout(title = "MDS Performance",
                            hovermode = "x unified", legend = list(orientation='h'))

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

7.8 Spectral Clustering

Suppose the multivariate dataset is represented as a set of data points A. We can define a similarity matrix \(S={s_{(i, j)}}\), where \(s_{(i, j)}\) represents a measure of the similarity between points \(i, j\in A.\) Spectral clustering uses the spectrum of the similarity matrix of the high-dimensional data and perform dimensionality reduction for clustering into fewer dimensions. The spectrum of a matrix is the set of its eigenvalues. In general, if \(T:\Omega \stackrel{\text{linear operator}}{\rightarrow} \Omega\) maps a vector space \(\Omega\) into itself, its spectrum is the set of scalars \(\lambda=\{\lambda_i\}\) such that \((T-\lambda I)v=0\), where \(I\) is the identity matrix and \(v\) are the eigen-vectors (or eigen-functions) for the operator \(T\). The determinant of the matrix equals the product of its eigenvalues, i.e., \(det(T)=\Pi_i \lambda_i\) , the trace of the matrix \(tr(T)=\Sigma_i \lambda_i\), and the pseudo-determinant for a singular matrix is the product of its nonzero eigenvalues, \(pseudo_{det}(T)=\Pi_{\lambda_i \neq 0}\lambda_i.\)

To partition the data points into two sets \((S_1, S_2)\) suppose \(v\) is the second-smallest eigenvector of the Laplacian matrix:

\[L = I - D^{-\frac{1}{2}}SD^{\frac{1}{2}}\]

of the similarity matrix \(S\), where \(D\) is the diagonal matrix \(D_{i, i}=\Sigma_j S_{i, j}.\)

This actual \((S_1, S_2)\) partitioning of the cases in the data may be done in different ways. For instance, \(S_1\) may use the median \(m\) of the components in \(v\) and group all data points whose component in \(v\) is greater than \(m\). Then the remaining cases can be labeled as part of \(S_2\). This approach may be used iteratively for hierarchical clustering by repeatedly partitioning the subsets.

The \(specc\) method in the \(kernlab\) package implements a spectral clustering algorithm where the data-clustering is performed by embedding the data into the subspace of the eigenvectors of an affinity matrix.

# install.packages("kernlab")
library("kernlab")

# review and choose a dataset (for example the Iris data
data()
#plot(iris)

Let’s look at a few simple cases of spectral clustering. We are suppressing some of the outputs to save space (e.g., #plot(my_data, col= data_sc)).

7.8.1 Iris petal data

Let’s look at the iris dataset we saw in Chapter 2.

my_data <- as.matrix(iris); dim(my_data)
my_data <- as.matrix(dplyr::mutate_all(my_data[, -5], function(x) as.numeric(as.character(x))))

## Error in UseMethod("tbl_vars"): no applicable method for 'tbl_vars' applied to an object of class "c('matrix', 'array', 'character')"

num_clusters <- 3
data_sc <- specc(my_data, centers= num_clusters)

## Error in sx * sx: non-numeric argument to binary operator

data_sc

## Error in eval(expr, envir, enclos): object 'data_sc' not found

centers(data_sc)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'centers': object 'data_sc' not found

withinss(data_sc)

## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'withinss': object 'data_sc' not found

#plot(my_data, col= data_sc)
# plot(iris[,1:2], col=data_sc, pch=as.numeric(iris$Species), lwd=2)
# legend("topright", legend=c("setosa", "versicolor", "virginica"), col=c("black", "red", "green"), 
#       lty=as.numeric(iris$Species), cex=1.2)
# legend("bottomright", legend=c("Clust1", "Clust2", "Clust3"), pch=unique(as.numeric(iris$Species)), 
#        cex=1.2)

plot_ly(x=~iris[,1], y=~iris[,2], type="scatter", mode="markers", color=~as.factor(data_sc@.Data), 
        marker=list(size=15), symbol=~as.numeric(iris$Species)) %>%
  layout(title="Spectral Clustering (Iris Data)", xaxis=list(title=colnames(iris)[1]), yaxis=list(title=colnames(iris)[2]))

## Error in is.factor(x): object 'data_sc' not found

7.8.2 Spirals data

Another simple dataset is `kernlab::spirals.

# library("kernlab")
data(spirals)
num_clusters <- 2
data_sc <- specc(spirals, centers= num_clusters)
data_sc

## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 2 2 1 1 2 1 1 1 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 1 1 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 1 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 2 2 2 1 1 2 1 2 1 1 1 2 1 2 1 1 2 2 1 1 1 2 
##  
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  184.344560953488 
## 
## Centers:  
##             [,1]       [,2]
## [1,] -0.01770984  0.1775137
## [2,]  0.01997201 -0.1761483
## 
## Cluster size:  
## [1] 150 150
## 
## Within-cluster sum of squares:  
## [1] 118.1182 117.3429

centers(data_sc)

##             [,1]       [,2]
## [1,] -0.01770984  0.1775137
## [2,]  0.01997201 -0.1761483

withinss(data_sc)

## [1] 118.1182 117.3429

# plot(spirals, col= data_sc)

clusterNames <- paste0("Cluster ", data_sc@.Data)

plot_ly(x=~spirals[,1], y=~spirals[,2], type="scatter", mode="markers", color = ~data_sc@.Data, name=clusterNames) %>% hide_colorbar()

7.8.3 Income data

A customer income dataset representing a marketing survey is included in kernlab::income.

data(income)
num_clusters <- 2
data_sc <- specc(income, centers= num_clusters)
data_sc

## Spectral Clustering object of class "specc" 
## 
##  Cluster memberships: 
##  
## 2 1 1 2 2 1 2 1 1 1 1 1 2 1 
##  
## String kernel function.  Type =  spectrum 
##  Hyperparameters : sub-sequence/string length =  4 
##  Normalized 
## 
## Cluster size:  
## [1] 9 5

centers(data_sc)

##      [,1]
## [1,]   NA

withinss(data_sc)

## logical(0)

# plot(income, col= data_sc)

# Convert all Income features to factors and construct SPLOM plot
df2 <- income
col_names <- names(df2)
dims <- as.data.frame(lapply(df2[col_names], factor))

dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
plot_ly(type = "splom", dimensions = setNames(dims, NULL), 
        showupperhalf = FALSE, diagonal = list(visible = FALSE))

8 Compare the results

Now let’s compare all eight classifiers (AdaBoost, LDA, QDA, knn, logit, Neural Network, linear SVM and Gaussian SVM) we presented above.

# get AdaBoost CV results
set.seed(123)
cv.out.ada <- crossval::crossval(my.ada, as.data.frame(X), Y, K = 5, B = 1, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 5 
## 
## Round # 1 of 1 
## CV Fit # 1 of 5 
## CV Fit # 2 of 5 
## CV Fit # 3 of 5 
## CV Fit # 4 of 5 
## CV Fit # 5 of 5

# get k-Means CV results
my.kmeans <- function (train.x, train.y, test.x, test.y, negative, formula){
  kmeans.fit <- kmeans(scale(test.x), kpp_init(scale(test.x), 2), iter.max=100, algorithm='Lloyd')
  predict.y <- kmeans.fit$cluster
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}
set.seed(123)
cv.out.kmeans <- crossval::crossval(my.kmeans, as.data.frame(X), Y, K = 5, B = 2, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 10 
## 
## Round # 1 of 2 
## CV Fit # 1 of 10 
## CV Fit # 2 of 10 
## CV Fit # 3 of 10 
## CV Fit # 4 of 10 
## CV Fit # 5 of 10 
## 
## Round # 2 of 2 
## CV Fit # 6 of 10 
## CV Fit # 7 of 10 
## CV Fit # 8 of 10 
## CV Fit # 9 of 10 
## CV Fit # 10 of 10

# get spectral clustering CV results
my.sc <- function (train.x, train.y, test.x, test.y, negative, formula){
  sc.fit <- specc(scale(test.x), centers= 2)
  predict.y <- sc.fit@.Data
  #count TP, FP, TN, FN, Accuracy, etc.
  out <- confusionMatrix(test.y, predict.y, negative = negative)
 # negative  is the label of a negative "null" sample (default: "control").
  return (out)
}
set.seed(123)
cv.out.sc <- crossval::crossval(my.sc, as.data.frame(X), Y, K = 5, B = 2, negative = neg)

## Number of folds: 5 
## Total number of CV fits: 10 
## 
## Round # 1 of 2 
## CV Fit # 1 of 10 
## CV Fit # 2 of 10 
## CV Fit # 3 of 10 
## CV Fit # 4 of 10 
## CV Fit # 5 of 10 
## 
## Round # 2 of 2 
## CV Fit # 6 of 10 
## CV Fit # 7 of 10 
## CV Fit # 8 of 10 
## CV Fit # 9 of 10 
## CV Fit # 10 of 10

require(knitr)

## Loading required package: knitr

res_tab=rbind(diagnosticErrors(cv.out.ada$stat),diagnosticErrors(cv.out.lda$stat),diagnosticErrors(cv.out.qda$stat),diagnosticErrors(cv.out.knn$stat),diagnosticErrors(cv.out.logit$stat),diagnosticErrors(cv.out.nn$stat),diagnosticErrors(cv.out.svml$stat),diagnosticErrors(cv.out.svmg$stat),diagnosticErrors(cv.out.kmeans$stat),diagnosticErrors(cv.out.sc$stat))
rownames(res_tab) <- c("AdaBoost", "LDA", "QDA", "knn", "logit", "Neural Network", "linear SVM", "Gaussian SVM", "k-Means", "Spectral Clustering")
kable(res_tab,caption = "Compare Result")

Compare Result
	acc	sens	spec	ppv	npv	lor
AdaBoost	0.9668246	0.9866667	0.9180328	0.9673203	0.9655172	6.7199789
LDA	0.9606635	0.9515000	0.9831967	0.9928696	0.8918216	7.0457111
QDA	0.9395735	0.9550000	0.9016393	0.9597990	0.8906883	5.2706226
knn	0.6125592	0.7266667	0.3319672	0.7278798	0.3306122	0.2784748
logit	0.9478673	0.9533333	0.9344262	0.9727891	0.8906250	5.6736914
Neural Network	0.9407583	0.9016393	0.9566667	0.8943089	0.9598662	5.3101066
linear SVM	0.9573460	0.9344262	0.9666667	0.9193548	0.9731544	6.0240527
Gaussian SVM	0.9478673	0.9262295	0.9566667	0.8968254	0.9695946	5.6246961
k-Means	0.4241706	0.3866667	0.5163934	0.6628571	0.2550607	-0.3957483
Spectral Clustering	0.4727488	0.4683333	0.4836066	0.6904177	0.2700229	-0.1924337

Leaving knn, kmeans and specc aside, the other methods achieve pretty good results. With these data, the reason for the suboptimal results of some clustering methods may be rooted in lack of training (e.g., specc and kmeans) or the curse of (high) dimensionality, which we saw in Chapter 6. As the data is super sparse, predicting from the nearest neighbors may not be too reliable.

Data Science and Predictive Analytics (UMich HS650)

Prediction and Internal Statistical Cross Validation

SOCR/MIDAS (Ivo Dinov)

August 2021

1 Forecasting types and assessment approaches

2 Overfitting

2.1 Example (US Presidential Elections)

2.2 Example (Google Flu Trends)

2.3 Example (Autism)

3 Internal Statistical Cross-validation is an iterative process

4 Example (Linear Regression)

4.1 Cross-validation methods

4.2 Exhaustive cross-validation

4.3 Non-exhaustive cross-validation

5 Case-Studies

5.1 Example 1: Prediction of Parkinson’s Disease using `Adaptive Boosting` (AdaBoost)

5.2 Example 2: Sleep dataset

5.3 Example 3: Model-based (linear regression) prediction using the `attitude` dataset

5.4 Example 4: Parkinson’s data (`ppmi_data`)

6 Summary of CV output

7 Alternative predictor functions

7.1 Logistic Regression

7.2 Quadratic Discriminant Analysis (QDA)

7.3 Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

7.3.1 LDA (Linear Discriminant Analysis)

7.3.2 QDA (Quadratic Discriminant Analysis)

7.4 Neural Network

7.5 SVM

7.6 k-Nearest Neighbors algorithm (k-NN)

7.7 k-Means Clustering (k-MC)

7.8 Spectral Clustering

7.8.1 Iris petal data

7.8.2 Spirals data

7.8.3 Income data

8 Compare the results

Data Science and Predictive Analytics (UMich HS650)

Prediction and Internal Statistical Cross Validation

SOCR/MIDAS (Ivo Dinov)

August 2021

1 Forecasting types and assessment approaches

2 Overfitting

2.1 Example (US Presidential Elections)

2.2 Example (Google Flu Trends)

2.3 Example (Autism)

3 Internal Statistical Cross-validation is an iterative process

4 Example (Linear Regression)

4.1 Cross-validation methods

4.2 Exhaustive cross-validation

4.3 Non-exhaustive cross-validation

5 Case-Studies

5.1 Example 1: Prediction of Parkinson’s Disease using Adaptive Boosting (AdaBoost)

5.2 Example 2: Sleep dataset

5.3 Example 3: Model-based (linear regression) prediction using the attitude dataset

5.4 Example 4: Parkinson’s data (ppmi_data)

6 Summary of CV output

7 Alternative predictor functions

7.1 Logistic Regression

7.2 Quadratic Discriminant Analysis (QDA)

7.3 Foundation of LDA and QDA for prediction, dimensionality reduction or forecasting

7.3.1 LDA (Linear Discriminant Analysis)

7.3.2 QDA (Quadratic Discriminant Analysis)

7.4 Neural Network

7.5 SVM

7.6 k-Nearest Neighbors algorithm (k-NN)

7.7 k-Means Clustering (k-MC)

7.8 Spectral Clustering

7.8.1 Iris petal data

7.8.2 Spirals data

7.8.3 Income data

8 Compare the results

5.1 Example 1: Prediction of Parkinson’s Disease using `Adaptive Boosting` (AdaBoost)

5.3 Example 3: Model-based (linear regression) prediction using the `attitude` dataset

5.4 Example 4: Parkinson’s data (`ppmi_data`)