SOCR ≫ | DSPA ≫ | Topics ≫ |
Start by reviewing Chapter 13 (Model Performance Assessment). Cross-validation is a strategy for validating predictive methods, classification models and clustering techniques by assessing the reliability and stability of the results of the corresponding statistical analyses (e.g., predictions, classifications, forecasts) based on independent datasets. For prediction of trend, association, clustering, and classification, a model is usually trained on one dataset (training data) and subsequently tested on new data (testing or validation data). Statistical internal cross-validation defines a test dataset to evaluate the model predictive performance as well as assess its power to avoid overfitting. Overfitting is the process of computing a predictive or classification model that describes random error, i.e., fits to the noise components of the observations, instead of identifying actual relationships and salient features in the data.
In this Chapter, we will use Google Flu Trends, Autism, and Parkinson’s disease case-studies to illustrate (1) alternative forecasting types using linear and non-linear predictions, (2) exhaustive and non-exhaustive internal statistical cross-validation, and (3) explore complementary predictor functions.
In Chapter 6 we discussed the types of classification and prediction methods, including supervised
and unsupervised
learning. The former are direct and predictive (there are known outcome variables that can be predicted and the corresponding forecasts can be evaluated) and the latter are indirect and descriptive (there are no a priori labels or specific outcomes).
There are alternative metrics used for evaluation of model performance, see Chapter 13. For example, assessment of supervised prediction and classification methods depends on the type of the labeled outcome responses - categorical
(binary or polytomous) vs. continuous
.
Confusion matrices
reporting accuracy, FP, FN, PPV, NPV, LOR and other metrics may be used to assess predictions of dichotomous (binary) or polytomous outcomes
.
\(R^2\), correlations (between predicted and observed outcomes), and RMSE measures may be used to quantify the performance of various supervised forecasting methods on continuous features
.
Before we go into the cross-validation of predictive analytics, we will present several examples of overfitting that illustrate why certain amount of skepticism and mistrust may be appropriate when dealing with forecasting models based on large and complex data.
By 2017, there were only 57 US presidential elections and 45 presidents. That is a small dataset, and learning from it may be challenging. For instance:
This cartoon illustrates some of the (unique) noisy presidential characteristics that are thought to be unimportant to presidential elections or presidential performance.
A March 14, 2014 article in Science (DOI: 10.1126/science.1248506), identified problems in Google Flu Trends (GFT), DOI 10.1371/journal.pone.0023610, which may be attributed in part to overfitting. The GFT was built to predict the future Centers for Disease Control and Prevention (CDC) reports of doctor office visits for influenza-like illness (ILI). In February 2013, Nature reported that GFT was predicting more than double the proportion of doctor visits compared to the CDC forecast for the same period.
The GFT model found the best matches among 50 million web search terms to fit 1,152 data points. It predicted quite high odds of finding search terms that match the propensity of the flu but are structurally unrelated and hence are not prospectively predictive. In fact, the GFT investigators reported weeding out unrelated to the flu seasonal search terms may have been strongly correlated to the CDC data, e.g., high school basketball season. The big GFT data may have overfitted the relatively small number of cases. This false-alarm result was also paired with a false-negative finding. The GFT model also missed the non-seasonal 2009 H1N1 influenza pandemic, which provides a cautionary tale about prediction, overfitting, and prospective validation.
Autistic brains constantly overfit visual and cognitive stimuli. To an autistic person, a general conversation of several adults may seem like a cacophony due to super-sensitive detail-oriented hearing and perception tuned to literally pick up all elements of the conversation and clues of the surrounding environment. At the same time, autistic brains downplay body language, sarcasm and non-literal cues. We can miss the forest for the trees
when we start “overfitting”, over-interpreting the noise on top of the actual salient information. Ambient noise, trivial observations and unrelated perceptions may hide the true communication details.
Human conversations and communications involve exchanges of both critical information and random noise. Fitting a perfect model requires focus only on the “relevant” information. Overfitting occurs when attention is (excessively) consumed with peripheral noise, or worse, overwhelmed by inconsequential noise drowning the salient aspects of the communication exchange.
Any dataset is a mix of signal and noise. The main task of our brains is to sort these components and interpret the information (i.e., ignore the noise).
“One person’s noise is another person’s treasure map!”
Our predictions are most accurate if we can model as much of the signal and as little of the noise as possible. Note that in these terms, \(R^2\) is a poor metric to identify predictive power - it measures how much of the signal and the noise is explained by our model. In practice, it’s hard to always identify what’s signal and what’s noise. This is why practical applications tends to favor simpler models, since the more complicated a model is, the easier it is to overfit the noise component of the observed information.
Internal statistical cross-validation assesses the expected performance of a prediction method in cases (subject, units, regions, etc.) drawn from a similar population as the original training data sample. Internal
validation is distinct from external
validation, as the latter potentially allows for the existence of differences between the populations (training data, used to develop or train the technique, and testing data, used to independently quantify the performance of the technique).
Each step in the internal statistical cross-validation protocol involves:
Here is one example of internal statistical cross-validation Predictive diagnostic modeling in Parkinson’s disease.
To reduce the noise and variability at each iteration, the final validation results may include the averaged performance results each iteration.
In cases when new observations are hard to obtain (due to costs, reliability, time, or other constraints), cross-validation guards against testing hypotheses suggested by the data themselves (also known as Type III error
or False-Suggestion).
Cross-validation is different from conventional-validation (e.g. 80%-20% partitioning the data set into training and testing subsets) where the prediction error (e.g., Root Mean Square Error, RMSE) evaluated on the training data is not a useful estimator of model performance, as it does not generalize across multiple samples.
In general, the errors of the conventional-valuation are based on the results of a specific test dataset and may not accurately represent the model performance. A more appropriate strategy to properly estimate model prediction performance is to use cross-validation (CV), which combines (averages) prediction errors to measure the model performance. CV corrects for the expected stochastic nature of partitioning the training and testing sets and generates a more accurate and robust estimate of the expected model performance.
Relative to a simpler model, a more complex model may overfit-the-data if it has a short foresight, i.e., it generates accurate fitting results for known data but less accurate results when predicting based on new data. Knowledge from past experiences may include either relevant or irrelevant (noise) information. In challenging data-driven prediction models when uncertainty (entropy) is high, more noise is present in past information that needs to be accounted for in prospective forecasting. However, it is generally hard to discriminate patterns from noise in complex systems (i.e., deciding which part to model and which to ignore). Models that reduce the chance of fitting noise are called robust.
Let’s demonstrate a simple model assessment using linear regression. Suppose we observe the response values {\({y_1, \cdots, y_n}\)}, and the corresponding \(k\) predictors represented as a \(kD\) vector of covariates {\({x_1, \cdots, x_n }\)}, where subjects/cases are indexed by \(1\leq i\leq n\), and the data-elements (variables) are indexed by \(1\leq j\leq k\).
\[ \begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} .\]
Using least squares to estimate the linear function parameters (effect-sizes), \({\beta _1, \cdots , \beta _k }\), allows us to compute a hyperplane \(y = a + x\beta\) that best fits the observed data \({(x_i, y_i )}_{1\leq i \leq n}\). This is expressed as a matrix by:
\[\begin{pmatrix} y_1\\ \vdots \\ y_n\\ \end{pmatrix} = \begin{pmatrix} a_1\\ \vdots \\ a_n\\ \end{pmatrix}+\begin{pmatrix} x_{1, 1} &\cdots & x_{1, k}\\ \vdots &\ddots & \vdots \\x_{n, 1} &\cdots & x_{n, k} \\ \end{pmatrix} \begin{pmatrix} \beta_1\\ \vdots \\ \beta_k\\ \end{pmatrix}.\]
Corresponding to the system of linear hyperplanes:
\[ \left\{ \begin{array}{c} y_1=a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k \\ y_2=a_2+ x_{2, 1} \beta_1 +x_{2, 2} \beta_2 + \cdots +x_{2, k} \beta_k \\ \vdots\\ y_n=a_n+ x_{n, 1} \beta_1 +x_{n, 2} \beta_2 + \cdots +x_{n, k} \beta_k \end{array} \right. . \]
One measure to evaluate the model fit may be the mean squared error (MSE). The MSE for a given value of the parameters \(\alpha\) and \(\beta\) on the observed training data \({(x_i, y_i) }_{1\leq i\leq n}\) is expressed as:
\[MSE=\frac{1}{n}\sum_{i=1}^{n} (y_i- \underbrace{(a_1+ x_{i, 1} \beta_1 +x_{i, 2} \beta_2 + \cdots +x_{i, k} \beta_k) }_{\text{predicted value } \hat{y}_i, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2\]
And the corresponding root mean square error (RMSE) is:
\[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n} (y_1- \underbrace{(a_1+ x_{1, 1} \beta_1 +x_{1, 2} \beta_2 + \cdots +x_{1, k} \beta_k) }_{\text{predicted value } \hat{y}_1, \text{at } {x_{i, 1}, \cdots, x_{i, k}}} )^2}.\]
In the linear model case, the expected value of the MSE (over the distribution of training sets) for the training set is \(\frac{n - k - 1}{n + k + 1} E\), where \(E\) is the expected value of the MSE for the testing/validation data. Therefore, fitting a model and computing the MSE on the training set, we may produce an over optimistic evaluation assessment (smaller RMSE) of how well the model may fit another dataset. This bias represents in-sample estimate of the fit, whereas we are interested in the cross-validation estimate as an out-of-sample estimate.
In the linear regression model, cross validation may not be as useful, since we can compute the exact correction factor \(\frac{n - k - 1}{n + k + 1}\) to obtain as estimate of the exact expected out-of-sample fit using the in-sample MSE (under)estimate. However, even in this situation, cross-validation remains useful as it can be used to select an optimal regularized cost function.
In most other modeling procedures (e.g. logistic regression), there are no simple general closed-form expressions (formulas) to adjust the cross-validation error estimate from the in-sample fit estimate. Cross-validation is generally applicable way to predict the performance of a model on a validation set using stochastic computation instead of obtaining experimental, theoretical, mathematical, or analytic error estimates.
There are 2 classes of cross-validation approaches, exhaustive and non-exhaustive.
Exhaustive cross-validation methods are based on determining all possible ways to divide the original sample into training and testing data. For instance, the Leave-m-out cross-validation involves using \(m\) observations for testing and the remaining (\(n-m\)) observations as training. The case when \(m=1\), i.e., leave-1-out method, is only applicable when \(n\) is small, due to its huge computational cost. This process is repeated on all partitions of the original sample. This method requires model fitting and validating \(C_m^n\) times (\(n\) is the total number of observations in the original sample and \(m\) is the number left out for validation). This requires a very large number of steps.
Non-exhaustive cross validation methods avoid computing estimates/errors using all possible partitionings of the original sample, but rather approximates these. For example, in the k-fold cross-validation, the original sample is randomly partitioned into \(k\) equal sized subsamples, or folds. Of the \(k\) subsamples, a single subsample is kept as final testing data for validation of the model. The other \(k - 1\) subsamples are used as training data. The cross-validation process is then repeated \(k\) times, corresponding to the \(k\) folds. Each of the \(k\) subsamples is used once as the validation data. There are corresponding \(k\) results that are averaged (or otherwise aggregated) to generate a final pooled model-quality estimation. In k-fold validation, all observations are used for both training and validation, and each observation is used for validation exactly once. In general, \(k\) is a parameter that needs to be selected by investigator (common values may be \(5\) or \(10\)).
A general case of the k-fold validation
is \(k=n\) (the total number of observations), when it coincides with the leave-one-out cross-validation.
A variation of the k-fold validation is stratified k-fold cross-validation, where each fold has the same (approximately) mean response value. For instance, if the model represents a binary classification of cases (e.g., controls vs. patients), this implies that each fold contains roughly the same proportion of the two class labels.
Repeated random sub-sampling validation splits randomly the entire dataset into a training set, where the model is fit, and a testing set, where the predictive accuracy is assessed. Again, the results are averaged over all iterative splits. This method has an advantage over k-fold cross validation as that the proportion of the training/testing split is not dependent on the number of iterations (folds). However, its drawback is that some observations may never be selected whereas others may be selected multiple times in the testing/validation subsample. As validation subsets may overlap, the results may vary each time we repeat the validation protocol, unless we set a seed point in the algorithm.
Asymptotically, as the number of random splits increases, the repeated random sub-sampling validation approaches the leave-k-out cross-validation.
In the examples below, we have intentionally suppressed some of the R output to save space. This is accomplished using this Rmarkdown command, {r eval=TRUE, results='hide'}
, however, the reader is encouraged to try hands-on all the protocols, make modifications, inspect and interpret the outputs.
Adaptive Boosting
(AdaBoost)This Parkinson’s Diseases Study involves heterogeneous neuroimaging, genetics, clinical, and phenotypic data for over 600 volunteers produced multivariate data for three cohorts (HC=Healthy Controls, PD=Parkinson’s, SWEDD=subjects without evidence for dopaminergic deficit).
Load the data: 06_PPMI_ClassificationValidationData_Short.csv.
<-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE) ppmi_data
Binarize the Dx (clinical diagnoses) classes.
# binarize the Dx classes
$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
ppmi_dataattach(ppmi_data)
head(ppmi_data)
# View (ppmi_data)
Obtain a model-free predictive analytics, e.g., AdaBoost classification, and report the results.
# Model-free analysis, classification
# install.packages("crossval")
# install.packages("ada")
# library("crossval")
require(crossval)
require(ada)
#set up adaboosting prediction function
# Define a new AdaBoost classification result-reporting function
<- function (train.x, train.y, test.x, test.y, negative, formula){
my.ada <- ada(train.x, train.y)
ada.fit <- predict(ada.fit, test.x)
predict.y #count TP, FP, TN, FN, Accuracy, etc.
<- confusionMatrix(test.y, predict.y, negative = negative)
out # negative is the label of a negative "null" sample (default: "control").
return (out)
}
When group sizes are imbalanced, we may need to rebalance them to avoid potential biases of the dominant cohorts. In this case, we will re-balance the groups using the package SMOTE
Synthetic Minority Oversampling Technique. SMOTE may be used to handle class imbalance in binary classification.
# balance cases
# SMOTE: Synthetic Minority Oversampling Technique to handle class misbalance in binary classification.
set.seed(1000)
# install.packages("unbalanced") to deal with unbalanced group data
require(unbalanced)
$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
ppmi_data<- unique(ppmi_data$FID_IID)
uniqueID <- ppmi_data[ppmi_data$VisitID==1, ]
ppmi_data $PD <- factor(ppmi_data$PD)
ppmi_data
colnames(ppmi_data)
# ppmi_data.1<-ppmi_data[, c(3:281, 284, 287, 336:340, 341)]
<- ncol(ppmi_data)
n .1 <- ppmi_data$PD
output
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# remove Default Real Clinical subject classifications!
<- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
input # output <- as.matrix(ppmi_data[ , which(names(ppmi_data) %in% {"PD"})])
<- as.factor(ppmi_data$PD)
output c(dim(input), dim(output))
#balance the dataset
set.seed(123)
.1<-ubBalance(X= input, Y=output, type="ubSMOTE", percOver=300, percUnder=150, verbose=TRUE)
data<-cbind(data.1$X, data.1$Y)
balancedDatatable(data.1$Y)
nrow(data.1$X); ncol(data.1$X)
nrow(balancedData); ncol(balancedData)
nrow(input); ncol(input)
colnames(balancedData) <- c(colnames(input), "PD")
Next, we’ll check the re-balanced cohort sizes.
library(plotly)
###Check balance
## T test
0.05 <- 0.05
alpha.<- NULL # binarized/dichotomized p-values
test.results.bin <- NULL # raw p-values
test.results.raw
# get a better error-handling t.test function that gracefully handles NA's and trivial variances
<- function(input1, input2) {
my.t.test.p.value <- try(t.test(input1, input2), silent=TRUE)
obj if (is(obj, "try-error"))
return(NA)
else
return(obj$p.value)
}
for (i in 1:ncol(balancedData))
{ <- my.t.test.p.value(input[, i], balancedData [, i])
test.results.raw[i] <- ifelse(test.results.raw[i] > alpha.0.05, 1, 0)
test.results.bin[i] # binarize the p-value (0=significant, 1=otherwise)
print(c("i=", i, "var=", colnames(balancedData[i]), "t-test_raw_p_value=", test.results.raw[i]))
}
# we can also employ (e.g., FDR, Bonferroni) correction for multiple testing!
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw))
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)
# sum(test.results.raw < alpha.0.05, na.rm=T)/length(test.results.raw) #check proportion of inconsistencies
# sum(test.results.corr < alpha.0.05, na.rm =T)/length(test.results.corr)
<- qqplot(input[, 5], balancedData [, 5], plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
QQ plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
add_trace(x=c(0,1), y=c(0,1), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
layout(title=paste0("QQ-Plot Original vs. Rebalanced Data (Feature=",colnames(input)[5], ")"),
xaxis=list(title="Original Data"), yaxis=list(title="Rebalanced Data"),
hovermode = "x unified", legend = list(orientation='h'))
# Now, check visually for differences between the distributions of the raw (input) and rebalanced data.
# par(mar=c(1,1,1,1))
# par(mfrow=c(10,10))
# for(i in c(1:62,64:101)){ qqplot(balancedData [, i],input[, i]) } #except VisitID
# as the sample-size is changed:
length(input[, 5]); length(balancedData [, 5])
# to plot raw vs. rebalanced pairs (e.g., var="L_insular_cortex_Volume"), we need to equalize the lengths
#plot (input[, 5] +0*balancedData [, 5], balancedData [, 5]) # [, 5] == "L_insular_cortex_Volume"
# print(c("T-test results: ", test.results))
# zeros (0) are significant independent between-group T-test differences, ones (1) are insignificant
for (i in 1:(ncol(balancedData)-1))
{<- wilcox.test(input[, i], balancedData [, i])$p.value
test.results.raw [i] <- ifelse(test.results.raw [i] > alpha.0.05, 1, 0)
test.results.bin [i] print(c("i=", i, "Wilcoxon-test=", test.results.raw [i]))
}print(c("Wilcoxon test results: ", test.results.bin))
# test.results.corr <- stats::p.adjust(test.results.raw, method = "fdr", n = length(test.results.raw))
# where methods are "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
# plot(test.results.raw, test.results.corr)
The next step will be the actual Cross validation
.
# using raw data:
<- as.data.frame(input); Y <- output
X <- "1" # "Control" == "1"
neg
# using Rebalanced data:
<- as.data.frame(data.1$X); Y <- data.1$Y
X # balancedData<-cbind(data.1$X, data.1$Y); dim(balancedData)
# Side note: There is a function name collision for "crossval", the same method is present in the "mlr" (machine Learning in R) package and in the "crossval" package.
# To specify a function call from a specific package do: packagename::functionname()
set.seed(115)
<- crossval::crossval(my.ada, X, Y, K = 5, B = 1, negative = neg)
cv.out # the label of a negative "null" sample (default: "control")
<- diagnosticErrors(cv.out$stat) out
print(cv.out$stat)
## FP TP TN FN
## 0.2 109.8 97.4 0.0
print(out)
## acc sens spec ppv npv lor
## 0.9990357 1.0000000 0.9979508 0.9981818 1.0000000 Inf
As we can see from the reported metrics, the overall averaged AdaBoost-based diagnostic predictions are quite good.
These data contain the effect of two soporific drugs to increase hours of sleep (treatment-compared design) on 10 patients. The data is available by default (sleep {datasets}
)
First, load the data and report some graphs and summaries.
data(sleep); str(sleep)
= as.matrix(sleep[, 1, drop=FALSE]) # increase in hours of sleep,
X # drop is logical, if TRUE the result is coerced to the lowest possible dimension.
# The default is to drop if only one column is left, but not to drop if only one row is left.
= sleep[, 2] # drug given
Y # plot(X ~ Y)
plot_ly(data=sleep, x=~group, y=~extra, color=~group, type="box", name="Hours of Sleep") %>%
layout(title="Hours of Extra Sleep", legend = list(orientation='h'))
levels(Y) # "1" "2"
dim(X) # 20 1
Next, we will define a new LDA (linear discriminant analysis) predicting function and perform the Cross-validation
(CV) on the resulting predictor.
require("MASS") # for lda function
= function(train.x, train.y, test.x, test.y, negative)
predfun.lda = lda(train.x, grouping=train.y)
{ lda.fit = predict(lda.fit, test.x)$class
ynew # count TP, FP etc.
= confusionMatrix(test.y, ynew, negative=negative)
out return( out )
}
# install.packages("crossval")
library("crossval")
set.seed(123456)
<- crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
cv.out $stat
cv.outdiagnosticErrors(cv.out$stat)
Execute the above code and interpret the diagnostic results measuring the performance of the LDA prediction.
attitude
datasetThis data represents a survey of clerical employees of an organization with 35 employees of 30 (randomly selected) departments. Data is proportion of favorable responses to 7 questions in each department.
Let’s load and summarize the data, which is available by default in attitude {datasets}
.
# ?attitude, colnames(attitude)
# Note: when using a data frame, a time-saver is to use "." to indicate "include all covariates" in the DF.
# E.g., fit <- lm(Y ~ ., data = D)
data("attitude")
= attitude[, 1] # rating variable
y = attitude[, -1] # date frame with the remaining variables
x is.factor(y)
summary( lm(y ~ . , data=x) ) # R-squared: 0.7326
# set up lm prediction function
We will demonstrate model-based analytics using lm
and lda
, and will validate the forecasting using CV.
= function(train.x, train.y, test.x, test.y)
predfun.lm = lm(train.y ~ . , data=train.x)
{ lm.fit = predict(lm.fit, test.x )
ynew # compute squared error risk (MSE)
= mean( (ynew - test.y)^2)
out # note that, in general, when fitting linear model to continuous outcome variable (Y),
# we can't use the out<-confusionMatrix(test.y, ynew, negative=negative), as it requires a binary outcome
# this is why we use the MSE as an estimate of the discrepancy between observed & predicted values
return(out)
}
# require("MASS")
#predfun.lda = function(train.x, train.y, test.x, test.y, negative)
#{ lda.fit = lda(train.x, grouping=train.y)
# ynew = predict(lda.fit, test.x)$class
# count TP, FP etc.
# out = confusionMatrix(test.y, ynew, negative=negative)
#return( out )
#}
# prediction MSE using all variables
set.seed(123456)
= crossval::crossval(predfun.lm, x, y, K=5, B=20, verbose=FALSE)
cv.out.lm c(cv.out.lm$stat, cv.out.lm$stat.se) # 72.581198 3.736784
# reducing to using only two variables
= crossval::crossval(predfun.lm, x[, c(1, 3)], y, K=5, B=20, verbose=FALSE)
cv.out.lm c(cv.out.lm$stat, cv.out.lm$stat.se)
# 52.563957 2.015109
ppmi_data
)Let’s go back to the more elaborate PD data starting with loading and preprocessing the derived-PPMI data.
# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# attach(ppmi_data); head(ppmi_data)
# install.packages("crossval")
# library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# input <- ppmi_data[ , -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID"))]
# output <- as.factor(ppmi_data$PD)
# remove the irrelevant variables (e.g., visit ID)
<- as.factor(ppmi_data$PD)
output <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
input = as.matrix(input) # Predictor variables
X = as.matrix(output) # Actual PD clinical assessment
Y dim(X);
dim(Y)
<- lm(Y~X);
fit
# layout(matrix(c(1, 2, 3, 4), 2, 2)) # optional 4 graphs/page
# plot(fit) # plot the fit
<- scale(fit$residuals)
xResid <- qqplot(xResid, rnorm(1000), plot=F) # check visually for differences between the distributions of the raw (input) and rebalanced data (for only one variable, in this case)
QQ plot_ly(x=~QQ$x, y=~QQ$y, type="scatter", mode="markers", name="Data") %>%
add_trace(x=c(-4,4), y=c(-4,4), name="Ideal Distribution Match", type="scatter", mode="lines") %>%
layout(title="QQ Normal Plot of Model Residuals",
xaxis=list(title="Model Residuals"), yaxis=list(title="Normal Residuals"),
hovermode = "x unified", legend = list(orientation='h'))
levels(as.factor(Y)) # "0" "1"
## [1] "0" "1"
c(dim(X), dim(Y)) # 1043 103
## [1] 422 100 422 1
Apply Cross-validation
to assess the performance of the linear model.
set.seed(12345)
# cv.out.lm = crossval::crossval(predfun.lm, as.data.frame(X), as.numeric(Y), K=5, B=20)
= crossval::crossval(predfun.lda, X, Y, K=5, B=20, negative="1", verbose=FALSE)
cv.out.lda # K=Number of folds; B=Number of repetitions.
# Results
#cv.out.lda$stat;
#cv.out.lda;
diagnosticErrors(cv.out.lda$stat)
## acc sens spec ppv npv lor
## 0.9606635 0.9515000 0.9831967 0.9928696 0.8918216 7.0457111
#cv.out.lm$stat;
#cv.out.lm;
#diagnosticErrors(cv.out.lm$stat)
The cross-validation (CV) output object includes the following components:
stat.cv
: Vector of statistics returned by predfun for each cross validation runstat
: statistic returned by predfun averaged over all cross validation runsstat.se
: Variability: the corresponding standard error.We have already seen a number of predict()
functions, e.g., Chapter 17. Below, we will add to the collection of predictive analytics and forecasting functions.
We already saw the logit model in Chapter 17. Now, we will demonstrate a logit-predictor function.
# ppmi_data <-read.csv("https://umich.instructure.com/files/330400/download?download_frd=1", header=TRUE)
# ppmi_data$ResearchGroup <- ifelse(ppmi_data$ResearchGroup == "Control", "Control", "Patient")
# install.packages("crossval"); library("crossval")
# ppmi_data$PD <- ifelse(ppmi_data$ResearchGroup=="Control", 1, 0)
# remove the irrelevant variables (e.g., visit ID)
<- as.factor(ppmi_data$PD)
output <- ppmi_data[, -which(names(ppmi_data) %in% c("ResearchGroup", "PD", "X", "FID_IID", "VisitID"))]
input = as.matrix(input) # Predictor variables
X = as.matrix(output) Y
Note that the predicted values are in \(log\) terms, so they need to be exponentiated to interpret them correctly.
<- glm(as.numeric(Y) ~ ., data = as.data.frame(X), family = "binomial")
lm.logit <- predict(lm.logit, as.data.frame(X)); #plot(ynew)
ynew <- ifelse(exp(ynew)<0.5, 0, 1); # plot(ynew2)
ynew2
= function(train.x, train.y, test.x, test.y, neg)
predfun.logit <- glm(train.y ~ ., data = train.x, family = "binomial")
{ lm.logit = predict(lm.logit, test.x )
ynew # compute TP, FP, TN, FN
<- ifelse(exp(ynew)<0.5, 0, 1)
ynew2 = confusionMatrix(test.y, ynew2, negative=neg) # Binary outcome, we can use confusionMatrix
out return( out )
}# Reduce the bag of explanatory variables, purely to simplify the interpretation of the analytics in this example!
<- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume",
input.short "R_fusiform_gyrus_ShapeIndex", "R_fusiform_gyrus_Curvedness",
"Sex", "Weight", "Age" , "chr12_rs34637584_GT", "chr17_rs11868035_GT",
"UPDRS_Part_I_Summary_Score_Baseline", "UPDRS_Part_I_Summary_Score_Month_03",
"UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",
"UPDRS_Part_III_Summary_Score_Baseline",
"X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"
))]= as.matrix(input.short)
X
= crossval::crossval(predfun.logit, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1", verbose=FALSE)
cv.out.logit $stat.cv cv.out.logit
## FP TP TN FN
## B1.F1 2 55 26 1
## B1.F2 0 61 21 3
## B1.F3 4 57 21 2
## B1.F4 2 58 22 3
## B1.F5 0 55 24 5
## B2.F1 1 53 29 2
## B2.F2 0 63 17 4
## B2.F3 1 62 21 1
## B2.F4 3 50 29 2
## B2.F5 3 58 18 5
diagnosticErrors(cv.out.logit$stat)
## acc sens spec ppv npv lor
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914
Caution: Note that if you forget to exponentiate the predicted logistic model values (see ynew2 in predict.logit
), you will get nonsense results, e.g., all cases may be predicted to be in one class, trivial sensitivity or NPP.
In Chapters 7 and 20 we discussed the linear and quadratic discriminant analysis models. Let’s now introduce a predfun.qda()
function.
= function(train.x, train.y, test.x, test.y, negative)
predfun.qda
{require("MASS") # for lda function
= qda(train.x, grouping=train.y)
qda.fit = predict(qda.fit, test.x)$class
ynew = confusionMatrix(test.y, ynew, negative=negative)
out.qda return( out.qda )
}
= crossval::crossval(predfun.qda, as.data.frame(input.short), as.factor(Y), K=5, B=20, neg="1") cv.out.qda
## Error in qda.default(x, grouping, ...): rank deficiency in group 1
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat);
## Error in diagnosticErrors(cv.out.qda$stat): object 'cv.out.qda' not found
This error message: “Error in qda.default(x, grouping, …) : rank deficiency in group 1” indicates that there is a rank deficiency, i.e. some variables are collinear and one or more covariance matrices cannot be inverted to obtain the estimates in group 1 (Controls)!
If you remove the strongly correlated data elements (“R_fusiform_gyrus_Volume”, “R_fusiform_gyrus_ShapeIndex”, and “R_fusiform_gyrus_Curvedness”), the rank-deficiency problem goes away!
<- input[, which(names(input) %in% c("R_fusiform_gyrus_Volume", "Sex", "Weight", "Age" , "chr17_rs11868035_GT", "UPDRS_Part_I_Summary_Score_Baseline",
input.short2 "UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline",
"UPDRS_Part_III_Summary_Score_Baseline",
"X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline"
))]= as.matrix(input.short2)
X = crossval::crossval(predfun.qda, as.data.frame(X), as.numeric(Y), K=5, B=2, neg="1") cv.out.qda
It makes sense to contrast the QDA and GLM/Logit predictions.
diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat)
## acc sens spec ppv npv lor
## 0.9395735 0.9550000 0.9016393 0.9597990 0.8906883 5.2706226
## acc sens spec ppv npv lor
## 0.9478673 0.9533333 0.9344262 0.9727891 0.8906250 5.6736914
Clearly, both the QDA and Logit model predictions are quite similar and reliable.
Previously, in Chapter 7 we saw some examples of LDA/QDA methods. Now, we’ll talk more about the details. Both LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) use probabilistic models of the class conditional distribution of the data \(P(X \mid Y=k)\) for each class \(k\). Their predictions are obtained by using Bayesian theorem, see Appendix:
\[P(Y=k \mid X)= \frac{P(X \mid Y=k)P(Y=k)}{P(X)}=\frac{P(X \mid Y=k)P(Y=k)}{\sum_{l=0}^{\infty}P(X \mid Y=l)P(Y=l)}\] and we select the class \(k\), which maximizes this conditional probability (maximum likelihood estimation). In linear and quadratic discriminant analysis, \(P(X \mid Y)\) is modeled as a multivariate Gaussian distribution with density:
\[P(X \mid Y=k)= \frac{1}{{(2 \pi)}^{\frac{n}{2}} {\vert{\Sigma_{k}}\vert}^{\frac{1}{2}}}\times e^{(-\frac{1}{2}{(X- \mu_k)}^T \Sigma_k^{-1}(X- \mu_k))}.\]
This model can be used to classify data by using the training data to estimate:
In the linear case (LDA), the Gaussians for each class are assumed to share the same covariance matrix:
\(\Sigma_k=\Sigma\) for each class \(k\). This leads to linear decision surfaces between classes. This is clear from comparing the log-probability ratios of 2 classes (\(k\) and \(l\)):
\(LOR=log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )\), (the LOR=0 \(\Longleftrightarrow\) the two probabilities are identical, i.e., same class). \(LOR=log\left (\frac{P(Y=k \mid X)}{P(Y=l \mid X)}\right )=0\) \({(\mu_k-\mu_l)}^T \Sigma^{-1} (\mu_k-\mu_l)= \frac{1}{2}(\mu_k^T \Sigma^{-1} \mu_k-\mu_l^T \Sigma^{-1} \mu_l)\).
But, in the more general, quadratic case of QDA, there are no assumptions on the covariance matrices \(\Sigma_k\) of the Gaussians, leading to quadratic decision surfaces.
LDA is similar to GLM (e.g., ANOVA and regression analyses), as it also attempts to express one dependent variable as a linear combination of other features or data elements, However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas LDA has continuous independent variables and a categorical dependent variable (i.e. Dx/class label). Logistic regression and probit regression are more similar to LDA than ANOVA, as they also explain a categorical variable by the values of continuous independent variables.
= function(train.x, train.y, test.x, test.y, neg)
predfun.lda
{require("MASS")
= lda(train.x, grouping=train.y)
lda.fit = predict(lda.fit, test.x)$class
ynew = confusionMatrix(test.y, ynew, negative=neg)
out.lda return( out.lda )
}
Similarly to LDA, the QDA prediction function can be defined by:
= function(train.x, train.y, test.x, test.y, neg)
predfun.qda
{require("MASS") # for lda function
= qda(train.x, grouping=train.y)
qda.fit = predict(qda.fit, test.x)$class
ynew = confusionMatrix(test.y, ynew, negative=neg)
out.qda return( out.qda )
}
We saw Neural Networks (NNs) in Chapter 10. Applying NNs is not straightforward. We have to create a design matrix with an indicator column for the response feature. In addition, we need to write a predict function to translate the output of neuralnet()
into analytical forecasts.
# predict nn
library("neuralnet")
= function(nn, dat) {
pred = compute(nn, dat)$net.result
yhat = apply(yhat, 1, which.max)-1
yhat return(yhat)
}
<- function (train.x, train.y, test.x, test.y,method,layer=c(5,5)){
my.neural <- as.data.frame(train.x)
train.x <- as.data.frame(train.y)
train.y colnames(train.x) <- paste0('V', 1:ncol(X))
colnames(train.y) <- "V1"
= model.matrix(~factor(train.y$V1)-1)
train_y_ind colnames(train_y_ind) = paste0('out', 0:1)
= cbind(train.x, train_y_ind)
train = paste0('out', 0:1)
y_names = paste0('V', 1:ncol(train.x))
x_names = neuralnet(
nn paste(paste(y_names, collapse='+'),
'~',
paste(x_names, collapse='+')),
train,hidden=layer,
linear.output=FALSE,
lifesign='full', lifesign.step=1000)
#predict
<- pred(nn, test.x)
predict.y <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
out return (out)
}
set.seed(1234)
<- crossval::crossval(my.neural, scale(X), Y, K = 5, B = 1,layer=c(20,20),verbose = F) # scale predictors is necessary.
cv.out.nn ::diagnosticErrors(cv.out.nn$stat) crossval
## acc sens spec ppv npv lor
## 0.9407583 0.9016393 0.9566667 0.8943089 0.9598662 5.3101066
We also saw SVM in Chapter 10. Let’s try cross validation on Linear and Gaussian (radial) kernel SVM. We can expect that linear SVM should achieve a close result to Gaussian or even better than Gaussian since this dataset have a large \(k\) (# features) compared with \(n\) (# cases), which we have studied in detail in Chapter 10.
library("e1071")
<- function (train.x, train.y, test.x, test.y,method,cost=1,gamma=1/ncol(dx_norm),coef0=0,degree=3){
my.svm <- svm(x = train.x, y=as.factor(train.y),kernel = method)
svm_l.fit <- predict(svm_l.fit, test.x)
predict.y <- crossval::confusionMatrix(test.y, predict.y,negative = 0)
out return (out)
}
# Linear kernel
set.seed(123)
<- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "linear",cost=tune_svm$best.parameters$cost,verbose = F)
cv.out.svml diagnosticErrors(cv.out.svml$stat)
## acc sens spec ppv npv lor
## 0.9573460 0.9344262 0.9666667 0.9193548 0.9731544 6.0240527
# Gaussian kernel
set.seed(123)
<- crossval::crossval(my.svm, as.data.frame(X), Y, K = 5, B = 1,method = "radial",cost=tune_svmg$best.parameters$cost,gamma=tune_svmg$best.parameters$gamma,verbose = F)
cv.out.svmg diagnosticErrors(cv.out.svmg$stat)
## acc sens spec ppv npv lor
## 0.9478673 0.9262295 0.9566667 0.8968254 0.9695946 5.6246961
Indeed, both types of kernels yield good quality predictors according to the assessment metrics reported by the diagnosticErrors()
method.
As we saw in Chapter 6, k-NN is a non-parametric method for either classification or regression, where the input consists of the k closest training examples in the feature space, but the output depends on whether k-NN is used for classification or regression:
Let’s now build the corresponding predfun.knn()
method.
# X = as.matrix(input) # Predictor variables X = as.matrix(input.short2)
# Y = as.matrix(output) # Outcome
# KNN (k-nearest neighbors)
library("class")
# knn.fit.test <- knn(X, X, cl = Y, k=3, prob=F); predict(as.matrix(knn.fit.test), X)$class
# table(knn.fit.test, Y); confusionMatrix(Y, knn.fit.test, negative="1")
# This can be used for polytomous variable (multiple classes)
= function(train.x, train.y, test.x, test.y, neg)
predfun.knn
{require("class")
= knn(train.x, test.x, cl = train.y, prob=T) # knn is already a prediction function!!!
knn.fit # ynew = predict(knn.fit, test.x)$class # no need of another prediction, in this case
= confusionMatrix(test.y, knn.fit, negative=neg)
out.knn return( out.knn )
} = crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
cv.out.knn
= crossval::crossval(predfun.knn, X, Y, K=5, B=2, neg="1")
cv.out.knn #Compare all 3 classifiers (lda, qda, knn, and logit)
diagnosticErrors(cv.out.lda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.qda$stat); diagnosticErrors(cv.out.logit$stat);
We can also examine the performance of k-NN prediction on the PPMI (Parkinson’s disease) data. Start by partitioning the data into training
and testing
sets.
# TRAINING: 75% of the sample size
<- floor(0.75 * nrow(input))
sample_size ## set the seed to make your partition reproducible
set.seed(1234)
<- sample(seq_len(nrow(input)), size = sample_size)
input.train.ind <- input[input.train.ind, ]
input.train <- as.matrix(output)[input.train.ind, ]
output.train
# TESTING DATA
<- input[-input.train.ind, ]
input.test <- as.matrix(output)[-input.train.ind, ] output.test
Then fit the k-NN model and report the results.
library("class")
<- knn(train= input.train, input.test, cl=as.factor(output.train), k=2)
knn_model #plot(knn_model)
summary(knn_model)
attributes(knn_model)
# cross-validation
<- knn.cv(train= input.train, cl=as.factor(output.train), k=2)
knn_model.cv summary(knn_model.cv)
In Chapter 12, we showed that k-MC aims to partition \(n\) observations into \(k\) clusters where each observation belongs to the cluster with the nearest mean which acts as a prototype of a cluster. The k-MC partitions the data space into Voronoi cells. In general, there is no computationally tractable solution for this, i.e., the problem is NP-hard. However, there are efficient algorithms that converge quickly to local optima, e.g., expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach.
<- kmeans(input.train, 2)
kmeans_model table (output.train, kmeans_model$cluster-1)
##
## output.train 0 1
## 0 173 50
## 1 65 28
layout(matrix(1, 1))
# tiff("C:/Users/Dinov/Desktop/test.tiff", width = 10, height = 10, units = 'in', res = 300)
::plotcluster(input.train, output.train, col = kmeans_model$cluster)
fpclegend("topright", legend=c("0(trueLabel)+0(kmeansLabel)", "1(trueLabel)+0(kmeansLabel)", "0(trueLabel)+1(kmeansLabel)", "1(trueLabel)+1(kmeansLabel)"),
col=c("red", "black", "black", "red"),
text.col=c("red", "black", "black", "red"),
pch = 15 , y.intersp=0.6, x.intersp=0.7,
text.width = 5, cex=1)
# plot_ly(x=~input.train, y=~output.train, color = ~as.factor(kmeans_model$cluster), type="scatter", mode="marker")
<- input.train[ , apply(input.train, 2, var, na.rm=TRUE) != 0] # remove constant columns
input.train
::clusplot(input.train, kmeans_model$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) cluster
# par(mfrow=c(10,10))
# # the next figure is very large and will not render in RStudio, you may need to save it as PDF file!
# # pdf("C:/Users/Dinov/Desktop/test.pdf", width = 50, height = 50)
# # with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster]))
# # dev.off()
# with(ppmi_data[,1:10], pairs(input.train[,1:10], col=c(1:2)[kmeans_model$cluster]))
# Convert all Income features to factors and construct SPLOM plot
<- input.train[,1:10]
df2 <- names(df2)
col_names <- as.data.frame(lapply(df2[col_names], factor))
dims
<- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
dims plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE))
# plot(input.train, col = kmeans_model$cluster)
# points(kmeans_model$centers, col = 1:2, pch = 8, cex = 2)
## cluster centers "fitted" to each obs.:
<- fitted(kmeans_model); head(fitted.kmeans) fitted.kmeans
## L_insular_cortex_AvgMeanCurvature L_insular_cortex_ComputeArea
## 1 0.1076258 2626.614
## 2 0.2281071 1094.980
## 1 0.1076258 2626.614
## 1 0.1076258 2626.614
## 1 0.1076258 2626.614
## 2 0.2281071 1094.980
## L_insular_cortex_Volume L_insular_cortex_ShapeIndex
## 1 7959.068 0.3260816
## 2 1999.192 0.2773389
## 1 7959.068 0.3260816
## 1 7959.068 0.3260816
## 1 7959.068 0.3260816
## 2 1999.192 0.2773389
## L_insular_cortex_Curvedness R_insular_cortex_AvgMeanCurvature
## 1 0.1069922 0.1217837
## 2 0.2723533 0.2988653
## 1 0.1069922 0.1217837
## 1 0.1069922 0.1217837
## 1 0.1069922 0.1217837
## 2 0.2723533 0.2988653
## R_insular_cortex_ComputeArea R_insular_cortex_Volume
## 1 2019.1489 4927.957
## 2 770.1955 1113.124
## 1 2019.1489 4927.957
## 1 2019.1489 4927.957
## 1 2019.1489 4927.957
## 2 770.1955 1113.124
## R_insular_cortex_ShapeIndex R_insular_cortex_Curvedness
## 1 0.3032212 0.1278277
## 2 0.2672094 0.3530532
## 1 0.3032212 0.1278277
## 1 0.3032212 0.1278277
## 1 0.3032212 0.1278277
## 2 0.2672094 0.3530532
## L_cingulate_gyrus_AvgMeanCurvature L_cingulate_gyrus_ComputeArea
## 1 0.1325229 3967.805
## 2 0.2880964 1311.079
## 1 0.1325229 3967.805
## 1 0.1325229 3967.805
## 1 0.1325229 3967.805
## 2 0.2880964 1311.079
## L_cingulate_gyrus_Volume L_cingulate_gyrus_ShapeIndex
## 1 10015.619 0.4463758
## 2 1835.168 0.3741087
## 1 10015.619 0.4463758
## 1 10015.619 0.4463758
## 1 10015.619 0.4463758
## 2 1835.168 0.3741087
## L_cingulate_gyrus_Curvedness R_cingulate_gyrus_AvgMeanCurvature
## 1 0.07632257 0.1434376
## 2 0.22731432 0.2840280
## 1 0.07632257 0.1434376
## 1 0.07632257 0.1434376
## 1 0.07632257 0.1434376
## 2 0.22731432 0.2840280
## R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume
## 1 3953.795 9964.057
## 2 1177.863 1710.276
## 1 3953.795 9964.057
## 1 3953.795 9964.057
## 1 3953.795 9964.057
## 2 1177.863 1710.276
## R_cingulate_gyrus_ShapeIndex R_cingulate_gyrus_Curvedness
## 1 0.4379627 0.09675352
## 2 0.3367350 0.27422318
## 1 0.4379627 0.09675352
## 1 0.4379627 0.09675352
## 1 0.4379627 0.09675352
## 2 0.3367350 0.27422318
## L_caudate_AvgMeanCurvature L_caudate_ComputeArea L_caudate_Volume
## 1 0.2601677 797.0776 1233.5135
## 2 0.8551616 146.6532 102.8474
## 1 0.2601677 797.0776 1233.5135
## 1 0.2601677 797.0776 1233.5135
## 1 0.2601677 797.0776 1233.5135
## 2 0.8551616 146.6532 102.8474
## L_caudate_ShapeIndex L_caudate_Curvedness R_caudate_AvgMeanCurvature
## 1 0.2914226 0.2796301 0.1894883
## 2 0.3595532 0.6930212 0.7598691
## 1 0.2914226 0.2796301 0.1894883
## 1 0.2914226 0.2796301 0.1894883
## 1 0.2914226 0.2796301 0.1894883
## 2 0.3595532 0.6930212 0.7598691
## R_caudate_ComputeArea R_caudate_Volume R_caudate_ShapeIndex
## 1 1089.4075 1927.4092 0.3055045
## 2 200.8635 175.1637 0.3521822
## 1 1089.4075 1927.4092 0.3055045
## 1 1089.4075 1927.4092 0.3055045
## 1 1089.4075 1927.4092 0.3055045
## 2 200.8635 175.1637 0.3521822
## R_caudate_Curvedness L_putamen_AvgMeanCurvature L_putamen_ComputeArea
## 1 0.1863632 0.1704558 1108.874
## 2 0.6260534 0.5752676 369.652
## 1 0.1863632 0.1704558 1108.874
## 1 0.1863632 0.1704558 1108.874
## 1 0.1863632 0.1704558 1108.874
## 2 0.6260534 0.5752676 369.652
## L_putamen_Volume L_putamen_ShapeIndex L_putamen_Curvedness
## 1 2234.7919 0.2827044 0.1959905
## 2 387.9603 0.2993340 0.5815818
## 1 2234.7919 0.2827044 0.1959905
## 1 2234.7919 0.2827044 0.1959905
## 1 2234.7919 0.2827044 0.1959905
## 2 387.9603 0.2993340 0.5815818
## R_putamen_AvgMeanCurvature R_putamen_ComputeArea R_putamen_Volume
## 1 0.1282211 1538.4491 3700.6326
## 2 0.3955008 539.4418 718.2451
## 1 0.1282211 1538.4491 3700.6326
## 1 0.1282211 1538.4491 3700.6326
## 1 0.1282211 1538.4491 3700.6326
## 2 0.3955008 539.4418 718.2451
## R_putamen_ShapeIndex R_putamen_Curvedness L_hippocampus_AvgMeanCurvature
## 1 0.3193293 0.1344744 0.1305658
## 2 0.2755695 0.4745640 0.2868015
## 1 0.3193293 0.1344744 0.1305658
## 1 0.3193293 0.1344744 0.1305658
## 1 0.3193293 0.1344744 0.1305658
## 2 0.2755695 0.4745640 0.2868015
## L_hippocampus_ComputeArea L_hippocampus_Volume L_hippocampus_ShapeIndex
## 1 1573.5764 3817.140 0.314922
## 2 731.5145 1180.315 0.253671
## 1 1573.5764 3817.140 0.314922
## 1 1573.5764 3817.140 0.314922
## 1 1573.5764 3817.140 0.314922
## 2 731.5145 1180.315 0.253671
## L_hippocampus_Curvedness R_hippocampus_AvgMeanCurvature
## 1 0.1309192 0.1242315
## 2 0.3950394 0.2596902
## 1 0.1309192 0.1242315
## 1 0.1309192 0.1242315
## 1 0.1309192 0.1242315
## 2 0.3950394 0.2596902
## R_hippocampus_ComputeArea R_hippocampus_Volume R_hippocampus_ShapeIndex
## 1 1692.6730 4256.232 0.3130925
## 2 849.2564 1488.660 0.2615934
## 1 1692.6730 4256.232 0.3130925
## 1 1692.6730 4256.232 0.3130925
## 1 1692.6730 4256.232 0.3130925
## 2 849.2564 1488.660 0.2615934
## R_hippocampus_Curvedness L_fusiform_gyrus_AvgMeanCurvature
## 1 0.1283778 0.1135699
## 2 0.3563544 0.1570402
## 1 0.1283778 0.1135699
## 1 0.1283778 0.1135699
## 1 0.1283778 0.1135699
## 2 0.3563544 0.1570402
## L_fusiform_gyrus_ComputeArea L_fusiform_gyrus_Volume
## 1 3681.950 11910.490
## 2 2882.248 6772.765
## 1 3681.950 11910.490
## 1 3681.950 11910.490
## 1 3681.950 11910.490
## 2 2882.248 6772.765
## L_fusiform_gyrus_ShapeIndex L_fusiform_gyrus_Curvedness
## 1 0.3041462 0.0985821
## 2 0.3158982 0.1423833
## 1 0.3041462 0.0985821
## 1 0.3041462 0.0985821
## 1 0.3041462 0.0985821
## 2 0.3158982 0.1423833
## R_fusiform_gyrus_AvgMeanCurvature R_fusiform_gyrus_ComputeArea
## 1 0.1169820 3535.754
## 2 0.1529637 2724.399
## 1 0.1169820 3535.754
## 1 0.1169820 3535.754
## 1 0.1169820 3535.754
## 2 0.1529637 2724.399
## R_fusiform_gyrus_Volume R_fusiform_gyrus_ShapeIndex
## 1 11462.851 0.3318113
## 2 6923.456 0.3222729
## 1 11462.851 0.3318113
## 1 11462.851 0.3318113
## 1 11462.851 0.3318113
## 2 6923.456 0.3222729
## R_fusiform_gyrus_Curvedness Sex Weight Age chr12_rs34637584_GT
## 1 0.09852701 1.394958 81.01218 60.68045 0
## 2 0.15437168 1.269231 85.65256 62.48391 0
## 1 0.09852701 1.394958 81.01218 60.68045 0
## 1 0.09852701 1.394958 81.01218 60.68045 0
## 1 0.09852701 1.394958 81.01218 60.68045 0
## 2 0.15437168 1.269231 85.65256 62.48391 0
## chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT chr17_rs12185268_GT
## 1 0.6260504 0.3193277 0.4033613 0.3697479
## 2 0.5384615 0.4102564 0.5128205 0.4871795
## 1 0.6260504 0.3193277 0.4033613 0.3697479
## 1 0.6260504 0.3193277 0.4033613 0.3697479
## 1 0.6260504 0.3193277 0.4033613 0.3697479
## 2 0.5384615 0.4102564 0.5128205 0.4871795
## chr17_rs199533_GT chr17_rs199533_DP chr17_rs199533_GQ
## 1 0.3529412 36.99580 72.88235
## 2 0.4743590 45.38462 78.16667
## 1 0.3529412 36.99580 72.88235
## 1 0.3529412 36.99580 72.88235
## 1 0.3529412 36.99580 72.88235
## 2 0.4743590 45.38462 78.16667
## UPDRS_Part_I_Summary_Score_Baseline UPDRS_Part_I_Summary_Score_Month_03
## 1 1.013571 1.508369
## 2 0.974359 1.690367
## 1 1.013571 1.508369
## 1 1.013571 1.508369
## 1 1.013571 1.508369
## 2 0.974359 1.690367
## UPDRS_Part_I_Summary_Score_Month_06 UPDRS_Part_I_Summary_Score_Month_09
## 1 1.272435 1.270631
## 2 1.426228 1.539340
## 1 1.272435 1.270631
## 1 1.272435 1.270631
## 1 1.272435 1.270631
## 2 1.426228 1.539340
## UPDRS_Part_I_Summary_Score_Month_12 UPDRS_Part_I_Summary_Score_Month_18
## 1 1.085737 1.495304
## 2 1.177899 1.691696
## 1 1.085737 1.495304
## 1 1.085737 1.495304
## 1 1.085737 1.495304
## 2 1.177899 1.691696
## UPDRS_Part_I_Summary_Score_Month_24
## 1 1.540068
## 2 1.605211
## 1 1.540068
## 1 1.540068
## 1 1.540068
## 2 1.605211
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Baseline
## 1 4.425786
## 2 4.012821
## 1 4.425786
## 1 4.425786
## 1 4.425786
## 2 4.012821
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_03
## 1 6.952695
## 2 7.073448
## 1 6.952695
## 1 6.952695
## 1 6.952695
## 2 7.073448
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_06
## 1 6.498689
## 2 6.842894
## 1 6.498689
## 1 6.498689
## 1 6.498689
## 2 6.842894
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_09
## 1 6.944167
## 2 7.604399
## 1 6.944167
## 1 6.944167
## 1 6.944167
## 2 7.604399
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_12
## 1 5.072632
## 2 5.114479
## 1 5.072632
## 1 5.072632
## 1 5.072632
## 2 5.114479
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_18
## 1 7.426062
## 2 8.043127
## 1 7.426062
## 1 7.426062
## 1 7.426062
## 2 8.043127
## UPDRS_Part_II_Patient_Questionnaire_Summary_Score_Month_24
## 1 6.259172
## 2 5.976416
## 1 6.259172
## 1 6.259172
## 1 6.259172
## 2 5.976416
## UPDRS_Part_III_Summary_Score_Baseline UPDRS_Part_III_Summary_Score_Month_03
## 1 15.80387 22.16758
## 2 12.55128 20.97394
## 1 15.80387 22.16758
## 1 15.80387 22.16758
## 1 15.80387 22.16758
## 2 12.55128 20.97394
## UPDRS_Part_III_Summary_Score_Month_06 UPDRS_Part_III_Summary_Score_Month_09
## 1 22.87287 22.72272
## 2 20.37245 19.97371
## 1 22.87287 22.72272
## 1 22.87287 22.72272
## 1 22.87287 22.72272
## 2 20.37245 19.97371
## UPDRS_Part_III_Summary_Score_Month_12 UPDRS_Part_III_Summary_Score_Month_18
## 1 17.35737 23.75503
## 2 13.38860 21.69028
## 1 17.35737 23.75503
## 1 17.35737 23.75503
## 1 17.35737 23.75503
## 2 13.38860 21.69028
## UPDRS_Part_III_Summary_Score_Month_24
## 1 19.73472
## 2 16.22692
## 1 19.73472
## 1 19.73472
## 1 19.73472
## 2 16.22692
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Baseline
## 1 5.852568
## 2 6.334990
## 1 5.852568
## 1 5.852568
## 1 5.852568
## 2 6.334990
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_06
## 1 5.537727
## 2 6.043413
## 1 5.537727
## 1 5.537727
## 1 5.537727
## 2 6.043413
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_12
## 1 5.970469
## 2 6.585084
## 1 5.970469
## 1 5.970469
## 1 5.970469
## 2 6.585084
## X_Assessment_Non.Motor_Epworth_Sleepiness_Scale_Summary_Score_Month_24
## 1 6.351765
## 2 7.175677
## 1 6.351765
## 1 6.351765
## 1 6.351765
## 2 7.175677
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Baseline
## 1 5.231092
## 2 5.384615
## 1 5.231092
## 1 5.231092
## 1 5.231092
## 2 5.384615
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_06
## 1 5.023530
## 2 5.032027
## 1 5.023530
## 1 5.023530
## 1 5.023530
## 2 5.032027
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_12
## 1 5.200913
## 2 5.340521
## 1 5.200913
## 1 5.200913
## 1 5.200913
## 2 5.340521
## X_Assessment_Non.Motor_Geriatric_Depression_Scale_GDS_Short_Summary_Score_Month_24
## 1 5.203064
## 2 5.289329
## 1 5.203064
## 1 5.203064
## 1 5.203064
## 2 5.289329
<- (input.train - fitted(kmeans_model))
resid.kmeans
# define the sum of squares function
<- function(data) sum(scale(data, scale = FALSE)^2)
ss
## Equalities
cbind(kmeans_model[c("betweenss", "tot.withinss", "totss")], # the same two columns
c (ss(fitted.kmeans), ss(resid.kmeans), ss(input.train)))
## [,1] [,2]
## betweenss 16891584021 16891584021
## tot.withinss 12772530290 12772538936
## totss 29664114311 29664114311
# validation
stopifnot(all.equal(kmeans_model$totss, ss(input.train)),
all.equal(kmeans_model$tot.withinss, ss(resid.kmeans)),
## these three are the same:
all.equal(kmeans_model$betweenss, ss(fitted.kmeans)),
all.equal(kmeans_model$betweenss, kmeans_model$totss - kmeans_model$tot.withinss),
## and hence also
all.equal(ss(input.train), ss(fitted.kmeans) + ss(resid.kmeans))
)
## Error: kmeans_model$tot.withinss and ss(resid.kmeans) are not equal:
## Mean relative difference: 6.768987e-07
# kmeans(input.train, 1)$withinss
# trivial one-cluster, (its W.SS == ss(input.train))
= kmeans(scale(X), center=X[1:2,],iter.max=100, algorithm='Lloyd') clust_kmeans2
We get empty cluster instead of two clusters when we randomly select two points as the initial centers. The way to solve this problem is using k-means++.
# k++ initialize
= function(dat, K) {
kpp_init = as.matrix(dat)
x = nrow(x)
n # Randomly choose a first center
= matrix(NA, nrow=K, ncol=ncol(x))
centers 1,] = as.matrix(x[sample(1:n, 1),])
centers[for (k in 2:K) {
# Calculate dist^2 to closest center for each point
= matrix(NA, nrow=n, ncol=k-1)
dists for (j in 1:(k-1)) {
= sweep(x, 2, centers[j,], '-')
temp = rowSums(temp^2)
dists[,j]
}= rowMeans(dists)
dists # Draw next center with probability proportional to dist^2
= cumsum(dists)
cumdists = runif(1, min=0, max=cumdists[n])
prop = as.matrix(x[min(which(cumdists > prop)),])
centers[k,]
}return(centers)
}= kmeans(scale(X), kpp_init(scale(X), 2), iter.max=100, algorithm='Lloyd') clust_kmeans2_plus
Now let’s evaluate the model. The first step is to justify the selection of k=2
. We use silhouette()
in package cluster
. Recall from Chapter 13 that the silhouette value is between -1 and 1 and negative value represent “mis-clustered” cases.
= clust_kmeans2_plus$cluster
clust_k2 require(cluster)
## Loading required package: cluster
# X = as.matrix(input.short2)
# as the data is too large for the silhouette plot, we'll just subsample and plot 100 random cases
<- sample(nrow(X),100) #100 cases from 661 total cases
subset_int = dist(as.data.frame(scale(X[subset_int,])))
dis = silhouette(clust_k2[subset_int],dis) #best
sil_k2 plot(sil_k2, col=c(1:length(clust_kmeans2_plus$size)))
summary(sil_k2)
## Silhouette of 100 units in 2 clusters from silhouette.default(x = clust_k2[subset_int], dist = dis) :
## Cluster sizes and average silhouette widths:
## 56 44
## 0.23255908 0.03090563
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.14149 0.05501 0.14890 0.14383 0.25305 0.34628
mean(sil_k2<0)
## [1] 0.04666667
The result is pretty good. Only very small number of samples are “mis-clustered” (having negative silhouette value).
Further, you can observe that when k=3
or k=4
, the overall silhouette is smaller.
= dist(as.data.frame(scale(X)))
dis = kmeans(scale(X), kpp_init(scale(X), 3), iter.max=100, algorithm='Lloyd')
clust_kmeans3_plus summary(silhouette(clust_kmeans3_plus$cluster,dis))
## Silhouette of 422 units in 3 clusters from silhouette.default(x = clust_kmeans3_plus$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 176 59 187
## 0.10389846 0.06762243 0.13880747
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06382 0.05580 0.11102 0.11430 0.16941 0.29441
= kmeans(scale(X), kpp_init(scale(X), 4), iter.max=100, algorithm='Lloyd')
clust_kmeans4_plus summary(silhouette(clust_kmeans4_plus$cluster,dis))
## Silhouette of 422 units in 4 clusters from silhouette.default(x = clust_kmeans4_plus$cluster, dist = dis) :
## Cluster sizes and average silhouette widths:
## 78 116 117 111
## 0.04558664 0.17648759 0.06998387 0.19086006
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.12123 0.05065 0.14215 0.12654 0.20929 0.32875
Then, let’s calculate the unsupervised classification error. Here, \(p\) represents the percentage of class \(0\) cases, which provides the weighting factor for labeling each cluster.
= matrix(1,nrow = length(Y))
mat = sum(Y==0)/length(Y)
p for (i in 1:2){
= which(clust_k2==i)
id if(sum(Y[id]==0)>length(id)*p){
= 0
mat[id]
}
}::confusionMatrix(factor(Y), factor(mat)) caret
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 189 111
## 1 1 121
##
## Accuracy : 0.7346
## 95% CI : (0.6897, 0.7761)
## No Information Rate : 0.5498
## P-Value [Acc > NIR] : 3.789e-15
##
## Kappa : 0.4906
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9947
## Specificity : 0.5216
## Pos Pred Value : 0.6300
## Neg Pred Value : 0.9918
## Prevalence : 0.4502
## Detection Rate : 0.4479
## Detection Prevalence : 0.7109
## Balanced Accuracy : 0.7581
##
## 'Positive' Class : 0
##
#table(Y, mat)
# caret::confusionMatrix(factor(Y, levels=0:1), factor(mat, levels=0:1))
It achieves 75% accuracy, which is reasonable for unsupervised classification.
Finally, let’s visualize the results by superimposing the data into the first two multi-dimensional scaling (MDS) dimensions.
library("ggplot2")
= as.data.frame(cmdscale(dis, k=2))
mds = cbind( mds, as.factor(clust_k2))
mds_temp names(mds_temp) = c('V1', 'V2', 'cluster k=2')
# gp_cluster = ggplot(mds_temp, aes(x=V2, y=V1, color=as.factor(clust_k2))) +
# geom_point(aes(shape = as.factor(Y))) + theme()
# gp_cluster
plot_ly(data=mds_temp, x=~V1, y=~V2, color=~mds_temp[,3], symbol=~as.factor(Y), type="scatter", mode="markers",
marker=list(size=15)) %>%
layout(title = "MDS Performance",
hovermode = "x unified", legend = list(orientation='h'))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Suppose the multivariate dataset is represented as a set of data points A. We can define a similarity matrix \(S={s_{(i, j)}}\), where \(s_{(i, j)}\) represents a measure of the similarity between points \(i, j\in A.\) Spectral clustering uses the spectrum of the similarity matrix of the high-dimensional data and perform dimensionality reduction for clustering into fewer dimensions. The spectrum of a matrix is the set of its eigenvalues. In general, if \(T:\Omega \stackrel{\text{linear operator}}{\rightarrow} \Omega\) maps a vector space \(\Omega\) into itself, its spectrum is the set of scalars \(\lambda=\{\lambda_i\}\) such that \((T-\lambda I)v=0\), where \(I\) is the identity matrix and \(v\) are the eigen-vectors (or eigen-functions) for the operator \(T\). The determinant of the matrix equals the product of its eigenvalues, i.e., \(det(T)=\Pi_i \lambda_i\) , the trace of the matrix \(tr(T)=\Sigma_i \lambda_i\), and the pseudo-determinant for a singular matrix is the product of its nonzero eigenvalues, \(pseudo_{det}(T)=\Pi_{\lambda_i \neq 0}\lambda_i.\)
To partition the data points into two sets \((S_1, S_2)\) suppose \(v\) is the second-smallest eigenvector of the Laplacian matrix:
\[L = I - D^{-\frac{1}{2}}SD^{\frac{1}{2}}\]
of the similarity matrix \(S\), where \(D\) is the diagonal matrix \(D_{i, i}=\Sigma_j S_{i, j}.\)
This actual \((S_1, S_2)\) partitioning of the cases in the data may be done in different ways. For instance, \(S_1\) may use the median \(m\) of the components in \(v\) and group all data points whose component in \(v\) is greater than \(m\). Then the remaining cases can be labeled as part of \(S_2\). This approach may be used iteratively for hierarchical clustering by repeatedly partitioning the subsets.
The \(specc\) method in the \(kernlab\) package implements a spectral clustering algorithm where the data-clustering is performed by embedding the data into the subspace of the eigenvectors of an affinity matrix.
# install.packages("kernlab")
library("kernlab")
# review and choose a dataset (for example the Iris data
data()
#plot(iris)
Let’s look at a few simple cases of spectral clustering. We are suppressing some of the outputs to save space (e.g., #plot(my_data, col= data_sc)
).
Let’s look at the iris dataset we saw in Chapter 2.
<- as.matrix(iris); dim(my_data)
my_data <- as.matrix(dplyr::mutate_all(my_data[, -5], function(x) as.numeric(as.character(x)))) my_data
## Error in UseMethod("tbl_vars"): no applicable method for 'tbl_vars' applied to an object of class "c('matrix', 'array', 'character')"
<- 3
num_clusters <- specc(my_data, centers= num_clusters) data_sc
## Error in sx * sx: non-numeric argument to binary operator
data_sc
## Error in eval(expr, envir, enclos): object 'data_sc' not found
centers(data_sc)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'centers': object 'data_sc' not found
withinss(data_sc)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'withinss': object 'data_sc' not found
#plot(my_data, col= data_sc)
# plot(iris[,1:2], col=data_sc, pch=as.numeric(iris$Species), lwd=2)
# legend("topright", legend=c("setosa", "versicolor", "virginica"), col=c("black", "red", "green"),
# lty=as.numeric(iris$Species), cex=1.2)
# legend("bottomright", legend=c("Clust1", "Clust2", "Clust3"), pch=unique(as.numeric(iris$Species)),
# cex=1.2)
plot_ly(x=~iris[,1], y=~iris[,2], type="scatter", mode="markers", color=~as.factor(data_sc@.Data),
marker=list(size=15), symbol=~as.numeric(iris$Species)) %>%
layout(title="Spectral Clustering (Iris Data)", xaxis=list(title=colnames(iris)[1]), yaxis=list(title=colnames(iris)[2]))
## Error in is.factor(x): object 'data_sc' not found
Another simple dataset is `kernlab::spirals
.
# library("kernlab")
data(spirals)
<- 2
num_clusters <- specc(spirals, centers= num_clusters)
data_sc data_sc
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 2 2 1 1 2 1 1 1 2 1 1 2 2 1 1 2 2 2 2 2 1 1 2 1 1 1 1 2 2 2 1 2 1 1 2 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 1 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 2 2 1 1 2 2 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 2 2 2 1 1 2 1 2 1 1 1 2 1 2 1 1 2 2 1 1 1 2
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 184.344560953488
##
## Centers:
## [,1] [,2]
## [1,] -0.01770984 0.1775137
## [2,] 0.01997201 -0.1761483
##
## Cluster size:
## [1] 150 150
##
## Within-cluster sum of squares:
## [1] 118.1182 117.3429
centers(data_sc)
## [,1] [,2]
## [1,] -0.01770984 0.1775137
## [2,] 0.01997201 -0.1761483
withinss(data_sc)
## [1] 118.1182 117.3429
# plot(spirals, col= data_sc)
<- paste0("Cluster ", data_sc@.Data)
clusterNames
plot_ly(x=~spirals[,1], y=~spirals[,2], type="scatter", mode="markers", color = ~data_sc@.Data, name=clusterNames) %>% hide_colorbar()
A customer income dataset representing a marketing survey is included in kernlab::income
.
data(income)
<- 2
num_clusters <- specc(income, centers= num_clusters)
data_sc data_sc
## Spectral Clustering object of class "specc"
##
## Cluster memberships:
##
## 2 1 1 2 2 1 2 1 1 1 1 1 2 1
##
## String kernel function. Type = spectrum
## Hyperparameters : sub-sequence/string length = 4
## Normalized
##
## Cluster size:
## [1] 9 5
centers(data_sc)
## [,1]
## [1,] NA
withinss(data_sc)
## logical(0)
# plot(income, col= data_sc)
# Convert all Income features to factors and construct SPLOM plot
<- income
df2 <- names(df2)
col_names <- as.data.frame(lapply(df2[col_names], factor))
dims
<- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
dims plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE))
Now let’s compare all eight classifiers (AdaBoost, LDA, QDA, knn, logit, Neural Network, linear SVM
and Gaussian SVM
) we presented above.
# get AdaBoost CV results
set.seed(123)
<- crossval::crossval(my.ada, as.data.frame(X), Y, K = 5, B = 1, negative = neg) cv.out.ada
## Number of folds: 5
## Total number of CV fits: 5
##
## Round # 1 of 1
## CV Fit # 1 of 5
## CV Fit # 2 of 5
## CV Fit # 3 of 5
## CV Fit # 4 of 5
## CV Fit # 5 of 5
# get k-Means CV results
<- function (train.x, train.y, test.x, test.y, negative, formula){
my.kmeans <- kmeans(scale(test.x), kpp_init(scale(test.x), 2), iter.max=100, algorithm='Lloyd')
kmeans.fit <- kmeans.fit$cluster
predict.y #count TP, FP, TN, FN, Accuracy, etc.
<- confusionMatrix(test.y, predict.y, negative = negative)
out # negative is the label of a negative "null" sample (default: "control").
return (out)
}set.seed(123)
<- crossval::crossval(my.kmeans, as.data.frame(X), Y, K = 5, B = 2, negative = neg) cv.out.kmeans
## Number of folds: 5
## Total number of CV fits: 10
##
## Round # 1 of 2
## CV Fit # 1 of 10
## CV Fit # 2 of 10
## CV Fit # 3 of 10
## CV Fit # 4 of 10
## CV Fit # 5 of 10
##
## Round # 2 of 2
## CV Fit # 6 of 10
## CV Fit # 7 of 10
## CV Fit # 8 of 10
## CV Fit # 9 of 10
## CV Fit # 10 of 10
# get spectral clustering CV results
<- function (train.x, train.y, test.x, test.y, negative, formula){
my.sc <- specc(scale(test.x), centers= 2)
sc.fit <- sc.fit@.Data
predict.y #count TP, FP, TN, FN, Accuracy, etc.
<- confusionMatrix(test.y, predict.y, negative = negative)
out # negative is the label of a negative "null" sample (default: "control").
return (out)
}set.seed(123)
<- crossval::crossval(my.sc, as.data.frame(X), Y, K = 5, B = 2, negative = neg) cv.out.sc
## Number of folds: 5
## Total number of CV fits: 10
##
## Round # 1 of 2
## CV Fit # 1 of 10
## CV Fit # 2 of 10
## CV Fit # 3 of 10
## CV Fit # 4 of 10
## CV Fit # 5 of 10
##
## Round # 2 of 2
## CV Fit # 6 of 10
## CV Fit # 7 of 10
## CV Fit # 8 of 10
## CV Fit # 9 of 10
## CV Fit # 10 of 10
require(knitr)
## Loading required package: knitr
=rbind(diagnosticErrors(cv.out.ada$stat),diagnosticErrors(cv.out.lda$stat),diagnosticErrors(cv.out.qda$stat),diagnosticErrors(cv.out.knn$stat),diagnosticErrors(cv.out.logit$stat),diagnosticErrors(cv.out.nn$stat),diagnosticErrors(cv.out.svml$stat),diagnosticErrors(cv.out.svmg$stat),diagnosticErrors(cv.out.kmeans$stat),diagnosticErrors(cv.out.sc$stat))
res_tabrownames(res_tab) <- c("AdaBoost", "LDA", "QDA", "knn", "logit", "Neural Network", "linear SVM", "Gaussian SVM", "k-Means", "Spectral Clustering")
kable(res_tab,caption = "Compare Result")
acc | sens | spec | ppv | npv | lor | |
---|---|---|---|---|---|---|
AdaBoost | 0.9668246 | 0.9866667 | 0.9180328 | 0.9673203 | 0.9655172 | 6.7199789 |
LDA | 0.9606635 | 0.9515000 | 0.9831967 | 0.9928696 | 0.8918216 | 7.0457111 |
QDA | 0.9395735 | 0.9550000 | 0.9016393 | 0.9597990 | 0.8906883 | 5.2706226 |
knn | 0.6125592 | 0.7266667 | 0.3319672 | 0.7278798 | 0.3306122 | 0.2784748 |
logit | 0.9478673 | 0.9533333 | 0.9344262 | 0.9727891 | 0.8906250 | 5.6736914 |
Neural Network | 0.9407583 | 0.9016393 | 0.9566667 | 0.8943089 | 0.9598662 | 5.3101066 |
linear SVM | 0.9573460 | 0.9344262 | 0.9666667 | 0.9193548 | 0.9731544 | 6.0240527 |
Gaussian SVM | 0.9478673 | 0.9262295 | 0.9566667 | 0.8968254 | 0.9695946 | 5.6246961 |
k-Means | 0.4241706 | 0.3866667 | 0.5163934 | 0.6628571 | 0.2550607 | -0.3957483 |
Spectral Clustering | 0.4727488 | 0.4683333 | 0.4836066 | 0.6904177 | 0.2700229 | -0.1924337 |
Leaving knn
, kmeans
and specc
aside, the other methods achieve pretty good results. With these data, the reason for the suboptimal results of some clustering methods may be rooted in lack of training (e.g., specc
and kmeans
) or the curse of (high) dimensionality, which we saw in Chapter 6. As the data is super sparse, predicting from the nearest neighbors may not be too reliable.