SOCR ≫ | DSPA ≫ | Topics ≫ |
We already explored several alternative machine learning (ML) methods for prediction, classification, clustering and outcome forecasting. In many situations, we derive models by estimating model coefficients or parameters. The main question now is How can we adopt crowd-sourcing advantages of social networks to aggregate different predictive analytics strategies?
Are there reasons to believe that such ensembles of forecasting methods may actually improve the performance or boost the prediction accuracy of the resulting consensus meta-algorithm? In this chapter, we are going to introduce ways that we can search for optimal parameters for a single ML method as well as aggregate different methods into ensembles to augment their collective performance, relative to any of the individual methods part of the meta-algorithm.
After we summarize the core methods, we will present automated and customized parameter tuning, and show strategies for improving model performance based on meta-learning via bagging and boosting.
One of the methods for improving model performance relies on tuning. For a given ML technique, tuning is the process of searching through the parameter space for the optimal parameter(s). The following table summarizes some of the parameters used in ML techniques we covered in previous chapters.
Model | Learning Task | Method | Parameters |
---|---|---|---|
KNN | Classification | class::knn |
data, k |
K-Means | Classification | stats::kmeans |
data, k |
Naive Bayes | Classification | e1071::naiveBayes |
train, class, laplace |
Decision Trees | Classification | C50::C5.0 |
train, class, trials, costs |
OneR Rule Learner | Classification | RWeka::OneR |
class~predictors, data |
RIPPER Rule Learner | Classification | RWeka::JRip |
formula, data, subset, na.action, control, options |
Linear Regression | Regression | stats::lm |
formula, data, subset, weights, na.action, method |
Regression Trees | Regression | rpart::rpart |
dep_var ~ indep_var, data |
Model Trees | Regression | RWeka::M5P |
formula, data, subset, na.action, control |
Neural Networks | Dual use | nnet::nnet |
x, y, weights, size, Wts, mask,linout, entropy, softmax, censored, skip, rang, decay, maxit, Hess, trace, MaxNWts, abstol, reltol |
Support Vector Machines (Polynomial Kernel) | Dual use | caret::train::svmLinear |
C |
Support Vector Machines (Radial Basis Kernel) | Dual use | caret::train::svmRadial |
C, sigma |
Support Vector Machines (general) | Dual use | kernlab::ksvm |
formula, data, kernel |
Random Forests | Dual use | randomForest::randomForest |
formula, data |
\[\textbf{Table 1: Core parameters of several machine learning techniques.}\]
caret
for automated parameter tuningIn Chapter 6 we used KNN and plugged in random k parameters for the number of clusters. This time we will test simultaneously multiple k values and select the parameter(s) yielding the highest prediction accuracy. Using caret
allows us to specify an outcome class variable, covariate predictor features, and a specific ML method. In Chapter 6, we showed the Boys Town Study of Youth Development dataset, where we normalized all the features, stored them in a boystown_n
computable object, and defined an outcome class variable (boystown$grade
).
<-read.csv("https://umich.instructure.com/files/399119/download?download_frd=1", sep=" ")
boystown$sex<-boystown$sex-1
boystown$dadjob<--1*(boystown$dadjob-2)
boystown$momjob<--1*(boystown$momjob-2)
boystown<-boystown[, -1]
boystowntable(boystown$gpa)
##
## 0 1 2 3 4 5
## 30 50 54 40 14 12
$grade<-boystown$gpa %in% c(3, 4, 5)
boystown$grade<-factor(boystown$grade, levels=c(F, T), labels = c("above_avg", "avg_or_below"))
boystown<-function(x){
normalizereturn((x-min(x))/(max(x)-min(x)))
}<-as.data.frame(lapply(boystown[, -11], normalize)) boystown_n
str(boystown_n)
## 'data.frame': 200 obs. of 10 variables:
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : num 1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
## $ Alcoholuse: num 0.182 0.364 0.182 0.182 0.545 ...
## $ alcatt : num 0.5 0.333 0.5 0.167 0.333 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : num 0.143 0.429 0.286 0.143 0.286 ...
## $ momclose : num 0.143 0.571 0.286 0.286 0.143 ...
## $ larceny : num 0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
## $ vandalism : num 0.429 0 0.286 0.286 0.286 ...
<-cbind(boystown_n, boystown[, 11])
boystown_nstr(boystown_n)
## 'data.frame': 200 obs. of 11 variables:
## $ sex : num 0 0 0 0 1 1 0 0 1 1 ...
## $ gpa : num 1 0 0.6 0.4 0.6 0.6 0.2 1 0.2 0.6 ...
## $ Alcoholuse : num 0.182 0.364 0.182 0.182 0.545 ...
## $ alcatt : num 0.5 0.333 0.5 0.167 0.333 ...
## $ dadjob : num 1 1 1 1 1 1 1 1 1 1 ...
## $ momjob : num 0 0 0 0 1 0 0 0 1 1 ...
## $ dadclose : num 0.143 0.429 0.286 0.143 0.286 ...
## $ momclose : num 0.143 0.571 0.286 0.286 0.143 ...
## $ larceny : num 0.25 0 0 0.75 0.25 0 0 0 0.25 0.25 ...
## $ vandalism : num 0.429 0 0.286 0.286 0.286 ...
## $ boystown[, 11]: Factor w/ 2 levels "above_avg","avg_or_below": 2 1 2 1 2 2 1 2 1 2 ...
colnames(boystown_n)[11]<-"grade"
Now that the dataset includes an explicit class variable and predictor features, we can use the KNN method to predict the outcome grade
. Let’s plug this information into the caret::train()
function. Note that caret
can use the complete dataset as it will automatically do the random sampling for the internal statistical cross-validation. To make results reproducible, we may utilize the set.seed()
function that we presented previously in Chapter 13.
library(caret)
set.seed(123)
<- train(grade~., data=boystown_n, method="knn")
kNN_mod summary(kNN_mod) kNN_mod;
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
## 2 classes: 'above_avg', 'avg_or_below'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7971851 0.5216369
## 7 0.8037923 0.5282107
## 9 0.7937086 0.4973139
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 7.
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 10 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
In this case, using str(m)
to summarize the object m
may report out too much information. Instead, we can simply type the object name m
to get a more concise information about it.
Description about the dataset: number of samples, features, and classes.
Re-sampling process: here it is using 25 bootstrap samples with 200 observations (same size as the observed dataset) each to train the model.
Candidate models with different parameters that have been evaluated: by default, caret
uses 3 different choices for each parameter, but for binary parameters, it only takes 2 choices TRUE
and FALSE
). As KNN has only one parameter k, we have 3 candidate models reported in the output above.
Optimal model: the model with largest accuracy is the one corresponding to k=9
.
Let’s see how accurate this “optimal model” is in terms of the re-substitution error. Again, we will use the predict()
function specifying the object m
and the dataset boystown_n
. Then, we can report the contingency table showing the agreement between the predictions and real class labels.
set.seed(1234)
<- predict(kNN_mod, boystown_n)
pred table(pred, boystown_n$grade)
##
## pred above_avg avg_or_below
## above_avg 132 17
## avg_or_below 2 49
This model has \((17+2)/200=0.09\) re-substitution error (9%). This means that in the 200 observations that we used to train this model, 91% of them were correctly classified. Note that re-substitution error is different from accuracy. The accuracy of this model is \(0.81\), which is reported by a model summary call. As mentioned in Chapter 13, we can obtain prediction probabilities for each observation in the original boystown_n
dataset.
head(predict(kNN_mod, boystown_n, type = "prob"))
## above_avg avg_or_below
## 1 0.0000000 1.0000000
## 2 1.0000000 0.0000000
## 3 0.7142857 0.2857143
## 4 0.8571429 0.1428571
## 5 0.2857143 0.7142857
## 6 0.5714286 0.4285714
The default setting of train()
might not meet the specific needs for every study. In our case, the optimal k might be smaller than \(9\). The caret
package allows us to customize the settings for train()
. Specifically, caret::trainControl()
can help us to customize re-sampling methods. There are 6 popular re-sampling methods that we might want to use, which are summarized in the following table.
Resampling method | Method name | Additional options and default values |
---|---|---|
Holdout sampling | LGOCV |
p = 0.75 (training data proportion) |
k-fold cross-validation | cv |
number = 10 (number of folds) |
Repeated k-fold cross validation | repeatedcv |
number = 10 (number of folds), repeats = 10 (number of iterations) |
Bootstrap sampling | boot |
number = 25 (resampling iterations) |
0.632 bootstrap | boot632 |
number = 25 (resampling iterations) |
Leave-one-out cross-validation | LOOCV |
None |
\[\textbf{Table 2: Core re-sampling methods.}\]
Each of these methods rely on alternative representative sampling strategies to train the model. Let’s use 0.632 bootstrap for example. Just specify method="boot632"
in the trainControl()
function. The number of different samples to include can be customized by the number=
option. Another option in trainControl()
allows specification of the model performance evaluation. We can select a preferred method of evaluation for choosing the optimal model. For instance, the oneSE
method chooses the simplest model within one standard error of the best performance to be the optimal model. Other strategies are also available in the caret
package. For detailed information, type ?best
in the R console.
We can also specify a list of k values we want to test by creating a matrix or a grid.
<-trainControl(method = "boot632", number=25, selectionFunction = "oneSE")
ctrl<-expand.grid(k=c(1, 3, 5, 7, 9))
grid# Creates a data frame from all combinations of the supplied factors
Usually, to avoid ties, we prefer to choose an odd number of clusters \(k\). Now the constraints are all set. We can start to select models again using train()
.
set.seed(123)
<-train(grade ~ ., data=boystown_n, method="knn",
kNN_mod2 metric="Kappa",
trControl=ctrl,
tuneGrid=grid)
kNN_mod2
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
## 2 classes: 'above_avg', 'avg_or_below'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8748058 0.7155507
## 3 0.8389441 0.6235744
## 5 0.8411961 0.6254587
## 7 0.8384469 0.6132381
## 9 0.8341422 0.5971359
##
## Kappa was used to select the optimal model using the one SE rule.
## The final value used for the model was k = 1.
Here we added metric="Kappa"
to include the Kappa statistics as one of the criteria to select the optimal model. We can see the output accuracy for all the candidate models are better than the default bootstrap sampling. The optimal model has k=1, a high accuracy \(0.861\), and a high Kappa statistics, which is much better than the model we had in Chapter 6. Note that the output based on the SE rule may not necessarily choose the model with the highest accuracy or the highest Kappa statistic as the “optimal model”. The tuning process is more comprehensive than only looking at one statistic.
Meta-learning involves building and ensembling multiple learners relying either on single or multiple learning algorithms. Meta-learners combine the outputs of several techniques and report consensus results that are more reliable, in general.
For example, to decrease the variance (bagging) or bias (boosting), random forest attempts in two steps to correct the general decision trees’ trend to overfit the model to the training set:
Step 1: producing a distribution of simple ML models on subsets of the original data.
Step 2: combine the distribution into one “aggregated” model.
Before stepping into the details, let’s briefly summarize:
Bagging (stands for Bootstrap Aggregating) is the way to decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multiple samples of the same cardinality/size as your original data. We can’t expect to improve the model predictive power by synthetically increasing the size of the training set, however we may decrease the variance by narrowly tuning the prediction to the expected outcome.
Boosting is a two-step approach that aims to reduce bias in parameter estimation. First, we use subsets of the original data to produce a series of moderately performing models and then “boost” their performance by combining them together using a particular cost function (e.g., Accuracy). Unlike bagging, in classical boosting, the subset creation is not random and depends upon the performance of the previous models: every new subset contains the elements that were (likely to be) misclassified by previous models. Usually, when using boosting, we prefer weaker classifiers. For example, a prevalent choice is to use a stump (level-one decision tree) in AdaBoost (Adaptive Boosting).
One of the most well-known meta-learning method is bootstrap aggregating or bagging. It builds multiple models with bootstrap samples using a single algorithm. The models’ predictions are combined with voting (for classification) or averaging (for numeric prediction). Voting means the bagging model’s prediction is based on the majority of learners’ prediction for a class. Bagging is especially good with unstable learners like decision trees or SVM models.
To illustrate the Bagging method we will again use the Quality of life and chronic disease dataset we saw in Chapter 8. Just like we did in the second practice problem in Chapter 10, we will use CHARLSONSCORE
as the class label, which has 11 different classes.
<-read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
qol<-qol[!qol$CHARLSONSCORE==-9 , -c(1, 2)]
qol$CHARLSONSCORE<-as.factor(qol$CHARLSONSCORE) qol
To apply bagging()
, we need to download the ipred
package first. After loading the package, we build a bagging model with CHARLSONSCORE
as class label and all other variables in the dataset as predictors. We can specify the number of voters (decision tree models we want to have), which defaults to 25.
# install.packages("ipred")
library(ipred)
## Warning: package 'ipred' was built under R version 4.1.1
set.seed(123)
<-bagging(CHARLSONSCORE ~ ., data=qol, nbagg=25) mybag
The result, mybag
, is a complex class object that includes y
(vector of responses), X
(data frame of predictors), mtrees
(multiple trees as a list of length nbagg containing the trees for each bootstrap sample, OOB
(logical indicating whether the out-of-bag estimate should be computed), err
error (if OOB=TRUE, the out-of-bag estimate of misclassification or root mean squared error or the Brier score for censored data), and comb (Boolean indicating whether a combination of models was requested).
Now we will use the predict()
function to apply this forecasting model. For evaluation purposes, we create a table to inspect the re-substitution error.
<-predict(mybag, qol)
bt_pred<-bt_pred==qol$CHARLSONSCORE
agreementprop.table(table(agreement))
## agreement
## FALSE TRUE
## 0.00128866 0.99871134
This model works very well with its training data. It labeled 99.8% of the cases correctly. To evaluate its performance on testing data, we apply the caret
train()
function again with 10 repeated CVs as a re-sampling method. In caret
, the bagged trees method is called treebag
.
library(caret)
set.seed(123)
<-trainControl(method="repeatedcv", number = 10, repeats = 10)
ctrltrain(CHARLSONSCORE~., data=as.data.frame(qol), method="treebag", trControl=ctrl)
## Bagged CART
##
## 2328 samples
## 38 predictor
## 11 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 2094, 2097, 2096, 2098, 2096, 2095, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5199428 0.2120838
Well, we got a very marginal accuracy of 52% and a fair Kappa statistics. This result is better than we got back in Chapter 10 using the ksvm()
function alone (~50%). Here we combined the prediction results of 38 decision trees to get this accuracy. It seems that we can’t forecast CHARLSONSCORE
too well, however other QoL outcomes may have higher prediction accuracy. For instance, we may predict QOL_Q_01
with \(accuracy=0.6\) and \(\kappa=0.42\).
set.seed(123)
<-trainControl(method="repeatedcv", number = 10, repeats = 10)
ctrltrain(as.factor(QOL_Q_01) ~ . , data=as.data.frame(qol), method="treebag", trControl=ctrl)
In addition to decision tree classification, caret
allows us to explore alternative bag()
functions. For instance, instead of bagging based on decision trees, we can bag using a SVM model. caret
provides a nice setting for SVM training, making predictions and counting votes in a list object svmBag
. We can examine these objects by using the str()
function.
str(svmBag)
## List of 3
## $ fit :function (x, y, ...)
## $ pred :function (object, x)
## $ aggregate:function (x, type = "class")
Clearly, fit
provides the training functionality, pred
the prediction and forecasting on new data, and aggregate
is a way to combine many models and achieve voting-based consensus. Using the member operator, the \(\$\) sign, we can explore these three types of elements of the svmBag
object. For instance, the fit
element may be extracted from the SVM object by:
$fit svmBag
## function (x, y, ...)
## {
## loadNamespace("kernlab")
## out <- kernlab::ksvm(as.matrix(x), y, prob.model = is.factor(y),
## ...)
## out
## }
## <bytecode: 0x0000000029645080>
## <environment: namespace:caret>
The SVM bag fit
relies on the kernlab::ksvm()
function. The other two methods, pred
and aggregate
, may be explored in a similar way. They follow the SVM model building and testing process we discussed in Chapter 10.
This svmBag
object could be used as an optional setting in the train()
function. However, this option requires that all features are linearly independent, which may be rare in real world data.
Bagging uses equal weights for all learners we include in the model. Boosting is different as it employs non-uniform weights. Suppose we have the first learner correctly classifying 60% of the observations. This 60% of data may be less likely to be included in the training dataset for the next learner. So, we have more learners working on the remaining “hard-to-classify” observations.
Mathematically, the boosting technique uses a weighted sum of functions to predict the outcome class labels. We can try to fit the true model using weighted additive modeling. We start with a random learner that can classify some of the observations mostly correctly, possibly with some errors. \[\hat{y}_1=l_1.\] This \(l_1\) is our first learner and \(\hat{y}_1\) denotes its predictions (this equation is in matrix form). Then, we can calculate the residuals of our first learner. \[\epsilon_1=y-v_1\times\hat{y}_1,\] where \(v_1\) is a shrinkage parameter to avoid overfitting. Next, we fit the residual with another learner. This learner minimizes the following objective function \(\sum_{i=1}^N||y_i-l_{k-1}-l_k||\). Here k=2
. Then we obtain a second model \(l_2\) with: \[\hat{y}_2=l_2.\] After that, we can update the residuals: \[\epsilon_2=\epsilon_1-v_2\times\hat{y}_2.\] We repeat this residual fitting until adding another learner \(l_k\) results in updated residual \(\epsilon_k\) that is smaller than a small predefined threshold. In the end, we will have an additive model like: \[L=v_1\times l_1+v_2\times l_2+...+v_k\times l_k,\] where we ensemble k weak learners to generate a stronger meta model.
Schapire and Freund found that although individual learners trained on the pilot observations might be very weak in predicting in isolation, boosting the collective power of all of them is expected to generate a model no worse than the best of all individual constituent models included in the boosting ensemble. Usually, the boosting results are quite better than the top individual model. Although boosting can be used for almost all models, it’s most commonly applied to decision trees.
Random forests, or decision tree forests, represents a boosting method focusing on decision tree learners.
One approach to train and build random forests uses the randomForest::randomForest()
method, which has the following invocation:
m<-randomForest(expression, data, ntree=500, mtry=sqrt(p))
p
stands for the number of features in the data.Let’s build a random forest using the Quality of Life dataset.
# install.packages("randomForest")
library(randomForest)
set.seed(123)
<-randomForest(as.factor(QOL_Q_01) ~ . , data=qol)
rf rf
##
## Call:
## randomForest(formula = as.factor(QOL_Q_01) ~ ., data = qol)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 38.4%
## Confusion matrix:
## 1 2 3 4 5 6 class.error
## 1 14 10 14 5 0 0 0.6744186
## 2 3 73 120 14 0 1 0.6540284
## 3 1 23 558 205 3 1 0.2945638
## 4 0 2 164 686 36 4 0.2309417
## 5 0 0 7 160 92 0 0.6447876
## 6 0 1 34 81 5 11 0.9166667
By default the model contains 500 voter trees and tries 6 variables at each split. Its OOB (out-of-bag) error rate is about 38%, which corresponds with a moderate accuracy (62%). Note that the OOB error rate is not a re-substitution error. Next to the confusion matrix, we see the reported OOB error rate for all specific classes. All of these error rates are reasonable estimates of future performances with unseen data. We can see that this model is so far the best of all models, although it is still not highly predictive of QOL_Q_01
.
In addition to model building, the caret
package also supports model evaluation. It reports more detailed model performance evaluations. As usual, we need to specify the re-sampling method and a parameter grid. Let’s use a 10-fold CV re-sampling method as an example. The grid for this model contains information about the mtry
parameter (the only tuning parameter for random forest). Previously we tried the default value \(\sqrt{38}=6\) (38 is the number of features). This time we could compare multiple mtry
parameters.
library(caret)
<-trainControl(method="cv", number=10)
ctrl<-expand.grid(mtry=c(2, 4, 8, 16)) grid_rf
Next, we apply the train()
function with our ctrl
and grid_rf
settings.
set.seed(123)
<- train(as.factor(QOL_Q_01) ~ ., data = qol, method = "rf",
m_rf metric = "Kappa", trControl = ctrl, tuneGrid = grid_rf)
m_rf
## Random Forest
##
## 2328 samples
## 38 predictor
## 6 classes: '1', '2', '3', '4', '5', '6'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2095, 2095, 2096, 2096, 2095, 2096, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.5760034 0.3465920
## 4 0.6039266 0.4004127
## 8 0.6125122 0.4197845
## 16 0.6180824 0.4372678
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
This call may take a while to complete. The result appears to be a good model, when mtry=16
we reached a moderately high accuracy (0.62) and good kappa
statistic (0.44). This is a good result for a meta-learner of 6 dispersed classes (table(as.factor(qol$QOL_Q_01))
).
More examples of using randomForest()
and interpreting its results are shown in Chapter 8.
We may achieve even higher accuracy using AdaBoost. Adaptive boosting (AdaBoost) can be used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms (‘weak learners’) is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by the previous classifiers.
For binary cases, we could use the method ada::ada()
and for multiple classes (multinomial/polytomous outcomes) we can use the package adabag
. The adabag::boosting()
function allows us to specify a method by setting coeflearn
. The two main types of adaptive boosting methods that are commonly used include the AdaBoost.M1
algorithm, e.g., Breiman
and Freund
, or the Zhu
’s SAMME
algorithm. The key parameter in the adabag::boosting()
method is coeflearn:
The generalizations of AdaBoost for multiple classes (\(\geq 2\)) include AdaBoost.M1
(where individual trees are required to have an error \(\lt \frac{1}{2}\)) and SAMME
(where individual trees are required to have an error \(\lt 1-\frac{1}{nclasses}\)).
Let’s see some examples using these three alternative adaptive boosting methods:
# Prep the data
<- read.csv("https://umich.instructure.com/files/481332/download?download_frd=1")
qol <-qol[!qol$CHARLSONSCORE==-9 , -c(1, 2)]
qol$CHARLSONSCORE <- as.factor(qol$CHARLSONSCORE)
qol#qol$QOL_Q_01 <- as.factor(qol$QOL_Q_01)
<-qol[!qol$CHARLSONSCORE==-9 , -c(1, 2)]
qol$cd<-qol$CHRONICDISEASESCORE>1.497
qol$cd<-factor(qol$cd, levels=c(F, T), labels = c("minor_disease", "severe_disease"))
qol<-qol[!qol$CHRONICDISEASESCORE==-9, ]
qol
# install.packages("ada"); install.packages("adabag")
library("ada"); library("adabag")
set.seed(123)
# qol_boost <- boosting(QOL_Q_01 ~ . , data=qol, mfinal = 100, coeflearn = 'Breiman')
# mean(qol_boost$class==qol$QOL_Q_01)
<- boosting(cd ~ . , data=qol[, -37], mfinal = 100, coeflearn = 'Breiman')
qol_boost mean(qol_boost$class==qol$cd)
## [1] 0.86621
set.seed(123)
#qol_boost <- boosting(QOL_Q_01 ~ ., data=qol, mfinal = 100, coeflearn = 'Breiman')
#mean(qol_boost$class==qol$QOL_Q_01)
<- boosting(cd ~ . , data=qol[, -37], mfinal = 100, coeflearn = 'Breiman')
qol_boost mean(qol_boost$class==qol$cd)
## [1] 0.86621
set.seed(1234)
#qol_boost <- boosting(QOL_Q_01 ~ ., data=qol, mfinal = 100, coeflearn = 'Zhu')
#mean(qol_boost$class==qol$QOL_Q_01)
<- boosting(cd ~ . , data=qol[, -37], mfinal = 100, coeflearn = 'Zhu')
qol_boost mean(qol_boost$class==qol$cd)
## [1] 0.9378995
We observe that the Zhu
approach achieves the best results, average \(accuracy=0.78\). Notice that the default method is M1 Breiman
and mfinal
is the number of boosting iterations.
Earlier in Chapter 8 (Decision Trees) and Chapter 13 (Model Evaluation) we saw examples of how to choose appropriate evaluation metrics and how to contrast the performance of various AI/ML methods. Below, we illustrate model comparison based on classification of Case-Study 6, Quality of Life (QoL) dataset using bagging, boosting, random forest, SVN, k nearest neighbors, and decision trees. All available caret
package ML/AI training methods are listed here.
# install.packages(fastAdaboost)
library(fastAdaboost)
library(caret) # for modeling
library(lattice) # for plotting
<- trainControl(method="repeatedcv", number=10, repeats=3)
control
## Run all subsequent models in parallel
library(doParallel)
<- makePSOCKcluster(5)
cl registerDoParallel(cl)
system.time({
<- train(cd~., data=qol[, -37], method="rf", trControl=control);
rf.fit <- train(cd~., data=qol[, -37], method="knn", trControl=control);
knn.fit <- train(cd~., data=qol[, -37], method="svmRadialWeights", trControl=control);
svm.fit <- train(cd~., data=qol[, -37], method="AdaBag", trControl=control);
adabag.fit <- train(cd~., data=qol[, -37], method="adaboost", trControl=control)
adaboost.fit })
## user system elapsed
## 53.36 0.47 2161.38
stopCluster(cl) # close multi-core cluster
rm(cl)
<- resamples(list(RF=rf.fit, kNN=knn.fit, SVM=svm.fit, Bag=adabag.fit, Boost=adaboost.fit))
results
# summary of model differences
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, kNN, SVM, Bag, Boost
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5753425 0.6027397 0.6292887 0.6313526 0.6617130 0.6803653 0
## kNN 0.4977169 0.5210928 0.5331783 0.5352908 0.5427615 0.6090909 0
## SVM 0.5590909 0.6365779 0.6521721 0.6477953 0.6632524 0.6803653 0
## Bag 0.5936073 0.6312785 0.6506849 0.6476458 0.6632524 0.6940639 0
## Boost 0.5159817 0.6027397 0.6141553 0.6113985 0.6281485 0.6772727 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.15359681 0.2060723 0.25849250 0.26188171 0.3213024 0.3626310 0
## kNN -0.01175976 0.0347122 0.06088063 0.06445134 0.0769631 0.2197938 0
## SVM 0.11378738 0.2802618 0.30831197 0.30004579 0.3280927 0.3672610 0
## Bag 0.18478397 0.2683480 0.30879227 0.29869079 0.3308841 0.3930005 0
## Boost 0.03572319 0.2040188 0.22475694 0.22152584 0.2535556 0.3566722 0
# Plot Accuracy Summaries
# scales <- list(x=list(relation="free"), y=list(relation="free"))
# bwplot(results, scales=scales) # Box plots of accuracy
# Convert (results) data-frame from wide to long format
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
library(tidyr)
<- gather(results$values[, -1], method, measurement, factor_key=TRUE) %>%
results_long separate(method, c("Technique", "Metric"), sep = "~")
# Compare original wide format to transformed long format
$values[, -1] results
## RF~Accuracy RF~Kappa kNN~Accuracy kNN~Kappa SVM~Accuracy SVM~Kappa
## 1 0.6757991 0.3490602 0.5227273 0.0375000000 0.6484018 0.3030378
## 2 0.5981735 0.1980025 0.5272727 0.0525879917 0.6210046 0.2494116
## 3 0.5981735 0.1905922 0.5045872 0.0037237644 0.6803653 0.3672610
## 4 0.5753425 0.1535968 0.5409091 0.0741666667 0.6575342 0.3223730
## 5 0.5981735 0.1935727 0.5662100 0.1241843822 0.6621005 0.3280537
## 6 0.6009174 0.2058454 0.5662100 0.1241843822 0.6500000 0.3063063
## 7 0.6301370 0.2601043 0.5388128 0.0748672048 0.6757991 0.3555887
## 8 0.6347032 0.2698175 0.5159817 0.0286216420 0.6605505 0.3253597
## 9 0.6136364 0.2253521 0.5296804 0.0504314881 0.6575342 0.3186626
## 10 0.6027397 0.2067530 0.5205479 0.0337829321 0.6803653 0.3672610
## 11 0.6118721 0.2228531 0.6090909 0.2197938144 0.6227273 0.2516393
## 12 0.6757991 0.3526375 0.5022831 0.0006698204 0.5590909 0.1137874
## 13 0.6164384 0.2302285 0.5342466 0.0626888218 0.6529680 0.3124019
## 14 0.6330275 0.2657460 0.5000000 -0.0093450004 0.6347032 0.2768697
## 15 0.6712329 0.3426165 0.5366972 0.0694784887 0.6513761 0.3077052
## 16 0.6605505 0.3190951 0.5707763 0.1330020216 0.5688073 0.1328707
## 17 0.6027397 0.2067530 0.4977169 -0.0117597648 0.6422018 0.2930318
## 18 0.6284404 0.2568807 0.5227273 0.0414937759 0.6803653 0.3643752
## 19 0.6621005 0.3220382 0.5205479 0.0337829321 0.6347032 0.2729084
## 20 0.6575342 0.3136517 0.5388128 0.0636667654 0.6666667 0.3410412
## 21 0.6666667 0.3313397 0.5616438 0.1234887017 0.6666667 0.3368316
## 22 0.6590909 0.3187448 0.5275229 0.0534524914 0.6529680 0.3089188
## 23 0.6681818 0.3325021 0.5114155 0.0153787974 0.6636364 0.3281057
## 24 0.6027397 0.2052972 0.5296804 0.0530624239 0.6484018 0.2979308
## 25 0.6803653 0.3626310 0.5388128 0.0705912013 0.6803653 0.3643752
## 26 0.5981735 0.1957937 0.5321101 0.0590724441 0.6575342 0.3192838
## 27 0.5954545 0.1882255 0.5570776 0.1032546752 0.6347032 0.2742336
## 28 0.6118721 0.2207058 0.5366972 0.0702643358 0.6454545 0.2970094
## 29 0.6666667 0.3294887 0.5433790 0.0776617251 0.6438356 0.2904378
## 30 0.6438356 0.2865258 0.5545455 0.0997912317 0.6330275 0.2743009
## Bag~Accuracy Bag~Kappa Boost~Accuracy Boost~Kappa
## 1 0.6940639 0.3930005 0.6090909 0.21721142
## 2 0.6484018 0.3026920 0.6181818 0.23730912
## 3 0.6438356 0.2917323 0.6363636 0.27362773
## 4 0.6636364 0.3330602 0.6100917 0.21688784
## 5 0.6210046 0.2480453 0.6118721 0.22213865
## 6 0.6409091 0.2847737 0.5205479 0.03822828
## 7 0.6484018 0.2979308 0.6073059 0.21980116
## 8 0.6347032 0.2755541 0.5733945 0.14173228
## 9 0.6727273 0.3553114 0.5799087 0.15066183
## 10 0.6621005 0.3243559 0.6484018 0.29664234
## 11 0.6651376 0.3336404 0.6272727 0.25300207
## 12 0.6605505 0.3242292 0.6284404 0.25374017
## 13 0.6255708 0.2574429 0.6164384 0.22523585
## 14 0.6621005 0.3212431 0.6651376 0.32630599
## 15 0.6301370 0.2681629 0.6210046 0.24044127
## 16 0.6529680 0.3148926 0.5159817 0.03572319
## 17 0.6757991 0.3590948 0.5753425 0.15282226
## 18 0.6651376 0.3341980 0.6529680 0.30481998
## 19 0.5981735 0.2038338 0.6118721 0.22427803
## 20 0.6666667 0.3350110 0.6255708 0.24786396
## 21 0.6164384 0.2295192 0.6772727 0.35667216
## 22 0.6559633 0.3153840 0.6330275 0.26512725
## 23 0.6849315 0.3714488 0.6164384 0.23375541
## 24 0.6301370 0.2621355 0.5963303 0.19163998
## 25 0.6621005 0.3220382 0.5890411 0.17296073
## 26 0.6210046 0.2480453 0.6164384 0.23445693
## 27 0.6621005 0.3218679 0.6027397 0.20383603
## 28 0.5936073 0.1847840 0.6500000 0.29826015
## 29 0.6363636 0.2783928 0.6027397 0.20456728
## 30 0.6347032 0.2689034 0.6027397 0.20602575
head(results_long)
## Technique Metric measurement
## 1 RF Accuracy 0.6757991
## 2 RF Accuracy 0.5981735
## 3 RF Accuracy 0.5981735
## 4 RF Accuracy 0.5753425
## 5 RF Accuracy 0.5981735
## 6 RF Accuracy 0.6009174
library(plotly)
plot_ly(results_long, x=~Technique, y = ~measurement, color = ~Metric, type = "box")
#densityplot(results, scales=scales, pch = "|") # Density plots of accuracy
<- with(results_long[which(results_long$Metric=='Accuracy'), ],
densityModels tapply(measurement, INDEX = Technique, density))
<- data.frame(
df x = unlist(lapply(densityModels, "[[", "x")),
y = unlist(lapply(densityModels, "[[", "y")),
method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)
plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
layout(title="Performance Density Plots (Accuracy)", legend = list(orientation='h'),
xaxis=list(title="Accuracy"), yaxis=list(title="Density"))
<- with(results_long[which(results_long$Metric=='Kappa'), ],
densityModels tapply(measurement, INDEX = Technique, density))
<- data.frame(
df x = unlist(lapply(densityModels, "[[", "x")),
y = unlist(lapply(densityModels, "[[", "y")),
method = rep(names(densityModels), each = length(densityModels[[1]]$x))
)
plot_ly(df, x = ~x, y = ~y, color = ~method) %>% add_lines() %>%
layout(title="Performance Density Plots (Kappa)", legend = list(orientation='h'),
xaxis=list(title="Kappa"), yaxis=list(title="Density"))
# dotplot(results, scales=scales) # Dot plots of Accuracy & Kappa
#splom(results) # contrast pair-wise model scatterplots of prediction accuracy (Trellis Scatterplot matrices)
# Pairs - Accuracy
<- results_long[which(results_long$Metric=='Accuracy'), -2] %>%
results_wide pivot_wider(names_from = Technique, values_from = measurement)
= data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))
df
<- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
dims plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
layout(title="Performance Pairs Plot (Accuracy)")
# Pairs - Accuracy
<- results_long[which(results_long$Metric=='Kappa'), -2] %>%
results_wide pivot_wider(names_from = Technique, values_from = measurement)
= data.frame(cbind(RF=results_wide$RF[[1]], kNN=results_wide$kNN[[1]], SVM=results_wide$SVM[[1]], Bag=results_wide$Bag[[1]], Boost=results_wide$Boost[[1]]))
df
<- dplyr::select_if(df, is.numeric)
dims <- purrr::map2(dims, names(dims), ~list(values=.x, label=.y))
dims plot_ly(type = "splom", dimensions = setNames(dims, NULL),
showupperhalf = FALSE, diagonal = list(visible = FALSE)) %>%
layout(title="Performance Pairs Plot (Kappa)")
Try applying model improvement techniques using other data from the list of our Case-Studies.