SOCR ≫ | DSPA ≫ | Topics ≫ |
As we mentioned in Chapter 15, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Instead of trying to interrogate the complete data in its native high-dimensional state, variable selection, or feature selection, helps us focus on the most salient information contained in the observations. Due to presence of intrinsic and extrinsic noise, the volume and complexity of big health data, as well as different methodological and technological challenges, the process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.
Variable selection relates to dimensionality reduction, which we saw in Chapter 5, however there are differences between them.
Method | Process Type | Goals | Approach |
---|---|---|---|
Variable selection | Discrete process | To select unique representative features from each group of similar features | To identify highly correlated variables and choose a representative feature by post processing the data |
Dimension reduction | Continuous process | To denoise the data, enable simpler prediction, or group features so that low impact features have smaller weights | Find the essential, \(k\ll n\), components, factors, or clusters representing linear, or nonlinear, functions of the \(n\) variables which maximize an objective function like the proportion of explained variance |
Relative to the lower variance estimates in continuous dimensionaltuy reduction, the intrinsic characteristics of the discrete feature selection process yield higher variance in bootstrap estimation and cross validation.
In Chapter 17, we will learn about another powerful technique for variable-selection using decoy features (knockoffs) to control for the false discovery rate of selecting inconsequential features as important.
There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.
The different types of feature selection methods have their own pros and cons. In this chapter, we are going to introduce the randomized wrapper method using the Boruta
package, which utilizes random forest classification method to output variable importance measures (VIMs). Then, we will compare its results with Recursive Feature Elimination, a classical deterministic wrapper method.
Let’s start by examining random forest based feature selection, as an embedded technique. The good performance of random forest as a classification, regression, and clustering method is coupled with its ease-of-use, accurate, and robust results. Having a random forest, or more broadly a decision tree, prediction naturally leads to feature selection by using the mean decrease impurity or the mean accuracy decrease criteria.
The many decision trees captured in a random forest include explicit conditions at each branching node, which are based on single features. The intrinsic bifurcation conditions splitting the data may be based on cost function optimization using the impurity, see Chapter 8. We can also use other metrics information gain or entropy for classification problems. These measures capture the importance of variables by computing its impact (how much is the feature-based splitting decision decreasing the weighted impurity in a tree). In random forests, the ranking of feature importance, which based on the average impurity decrease due to each variable leads to effective feature selection.
First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.
In this case-study, we use the training dataset that contains 2,223 observations and 131 numeric variables. We select ALSFRS slope
as our outcome variable, as it captures the patients’ clinical decline over a year. Although we have more observations than features, this is one of the examples where multiple features are highly correlated. Therefore, we need to preprocess the variables before commencing with feature selection.
The dataset is located in our case-studies archive. We can use read.csv()
to directly import the CSV dataset into R using the URL reference.
<-read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
ALS.trainsummary(ALS.train)
## ID Age_mean Albumin_max Albumin_median
## Min. : 1.0 Min. :18.00 Min. :37.00 Min. :34.50
## 1st Qu.: 614.5 1st Qu.:47.00 1st Qu.:45.00 1st Qu.:42.00
## Median :1213.0 Median :55.00 Median :47.00 Median :44.00
## Mean :1214.9 Mean :54.55 Mean :47.01 Mean :43.95
## 3rd Qu.:1815.5 3rd Qu.:63.00 3rd Qu.:49.00 3rd Qu.:46.00
## Max. :2424.0 Max. :81.00 Max. :70.30 Max. :51.10
## Albumin_min Albumin_range ALSFRS_slope ALSFRS_Total_max
## Min. :24.00 Min. :0.000000 Min. :-4.3452 Min. :11.00
## 1st Qu.:39.00 1st Qu.:0.009042 1st Qu.:-1.0863 1st Qu.:29.00
## Median :41.00 Median :0.012111 Median :-0.6207 Median :33.00
## Mean :40.77 Mean :0.013779 Mean :-0.7283 Mean :31.69
## 3rd Qu.:43.00 3rd Qu.:0.015873 3rd Qu.:-0.2838 3rd Qu.:36.00
## Max. :49.00 Max. :0.243902 Max. : 1.2070 Max. :40.00
## ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max
## Min. : 2.5 Min. : 0.00 Min. :0.00000 Min. : 10.00
## 1st Qu.:23.0 1st Qu.:14.00 1st Qu.:0.01404 1st Qu.: 32.00
## Median :28.0 Median :20.00 Median :0.02330 Median : 45.00
## Mean :27.1 Mean :19.88 Mean :0.02604 Mean : 54.44
## 3rd Qu.:32.0 3rd Qu.:27.00 3rd Qu.:0.03480 3rd Qu.: 65.00
## Max. :40.0 Max. :40.00 Max. :0.11765 Max. :944.00
## ALT.SGPT._median ALT.SGPT._min ALT.SGPT._range AST.SGOT._max
## Min. : 8.00 Min. : 1.60 Min. :0.002747 Min. : 11.00
## 1st Qu.: 22.00 1st Qu.: 15.00 1st Qu.:0.030303 1st Qu.: 30.00
## Median : 30.00 Median : 21.00 Median :0.047619 Median : 38.00
## Mean : 32.99 Mean : 23.01 Mean :0.071137 Mean : 43.13
## 3rd Qu.: 40.00 3rd Qu.: 28.00 3rd Qu.:0.077539 3rd Qu.: 48.00
## Max. :193.00 Max. :109.00 Max. :2.383117 Max. :911.00
## AST.SGOT._median AST.SGOT._min AST.SGOT._range Bicarbonate_max
## Min. : 9.00 Min. : 1.00 Min. :0.00000 Min. :20.0
## 1st Qu.: 22.00 1st Qu.:17.00 1st Qu.:0.02352 1st Qu.:29.0
## Median : 27.00 Median :20.00 Median :0.03502 Median :31.0
## Mean : 29.08 Mean :21.54 Mean :0.04919 Mean :30.9
## 3rd Qu.: 34.00 3rd Qu.:25.00 3rd Qu.:0.05243 3rd Qu.:32.0
## Max. :100.00 Max. :86.00 Max. :1.91667 Max. :52.0
## Bicarbonate_median Bicarbonate_min Bicarbonate_range
## Min. :19.50 Min. : 2.50 Min. :0.00000
## 1st Qu.:26.00 1st Qu.:22.00 1st Qu.:0.01266
## Median :27.00 Median :23.00 Median :0.01493
## Mean :26.96 Mean :23.16 Mean :0.01687
## 3rd Qu.:28.00 3rd Qu.:24.45 3rd Qu.:0.01815
## Max. :39.50 Max. :34.00 Max. :0.21429
## Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
## Min. : 2.921 Min. : 2.191
## 1st Qu.: 5.842 1st Qu.: 4.640
## Median : 6.937 Median : 5.423
## Mean : 7.353 Mean : 5.558
## 3rd Qu.: 8.210 3rd Qu.: 6.353
## Max. :25.192 Max. :11.866
## Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
## Min. : 0.5842 Min. :0.000000 Min. : 70.00
## 1st Qu.: 3.2859 1st Qu.:0.004109 1st Qu.: 88.00
## Median : 4.0700 Median :0.005817 Median : 90.00
## Mean : 4.1609 Mean :0.007133 Mean : 92.03
## 3rd Qu.: 5.0000 3rd Qu.:0.008353 3rd Qu.: 98.00
## Max. :10.2228 Max. :0.069543 Max. :140.00
## bp_diastolic_median bp_diastolic_min bp_diastolic_range bp_systolic_max
## Min. : 56.00 Min. : 20.00 Min. :0.00000 Min. :100.0
## 1st Qu.: 78.00 1st Qu.: 65.00 1st Qu.:0.03527 1st Qu.:138.0
## Median : 80.00 Median : 70.00 Median :0.04337 Median :145.0
## Mean : 81.11 Mean : 69.89 Mean :0.04766 Mean :147.1
## 3rd Qu.: 85.00 3rd Qu.: 75.00 3rd Qu.:0.05435 3rd Qu.:157.0
## Max. :110.00 Max. :100.00 Max. :0.71429 Max. :220.0
## bp_systolic_median bp_systolic_min bp_systolic_range Calcium_max
## Min. : 90.0 Min. : 72.0 Min. :0.00000 Min. :2.171
## 1st Qu.:120.0 1st Qu.:108.0 1st Qu.:0.05272 1st Qu.:2.400
## Median :130.0 Median :110.0 Median :0.06494 Median :2.470
## Mean :129.6 Mean :113.4 Mean :0.07118 Mean :2.475
## 3rd Qu.:136.0 3rd Qu.:120.0 3rd Qu.:0.08190 3rd Qu.:2.530
## Max. :190.0 Max. :165.0 Max. :0.40462 Max. :9.460
## Calcium_median Calcium_min Calcium_range Chloride_max
## Min. :2.046 Min. :0.2438 Min. :0.0000000 Min. : 96.0
## 1st Qu.:2.283 1st Qu.:2.1707 1st Qu.:0.0003741 1st Qu.:106.0
## Median :2.345 Median :2.2300 Median :0.0004739 Median :107.0
## Mean :2.346 Mean :2.2229 Mean :0.0005407 Mean :107.2
## 3rd Qu.:2.400 3rd Qu.:2.2977 3rd Qu.:0.0005893 3rd Qu.:109.0
## Max. :2.800 Max. :2.6500 Max. :0.0129009 Max. :119.0
## Chloride_median Chloride_min Chloride_range Creatinine_max
## Min. : 90.0 Min. : 76.00 Min. :0.00000 Min. : 22.00
## 1st Qu.:102.0 1st Qu.: 98.00 1st Qu.:0.01250 1st Qu.: 65.00
## Median :104.0 Median :100.00 Median :0.01587 Median : 79.56
## Mean :103.5 Mean : 99.26 Mean :0.01787 Mean : 78.78
## 3rd Qu.:105.0 3rd Qu.:101.00 3rd Qu.:0.01990 3rd Qu.: 88.40
## Max. :111.0 Max. :109.00 Max. :0.21429 Max. :248.00
## Creatinine_median Creatinine_min Creatinine_range Gender_mean
## Min. : 18.00 Min. : 0.00 Min. :0.00000 Min. :1.000
## 1st Qu.: 53.04 1st Qu.: 39.00 1st Qu.:0.03824 1st Qu.:1.000
## Median : 62.00 Median : 53.00 Median :0.04865 Median :2.000
## Mean : 65.19 Mean : 51.98 Mean :0.05842 Mean :1.637
## 3rd Qu.: 78.85 3rd Qu.: 61.88 3rd Qu.:0.07026 3rd Qu.:2.000
## Max. :176.80 Max. :167.96 Max. :0.42095 Max. :2.000
## Glucose_max Glucose_median Glucose_min Glucose_range
## Min. : 4.160 Min. : 3.497 Min. : 0.000 Min. :0.000000
## 1st Qu.: 5.827 1st Qu.: 4.911 1st Qu.: 4.051 1st Qu.:0.003051
## Median : 6.500 Median : 5.300 Median : 4.440 Median :0.004695
## Mean : 7.160 Mean : 5.487 Mean : 4.265 Mean :0.006319
## 3rd Qu.: 7.600 3rd Qu.: 5.695 3rd Qu.: 4.800 3rd Qu.:0.007373
## Max. :33.688 Max. :26.196 Max. :12.200 Max. :0.097463
## hands_max hands_median hands_min hands_range
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000000
## 1st Qu.:5.000 1st Qu.:3.000 1st Qu.:0.000 1st Qu.:0.003610
## Median :7.000 Median :5.500 Median :3.000 Median :0.006652
## Mean :6.181 Mean :4.905 Mean :3.047 Mean :0.006883
## 3rd Qu.:8.000 3rd Qu.:7.000 3rd Qu.:5.000 3rd Qu.:0.009513
## Max. :8.000 Max. :8.000 Max. :8.000 Max. :0.042857
## Hematocrit_max Hematocrit_median Hematocrit_min Hematocrit_range
## Min. : 0.373 Min. : 0.362 Min. : 0.311 Min. :0.000000
## 1st Qu.:42.300 1st Qu.:40.000 1st Qu.:37.000 1st Qu.:0.007164
## Median :45.200 Median :42.600 Median :40.000 Median :0.009701
## Mean :41.939 Mean :39.467 Mean :36.962 Mean :0.011431
## 3rd Qu.:47.700 3rd Qu.:45.000 3rd Qu.:42.700 3rd Qu.:0.013579
## Max. :81.000 Max. :56.000 Max. :52.900 Max. :0.185714
## Hemoglobin_max Hemoglobin_median Hemoglobin_min Hemoglobin_range
## Min. :116.0 Min. :106.0 Min. : 6.204 Min. :0.00000
## 1st Qu.:144.0 1st Qu.:136.0 1st Qu.:128.000 1st Qu.:0.02321
## Median :152.0 Median :145.0 Median :136.000 Median :0.03106
## Mean :152.1 Mean :144.3 Mean :135.461 Mean :0.03824
## 3rd Qu.:160.0 3rd Qu.:152.0 3rd Qu.:145.000 3rd Qu.:0.04205
## Max. :280.0 Max. :182.0 Max. :180.000 Max. :0.56180
## leg_max leg_median leg_min leg_range
## Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000000
## 1st Qu.:3.00 1st Qu.:2.50 1st Qu.:1.000 1st Qu.:0.003378
## Median :5.00 Median :3.00 Median :2.000 Median :0.005435
## Mean :5.31 Mean :4.05 Mean :2.493 Mean :0.006163
## 3rd Qu.:8.00 3rd Qu.:6.00 3rd Qu.:3.000 3rd Qu.:0.008718
## Max. :8.00 Max. :8.00 Max. :8.000 Max. :0.042017
## mouth_max mouth_median mouth_min mouth_range
## Min. : 1.00 Min. : 0.000 Min. : 0.000 Min. :0.000000
## 1st Qu.:10.00 1st Qu.: 8.000 1st Qu.: 5.000 1st Qu.:0.001815
## Median :12.00 Median :11.000 Median : 9.000 Median :0.005329
## Mean :10.74 Mean : 9.703 Mean : 7.778 Mean :0.006595
## 3rd Qu.:12.00 3rd Qu.:12.000 3rd Qu.:11.000 3rd Qu.:0.010251
## Max. :12.00 Max. :12.000 Max. :12.000 Max. :0.036765
## onset_delta_mean onset_site_mean Platelets_max Platelets_median
## Min. :-3119 Min. :1.000 Min. : 84.0 Min. : 73.0
## 1st Qu.: -887 1st Qu.:2.000 1st Qu.:239.0 1st Qu.:204.0
## Median : -572 Median :2.000 Median :275.0 Median :233.0
## Mean : -683 Mean :1.801 Mean :285.3 Mean :238.8
## 3rd Qu.: -374 3rd Qu.:2.000 3rd Qu.:320.0 3rd Qu.:270.0
## Max. : -16 Max. :3.000 Max. :866.0 Max. :526.0
## Platelets_min Potassium_max Potassium_median Potassium_min
## Min. : 0.197 Min. : 3.400 Min. :3.000 Min. :2.400
## 1st Qu.:175.000 1st Qu.: 4.400 1st Qu.:4.000 1st Qu.:3.700
## Median :204.000 Median : 4.500 Median :4.200 Median :3.900
## Mean :208.382 Mean : 4.628 Mean :4.189 Mean :3.857
## 3rd Qu.:236.000 3rd Qu.: 4.800 3rd Qu.:4.300 3rd Qu.:4.000
## Max. :476.000 Max. :43.000 Max. :5.100 Max. :5.100
## Potassium_range pulse_max pulse_median pulse_min
## Min. :0.000000 Min. : 53.00 Min. : 50.00 Min. : 18.00
## 1st Qu.:0.001058 1st Qu.: 84.00 1st Qu.: 72.00 1st Qu.: 60.00
## Median :0.001425 Median : 90.00 Median : 77.00 Median : 64.00
## Mean :0.001744 Mean : 90.64 Mean : 76.97 Mean : 65.37
## 3rd Qu.:0.001913 3rd Qu.: 96.00 3rd Qu.: 81.00 3rd Qu.: 70.00
## Max. :0.098674 Max. :144.00 Max. :115.00 Max. :102.00
## pulse_range respiratory_max respiratory_median respiratory_min
## Min. :0.005425 Min. :2.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.036755 1st Qu.:4.00 1st Qu.:3.000 1st Qu.:2.000
## Median :0.048821 Median :4.00 Median :4.000 Median :3.000
## Mean :0.053587 Mean :3.91 Mean :3.593 Mean :2.791
## 3rd Qu.:0.062365 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :0.500000 Max. :4.00 Max. :4.000 Max. :4.000
## respiratory_range Sodium_max Sodium_median Sodium_min
## Min. :0.000000 Min. :134.0 Min. :128.0 Min. :112.0
## 1st Qu.:0.000000 1st Qu.:142.0 1st Qu.:139.0 1st Qu.:135.0
## Median :0.001828 Median :143.0 Median :140.0 Median :137.0
## Mean :0.002513 Mean :143.4 Mean :140.1 Mean :136.8
## 3rd Qu.:0.003653 3rd Qu.:145.0 3rd Qu.:141.0 3rd Qu.:138.0
## Max. :0.025424 Max. :169.0 Max. :146.5 Max. :145.0
## Sodium_range SubjectID trunk_max trunk_median
## Min. :0.00000 Min. : 533 Min. :0.000 Min. :0.000
## 1st Qu.:0.01058 1st Qu.:240826 1st Qu.:5.000 1st Qu.:3.000
## Median :0.01312 Median :496835 Median :7.000 Median :5.000
## Mean :0.01500 Mean :498880 Mean :6.204 Mean :4.893
## 3rd Qu.:0.01728 3rd Qu.:750301 3rd Qu.:8.000 3rd Qu.:6.500
## Max. :0.14286 Max. :999482 Max. :8.000 Max. :8.000
## trunk_min trunk_range Urine.Ph_max Urine.Ph_median
## Min. :0.000 Min. :0.000000 Min. :5.00 Min. :5.000
## 1st Qu.:1.000 1st Qu.:0.003643 1st Qu.:6.00 1st Qu.:5.000
## Median :3.000 Median :0.006920 Median :7.00 Median :6.000
## Mean :2.956 Mean :0.007136 Mean :6.82 Mean :5.711
## 3rd Qu.:5.000 3rd Qu.:0.009639 3rd Qu.:7.00 3rd Qu.:6.000
## Max. :8.000 Max. :0.042017 Max. :9.00 Max. :9.000
## Urine.Ph_min
## Min. :5.000
## 1st Qu.:5.000
## Median :5.000
## Mean :5.183
## 3rd Qu.:5.000
## Max. :8.000
There are 131 features and some of variables represent statistics like max, min and median values of the same clinical measurements.
Now let’s explore the Boruta()
function in Boruta
package to perform variables selection, based on random forest classification. Boruta()
includes the following components:
vs<-Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)
class
: variable for class labels.features
: potential features to select from.data
: dataset containing classes and features.pValue
: confidence level. Default value is 0.01 (Notice we are applying multiple variable selection.mcAdj
: Default TRUE to apply a multiple comparisons adjustment using the Bonferroni method.maxRuns
: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.doTrace
: verbosity level. Default 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means same as 1, plus at each importance source run reporting the number of attributes. The default is 0 where we don’t do the reporting.getImp
: function used to obtain attribute importance. The default is \(getImpRfZ\), which runs random forest from the ranger package and gathers \(Z\)-scores of mean decrease accuracy measure.The resulting vs
object is of class Boruta
and contains two important components:
finalDecision
: a factor of three values: Confirmed
, Rejected
or Tentative
, containing the final results of the feature selection process.ImpHistory
: a data frame of importance of attributes gathered in each importance source run. Besides the predictors’ importance, it contains maximal, mean and minimal importance of shadow attributes for each run. Rejected attributes get -Inf
importance. This output is set to NULL if we specify holdHistory=FALSE
in the Boruta call.Caution: Running the code below will take several minutes.
# install.packages("Boruta")
library(Boruta)
set.seed(123)
<-Boruta(ALSFRS_slope~.-ID, data=ALS.train, doTrace=0)
alsprint(als)
## Boruta performed 99 iterations in 4.066405 mins.
## 27 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 22 more;
## 60 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 55 more;
## 12 tentative attributes left: ALT.SGPT._min, Chloride_range,
## Hematocrit_max, Hematocrit_median, Hematocrit_min and 7 more;
$ImpHistory[1:6, 1:10] als
## Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,] 2.2680963 0.37764697 0.2375537 -0.1580937 2.7918574
## [2,] 2.0267252 1.39739377 1.4813602 0.6770461 1.7430500
## [3,] 2.3157588 -0.58408581 1.0305236 2.0934090 0.8981331
## [4,] 2.4953558 -0.94574532 0.1539726 1.4514634 2.2579837
## [5,] 0.6570802 0.07801328 -0.7698394 1.6172399 1.9590540
## [6,] 2.9302386 0.99320619 -0.1421461 1.2192271 1.7620833
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## [1,] 7.197587 7.769678 17.48219 25.79845
## [2,] 7.887404 8.688664 15.49813 26.35402
## [3,] 7.779168 8.822599 16.64904 25.56681
## [4,] 8.694571 7.061077 17.00731 24.67569
## [5,] 8.352961 8.404101 16.20194 27.51207
## [6,] 8.704381 7.606126 17.05258 27.01024
## ALT.SGPT._max
## [1,] 0.5698794
## [2,] 0.6220453
## [3,] 1.3444379
## [4,] 1.9128324
## [5,] -0.3869214
## [6,] 2.1655440
This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample \(x_i\), using only the trees that did not include \(x_i\) in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.
The importance scores for all features at every iteration are stored in the data frame als$ImpHistory
. Let’s plot a graph depicting the essential features.
Note: Again, running this code will take several minutes to complete.
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# plot(als, xlab="", xaxt="n")
# lz<-lapply(1:ncol(als$ImpHistory), function(i)
# als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
# names(lz)<-colnames(als$ImpHistory)
# lb<-sort(sapply(lz, median))
# axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)
<- tidyr::gather(as.data.frame(als$ImpHistory), feature, measurement)
df_long
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all 102 Features (ALS Data)",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
We can see that plotting the graph is easy but extracting matched feature names may require more work. The basic plot is done by this call plot(als, xlab="", xaxt="n")
, where xaxt="n"
suppresses labeling the x-axis, but the following lines in the script reconstruct the correct x-axis labels, and lz
is a list created by the lapply()
function. Each element in lz
contains all the important scores for a single feature in the original dataset. Also, we excluded all rejected features with infinite importance. Then, we sorted these non-rejected features according to their median importance and print them on the x-axis by using axis()
.
We have already seen similar groups of boxplots back in Chapter 2 and Chapter 3. In this graph, variables with green boxes are more important than the ones represented with red boxes, and we can see the range of importance scores within a single variable in the graph.
It may be desirable to get rid of tentative features. Notice that this function should be used only when strict decision is highly desired, because this test is much weaker than Boruta and can lower the confidence of the final result.
<-TentativeRoughFix(als)
final.alsprint(final.als)
## Boruta performed 99 iterations in 4.066405 mins.
## Tentatives roughfixed over the last 99 iterations.
## 28 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 23 more;
## 71 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 66 more;
$finalDecision final.als
## Age_mean Albumin_max
## Rejected Rejected
## Albumin_median Albumin_min
## Rejected Rejected
## Albumin_range ALSFRS_Total_max
## Rejected Confirmed
## ALSFRS_Total_median ALSFRS_Total_min
## Confirmed Confirmed
## ALSFRS_Total_range ALT.SGPT._max
## Confirmed Rejected
## ALT.SGPT._median ALT.SGPT._min
## Rejected Rejected
## ALT.SGPT._range AST.SGOT._max
## Rejected Rejected
## AST.SGOT._median AST.SGOT._min
## Rejected Rejected
## AST.SGOT._range Bicarbonate_max
## Rejected Rejected
## Bicarbonate_median Bicarbonate_min
## Rejected Rejected
## Bicarbonate_range Blood.Urea.Nitrogen..BUN._max
## Rejected Rejected
## Blood.Urea.Nitrogen..BUN._median Blood.Urea.Nitrogen..BUN._min
## Rejected Rejected
## Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
## Rejected Rejected
## bp_diastolic_median bp_diastolic_min
## Rejected Rejected
## bp_diastolic_range bp_systolic_max
## Rejected Rejected
## bp_systolic_median bp_systolic_min
## Rejected Rejected
## bp_systolic_range Calcium_max
## Rejected Rejected
## Calcium_median Calcium_min
## Rejected Rejected
## Calcium_range Chloride_max
## Rejected Rejected
## Chloride_median Chloride_min
## Rejected Rejected
## Chloride_range Creatinine_max
## Rejected Confirmed
## Creatinine_median Creatinine_min
## Confirmed Confirmed
## Creatinine_range Gender_mean
## Rejected Rejected
## Glucose_max Glucose_median
## Rejected Rejected
## Glucose_min Glucose_range
## Rejected Rejected
## hands_max hands_median
## Confirmed Confirmed
## hands_min hands_range
## Confirmed Confirmed
## Hematocrit_max Hematocrit_median
## Rejected Rejected
## Hematocrit_min Hematocrit_range
## Rejected Rejected
## Hemoglobin_max Hemoglobin_median
## Rejected Confirmed
## Hemoglobin_min Hemoglobin_range
## Rejected Rejected
## leg_max leg_median
## Confirmed Confirmed
## leg_min leg_range
## Confirmed Confirmed
## mouth_max mouth_median
## Confirmed Confirmed
## mouth_min mouth_range
## Confirmed Confirmed
## onset_delta_mean onset_site_mean
## Confirmed Rejected
## Platelets_max Platelets_median
## Rejected Rejected
## Platelets_min Potassium_max
## Rejected Rejected
## Potassium_median Potassium_min
## Rejected Rejected
## Potassium_range pulse_max
## Rejected Rejected
## pulse_median pulse_min
## Rejected Rejected
## pulse_range respiratory_max
## Rejected Rejected
## respiratory_median respiratory_min
## Confirmed Confirmed
## respiratory_range Sodium_max
## Confirmed Rejected
## Sodium_median Sodium_min
## Rejected Rejected
## Sodium_range SubjectID
## Rejected Rejected
## trunk_max trunk_median
## Confirmed Confirmed
## trunk_min trunk_range
## Confirmed Confirmed
## Urine.Ph_max Urine.Ph_median
## Rejected Rejected
## Urine.Ph_min
## Rejected
## Levels: Tentative Confirmed Rejected
getConfirmedFormula(final.als)
## ALSFRS_slope ~ ALSFRS_Total_max + ALSFRS_Total_median + ALSFRS_Total_min +
## ALSFRS_Total_range + Creatinine_max + Creatinine_median +
## Creatinine_min + hands_max + hands_median + hands_min + hands_range +
## Hemoglobin_median + leg_max + leg_median + leg_min + leg_range +
## mouth_max + mouth_median + mouth_min + mouth_range + onset_delta_mean +
## respiratory_median + respiratory_min + respiratory_range +
## trunk_max + trunk_median + trunk_min + trunk_range
## <environment: 0x0000000037f6f450>
# report the Boruta "Confirmed" & "Tentative" features, removing the "Rejected" ones
print(final.als$finalDecision[final.als$finalDecision %in% c("Confirmed", "Tentative")])
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## Confirmed Confirmed Confirmed Confirmed
## Creatinine_max Creatinine_median Creatinine_min hands_max
## Confirmed Confirmed Confirmed Confirmed
## hands_median hands_min hands_range Hemoglobin_median
## Confirmed Confirmed Confirmed Confirmed
## leg_max leg_median leg_min leg_range
## Confirmed Confirmed Confirmed Confirmed
## mouth_max mouth_median mouth_min mouth_range
## Confirmed Confirmed Confirmed Confirmed
## onset_delta_mean respiratory_median respiratory_min respiratory_range
## Confirmed Confirmed Confirmed Confirmed
## trunk_max trunk_median trunk_min trunk_range
## Confirmed Confirmed Confirmed Confirmed
## Levels: Tentative Confirmed Rejected
# how many are actually "confirmed" as important/salient?
<- final.als$finalDecision[final.als$finalDecision %in% c("Confirmed")]; length(impBoruta) impBoruta
## [1] 28
This shows the final features selection result.
Let’s compare the Boruta
results against a classical variable selection method - recursive feature elimination (RFE). First, we need to load two packages: caret
and randomForest
. Then, similar to Chapter 14 we must specify a resampling method. Here we use 10-fold CV to do the resampling.
library(caret)
library(randomForest)
set.seed(123)
<-rfeControl(functions = rfFuncs, method = "cv", number=10) control
Now, all preparations are complete and we are ready to do the RFE variable selection.
<-rfe(ALS.train[, -c(1, 7)], ALS.train[, 7], sizes=c(10, 20, 30, 40), rfeControl=control)
rf.train rf.train
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 10 0.3490 0.6831 0.2479 0.03305 0.05288 0.01580
## 20 0.3468 0.6876 0.2463 0.03110 0.04798 0.01396 *
## 30 0.3482 0.6852 0.2479 0.03253 0.04827 0.01474
## 40 0.3470 0.6876 0.2473 0.03411 0.04927 0.01494
## 99 0.3511 0.6807 0.2499 0.03300 0.04967 0.01472
##
## The top 5 variables (out of 20):
## ALSFRS_Total_range, hands_range, trunk_range, ALSFRS_Total_min, mouth_range
This calculation may take a long time to complete. The RFE invocation is different from Boruta
. Here we have to specify the feature data frame and the class labels separately. Also, the sizes=
option allows us to specify the number of features we want to include in the model. Let’s try sizes=c(10, 20, 30, 40)
to compare the model performance for alternative numbers of features.
To visualize the results, we can plot the 5 different feature size combinations listed in the summary. The one with 30 features has the lowest RMSE measure. This result is similar to the Boruta
output, which selected around 30 features.
plot(rf.train, type=c("g", "o"), cex=1, col=1:5)
# df <- as.data.frame(cbind(variables=rf.train$variables$var[1:5], RMSE=rf.train$results$RMSE,
# Rsquared=rf.train$results$Rsquared, MAE=rf.train$results$MAE,
# RMSESD = rf.train$results$RMSESD,
# RsquaredSD= rf.train$results$RsquaredSD, MAESD=rf.train$results$MAESD))
#
# data_long <- tidyr::gather(df, Metric, value, RMSE:MAESD, factor_key=TRUE)
#
# plot_ly(data_long, x=~variables, y=~value, color=~as.factor(Metric), type = "scatter", mode="lines")
Using the functions predictors()
and getSelectedAttributes()
, we can compare the final results of the two alternative feature selection methods.
<- predictors(rf.train)
predRFE <- getSelectedAttributes(final.als, withTentative = F) predBoruta
The results are almost identical:
intersect(predBoruta, predRFE)
## [1] "ALSFRS_Total_max" "ALSFRS_Total_median" "ALSFRS_Total_min"
## [4] "ALSFRS_Total_range" "Creatinine_max" "hands_max"
## [7] "hands_median" "hands_min" "hands_range"
## [10] "leg_median" "leg_min" "leg_range"
## [13] "mouth_median" "mouth_min" "mouth_range"
## [16] "onset_delta_mean" "respiratory_range" "trunk_median"
## [19] "trunk_min" "trunk_range"
There are 26 common variables chosen by the two techniques, which suggests that both the Boruta
and RFE methods are robust. Also, notice that the Boruta
method can give similar results without utilizing the size option. If we want to consider 10 or more different sizes, the procedure will be quite time consuming. Thus, Boruta
method is effective when dealing with complex real world problems.
Next, we can contrast the Boruta
feature selection results against another classical variable selection method - stepwise model selection. Let’s start with fitting a bidirectional stepwise linear model-based feature selection.
<- ALS.train[, -1]
data2 # Define a base model - intercept only
<- lm(ALSFRS_slope ~ 1 , data= data2)
base.mod # Define the full model - including all predictors
<- lm(ALSFRS_slope ~ . , data= data2)
all.mod # ols_step <- lm(ALSFRS_slope ~ ., data=data2)
<- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
ols_step summary(ols_step); # ols_step
##
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median +
## ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min +
## onset_delta_mean + Calcium_min + Albumin_range + Glucose_range +
## ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min +
## Creatinine_range + Potassium_range + Chloride_range + Chloride_min +
## Sodium_median + respiratory_min + respiratory_range + respiratory_max +
## trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range +
## Chloride_max + onset_site_mean + trunk_max + Gender_mean +
## Creatinine_min, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22558 -0.17875 -0.02024 0.17098 1.95100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.176e-01 6.064e-01 0.689 0.491091
## ALSFRS_Total_range -2.260e+01 1.359e+00 -16.631 < 2e-16 ***
## ALSFRS_Total_median -3.388e-02 2.868e-03 -11.812 < 2e-16 ***
## ALSFRS_Total_min 2.821e-02 3.310e-03 8.524 < 2e-16 ***
## Calcium_range 2.410e+02 4.188e+01 5.754 9.94e-09 ***
## Calcium_max -4.258e-01 8.846e-02 -4.813 1.59e-06 ***
## bp_diastolic_min -2.249e-03 8.856e-04 -2.540 0.011161 *
## onset_delta_mean -5.461e-05 1.980e-05 -2.758 0.005856 **
## Calcium_min 3.579e-01 9.501e-02 3.767 0.000169 ***
## Albumin_range -2.305e+00 8.197e-01 -2.812 0.004967 **
## Glucose_range -1.510e+01 2.929e+00 -5.156 2.75e-07 ***
## ALT.SGPT._median -2.300e-03 7.998e-04 -2.876 0.004062 **
## AST.SGOT._median 3.369e-03 1.276e-03 2.641 0.008316 **
## Glucose_max 3.279e-02 7.082e-03 4.630 3.88e-06 ***
## Glucose_min -3.507e-02 8.718e-03 -4.023 5.95e-05 ***
## Creatinine_range 5.076e-01 2.214e-01 2.293 0.021925 *
## Potassium_range -4.535e+00 2.607e+00 -1.739 0.082128 .
## Chloride_range 5.318e+00 1.188e+00 4.475 8.04e-06 ***
## Chloride_min 1.672e-02 3.797e-03 4.404 1.12e-05 ***
## Sodium_median -9.830e-03 4.639e-03 -2.119 0.034227 *
## respiratory_min -1.453e-01 2.442e-02 -5.948 3.14e-09 ***
## respiratory_range -5.834e+01 1.013e+01 -5.757 9.78e-09 ***
## respiratory_max 1.712e-01 3.395e-02 5.042 4.99e-07 ***
## trunk_range -8.705e+00 3.088e+00 -2.819 0.004860 **
## pulse_range -5.117e-01 3.016e-01 -1.697 0.089874 .
## Bicarbonate_max 7.526e-03 2.931e-03 2.568 0.010292 *
## Bicarbonate_range -2.204e+00 9.567e-01 -2.304 0.021329 *
## Chloride_max -6.918e-03 3.952e-03 -1.751 0.080143 .
## onset_site_mean 3.359e-02 2.019e-02 1.663 0.096359 .
## trunk_max 2.288e-02 8.453e-03 2.706 0.006854 **
## Gender_mean -3.360e-02 1.751e-02 -1.919 0.055066 .
## Creatinine_min 7.643e-04 4.977e-04 1.536 0.124771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3355 on 2191 degrees of freedom
## Multiple R-squared: 0.7135, Adjusted R-squared: 0.7094
## F-statistic: 176 on 31 and 2191 DF, p-value: < 2.2e-16
We can report the stepwise “Confirmed” (important) features:
# get the shortlisted variable
<- names(unlist(ols_step[[1]]))
stepwiseConfirmedVars # remove the intercept
<- stepwiseConfirmedVars[!stepwiseConfirmedVars %in% "(Intercept)"]
stepwiseConfirmedVars print(stepwiseConfirmedVars)
## [1] "ALSFRS_Total_range" "ALSFRS_Total_median" "ALSFRS_Total_min"
## [4] "Calcium_range" "Calcium_max" "bp_diastolic_min"
## [7] "onset_delta_mean" "Calcium_min" "Albumin_range"
## [10] "Glucose_range" "ALT.SGPT._median" "AST.SGOT._median"
## [13] "Glucose_max" "Glucose_min" "Creatinine_range"
## [16] "Potassium_range" "Chloride_range" "Chloride_min"
## [19] "Sodium_median" "respiratory_min" "respiratory_range"
## [22] "respiratory_max" "trunk_range" "pulse_range"
## [25] "Bicarbonate_max" "Bicarbonate_range" "Chloride_max"
## [28] "onset_site_mean" "trunk_max" "Gender_mean"
## [31] "Creatinine_min"
The feature selection results of Boruta
and step
are similar.
library(mlbench)
library(caret)
# estimate variable importance
<- varImp(ols_step, scale=FALSE)
predStepwise # summarize importance
print(predStepwise)
## Overall
## ALSFRS_Total_range 16.630592
## ALSFRS_Total_median 11.812263
## ALSFRS_Total_min 8.523606
## Calcium_range 5.754045
## Calcium_max 4.812942
## bp_diastolic_min 2.539766
## onset_delta_mean 2.758465
## Calcium_min 3.767450
## Albumin_range 2.812018
## Glucose_range 5.156259
## ALT.SGPT._median 2.876338
## AST.SGOT._median 2.641369
## Glucose_max 4.629759
## Glucose_min 4.022642
## Creatinine_range 2.293301
## Potassium_range 1.739268
## Chloride_range 4.474709
## Chloride_min 4.403551
## Sodium_median 2.118710
## respiratory_min 5.948488
## respiratory_range 5.756735
## respiratory_max 5.041816
## trunk_range 2.819029
## pulse_range 1.696811
## Bicarbonate_max 2.568068
## Bicarbonate_range 2.303757
## Chloride_max 1.750666
## onset_site_mean 1.663481
## trunk_max 2.706410
## Gender_mean 1.919380
## Creatinine_min 1.535642
# plot predStepwise
# plot(predStepwise)
# Boruta vs. Stepwise feataure selection
intersect(predBoruta, stepwiseConfirmedVars)
## [1] "ALSFRS_Total_median" "ALSFRS_Total_min" "ALSFRS_Total_range"
## [4] "Creatinine_min" "onset_delta_mean" "respiratory_min"
## [7] "respiratory_range" "trunk_max" "trunk_range"
There are about \(10\) common variables chosen by the Boruta and Stepwise feature selection methods.
There is another more elaborate stepwise feature selection technique that is implemented in the function MASS::stepAIC()
that is useful for a wider range of object classes.
You can practice variable selection with the SOCR_Data_AD_BiomedBigMetadata on SOCR website. This is a smaller dataset that has 744 observations and 63 variables. Here we utilize DXCURREN
or current diagnostics as the class variable.
Let’s import the dataset first.
library(rvest)
<- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_AD_BiomedBigMetadata")
wiki_url html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...
<- html_table(html_nodes(wiki_url, "table")[[1]])
alzh summary(alzh)
## SID MMSCORE FAQTOTAL GDTOTAL
## Min. : 2.0 Min. :18.00 Length:744 Min. :0.000
## 1st Qu.: 355.5 1st Qu.:25.00 Class :character 1st Qu.:0.000
## Median : 697.5 Median :27.00 Mode :character Median :1.000
## Mean : 707.5 Mean :26.81 Mean :1.367
## 3rd Qu.:1063.0 3rd Qu.:29.00 3rd Qu.:2.000
## Max. :1435.0 Max. :30.00 Max. :6.000
## adascog sobcdr DXCURREN DX_Conversion
## Length:744 Min. :0.000 Min. :1.000 Length:744
## Class :character 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Mode :character Median :1.500 Median :2.000 Mode :character
## Mean :1.785 Mean :1.958
## 3rd Qu.:2.625 3rd Qu.:2.000
## Max. :9.000 Max. :3.000
## DXCONTYP DX_Confidence Gender Married
## Min. :-4.000 Length:744 Min. :1.000 Min. :1.000
## 1st Qu.:-4.000 Class :character 1st Qu.:1.000 1st Qu.:1.000
## Median :-4.000 Mode :character Median :1.000 Median :1.000
## Mean :-3.962 Mean :1.407 Mean :1.083
## 3rd Qu.:-4.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. : 3.000 Max. :2.000 Max. :2.000
## Education Age Weight_Kg VSBPSYS
## Min. : 6.00 Min. :55.00 Min. : -1.00 Min. : 90.0
## 1st Qu.:14.00 1st Qu.:71.00 1st Qu.: 64.67 1st Qu.:122.0
## Median :16.00 Median :76.00 Median : 74.39 Median :135.0
## Mean :15.64 Mean :75.49 Mean : 75.28 Mean :135.5
## 3rd Qu.:18.00 3rd Qu.:80.00 3rd Qu.: 84.48 3rd Qu.:146.0
## Max. :20.00 Max. :91.00 Max. :137.44 Max. :206.0
## VSBPDIA VSPULSE VSRESP VSTEMP
## Min. : 43.00 Min. : 40.00 Min. :-1.00 Min. :-1.00
## 1st Qu.: 68.00 1st Qu.: 58.00 1st Qu.:16.00 1st Qu.:36.10
## Median : 75.00 Median : 64.00 Median :16.00 Median :36.40
## Mean : 74.56 Mean : 65.17 Mean :16.68 Mean :36.35
## 3rd Qu.: 82.00 3rd Qu.: 72.00 3rd Qu.:18.00 3rd Qu.:36.70
## Max. :103.00 Max. :100.00 Max. :32.00 Max. :37.70
## SymptomeSeverety SymptomeChronicity BC.USEA BCVOMIT
## Length:744 Length:744 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Mode :character Median :1.000 Median :1.000
## Mean :1.032 Mean :1.016
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000
## BCDIARRH BCCONSTP BCABDOMN BCSWEATN
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.097 Mean :1.106 Mean :1.074 Mean :1.056
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCDIZZY BCENERGY BCDROWSY BCVISION BCHDACHE
## Min. :1.000 Min. :1.0 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.0 Median :1.00 Median :1.000 Median :1.000
## Mean :1.125 Mean :1.2 Mean :1.13 Mean :1.059 Mean :1.093
## 3rd Qu.:1.000 3rd Qu.:1.0 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.0 Max. :2.00 Max. :2.000 Max. :2.000
## BCDRYMTH BCBREATH BCCOUGH BCPALPIT
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.087 Mean :1.078 Mean :1.116 Mean :1.031
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCCHEST BCURNDIS BCURNFRQ BCANKLE
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.017 Mean :1.023 Mean :1.218 Mean :1.078
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCMUSCLE BCRASH BCINSOMN BCDPMOOD
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.364 Mean :1.073 Mean :1.112 Mean :1.122
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCCRYING BCELMOOD BCWANDER BCFALL
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.035 Mean :1.012 Mean :1.004 Mean :1.046
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCOTHER CTWHITE CTRED PROTEIN
## Min. :1.000 Length:744 Length:744 Length:744
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :1.000 Mode :character Mode :character Mode :character
## Mean :1.046
## 3rd Qu.:1.000
## Max. :2.000
## GLUCOSE ApoEGeneAllele1 ApoEGeneAllele2 CDMEMORY
## Length:744 Min. :2.000 Min. :2.000 Min. :0
## Class :character 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:0
## Mode :character Median :3.000 Median :3.000 Median :0
## Mean :3.023 Mean :3.489 Mean :0
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:0
## Max. :4.000 Max. :4.000 Max. :0
## CDORIENT CDJUDGE CDCOMMUN CDHOME
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.5000 Median :0.0000 Median :0.5000 Median :0.0000
## Mean :0.5047 Mean :0.3085 Mean :0.3683 Mean :0.2513
## 3rd Qu.:1.0000 3rd Qu.:0.5000 3rd Qu.:0.5000 3rd Qu.:0.5000
## Max. :2.0000 Max. :2.0000 Max. :2.0000 Max. :2.0000
## CDCARE CDGLOBAL
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.2849 Mean :0.0672
## 3rd Qu.:0.5000 3rd Qu.:0.0000
## Max. :2.0000 Max. :2.0000
The data summary shows that we have several factor variables. After converting their type to numeric we find some missing data. We can manage this issue by selecting only the complete observation of the original dataset or by using multivariate imputation, see Chapter 2.
<-c(3, 5, 8, 10, 21:22, 51:54)
chrtofactor=="."] <- NA # replace all missing "." values with "NA"
alzh[alzh<-data.frame(apply(alzh[chrtofactor], 2, as.numeric))
alzh[chrtofactor]<-alzh[complete.cases(alzh), ] alzh
For simplicity, here we eliminated the missing data and are left with 408 complete observations. Now, we can apply the Boruta
method for feature selection.
set.seed(123)
<-Boruta(DXCURREN~.-SID, data=alzh, doTrace=0)
trainprint(train)
## Boruta performed 99 iterations in 7.208961 secs.
## 13 attributes confirmed important: adascog, ApoEGeneAllele2, BCBREATH,
## CDCARE, CDCOMMUN and 8 more;
## 47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN, BCANKLE,
## BCCHEST and 42 more;
## 1 tentative attributes left: ApoEGeneAllele1;
You might get a result that is a little bit different. We can plot the variable importance graph using some previous knowledge.
The final step is to get rid of the tentative features.
## Boruta performed 99 iterations in 7.208961 secs.
## Tentatives roughfixed over the last 99 iterations.
## 14 attributes confirmed important: adascog, ApoEGeneAllele1,
## ApoEGeneAllele2, BCBREATH, CDCARE and 9 more;
## 47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN, BCANKLE,
## BCCHEST and 42 more;
## [1] "MMSCORE" "FAQTOTAL" "adascog" "sobcdr"
## [5] "DX_Confidence" "BCBREATH" "ApoEGeneAllele1" "ApoEGeneAllele2"
## [9] "CDORIENT" "CDJUDGE" "CDCOMMUN" "CDHOME"
## [13] "CDCARE" "CDGLOBAL"
Can you reproduce these results? Also try to apply some of these techniques to other data from the list of our Case-Studies.