SOCR ≫ | DSPA ≫ | Topics ≫ |
As we mentioned in Chapter 15, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Variable selection, or feature selection, can help us focus only on the core important information contained in the observations, instead of every piece of information. Due to presence of intrinsic and extrinsic noise, the volume and complexity of big health data, and different methodological and technological challenges, this process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.
The next Chapter, Chapter 17, provides the details of another powerful technique for variable-selection using decoy features to control for the false discovery rate of inconsequential features.
There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.
The different types of feature selection methods have their own pros and cons. In this chapter, we are going to introduce the randomized wrapper method using the Boruta
package, which utilizes random forest classification method to output variable importance measures (VIMs). Then, we will compare its results with Recursive Feature Elimination, a classical deterministic wrapper method.
First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.
In this case-study, we use the training dataset that contains 2,223 observations and 131 numeric variables. We select ALSFRS slope
as our outcome variable, as it captures the patients’ clinical decline over a year. Although we have more observations than features, this is one of the examples where multiple features are highly correlated. Therefore, we need to preprocess the variables before commencing with feature selection.
The dataset is located in our case-studies archive. We can use read.csv()
to directly import the CSV dataset into R using the URL reference.
ALS.train<-read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
summary(ALS.train)
## ID Age_mean Albumin_max Albumin_median
## Min. : 1.0 Min. :18.00 Min. :37.00 Min. :34.50
## 1st Qu.: 614.5 1st Qu.:47.00 1st Qu.:45.00 1st Qu.:42.00
## Median :1213.0 Median :55.00 Median :47.00 Median :44.00
## Mean :1214.9 Mean :54.55 Mean :47.01 Mean :43.95
## 3rd Qu.:1815.5 3rd Qu.:63.00 3rd Qu.:49.00 3rd Qu.:46.00
## Max. :2424.0 Max. :81.00 Max. :70.30 Max. :51.10
## Albumin_min Albumin_range ALSFRS_slope ALSFRS_Total_max
## Min. :24.00 Min. :0.000000 Min. :-4.3452 Min. :11.00
## 1st Qu.:39.00 1st Qu.:0.009042 1st Qu.:-1.0863 1st Qu.:29.00
## Median :41.00 Median :0.012111 Median :-0.6207 Median :33.00
## Mean :40.77 Mean :0.013779 Mean :-0.7283 Mean :31.69
## 3rd Qu.:43.00 3rd Qu.:0.015873 3rd Qu.:-0.2838 3rd Qu.:36.00
## Max. :49.00 Max. :0.243902 Max. : 1.2070 Max. :40.00
## ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max
## Min. : 2.5 Min. : 0.00 Min. :0.00000 Min. : 10.00
## 1st Qu.:23.0 1st Qu.:14.00 1st Qu.:0.01404 1st Qu.: 32.00
## Median :28.0 Median :20.00 Median :0.02330 Median : 45.00
## Mean :27.1 Mean :19.88 Mean :0.02604 Mean : 54.44
## 3rd Qu.:32.0 3rd Qu.:27.00 3rd Qu.:0.03480 3rd Qu.: 65.00
## Max. :40.0 Max. :40.00 Max. :0.11765 Max. :944.00
## ALT.SGPT._median ALT.SGPT._min ALT.SGPT._range AST.SGOT._max
## Min. : 8.00 Min. : 1.60 Min. :0.002747 Min. : 11.00
## 1st Qu.: 22.00 1st Qu.: 15.00 1st Qu.:0.030303 1st Qu.: 30.00
## Median : 30.00 Median : 21.00 Median :0.047619 Median : 38.00
## Mean : 32.99 Mean : 23.01 Mean :0.071137 Mean : 43.13
## 3rd Qu.: 40.00 3rd Qu.: 28.00 3rd Qu.:0.077539 3rd Qu.: 48.00
## Max. :193.00 Max. :109.00 Max. :2.383117 Max. :911.00
## AST.SGOT._median AST.SGOT._min AST.SGOT._range Bicarbonate_max
## Min. : 9.00 Min. : 1.00 Min. :0.00000 Min. :20.0
## 1st Qu.: 22.00 1st Qu.:17.00 1st Qu.:0.02352 1st Qu.:29.0
## Median : 27.00 Median :20.00 Median :0.03502 Median :31.0
## Mean : 29.08 Mean :21.54 Mean :0.04919 Mean :30.9
## 3rd Qu.: 34.00 3rd Qu.:25.00 3rd Qu.:0.05243 3rd Qu.:32.0
## Max. :100.00 Max. :86.00 Max. :1.91667 Max. :52.0
## Bicarbonate_median Bicarbonate_min Bicarbonate_range
## Min. :19.50 Min. : 2.50 Min. :0.00000
## 1st Qu.:26.00 1st Qu.:22.00 1st Qu.:0.01266
## Median :27.00 Median :23.00 Median :0.01493
## Mean :26.96 Mean :23.16 Mean :0.01687
## 3rd Qu.:28.00 3rd Qu.:24.45 3rd Qu.:0.01815
## Max. :39.50 Max. :34.00 Max. :0.21429
## Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
## Min. : 2.921 Min. : 2.191
## 1st Qu.: 5.842 1st Qu.: 4.640
## Median : 6.937 Median : 5.423
## Mean : 7.353 Mean : 5.558
## 3rd Qu.: 8.210 3rd Qu.: 6.353
## Max. :25.192 Max. :11.866
## Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range
## Min. : 0.5842 Min. :0.000000
## 1st Qu.: 3.2859 1st Qu.:0.004109
## Median : 4.0700 Median :0.005817
## Mean : 4.1609 Mean :0.007133
## 3rd Qu.: 5.0000 3rd Qu.:0.008353
## Max. :10.2228 Max. :0.069543
## bp_diastolic_max bp_diastolic_median bp_diastolic_min bp_diastolic_range
## Min. : 70.00 Min. : 56.00 Min. : 20.00 Min. :0.00000
## 1st Qu.: 88.00 1st Qu.: 78.00 1st Qu.: 65.00 1st Qu.:0.03527
## Median : 90.00 Median : 80.00 Median : 70.00 Median :0.04337
## Mean : 92.03 Mean : 81.11 Mean : 69.89 Mean :0.04766
## 3rd Qu.: 98.00 3rd Qu.: 85.00 3rd Qu.: 75.00 3rd Qu.:0.05435
## Max. :140.00 Max. :110.00 Max. :100.00 Max. :0.71429
## bp_systolic_max bp_systolic_median bp_systolic_min bp_systolic_range
## Min. :100.0 Min. : 90.0 Min. : 72.0 Min. :0.00000
## 1st Qu.:138.0 1st Qu.:120.0 1st Qu.:108.0 1st Qu.:0.05272
## Median :145.0 Median :130.0 Median :110.0 Median :0.06494
## Mean :147.1 Mean :129.6 Mean :113.4 Mean :0.07118
## 3rd Qu.:157.0 3rd Qu.:136.0 3rd Qu.:120.0 3rd Qu.:0.08190
## Max. :220.0 Max. :190.0 Max. :165.0 Max. :0.40462
## Calcium_max Calcium_median Calcium_min Calcium_range
## Min. :2.171 Min. :2.046 Min. :0.2438 Min. :0.0000000
## 1st Qu.:2.400 1st Qu.:2.283 1st Qu.:2.1707 1st Qu.:0.0003741
## Median :2.470 Median :2.345 Median :2.2300 Median :0.0004739
## Mean :2.475 Mean :2.346 Mean :2.2229 Mean :0.0005407
## 3rd Qu.:2.530 3rd Qu.:2.400 3rd Qu.:2.2977 3rd Qu.:0.0005893
## Max. :9.460 Max. :2.800 Max. :2.6500 Max. :0.0129009
## Chloride_max Chloride_median Chloride_min Chloride_range
## Min. : 96.0 Min. : 90.0 Min. : 76.00 Min. :0.00000
## 1st Qu.:106.0 1st Qu.:102.0 1st Qu.: 98.00 1st Qu.:0.01250
## Median :107.0 Median :104.0 Median :100.00 Median :0.01587
## Mean :107.2 Mean :103.5 Mean : 99.26 Mean :0.01787
## 3rd Qu.:109.0 3rd Qu.:105.0 3rd Qu.:101.00 3rd Qu.:0.01990
## Max. :119.0 Max. :111.0 Max. :109.00 Max. :0.21429
## Creatinine_max Creatinine_median Creatinine_min Creatinine_range
## Min. : 22.00 Min. : 18.00 Min. : 0.00 Min. :0.00000
## 1st Qu.: 65.00 1st Qu.: 53.04 1st Qu.: 39.00 1st Qu.:0.03824
## Median : 79.56 Median : 62.00 Median : 53.00 Median :0.04865
## Mean : 78.78 Mean : 65.19 Mean : 51.98 Mean :0.05842
## 3rd Qu.: 88.40 3rd Qu.: 78.85 3rd Qu.: 61.88 3rd Qu.:0.07026
## Max. :248.00 Max. :176.80 Max. :167.96 Max. :0.42095
## Gender_mean Glucose_max Glucose_median Glucose_min
## Min. :1.000 Min. : 4.160 Min. : 3.497 Min. : 0.000
## 1st Qu.:1.000 1st Qu.: 5.827 1st Qu.: 4.911 1st Qu.: 4.051
## Median :2.000 Median : 6.500 Median : 5.300 Median : 4.440
## Mean :1.637 Mean : 7.160 Mean : 5.487 Mean : 4.265
## 3rd Qu.:2.000 3rd Qu.: 7.600 3rd Qu.: 5.695 3rd Qu.: 4.800
## Max. :2.000 Max. :33.688 Max. :26.196 Max. :12.200
## Glucose_range hands_max hands_median hands_min
## Min. :0.000000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.003051 1st Qu.:5.000 1st Qu.:3.000 1st Qu.:0.000
## Median :0.004695 Median :7.000 Median :5.500 Median :3.000
## Mean :0.006319 Mean :6.181 Mean :4.905 Mean :3.047
## 3rd Qu.:0.007373 3rd Qu.:8.000 3rd Qu.:7.000 3rd Qu.:5.000
## Max. :0.097463 Max. :8.000 Max. :8.000 Max. :8.000
## hands_range Hematocrit_max Hematocrit_median Hematocrit_min
## Min. :0.000000 Min. : 0.373 Min. : 0.362 Min. : 0.311
## 1st Qu.:0.003610 1st Qu.:42.300 1st Qu.:40.000 1st Qu.:37.000
## Median :0.006652 Median :45.200 Median :42.600 Median :40.000
## Mean :0.006883 Mean :41.939 Mean :39.467 Mean :36.962
## 3rd Qu.:0.009513 3rd Qu.:47.700 3rd Qu.:45.000 3rd Qu.:42.700
## Max. :0.042857 Max. :81.000 Max. :56.000 Max. :52.900
## Hematocrit_range Hemoglobin_max Hemoglobin_median Hemoglobin_min
## Min. :0.000000 Min. :116.0 Min. :106.0 Min. : 6.204
## 1st Qu.:0.007164 1st Qu.:144.0 1st Qu.:136.0 1st Qu.:128.000
## Median :0.009701 Median :152.0 Median :145.0 Median :136.000
## Mean :0.011431 Mean :152.1 Mean :144.3 Mean :135.461
## 3rd Qu.:0.013579 3rd Qu.:160.0 3rd Qu.:152.0 3rd Qu.:145.000
## Max. :0.185714 Max. :280.0 Max. :182.0 Max. :180.000
## Hemoglobin_range leg_max leg_median leg_min
## Min. :0.00000 Min. :0.00 Min. :0.00 Min. :0.000
## 1st Qu.:0.02321 1st Qu.:3.00 1st Qu.:2.50 1st Qu.:1.000
## Median :0.03106 Median :5.00 Median :3.00 Median :2.000
## Mean :0.03824 Mean :5.31 Mean :4.05 Mean :2.493
## 3rd Qu.:0.04205 3rd Qu.:8.00 3rd Qu.:6.00 3rd Qu.:3.000
## Max. :0.56180 Max. :8.00 Max. :8.00 Max. :8.000
## leg_range mouth_max mouth_median mouth_min
## Min. :0.000000 Min. : 1.00 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.003378 1st Qu.:10.00 1st Qu.: 8.000 1st Qu.: 5.000
## Median :0.005435 Median :12.00 Median :11.000 Median : 9.000
## Mean :0.006163 Mean :10.74 Mean : 9.703 Mean : 7.778
## 3rd Qu.:0.008718 3rd Qu.:12.00 3rd Qu.:12.000 3rd Qu.:11.000
## Max. :0.042017 Max. :12.00 Max. :12.000 Max. :12.000
## mouth_range onset_delta_mean onset_site_mean Platelets_max
## Min. :0.000000 Min. :-3119 Min. :1.000 Min. : 84.0
## 1st Qu.:0.001815 1st Qu.: -887 1st Qu.:2.000 1st Qu.:239.0
## Median :0.005329 Median : -572 Median :2.000 Median :275.0
## Mean :0.006595 Mean : -683 Mean :1.801 Mean :285.3
## 3rd Qu.:0.010251 3rd Qu.: -374 3rd Qu.:2.000 3rd Qu.:320.0
## Max. :0.036765 Max. : -16 Max. :3.000 Max. :866.0
## Platelets_median Platelets_min Potassium_max Potassium_median
## Min. : 73.0 Min. : 0.197 Min. : 3.400 Min. :3.000
## 1st Qu.:204.0 1st Qu.:175.000 1st Qu.: 4.400 1st Qu.:4.000
## Median :233.0 Median :204.000 Median : 4.500 Median :4.200
## Mean :238.8 Mean :208.382 Mean : 4.628 Mean :4.189
## 3rd Qu.:270.0 3rd Qu.:236.000 3rd Qu.: 4.800 3rd Qu.:4.300
## Max. :526.0 Max. :476.000 Max. :43.000 Max. :5.100
## Potassium_min Potassium_range pulse_max pulse_median
## Min. :2.400 Min. :0.000000 Min. : 53.00 Min. : 50.00
## 1st Qu.:3.700 1st Qu.:0.001058 1st Qu.: 84.00 1st Qu.: 72.00
## Median :3.900 Median :0.001425 Median : 90.00 Median : 77.00
## Mean :3.857 Mean :0.001744 Mean : 90.64 Mean : 76.97
## 3rd Qu.:4.000 3rd Qu.:0.001913 3rd Qu.: 96.00 3rd Qu.: 81.00
## Max. :5.100 Max. :0.098674 Max. :144.00 Max. :115.00
## pulse_min pulse_range respiratory_max respiratory_median
## Min. : 18.00 Min. :0.005425 Min. :2.00 Min. :0.000
## 1st Qu.: 60.00 1st Qu.:0.036755 1st Qu.:4.00 1st Qu.:3.000
## Median : 64.00 Median :0.048821 Median :4.00 Median :4.000
## Mean : 65.37 Mean :0.053587 Mean :3.91 Mean :3.593
## 3rd Qu.: 70.00 3rd Qu.:0.062365 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :102.00 Max. :0.500000 Max. :4.00 Max. :4.000
## respiratory_min respiratory_range Sodium_max Sodium_median
## Min. :0.000 Min. :0.000000 Min. :134.0 Min. :128.0
## 1st Qu.:2.000 1st Qu.:0.000000 1st Qu.:142.0 1st Qu.:139.0
## Median :3.000 Median :0.001828 Median :143.0 Median :140.0
## Mean :2.791 Mean :0.002513 Mean :143.4 Mean :140.1
## 3rd Qu.:4.000 3rd Qu.:0.003653 3rd Qu.:145.0 3rd Qu.:141.0
## Max. :4.000 Max. :0.025424 Max. :169.0 Max. :146.5
## Sodium_min Sodium_range SubjectID trunk_max
## Min. :112.0 Min. :0.00000 Min. : 533 Min. :0.000
## 1st Qu.:135.0 1st Qu.:0.01058 1st Qu.:240826 1st Qu.:5.000
## Median :137.0 Median :0.01312 Median :496835 Median :7.000
## Mean :136.8 Mean :0.01500 Mean :498880 Mean :6.204
## 3rd Qu.:138.0 3rd Qu.:0.01728 3rd Qu.:750301 3rd Qu.:8.000
## Max. :145.0 Max. :0.14286 Max. :999482 Max. :8.000
## trunk_median trunk_min trunk_range Urine.Ph_max
## Min. :0.000 Min. :0.000 Min. :0.000000 Min. :5.00
## 1st Qu.:3.000 1st Qu.:1.000 1st Qu.:0.003643 1st Qu.:6.00
## Median :5.000 Median :3.000 Median :0.006920 Median :7.00
## Mean :4.893 Mean :2.956 Mean :0.007136 Mean :6.82
## 3rd Qu.:6.500 3rd Qu.:5.000 3rd Qu.:0.009639 3rd Qu.:7.00
## Max. :8.000 Max. :8.000 Max. :0.042017 Max. :9.00
## Urine.Ph_median Urine.Ph_min
## Min. :5.000 Min. :5.000
## 1st Qu.:5.000 1st Qu.:5.000
## Median :6.000 Median :5.000
## Mean :5.711 Mean :5.183
## 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :9.000 Max. :8.000
There are 131 features and some of variables represent statistics like max, min and median values of the same clinical measurements.
Now let’s explore the Boruta()
function in Boruta
package to perform variables selection, based on random forest classification. Boruta()
includes the following components:
vs<-Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)
class
: variable for class labels.features
: potential features to select from.data
: dataset containing classes and features.pValue
: confidence level. Default value is 0.01 (Notice we are applying multiple variable selection.mcAdj
: Default TRUE to apply a multiple comparisons adjustment using the Bonferroni method.maxRuns
: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.doTrace
: verbosity level. Default 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means same as 1, plus at each importance source run reporting the number of attributes. The default is 0 where we don’t do the reporting.getImp
: function used to obtain attribute importance. The default is \(getImpRfZ\), which runs random forest from the ranger package and gathers \(Z\)-scores of mean decrease accuracy measure.The resulting vs
object is of class Boruta
and contains two important components:
finalDecision
: a factor of three values: Confirmed
, Rejected
or Tentative
, containing the final results of the feature selection process.ImpHistory
: a data frame of importance of attributes gathered in each importance source run. Besides the predictors’ importance, it contains maximal, mean and minimal importance of shadow attributes for each run. Rejected attributes get -Inf
importance. This output is set to NULL if we specify holdHistory=FALSE
in the Boruta call.Note: Running the code below will take several minutes.
# install.packages("Boruta")
library(Boruta)
## Loading required package: ranger
set.seed(123)
als<-Boruta(ALSFRS_slope~.-ID, data=ALS.train, doTrace=0)
print(als)
## Boruta performed 99 iterations in 4.627568 mins.
## 28 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_median and 23 more;
## 59 attributes confirmed unimportant: Albumin_max, Albumin_median,
## Albumin_min, ALT.SGPT._max, ALT.SGPT._median and 54 more;
## 12 tentative attributes left: Age_mean, Albumin_range,
## Creatinine_max, Hematocrit_median, Hematocrit_range and 7 more;
als$ImpHistory[1:6, 1:10]
## Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,] 1.2031427 1.4969268 0.6976378 0.9385041 1.979510
## [2,] -0.1998469 0.7204092 -1.5626360 0.5777092 2.573882
## [3,] 1.9272058 -1.0274668 0.2216170 -1.2234402 1.843967
## [4,] 0.5763244 0.9097371 0.2960979 0.6137624 2.184383
## [5,] 3.3655147 1.9412326 0.3849548 1.7309793 1.134676
## [6,] 0.2603118 -0.0287943 1.4164860 2.3251879 2.259974
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min
## [1,] 6.925233 9.551064 15.92924
## [2,] 8.124101 7.867399 14.94650
## [3,] 7.443326 8.735702 17.26469
## [4,] 7.578267 7.868885 16.95563
## [5,] 7.554582 7.248834 15.42697
## [6,] 7.516362 7.145460 14.94824
## ALSFRS_Total_range ALT.SGPT._max
## [1,] 25.78135 4.1516252
## [2,] 26.11722 1.2187027
## [3,] 25.61523 2.1618804
## [4,] 28.19229 0.4305607
## [5,] 24.90620 1.2043325
## [6,] 26.57093 0.8463782
This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample \(x_i\), using only the trees that did not include \(x_i\) in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.
The importance scores for all features at every iteration are stored in the data frame als$ImpHistory
. Let’s plot a graph depicting the essential features.
Note: Again, running this code will take several minutes to complete.
plot(als, xlab="", xaxt="n")
lz<-lapply(1:ncol(als$ImpHistory), function(i)
als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
names(lz)<-colnames(als$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)
We can see that plotting the graph is easy but extracting matched feature names may require more work. The basic plot is done by this call plot(als, xlab="", xaxt="n")
, where xaxt="n"
means we suppress plotting of x-axis. The following lines in the script reconstruct the x-axis plot. lz
is a list created by the lapply()
function. Each element in lz
contains all the important scores for a single feature in the original dataset. Also, we excluded all rejected features with infinite importance. Then, we sorted these non-rejected features according to their median importance and print them on the x-axis by using axis()
.
We have already seen similar groups of boxplots back in Chapter 2 and Chapter 3. In this graph, variables with green boxes are more important than the ones represented with red boxes, and we can see the range of importance scores within a single variable in the graph.
It may be desirable to get rid of tentative features. Notice that this function should be used only when strict decision is highly desired, because this test is much weaker than Boruta and can lower the confidence of the final result.
final.als<-TentativeRoughFix(als)
print(final.als)
## Boruta performed 99 iterations in 4.627568 mins.
## Tentatives roughfixed over the last 99 iterations.
## 32 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_median and 27 more;
## 67 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 62 more;
final.als$finalDecision
## Age_mean Albumin_max
## Rejected Rejected
## Albumin_median Albumin_min
## Rejected Rejected
## Albumin_range ALSFRS_Total_max
## Rejected Confirmed
## ALSFRS_Total_median ALSFRS_Total_min
## Confirmed Confirmed
## ALSFRS_Total_range ALT.SGPT._max
## Confirmed Rejected
## ALT.SGPT._median ALT.SGPT._min
## Rejected Rejected
## ALT.SGPT._range AST.SGOT._max
## Rejected Rejected
## AST.SGOT._median AST.SGOT._min
## Rejected Rejected
## AST.SGOT._range Bicarbonate_max
## Rejected Rejected
## Bicarbonate_median Bicarbonate_min
## Rejected Rejected
## Bicarbonate_range Blood.Urea.Nitrogen..BUN._max
## Rejected Rejected
## Blood.Urea.Nitrogen..BUN._median Blood.Urea.Nitrogen..BUN._min
## Rejected Rejected
## Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
## Rejected Rejected
## bp_diastolic_median bp_diastolic_min
## Rejected Rejected
## bp_diastolic_range bp_systolic_max
## Rejected Rejected
## bp_systolic_median bp_systolic_min
## Rejected Rejected
## bp_systolic_range Calcium_max
## Rejected Rejected
## Calcium_median Calcium_min
## Rejected Rejected
## Calcium_range Chloride_max
## Rejected Rejected
## Chloride_median Chloride_min
## Rejected Rejected
## Chloride_range Creatinine_max
## Rejected Rejected
## Creatinine_median Creatinine_min
## Confirmed Confirmed
## Creatinine_range Gender_mean
## Rejected Rejected
## Glucose_max Glucose_median
## Rejected Rejected
## Glucose_min Glucose_range
## Rejected Rejected
## hands_max hands_median
## Confirmed Confirmed
## hands_min hands_range
## Confirmed Confirmed
## Hematocrit_max Hematocrit_median
## Confirmed Rejected
## Hematocrit_min Hematocrit_range
## Confirmed Confirmed
## Hemoglobin_max Hemoglobin_median
## Rejected Confirmed
## Hemoglobin_min Hemoglobin_range
## Rejected Confirmed
## leg_max leg_median
## Confirmed Confirmed
## leg_min leg_range
## Confirmed Confirmed
## mouth_max mouth_median
## Confirmed Confirmed
## mouth_min mouth_range
## Confirmed Confirmed
## onset_delta_mean onset_site_mean
## Confirmed Rejected
## Platelets_max Platelets_median
## Rejected Rejected
## Platelets_min Potassium_max
## Rejected Rejected
## Potassium_median Potassium_min
## Rejected Rejected
## Potassium_range pulse_max
## Rejected Confirmed
## pulse_median pulse_min
## Rejected Rejected
## pulse_range respiratory_max
## Rejected Rejected
## respiratory_median respiratory_min
## Confirmed Confirmed
## respiratory_range Sodium_max
## Confirmed Rejected
## Sodium_median Sodium_min
## Rejected Rejected
## Sodium_range SubjectID
## Rejected Rejected
## trunk_max trunk_median
## Confirmed Confirmed
## trunk_min trunk_range
## Confirmed Confirmed
## Urine.Ph_max Urine.Ph_median
## Rejected Rejected
## Urine.Ph_min
## Rejected
## Levels: Tentative Confirmed Rejected
getConfirmedFormula(final.als)
## ALSFRS_slope ~ ALSFRS_Total_max + ALSFRS_Total_median + ALSFRS_Total_min +
## ALSFRS_Total_range + Creatinine_median + Creatinine_min +
## hands_max + hands_median + hands_min + hands_range + Hematocrit_max +
## Hematocrit_min + Hematocrit_range + Hemoglobin_median + Hemoglobin_range +
## leg_max + leg_median + leg_min + leg_range + mouth_max +
## mouth_median + mouth_min + mouth_range + onset_delta_mean +
## pulse_max + respiratory_median + respiratory_min + respiratory_range +
## trunk_max + trunk_median + trunk_min + trunk_range
## <environment: 0x00000000279059b0>
# report the Boruta "Confirmed" & "Tentative" features, removing the "Rejected" ones
print(final.als$finalDecision[final.als$finalDecision %in% c("Confirmed", "Tentative")])
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min
## Confirmed Confirmed Confirmed
## ALSFRS_Total_range Creatinine_median Creatinine_min
## Confirmed Confirmed Confirmed
## hands_max hands_median hands_min
## Confirmed Confirmed Confirmed
## hands_range Hematocrit_max Hematocrit_min
## Confirmed Confirmed Confirmed
## Hematocrit_range Hemoglobin_median Hemoglobin_range
## Confirmed Confirmed Confirmed
## leg_max leg_median leg_min
## Confirmed Confirmed Confirmed
## leg_range mouth_max mouth_median
## Confirmed Confirmed Confirmed
## mouth_min mouth_range onset_delta_mean
## Confirmed Confirmed Confirmed
## pulse_max respiratory_median respiratory_min
## Confirmed Confirmed Confirmed
## respiratory_range trunk_max trunk_median
## Confirmed Confirmed Confirmed
## trunk_min trunk_range
## Confirmed Confirmed
## Levels: Tentative Confirmed Rejected
# how many are actually "confirmed" as important/salient?
impBoruta <- final.als$finalDecision[final.als$finalDecision %in% c("Confirmed")]; length(impBoruta)
## [1] 32
This shows the final features selection result.
Let’s compare the Boruta
results against a classical variable selection method - recursive feature elimination (RFE). First, we need to load two packages: caret
and randomForest
. Then, similar to Chapter 14 we must specify a resampling method. Here we use 10-fold CV to do the resampling.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:ranger':
##
## importance
set.seed(123)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
Now, all preparations are complete and we are ready to do the RFE variable selection.
rf.train<-rfe(ALS.train[, -c(1, 7)], ALS.train[, 7], sizes=c(10, 20, 30, 40), rfeControl=control)
rf.train
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared RMSESD RsquaredSD Selected
## 10 0.3500 0.6837 0.03451 0.03837
## 20 0.3471 0.6894 0.03230 0.03374
## 30 0.3468 0.6900 0.03135 0.02967 *
## 40 0.3473 0.6895 0.03061 0.02887
## 99 0.3503 0.6842 0.02995 0.02868
##
## The top 5 variables (out of 30):
## ALSFRS_Total_range, trunk_range, hands_range, mouth_range, ALSFRS_Total_min
This calculation may take a long time to complete. The RFE invocation is different from Boruta
. Here we have to specify the feature data frame and the class labels separately. Also, the sizes=
option allows us to specify the number of features we want to include in the model. Let’s try sizes=c(10, 20, 30, 40)
to compare the model performance for alternative numbers of features.
To visualize the results, we can plot the 5 different feature size combinations listed in the summary. The one with 30 features has the lowest RMSE measure. This result is similar to the Boruta
output, which selected around 30 features.
plot(rf.train, type=c("g", "o"), cex=1, col=1:4)
Using the functions predictors()
and getSelectedAttributes()
, we can compare the final results of the two alternative feature selection methods.
predRFE <- predictors(rf.train)
predBoruta <- getSelectedAttributes(final.als, withTentative = F)
The results are almost identical:
intersect(predBoruta, predRFE)
## [1] "ALSFRS_Total_max" "ALSFRS_Total_median" "ALSFRS_Total_min"
## [4] "ALSFRS_Total_range" "Creatinine_min" "hands_max"
## [7] "hands_median" "hands_min" "hands_range"
## [10] "Hematocrit_max" "Hemoglobin_median" "leg_max"
## [13] "leg_median" "leg_min" "leg_range"
## [16] "mouth_median" "mouth_min" "mouth_range"
## [19] "onset_delta_mean" "respiratory_median" "respiratory_min"
## [22] "respiratory_range" "trunk_max" "trunk_median"
## [25] "trunk_min" "trunk_range"
There are 26 common variables chosen by the two techniques, which suggests that both the Boruta
and RFE methods are robust. Also, notice that the Boruta
method can give similar results without utilizing on the size option. If we want to consider 10 or more different sizes, the procedure will be quite time consuming. Thus, Boruta
method is effective when dealing with complex real world problems.
Next, we can contrast the Boruta
feature selection results against another classical variable selection method - stepwise model selection. Let’s start with fitting a bidirectional stepwise linear model-based feature selection.
data2 <- ALS.train[, -1]
# Define a base model - intercept only
base.mod <- lm(ALSFRS_slope ~ 1 , data= data2)
# Define the full model - including all predictors
all.mod <- lm(ALSFRS_slope ~ . , data= data2)
# ols_step <- lm(ALSFRS_slope ~ ., data=data2)
ols_step <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
summary(ols_step); ols_step
##
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median +
## ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min +
## onset_delta_mean + Calcium_min + Albumin_range + Glucose_range +
## ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min +
## Creatinine_range + Potassium_range + Chloride_range + Chloride_min +
## Sodium_median + respiratory_min + respiratory_range + respiratory_max +
## trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range +
## Chloride_max + onset_site_mean + trunk_max + Gender_mean +
## Creatinine_min, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22558 -0.17875 -0.02024 0.17098 1.95100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.176e-01 6.064e-01 0.689 0.491091
## ALSFRS_Total_range -2.260e+01 1.359e+00 -16.631 < 2e-16 ***
## ALSFRS_Total_median -3.388e-02 2.868e-03 -11.812 < 2e-16 ***
## ALSFRS_Total_min 2.821e-02 3.310e-03 8.524 < 2e-16 ***
## Calcium_range 2.410e+02 4.188e+01 5.754 9.94e-09 ***
## Calcium_max -4.258e-01 8.846e-02 -4.813 1.59e-06 ***
## bp_diastolic_min -2.249e-03 8.856e-04 -2.540 0.011161 *
## onset_delta_mean -5.461e-05 1.980e-05 -2.758 0.005856 **
## Calcium_min 3.579e-01 9.501e-02 3.767 0.000169 ***
## Albumin_range -2.305e+00 8.197e-01 -2.812 0.004967 **
## Glucose_range -1.510e+01 2.929e+00 -5.156 2.75e-07 ***
## ALT.SGPT._median -2.300e-03 7.998e-04 -2.876 0.004062 **
## AST.SGOT._median 3.369e-03 1.276e-03 2.641 0.008316 **
## Glucose_max 3.279e-02 7.082e-03 4.630 3.88e-06 ***
## Glucose_min -3.507e-02 8.718e-03 -4.023 5.95e-05 ***
## Creatinine_range 5.076e-01 2.214e-01 2.293 0.021925 *
## Potassium_range -4.535e+00 2.607e+00 -1.739 0.082128 .
## Chloride_range 5.318e+00 1.188e+00 4.475 8.04e-06 ***
## Chloride_min 1.672e-02 3.797e-03 4.404 1.12e-05 ***
## Sodium_median -9.830e-03 4.639e-03 -2.119 0.034227 *
## respiratory_min -1.453e-01 2.442e-02 -5.948 3.14e-09 ***
## respiratory_range -5.834e+01 1.013e+01 -5.757 9.78e-09 ***
## respiratory_max 1.712e-01 3.395e-02 5.042 4.99e-07 ***
## trunk_range -8.705e+00 3.088e+00 -2.819 0.004860 **
## pulse_range -5.117e-01 3.016e-01 -1.697 0.089874 .
## Bicarbonate_max 7.526e-03 2.931e-03 2.568 0.010292 *
## Bicarbonate_range -2.204e+00 9.567e-01 -2.304 0.021329 *
## Chloride_max -6.918e-03 3.952e-03 -1.751 0.080143 .
## onset_site_mean 3.359e-02 2.019e-02 1.663 0.096359 .
## trunk_max 2.288e-02 8.453e-03 2.706 0.006854 **
## Gender_mean -3.360e-02 1.751e-02 -1.919 0.055066 .
## Creatinine_min 7.643e-04 4.977e-04 1.536 0.124771
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3355 on 2191 degrees of freedom
## Multiple R-squared: 0.7135, Adjusted R-squared: 0.7094
## F-statistic: 176 on 31 and 2191 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median +
## ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min +
## onset_delta_mean + Calcium_min + Albumin_range + Glucose_range +
## ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min +
## Creatinine_range + Potassium_range + Chloride_range + Chloride_min +
## Sodium_median + respiratory_min + respiratory_range + respiratory_max +
## trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range +
## Chloride_max + onset_site_mean + trunk_max + Gender_mean +
## Creatinine_min, data = data2)
##
## Coefficients:
## (Intercept) ALSFRS_Total_range ALSFRS_Total_median
## 4.176e-01 -2.260e+01 -3.388e-02
## ALSFRS_Total_min Calcium_range Calcium_max
## 2.821e-02 2.410e+02 -4.258e-01
## bp_diastolic_min onset_delta_mean Calcium_min
## -2.249e-03 -5.461e-05 3.579e-01
## Albumin_range Glucose_range ALT.SGPT._median
## -2.305e+00 -1.510e+01 -2.300e-03
## AST.SGOT._median Glucose_max Glucose_min
## 3.369e-03 3.279e-02 -3.507e-02
## Creatinine_range Potassium_range Chloride_range
## 5.076e-01 -4.535e+00 5.318e+00
## Chloride_min Sodium_median respiratory_min
## 1.672e-02 -9.830e-03 -1.453e-01
## respiratory_range respiratory_max trunk_range
## -5.834e+01 1.712e-01 -8.705e+00
## pulse_range Bicarbonate_max Bicarbonate_range
## -5.117e-01 7.526e-03 -2.204e+00
## Chloride_max onset_site_mean trunk_max
## -6.918e-03 3.359e-02 2.288e-02
## Gender_mean Creatinine_min
## -3.360e-02 7.643e-04
We can report the stepwise “Confirmed” (important) features:
# get the shortlisted variable
stepwiseConfirmedVars <- names(unlist(ols_step[[1]]))
# remove the intercept
stepwiseConfirmedVars <- stepwiseConfirmedVars[!stepwiseConfirmedVars %in% "(Intercept)"]
print(stepwiseConfirmedVars)
## [1] "ALSFRS_Total_range" "ALSFRS_Total_median" "ALSFRS_Total_min"
## [4] "Calcium_range" "Calcium_max" "bp_diastolic_min"
## [7] "onset_delta_mean" "Calcium_min" "Albumin_range"
## [10] "Glucose_range" "ALT.SGPT._median" "AST.SGOT._median"
## [13] "Glucose_max" "Glucose_min" "Creatinine_range"
## [16] "Potassium_range" "Chloride_range" "Chloride_min"
## [19] "Sodium_median" "respiratory_min" "respiratory_range"
## [22] "respiratory_max" "trunk_range" "pulse_range"
## [25] "Bicarbonate_max" "Bicarbonate_range" "Chloride_max"
## [28] "onset_site_mean" "trunk_max" "Gender_mean"
## [31] "Creatinine_min"
The feature selection results of Boruta
and step
are similar.
library(mlbench)
library(caret)
# estimate variable importance
predStepwise <- varImp(ols_step, scale=FALSE)
# summarize importance
print(predStepwise)
## Overall
## ALSFRS_Total_range 16.630592
## ALSFRS_Total_median 11.812263
## ALSFRS_Total_min 8.523606
## Calcium_range 5.754045
## Calcium_max 4.812942
## bp_diastolic_min 2.539766
## onset_delta_mean 2.758465
## Calcium_min 3.767450
## Albumin_range 2.812018
## Glucose_range 5.156259
## ALT.SGPT._median 2.876338
## AST.SGOT._median 2.641369
## Glucose_max 4.629759
## Glucose_min 4.022642
## Creatinine_range 2.293301
## Potassium_range 1.739268
## Chloride_range 4.474709
## Chloride_min 4.403551
## Sodium_median 2.118710
## respiratory_min 5.948488
## respiratory_range 5.756735
## respiratory_max 5.041816
## trunk_range 2.819029
## pulse_range 1.696811
## Bicarbonate_max 2.568068
## Bicarbonate_range 2.303757
## Chloride_max 1.750666
## onset_site_mean 1.663481
## trunk_max 2.706410
## Gender_mean 1.919380
## Creatinine_min 1.535642
# plot predStepwise
# plot(predStepwise)
# Boruta vs. Stepwise feataure selection
intersect(predBoruta, stepwiseConfirmedVars)
## [1] "ALSFRS_Total_median" "ALSFRS_Total_min" "ALSFRS_Total_range"
## [4] "Creatinine_min" "onset_delta_mean" "respiratory_min"
## [7] "respiratory_range" "trunk_max" "trunk_range"
There are about \(10\) common variables chosen by the Boruta and Stepwise feature selection methods.
There is another more elaborate stepwise feature selection technique that is implemented in the function MASS::stepAIC()
that is useful for a wider range of object classes.
You can practice variable selection with the SOCR_Data_AD_BiomedBigMetadata on SOCR website. This is a smaller dataset that has 744 observations and 63 variables. Here we utilize DXCURREN
or current diagnostics as the class variable.
Let’s import the dataset first.
library(rvest)
## Loading required package: xml2
wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_AD_BiomedBigMetadata")
html_nodes(wiki_url, "#content")
## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top ...
alzh <- html_table(html_nodes(wiki_url, "table")[[1]])
summary(alzh)
## SID MMSCORE FAQTOTAL GDTOTAL
## Min. : 2.0 Min. :18.00 Length:744 Min. :0.000
## 1st Qu.: 355.5 1st Qu.:25.00 Class :character 1st Qu.:0.000
## Median : 697.5 Median :27.00 Mode :character Median :1.000
## Mean : 707.5 Mean :26.81 Mean :1.367
## 3rd Qu.:1063.0 3rd Qu.:29.00 3rd Qu.:2.000
## Max. :1435.0 Max. :30.00 Max. :6.000
## adascog sobcdr DXCURREN DX_Conversion
## Length:744 Min. :0.000 Min. :1.000 Length:744
## Class :character 1st Qu.:0.000 1st Qu.:1.000 Class :character
## Mode :character Median :1.500 Median :2.000 Mode :character
## Mean :1.785 Mean :1.958
## 3rd Qu.:2.625 3rd Qu.:2.000
## Max. :9.000 Max. :3.000
## DXCONTYP DX_Confidence Gender Married
## Min. :-4.000 Length:744 Min. :1.000 Min. :1.000
## 1st Qu.:-4.000 Class :character 1st Qu.:1.000 1st Qu.:1.000
## Median :-4.000 Mode :character Median :1.000 Median :1.000
## Mean :-3.962 Mean :1.407 Mean :1.083
## 3rd Qu.:-4.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. : 3.000 Max. :2.000 Max. :2.000
## Education Age Weight_Kg VSBPSYS
## Min. : 6.00 Min. :55.00 Min. : -1.00 Min. : 90.0
## 1st Qu.:14.00 1st Qu.:71.00 1st Qu.: 64.67 1st Qu.:122.0
## Median :16.00 Median :76.00 Median : 74.39 Median :135.0
## Mean :15.64 Mean :75.49 Mean : 75.28 Mean :135.5
## 3rd Qu.:18.00 3rd Qu.:80.00 3rd Qu.: 84.48 3rd Qu.:146.0
## Max. :20.00 Max. :91.00 Max. :137.44 Max. :206.0
## VSBPDIA VSPULSE VSRESP VSTEMP
## Min. : 43.00 Min. : 40.00 Min. :-1.00 Min. :-1.00
## 1st Qu.: 68.00 1st Qu.: 58.00 1st Qu.:16.00 1st Qu.:36.10
## Median : 75.00 Median : 64.00 Median :16.00 Median :36.40
## Mean : 74.56 Mean : 65.17 Mean :16.68 Mean :36.35
## 3rd Qu.: 82.00 3rd Qu.: 72.00 3rd Qu.:18.00 3rd Qu.:36.70
## Max. :103.00 Max. :100.00 Max. :32.00 Max. :37.70
## SymptomeSeverety SymptomeChronicity BC.USEA BCVOMIT
## Length:744 Length:744 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Mode :character Median :1.000 Median :1.000
## Mean :1.032 Mean :1.016
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000
## BCDIARRH BCCONSTP BCABDOMN BCSWEATN
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.097 Mean :1.106 Mean :1.074 Mean :1.056
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCDIZZY BCENERGY BCDROWSY BCVISION
## Min. :1.000 Min. :1.0 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:1.00 1st Qu.:1.000
## Median :1.000 Median :1.0 Median :1.00 Median :1.000
## Mean :1.125 Mean :1.2 Mean :1.13 Mean :1.059
## 3rd Qu.:1.000 3rd Qu.:1.0 3rd Qu.:1.00 3rd Qu.:1.000
## Max. :2.000 Max. :2.0 Max. :2.00 Max. :2.000
## BCHDACHE BCDRYMTH BCBREATH BCCOUGH
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.093 Mean :1.087 Mean :1.078 Mean :1.116
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCPALPIT BCCHEST BCURNDIS BCURNFRQ
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.031 Mean :1.017 Mean :1.023 Mean :1.218
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCANKLE BCMUSCLE BCRASH BCINSOMN
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.078 Mean :1.364 Mean :1.073 Mean :1.112
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCDPMOOD BCCRYING BCELMOOD BCWANDER
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.122 Mean :1.035 Mean :1.012 Mean :1.004
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :2.000
## BCFALL BCOTHER CTWHITE CTRED
## Min. :1.000 Min. :1.000 Length:744 Length:744
## 1st Qu.:1.000 1st Qu.:1.000 Class :character Class :character
## Median :1.000 Median :1.000 Mode :character Mode :character
## Mean :1.046 Mean :1.046
## 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000
## PROTEIN GLUCOSE ApoEGeneAllele1 ApoEGeneAllele2
## Length:744 Length:744 Min. :2.000 Min. :2.000
## Class :character Class :character 1st Qu.:3.000 1st Qu.:3.000
## Mode :character Mode :character Median :3.000 Median :3.000
## Mean :3.023 Mean :3.489
## 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :4.000 Max. :4.000
## CDMEMORY CDORIENT CDJUDGE CDCOMMUN
## Min. :0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0 Median :0.5000 Median :0.0000 Median :0.5000
## Mean :0 Mean :0.5047 Mean :0.3085 Mean :0.3683
## 3rd Qu.:0 3rd Qu.:1.0000 3rd Qu.:0.5000 3rd Qu.:0.5000
## Max. :0 Max. :2.0000 Max. :2.0000 Max. :2.0000
## CDHOME CDCARE CDGLOBAL
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2513 Mean :0.2849 Mean :0.0672
## 3rd Qu.:0.5000 3rd Qu.:0.5000 3rd Qu.:0.0000
## Max. :2.0000 Max. :2.0000 Max. :2.0000
The data summary shows that we have several factor variables. After converting their type to numeric we find some missing data. We can manage this issue by selecting only the complete observation of the original dataset or by using multivariate imputation, see Chapter 2.
chrtofactor<-c(3, 5, 8, 10, 21:22, 51:54)
alzh[chrtofactor]<-data.frame(apply(alzh[chrtofactor], 2, as.numeric))
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion
alzh<-alzh[complete.cases(alzh), ]
For simplicity, here we eliminated the missing data and are left with 408 complete observations. Now, we can apply the Boruta
method for feature selection.
## Boruta performed 99 iterations in 8.643105 secs.
## 12 attributes confirmed important: adascog, BCBREATH, CDCARE,
## CDCOMMUN, CDGLOBAL and 7 more;
## 47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN,
## BCANKLE, BCCHEST and 42 more;
## 2 tentative attributes left: ApoEGeneAllele1, ApoEGeneAllele2;
You might get a result that is a little bit different. We can plot the variable importance graph using some previous knowledge.
The final step is to get rid of the tentative features.
## Boruta performed 99 iterations in 8.643105 secs.
## Tentatives roughfixed over the last 99 iterations.
## 14 attributes confirmed important: adascog, ApoEGeneAllele1,
## ApoEGeneAllele2, BCBREATH, CDCARE and 9 more;
## 47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN,
## BCANKLE, BCCHEST and 42 more;
## [1] "MMSCORE" "FAQTOTAL" "adascog"
## [4] "sobcdr" "DX_Confidence" "BCBREATH"
## [7] "ApoEGeneAllele1" "ApoEGeneAllele2" "CDORIENT"
## [10] "CDJUDGE" "CDCOMMUN" "CDHOME"
## [13] "CDCARE" "CDGLOBAL"
Can you reproduce these results? Also try to apply some of these techniques to other data from the list of our Case-Studies.