| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
As we mentioned earlier in Chapter 4, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Instead of trying to interrogate the complete data in its native high-dimensional state, we can apply variable selection, or feature selection, to focus on the most salient information contained in the observations Due to the presence of intrinsic and extrinsic noise, the volume and complexity of big health data, as well as different methodological and technological challenges, the process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.
Variable selection relates to dimensionality reduction, which we saw in Chapter 4, however there are differences between them.
| Method | Process Type | Goals | Approach |
|---|---|---|---|
| Variable selection | Discrete process | To select unique representative features from each group of similar features | To identify highly correlated variables and choose a representative feature by post processing the data |
| Dimension reduction | Continuous process | To denoise the data, enable simpler prediction, or group features so that low impact features have smaller weights | Find the essential, \(k\ll n\), components, factors, or clusters representing linear, or nonlinear, functions of the \(n\) variables which maximize an objective function like the proportion of explained variance |
Relative to the lower variance estimates in continuous dimensionality reduction, the intrinsic characteristics of the discrete feature selection process yields higher variance in bootstrap estimation and cross validation.
In in this Chapter, we will also learn about another powerful technique for variable-selection using decoy features (knockoffs) to control for the false discovery rate of selecting inconsequential features as important.
There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.
The different types of feature selection methods have their own pros
and cons. In this chapter, we are going to introduce the randomized
wrapper method using the Boruta package, which utilizes a
random forest classification method to output variable importance
measures (VIMs). Then, we will compare its results with Recursive
Feature Elimination, a classical deterministic wrapper method.
Let’s start by examining random forest based feature selection, as an embedded technique. The good performance of random forest as a classification, regression, and clustering method is coupled with its ease-of-use, accurate, and robust results. Having a random forest, or more broadly a decision tree, prediction naturally leads to feature selection by using the mean decrease impurity or the mean accuracy decrease criteria.
The many decision trees captured in a random forest include explicit conditions at each branching node, which are based on single features. The intrinsic bifurcation conditions splitting the data may be based on cost function optimization using the impurity, see Chapter 5. We can also use other metrics information gain or entropy for classification problems. These measures capture the importance of variables by computing its impact (how much is the feature-based splitting decision decreasing the weighted impurity in a tree). In random forests, the ranking of feature importance, which is based on the average impurity decrease due to each variable, leads to effective feature selection.
Step 1: Collecting Data
First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.
In this case-study, we use the training dataset that contains 2,223
observations and 131 numeric variables. We select
ALSFRS slope as our outcome variable, as it captures the
patients’ clinical decline over a year. Although we have more
observations than features, this is one of the examples where multiple
features are highly correlated. Therefore, we need to preprocess the
variables before commencing with feature selection.
Step 2: Exploring and preparing the data
The dataset is located in our case-studies
archive. We can use read.csv() to directly import the
CSV dataset into R using the URL reference.
ALS.train <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
summary(ALS.train)## ID Age_mean Albumin_max Albumin_median
## Min. : 1.0 Min. :18.00 Min. :37.00 Min. :34.50
## 1st Qu.: 614.5 1st Qu.:47.00 1st Qu.:45.00 1st Qu.:42.00
## Median :1213.0 Median :55.00 Median :47.00 Median :44.00
## Mean :1214.9 Mean :54.55 Mean :47.01 Mean :43.95
## 3rd Qu.:1815.5 3rd Qu.:63.00 3rd Qu.:49.00 3rd Qu.:46.00
## Max. :2424.0 Max. :81.00 Max. :70.30 Max. :51.10
## Albumin_min Albumin_range ALSFRS_slope ALSFRS_Total_max
## Min. :24.00 Min. :0.000000 Min. :-4.3452 Min. :11.00
## 1st Qu.:39.00 1st Qu.:0.009042 1st Qu.:-1.0863 1st Qu.:29.00
## Median :41.00 Median :0.012111 Median :-0.6207 Median :33.00
## Mean :40.77 Mean :0.013779 Mean :-0.7283 Mean :31.69
## 3rd Qu.:43.00 3rd Qu.:0.015873 3rd Qu.:-0.2838 3rd Qu.:36.00
## Max. :49.00 Max. :0.243902 Max. : 1.2070 Max. :40.00
## ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max
## Min. : 2.5 Min. : 0.00 Min. :0.00000 Min. : 10.00
## 1st Qu.:23.0 1st Qu.:14.00 1st Qu.:0.01404 1st Qu.: 32.00
## Median :28.0 Median :20.00 Median :0.02330 Median : 45.00
## Mean :27.1 Mean :19.88 Mean :0.02604 Mean : 54.44
## 3rd Qu.:32.0 3rd Qu.:27.00 3rd Qu.:0.03480 3rd Qu.: 65.00
## Max. :40.0 Max. :40.00 Max. :0.11765 Max. :944.00
## ALT.SGPT._median ALT.SGPT._min ALT.SGPT._range AST.SGOT._max
## Min. : 8.00 Min. : 1.60 Min. :0.002747 Min. : 11.00
## 1st Qu.: 22.00 1st Qu.: 15.00 1st Qu.:0.030303 1st Qu.: 30.00
## Median : 30.00 Median : 21.00 Median :0.047619 Median : 38.00
## Mean : 32.99 Mean : 23.01 Mean :0.071137 Mean : 43.13
## 3rd Qu.: 40.00 3rd Qu.: 28.00 3rd Qu.:0.077539 3rd Qu.: 48.00
## Max. :193.00 Max. :109.00 Max. :2.383117 Max. :911.00
## AST.SGOT._median AST.SGOT._min AST.SGOT._range Bicarbonate_max
## Min. : 9.00 Min. : 1.00 Min. :0.00000 Min. :20.0
## 1st Qu.: 22.00 1st Qu.:17.00 1st Qu.:0.02352 1st Qu.:29.0
## Median : 27.00 Median :20.00 Median :0.03502 Median :31.0
## Mean : 29.08 Mean :21.54 Mean :0.04919 Mean :30.9
## 3rd Qu.: 34.00 3rd Qu.:25.00 3rd Qu.:0.05243 3rd Qu.:32.0
## Max. :100.00 Max. :86.00 Max. :1.91667 Max. :52.0
## Bicarbonate_median Bicarbonate_min Bicarbonate_range
## Min. :19.50 Min. : 2.50 Min. :0.00000
## 1st Qu.:26.00 1st Qu.:22.00 1st Qu.:0.01266
## Median :27.00 Median :23.00 Median :0.01493
## Mean :26.96 Mean :23.16 Mean :0.01687
## 3rd Qu.:28.00 3rd Qu.:24.45 3rd Qu.:0.01815
## Max. :39.50 Max. :34.00 Max. :0.21429
## Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
## Min. : 2.921 Min. : 2.191
## 1st Qu.: 5.842 1st Qu.: 4.640
## Median : 6.937 Median : 5.423
## Mean : 7.353 Mean : 5.558
## 3rd Qu.: 8.210 3rd Qu.: 6.353
## Max. :25.192 Max. :11.866
## Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
## Min. : 0.5842 Min. :0.000000 Min. : 70.00
## 1st Qu.: 3.2859 1st Qu.:0.004109 1st Qu.: 88.00
## Median : 4.0700 Median :0.005817 Median : 90.00
## Mean : 4.1609 Mean :0.007133 Mean : 92.03
## 3rd Qu.: 5.0000 3rd Qu.:0.008353 3rd Qu.: 98.00
## Max. :10.2228 Max. :0.069543 Max. :140.00
## bp_diastolic_median bp_diastolic_min bp_diastolic_range bp_systolic_max
## Min. : 56.00 Min. : 20.00 Min. :0.00000 Min. :100.0
## 1st Qu.: 78.00 1st Qu.: 65.00 1st Qu.:0.03527 1st Qu.:138.0
## Median : 80.00 Median : 70.00 Median :0.04337 Median :145.0
## Mean : 81.11 Mean : 69.89 Mean :0.04766 Mean :147.1
## 3rd Qu.: 85.00 3rd Qu.: 75.00 3rd Qu.:0.05435 3rd Qu.:157.0
## Max. :110.00 Max. :100.00 Max. :0.71429 Max. :220.0
## bp_systolic_median bp_systolic_min bp_systolic_range Calcium_max
## Min. : 90.0 Min. : 72.0 Min. :0.00000 Min. :2.171
## 1st Qu.:120.0 1st Qu.:108.0 1st Qu.:0.05272 1st Qu.:2.400
## Median :130.0 Median :110.0 Median :0.06494 Median :2.470
## Mean :129.6 Mean :113.4 Mean :0.07118 Mean :2.475
## 3rd Qu.:136.0 3rd Qu.:120.0 3rd Qu.:0.08190 3rd Qu.:2.530
## Max. :190.0 Max. :165.0 Max. :0.40462 Max. :9.460
## Calcium_median Calcium_min Calcium_range Chloride_max
## Min. :2.046 Min. :0.2438 Min. :0.0000000 Min. : 96.0
## 1st Qu.:2.283 1st Qu.:2.1707 1st Qu.:0.0003741 1st Qu.:106.0
## Median :2.345 Median :2.2300 Median :0.0004739 Median :107.0
## Mean :2.346 Mean :2.2229 Mean :0.0005407 Mean :107.2
## 3rd Qu.:2.400 3rd Qu.:2.2977 3rd Qu.:0.0005893 3rd Qu.:109.0
## Max. :2.800 Max. :2.6500 Max. :0.0129009 Max. :119.0
## Chloride_median Chloride_min Chloride_range Creatinine_max
## Min. : 90.0 Min. : 76.00 Min. :0.00000 Min. : 22.00
## 1st Qu.:102.0 1st Qu.: 98.00 1st Qu.:0.01250 1st Qu.: 65.00
## Median :104.0 Median :100.00 Median :0.01587 Median : 79.56
## Mean :103.5 Mean : 99.26 Mean :0.01787 Mean : 78.78
## 3rd Qu.:105.0 3rd Qu.:101.00 3rd Qu.:0.01990 3rd Qu.: 88.40
## Max. :111.0 Max. :109.00 Max. :0.21429 Max. :248.00
## Creatinine_median Creatinine_min Creatinine_range Gender_mean
## Min. : 18.00 Min. : 0.00 Min. :0.00000 Min. :1.000
## 1st Qu.: 53.04 1st Qu.: 39.00 1st Qu.:0.03824 1st Qu.:1.000
## Median : 62.00 Median : 53.00 Median :0.04865 Median :2.000
## Mean : 65.19 Mean : 51.98 Mean :0.05842 Mean :1.637
## 3rd Qu.: 78.85 3rd Qu.: 61.88 3rd Qu.:0.07026 3rd Qu.:2.000
## Max. :176.80 Max. :167.96 Max. :0.42095 Max. :2.000
## Glucose_max Glucose_median Glucose_min Glucose_range
## Min. : 4.160 Min. : 3.497 Min. : 0.000 Min. :0.000000
## 1st Qu.: 5.827 1st Qu.: 4.911 1st Qu.: 4.051 1st Qu.:0.003051
## Median : 6.500 Median : 5.300 Median : 4.440 Median :0.004695
## Mean : 7.160 Mean : 5.487 Mean : 4.265 Mean :0.006319
## 3rd Qu.: 7.600 3rd Qu.: 5.695 3rd Qu.: 4.800 3rd Qu.:0.007373
## Max. :33.688 Max. :26.196 Max. :12.200 Max. :0.097463
## hands_max hands_median hands_min hands_range
## Min. :0.000 Min. :0.000 Min. :0.000 Min. :0.000000
## 1st Qu.:5.000 1st Qu.:3.000 1st Qu.:0.000 1st Qu.:0.003610
## Median :7.000 Median :5.500 Median :3.000 Median :0.006652
## Mean :6.181 Mean :4.905 Mean :3.047 Mean :0.006883
## 3rd Qu.:8.000 3rd Qu.:7.000 3rd Qu.:5.000 3rd Qu.:0.009513
## Max. :8.000 Max. :8.000 Max. :8.000 Max. :0.042857
## Hematocrit_max Hematocrit_median Hematocrit_min Hematocrit_range
## Min. : 0.373 Min. : 0.362 Min. : 0.311 Min. :0.000000
## 1st Qu.:42.300 1st Qu.:40.000 1st Qu.:37.000 1st Qu.:0.007164
## Median :45.200 Median :42.600 Median :40.000 Median :0.009701
## Mean :41.939 Mean :39.467 Mean :36.962 Mean :0.011431
## 3rd Qu.:47.700 3rd Qu.:45.000 3rd Qu.:42.700 3rd Qu.:0.013579
## Max. :81.000 Max. :56.000 Max. :52.900 Max. :0.185714
## Hemoglobin_max Hemoglobin_median Hemoglobin_min Hemoglobin_range
## Min. :116.0 Min. :106.0 Min. : 6.204 Min. :0.00000
## 1st Qu.:144.0 1st Qu.:136.0 1st Qu.:128.000 1st Qu.:0.02321
## Median :152.0 Median :145.0 Median :136.000 Median :0.03106
## Mean :152.1 Mean :144.3 Mean :135.461 Mean :0.03824
## 3rd Qu.:160.0 3rd Qu.:152.0 3rd Qu.:145.000 3rd Qu.:0.04205
## Max. :280.0 Max. :182.0 Max. :180.000 Max. :0.56180
## leg_max leg_median leg_min leg_range
## Min. :0.00 Min. :0.00 Min. :0.000 Min. :0.000000
## 1st Qu.:3.00 1st Qu.:2.50 1st Qu.:1.000 1st Qu.:0.003378
## Median :5.00 Median :3.00 Median :2.000 Median :0.005435
## Mean :5.31 Mean :4.05 Mean :2.493 Mean :0.006163
## 3rd Qu.:8.00 3rd Qu.:6.00 3rd Qu.:3.000 3rd Qu.:0.008718
## Max. :8.00 Max. :8.00 Max. :8.000 Max. :0.042017
## mouth_max mouth_median mouth_min mouth_range
## Min. : 1.00 Min. : 0.000 Min. : 0.000 Min. :0.000000
## 1st Qu.:10.00 1st Qu.: 8.000 1st Qu.: 5.000 1st Qu.:0.001815
## Median :12.00 Median :11.000 Median : 9.000 Median :0.005329
## Mean :10.74 Mean : 9.703 Mean : 7.778 Mean :0.006595
## 3rd Qu.:12.00 3rd Qu.:12.000 3rd Qu.:11.000 3rd Qu.:0.010251
## Max. :12.00 Max. :12.000 Max. :12.000 Max. :0.036765
## onset_delta_mean onset_site_mean Platelets_max Platelets_median
## Min. :-3119 Min. :1.000 Min. : 84.0 Min. : 73.0
## 1st Qu.: -887 1st Qu.:2.000 1st Qu.:239.0 1st Qu.:204.0
## Median : -572 Median :2.000 Median :275.0 Median :233.0
## Mean : -683 Mean :1.801 Mean :285.3 Mean :238.8
## 3rd Qu.: -374 3rd Qu.:2.000 3rd Qu.:320.0 3rd Qu.:270.0
## Max. : -16 Max. :3.000 Max. :866.0 Max. :526.0
## Platelets_min Potassium_max Potassium_median Potassium_min
## Min. : 0.197 Min. : 3.400 Min. :3.000 Min. :2.400
## 1st Qu.:175.000 1st Qu.: 4.400 1st Qu.:4.000 1st Qu.:3.700
## Median :204.000 Median : 4.500 Median :4.200 Median :3.900
## Mean :208.382 Mean : 4.628 Mean :4.189 Mean :3.857
## 3rd Qu.:236.000 3rd Qu.: 4.800 3rd Qu.:4.300 3rd Qu.:4.000
## Max. :476.000 Max. :43.000 Max. :5.100 Max. :5.100
## Potassium_range pulse_max pulse_median pulse_min
## Min. :0.000000 Min. : 53.00 Min. : 50.00 Min. : 18.00
## 1st Qu.:0.001058 1st Qu.: 84.00 1st Qu.: 72.00 1st Qu.: 60.00
## Median :0.001425 Median : 90.00 Median : 77.00 Median : 64.00
## Mean :0.001744 Mean : 90.64 Mean : 76.97 Mean : 65.37
## 3rd Qu.:0.001913 3rd Qu.: 96.00 3rd Qu.: 81.00 3rd Qu.: 70.00
## Max. :0.098674 Max. :144.00 Max. :115.00 Max. :102.00
## pulse_range respiratory_max respiratory_median respiratory_min
## Min. :0.005425 Min. :2.00 Min. :0.000 Min. :0.000
## 1st Qu.:0.036755 1st Qu.:4.00 1st Qu.:3.000 1st Qu.:2.000
## Median :0.048821 Median :4.00 Median :4.000 Median :3.000
## Mean :0.053587 Mean :3.91 Mean :3.593 Mean :2.791
## 3rd Qu.:0.062365 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :0.500000 Max. :4.00 Max. :4.000 Max. :4.000
## respiratory_range Sodium_max Sodium_median Sodium_min
## Min. :0.000000 Min. :134.0 Min. :128.0 Min. :112.0
## 1st Qu.:0.000000 1st Qu.:142.0 1st Qu.:139.0 1st Qu.:135.0
## Median :0.001828 Median :143.0 Median :140.0 Median :137.0
## Mean :0.002513 Mean :143.4 Mean :140.1 Mean :136.8
## 3rd Qu.:0.003653 3rd Qu.:145.0 3rd Qu.:141.0 3rd Qu.:138.0
## Max. :0.025424 Max. :169.0 Max. :146.5 Max. :145.0
## Sodium_range SubjectID trunk_max trunk_median
## Min. :0.00000 Min. : 533 Min. :0.000 Min. :0.000
## 1st Qu.:0.01058 1st Qu.:240826 1st Qu.:5.000 1st Qu.:3.000
## Median :0.01312 Median :496835 Median :7.000 Median :5.000
## Mean :0.01500 Mean :498880 Mean :6.204 Mean :4.893
## 3rd Qu.:0.01728 3rd Qu.:750301 3rd Qu.:8.000 3rd Qu.:6.500
## Max. :0.14286 Max. :999482 Max. :8.000 Max. :8.000
## trunk_min trunk_range Urine.Ph_max Urine.Ph_median
## Min. :0.000 Min. :0.000000 Min. :5.00 Min. :5.000
## 1st Qu.:1.000 1st Qu.:0.003643 1st Qu.:6.00 1st Qu.:5.000
## Median :3.000 Median :0.006920 Median :7.00 Median :6.000
## Mean :2.956 Mean :0.007136 Mean :6.82 Mean :5.711
## 3rd Qu.:5.000 3rd Qu.:0.009639 3rd Qu.:7.00 3rd Qu.:6.000
## Max. :8.000 Max. :0.042017 Max. :9.00 Max. :9.000
## Urine.Ph_min
## Min. :5.000
## 1st Qu.:5.000
## Median :5.000
## Mean :5.183
## 3rd Qu.:5.000
## Max. :8.000
There are \(131\) features and some of variables represent statistics like max, min and median values of the same clinical measurements.
Step 3 - training a model on the data
Now let’s explore the Boruta() function in the
Boruta package to perform variable selection, based on
random forest classification. Boruta() includes the
following components:
vs <- Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)
class: variable for class labels.features: potential features to select from.data: dataset containing classes and features.pValue: confidence level. Default value is 0.01 (Notice
we are applying multiple variable selection.mcAdj: Default TRUE to apply a multiple comparisons
adjustment using the Bonferroni method.maxRuns: maximal number of importance source runs. You
may increase it to resolve attributes left Tentative.doTrace: verbosity level. Default 0 means no tracing, 1
means reporting decision about each attribute as soon as it is
justified, 2 means same as 1, plus at each importance source run
reporting the number of attributes. The default is 0 where we don’t do
the reporting.getImp: function used to obtain attribute importance.
The default is \(getImpRfZ\), which
runs random forest from the ranger package and gathers \(Z\)-scores of mean decrease accuracy
measure.The resulting vs object is of class Boruta
and contains two important components:
finalDecision: a factor of three values:
Confirmed, Rejected or Tentative,
containing the final results of the feature selection process.ImpHistory: a data frame of importance of attributes
gathered in each importance source run. Besides the predictors’
importance, it contains maximal, mean and minimal importance of shadow
attributes for each run. Rejected attributes get -Inf
importance. This output is set to NULL if we specify
holdHistory=FALSE in the Boruta call.Caution: Running the code below will take several minutes.
# install.packages("Boruta")
library(Boruta)
set.seed(123)
als <- Boruta(ALSFRS_slope ~ . -ID, data=ALS.train, doTrace=0)
print(als)## Boruta performed 99 iterations in 1.113085 mins.
## 30 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 25 more;
## 61 attributes confirmed unimportant: Albumin_max, Albumin_median,
## Albumin_min, Albumin_range, ALT.SGPT._max and 56 more;
## 8 tentative attributes left: Age_mean, Hematocrit_median,
## Hematocrit_range, Hemoglobin_max, Hemoglobin_min and 3 more;
## Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,] 2.2680963 0.37764697 0.35392394 0.1051619 2.915087
## [2,] 2.0267252 1.39739377 1.97034396 0.5878719 1.960934
## [3,] 2.3157588 -0.58408581 0.89600771 2.1274668 0.985526
## [4,] 2.4953558 -0.94574532 0.08017671 1.3725028 2.210370
## [5,] 0.6570802 0.07801328 -0.80266698 1.6603405 2.018822
## [6,] 2.9302386 0.99320619 -0.16963863 0.9274493 2.130164
## ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## [1,] 7.404657 8.189677 17.53358 25.78601
## [2,] 7.511764 8.637098 15.75552 26.48235
## [3,] 7.837504 8.542079 16.68604 25.39762
## [4,] 8.620842 7.146983 17.18074 24.54223
## [5,] 8.597765 8.538938 16.04533 27.65627
## [6,] 8.544448 7.747993 16.89295 26.83922
## ALT.SGPT._max
## [1,] 0.94366835
## [2,] 0.42357699
## [3,] 1.52429646
## [4,] 1.77878291
## [5,] 0.07109222
## [6,] 2.32341000
This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample \(x_i\), using only the trees that did not include \(x_i\) in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.
The importance scores for all features at every iteration are stored
in the data frame als$ImpHistory. Let’s plot a graph
depicting the essential features.
Note: Again, running this code will take several minutes to complete.
library(plotly)
# plot(als, xlab="", xaxt="n")
# lz<-lapply(1:ncol(als$ImpHistory), function(i)
# als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
# names(lz)<-colnames(als$ImpHistory)
# lb<-sort(sapply(lz, median))
# axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)
df_long <- tidyr::gather(as.data.frame(als$ImpHistory), feature, measurement)
plot_ly(df_long, x=~feature, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all 102 Features (ALS Data)",
xaxis = list(title="Features", categoryorder = "total descending"),
yaxis = list(title="Importance"), showlegend=F)