1 Variable Importance and Feature Selection

As we mentioned earlier in Chapter 4, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Instead of trying to interrogate the complete data in its native high-dimensional state, we can apply variable selection, or feature selection, to focus on the most salient information contained in the observations Due to the presence of intrinsic and extrinsic noise, the volume and complexity of big health data, as well as different methodological and technological challenges, the process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.

Variable selection relates to dimensionality reduction, which we saw in Chapter 4, however there are differences between them.

Method	Process Type	Goals	Approach
Variable selection	Discrete process	To select unique representative features from each group of similar features	To identify highly correlated variables and choose a representative feature by post processing the data
Dimension reduction	Continuous process	To denoise the data, enable simpler prediction, or group features so that low impact features have smaller weights	Find the essential, $k\ll n$, components, factors, or clusters representing linear, or nonlinear, functions of the $n$ variables which maximize an objective function like the proportion of explained variance

Relative to the lower variance estimates in continuous dimensionality reduction, the intrinsic characteristics of the discrete feature selection process yields higher variance in bootstrap estimation and cross validation.

In in this Chapter, we will also learn about another powerful technique for variable-selection using decoy features (knockoffs) to control for the false discovery rate of selecting inconsequential features as important.

1.1 Feature selection methods

There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.

1.1.1 Filtering techniques

Univariate: Univariate filtering methods focus on selecting single features with high scores based on some statistics like $\chi^2$ or Information Gain Ratio. Each feature is viewed as independent of the others, effectively ignoring interactions between features.
- Examples: $\chi^2$, Euclidean distance, $i$-test, and Information gain.
Multivariate: Multivariate filtering methods rely on various (multivariate) statistics to select the principal features. They typically account for between-feature interactions by using higher-order statistics like correlation. The basic idea is that we iteratively triage variables that have high correlations with other features.
- Examples: Correlation-based feature selection, Markov blanket filter, and fast correlation-based feature selection.

1.1.2 Wrapper

Deterministic: Deterministic wrapper feature selection methods either start with no features (forward-selection) or with all features included in the model (backward-selection) and iteratively refine the set of chosen features according to some model quality measures. The iterative process of adding or removing features may rely on statistics like the Jaccard similarity coefficient.
- Examples: Sequential forward selection, Recursive Feature Elimination, Plus $q$ take-away $r$, and Beam search.
Randomized: Stochastic wrapper feature selection procedures utilize a binary feature-indexing vector indicating whether or not each variable should be included in the list of salient features. At each iteration, we randomly perturb to the binary indicators vector and compare the combinations of features before and after the random inclusion-exclusion indexing change. Finally, we pick the indexing vector corresponding with the optimal performance based on some metric like acceptance probability measures. The iterative process continues until no improvement of the objective function is observed.
- Examples: Simulated annealing, Genetic algorithms, Estimation of distribution algorithms.

1.1.3 Embedded Techniques

Embedded feature selection techniques are based on various classifiers, predictors, or clustering procedures. For instance, we can accomplish feature selection by using decision trees where the separation of the training data relies on features associated with the highest information gain. Further tree branching separating the data deeper may utilize weaker features. This process of choosing the vital features based on their separability characteristics continues until the classifier generates group labels that are mostly homogeneous within clusters/classes and largely heterogeneous across groups, and when the information gain of further tree branching is marginal. The entire process may be iterated multiple times and select the features that appear most frequently.
- Examples: Decision trees, random forests, weighted naive Bayes, and feature selection using weighted-SVM.

The different types of feature selection methods have their own pros and cons. In this chapter, we are going to introduce the randomized wrapper method using the Boruta package, which utilizes a random forest classification method to output variable importance measures (VIMs). Then, we will compare its results with Recursive Feature Elimination, a classical deterministic wrapper method.

1.1.4 Random Forest Feature Selection

Let’s start by examining random forest based feature selection, as an embedded technique. The good performance of random forest as a classification, regression, and clustering method is coupled with its ease-of-use, accurate, and robust results. Having a random forest, or more broadly a decision tree, prediction naturally leads to feature selection by using the mean decrease impurity or the mean accuracy decrease criteria.

The many decision trees captured in a random forest include explicit conditions at each branching node, which are based on single features. The intrinsic bifurcation conditions splitting the data may be based on cost function optimization using the impurity, see Chapter 5. We can also use other metrics information gain or entropy for classification problems. These measures capture the importance of variables by computing its impact (how much is the feature-based splitting decision decreasing the weighted impurity in a tree). In random forests, the ranking of feature importance, which is based on the average impurity decrease due to each variable, leads to effective feature selection.

1.1.5 Case Study - ALS

Step 1: Collecting Data

First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.

In this case-study, we use the training dataset that contains 2,223 observations and 131 numeric variables. We select ALSFRS slope as our outcome variable, as it captures the patients’ clinical decline over a year. Although we have more observations than features, this is one of the examples where multiple features are highly correlated. Therefore, we need to preprocess the variables before commencing with feature selection.

Step 2: Exploring and preparing the data

The dataset is located in our case-studies archive. We can use read.csv() to directly import the CSV dataset into R using the URL reference.

ALS.train <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
summary(ALS.train)

##        ID            Age_mean      Albumin_max    Albumin_median 
##  Min.   :   1.0   Min.   :18.00   Min.   :37.00   Min.   :34.50  
##  1st Qu.: 614.5   1st Qu.:47.00   1st Qu.:45.00   1st Qu.:42.00  
##  Median :1213.0   Median :55.00   Median :47.00   Median :44.00  
##  Mean   :1214.9   Mean   :54.55   Mean   :47.01   Mean   :43.95  
##  3rd Qu.:1815.5   3rd Qu.:63.00   3rd Qu.:49.00   3rd Qu.:46.00  
##  Max.   :2424.0   Max.   :81.00   Max.   :70.30   Max.   :51.10  
##   Albumin_min    Albumin_range       ALSFRS_slope     ALSFRS_Total_max
##  Min.   :24.00   Min.   :0.000000   Min.   :-4.3452   Min.   :11.00   
##  1st Qu.:39.00   1st Qu.:0.009042   1st Qu.:-1.0863   1st Qu.:29.00   
##  Median :41.00   Median :0.012111   Median :-0.6207   Median :33.00   
##  Mean   :40.77   Mean   :0.013779   Mean   :-0.7283   Mean   :31.69   
##  3rd Qu.:43.00   3rd Qu.:0.015873   3rd Qu.:-0.2838   3rd Qu.:36.00   
##  Max.   :49.00   Max.   :0.243902   Max.   : 1.2070   Max.   :40.00   
##  ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max   
##  Min.   : 2.5        Min.   : 0.00    Min.   :0.00000    Min.   : 10.00  
##  1st Qu.:23.0        1st Qu.:14.00    1st Qu.:0.01404    1st Qu.: 32.00  
##  Median :28.0        Median :20.00    Median :0.02330    Median : 45.00  
##  Mean   :27.1        Mean   :19.88    Mean   :0.02604    Mean   : 54.44  
##  3rd Qu.:32.0        3rd Qu.:27.00    3rd Qu.:0.03480    3rd Qu.: 65.00  
##  Max.   :40.0        Max.   :40.00    Max.   :0.11765    Max.   :944.00  
##  ALT.SGPT._median ALT.SGPT._min    ALT.SGPT._range    AST.SGOT._max   
##  Min.   :  8.00   Min.   :  1.60   Min.   :0.002747   Min.   : 11.00  
##  1st Qu.: 22.00   1st Qu.: 15.00   1st Qu.:0.030303   1st Qu.: 30.00  
##  Median : 30.00   Median : 21.00   Median :0.047619   Median : 38.00  
##  Mean   : 32.99   Mean   : 23.01   Mean   :0.071137   Mean   : 43.13  
##  3rd Qu.: 40.00   3rd Qu.: 28.00   3rd Qu.:0.077539   3rd Qu.: 48.00  
##  Max.   :193.00   Max.   :109.00   Max.   :2.383117   Max.   :911.00  
##  AST.SGOT._median AST.SGOT._min   AST.SGOT._range   Bicarbonate_max
##  Min.   :  9.00   Min.   : 1.00   Min.   :0.00000   Min.   :20.0   
##  1st Qu.: 22.00   1st Qu.:17.00   1st Qu.:0.02352   1st Qu.:29.0   
##  Median : 27.00   Median :20.00   Median :0.03502   Median :31.0   
##  Mean   : 29.08   Mean   :21.54   Mean   :0.04919   Mean   :30.9   
##  3rd Qu.: 34.00   3rd Qu.:25.00   3rd Qu.:0.05243   3rd Qu.:32.0   
##  Max.   :100.00   Max.   :86.00   Max.   :1.91667   Max.   :52.0   
##  Bicarbonate_median Bicarbonate_min Bicarbonate_range
##  Min.   :19.50      Min.   : 2.50   Min.   :0.00000  
##  1st Qu.:26.00      1st Qu.:22.00   1st Qu.:0.01266  
##  Median :27.00      Median :23.00   Median :0.01493  
##  Mean   :26.96      Mean   :23.16   Mean   :0.01687  
##  3rd Qu.:28.00      3rd Qu.:24.45   3rd Qu.:0.01815  
##  Max.   :39.50      Max.   :34.00   Max.   :0.21429  
##  Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
##  Min.   : 2.921                Min.   : 2.191                  
##  1st Qu.: 5.842                1st Qu.: 4.640                  
##  Median : 6.937                Median : 5.423                  
##  Mean   : 7.353                Mean   : 5.558                  
##  3rd Qu.: 8.210                3rd Qu.: 6.353                  
##  Max.   :25.192                Max.   :11.866                  
##  Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
##  Min.   : 0.5842               Min.   :0.000000                Min.   : 70.00  
##  1st Qu.: 3.2859               1st Qu.:0.004109                1st Qu.: 88.00  
##  Median : 4.0700               Median :0.005817                Median : 90.00  
##  Mean   : 4.1609               Mean   :0.007133                Mean   : 92.03  
##  3rd Qu.: 5.0000               3rd Qu.:0.008353                3rd Qu.: 98.00  
##  Max.   :10.2228               Max.   :0.069543                Max.   :140.00  
##  bp_diastolic_median bp_diastolic_min bp_diastolic_range bp_systolic_max
##  Min.   : 56.00      Min.   : 20.00   Min.   :0.00000    Min.   :100.0  
##  1st Qu.: 78.00      1st Qu.: 65.00   1st Qu.:0.03527    1st Qu.:138.0  
##  Median : 80.00      Median : 70.00   Median :0.04337    Median :145.0  
##  Mean   : 81.11      Mean   : 69.89   Mean   :0.04766    Mean   :147.1  
##  3rd Qu.: 85.00      3rd Qu.: 75.00   3rd Qu.:0.05435    3rd Qu.:157.0  
##  Max.   :110.00      Max.   :100.00   Max.   :0.71429    Max.   :220.0  
##  bp_systolic_median bp_systolic_min bp_systolic_range  Calcium_max   
##  Min.   : 90.0      Min.   : 72.0   Min.   :0.00000   Min.   :2.171  
##  1st Qu.:120.0      1st Qu.:108.0   1st Qu.:0.05272   1st Qu.:2.400  
##  Median :130.0      Median :110.0   Median :0.06494   Median :2.470  
##  Mean   :129.6      Mean   :113.4   Mean   :0.07118   Mean   :2.475  
##  3rd Qu.:136.0      3rd Qu.:120.0   3rd Qu.:0.08190   3rd Qu.:2.530  
##  Max.   :190.0      Max.   :165.0   Max.   :0.40462   Max.   :9.460  
##  Calcium_median   Calcium_min     Calcium_range        Chloride_max  
##  Min.   :2.046   Min.   :0.2438   Min.   :0.0000000   Min.   : 96.0  
##  1st Qu.:2.283   1st Qu.:2.1707   1st Qu.:0.0003741   1st Qu.:106.0  
##  Median :2.345   Median :2.2300   Median :0.0004739   Median :107.0  
##  Mean   :2.346   Mean   :2.2229   Mean   :0.0005407   Mean   :107.2  
##  3rd Qu.:2.400   3rd Qu.:2.2977   3rd Qu.:0.0005893   3rd Qu.:109.0  
##  Max.   :2.800   Max.   :2.6500   Max.   :0.0129009   Max.   :119.0  
##  Chloride_median  Chloride_min    Chloride_range    Creatinine_max  
##  Min.   : 90.0   Min.   : 76.00   Min.   :0.00000   Min.   : 22.00  
##  1st Qu.:102.0   1st Qu.: 98.00   1st Qu.:0.01250   1st Qu.: 65.00  
##  Median :104.0   Median :100.00   Median :0.01587   Median : 79.56  
##  Mean   :103.5   Mean   : 99.26   Mean   :0.01787   Mean   : 78.78  
##  3rd Qu.:105.0   3rd Qu.:101.00   3rd Qu.:0.01990   3rd Qu.: 88.40  
##  Max.   :111.0   Max.   :109.00   Max.   :0.21429   Max.   :248.00  
##  Creatinine_median Creatinine_min   Creatinine_range   Gender_mean   
##  Min.   : 18.00    Min.   :  0.00   Min.   :0.00000   Min.   :1.000  
##  1st Qu.: 53.04    1st Qu.: 39.00   1st Qu.:0.03824   1st Qu.:1.000  
##  Median : 62.00    Median : 53.00   Median :0.04865   Median :2.000  
##  Mean   : 65.19    Mean   : 51.98   Mean   :0.05842   Mean   :1.637  
##  3rd Qu.: 78.85    3rd Qu.: 61.88   3rd Qu.:0.07026   3rd Qu.:2.000  
##  Max.   :176.80    Max.   :167.96   Max.   :0.42095   Max.   :2.000  
##   Glucose_max     Glucose_median    Glucose_min     Glucose_range     
##  Min.   : 4.160   Min.   : 3.497   Min.   : 0.000   Min.   :0.000000  
##  1st Qu.: 5.827   1st Qu.: 4.911   1st Qu.: 4.051   1st Qu.:0.003051  
##  Median : 6.500   Median : 5.300   Median : 4.440   Median :0.004695  
##  Mean   : 7.160   Mean   : 5.487   Mean   : 4.265   Mean   :0.006319  
##  3rd Qu.: 7.600   3rd Qu.: 5.695   3rd Qu.: 4.800   3rd Qu.:0.007373  
##  Max.   :33.688   Max.   :26.196   Max.   :12.200   Max.   :0.097463  
##    hands_max      hands_median     hands_min      hands_range      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:5.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:0.003610  
##  Median :7.000   Median :5.500   Median :3.000   Median :0.006652  
##  Mean   :6.181   Mean   :4.905   Mean   :3.047   Mean   :0.006883  
##  3rd Qu.:8.000   3rd Qu.:7.000   3rd Qu.:5.000   3rd Qu.:0.009513  
##  Max.   :8.000   Max.   :8.000   Max.   :8.000   Max.   :0.042857  
##  Hematocrit_max   Hematocrit_median Hematocrit_min   Hematocrit_range  
##  Min.   : 0.373   Min.   : 0.362    Min.   : 0.311   Min.   :0.000000  
##  1st Qu.:42.300   1st Qu.:40.000    1st Qu.:37.000   1st Qu.:0.007164  
##  Median :45.200   Median :42.600    Median :40.000   Median :0.009701  
##  Mean   :41.939   Mean   :39.467    Mean   :36.962   Mean   :0.011431  
##  3rd Qu.:47.700   3rd Qu.:45.000    3rd Qu.:42.700   3rd Qu.:0.013579  
##  Max.   :81.000   Max.   :56.000    Max.   :52.900   Max.   :0.185714  
##  Hemoglobin_max  Hemoglobin_median Hemoglobin_min    Hemoglobin_range 
##  Min.   :116.0   Min.   :106.0     Min.   :  6.204   Min.   :0.00000  
##  1st Qu.:144.0   1st Qu.:136.0     1st Qu.:128.000   1st Qu.:0.02321  
##  Median :152.0   Median :145.0     Median :136.000   Median :0.03106  
##  Mean   :152.1   Mean   :144.3     Mean   :135.461   Mean   :0.03824  
##  3rd Qu.:160.0   3rd Qu.:152.0     3rd Qu.:145.000   3rd Qu.:0.04205  
##  Max.   :280.0   Max.   :182.0     Max.   :180.000   Max.   :0.56180  
##     leg_max       leg_median      leg_min        leg_range       
##  Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:3.00   1st Qu.:2.50   1st Qu.:1.000   1st Qu.:0.003378  
##  Median :5.00   Median :3.00   Median :2.000   Median :0.005435  
##  Mean   :5.31   Mean   :4.05   Mean   :2.493   Mean   :0.006163  
##  3rd Qu.:8.00   3rd Qu.:6.00   3rd Qu.:3.000   3rd Qu.:0.008718  
##  Max.   :8.00   Max.   :8.00   Max.   :8.000   Max.   :0.042017  
##    mouth_max      mouth_median      mouth_min       mouth_range      
##  Min.   : 1.00   Min.   : 0.000   Min.   : 0.000   Min.   :0.000000  
##  1st Qu.:10.00   1st Qu.: 8.000   1st Qu.: 5.000   1st Qu.:0.001815  
##  Median :12.00   Median :11.000   Median : 9.000   Median :0.005329  
##  Mean   :10.74   Mean   : 9.703   Mean   : 7.778   Mean   :0.006595  
##  3rd Qu.:12.00   3rd Qu.:12.000   3rd Qu.:11.000   3rd Qu.:0.010251  
##  Max.   :12.00   Max.   :12.000   Max.   :12.000   Max.   :0.036765  
##  onset_delta_mean onset_site_mean Platelets_max   Platelets_median
##  Min.   :-3119    Min.   :1.000   Min.   : 84.0   Min.   : 73.0   
##  1st Qu.: -887    1st Qu.:2.000   1st Qu.:239.0   1st Qu.:204.0   
##  Median : -572    Median :2.000   Median :275.0   Median :233.0   
##  Mean   : -683    Mean   :1.801   Mean   :285.3   Mean   :238.8   
##  3rd Qu.: -374    3rd Qu.:2.000   3rd Qu.:320.0   3rd Qu.:270.0   
##  Max.   :  -16    Max.   :3.000   Max.   :866.0   Max.   :526.0   
##  Platelets_min     Potassium_max    Potassium_median Potassium_min  
##  Min.   :  0.197   Min.   : 3.400   Min.   :3.000    Min.   :2.400  
##  1st Qu.:175.000   1st Qu.: 4.400   1st Qu.:4.000    1st Qu.:3.700  
##  Median :204.000   Median : 4.500   Median :4.200    Median :3.900  
##  Mean   :208.382   Mean   : 4.628   Mean   :4.189    Mean   :3.857  
##  3rd Qu.:236.000   3rd Qu.: 4.800   3rd Qu.:4.300    3rd Qu.:4.000  
##  Max.   :476.000   Max.   :43.000   Max.   :5.100    Max.   :5.100  
##  Potassium_range      pulse_max       pulse_median      pulse_min     
##  Min.   :0.000000   Min.   : 53.00   Min.   : 50.00   Min.   : 18.00  
##  1st Qu.:0.001058   1st Qu.: 84.00   1st Qu.: 72.00   1st Qu.: 60.00  
##  Median :0.001425   Median : 90.00   Median : 77.00   Median : 64.00  
##  Mean   :0.001744   Mean   : 90.64   Mean   : 76.97   Mean   : 65.37  
##  3rd Qu.:0.001913   3rd Qu.: 96.00   3rd Qu.: 81.00   3rd Qu.: 70.00  
##  Max.   :0.098674   Max.   :144.00   Max.   :115.00   Max.   :102.00  
##   pulse_range       respiratory_max respiratory_median respiratory_min
##  Min.   :0.005425   Min.   :2.00    Min.   :0.000      Min.   :0.000  
##  1st Qu.:0.036755   1st Qu.:4.00    1st Qu.:3.000      1st Qu.:2.000  
##  Median :0.048821   Median :4.00    Median :4.000      Median :3.000  
##  Mean   :0.053587   Mean   :3.91    Mean   :3.593      Mean   :2.791  
##  3rd Qu.:0.062365   3rd Qu.:4.00    3rd Qu.:4.000      3rd Qu.:4.000  
##  Max.   :0.500000   Max.   :4.00    Max.   :4.000      Max.   :4.000  
##  respiratory_range    Sodium_max    Sodium_median     Sodium_min   
##  Min.   :0.000000   Min.   :134.0   Min.   :128.0   Min.   :112.0  
##  1st Qu.:0.000000   1st Qu.:142.0   1st Qu.:139.0   1st Qu.:135.0  
##  Median :0.001828   Median :143.0   Median :140.0   Median :137.0  
##  Mean   :0.002513   Mean   :143.4   Mean   :140.1   Mean   :136.8  
##  3rd Qu.:0.003653   3rd Qu.:145.0   3rd Qu.:141.0   3rd Qu.:138.0  
##  Max.   :0.025424   Max.   :169.0   Max.   :146.5   Max.   :145.0  
##   Sodium_range       SubjectID        trunk_max      trunk_median  
##  Min.   :0.00000   Min.   :   533   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.01058   1st Qu.:240826   1st Qu.:5.000   1st Qu.:3.000  
##  Median :0.01312   Median :496835   Median :7.000   Median :5.000  
##  Mean   :0.01500   Mean   :498880   Mean   :6.204   Mean   :4.893  
##  3rd Qu.:0.01728   3rd Qu.:750301   3rd Qu.:8.000   3rd Qu.:6.500  
##  Max.   :0.14286   Max.   :999482   Max.   :8.000   Max.   :8.000  
##    trunk_min      trunk_range        Urine.Ph_max  Urine.Ph_median
##  Min.   :0.000   Min.   :0.000000   Min.   :5.00   Min.   :5.000  
##  1st Qu.:1.000   1st Qu.:0.003643   1st Qu.:6.00   1st Qu.:5.000  
##  Median :3.000   Median :0.006920   Median :7.00   Median :6.000  
##  Mean   :2.956   Mean   :0.007136   Mean   :6.82   Mean   :5.711  
##  3rd Qu.:5.000   3rd Qu.:0.009639   3rd Qu.:7.00   3rd Qu.:6.000  
##  Max.   :8.000   Max.   :0.042017   Max.   :9.00   Max.   :9.000  
##   Urine.Ph_min  
##  Min.   :5.000  
##  1st Qu.:5.000  
##  Median :5.000  
##  Mean   :5.183  
##  3rd Qu.:5.000  
##  Max.   :8.000

There are $131$ features and some of variables represent statistics like max, min and median values of the same clinical measurements.

Step 3 - training a model on the data

Now let’s explore the Boruta() function in the Boruta package to perform variable selection, based on random forest classification. Boruta() includes the following components:

vs <- Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)

class: variable for class labels.
features: potential features to select from.
data: dataset containing classes and features.
pValue: confidence level. Default value is 0.01 (Notice we are applying multiple variable selection.
mcAdj: Default TRUE to apply a multiple comparisons adjustment using the Bonferroni method.
maxRuns: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.
doTrace: verbosity level. Default 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means same as 1, plus at each importance source run reporting the number of attributes. The default is 0 where we don’t do the reporting.
getImp: function used to obtain attribute importance. The default is $getImpRfZ$, which runs random forest from the ranger package and gathers $Z$-scores of mean decrease accuracy measure.

The resulting vs object is of class Boruta and contains two important components:

finalDecision: a factor of three values: Confirmed, Rejected or Tentative, containing the final results of the feature selection process.
ImpHistory: a data frame of importance of attributes gathered in each importance source run. Besides the predictors’ importance, it contains maximal, mean and minimal importance of shadow attributes for each run. Rejected attributes get -Inf importance. This output is set to NULL if we specify holdHistory=FALSE in the Boruta call.

Caution: Running the code below will take several minutes.

# install.packages("Boruta")
library(Boruta)
set.seed(123)
als <- Boruta(ALSFRS_slope ~ . -ID, data=ALS.train, doTrace=0)
print(als)

## Boruta performed 99 iterations in 1.015261 mins.
##  30 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 25 more;
##  61 attributes confirmed unimportant: Albumin_max, Albumin_median,
## Albumin_min, Albumin_range, ALT.SGPT._max and 56 more;
##  8 tentative attributes left: Age_mean, Hematocrit_median,
## Hematocrit_range, Hemoglobin_max, Hemoglobin_min and 3 more;

als$ImpHistory[1:6, 1:10]

##       Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,] 2.2680963  0.37764697     0.35392394   0.1051619      2.915087
## [2,] 2.0267252  1.39739377     1.97034396   0.5878719      1.960934
## [3,] 2.3157588 -0.58408581     0.89600771   2.1274668      0.985526
## [4,] 2.4953558 -0.94574532     0.08017671   1.3725028      2.210370
## [5,] 0.6570802  0.07801328    -0.80266698   1.6603405      2.018822
## [6,] 2.9302386  0.99320619    -0.16963863   0.9274493      2.130164
##      ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## [1,]         7.404657            8.189677         17.53358           25.78601
## [2,]         7.511764            8.637098         15.75552           26.48235
## [3,]         7.837504            8.542079         16.68604           25.39762
## [4,]         8.620842            7.146983         17.18074           24.54223
## [5,]         8.597765            8.538938         16.04533           27.65627
## [6,]         8.544448            7.747993         16.89295           26.83922
##      ALT.SGPT._max
## [1,]    0.94366835
## [2,]    0.42357699
## [3,]    1.52429646
## [4,]    1.77878291
## [5,]    0.07109222
## [6,]    2.32341000

This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample $x_i$, using only the trees that did not include $x_i$ in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.

The importance scores for all features at every iteration are stored in the data frame als$ImpHistory. Let’s plot a graph depicting the essential features.

Note: Again, running this code will take several minutes to complete.

library(plotly)
# plot(als, xlab="", xaxt="n")
# lz<-lapply(1:ncol(als$ImpHistory), function(i)
# als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
# names(lz)<-colnames(als$ImpHistory)
# lb<-sort(sapply(lz, median))
# axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)

df_long <- tidyr::gather(as.data.frame(als$ImpHistory), feature, measurement)

plot_ly(df_long, x=~feature, y = ~measurement, color = ~feature, type = "box") %>%
    layout(title="Box-and-whisker Plots across all 102 Features (ALS Data)",
           xaxis = list(title="Features", categoryorder = "total descending"), 
           yaxis = list(title="Importance"), showlegend=F)

We can see that plotting the graph is easy but extracting matched feature names may require more work. Another basic plot may be rendered using plot(als, xlab="", xaxt="n"), where xaxt="n" suppresses labeling the x-axis. To reconstruct the correct x-axis labels, we need to create a list by using the apply() function. Each element in the list contains all the important scores for a single feature in the original dataset. Then, we can exclude all rejected features and sort these non-rejected features according to their median importance and printed them on the x-axis by using axis().

We have already seen similar groups of boxplots back in Chapter 2. In this graph, variables with green boxes are more important than the ones represented with red boxes, and we can see the range of importance scores within a single variable in the graph.

It may be desirable to get rid of tentative features. Notice that this function should be used only when strict decision is highly desired, because this test is much weaker than Boruta and can lower the confidence of the final result.

final.als <- TentativeRoughFix(als)
print(final.als)

## Boruta performed 99 iterations in 1.015261 mins.
## Tentatives roughfixed over the last 99 iterations.
##  35 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 30 more;
##  64 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 59 more;

final.als$finalDecision

##                         Age_mean                      Albumin_max 
##                         Rejected                         Rejected 
##                   Albumin_median                      Albumin_min 
##                         Rejected                         Rejected 
##                    Albumin_range                 ALSFRS_Total_max 
##                         Rejected                        Confirmed 
##              ALSFRS_Total_median                 ALSFRS_Total_min 
##                        Confirmed                        Confirmed 
##               ALSFRS_Total_range                    ALT.SGPT._max 
##                        Confirmed                         Rejected 
##                 ALT.SGPT._median                    ALT.SGPT._min 
##                         Rejected                         Rejected 
##                  ALT.SGPT._range                    AST.SGOT._max 
##                         Rejected                         Rejected 
##                 AST.SGOT._median                    AST.SGOT._min 
##                         Rejected                         Rejected 
##                  AST.SGOT._range                  Bicarbonate_max 
##                         Rejected                         Rejected 
##               Bicarbonate_median                  Bicarbonate_min 
##                         Rejected                         Rejected 
##                Bicarbonate_range    Blood.Urea.Nitrogen..BUN._max 
##                         Rejected                         Rejected 
## Blood.Urea.Nitrogen..BUN._median    Blood.Urea.Nitrogen..BUN._min 
##                         Rejected                         Rejected 
##  Blood.Urea.Nitrogen..BUN._range                 bp_diastolic_max 
##                         Rejected                         Rejected 
##              bp_diastolic_median                 bp_diastolic_min 
##                         Rejected                         Rejected 
##               bp_diastolic_range                  bp_systolic_max 
##                         Rejected                         Rejected 
##               bp_systolic_median                  bp_systolic_min 
##                         Rejected                         Rejected 
##                bp_systolic_range                      Calcium_max 
##                         Rejected                         Rejected 
##                   Calcium_median                      Calcium_min 
##                         Rejected                         Rejected 
##                    Calcium_range                     Chloride_max 
##                         Rejected                         Rejected 
##                  Chloride_median                     Chloride_min 
##                         Rejected                         Rejected 
##                   Chloride_range                   Creatinine_max 
##                         Rejected                        Confirmed 
##                Creatinine_median                   Creatinine_min 
##                        Confirmed                        Confirmed 
##                 Creatinine_range                      Gender_mean 
##                         Rejected                         Rejected 
##                      Glucose_max                   Glucose_median 
##                         Rejected                         Rejected 
##                      Glucose_min                    Glucose_range 
##                         Rejected                         Rejected 
##                        hands_max                     hands_median 
##                        Confirmed                        Confirmed 
##                        hands_min                      hands_range 
##                        Confirmed                        Confirmed 
##                   Hematocrit_max                Hematocrit_median 
##                        Confirmed                        Confirmed 
##                   Hematocrit_min                 Hematocrit_range 
##                        Confirmed                        Confirmed 
##                   Hemoglobin_max                Hemoglobin_median 
##                        Confirmed                         Rejected 
##                   Hemoglobin_min                 Hemoglobin_range 
##                        Confirmed                        Confirmed 
##                          leg_max                       leg_median 
##                        Confirmed                        Confirmed 
##                          leg_min                        leg_range 
##                        Confirmed                        Confirmed 
##                        mouth_max                     mouth_median 
##                        Confirmed                        Confirmed 
##                        mouth_min                      mouth_range 
##                        Confirmed                        Confirmed 
##                 onset_delta_mean                  onset_site_mean 
##                        Confirmed                        Confirmed 
##                    Platelets_max                 Platelets_median 
##                         Rejected                         Rejected 
##                    Platelets_min                    Potassium_max 
##                         Rejected                         Rejected 
##                 Potassium_median                    Potassium_min 
##                         Rejected                         Rejected 
##                  Potassium_range                        pulse_max 
##                         Rejected                         Rejected 
##                     pulse_median                        pulse_min 
##                         Rejected                         Rejected 
##                      pulse_range                  respiratory_max 
##                         Rejected                         Rejected 
##               respiratory_median                  respiratory_min 
##                        Confirmed                        Confirmed 
##                respiratory_range                       Sodium_max 
##                        Confirmed                         Rejected 
##                    Sodium_median                       Sodium_min 
##                         Rejected                         Rejected 
##                     Sodium_range                        SubjectID 
##                         Rejected                         Rejected 
##                        trunk_max                     trunk_median 
##                        Confirmed                        Confirmed 
##                        trunk_min                      trunk_range 
##                        Confirmed                        Confirmed 
##                     Urine.Ph_max                  Urine.Ph_median 
##                         Rejected                         Rejected 
##                     Urine.Ph_min 
##                         Rejected 
## Levels: Tentative Confirmed Rejected

getConfirmedFormula(final.als)

## ALSFRS_slope ~ ALSFRS_Total_max + ALSFRS_Total_median + ALSFRS_Total_min + 
##     ALSFRS_Total_range + Creatinine_max + Creatinine_median + 
##     Creatinine_min + hands_max + hands_median + hands_min + hands_range + 
##     Hematocrit_max + Hematocrit_median + Hematocrit_min + Hematocrit_range + 
##     Hemoglobin_max + Hemoglobin_min + Hemoglobin_range + leg_max + 
##     leg_median + leg_min + leg_range + mouth_max + mouth_median + 
##     mouth_min + mouth_range + onset_delta_mean + onset_site_mean + 
##     respiratory_median + respiratory_min + respiratory_range + 
##     trunk_max + trunk_median + trunk_min + trunk_range
## <environment: 0x000002261c523e58>

# report the Boruta "Confirmed" & "Tentative" features, removing the "Rejected" ones
print(final.als$finalDecision[final.als$finalDecision %in% c("Confirmed", "Tentative")])

##    ALSFRS_Total_max ALSFRS_Total_median    ALSFRS_Total_min  ALSFRS_Total_range 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##      Creatinine_max   Creatinine_median      Creatinine_min           hands_max 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##        hands_median           hands_min         hands_range      Hematocrit_max 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##   Hematocrit_median      Hematocrit_min    Hematocrit_range      Hemoglobin_max 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##      Hemoglobin_min    Hemoglobin_range             leg_max          leg_median 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##             leg_min           leg_range           mouth_max        mouth_median 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##           mouth_min         mouth_range    onset_delta_mean     onset_site_mean 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##  respiratory_median     respiratory_min   respiratory_range           trunk_max 
##           Confirmed           Confirmed           Confirmed           Confirmed 
##        trunk_median           trunk_min         trunk_range 
##           Confirmed           Confirmed           Confirmed 
## Levels: Tentative Confirmed Rejected

# how many are actually "confirmed" as important/salient?
impBoruta <- final.als$finalDecision[final.als$finalDecision %in% c("Confirmed")]; length(impBoruta)

## [1] 35

This shows the final features selection result.

Step 4 - evaluating model performance

1.1.5.1 Comparing with RFE

Let’s compare the Boruta results against a classical variable selection method - recursive feature elimination (RFE). First, we need to load two packages: caret and randomForest. Then, similar to Chapter 9 we must specify a resampling method. Here we use 10-fold CV to do the resampling.

library(caret)
library(randomForest)
set.seed(123)
control <- rfeControl(functions = rfFuncs, method = "cv", number=10)

Now, all preparations are complete and we are ready to do the RFE variable selection.

rf.train <- rfe(ALS.train[, -c(1, 7)], ALS.train[, 7], sizes=c(10, 20, 30, 40), rfeControl=control)
rf.train

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables   RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected
##         10 0.3490   0.6831 0.2479 0.03305    0.05288 0.01580         
##         20 0.3468   0.6876 0.2463 0.03110    0.04798 0.01396        *
##         30 0.3482   0.6852 0.2479 0.03253    0.04827 0.01474         
##         40 0.3470   0.6876 0.2473 0.03411    0.04927 0.01494         
##         99 0.3511   0.6807 0.2499 0.03300    0.04967 0.01472         
## 
## The top 5 variables (out of 20):
##    ALSFRS_Total_range, hands_range, trunk_range, ALSFRS_Total_min, mouth_range

This calculation may take a long time to complete. The RFE invocation is different from Boruta. Here we have to specify the feature data frame and the class labels separately. Also, the sizes= option allows us to specify the number of features we want to include in the model. Let’s try sizes=c(10, 20, 30, 40) to compare the model performance for alternative numbers of features.

To visualize the results, we can plot the 5 different feature size combinations listed in the summary. The one with 30 features has the lowest RMSE measure. This result is similar to the Boruta output, which selected around 30 features.

plot(rf.train, type=c("g", "o"), cex=1, col=1:5)

# df <- as.data.frame(cbind(variables=rf.train$variables$var[1:5], RMSE=rf.train$results$RMSE,
#                           Rsquared=rf.train$results$Rsquared, MAE=rf.train$results$MAE,
#                           RMSESD = rf.train$results$RMSESD,
#                           RsquaredSD= rf.train$results$RsquaredSD, MAESD=rf.train$results$MAESD))
# 
# data_long <- tidyr::gather(df, Metric, value, RMSE:MAESD, factor_key=TRUE)
# 
# plot_ly(data_long, x=~variables, y=~value, color=~as.factor(Metric), type = "scatter", mode="lines")

Using the functions predictors() and getSelectedAttributes(), we can compare the final results of the two alternative feature selection methods.

predRFE <- predictors(rf.train)
predBoruta <- getSelectedAttributes(final.als, withTentative = F)

The results are almost identical:

intersect(predBoruta, predRFE)

##  [1] "ALSFRS_Total_max"    "ALSFRS_Total_median" "ALSFRS_Total_min"   
##  [4] "ALSFRS_Total_range"  "Creatinine_max"      "hands_max"          
##  [7] "hands_median"        "hands_min"           "hands_range"        
## [10] "leg_median"          "leg_min"             "leg_range"          
## [13] "mouth_median"        "mouth_min"           "mouth_range"        
## [16] "onset_delta_mean"    "respiratory_range"   "trunk_median"       
## [19] "trunk_min"           "trunk_range"

There are 26 common variables chosen by the two techniques, which suggests that both the Boruta and RFE methods are robust. Also, notice that the Boruta method can give similar results without utilizing the size option. If we want to consider 10 or more different sizes, the procedure will be quite time consuming. Thus, Boruta method is effective when dealing with complex real world problems.

1.1.5.2 Comparing with stepwise feature selection

Next, we can contrast the Boruta feature selection results against another classical variable selection method - stepwise model selection. Let’s start with fitting a bidirectional stepwise linear model-based feature selection.

data2 <- ALS.train[, -1]
# Define a base model - intercept only
base.mod <- lm(ALSFRS_slope ~ 1 , data= data2)
# Define the full model - including all predictors
all.mod <- lm(ALSFRS_slope ~ . , data= data2)
# ols_step <- lm(ALSFRS_slope ~ ., data=data2)
ols_step <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
summary(ols_step); # ols_step

## 
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median + 
##     ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min + 
##     onset_delta_mean + Calcium_min + Albumin_range + Glucose_range + 
##     ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min + 
##     Creatinine_range + Potassium_range + Chloride_range + Chloride_min + 
##     Sodium_median + respiratory_min + respiratory_range + respiratory_max + 
##     trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range + 
##     Chloride_max + onset_site_mean + trunk_max + Gender_mean + 
##     Creatinine_min, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22558 -0.17875 -0.02024  0.17098  1.95100 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.176e-01  6.064e-01   0.689 0.491091    
## ALSFRS_Total_range  -2.260e+01  1.359e+00 -16.631  < 2e-16 ***
## ALSFRS_Total_median -3.388e-02  2.868e-03 -11.812  < 2e-16 ***
## ALSFRS_Total_min     2.821e-02  3.310e-03   8.524  < 2e-16 ***
## Calcium_range        2.410e+02  4.188e+01   5.754 9.94e-09 ***
## Calcium_max         -4.258e-01  8.846e-02  -4.813 1.59e-06 ***
## bp_diastolic_min    -2.249e-03  8.856e-04  -2.540 0.011161 *  
## onset_delta_mean    -5.461e-05  1.980e-05  -2.758 0.005856 ** 
## Calcium_min          3.579e-01  9.501e-02   3.767 0.000169 ***
## Albumin_range       -2.305e+00  8.197e-01  -2.812 0.004967 ** 
## Glucose_range       -1.510e+01  2.929e+00  -5.156 2.75e-07 ***
## ALT.SGPT._median    -2.300e-03  7.998e-04  -2.876 0.004062 ** 
## AST.SGOT._median     3.369e-03  1.276e-03   2.641 0.008316 ** 
## Glucose_max          3.279e-02  7.082e-03   4.630 3.88e-06 ***
## Glucose_min         -3.507e-02  8.718e-03  -4.023 5.95e-05 ***
## Creatinine_range     5.076e-01  2.214e-01   2.293 0.021925 *  
## Potassium_range     -4.535e+00  2.607e+00  -1.739 0.082128 .  
## Chloride_range       5.318e+00  1.188e+00   4.475 8.04e-06 ***
## Chloride_min         1.672e-02  3.797e-03   4.404 1.12e-05 ***
## Sodium_median       -9.830e-03  4.639e-03  -2.119 0.034227 *  
## respiratory_min     -1.453e-01  2.442e-02  -5.948 3.14e-09 ***
## respiratory_range   -5.834e+01  1.013e+01  -5.757 9.78e-09 ***
## respiratory_max      1.712e-01  3.395e-02   5.042 4.99e-07 ***
## trunk_range         -8.705e+00  3.088e+00  -2.819 0.004860 ** 
## pulse_range         -5.117e-01  3.016e-01  -1.697 0.089874 .  
## Bicarbonate_max      7.526e-03  2.931e-03   2.568 0.010292 *  
## Bicarbonate_range   -2.204e+00  9.567e-01  -2.304 0.021329 *  
## Chloride_max        -6.918e-03  3.952e-03  -1.751 0.080143 .  
## onset_site_mean      3.359e-02  2.019e-02   1.663 0.096359 .  
## trunk_max            2.288e-02  8.453e-03   2.706 0.006854 ** 
## Gender_mean         -3.360e-02  1.751e-02  -1.919 0.055066 .  
## Creatinine_min       7.643e-04  4.977e-04   1.536 0.124771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3355 on 2191 degrees of freedom
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.7094 
## F-statistic:   176 on 31 and 2191 DF,  p-value: < 2.2e-16

We can report the stepwise “Confirmed” (important) features:

# get the shortlisted variable
stepwiseConfirmedVars <- names(unlist(ols_step[[1]]))
# remove the intercept 
stepwiseConfirmedVars <- stepwiseConfirmedVars[!stepwiseConfirmedVars %in% "(Intercept)"]
print(stepwiseConfirmedVars)

##  [1] "ALSFRS_Total_range"  "ALSFRS_Total_median" "ALSFRS_Total_min"   
##  [4] "Calcium_range"       "Calcium_max"         "bp_diastolic_min"   
##  [7] "onset_delta_mean"    "Calcium_min"         "Albumin_range"      
## [10] "Glucose_range"       "ALT.SGPT._median"    "AST.SGOT._median"   
## [13] "Glucose_max"         "Glucose_min"         "Creatinine_range"   
## [16] "Potassium_range"     "Chloride_range"      "Chloride_min"       
## [19] "Sodium_median"       "respiratory_min"     "respiratory_range"  
## [22] "respiratory_max"     "trunk_range"         "pulse_range"        
## [25] "Bicarbonate_max"     "Bicarbonate_range"   "Chloride_max"       
## [28] "onset_site_mean"     "trunk_max"           "Gender_mean"        
## [31] "Creatinine_min"

The feature selection results of Boruta and step are similar.

library(mlbench)
library(caret)

# estimate variable importance
predStepwise <- varImp(ols_step, scale=FALSE)
# summarize importance
print(predStepwise)

##                       Overall
## ALSFRS_Total_range  16.630592
## ALSFRS_Total_median 11.812263
## ALSFRS_Total_min     8.523606
## Calcium_range        5.754045
## Calcium_max          4.812942
## bp_diastolic_min     2.539766
## onset_delta_mean     2.758465
## Calcium_min          3.767450
## Albumin_range        2.812018
## Glucose_range        5.156259
## ALT.SGPT._median     2.876338
## AST.SGOT._median     2.641369
## Glucose_max          4.629759
## Glucose_min          4.022642
## Creatinine_range     2.293301
## Potassium_range      1.739268
## Chloride_range       4.474709
## Chloride_min         4.403551
## Sodium_median        2.118710
## respiratory_min      5.948488
## respiratory_range    5.756735
## respiratory_max      5.041816
## trunk_range          2.819029
## pulse_range          1.696811
## Bicarbonate_max      2.568068
## Bicarbonate_range    2.303757
## Chloride_max         1.750666
## onset_site_mean      1.663481
## trunk_max            2.706410
## Gender_mean          1.919380
## Creatinine_min       1.535642

# plot predStepwise
# plot(predStepwise)

# Boruta vs. Stepwise feature selection
intersect(predBoruta, stepwiseConfirmedVars)

##  [1] "ALSFRS_Total_median" "ALSFRS_Total_min"    "ALSFRS_Total_range" 
##  [4] "Creatinine_min"      "onset_delta_mean"    "onset_site_mean"    
##  [7] "respiratory_min"     "respiratory_range"   "trunk_max"          
## [10] "trunk_range"

There are about $10$ common variables chosen by the Boruta and Stepwise feature selection methods.

There is another more elaborate stepwise feature selection technique that is implemented in the function MASS::stepAIC() that is useful for a wider range of object classes.

2 Regularized Linear Modeling and Controlled Variable Selection

Many biomedical and biosocial studies involve large amounts of complex data, including cases where the number of features ($k$) is large and may exceed the number of cases ($n$). In such situations, parameter estimates are difficult to compute or may be unreliable as the system is underdetermined. Regularization provides one approach to improve model reliability, prediction accuracy, and result interpretability. It is based on augmenting the fidelity term of the objective function used in the model fitting process with a regularization term that provides restrictions on the parameter space.

Classical techniques for choosing important covariates to include in a model of complex multivariate data rely on various types of stepwise variable selection processes. These tend to improve prediction accuracy in certain situations, e.g., when a small number of features are strongly predictive, or heavily associated, with the clinical outcome or the specific biosocial trait. However, the prediction error may be large when the model relies purely on a fidelity term. Including an additional regularization term in the optimization of the cost function improves the prediction accuracy. For example, below we show that by shrinking large regression coefficients, ridge regularization reduces overfitting and improves prediction error. Similarly, the least absolute shrinkage and selection operator (LASSO) employs regularization to perform simultaneous parameter estimation and variable selection. LASSO enhances the prediction accuracy and provides a natural interpretation of the resulting model. Regularization refers to forcing certain characteristics on the model, or the corresponding scientific inference. Examples include discouraging complex models or extreme explanations, even if they fit the data, enforcing model generalizability to prospective data, or restricting model overfitting of accidental samples.

In this section, we extend the mathematical foundation we presented in Chapter 3 and (1) discuss computational protocols for handling complex high-dimensional data, (2) illustrate model estimation by controlling the false-positive rate of selection of salient features, and (3) derive effective forecasting models.

2.1 General Questions

Applications of regularized linear modeling techniques will help us address problems like:

How to deal with extremely high-dimensional data (hundreds or thousands of features)?
Why mix fidelity (model fit) and regularization (model interpretability) terms in objective function optimization?
How to reduce the false-positive rate, increase scientific validation, and improve result reproducibility (e.g., Knockoff filtering)?

2.2 Model Regularization

In data-driven sciences, regularization is the process of introducing constraints, adding information to, or smoothing a model aiming to generate a realistic solution to an ill-posed (or under-determined) problem, to prevent overfitting, or to improve the model interpretability.

Regularization of objective functions is a commonly used strategy to solve ill-posed optimization problems (Chapter 13). This involves introducing another regularization term penalizing the model for not complying with the additional constraints or increasing the magnitude of the cost function to enforce convergence of the model to an “optimal” or a “unique” solution. The example below illustrates a schematic of regularization.

Suppose we fit several different (polynomial) models that have near perfect model-fidelity, i.e., all models go very close to the set of anchor points we specified. In that sense, all models represent near-perfect solutions to this unconstrained, not-regularized, optimization problem. They fit the data well. Now, we can introduce an additional constraint that we want a simple model, e.g., smooth, differentiable, integrable. We are looking for an easy to interpret model. This can be accomplished by adding a regularization term to the objective function that requires in addition to passing through (or near-by) the anchor points, the model to be “simple”. Which of the different models appear simpler in the example below? The fidelity of the model is captured by how closely it fits the set of anchor points (see RMSE error). The model regularizer enforces simple model representation, i.e., lower polynomial order.

The following example demonstrates the heuristics of fitting a regularized model where the objective function is a mixture of a fidelity term (polynomial fit to data) and a penalty term (enforcing conditions restricting the model flexibility to specific points).

# define a function of interest e.g., Runge's function
runge <- function(x){
  runge <- 1/(1+25*x^2)
}

# define the anchor (knot) points 
knots <- seq(-1, 1, 0.01)

# library(rSymPy)
library(polynom)
library(tidyverse)
library(ModelMetrics)

# function generator:
# INPUT: (Number of Interpolation nodes - 1 == order) between -1 and 1
# OUTPUT: A list containing the tuple (data_frame, f)
#   where data_frame is the corresponding nodes
#   where f is function object induced by the lagrange interpolation

lag_poly <- function(order) {
  X_nodes <- seq(-1, 1, 2/order)
  Y_coor <- runge(X_nodes)
  f <- as.function(poly.calc(X_nodes,Y_coor))
  RMSE = rmse(f(knots), runge(knots))
  print(paste("The ", order, " order polynomial interpolation has this RMS Error:", RMSE))
  X <- data.frame(X_nodes)
  Y <- data.frame(Y_coor)
  lag_poly <- c(f, X, Y)
}

#Adding OTHER order functions here
third_order <- lag_poly(3)

## [1] "The  3  order polynomial interpolation has this RMS Error: 0.243252070004637"

fourth_order <- lag_poly(4)

## [1] "The  4  order polynomial interpolation has this RMS Error: 0.278491961299107"

sixth_order <- lag_poly(6)

## [1] "The  6  order polynomial interpolation has this RMS Error: 0.273264650792483"

eigth_order <- lag_poly(8)

## [1] "The  8  order polynomial interpolation has this RMS Error: 0.367531069757977"

# twentyth_order <- lag_poly(20)

# unlist other created polynomials
X_dat <- c(
  unlist(flatten(third_order[2])),
  unlist(flatten(fourth_order[2])),
  unlist(flatten(sixth_order[2])),
  unlist(flatten(eigth_order[2])) # ,
  #unlist(flatten(twentyth_order[2]))
  )
Y_dat <-c(
  unlist(flatten(third_order[3])),
  unlist(flatten(fourth_order[3])),
  unlist(flatten(sixth_order[3])),
  unlist(flatten(eigth_order[3])) #,
  # unlist(flatten(twentyth_order[3]))
  )
Labels <- c(
  rep("third_order",4),
  rep("fourth_order",5),
  rep("sixth_order",7),
  rep("eigth_order",9) # ,
  # rep("twentyth_order",21)
  )
dat <- data.frame(X=X_dat,Y=Y_dat,label=Labels) # print(second_order[[2]])

ord=8
X_nodes <- seq(-1, 1, 2/ord)
Y_coor <- runge(X_nodes)
# fit8 <- lm(Y_coor ~ poly(X_nodes, 8, raw=TRUE))

library(plotly)
xSample <- seq(-1, 1, length.out = 1000)
plot_ly(x=~xSample, y=~third_order[[1]](xSample), type="scatter", mode="lines", name="third_order") %>%
  add_trace(x=~xSample, y=~fourth_order[[1]](xSample), mode="lines", name="fourth_order") %>%
  add_trace(x=~xSample, y=~sixth_order[[1]](xSample), mode="lines", name="sixth_order") %>%
  add_trace(x=~xSample, y=~eigth_order[[1]](xSample), mode="lines", name="eigth_order") %>%
  add_markers(x=~dat$X, y=~dat$Y, mode="markers", name="Anchor Points", marker=list(size=20)) %>%
  layout(title="Objective Functions as a Mix of Fidelity and Regularization Terms", 
         xaxis=list(title="Domain"), yaxis=list(title="Model Range"))

# the color sequence will be reversed
# ggplot(dat , aes(x=X, y=Y,color=label)) + labs(title= "Perfect Fidelity Models",
#                       y="Y", x = "X")+ scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07","#CC79A7"))+
#   theme(plot.title = element_text(hjust = 0.5))+
#   geom_point(size=3,shape=1, aes(colour = label)) + 
#   geom_function(fun = third_order[[1]], size=1, alpha=0.4,color="#CC79A7")+
#   stat_function(fun = fourth_order[[1]], size=1, alpha=0.4,color = "#FC4E07")+
#   stat_function(fun = sixth_order[[1]], size=1, alpha=0.4,color =  "#E7B800")+
#   stat_function(fun = eigth_order[[1]], size=1, alpha=0.4, color= "#00AFBB")+
#   stat_function(fun = runge, size=1, alpha=0.4)

2.3 Matrix notation

We should review the basics of matrix notation, linear algebra, and matrix computing we covered in Chapter 3. At the core of matrix manipulations are scalars, vectors and matrices.

${y}_i$: output or response variable, $i = 1, ..., n$ (cases, subjects, units, etc.)
$x_{ij}$: input, predictor, or feature variable, $1\leq j \leq k,\ 1\leq i \leq n.$

\[{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \ ,\]

and

\[\quad {X} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,k} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n,1} & x_{n,2} & \cdots & x_{n,k} \end{pmatrix}_{cases\times features}.\]

2.4 Regularized Linear Modeling

In the special case where we assume that the covariates are orthonormal and the number of cases exceeds the number of features ($k<n$), we may have a design matrix, $X$, corresponding to an identity cross product matrix $X^T X = I$.

The ordinary least squares (OLS) estimates minimize the following objective function: \[\min_{ \beta \in \mathbb{R}^k } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2\right\}.\]

In general, when $k<n$, $X^T X$ is a non-singular square matrix, which is invertible, and the OLS estimates are expressed analytically by

\[\hat{\beta}^{OLS} = (X^T X)^{-1} X^T y.\] Otherwise, when $k>n$, the cross product matrix $X^T X)$ is singular, i.e., not invertible, however a related $k\times k$ matrix $X^TX+\lambda I$, where $\lambda$ is a regularization constant, is invertible and leads to a regularized linear model solution.

LASSO estimates minimize a modified cost function \[\min_{ \beta \in \mathbb{R}^{k+1}, \lambda \in \mathbb{R}^{+} } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2 + \lambda \| \beta \|_1 \right\}.\]

These LASSO estimates may be expressed via a soft-thresholding function of the OLS estimates: \[\hat{\beta}_j = S_{N \lambda}( \hat{\beta}^\text{OLS}_j ) = \hat{\beta}^\text{OLS}_j \max \left( 0, 1 - \frac{ N \lambda }{ |\hat{\beta}^{OLS}_j| } \right), \]

where $S_{N \lambda}$ is a soft thresholding operator translating values towards zero. This is different from the hard thresholding operator, which sets smaller values to zero and leaves larger ones unchanged.

Ridge regression minimizes a similar objective function (using a different norm):

\[\min_{ \beta \in \mathbb{R}^{k+1}, \lambda \in \mathbb{R}^{+} } \left\{ \frac{1}{N} \| y - X \beta \|_2^2 + \lambda \| \beta \|_2^2 \right\}.\]

This yields the ridge estimates $\hat{\beta}_j = (1 + N \lambda )^{-1} \hat{\beta}^{OLS}_j$. Thus, ridge regression shrinks all coefficients by a uniform factor, $(1 + N \lambda)^{-1}$, and does not set any coefficients to zero.

Best subset selection regression, also known as orthogonal matching pursuit (OMP), minimizes the same cost function with respect to the zero-norm:

\[\min_{ \beta \in \mathbb{R}^{k+1}, \lambda \in \mathbb{R}^{+} } \left\{ \frac{1}{N} \left\| y - X \beta \right\|_2^2 + \lambda \| \beta \|_0 \right\}, \]

where $\|.\|_0$ is the “$\ell^0$ norm”, defined for $z\in R^d$ as $\| z \|_o = m$, where exactly $m$ components of $z$ are nonzero. In this case, a closed form of the parameter estimates is

\[\hat{\beta}_j = H_{ \sqrt{ N \lambda } } \left( \hat{\beta}^{OLS}_j \right) = \hat{\beta}^{OLS}_j I \left( \left| \hat{\beta}^{OLS}_j \right| \geq \sqrt{ N \lambda } \right), \]

where $H_\alpha$ is a hard-thresholding function and $I$ is an indicator function (it is 1 if its argument is true, and 0 otherwise).

The LASSO estimates may share similar features selection/estimates with both Ridge and Best (OMP). This is because they both shrink the magnitude of all the coefficients, like ridge regression, but also set some of them to zero, as in the best subset selection case. Ridge regression scales all of the coefficients by a constant factor, whereas LASSO translates the coefficients towards zero by a constant value and then sets the small values to zero.

2.4.1 Ridge Regression

Ridge regression relies on $L^2$ regularization to improve the model prediction accuracy. It improves prediction error by shrinking large regression coefficients and reducing overfitting. By itself, ridge regularization does not perform variable selection and does not really help with model interpretation.

Let’s show an example using the MLB dataset 01a_data.txt, which includes, player’s Name, Team, Position, Height, Weight, and Age. We may fit in any regularized linear mode, e.g., $Weight \sim Age + Height$.

# install.packages("doParallel")
library("doParallel")
library(plotly)
library(tidyr)

# Data: https://umich.instructure.com/courses/38100/files/folder/data   (01a_data.txt)
data <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', as.is=T, header=T)    
attach(data); str(data)

## 'data.frame':    1034 obs. of  6 variables:
##  $ Name    : chr  "Adam_Donachie" "Paul_Bako" "Ramon_Hernandez" "Kevin_Millar" ...
##  $ Team    : chr  "BAL" "BAL" "BAL" "BAL" ...
##  $ Position: chr  "Catcher" "Catcher" "Catcher" "First_Baseman" ...
##  $ Height  : int  74 74 72 72 73 69 69 71 76 71 ...
##  $ Weight  : int  180 215 210 210 188 176 209 200 231 180 ...
##  $ Age     : num  23 34.7 30.8 35.4 35.7 ...

# Training Data
# Full Model: x <- model.matrix(Weight ~ ., data = data[1:900, ])
# Reduced Model
x <- model.matrix(Weight ~ Age + Height, data = data[1:900, ])
# creates a design (or model) matrix, and adds 1 column for outcome according to the formula.
y <- data[1:900, ]$Weight

# Testing Data
x.test <- model.matrix(Weight ~ Age + Height, data = data[901:1034, ])
y.test <- data[901:1034, ]$Weight

# install.packages("glmnet")
library("glmnet")
library(doParallel)
cl <- makePSOCKcluster(6)
registerDoParallel(cl); getDoParWorkers()

## [1] 6

# getDoParName(); getDoParVersion()
cv.ridge <-  cv.glmnet(x, y, type.measure="mse", alpha=0, parallel=T)
## alpha =1 for lasso only, alpha = 0 for ridge only, and 0<alpha<1 to blend ridge & lasso penalty !!!!

# plot(cv.ridge)

plotCV.glmnet <- function(cv.glmnet.object, name="") {
  df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, 
                           errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)

  featureNum <- cv.glmnet.object$nzero
  xFeature <- log(cv.glmnet.object$lambda)
  yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
  dataFeature <- data.frame(featureNum, xFeature, yFeature)

  plot_ly(data = df) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
        name = 'CV MSE', error_y = ~list(array = errorBar)) %>% 
    # add the lambda-min and lambda 1SD vertical dash lines
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
              showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), 
              showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
    # Add Number of Features Annotations on Top
    add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
        mode = 'text', text = ~featureNum, textposition = 'middle right',
        textfont = list(color = '#000000', size = 9)) %>%
    # Add top x-axis (non-zero features)
    # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
    #           y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, 
    #           name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
    layout(title = paste0("Cross-Validation MSE (", name, ")"),
                            xaxis = list(title=paste0("log(",TeX("\\lambda"),")"),  side="bottom", showgrid=TRUE), # type="log"
                            hovermode = "x unified", legend = list(orientation='h'),  # xaxis2 = ax,  
                            yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE))
}

plotCV.glmnet(cv.ridge, "Ridge")

coef(cv.ridge)

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept) -13.3506075
## (Intercept)   .        
## Age           0.5155822
## Height        2.7164527

sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

## [1] 18.25998

#plot variable feature coefficients against the shrinkage parameter lambda.
glmmod <-glmnet(x, y, alpha = 0)
plot(glmmod, xvar="lambda")
grid()

# for plot_glmnet with ridge/lasso coefficient path labels
# install.packages("plotmo")
library(plotmo) 
plot_glmnet(glmmod, lwd=4)        #default colors

# More elaborate plots can be generated using:
# plot_glmnet(glmmod,label=2,lwd=4) #label the 2 biggest final coefficients
# specify color of each line
# g <- "blue" 
# plot_glmnet(glmmod, lwd=4, col=c(2,g))


# report the model coefficient estimates
coef(glmmod)[, 1]

##  (Intercept)  (Intercept)          Age       Height 
## 2.016556e+02 0.000000e+00 8.327372e-37 4.789383e-36

cv.glmmod <- cv.glmnet(x, y, alpha=0)

mod.ridge <-  cv.glmnet(x, y, alpha = 0, thresh = 1e-12, parallel = T)
lambda.best <-  mod.ridge$lambda.min
lambda.best

## [1] 1.086267

ridge.pred <-  predict(mod.ridge, newx = x.test, s = lambda.best)
ridge.RMS <- mean((y.test - ridge.pred)^2); ridge.RMS

## [1] 263.8461

ridge.test.r2 <-  1 - mean((y.test - ridge.pred)^2)/mean((y.test - mean(y.test))^2)

#plot(cv.glmmod)
plotCV.glmnet(cv.glmmod, "Ridge")

best_lambda <- cv.glmmod$lambda.min
best_lambda

## [1] 1.086267

In the plots above, different colors represent the vector of features, and the corresponding coefficients, displayed as a function of the regularization parameter, $\lambda$. The top horizontal axis indicates the number of nonzero coefficients at the current value of $\lambda$. For LASSO regularization, this top-axis corresponds to the effective degrees of freedom (df) for the model.

Notice the usefulness of Ridge regularization for model estimation in highly ill-conditioned problems ($n\ll k$) where slight feature perturbations may cause disproportionate alterations of the corresponding weight calculations. When $\lambda$ is very large, the regularization effect dominates the optimization of the objective function and the coefficients tend to zero. At the other extreme, as $\lambda\longrightarrow 0$, the resulting model solution tends towards the ordinary least squares (OLS) and the coefficients exhibit large oscillations. In practice, we often may need to tune $\lambda$ to balance this tradeoff.

Also note that in the cv.glmnet call, the extreme values of the parameter $\alpha = 0$ (ridge) and $\alpha = 1$ (LASSO) correspond to different types of regularization, and intermediate values of $0<\alpha<1$ corresponds to elastic net blended regularization.

2.4.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression

Estimating the linear regression coefficients in a linear regression model using LASSO involves minimizing an objective function that includes an $L^1$ regularization term which tends to shrink the number of features. A descriptive representation of the fidelity (left) and regularization (right) terms of the objective function are shown below:

\[\underbrace{\sum_{i=1}^n \left [ y_i - \beta_0 - \sum_{j=1}^k \beta_j x_{ij} \right ]^2}_{\text{fidelity term}} + \underbrace{\lambda\sum_{j=1}^{k}|\beta_j|}_{\text{regularization term}}.\]

LASSO jointly achieves model quality, reliability and variable selection by penalizing the sum of the absolute values of the regression coefficients. This forces the shrinkage of certain coefficients effectively acting as a variable selection process. This is similar to ridge regression’s penalty on the sum of the squares of the regression coefficients, although ridge regression only shrinks the magnitude of the coefficients without truncating them to $0$.

Let’s show how to select the regularization weight parameter $\lambda$ using training data and report the error using testing data.

mod.lasso <-  cv.glmnet(x, y, alpha = 1, thresh = 1e-12, parallel = T)
## alpha =1 for lasso only, alpha = 0 for ridge only, and 0<alpha<1 for elastic net, a blend ridge & lasso penalty !!!!
lambda.best <- mod.lasso$lambda.min
lambda.best

## [1] 0.05406379

lasso.pred <- predict(mod.lasso, newx = x.test, s = lambda.best)
LASSO.RMS <- mean((y.test - lasso.pred)^2); LASSO.RMS

## [1] 261.8045

Let’s retrieve the estimates of the model coefficients.

mod.lasso <-  glmnet(x, y, alpha = 1)
predict(mod.lasso, s = lambda.best, type = "coefficients")

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) -182.1429000
## (Intercept)    .        
## Age            0.9667182
## Height         4.8309312

lasso.test.r2 <-  1 - mean((y.test - lasso.pred)^2)/mean((y.test - mean(y.test))^2)

Perhaps obtain a classical OLS linear model, as well.

lm.fit <-  lm(Weight ~ Age + Height, data = data[1:900, ])
summary(lm.fit)

## 
## Call:
## lm(formula = Weight ~ Age + Height, data = data[1:900, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.602 -12.399  -0.718  10.913  74.446 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -184.3736    19.4232  -9.492  < 2e-16 ***
## Age            0.9799     0.1335   7.341 4.74e-13 ***
## Height         4.8561     0.2551  19.037  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.5 on 897 degrees of freedom
## Multiple R-squared:  0.3088, Adjusted R-squared:  0.3072 
## F-statistic: 200.3 on 2 and 897 DF,  p-value: < 2.2e-16

The OLS linear (unregularized) model has slightly larger coefficients and greater MSE than LASSO, which attests to the shrinkage of LASSO.

lm.pred <-  predict(lm.fit, newx = x.test)
LM.RMS <- mean((y - lm.pred)^2); LM.RMS

## [1] 305.1995

lm.test.r2 <-  1 - mean((y - lm.pred)^2) / mean((y.test - mean(y.test))^2)
  
# barplot(c(lm.test.r2, lasso.test.r2, ridge.test.r2), col = "red", names.arg = c("OLS", "LASSO", "Ridge"), main = "Testing Data Derived R-squared")

plot_ly(x = c("OLS", "LASSO", "Ridge"),  y = c(lm.test.r2, lasso.test.r2, ridge.test.r2),  
        name = paste0("Model ", TeX("R^2") ," Performance"), type = "bar") %>%
  layout(title=TeX("\\text{Model} \\ R^2\\ \\text{Performance}")) %>%
  config(mathjax = 'cdn')

Compare the results of the three alternative models (LM, LASSO and Ridge) for these data and contrast the derived RMS results.

library(knitr) #  kable function to convert tabular R-results into Rmd tables
# create table as data frame
RMS_Table = data.frame(LM=LM.RMS, LASSO=LASSO.RMS, Ridge=ridge.RMS)

# convert to markdown
kable(RMS_Table, format="pandoc", caption="Test Dataset RMS Results", align=c("c", "c", "c"))

Test Dataset RMS Results
LM	LASSO	Ridge
305.1995	261.8045	263.8461

stopCluster(cl)

As both the inputs (features or predictors) and the output (response) are observed for the testing data, we can build a learner examining the relationship between the two types of features (controlled covariates and observable responses). Most often, we are interested in forecasting or predicting responses based on prospective (new, testing, or validation) data.

2.5 Predictor Standardization

Prior to fitting regularized linear modeling and estimating the effects, covariates may be standardized. Scaling the features ensures the measuring units of the features do not bias the distance measures or norm estimates. Standardization can be accomplished by using the classic “z-score” formula. This puts each predictor on the same scale (unitless quantities) - the mean is 0 and the variance is 1. We use $\hat{\beta_0} = \bar{y}$, for the mean intercept parameter, and estimate the coefficients of the remaining predictors. To facilitate interpretation of the results, after the model is estimated, in the context of the specific case-study, we can transform the results back to the original scale/units.

2.6 Estimation Goals

The basic problem is this: given a set of predictors ${X}$, find a function, $f({X})$, to model or predict the outcome $Y$.

Let’s denote the objective (loss or cost) function by $L(y, f({X}))$. It determines adequacy of the fit and allows us to estimate the squared error loss: \[L(y, f({X})) = (y - f({X}))^2 . \]

We are looking to find $f$ that minimizes the expected loss: \[ E[(Y - f({X}))^2] \Rightarrow f = E[Y | {X} = {x}].\]

2.7 Linear Regression

For a linear model: \[Y_i = \beta_0 + x_{i,1}\beta_1 + x_{i,2}\beta_2 + \dots + x_{i,k}\beta_k + \epsilon, \] Let’s assume that:

The model shorthand matrix notation is: $Y = X\beta + \epsilon.$
And the expectation of the observed outcome given the data, $E[Y | {X} = {x}]$, is a linear function, which in certain situations can be expressed as: \[\arg\min_{{\beta}} \sum_{i=1}^n \left (y_i - \sum_{j=1}^{k} x_{ij} \beta_j \right )^2 = \arg\min_{{\beta}} \sum_{i=1}^{n} (y_i - x_{i}^T \beta)^2.\]

Multiplying both hand-sides on the left by $X^T=X'$, which is the transpose of the design matrix $X$ (recall that matrix multiplication is not always commutative), yields: \[X^T Y = X^T (X\beta) = (X^TX)\beta.\]

To solve for the effect-size coefficients, $\beta$, we can multiply both sides of the equation by the inverse of its (right hand side) multiplier: \[(X^TX)^{-1} (X^T Y) = (X^TX)^{-1} (X^TX)\beta = \beta.\]

The ordinary least squares (OLS) estimate of ${\beta}$ is given by: \[\hat{{\beta}} = \arg\min_{{\beta}} \sum_{i=1}^n (y_i - \sum_{j=1}^{k} x_{ij} \beta_j)^2 = \arg\min_{{\beta}} \| {y} - {X}{\beta} \|^2_2 \Rightarrow\] \[\hat{{\beta}}^{OLS} = ({X}'{X})^{-1} {X}'{y} \Rightarrow \hat{f}({x}_i) = {x}_i'\hat{{\beta}}.\]

2.8 Drawbacks of Linear Regression

Despite its wide use and elegant theory, linear regression has some shortcomings.

Prediction accuracy - Often can be improved upon;
Model interpretability - Linear model does not automatically do variable selection.

2.9 Assessing Prediction Accuracy

Given a new input, ${x}_0$, how do we assess our prediction $\hat{f}({x}_0)$?

The Expected Prediction Error (EPE) is \[\begin{aligned} EPE({x}_0) &= E[(Y_0 - \hat{f}({x}_0))^2] \\ &= \text{Var}(\epsilon) + \text{Var}(\hat{f}({x}_0)) + \text{Bias}(\hat{f}({x}_0))^2 \\ &= \text{Var}(\epsilon) + MSE(\hat{f}({x}_0)) \end{aligned} .\]

where

$\text{Var}(\epsilon)$: irreducible error variance
$\text{Var}(\hat{f}({x}_0))$: sample-to-sample variability of $\hat{f}({x}_0)$ , and
$\text{Bias}(\hat{f}({x}_0))$: average difference of $\hat{f}({x}_0)$ & $f({x}_0)$.

2.9.1 Estimating the Prediction Error

One common approach to estimating prediction error include:

Randomly splitting the data into “training” and “testing” sets, where the testing data has $m$ observations that will be used to independently validate the model quality. We estimate/calculate $\hat{f}$ using training data;
Estimating prediction error using the testing set MSE \[ \hat{MSE}(\hat{f}) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{f}(x_i))^2.\]

Ideally, we want our model/predictions to perform well with new or prospective data.

2.9.2 Improving the Prediction Accuracy

If $f(x) \approx \text{linear}$, $\hat{f}$ will have low bias but possibly high variance, e.g., in high-dimensional setting due to correlated predictors, when $k\ \text{features} \ll n\ \text{cases}$, or under-determination, when $k > n$. The goal is to minimize total error by trading off bias (centrality) and precision (variability).

\[MSE(\hat{f}(x)) = \text{Var}(\hat{f}(x)) +\text{Bias}(\hat{f}(x))^2.\] We can sacrifice bias to reduce variance, which may lead to decrease in $MSE$. So, regularization allows us to tune this tradeoff.

We aim to predict the outcome variable, $Y_{n\times1}$, in terms of other features $X_{n,k}$. Assume a first-order relationship relating $Y$ and $X$ is of the form $Y=f(X)+\epsilon$, where the error term is $\epsilon \sim N(0,\sigma)$. An estimate model $\hat{f}(X)$ can be computed in many different ways (e.g., using least squares calculations for linear regressions, Newton-Raphson, steepest descent, stochastic gradient descent, or other methods). Then, we can decompose the expected squared prediction error at $x$ as:

\[E(x)=E[(Y-\hat{f}(x))^2] = \underbrace{\left ( E[\hat{f}(x)]-f(x) \right )^2}_{Bias^2} + \underbrace{E\left [\left (\hat{f}(x)-E[\hat{f}(x)] \right )^2 \right]}_{\text{precision (variance)}} + \underbrace{\sigma^2}_{\text{irreducible error (noise)}}.\]

When the true $Y$ vs. $X$ relation is not known, infinite data may be necessary to calibrate the model $\hat{f}$ and it may be impractical to jointly reduce both the model bias and variance. In general, minimizing the bias at the same time as minimizing the variance may not be possible.

The figure below illustrates diagrammatically the dichotomy between bias and precision (variance), additional information is available in the SOCR SMHS EBook.

2.10 Variable Selection

Oftentimes, we are only interested in using a subset of the original features as model predictors. Thus, we need to identify the most relevant predictors, which usually capture the big picture of the process. This helps us avoid overly complex models that may be difficult to interpret. Typically, when considering several models that achieve similar results, it’s natural to select the simplest of them.

Linear regression does not directly determine the importance of features to predict a specific outcome. The problem of selecting critical predictors is therefore very important.

Automatic feature subset selection methods should directly determine an optimal subset of variables. Forward or backward stepwise variable selection and forward stagewise are examples of classical methods for choosing the best subset by assessing various metrics like $MSE$, $C_p$, AIC, or BIC.

2.11 Regularization Framework

As before, we start with a given ${X}$ and look for a (linear) function, $f({X})=\sum_{j=1}^{p} {x_{j} \beta_j}$, to model or predict $y$ subject to certain objective cost function, e.g., squared error loss. Adding a second term to the cost function minimization process yields (model parameter) estimates expressed as:

\[\hat{{\beta}}(\lambda) = \arg\min_{\beta} \left\{\sum_{i=1}^n (y_i - \sum_{j=1}^{k} {x_{ij} \beta_j})^2 + \lambda J({\beta})\right\}\]

In the above expression, $\lambda \ge 0$ is the regularization (tuning or penalty) parameter, $J({\beta})$ is a user-defined penalty function - typically, the intercept is not penalized.

2.12 Role of the Penalty Term

Consider $\arg\min J({\beta}) = \sum_{j=1}^k \beta_j^2 =\| {\beta} \|^2_2$ (Ridge Regression, RR).

Then, the formulation of the regularization framework is: \[\hat{{\beta}}(\lambda)^{RR} = \arg\min_{{\beta}} \left\{\sum_{i=1}^n \left (y_i - \sum_{j=1}^{k} x_{ij} \beta_j\right )^2 + \lambda \sum_{j=1}^k \beta_j^2 \right\}.\]

Or, alternatively:

\[\hat{{\beta}}(t)^{RR} = \arg\min_{{\beta}} \sum_{i=1}^n \left (y_i - \sum_{j=1}^{k} x_{ij} \beta_j\right )^2, \] subject to \[\sum_{j=1}^k \beta_j^2 \le t .\]

2.13 Role of the Regularization Parameter

The regularization parameter $\lambda\geq 0$ directly controls the bias-variance trade-off:

$\lambda = 0$ corresponds to OLS, and
$\lambda \rightarrow \infty$ puts more weight on the penalty function and results in more shrinkage of the coefficients, i.e., we introduce bias at the sake of reducing the variance.

The choice of $\lambda$ is crucial and will be discussed below as each $\lambda$ results in a different solution $\hat{{\beta}}(\lambda)$.

2.13.0.1 LASSO

The LASSO (Least Absolute Shrinkage and Selection Operator) regularization relies on: \[\arg\min J({\beta}) = \sum_{j=1}^k |\beta_j| = \| {\beta} \|_1,\] which leads to the following objective function: \[\hat{{\beta}}(\lambda)^{L} = \arg\min_{\beta} \left\{\sum_{i=1}^n \left (y_i - \sum_{j=1}^{k} x_{ij} \beta_j\right )^2 + \lambda \sum_{j=1}^k |\beta_j| \right\}.\]

In practice, subtle changes in the penalty terms frequently lead to big differences in the results. Not only does the regularization term shrink coefficients towards zero, but it sets some of them to be exactly zero. Thus, it performs continuous variable selection, hence the name, Least Absolute Shrinkage and Selection Operator (LASSO).

For further details, see the Tibshirani’s LASSO website.

2.14 General Regularization Framework

The general regularization framework involves optimization of a more general objective function:

\[\min_{f \in {\mathcal{H}}} \sum_{i=1}^n \left\{L(y_i, f(x_i)) + \lambda J(f)\right\}, \]

where $\mathcal{H}$ is a space of possible functions, $L$ is the fidelity term, e.g., squared error, absolute error, zero-one, negative log-likelihood (GLM), hinge loss (support vector machines), and $J$ is the regularizer, e.g., ridge regression, LASSO, adaptive LASSO, group LASSO, fused LASSO, thresholded LASSO, generalized LASSO, constrained LASSO, elastic-net, Dantzig selector, SCAD, MCP, smoothing splines, etc.

This represents a very general and flexible framework that allows us to incorporate prior knowledge (sparsity, structure, etc.) into the model estimation.

2.15 Likelihood Ratio Test (LRT), False Discovery Rate (FDR), and Logistic Transform

These concepts will be important in theoretical model development as well as in the applications we show below.

2.15.1 Likelihood Ratio Test (LRT)

The Likelihood Ratio Test (LRT) compares the data fit of two models. For instance, removing predictor variables from a model may reduce the model quality (i.e., a model will have a lower log likelihood). To statistically assess whether the observed difference in model fit is significant, the LRT compares the difference of the log likelihoods of the two models. When this difference is statistically significant, the full model (the one with more variables) represents a better fit to the data, compared to the reduced model. LRT is computed using the log likelihoods ($ll$) of the two models:

\[LRT = -2 \ln\left (\frac{L(m_1)}{L(m_2)}\right ) = 2(ll(m_2)-ll(m_1)), \] where:

$m_1$ and $m_2$ are the reduced and the full models, respectively,
$L(m_1)$ and $L(m_2)$ denote the likelihoods of the 2 models, and
$ll(m_1)$ and $ll(m_2)$ represent the log likelihood (natural log of the model likelihood function).

As $n\longrightarrow \infty$, the distribution of the LRT is asymptotically chi-squared with degrees of freedom equal to the number of parameters that are reduced (i.e., the number of variables removed from the model). In our case, $LRT \sim \chi_{df=2}^2$, as we have an intercept and one predictor (SE), and the null model is empty (no parameters).

2.15.2 False Discovery Rate (FDR)

The FDR rate measures the performance of a test:

\[\underbrace{FDR}_{\text{False Discovery Rate}} =\underbrace{E}_{\text{expectation}} \underbrace{\left( \frac{\# False Positives}{\text{total number of selected features}}\right )}_{\text{False Discovery Proportion}}.\]

The Benjamini-Hochberg (BH) FDR procedure involves ordering the p-values, specifying a target FDR, calculating and applying the threshold. Below we show how this is accomplished in R.

# List the p-values (these are typically computed by some statistical
# analysis, later these will be ordered from smallest to largest)
pvals <- c(0.9, 0.35, 0.01, 0.013, 0.014, 0.19, 0.35, 0.5, 0.63, 0.67, 0.75, 0.81, 0.01, 0.051)
length(pvals)

## [1] 14

#enter the target FDR
alpha.star <- 0.05

# order the p-values small to large
pvals <- sort(pvals); pvals

##  [1] 0.010 0.010 0.013 0.014 0.051 0.190 0.350 0.350 0.500 0.630 0.670 0.750
## [13] 0.810 0.900

#calculate the threshold for each p-value
# threshold[i] = alpha*(i/n), where i is the index of the ordered p-value
threshold<-alpha.star*(1:length(pvals))/length(pvals)

# for each index, compare the p-value against its threshold and display the results
cbind(pvals, threshold, pvals<=threshold)

##       pvals   threshold  
##  [1,] 0.010 0.003571429 0
##  [2,] 0.010 0.007142857 0
##  [3,] 0.013 0.010714286 0
##  [4,] 0.014 0.014285714 1
##  [5,] 0.051 0.017857143 0
##  [6,] 0.190 0.021428571 0
##  [7,] 0.350 0.025000000 0
##  [8,] 0.350 0.028571429 0
##  [9,] 0.500 0.032142857 0
## [10,] 0.630 0.035714286 0
## [11,] 0.670 0.039285714 0
## [12,] 0.750 0.042857143 0
## [13,] 0.810 0.046428571 0
## [14,] 0.900 0.050000000 0

Starting with the smallest p-value and moving up, we find that the largest $k$ for which the corresponding p-value is less than its threshold, $\alpha^*$, which yields an index $\hat{k}=4$.

Next, the algorithm rejects the null hypotheses for the tests that correspond to p-values with indices $k\leq \hat{k}=4$, i.e., we determine that $p_{(1)}, p_{(2)}, p_{(3)}, p_{(4)}$ survive FDR correction for multiple testing.

Note: Since we controlled FDR at $\alpha^*=0.05$, we expect that on average only 5% of the tests that we rejected are spurious. In other words, of the FDR-corrected p-values, only about $\alpha^*=0.05$ are expected to represent false-positives, e.g., features chosen to be salient, that are in fact not really important.

As a comparison, the Bonferroni corrected $\alpha$-value for these data is $\frac{0.05}{14} = 0.0036$. Note that Bonferroni coincides with the 1-st threshold value corresponding to the smallest p-value. If we had used this correction for multiple testing, then we would have concluded that none of our $14$ results were significant!

2.15.2.1 Graphical Interpretation of the Benjamini-Hochberg (BH) Method

There’s an intuitive graphical interpretation of the BH calculations.

Sort the $p$-values from largest to smallest.
Plot the ordered p-values $p_{(k)}$ on the y-axis versus their indices on the x-axis.
Superimpose on this plot a line that passes through the origin and has slope $\alpha^*$.

Any p-value that falls on or below this line corresponds to a significant result.

#generate the "relative-indices" (i/n) that will be plotted on the x-axis
x.values<-(1:length(pvals))/length(pvals)

#select observations that are less than threshold
for.test <- cbind(1:length(pvals), pvals)
pass.test <- for.test[pvals <= 0.05*x.values, ]
pass.test

##       pvals 
## 4.000 0.014

#use largest k to color points that meet Benjamini-Hochberg FDR test
last<-ifelse(is.vector(pass.test), pass.test[1], pass.test[nrow(pass.test), 1])

#widen right margin to make room for labels
par(mar=c(4.1, 4.1, 1.1, 4.1))

#plot the points (relative-index vs. probability-values)
# we can also plot the y-axis on a log-scale to spread out the values
# plot(x.values, pvals, xlab=expression(i/n), ylab="log(p-value)", log = 'y')

# plot(x.values, pvals, xlab=expression(i/n), ylab="p-value")
# #add FDR line
# abline(a=0, b=0.05, col=2, lwd=2)
# #add naive threshold line
# abline(h=.05, col=4, lty=2)
# #add Bonferroni-corrected threshold line
# abline(h=.05/length(pvals), col=4, lty=2)
# #label lines
# mtext(c('naive', 'Bonferroni'), side=4, at=c(.05, .05/length(pvals)), las=1, line=0.2)
# #use largest k to color points that meet Benjamini-Hochberg FDR test
# points(x.values[1:last], pvals[1:last], pch=19, cex=1.5)

plot_ly(x=~x.values, y=~pvals, type="scatter", mode="markers", marker=list(size=15), 
        name="observed p-values", symbols='o') %>%
  # add bounding horizontal lines
  # add naive threshold line
  add_lines(x=~c(0,1), y=~c(0.05, 0.05), mode="lines", line=list(dash='dash'), name="p=0.05") %>%
  # add conservative Bonferroni line
  add_lines(x=~c(0,1), y=~c(0.05/length(pvals), 0.05/length(pvals)), mode="lines", 
            line=list(dash='dash'), name="Bonferroni (p=0.05/n)") %>%
  # add FDR line
  add_lines(x=~c(0,1), y=~c(0, 0.05), mode="lines", line=list(dash='dash'), name="FDR Line") %>%
  # highlight the largest k to color points meeting the Benjamini-Hochberg FDR test
  add_trace(x=~x.values[1:last], y=~pvals[1:last], mode="markers",symbols='0', name="FDR Test Points") %>%
  layout (title="Benjamini-Hochberg FDR Test", legend = list(orientation='h'),
          xaxis=list(title=expression(i/n)), yaxis=list(title="p-value"))

2.15.3 FDR adjusting the $p$-values

R can automatically perform the Benjamini-Hochberg procedure. The adjusted p-values are obtained by

pvals.adjusted <- p.adjust(pvals, "BH")
pvals.adjusted

##  [1] 0.0490000 0.0490000 0.0490000 0.0490000 0.1428000 0.4433333 0.6125000
##  [8] 0.6125000 0.7777778 0.8527273 0.8527273 0.8723077 0.8723077 0.9000000

The adjusted p-values indicate the corresponding null hypothesis we need to reject to preserve the initial $\alpha^*$ false-positive rate. We can also compute the adjusted p-values as follows:

# manually calculate the thresholds for the ordered p-values list
test.p <- length(pvals)/(1:length(pvals))*pvals   # test.p

# loop through each p-value and carry out the manual FDR adjustment for multiple testing
adj.p <- numeric(14)
for(i in 1:14) {
    adj.p[i]<-min(test.p[i:length(test.p)])
    ifelse(adj.p[i]>1, 1, adj.p[i])
}
adj.p

##  [1] 0.0490000 0.0490000 0.0490000 0.0490000 0.1428000 0.4433333 0.6125000
##  [8] 0.6125000 0.7777778 0.8527273 0.8527273 0.8723077 0.8723077 0.9000000

Note that the manually computed (adj.p) and the automatically computed (pvals.adjusted) adjusted-p-values are the same.

2.16 Logistic Transformation

For binary outcome variables, or ordinal categorical variables, we may need to employ the logistic curve to transform the polytomous outcomes into real values.

The Logistic curve is $y=f(x)= \frac{1}{1+e^{-x}}$, where y and x represent probability and quantitative-predictor values, respectively. A slightly more general form is: $y=f(x)= \frac{K}{1+e^{-x}}$, where the covariate $x \in (-\infty, \infty)$ and the response $y \in [0, K]$. For example,

library("ggplot2")
k=7
x <- seq(-10, 10, 0.1)
# plot(x, k/(1+exp(-x)), xlab="X-axis (Covariate)", ylab="Outcome k/(1+exp(-x)), k=7", type="l")

plot_ly(x=~x, y=~k/(1+exp(-x)), type="scatter", mode="line", name="Logistic model") %>%
  layout (title="Logistic Model Y=k/(1+exp(-x)), k=7",
          xaxis=list(title="x"), yaxis=list(title="Y=k/(1+exp(-x))"))

The point of this logistic transformation is that: \[y= \frac{1}{1+e^{-x}} \Longleftrightarrow x=\ln\frac{y}{1-y},\] which represents the log-odds (when $y$ is the probability of an event of interest)!!!

We use the logistic regression equation model to estimate the probability of specific outcomes:

(Estimate of)$P(Y=1| x_1, x_2, \cdots, x_l)= \frac{1}{1+e^{-(a_o+\sum_{k=1}^l{a_k x_k })}}$, where the coefficients $a_o$ (intercept) and effects $a_k, k = 1, 2, \cdots, l$, are estimated using GLM according to a maximum likelihood approach. Using this model allows us to estimate the probability of the dependent (clinical outcome) variable $Y=1$ (CO), i.e., surviving surgery, given the observed values of the predictors $X_k, k = 1, 2, \cdots, l$.

2.16.1 Example: Heart Transplant Surgery

Let’s look at an example of estimating the probability of surviving a heart transplant based on surgeon’s experience. Suppose a group of 20 patients undergo heart transplantation with different surgeons having experience in the range {0(least), 2, …, 10(most)}, representing 100’s of operating/surgery hours. How does the surgeon’s experience affect the probability of the patient surviving?

The data below shows the outcome of the surgery (1=survival) or (0=death) according to the surgeons’ experience in 100’s of hours of practice.

Surgeon’s Experience (SE)	1	1.5	2	2.5	3	3.5	3.5	4	4.5	5	5.5	6	6.5	7	8	8.5	9	9.5	10	10
Clinical Outcome (CO)	0	0	0	0	0	0	1	0	1	0	1	0	1	0	1	1	1	1	1	1

mydata <- read.csv("https://umich.instructure.com/files/405273/download?download_frd=1")  # 01_HeartSurgerySurvivalData.csv
# estimates a logistic regression model for the clinical outcome (CO), survival, using the glm 
# (generalized linear model) function. 
# convert Surgeon's Experience (SE) to a factor to indicate it should be treated as a categorical variable.
# mydata$rank <- factor(mydata$SE)
# mylogit <- glm(CO ~ SE, data = mydata, family = "binomial")
# library(ggplot2)
# ggplot(mydata, aes(x=SE, y=CO)) + geom_point() + 
#       stat_smooth(method="glm", method.args=list(family = "binomial"), se=FALSE)

mylogit <- glm(CO ~ SE, data=mydata, family = "binomial")

plot_ly(data=mydata, x=~SE, y=~CO, type="scatter", mode="markers", name="Data", marker=list(size=15)) %>%
  add_trace(x=~SE, y=~mylogit$fitted.values, type="scatter", mode="lines", name="Logit Model") %>%
  layout (title="Logistic Model Clinical Outcome ~ Surgeon's Experience",
          xaxis=list(title="SE"), yaxis=list(title="CO"), hovermode = "x unified")

Graph of a logistic regression curve showing probability of surviving the surgery versus surgeon’s experience.

The graph shows the probability of the clinical outcome, survival, (Y-axis) versus the surgeon’s experience (X-axis), with the logistic regression curve fitted to the data.

mylogit <- glm(CO ~ SE, data = mydata, family = "binomial")
summary(mylogit)

## 
## Call:
## glm(formula = CO ~ SE, family = "binomial", data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.1030     1.7629  -2.327   0.0199 *
## SE            0.7583     0.3139   2.416   0.0157 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 16.092  on 18  degrees of freedom
## AIC: 20.092
## 
## Number of Fisher Scoring iterations: 5

The output indicates that a surgeon’s experience (SE) is significantly associated with the probability of surviving the surgery (0.0157, Wald test). The output also provides the coefficient estimates: $Intercept = -4.1030$ and $SE = 0.7583$. These coefficients can then be used in the logistic regression equation model to estimate the probability of surviving the heart surgery – Probability of surviving heart surgery $CO =1/(1+exp(-(-4.1030+0.7583\times SE)))$.

For example, for a patient who is operated by a surgeon with 200 hours of operating experience (SE=2), we plug in the value 2 in the equation to get an estimated probability of survival, $p=0.07$:

SE=2
CO =1/(1+exp(-(-4.1030+0.7583*SE)))
CO

## [1] 0.07001884

[1] 0.07001884

Similarly, a patient undergoing heart surgery with a doctor that has 400 operating hours experience (SE=4), the estimated probability of survival is p=0.26:

SE=4; CO =1/(1+exp(-(-4.1030+0.7583*SE))); CO

## [1] 0.2554411

CO

## [1] 0.2554411

for (SE in c(1:5)) {
  CO <- 1/(1+exp(-(-4.1030+0.7583*SE))); 
  print(c(SE, CO))
}

## [1] 1.00000000 0.03406915
## [1] 2.00000000 0.07001884
## [1] 3.0000000 0.1384648
## [1] 4.0000000 0.2554411
## [1] 5.0000000 0.4227486

[1] 0.2554411

The table below shows the probability of surviving surgery for several values of surgeons’ experience.

Surgeon’s Experience (SE)	Probability of patient survival (Clinical Outcome)
1	0.034
2	0.07
3	0.14
4	0.26
5	0.423

The output from the logistic regression analysis yields an SE effect of $\beta=0.0157$, which is based on the Wald z-score. In addition to the Wald method, we can calculate the p-value for logistic regression using the Likelihood Ratio Test (LRT), which for these data yields $0.0006476922$.

mylogit <- glm(CO ~ SE, data = mydata, family = "binomial")
summary(mylogit)

## 
## Call:
## glm(formula = CO ~ SE, family = "binomial", data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.1030     1.7629  -2.327   0.0199 *
## SE            0.7583     0.3139   2.416   0.0157 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 16.092  on 18  degrees of freedom
## AIC: 20.092
## 
## Number of Fisher Scoring iterations: 5

.	Estimate	Std. Error	z value	$Pr(\gt\|z\|)$ Wald
SE	0.7583	0.3139	2.416	0.0157 *

The logit of a number $0\leq p\leq 1$ is given by the formula: $logit(p)=log\frac{p}{1-p}$, and represents the log-odds ratio (of survival in this case).

confint(mylogit)

##                  2.5 %    97.5 %
## (Intercept) -8.6083535 -1.282692
## SE           0.2687893  1.576912

So, why exponentiating the coefficients? Because,

\[logit(p)=\log\frac{p}{1-p} \longrightarrow e^{logit(p)} =e^{\log\frac{p}{1-p}}\longrightarrow RHS=\frac{p}{1-p}, \ \text{(odds-ratio, OR)}.\]

exp(coef(mylogit))    # exponentiated logit model coefficients

## (Intercept)          SE 
##  0.01652254  2.13474149

exp(coef(mylogit))

(Intercept)	SE
0.01652254	2.13474149 == exp(0.7583456)

coef(mylogit) # raw logit model coefficients

(Intercept)	SE
-4.1030298	0.7583456

exp(cbind(OR = coef(mylogit), confint(mylogit)))

##                     OR        2.5 %   97.5 %
## (Intercept) 0.01652254 0.0001825743 0.277290
## SE          2.13474149 1.3083794719 4.839986

.	OR	2.5%	97.5%
(Intercept)	0.01652254	0.0001825743	0.277290
SE	2.13474149	1.3083794719	4.839986

We can compute the LRT and report its p-value by using the with() function:

with(mylogit, df.null - df.residual)

with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

## [1] 0.0006476922

# mylogit$null.deviance - mylogit$deviance  # 11.63365
# mylogit$df.null - mylogit$df.residual   # 1
# [1] 0.0006476922

# CONFIRM THE RESULT
# qchisq(1-with(mylogit, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)), 1)
# qchisq(1-0.0006476922, 1)

LRT p-value < 0.001 tells us that our model as a whole fits significantly better than an empty model. The deviance residual mylogit$deviance is -2*log likelihood, and we can report the model’s log likelihood by:

mylogit$deviance   # model residual deviance

## [1] 16.09223

-2*logLik(mylogit) # -2 * model_ll

## 'log Lik.' 16.09223 (df=2)

2.17 Implementation of Regularization

Before we dive into the theoretical formulation of model regularization, let’s start with a specific application that will ground the subsequent analytics.

2.17.1 Example: Neuroimaging-genetics study of Parkinson’s Disease Dataset

More information about this specific study and the included derived neuroimaging biomarkers is available here. A link to the data and a brief summary of the features are included below:

05_PPMI_top_UPDRS_Integrated_LongFormat1.csv
Data elements include: FID_IID, L_insular_cortex_ComputeArea, L_insular_cortex_Volume, R_insular_cortex_ComputeArea, R_insular_cortex_Volume, L_cingulate_gyrus_ComputeArea, L_cingulate_gyrus_Volume, R_cingulate_gyrus_ComputeArea, R_cingulate_gyrus_Volume, L_caudate_ComputeArea, L_caudate_Volume, R_caudate_ComputeArea, R_caudate_Volume, L_putamen_ComputeArea, L_putamen_Volume, R_putamen_ComputeArea, R_putamen_Volume, Sex, Weight, ResearchGroup, Age, chr12_rs34637584_GT, chr17_rs11868035_GT, chr17_rs11012_GT, chr17_rs393152_GT, chr17_rs12185268_GT, chr17_rs199533_GT, UPDRS_part_I, UPDRS_part_II, UPDRS_part_III, time_visit

Note that the dataset includes missing values and repeated measures.

The goal of this demonstration is to use OLS, ridge regression, and LASSO to find the best predictive model for the clinical outcomes – UPDRS score (vector) and Research Group (factor variable), in terms of demographic, genetics, and neuroimaging biomarkers.

We can utilize the glmnet package in R for most calculations.

#### Initial Stuff ####
# clean up
rm(list=ls())
# load required packages
# install.packages("arm")
library(glmnet)
library(arm)
library(knitr) #  kable function to convert tabular R-results into Rmd tables
# pick a random seed, but set.seed(seed) only affects the next block of code!
seed = 1234

#### Organize Data ####
# load dataset
# Data: https://umich.instructure.com/courses/38100/files/folder/data  
# (05_PPMI_top_UPDRS_Integrated_LongFormat1.csv)
data1 <- read.table('https://umich.instructure.com/files/330397/download?download_frd=1', sep=",", header=T)    
# we will deal with missing values using multiple imputation later. For now, let's just ignore incomplete cases
data1.completeRowIndexes <- complete.cases(data1); table(data1.completeRowIndexes)

## data1.completeRowIndexes
## FALSE  TRUE 
##   609  1155

prop.table(table(data1.completeRowIndexes))

## data1.completeRowIndexes
##     FALSE      TRUE 
## 0.3452381 0.6547619

attach(data1)
# View(data1[data1.completeRowIndexes, ])

# define response and predictors
y <- data1$UPDRS_part_I + data1$UPDRS_part_II + data1$UPDRS_part_III
table(y)   # Show Clinically relevant classification

## y
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 54 20 25 12  8  7 11 16 16  9 21 16 13 13 22 25 21 31 25 29 29 28 20 25 28 26 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
## 35 41 23 34 32 31 37 34 28 36 29 27 22 19 17 18 18 19 16  9 10 12  9 11  7 10 
## 52 53 54 55 56 57 58 59 60 61 62 63 64 66 68 69 71 75 80 81 82 
## 11  5  7  4  1  5  9  4  3  2  1  6  1  2  1  2  1  1  2  3  1

y <- y[data1.completeRowIndexes]

# X = scale(data1[,])  # Explicit Scaling is not needed, as glmnet auto standardizes predictors
# X = as.matrix(data1[, c("R_caudate_Volume", "R_putamen_Volume", "Weight", "Age", "chr17_rs12185268_GT")])  # X needs to be a matrix, not a data frame
drop_features <- c("FID_IID", "ResearchGroup", "UPDRS_part_I", "UPDRS_part_II", 
                   "UPDRS_part_III", "time_visit")
X <- data1[ , !(names(data1) %in% drop_features)]
X = as.matrix(X)   # remove columns: index, ResearchGroup, and y=(PDRS_part_I + UPDRS_part_II + UPDRS_part_III)
X <- X[data1.completeRowIndexes, ]
summary(X)

##  L_insular_cortex_ComputeArea L_insular_cortex_Volume
##  Min.   :  50.03              Min.   :   22.63       
##  1st Qu.:2174.57              1st Qu.: 5867.23       
##  Median :2522.52              Median : 7362.90       
##  Mean   :2306.89              Mean   : 6710.18       
##  3rd Qu.:2752.17              3rd Qu.: 8483.80       
##  Max.   :3650.81              Max.   :13499.92       
##  R_insular_cortex_ComputeArea R_insular_cortex_Volume
##  Min.   :  40.92              Min.   :  11.84        
##  1st Qu.:1647.69              1st Qu.:3559.74        
##  Median :1931.21              Median :4465.12        
##  Mean   :1758.64              Mean   :4127.87        
##  3rd Qu.:2135.57              3rd Qu.:5319.13        
##  Max.   :2791.92              Max.   :8179.40        
##  L_cingulate_gyrus_ComputeArea L_cingulate_gyrus_Volume
##  Min.   : 127.8                Min.   :   57.33        
##  1st Qu.:2847.4                1st Qu.: 6587.07        
##  Median :3737.7                Median : 8965.03        
##  Mean   :3411.3                Mean   : 8265.03        
##  3rd Qu.:4253.7                3rd Qu.:10815.06        
##  Max.   :5944.2                Max.   :17153.19        
##  R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume L_caudate_ComputeArea
##  Min.   : 104.1                Min.   :   47.67         Min.   :   1.782     
##  1st Qu.:2829.4                1st Qu.: 6346.31         1st Qu.: 318.806     
##  Median :3719.4                Median : 9094.15         Median : 710.779     
##  Mean   :3368.4                Mean   : 8194.07         Mean   : 657.442     
##  3rd Qu.:4261.8                3rd Qu.:10832.53         3rd Qu.: 951.868     
##  Max.   :6593.7                Max.   :19761.77         Max.   :1453.506     
##  L_caudate_Volume    R_caudate_ComputeArea R_caudate_Volume  
##  Min.   :   0.1928   Min.   :   1.782      Min.   :   0.193  
##  1st Qu.: 264.0013   1st Qu.: 660.696      1st Qu.: 893.637  
##  Median : 998.2269   Median :1063.046      Median :1803.281  
##  Mean   : 992.2892   Mean   : 894.806      Mean   :1548.739  
##  3rd Qu.:1568.3643   3rd Qu.:1183.659      3rd Qu.:2152.509  
##  Max.   :2746.6208   Max.   :1684.563      Max.   :3579.373  
##  L_putamen_ComputeArea L_putamen_Volume   R_putamen_ComputeArea
##  Min.   :   6.76       Min.   :   1.228   Min.   :  13.93      
##  1st Qu.: 775.73       1st Qu.:1234.601   1st Qu.:1255.62      
##  Median :1029.17       Median :1911.089   Median :1490.05      
##  Mean   : 959.15       Mean   :1864.390   Mean   :1332.01      
##  3rd Qu.:1260.56       3rd Qu.:2623.722   3rd Qu.:1642.41      
##  Max.   :2129.67       Max.   :4712.661   Max.   :2251.41      
##  R_putamen_Volume        Sex            Weight            Age       
##  Min.   :   3.207   Min.   :1.000   Min.   : 43.20   Min.   :31.18  
##  1st Qu.:2474.041   1st Qu.:1.000   1st Qu.: 69.90   1st Qu.:53.87  
##  Median :3510.249   Median :1.000   Median : 80.90   Median :62.16  
##  Mean   :3083.007   Mean   :1.347   Mean   : 82.06   Mean   :61.25  
##  3rd Qu.:3994.733   3rd Qu.:2.000   3rd Qu.: 90.70   3rd Qu.:68.83  
##  Max.   :7096.580   Max.   :2.000   Max.   :135.00   Max.   :83.03  
##  chr12_rs34637584_GT chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT
##  Min.   :0.00000     Min.   :0.0000      Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.00000     1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:0.0000   
##  Median :0.00000     Median :1.0000      Median :0.0000   Median :0.0000   
##  Mean   :0.01212     Mean   :0.6364      Mean   :0.3654   Mean   :0.4468   
##  3rd Qu.:0.00000     3rd Qu.:1.0000      3rd Qu.:1.0000   3rd Qu.:1.0000   
##  Max.   :1.00000     Max.   :2.0000      Max.   :2.0000   Max.   :2.0000   
##  chr17_rs12185268_GT chr17_rs199533_GT
##  Min.   :0.0000      Min.   :0.0000   
##  1st Qu.:0.0000      1st Qu.:0.0000   
##  Median :0.0000      Median :0.0000   
##  Mean   :0.4268      Mean   :0.4052   
##  3rd Qu.:1.0000      3rd Qu.:1.0000   
##  Max.   :2.0000      Max.   :2.0000

# randomly split data into training (80%) and test (20%) sets
set.seed(seed)
train = sample(1 : nrow(X), round((4/5) * nrow(X)))
test = -train

# subset training data
yTrain = y[train]
XTrain = X[train, ]
XTrainOLS = cbind(rep(1, nrow(XTrain)), XTrain)
  
# subset test data
yTest = y[test]
XTest = X[test, ]

#### Model Estimation & Selection ####
# Estimate models
fitOLS = lm(yTrain ~ XTrain)  # Ordinary Least Squares
# glmnet automatically standardizes the predictors
fitRidge = glmnet(XTrain, yTrain, alpha = 0)  # Ridge Regression
fitLASSO = glmnet(XTrain, yTrain, alpha = 1)  # The LASSO

Readers are encouraged to compare and contrast the resulting ridge and LASSO models.

2.17.2 Computational Complexity

Recall that the regularized regression estimates depend on the regularization parameter $\lambda$. Fortunately, efficient algorithms for choosing optimal $\lambda$ parameters do exist. Examples of solution path algorithms include:

LARS Algorithm for the LASSO (Efron et al., 2004)
Piecewise linearity (Rosset & Zhu, 2007)
Generic path algorithm (Zhou & Wu, 2013)
Pathwise coordinate descent (Friedman et al., 2007)
Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011)

We will show how to visualize the relations between the regularization parameter ($\ln(\lambda)$) and the number and magnitude of the corresponding coefficients for each specific regularized regression method.

2.18 Ridge and LASSO Algorithms

Below we will demonstrate hands-on implementations of the Ridge and LASSO regularization algorithms. These manual Ridge/LASSO functions, ridge() and lasso() draw direct parallels between the mathematical foundations of regularized linear modeling, the symbolic matrix computational algorithms, and the corresponding empirical validations. In practical linear modeling applications, to avoid complexities, improve performance, and ensure reproducibility, it’s always best to utilize the highly-optimized, extensively validated, and generic R function glmnet(). Again, in this demonstration we use the UPDRS (Parkinson’s disease) dataset to predict the latent outcome UPDRS

\[y =UPDRS\_part\_I + UPDRS\_part\_II + UPDRS\_part\_III\] using the other covariates.

For each regularization scheme (ridge or LASSO) the algorithms will include the following four steps:

Define the regularized loss-function,
Optimize the objective (loss function), see DSPA Chapter 13 (Optimization) for details,
Derive the effect-size estimates $\beta$’s using matrix algebra, see DSPA Chapter 3 (Linear Algebra and Matrix Computing) for details,
Compare and contrast the alternative linear model coefficient estimates.

2.18.1 Manual Ridge Linear Modeling

# 1. Define RIDGE Loss
# X: model Design matrix (observed covariates data) 
# y: target (observed outcomes of interest)
# lambda: regularization parameter (default to 0.1)
# beta:  effect-size estimates for all covariates in X (incl. intercept)
ridge <- function(beta, X, y, lambda = 0.1) {
  crossprod(y - X %*% beta) + lambda * length(y) * crossprod(beta)
}

X = XTrain
y = yTrain

# 2. Optimize the Ridge objective (loss function), See DSPA Chap. 13 (Optimization)
result_ridgeOptimization <-
  optim(rep(0, ncol(X)), ridge, X=X, y=y, lambda=0.1, method='BFGS')

# 3. Analytic Ridge solution
result_ridgeAnalytic <-
  solve(crossprod(X) + diag(length(y)*.1, ncol(X))) %*% crossprod(X, y)

# 4. Official `glmnet()` Ridge solution (alpha=0 corresponds to Ridge)
library(glmnet)
glmnet_res <-
  coef(glmnet(X, y, alpha=0,lambda=c(10, 1, 0.1), 
              thresh=1e-12, intercept=F), s=0.1)

# 4. Compare and Contrast the 4 alternative Linear Model solutions
library(DT)
df <- data.frame(covariates=colnames(X),
                 lm=coef(lm(y ~ . -1, data.frame(X))),
                 ridgeOptim  = result_ridgeOptimization$par,
                 ridgeAnalyt = result_ridgeAnalytic,
                 glmnet = glmnet_res[-1, 1])
datatable(df[, -1], caption="Compare & Contrast Ridge and Alternative Linear Model Estimates")

df_long <- df %>%
  pivot_longer(!covariates, names_to = "Method", values_to = "betaEstimates")
plot_ly(data=df_long, x=~covariates, y=~betaEstimates, 
        color=~Method, symbol=~Method, type="scatter", mode="markers") %>%
  layout(title="UPDRS: Comparing Automated and Manual Ridge Linear Model Estimation")

2.18.2 Manual LASSO Linear Modeling

This blog post outlines the details of the derivation of coordinate descent for LASSO regularized linear modeling.

2.18.2.0.1 LASSO algorithmic Solution

The cyclic coordinate descent LASSO optimization leading to the effect-size estimates requires iteratively going over all features, one at a time, and minimizing the loss function with respect to each effect-size, $\beta_i$:

Start with an initial guess $\beta^{(0)}=(\beta^{(0)}_1,\cdots,\beta^{(0)}_n)$.
At $k+1$ iteration, define $\beta^{(k+1)}$ in terms of $\beta^{(k)}$ by iteratively solving the single variable optimization problem $\beta_i^{(k+1)} = \arg\min_{\omega} f(\beta_1^{(k+1)}, \cdots, \beta_{i-1}^{(k+1)}, \underbrace{\omega}_{optimiz.}, \beta_{i+1}^{(k)}, \cdots, \beta_n^{(k)})$, corresponding to $\beta_i:=\beta_i - \alpha \frac{\partial f}{\partial \beta_i}(\beta)$.
Repeat for each effect-size $\beta_i,\ \forall 1\leq i\leq n$.

As a single variable optimization problem, the LASSO cost function optimization has a closed form solution in the special case of coordinate descent. For normalized data, the closed form solution is defined in terms of the soft threshold function

\[\hat{\beta}_j^{LASSO}=\mathcal{S}(\hat{\beta}_j^{LS}, \lambda)\equiv \begin{aligned} \begin{cases} \hat{\beta}_j^{LS} + \lambda , & \text{for} \ \hat{\beta}_j^{LS} < - \lambda \\ 0, & \text{for} \ - \lambda \leq \hat{\beta}_j^{LS} \leq \lambda \\ \hat{\beta}_j^{LS} - \lambda , & \text{for} \ \hat{\beta}_j^{LS} > \lambda \end{cases} \end{aligned}\ .\]

Iterative coordinate descent updating involves repeating this step either until a convergence tolerance is achieved or the max number of iterations is exceeded.

\[\hat{\beta}_j^{LASSO}=\mathcal{S}(\hat{\beta}_j^{LS} := \sum_{i=1}^m { \left ( x_j^{(i)} \left (y^{(i)} - \sum_{k \neq j}^n \hat{\beta}_j^{LASSO} x_k^{(i)} \right ) \right )} = \\ \sum_{i=1}^m x_j^{(i)} \left ( y^{(i)} - \hat y^{(i)}_{pred} + \hat{\beta}_j^{LASSO} x_j^{(i)} \right )\ , \\ \hat{\beta}_j^{LASSO} :=\mathcal{S}(\hat{\beta}_j^{LS} , \lambda)\ .\]

Notice that any constant (intercept) term $\beta_o$ is not regularized, i.e., $\hat{\beta}_o^{LASSO}=\hat{\beta}_o^{LS}$.

The example below shows a rudimentary lasso() function definition, which has certain level of flexibility in particular to accommodate kernel-based linear regularization modeling.

# Use the same UPDRS Parkinson's disease dataset
X = XTrain   
y = yTrain
X = apply(X, 2, scales::rescale, to = c(0, 1))  # normalize/scale the features!

lambda = 0.1

inverse <- function(X, eps = 1e-9) { # a generalization of solve() to avoid singularities
  eig.X = eigen(X, symmetric = TRUE)
  P = eig.X[[2]] 
  lambda = eig.X[[1]]
  # to avoid singularities, identify the indices of all eigenvalues > epsilon
  ind = lambda > eps
  lambda[ind]  = 1/lambda[ind] 
  lambda[!ind] = 0
  P %*% diag(lambda) %*% t(P)
}

# Default Linear kernel: k(x,y) = x' * y
rk <- function(s, t) {
  init_len = length(s)
  rk = 0
  for (i in 1:init_len) { rk = s[i]*t[i] + rk }
  return(rk)
} 

# Gram Function - kernelized cross-product
gram <- function(X, rkfunc = rk) { # compute the `crossprod` using the specified RK
  apply(X, 1, function(Row) 
    apply(X, 1, function(tRow) rkfunc(Row, tRow)) # specifies the Reproducing Kernel
  )  
}

# When we kernalize the predicting covariate Design matrix X, 
# we need to kernalize the response outcome Y, as well
kernelized_y <- function(X,y, rkfunc = rk) { # compute the `crossprod` using the specified RK
  apply(X, 1, function(Row) 
     rkfunc(Row, y) # specifies the Reproducing Kernel
  )  
}

# An alternative Laplace kernel, as an additional example
rk_Laplace <- function(s, t, sigma=1) {
  if (sigma <= 0) sigma=1  # avoid singularities
  init_len = length(s)
  rk_Lap = 0
  for (i in 1:init_len) { rk_Lap = (s[i] - t[i])^2 + rk_Lap }
  rk_Lap = exp(-(sqrt(rk_Lap))/sigma)
  return(rk_Lap)
} 

# Naive LASSO
lasso <- function(X, y, lambda=0.1, tol=1e-6, iter=100,kernel=rk) { # tolerance & iter-max
  GramMatrix = gram(t(X),kernel)                            # GramMatrix (nxn) xTX

# n = length(y)                                   
Q = cbind(1, GramMatrix)                        # Formulate the Design matrix Q


M = crossprod(Q)                     # no need for singularity protection as 6*6
M_inv = inverse(M)                              # (XTXXTX)^(-1)


kernel_y  = kernelized_y(t(X),y,kernel)  # kernelized \phi(y) linear case is XTy

gamma_hat = crossprod(M_inv, crossprod(Q, kernel_y)) # Q and M_inv are both symmetric, this is the final (XTXXTX)^(-1)XTXXTy

J = ncol(X)
for (j in 1:J) { 
  # The explicit formula for LASSO loss
  gamma_hat[j] = gamma_hat[j]*max(0,1-length(gamma_hat)*lambda/abs(gamma_hat[j]))
}

f_hat = Q %*% gamma_hat
  
A = Q %*% M_inv %*% t(Q)
tr_A = sum(diag(A))                             # trace of hat matrix

rss  = crossprod(kernel_y - f_hat)                     # residual sum of squares
gcv  = J*rss / (J - tr_A)^2                     # compute GCV score
 return(list(f_hat = f_hat, beta_hat = 
                gamma_hat, gcv = gcv,rss=rss))
}

# Test
las = lasso(X,y)$beta_hat[-1]

# 3. Official `glmnet()` LASSO solution (alpha=1 corresponds to LASSO)
glmnet_res <-
  coef(glmnet(X, y, alpha  = 1, lambda = lambda, 
              thresh=1e-12, intercept=FALSE), s=lambda)

# 5. Compare and Contrast the 4 alternative Linear Model solutions
# df <- data.frame(covariates=colnames(X), lm=coef(lm(y ~ . -1, data.frame(X))),
#                  lasso_soft=result_soft, lasso_hard=result_hard,
#                  glmnet=glmnet_res[-1, 1])
df <- data.frame(covariates=colnames(X), lm=coef(lm(y ~ . -1, data.frame(X))),
                 lasso=las, glmnet=glmnet_res[-1, 1])

datatable(df[, -1], caption="Compare & Contrast LASSO and Alternative Linear Model Estimates")

df_long <- df %>%
  pivot_longer(!covariates, names_to = "Method", values_to = "betaEstimates")
plot_ly(data=df_long, x=~covariates, y=~betaEstimates, 
        color=~Method, symbol=~Method, type="scatter", mode="markers") %>%
  layout(title="UPDRS: Comparing Automated and Manual LASSO Linear Model Estimation")

2.19 LASSO and Ridge Solution Paths

The plot for the LASSO results can be obtained as followis.

library(RColorBrewer)
### Plot Solution Path ###
# LASSO
# plot(fitLASSO, xvar="lambda", label="TRUE")
# # add label to upper x-axis
# mtext("LASSO regularizer: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plot.glmnet <- function(glmnet.object, name="") {
  df <- as.data.frame(t(as.matrix(glmnet.object$beta)))
  df$loglambda <- log(glmnet.object$lambda)
  df <- as.data.frame(df)
  data_long <- gather(df, Variable, coefficient, 1:(dim(df)[2]-1), factor_key=TRUE)
  
  plot_ly(data = data_long) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~loglambda, y = ~coefficient, color=~Variable, 
              colors=colorRampPalette(brewer.pal(10,"Spectral"))(dim(df)[2]),  # "Dark2", 
              type = 'scatter', mode = 'lines',
              name = ~Variable) %>%
    layout(title = paste0(name, " Model Coefficient Values"),
                            xaxis = list(title = paste0("log(",TeX("\\lambda"),")"),    side="bottom",  showgrid = TRUE),
                          hovermode = "x unified", legend = list(orientation='h'),
                            yaxis = list(title = ' Model Coefficient Values',   side="left", showgrid = TRUE))
}

plot.glmnet(fitLASSO, name="LASSO")

Similarly, the plot for the Ridge regularization can be obtained by:

### Plot Solution Path ###
# Ridge
# plot(fitRidge, xvar="lambda", label="TRUE")
# # add label to upper x-axis
# mtext("Ridge regularizer: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plot.glmnet(fitRidge, name="Ridge")

2.20 Regression Solution Paths - Ridge vs. LASSO

Let’s try to compare the paths of the LASSO and Ridge regression solutions. Below, you will see that the curves of LASSO are steeper and non-differentiable at some points, which is the result of using the $L_1$ norm. On the other hand, the Ridge path is smoother and asymptotically tends to $0$ as $\lambda$ increases.

Let’s start by examining the joint objective function (including LASSO and Ridge terms):

\[\min_\beta \left (\sum_i (y_i-x_i\beta)^2+\frac{1-\alpha}{2}||\beta||_2^2+\alpha||\beta||_1 \right ),\]

where $||\beta||_1 = \sum_{j=1}^{p}|\beta_j|$ and $||\beta||_2 = \sqrt{\sum_{j=1}^{p}||\beta_j||^2}$ are the norms of $\boldsymbol\beta$ corresponding to the $L_1$ and $L_2$ distance measures, respectively. The parameters $\alpha=0$ and $\alpha=1$ correspond to Ridge and LASSO regularization. The following two natural questions raise:

What if $0 <\alpha<1$?
How does the regularization penalty term affect the optimal solution?

In Chapter 3, we explored the minimal SSE (Sum of Square Error) for the OLS (without penalty) where the feasible parameter ($\beta$) spans the entire real solution space. In penalized optimization problems, the best solution may actually be unachievable. Therefore, we look for solutions that are “closest”, within the feasible region, to the enigmatic best solution.

The effect of the penalty term on the objective function is separate from the fidelity term (OLS solution). Thus, the effect of $0\leq \alpha \leq 1$ is limited to the size and shape of the penalty region. Let’s try to visualize the feasible region as:

centrosymmetric topology, when $\alpha=0$, and
super diamond topology, when $\alpha=1$.

Below is a hands-on demonstration of that process using the following simple quadratic equation solver.

library(needs)

# Constructing Quadratic Formula
quadraticEquSolver <- function(a,b,c){
  if(delta(a,b,c) > 0){ # first case D>0
    x_1 = (-b+sqrt(delta(a,b,c)))/(2*a)
    x_2 = (-b-sqrt(delta(a,b,c)))/(2*a)
    result = c(x_1,x_2)
    # print(result)
  }
  else if(delta(a,b,c) == 0){ # second case D=0
    result = -b/(2*a)
    # print(result)
  }
  else {"There are no real roots."} # third case D<0
}
# Constructing delta
delta<-function(a,b,c){
  b^2-4*a*c
}

To make this realistic, we will use the MLB dataset to first fit an OLS model. The dataset contains $1,034$ records of heights and weights for some current and recent Major League Baseball (MLB) Players.

Height: Player height in inch,
Weight: Player weight in pounds,
Age: Player age at time of record.

Then, we can obtain the SSE for any $||\boldsymbol\beta||$:

\[SSE = ||Y-\hat Y||^2 = (Y-\hat Y)^{T}(Y-\hat Y)=Y^TY - 2\beta^TX^TY + \beta^TX^TX\beta.\]

Next, we will compute the contours for SSE in several situations.

library("ggplot2")
# load data
mlb <- read.table('https://umich.instructure.com/files/330381/download?download_frd=1', 
                  as.is=T, header=T)
str(mlb)

## 'data.frame':    1034 obs. of  6 variables:
##  $ Name    : chr  "Adam_Donachie" "Paul_Bako" "Ramon_Hernandez" "Kevin_Millar" ...
##  $ Team    : chr  "BAL" "BAL" "BAL" "BAL" ...
##  $ Position: chr  "Catcher" "Catcher" "Catcher" "First_Baseman" ...
##  $ Height  : int  74 74 72 72 73 69 69 71 76 71 ...
##  $ Weight  : int  180 215 210 210 188 176 209 200 231 180 ...
##  $ Age     : num  23 34.7 30.8 35.4 35.7 ...

fit <- lm(Height ~ Weight + Age -1, data = as.data.frame(scale(mlb[,4:6])))
points = data.frame(x=c(0,fit$coefficients[1]),y=c(0,fit$coefficients[2]),z=c("(0,0)","OLS Coef"))

Y=scale(mlb$Height)
X = scale(mlb[,c(5,6)])
beta1=seq(-0.556, 1.556, length.out = 100)
beta2=seq(-0.661, 0.3386, length.out = 100)
df <- expand.grid(beta1 = beta1, beta2 = beta2)
b = as.matrix(df)
df$sse <- rep(t(Y)%*%Y,100*100) - 2*b%*%t(X)%*%Y + diag(b%*%t(X)%*%X%*%t(b))

base <- ggplot(df) + 
  stat_contour(aes(beta1, beta2, z = sse),breaks = round(quantile(df$sse, seq(0, 0.2, 0.03)), 0), 
               size = 0.5,color="darkorchid2",alpha=0.8)+ 
  scale_x_continuous(limits = c(-0.4,1))+
  scale_y_continuous(limits = c(-0.55,0.4))+
  coord_fixed(ratio=1)+
  geom_point(data = points,aes(x,y))+
  geom_text(data = points,aes(x,y,label=z),vjust = 2,size=3.5)+
  geom_segment(aes(x = -0.4, y = 0, xend = 1, yend = 0),colour = "grey46",
               arrow = arrow(length=unit(0.30,"cm")),size=0.5,alpha=0.8)+
  geom_segment(aes(x = 0, y = -0.55, xend = 0, yend = 0.4),colour = "grey46",
               arrow = arrow(length=unit(0.30,"cm")),size=0.5,alpha=0.8)

plot_alpha = function(alpha=0,restrict=0.2,beta1_range=0.2,annot=c(0.15,-0.25,0.205,-0.05)){
  a=alpha; t=restrict; k=beta1_range; pos=data.frame(V1=annot[1:4])
  text=paste("(",as.character(annot[3]),",",as.character(annot[4]),")",sep = "")
  K = seq(0,k,length.out = 50)
  y = unlist(lapply((1-a)*K^2/2+a*K-t, quadraticEquSolver,
                    a=(1-a)/2,b=a))[seq(1,99,by=2)]
  fills = data.frame(x=c(rev(-K),K), y1=c(rev(y),y), y2=c(-rev(y),-y))
  p<-base+geom_line(data=fills,aes(x = x,y = y1),colour = "salmon1",alpha=0.6,size=0.7)+
    geom_line(data=fills,aes(x = x,y = y2),colour = "salmon1",alpha=0.6,size=0.7)+
    geom_polygon(data = fills, aes(x, y1),fill = "red", alpha = 0.2)+
    geom_polygon(data = fills, aes(x, y2), fill = "red", alpha = 0.2)+
    geom_segment(data=pos,aes(x = V1[1] , y = V1[2], xend = V1[3], yend = V1[4]),
                 arrow = arrow(length=unit(0.30,"cm")),alpha=0.8,colour = "magenta")+
    ggplot2::annotate("text", x = pos$V1[1]-0.01, y = pos$V1[2]-0.11,
                      label = paste(text,"\n","Point of Contact \n i.e., Coef of", "alpha=",fractions(a)),size=3)+
    xlab(expression(beta[1]))+
    ylab(expression(beta[2]))+
    ggtitle(paste("alpha =",as.character(fractions(a))))+
    theme(legend.position="none")
}

# $\alpha=0$ - Ridge
p1 <- plot_alpha(alpha=0,restrict=(0.21^2)/2,beta1_range=0.21,annot=c(0.15,-0.25,0.205,-0.05))
p1 <- p1 + ggtitle(expression(paste(alpha, "=0 (Ridge)")))
# $\alpha=1/9$
p2 <- plot_alpha(alpha=1/9,restrict=0.046,beta1_range=0.22,annot =c(0.15,-0.25,0.212,-0.02))
p2 <- p2 + ggtitle(expression(paste(alpha, "=1/9")))
# $\alpha=1/5$
p3 <- plot_alpha(alpha=1/5,restrict=0.063,beta1_range=0.22,annot=c(0.13,-0.25,0.22,0))
p3 <- p3 + ggtitle(expression(paste(alpha, "=1/5")))
# $\alpha=1/2$
p4 <- plot_alpha(alpha=1/2,restrict=0.123,beta1_range=0.22,annot=c(0.12,-0.25,0.22,0))
p4 <- p4 + ggtitle(expression(paste(alpha, "=1/2")))
# $\alpha=3/4$
p5 <- plot_alpha(alpha=3/4,restrict=0.17,beta1_range=0.22,annot=c(0.12,-0.25,0.22,0))
p5 <- p5 + ggtitle(expression(paste(alpha, "=3/4")))

# $\alpha=1$ - LASSO
t=0.22
K = seq(0,t,length.out = 50)
fills = data.frame(x=c(-rev(K),K),y1=c(rev(t-K),c(t-K)),y2=c(-rev(t-K),-c(t-K)))
p6 <- base + 
  geom_segment(aes(x = 0, y = t, xend = t, yend = 0),colour = "salmon1",alpha=0.1,size=0.2)+
  geom_segment(aes(x = 0, y = t, xend = -t, yend = 0),colour = "salmon1",alpha=0.1,size=0.2)+
  geom_segment(aes(x = 0, y = -t, xend = t, yend = 0),colour = "salmon1",alpha=0.1,size=0.2)+
  geom_segment(aes(x = 0, y = -t, xend = -t, yend = 0),colour = "salmon1",alpha=0.1,size=0.2)+
  geom_polygon(data = fills, aes(x, y1),fill = "red", alpha = 0.2)+
  geom_polygon(data = fills, aes(x, y2), fill = "red", alpha = 0.2)+
  geom_segment(aes(x = 0.12 , y = -0.25, xend = 0.22, yend = 0),colour = "magenta",
               arrow = arrow(length=unit(0.30,"cm")),alpha=0.8)+
  ggplot2::annotate("text", x = 0.11, y = -0.36,
                    label = "(0.22,0)\n Point of Contact \n i.e Coef of LASSO",size=3)+
  xlab( expression(beta[1]))+
  ylab( expression(beta[2]))+
  theme(legend.position="none")+  
  ggtitle(expression(paste(alpha, "=1 (LASSO)")))

Then, let’s add the six feasible regions corresponding to $\alpha=0$ (Ridge), $\alpha=\frac{1}{9}$, $\alpha=\frac{1}{5}$, $\alpha=\frac{1}{2}$, $\alpha=\frac{3}{4}$ and $\alpha=1$ (LASSO).

This figure provides some intuition into the continuum from Ridge to LASSO regularization. The feasible regions are drawn as ellipse contours of the SSE in red. Curves around the corresponding feasible regions represent the boundary of the constraint function $\frac{1-\alpha}{2}||\beta||_2^2+\alpha||\beta||_1\leq t$.

In this example, $\beta_2$ shrinks to $0$ for $\alpha=\frac{1}{5}$, $\alpha=\frac{1}{2}$, $\alpha=\frac{3}{4}$ and $\alpha=1$.

We observe that it is almost impossible for the contours of Ridge regression to touch the circle at any of the coordinate axes. This is also true in higher dimensions ($nD$), where the $L_1$ and $L_2$ metrics are unchanged and the 2D ellipse representations of the feasibility regions become hyper-ellipsoidal shapes.

Generally, as $\alpha$ goes from $0$ to $1$. The coefficients of more features tend to shrink towards $0$. This specific property makes LASSO useful for variable selection.

By Lagrangian duality, any solution of $\min_\beta {||Y-X\beta||^2_2} +\lambda ||\beta||_2$ and $\min_\beta {||Y-X\beta||_1} +\lambda ||\beta||_1$ must also represent a solution to the corresponding Ridge ($\hat{\beta}^{RR}$) or LASSO ($\hat{\beta}^{L}$) optimization problems:

\[\min_\beta {||Y-X\beta||^2_2},\ \ \text{subject to}\ \ ||\beta||_2 \leq||\hat{\beta}^{RR}||_2, \] \[\min_\beta {||Y-X\beta||_1},\ \ \text{subject to}\ \ ||\beta||_1 \leq ||\hat{\beta}^{L}||_1, \]

Suppose we actually know the values of $||\hat{\beta}^{RR}||_2$ and $||\hat{\beta}^{L}||_1$, then we can pictorially represent the optimization problem and illustrate the complementary model-fitting, variable selection and shrinkage of the Ridge and LASSO regularization.

The topologies of the solution (domain) regions are different for Ridge and LASSO. Ridge regularization corresponds with ball topology and LASSO with diamond topology. This is because the solution regions are defined by $||\hat{\beta}||_2\leq ||\hat{\beta}^{RR}||_2$ and $||\hat{\beta}||_1\leq ||\hat{\beta}^{L}||_1$, respectively.

On the other hand, the topology of the fidelity term $||Y-X\beta||^2_2$ is ellipsoidal, centered at the OLS estimate, $\hat{\beta}^{OLS}$. To solve the optimization problem, we look for the tightest contour around $\hat{\beta}^{OLS}$ that hits the solution domain (ball for Ridge or diamond for LASSO). This intersection point would represent the solution estimate $\hat{\beta}$. As the LASSO domain space ($l_1$ unit ball) has these corners, the solution estimate $\hat{\beta}$ is likely to be at the corners. Hence LASSO solutions tend to include many zeroes, whereas Ridge regression solutions (constraint set is a round ball) may not.

Let’s compare the feasibility regions corresponding to Ridge (top, $p1$) and LASSO (bottom, $p6$) regularization.

plot(p1)

SSE Contour and Penalty Region (Ridge).

plot(p6)

SSE Contour and Penalty Region (LASSO).

Then, we can plot the progression from Ridge to LASSO. (This composite plot is intense and may take several minutes to render!)

library("gridExtra")
grid.arrange(p1,p2,p3,p4,p5,p6,nrow=3)

SSE Contour and Penalty Region for 6 values of Alpha.

2.21 Choice of the Regularization Parameter

Efficiently obtaining the entire solution path is nice, but we still have to choose a specific $\lambda$ regularization parameter. This is critical as $\lambda$ controls the bias-variance tradeoff.

Traditional model selection methods rely on various metrics like Mallows’ $C_p$, AIC, BIC, and adjusted $R^2$.

Internal statistical validation (Cross validation) is a popular modern alternative, which offers some of these benefits:

Choice is based on predictive performance,
Makes fewer model assumptions,
More widely applicable.

3 Cross Validation Motivation

We discussed statistical internal cross validation (CV) in Chapter 9. When assessing model performance using a regularized approach, we would like a separate validation set for choosing the parameter $\lambda$ controlling the weight of the regularizer. Reusing training sets may encourage overfitting and using testing data to pick $\lambda$ may underestimate the true error rate. Often, when we do not have enough data for a separate validation set, cross validation provides an alternative strategy.

3.1 $n$-Fold Cross Validation

We have already seen examples of using cross validation, e.g., Chapter 9 provides more details about this internal statistical assessment strategy.

We can use either automated or manual cross-validation. In either case, the protocol involves the following iterative steps:

Randomly split the training data into $n$ parts (“folds”).
Fit a model using data in $n-1$ folds for multiple $\lambda\text{s}$.
Calculate some prediction quality metrics (e.g., MSE, accuracy) on the last remaining fold, see Chapter 9.
Repeat the process and average the prediction metrics across iterations.

Common choices of $n$ are 5, 10, and $N$ ($n=N$, the sample size, corresponds to leave-one-out CV). The one standard error rule suggests choosing a $\lambda$ value corresponding to a model with the smallest number of parameters, which has MSE within one standard error of the minimum MSE.

3.2 LASSO 10-Fold Cross Validation

Now, let’s apply an internal statistical cross-validation to assess the quality of the LASSO and Ridge models, based on our Parkinson’s disease case-study. Recall our split of the PD data into training (yTrain, XTrain) and testing (yTest, XTest) sets.

plotCV.glmnet <- function(cv.glmnet.object, name="") {
  df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, 
                           errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)

  featureNum <- cv.glmnet.object$nzero
  xFeature <- log(cv.glmnet.object$lambda)
  yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
  dataFeature <- data.frame(featureNum, xFeature, yFeature)

  plot_ly(data = df) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
        name = 'CV MSE', error_y = ~list(array = errorBar)) %>% 
    # add the lambda-min and lambda 1SD vertical dash lines
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
              showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), 
              showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
    # Add Number of Features Annotations on Top
    add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
        mode = 'text', text = ~featureNum, textposition = 'middle right',
        textfont = list(color = '#000000', size = 9)) %>%
    # Add top x-axis (non-zero features)
    # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
    #           y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, 
    #           name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
    layout(title = paste0("Cross-Validation MSE (", name, ")"),
                            xaxis = list(title=paste0("log(",TeX("\\lambda"),")"),  side="bottom", showgrid=TRUE), # type="log"
                            hovermode = "x unified", legend = list(orientation='h'),  # xaxis2 = ax,  
                            yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE))
}

#### 10-fold cross validation ####
# LASSO
library("glmnet")
library(doParallel)
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
set.seed(seed)  # set seed 
# (10-fold) cross validation for the LASSO
cvLASSO = cv.glmnet(XTrain, yTrain, alpha = 1, parallel=TRUE)

# plot(cvLASSO)
# mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plotCV.glmnet(cvLASSO, "LASSO")

# Report MSE LASSO
predLASSO <-  predict(cvLASSO, s = cvLASSO$lambda.1se, newx = XTest)
testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 233.183

plotCV.glmnet <- function(cv.glmnet.object, name="") {
  df <- as.data.frame(cbind(x=log(cv.glmnet.object$lambda), y=cv.glmnet.object$cvm, 
                           errorBar=cv.glmnet.object$cvsd), nzero=cv.glmnet.object$nzero)

  featureNum <- cv.glmnet.object$nzero
  xFeature <- log(cv.glmnet.object$lambda)
  yFeature <- max(cv.glmnet.object$cvm)+max(cv.glmnet.object$cvsd)
  dataFeature <- data.frame(featureNum, xFeature, yFeature)

  plot_ly(data = df) %>%
    # add error bars for each CV-mean at log(lambda)
    add_trace(x = ~x, y = ~y, type = 'scatter', mode = 'markers',
        name = 'CV MSE', error_y = ~list(array = errorBar)) %>% 
    # add the lambda-min and lambda 1SD vertical dash lines
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.min), log(cv.glmnet.object$lambda.min)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)),
              showlegend=F, line=list(dash="dash"), name="lambda.min", mode = 'lines+markers') %>%
    add_lines(data=df, x=c(log(cv.glmnet.object$lambda.1se), log(cv.glmnet.object$lambda.1se)), 
              y=c(min(cv.glmnet.object$cvm)-max(df$errorBar), max(cv.glmnet.object$cvm)+max(df$errorBar)), 
              showlegend=F, line=list(dash="dash"), name="lambda.1se") %>%
    # Add Number of Features Annotations on Top
    add_trace(dataFeature, x = ~xFeature, y = ~yFeature, type = 'scatter', name="Number of Features",
        mode = 'text', text = ~featureNum, textposition = 'middle right',
        textfont = list(color = '#000000', size = 9)) %>%
    # Add top x-axis (non-zero features)
    # add_trace(data=df, x=~c(min(cv.glmnet.object$nzero),max(cv.glmnet.object$nzero)),
    #           y=~c(max(y)+max(errorBar),max(y)+max(errorBar)), showlegend=F, 
    #           name = "Non-Zero Features", yaxis = "ax", mode = "lines+markers", type = "scatter") %>%
    layout(title = paste0("Cross-Validation MSE (", name, ")"),
                            xaxis = list(title=paste0("log(",TeX("\\lambda"),")"),  side="bottom", showgrid=TRUE), # type="log"
                            hovermode = "x unified", legend = list(orientation='h'),  # xaxis2 = ax,  
                            yaxis = list(title = cv.glmnet.object$name, side="left", showgrid = TRUE))
}

#### 10-fold cross validation ####
# Ridge Regression
set.seed(seed)  # set seed 
# (10-fold) cross validation for Ridge Regression
cvRidge = cv.glmnet(XTrain, yTrain, alpha = 0, parallel=TRUE)

# plot(cvRidge)
# mtext("CV Ridge: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

plotCV.glmnet(cvRidge, "Ridge")

# Report MSE Ridge
predRidge <-  predict(cvRidge, s = cvRidge$lambda.1se, newx = XTest)
testMSE_Ridge <- mean((predRidge - yTest)^2); testMSE_Ridge

## [1] 233.183

stopCluster(cl)

Note that the predict() method applied to cv.gmlnet or glmnet forecasting models is effectively a function wrapper to predict.gmlnet(). According to what you would like to get as a prediction output, you can use type="..." to specify one of the following types of prediction outputs:

type="link", reports the linear predictors for “binomial”, “multinomial”, “poisson” or “cox” models; for “gaussian” models it gives the fitted values.
type="response", reports the fitted probabilities for “binomial” or “multinomial”, fitted mean for “poisson” and the fitted relative-risk for “cox”; for “gaussian” type “response” is equivalent to type “link”.
type="coefficients", reports the coefficients at the requested values for s. Note that for “binomial” models, results are returned only for the class corresponding to the second level of the factor response.
type="class", applies only to “binomial” or “multinomial” models, and produces the class label corresponding to the maximum probability.
type="nonzero", returns a list of the indices of the nonzero coefficients for each value of s.

3.3 Stepwise OLS (ordinary least squares)

For a fair comparison, let’s also obtain an OLS stepwise model selection, which we presented earlier.

dt = as.data.frame(cbind(yTrain,XTrain))
ols_step <- lm(yTrain ~., data = dt)
ols_step <- step(ols_step, direction = 'both', k=2, trace = F)
summary(ols_step)

## 
## Call:
## lm(formula = yTrain ~ L_insular_cortex_ComputeArea + L_insular_cortex_Volume + 
##     R_insular_cortex_Volume + L_cingulate_gyrus_ComputeArea + 
##     R_cingulate_gyrus_Volume + L_caudate_Volume + L_putamen_ComputeArea + 
##     L_putamen_Volume + R_putamen_ComputeArea + Sex + Weight + 
##     Age + chr17_rs11868035_GT + chr17_rs11012_GT + chr17_rs393152_GT + 
##     chr17_rs12185268_GT, data = dt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.326  -9.250   0.067   9.257  54.184 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.6772645  6.8299694   0.685 0.493636    
## L_insular_cortex_ComputeArea  -0.0082211  0.0046771  -1.758 0.079131 .  
## L_insular_cortex_Volume        0.0021880  0.0012140   1.802 0.071834 .  
## R_insular_cortex_Volume       -0.0013032  0.0008958  -1.455 0.146080    
## L_cingulate_gyrus_ComputeArea  0.0074028  0.0016421   4.508 7.40e-06 ***
## R_cingulate_gyrus_Volume      -0.0014608  0.0003854  -3.790 0.000161 ***
## L_caudate_Volume              -0.0044248  0.0013782  -3.211 0.001371 ** 
## L_putamen_ComputeArea         -0.0144807  0.0053722  -2.695 0.007159 ** 
## L_putamen_Volume               0.0055039  0.0021430   2.568 0.010379 *  
## R_putamen_ComputeArea          0.0065605  0.0023052   2.846 0.004527 ** 
## Sex                            2.9342207  1.2369794   2.372 0.017896 *  
## Weight                         0.0555807  0.0346594   1.604 0.109145    
## Age                            0.1524618  0.0593021   2.571 0.010301 *  
## chr17_rs11868035_GT           -1.5710859  0.7447326  -2.110 0.035167 *  
## chr17_rs11012_GT              -7.7050024  1.9846968  -3.882 0.000111 ***
## chr17_rs393152_GT             -4.4644593  2.3601881  -1.892 0.058867 .  
## chr17_rs12185268_GT           12.9794629  2.9244435   4.438 1.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.81 on 907 degrees of freedom
## Multiple R-squared:  0.0995, Adjusted R-squared:  0.08362 
## F-statistic: 6.264 on 16 and 907 DF,  p-value: 2.123e-13

We use direction=both for both forward and backward selection and choose the optimal one. k=2 specifies AIC and BIC criteria, and you can choose $k\sim \log(n)$.

Then, we use the ols_step model to predict the outcome $Y$ for some new test data.

betaHatOLS_step = ols_step$coefficients
var_step <- colnames(ols_step$model)[-1]
XTestOLS_step = cbind(rep(1, nrow(XTest)), XTest[,var_step])
predOLS_step = XTestOLS_step%*%betaHatOLS_step 
testMSEOLS_step = mean((predOLS_step - yTest)^2)
# Report MSE OLS Stepwise feature selection
testMSEOLS_step

## [1] 243.939

Alternatively, we can predict the outcomes directly using the predict() function, and the results should be identical:

pred2 <- predict(ols_step,as.data.frame(XTest))
any(pred2 == predOLS_step)

## [1] TRUE

3.4 Final Models

Let’s identify the most important (predictive) features, which can then be interpreted in the context of the specific data.

# Determine final models

# Extract Coefficients
# OLS coefficient estimates
betaHatOLS = fitOLS$coefficients
# LASSO coefficient estimates 
betaHatLASSO = as.double(coef(fitLASSO, s = cvLASSO$lambda.1se))  # s is lambda
# Ridge  coefficient estimates 
betaHatRidge = as.double(coef(fitRidge, s = cvRidge$lambda.1se))

# Test Set MSE
# calculate predicted values

XTestOLS = cbind(rep(1, nrow(XTest)), XTest) # add intercept to test data
predOLS = XTestOLS%*%betaHatOLS 
predLASSO = predict(fitLASSO, s = cvLASSO$lambda.1se, newx = XTest)
predRidge = predict(fitRidge, s = cvRidge$lambda.1se, newx = XTest)

# calculate test set MSE
testMSEOLS = mean((predOLS - yTest)^2)
testMSELASSO = mean((predLASSO - yTest)^2)
testMSERidge = mean((predRidge - yTest)^2)

This plot shows a rank-ordered list of the key predictors of the clinical outcome variable (total UPDRS, y <- data1$UPDRS_part_I + data1$UPDRS_part_II + data1$UPDRS_part_III).

# Plot Regression Coefficients
# create variable names for plotting 
library("arm")
par(mar=c(2, 13, 1, 1))   # extra large left margin
varNames <- colnames(data1[ , !(names(data1) %in% drop_features)]); varNames; length(varNames)

##  [1] "L_insular_cortex_ComputeArea"  "L_insular_cortex_Volume"      
##  [3] "R_insular_cortex_ComputeArea"  "R_insular_cortex_Volume"      
##  [5] "L_cingulate_gyrus_ComputeArea" "L_cingulate_gyrus_Volume"     
##  [7] "R_cingulate_gyrus_ComputeArea" "R_cingulate_gyrus_Volume"     
##  [9] "L_caudate_ComputeArea"         "L_caudate_Volume"             
## [11] "R_caudate_ComputeArea"         "R_caudate_Volume"             
## [13] "L_putamen_ComputeArea"         "L_putamen_Volume"             
## [15] "R_putamen_ComputeArea"         "R_putamen_Volume"             
## [17] "Sex"                           "Weight"                       
## [19] "Age"                           "chr12_rs34637584_GT"          
## [21] "chr17_rs11868035_GT"           "chr17_rs11012_GT"             
## [23] "chr17_rs393152_GT"             "chr17_rs12185268_GT"          
## [25] "chr17_rs199533_GT"

## [1] 25

# # Graph 27 regression coefficients (exclude intercept [1], betaHat indices 2:27)
# coefplot(betaHatOLS[2:27], sd = rep(0, 26), pch=0, cex.pts = 3, main = "Regression Coefficient Estimates", varnames = varNames)
# coefplot(betaHatLASSO[2:27], sd = rep(0, 26), pch=1, add = TRUE, col.pts = "red", cex.pts = 3)
# coefplot(betaHatRidge[2:27], sd = rep(0, 26), pch=2, add = TRUE, col.pts = "blue", cex.pts = 3)
# legend("bottomright", c("OLS", "LASSO", "Ridge"), col = c("black", "red", "blue"), pch = c(0, 1 , 2), bty = "o", cex = 2)

df <- as.data.frame(cbind(Feature=attributes(betaHatOLS)$names[2:26], OLS=betaHatOLS[2:26],
                          LASSO=betaHatLASSO[2:26], Ridge=betaHatRidge[2:26]))

data_long <- gather(df, Method, value, OLS:Ridge, factor_key=TRUE)
data_long$value <- as.numeric(data_long$value)

# Note that Plotly will automatically order your axes by the order that is present in the data
# When using character vectors - order is alphabetic; in case of factors the order is by levels. 
# To override this behavior, specify categoryorder and categoryarray for the appropriate axis in the layout
formY <- list(categoryorder = "array", categoryarray = df$Feature)

plot_ly(data_long, x=~value, y=~Feature, type="scatter", mode="markers", 
        marker=list(size=20), color=~Method, symbol=~Method, symbols=c('circle-open','x-open','hexagon-open')) %>%
  layout(yaxis = formY)

# par()

3.5 Model Performance

We next quantify the performance of the models.

# Test Set MSE Table 
# create table as data frame
MSETable = data.frame(OLS=testMSEOLS, OLS_step=testMSEOLS_step, LASSO=testMSELASSO, Ridge=testMSERidge)

# convert to markdown
kable(MSETable, format="pandoc", caption="Test Set MSE", align=c("c", "c", "c", "c"))

Test Set MSE
OLS	OLS_step	LASSO	Ridge
247.3367	243.939	233.183	233.183

3.6 Compare the selected variables

var_step = names(ols_step$coefficients)[-1]
var_lasso = colnames(XTrain)[which(coef(fitLASSO, s = cvLASSO$lambda.min)!=0)-1]
intersect(var_step,var_lasso)

##  [1] "L_insular_cortex_ComputeArea"  "L_insular_cortex_Volume"      
##  [3] "R_insular_cortex_Volume"       "L_cingulate_gyrus_ComputeArea"
##  [5] "R_cingulate_gyrus_Volume"      "L_caudate_Volume"             
##  [7] "L_putamen_ComputeArea"         "L_putamen_Volume"             
##  [9] "R_putamen_ComputeArea"         "Sex"                          
## [11] "Weight"                        "Age"                          
## [13] "chr17_rs11868035_GT"           "chr17_rs11012_GT"             
## [15] "chr17_rs393152_GT"             "chr17_rs12185268_GT"

coef(fitLASSO, s = cvLASSO$lambda.min)

## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                          s1
## (Intercept)                    1.7349137426
## L_insular_cortex_ComputeArea  -0.0031461632
## L_insular_cortex_Volume        0.0007428576
## R_insular_cortex_ComputeArea   .           
## R_insular_cortex_Volume       -0.0007386605
## L_cingulate_gyrus_ComputeArea  0.0060323275
## L_cingulate_gyrus_Volume       .           
## R_cingulate_gyrus_ComputeArea -0.0004655064
## R_cingulate_gyrus_Volume      -0.0009788965
## L_caudate_ComputeArea          .           
## L_caudate_Volume              -0.0031007230
## R_caudate_ComputeArea          .           
## R_caudate_Volume              -0.0007081574
## L_putamen_ComputeArea         -0.0099013352
## L_putamen_Volume               0.0038351119
## R_putamen_ComputeArea          0.0056737358
## R_putamen_Volume               .           
## Sex                            2.6406206813
## Weight                         0.0577691358
## Age                            0.1642627834
## chr12_rs34637584_GT           -2.1268900512
## chr17_rs11868035_GT           -1.4279120508
## chr17_rs11012_GT              -6.6808710487
## chr17_rs393152_GT             -3.3231021784
## chr17_rs12185268_GT           10.9433767653
## chr17_rs199533_GT              .

Stepwise variable selection for OLS selects 12 variables, whereas LASSO selects 9 variables with the best $\lambda$. There are 6 common variables common for both OLS and LASSO.

3.6.1 Summary

Traditional linear models are useful but also have their shortcomings:

Prediction accuracy may be sub-optimal.
Model interpretability may be challenging (especially when a large number of features are used as regressors).
Stepwise model selection may improve the model performance and add some interpretations, but still may not be optimal.

Regularization adds a penalty term to the estimation:

Enables exploitation of the bias-variance tradeoff.
Provides flexibility on specifying penalties to allow for continuous variable selection.
Allows incorporation of prior knowledge.

4 Knockoff Filtering (FDR-Controlled Feature Selection)

4.1 Simulated Knockoff Example

Variable selection that controls the false discovery rate (FDR) of salient features can be accomplished in different ways. The knockoff filtering represents one strategy for controlled variable selection. To show the usage of knockoff.filter we start with a synthetic dataset constructed so that the true coefficient vector $\beta$ has only a few nonzero entries.

The essence of the knockoff filtering is based on the following three-step process:

Construct the decoy features (knockoff variables), one for each real observed feature. These act as controls for assessing the importance of the real variables.
For each feature, $X_j$, compute the knockoff statistic, $W_j$, which measures the importance of the variable, relative to its decoy counterpart, $\tilde{X}_j$. This importance is measured by comparing the corresponding parameter estimates, $\hat{\beta}_{X_j}$ and $\hat{\beta}_{\tilde{X}_j}$, obtained via regularized linear modeling (e.g., LASSO).
Determine the overall knockoff threshold. This is computed by rank-ordering the $W_j$ statistics (from large to small), walking down the list of $W_j$’s, selecting variables $X_j$ corresponding to positive $W_j$’s, and terminating this search the last time the ratio of negative to positive $W_j$’s is below the default FDR $q$ value, e.g., $q=0.10$.

Mathematically, we consider $X_j$ to be unimportant (i.e., peripheral or extraneous) if the conditional distribution of $Y$ given $X_1,\cdots,X_p$ does not depend on $X_j$. Formally, $X_j$ is unimportant if it is conditionally independent of $Y$ given all other features, $X_{-j}$:

\[Y \perp X_j | X_{-j}.\] We want to generate a Markov Blanket of $Y$, such that the smallest set of features $J$ satisfies this condition. Further, to make sure we do not make too many mistakes, we search for a set $\hat{S}$ controlling the false discovery rate (FDR):

\[FDR(\hat{S}) = \mathrm{E} \left (\frac{\#j\in \hat{S}:\ x_j\ unimportant}{\#j\in \hat{S}} \right) \leq q\ (e.g.\ 10\%).\]

Let’s look at one simulation example.

# Problem parameters
n = 1000          # number of observations
p = 300           # number of variables
k = 30            # number of variables with nonzero coefficients
amplitude = 3.5   # signal amplitude (for noise level = 1)

# Problem data
X = matrix(rnorm(n*p), nrow=n, ncol=p)
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero)
y.sample <- function() X %*% beta + rnorm(n)

To begin with, we will invoke the knockoff.filter using the default settings.

# install.packages("knockoff")
library(knockoff)
y = y.sample()
result = knockoff.filter(X, y)
print(result)

## Call:
## knockoff.filter(X = X, y = y)
## 
## Selected variables:
##  [1]   4  11  12  25  27  42  47  82  87  91  96 104 112 123 127 130 148 152 157
## [20] 182 206 227 231 236 239 242 245 254 258 278 292 298

The false discovery proportion (fdp) is:

fdp <- function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(result$selected)

## [1] 0.0625

This yields an approximate FDR of $0.10$.

The default settings of the knockoff filter uses a test statistic based on LASSO – knockoff.stat.lasso_signed_max, which computes the $W_j$ statistics that quantify the discrepancy between a real ($X_j$) and a decoy, knockoff ($\tilde{X}_j$), feature coefficient estimates:

\[W_j=\max(X_j, \tilde{X}_j) \times sign(X_j - \tilde{X}_j). \] Effectively, the $W_j$ statistics measures how much more important the variable $X_j$ is relative to its decoy counterpart $\tilde{X}_j$. The strength of the importance of $X_j$ relative to $\tilde{X}_j$ is measured by the magnitude of $W_j$.

The knockoff package includes several other test statistics, with appropriate names prefixed by knockoff.stat. For instance, we can use a statistic based on forward selection ($fs$) and a lower target FDR of $0.10$.

result = knockoff.filter(X, y, fdr = 0.10, statistic = stat.glmnet_coefdiff)         # Old: statistic=knockoff.stat.fs)
#knockoff::stat.forward_selection       Importance statistics based on forward selection
#knockoff::stat.glmnet_coefdiff     Importance statistics based on a GLM with cross-validation
#knockoff::stat.glmnet_lambdadiff       Importance statistics based on a GLM
#knockoff::stat.glmnet_lambdasmax       GLM statistics for knockoff
#knockoff::stat.lasso_coefdiff      Importance statistics based the lasso with cross-validation
#knockoff::stat.lasso_coefdiff_bin      Importance statistics based on regularized logistic regression with cross-validation
#knockoff::stat.lasso_lambdadiff        Importance statistics based on the lasso
#knockoff::stat.lasso_lambdadiff_bin        Importance statistics based on regularized logistic regression
#knockoff::stat.lasso_lambdasmax        Penalized linear regression statistics for knockoff
#knockoff::stat.lasso_lambdasmax_bin        Penalized logistic regression statistics for knockoff
#knockoff::stat.random_forest       Importance statistics based on random forests
# knockoff::stat.sqrt_lasso     Importance statistics based on the square-root lasso
#knockoff::stat.stability_selection     Importance statistics based on stability selection
#knockoff::verify_stat_depends      Verify dependencies for chosen statistics)
fdp(result$selected)

## [1] 0.09090909

One can also define additional test statistics, complementing the ones included in the package already. For instance, if we want to implement the following test-statistics:

\[W_j= || X^t . y|| - ||\tilde{X^t} . y||.\]

We can code it as:

new_knockoff_stat <- function(X, X_ko, y) {
  abs(t(X) %*% y) - abs(t(X_ko) %*% y)
}
result = knockoff.filter(X, y, statistic = new_knockoff_stat)
# print indices of selected features
print(sprintf("Number of KO-selected features: %d", length(result$selected)))

## [1] "Number of KO-selected features: 22"

cat("Indices of KO-selected features: ", result$selected)

## Indices of KO-selected features:  4 12 25 27 47 87 91 96 123 127 130 152 157 182 206 227 231 242 245 258 292 298

fdp(result$selected)

## [1] 0

4.2 Knockoff invocation

The knockoff.filter function is a wrapper around several simpler functions that (1) construct knockoff variables (knockoff.create); (2) compute the test statistic $W$ (various functions with prefix knockoff.stat); and (3) compute the threshold for variable selection (knockoff.threshold).

The high-level function knockoff.filter will automatically normalize the columns of the input matrix (unless this behavior is explicitly disabled). However, all other functions in this package assume that the columns of the input matrix have unitary Euclidean norm.

4.3 PD Neuroimaging-genetics Case-Study

Let’s illustrate controlled variable selection via knockoff filtering using the real PD dataset.

The goal is to determine which imaging, genetics and phenotypic covariates are associated with the clinical diagnosis of PD. The dataset is available at the DSPA case-study archive site.

Preparing the data

The data set consists of clinical, genetics, and demographic measurements. To evaluate our results, we will compare diagnostic predictions created by the model for the UPDRS scores and the ResearchGroup factor variable.

Fetching and cleaning the data

First, we download the data and read it into data frames.

data1 <- read.table('https://umich.instructure.com/files/330397/download?download_frd=1', sep=",", header=T)    
# we will deal with missing values using multiple imputation later. For now, let's just ignore incomplete cases
data1.completeRowIndexes <- complete.cases(data1) # table(data1.completeRowIndexes)
prop.table(table(data1.completeRowIndexes))

## data1.completeRowIndexes
##     FALSE      TRUE 
## 0.3452381 0.6547619

# attach(data1)
# View(data1[data1.completeRowIndexes, ])
data2 <- data1[data1.completeRowIndexes, ]
Dx_label  <- data2$ResearchGroup; table(Dx_label)

## Dx_label
## Control      PD   SWEDD 
##     121     897     137

Preparing the design matrix

We now construct the design matrix $X$ and the response vector $Y$. The features (columns of $X$) represent covariates that will be used to explain the response $Y$.

# Construct preliminary design matrix.
# define response and predictors
Y <- data1$UPDRS_part_I + data1$UPDRS_part_II + data1$UPDRS_part_III
table(Y)   # Show Clinically relevant classification

## Y
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 54 20 25 12  8  7 11 16 16  9 21 16 13 13 22 25 21 31 25 29 29 28 20 25 28 26 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
## 35 41 23 34 32 31 37 34 28 36 29 27 22 19 17 18 18 19 16  9 10 12  9 11  7 10 
## 52 53 54 55 56 57 58 59 60 61 62 63 64 66 68 69 71 75 80 81 82 
## 11  5  7  4  1  5  9  4  3  2  1  6  1  2  1  2  1  1  2  3  1

Y <- Y[data1.completeRowIndexes]

# X = scale(ncaaData[, -20])  # Explicit Scaling is not needed, as glmnet auto standardizes predictors
# X = as.matrix(data1[, c("R_caudate_Volume", "R_putamen_Volume", "Weight", "Age", "chr17_rs12185268_GT")])  # X needs to be a matrix, not a data frame
drop_features <- c("FID_IID", "ResearchGroup", "UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III")
X <- data1[ , !(names(data1) %in% drop_features)]
X = as.matrix(X)   # remove columns: index, ResearchGroup, and y=(PDRS_part_I + UPDRS_part_II + UPDRS_part_III)
X <- X[data1.completeRowIndexes, ]; dim(X)

## [1] 1155   26

summary(X)

##  L_insular_cortex_ComputeArea L_insular_cortex_Volume
##  Min.   :  50.03              Min.   :   22.63       
##  1st Qu.:2174.57              1st Qu.: 5867.23       
##  Median :2522.52              Median : 7362.90       
##  Mean   :2306.89              Mean   : 6710.18       
##  3rd Qu.:2752.17              3rd Qu.: 8483.80       
##  Max.   :3650.81              Max.   :13499.92       
##  R_insular_cortex_ComputeArea R_insular_cortex_Volume
##  Min.   :  40.92              Min.   :  11.84        
##  1st Qu.:1647.69              1st Qu.:3559.74        
##  Median :1931.21              Median :4465.12        
##  Mean   :1758.64              Mean   :4127.87        
##  3rd Qu.:2135.57              3rd Qu.:5319.13        
##  Max.   :2791.92              Max.   :8179.40        
##  L_cingulate_gyrus_ComputeArea L_cingulate_gyrus_Volume
##  Min.   : 127.8                Min.   :   57.33        
##  1st Qu.:2847.4                1st Qu.: 6587.07        
##  Median :3737.7                Median : 8965.03        
##  Mean   :3411.3                Mean   : 8265.03        
##  3rd Qu.:4253.7                3rd Qu.:10815.06        
##  Max.   :5944.2                Max.   :17153.19        
##  R_cingulate_gyrus_ComputeArea R_cingulate_gyrus_Volume L_caudate_ComputeArea
##  Min.   : 104.1                Min.   :   47.67         Min.   :   1.782     
##  1st Qu.:2829.4                1st Qu.: 6346.31         1st Qu.: 318.806     
##  Median :3719.4                Median : 9094.15         Median : 710.779     
##  Mean   :3368.4                Mean   : 8194.07         Mean   : 657.442     
##  3rd Qu.:4261.8                3rd Qu.:10832.53         3rd Qu.: 951.868     
##  Max.   :6593.7                Max.   :19761.77         Max.   :1453.506     
##  L_caudate_Volume    R_caudate_ComputeArea R_caudate_Volume  
##  Min.   :   0.1928   Min.   :   1.782      Min.   :   0.193  
##  1st Qu.: 264.0013   1st Qu.: 660.696      1st Qu.: 893.637  
##  Median : 998.2269   Median :1063.046      Median :1803.281  
##  Mean   : 992.2892   Mean   : 894.806      Mean   :1548.739  
##  3rd Qu.:1568.3643   3rd Qu.:1183.659      3rd Qu.:2152.509  
##  Max.   :2746.6208   Max.   :1684.563      Max.   :3579.373  
##  L_putamen_ComputeArea L_putamen_Volume   R_putamen_ComputeArea
##  Min.   :   6.76       Min.   :   1.228   Min.   :  13.93      
##  1st Qu.: 775.73       1st Qu.:1234.601   1st Qu.:1255.62      
##  Median :1029.17       Median :1911.089   Median :1490.05      
##  Mean   : 959.15       Mean   :1864.390   Mean   :1332.01      
##  3rd Qu.:1260.56       3rd Qu.:2623.722   3rd Qu.:1642.41      
##  Max.   :2129.67       Max.   :4712.661   Max.   :2251.41      
##  R_putamen_Volume        Sex            Weight            Age       
##  Min.   :   3.207   Min.   :1.000   Min.   : 43.20   Min.   :31.18  
##  1st Qu.:2474.041   1st Qu.:1.000   1st Qu.: 69.90   1st Qu.:53.87  
##  Median :3510.249   Median :1.000   Median : 80.90   Median :62.16  
##  Mean   :3083.007   Mean   :1.347   Mean   : 82.06   Mean   :61.25  
##  3rd Qu.:3994.733   3rd Qu.:2.000   3rd Qu.: 90.70   3rd Qu.:68.83  
##  Max.   :7096.580   Max.   :2.000   Max.   :135.00   Max.   :83.03  
##  chr12_rs34637584_GT chr17_rs11868035_GT chr17_rs11012_GT chr17_rs393152_GT
##  Min.   :0.00000     Min.   :0.0000      Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:0.00000     1st Qu.:0.0000      1st Qu.:0.0000   1st Qu.:0.0000   
##  Median :0.00000     Median :1.0000      Median :0.0000   Median :0.0000   
##  Mean   :0.01212     Mean   :0.6364      Mean   :0.3654   Mean   :0.4468   
##  3rd Qu.:0.00000     3rd Qu.:1.0000      3rd Qu.:1.0000   3rd Qu.:1.0000   
##  Max.   :1.00000     Max.   :2.0000      Max.   :2.0000   Max.   :2.0000   
##  chr17_rs12185268_GT chr17_rs199533_GT   time_visit   
##  Min.   :0.0000      Min.   :0.0000    Min.   : 0.00  
##  1st Qu.:0.0000      1st Qu.:0.0000    1st Qu.: 9.00  
##  Median :0.0000      Median :0.0000    Median :24.00  
##  Mean   :0.4268      Mean   :0.4052    Mean   :23.83  
##  3rd Qu.:1.0000      3rd Qu.:1.0000    3rd Qu.:36.00  
##  Max.   :2.0000      Max.   :2.0000    Max.   :54.00

mode(X) <- 'numeric'

Dx_label <- Dx_label[data1.completeRowIndexes]; length(Dx_label)

## [1] 1155

Preparing the response vector

The knockoff filter is designed to control the FDR under Gaussian noise. A quick inspection of the response vector shows that it is highly non-Gaussian.

h <- hist(Y, breaks='FD', plot = F)
plot_ly(x = h$mids, y = h$density, type = "bar") %>%
   layout(bargap=0.1, title="Histogram of Computed Variable Y = (UPDRS) part_I + part_II + part_III")

A log-transform may help to stabilize the clinical response measurements:

# hist(log(Y), breaks='FD')
h <- hist(log(Y), breaks='FD', plot = F)
plot_ly(x = h$mids, y = h$density, type = "bar") %>%
   layout(bargap=0.1, title="Histogram of log(Y)")

For binary outcome variables, or ordinal categorical variables, we can employ the logistic curve to transform the polytomous outcomes into real values.

Running the knockoff filter

We now run the knockoff filter along with the Benjamini-Hochberg (BH) procedure for controlling the false-positive rate of feature selection. More details about the Knock-off filtering methods are available here.

Before running either selection procedure, remove rows with missing values, reduce the design matrix by removing predictor columns that do not appear frequently (e.g., at least three times in the sample), and remove any columns that are duplicates.

library(knockoff)

Y <- data1$UPDRS_part_I + data1$UPDRS_part_II + data1$UPDRS_part_III
table(Y)   # Show Clinically relevant classification

## Y
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 54 20 25 12  8  7 11 16 16  9 21 16 13 13 22 25 21 31 25 29 29 28 20 25 28 26 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
## 35 41 23 34 32 31 37 34 28 36 29 27 22 19 17 18 18 19 16  9 10 12  9 11  7 10 
## 52 53 54 55 56 57 58 59 60 61 62 63 64 66 68 69 71 75 80 81 82 
## 11  5  7  4  1  5  9  4  3  2  1  6  1  2  1  2  1  1  2  3  1

Y <- as.matrix(Y[data1.completeRowIndexes]); colnames(Y) <- "y"
mode(Y)

## [1] "numeric"

# X = scale(ncaaData[,-20])  # Explicit Scaling is not needed, as glmnet auto standardizes predictors
# X = as.matrix(data1[,c("R_caudate_Volume", "R_putamen_Volume","Weight", "Age", "chr17_rs12185268_GT")])  # X needs to be a matrix, not a data frame
drop_features <- c("FID_IID", "ResearchGroup", "UPDRS_part_I", "UPDRS_part_II", "UPDRS_part_III")
X <- data1[ , !(names(data1) %in% drop_features)]
X = as.matrix(X)   # remove columns: index, ResearchGroup, and y=(PDRS_part_I + UPDRS_part_II + UPDRS_part_III)
X <- X[data1.completeRowIndexes,]; dim(X); mode(X)

## [1] 1155   26

## [1] "numeric"

View(cbind(X,Y))

# Direct call to knockoff filtering looks like this:
fdr <- 0.4
set.seed(1234)
result = knockoff.filter(X, Y, fdr=fdr, knockoffs=create.second_order); print(result$selected) # Old: knockoffs='equicorrelated')

## L_cingulate_gyrus_ComputeArea         R_putamen_ComputeArea 
##                             5                            15 
##                        Weight                           Age 
##                            18                            19 
##           chr17_rs11868035_GT 
##                            21

# knockoff::create.fixed        Fixed-X knockoffs
#knockoff::create.gaussian      Model-X Gaussian knockoffs
#knockoff::create.second_order      Second-order Gaussian knockoffs
#knockoff::create.solve_asdp        Relaxed optimization for fixed-X and Gaussian knockoffs
#knockoff::create.solve_equi        Optimization for equi-correlated fixed-X and Gaussian knockoffs
#knockoff::create.solve_sdp     Optimization for fixed-X and Gaussian knockoffs
#knockoff::create_equicorrelated        Create equicorrelated fixed-X knockoffs.
#knockoff::create_sdp       Create SDP fixed-X knockoffs.
#knockoff::create.vectorize_matrix      Vectorize a matrix into the SCS format

names(result$selected)

## [1] "L_cingulate_gyrus_ComputeArea" "R_putamen_ComputeArea"        
## [3] "Weight"                        "Age"                          
## [5] "chr17_rs11868035_GT"

knockoff_selected <- names(result$selected)
  
# Run BH (Benjamini-Hochberg)
k = ncol(X)
lm.fit = lm(Y ~ X - 1) # no intercept
p.values = coef(summary(lm.fit))[,4]
cutoff = max(c(0, which(sort(p.values) <= fdr * (1:k) / k)))
BH_selected = names(which(p.values <= fdr * cutoff / k))

knockoff_selected; BH_selected

## [1] "L_cingulate_gyrus_ComputeArea" "R_putamen_ComputeArea"        
## [3] "Weight"                        "Age"                          
## [5] "chr17_rs11868035_GT"

##  [1] "XL_insular_cortex_ComputeArea"  "XL_insular_cortex_Volume"      
##  [3] "XL_cingulate_gyrus_ComputeArea" "XL_putamen_ComputeArea"        
##  [5] "XL_putamen_Volume"              "XR_putamen_ComputeArea"        
##  [7] "XSex"                           "XWeight"                       
##  [9] "XAge"                           "Xchr17_rs11868035_GT"          
## [11] "Xchr17_rs11012_GT"              "Xchr17_rs393152_GT"            
## [13] "Xchr17_rs12185268_GT"

list(Knockoff = knockoff_selected, BHq = BH_selected)

## $Knockoff
## [1] "L_cingulate_gyrus_ComputeArea" "R_putamen_ComputeArea"        
## [3] "Weight"                        "Age"                          
## [5] "chr17_rs11868035_GT"          
## 
## $BHq
##  [1] "XL_insular_cortex_ComputeArea"  "XL_insular_cortex_Volume"      
##  [3] "XL_cingulate_gyrus_ComputeArea" "XL_putamen_ComputeArea"        
##  [5] "XL_putamen_Volume"              "XR_putamen_ComputeArea"        
##  [7] "XSex"                           "XWeight"                       
##  [9] "XAge"                           "Xchr17_rs11868035_GT"          
## [11] "Xchr17_rs11012_GT"              "Xchr17_rs393152_GT"            
## [13] "Xchr17_rs12185268_GT"

# Alternatively, for more flexible Knockoff invocation
set.seed(1234)
knockoffs = function(X) create.gaussian(X, 0, Sigma=diag(dim(X)[2])) # identify var-covar matrix Sigma of rank equal to the number of features
stats = function(X, Xk, y) stat.glmnet_coefdiff(X, Xk, y, nfolds=10) # The Output X_k is an n-by-p matrix of knockoff features
result = knockoff.filter(X, Y, fdr=fdr, knockoffs=knockoffs, statistic=stats); print(result$selected)

## L_cingulate_gyrus_ComputeArea         R_putamen_ComputeArea 
##                             5                            15 
##                           Age           chr17_rs11868035_GT 
##                            19                            21 
##           chr17_rs12185268_GT 
##                            24

# Housekeeping: remove the "X" prefixes in the BH_selected list of features
for(i in 1:length(BH_selected)){
  BH_selected[i] <- substring(BH_selected[i], 2)
}

intersect(BH_selected,knockoff_selected)

## [1] "L_cingulate_gyrus_ComputeArea" "R_putamen_ComputeArea"        
## [3] "Weight"                        "Age"                          
## [5] "chr17_rs11868035_GT"

We see that there are some features that are selected by both methods suggesting they may be indeed salient.

5 Feature Selection via Minimum Redundancy Maximum Relevance (mRMR)

The Minimum Redundancy Maximum Relevance (mRMR) algorithm is an information-theoretic method for feature selection. Its goal is to select features that are both highly relevant to the target variable and minimally redundant among themselves. Let’s examine the mathematical foundations of mRMR.

Let $\mathcal{F} = \{f_1, f_2, \ldots, f_n\}$ be the set of all $n$ features (predictors) and $y$ be the target (outcome) variable. Denote by $S \subset \mathcal{F}$ the subset of selected features, with size $|S| = m$ and by $I(f_i; y)$ the mutual information between feature $f_i$ and the target $y$ which is \[I(f_i; y) = \int \int p(f_i, y) \log \frac{p(f_i, y)}{p(f_i)p(y)} \, df_i \, dy,\] where $p(f_i, y)$ is the joint probability distribution, and $p(f_i)$ and $p(y)$ are the marginal distributions. FInally, let’s let $I(f_i; f_j)$ be the mutual information between any pair of features $f_i$ and $f_j$.

The mRMR criterion optimizes two objectives

Maximize Relevance: The relevance of a feature $f_i$ with respect to the target $y$ is given by $I(f_i; y)$. The goal is to select features with high relevance: \[\text{Relevance}(S) = \frac{1}{|S|} \sum_{f_i \in S} I(f_i; y). \]
Minimize Redundancy: The redundancy among features in $S$ is measured as the average mutual information between all pairs of features \[\text{Redundancy}(S) = \frac{1}{|S|^2} \sum_{f_i, f_j \in S, i \neq j} I(f_i; f_j).\]

The aggregate objective mRMR criterion combining the above two objective terms can be framed in additive or multiplicative forms.

Additive/Difference Form: \[J(S) = \frac{1}{|S|} \sum_{f_i \in S} I(f_i; y) - \frac{1}{|S|^2} \sum_{f_i, f_j \in S, i \neq j} I(f_i; f_j). \]
Multiplicative/Ratio Form: \[J(S) = \frac{\frac{1}{|S|} \sum_{f_i \in S} I(f_i; y)}{\frac{1}{|S|^2} \sum_{f_i, f_j \in S, i \neq j} I(f_i; f_j)}. \]

5.1 Greedy Selection Algorithm

The mRMR algorithm typically selects features sequentially using a greedy approach:

Initialize: Start with an empty set $S = \emptyset$.
Optimize: At each step, add a feature $f_k \in \mathcal{F} \setminus S$ that maximizes the mRMR criterion: \[f_k = \arg \max_{f \in \mathcal{F} \setminus S} \left( I(f; y) - \frac{1}{|S|} \sum_{f_i \in S} I(f; f_i) \right).\]
Iterate: Repeat until $|S| = m$, where $m$ is the desired number of features.

mRMR advantages include computational efficiency, making it suitable for high-dimensional datasets, and interpretability, since selected features are both individually informative and collectively diverse.

5.2 Simulated mRMR Example

Let’s implement the mRMR feature selection algorithm in R, using a synthetic dataset with complex interdependencies between covariate features and a target variable. The goal will be to apply mRMR for selection of the most relevant and least redundant features.

We can utilize the R mRMRe package, which provides an efficient mRMR implementation.

# Install the mRMRe package if not already installed
if (!requireNamespace("mRMRe", quietly = TRUE)) {
  install.packages("mRMRe")
}
library(mRMRe)

# Install plotly if not already installed
if (!requireNamespace("plotly", quietly = TRUE)) {
  install.packages("plotly")
}
library(plotly)

Next, we will simulate a dataset with $30$ features, $f_1, f_2, \dots, f_{20}$, where some features are highly redundant, and others are highly predictive of the target variable, $y$. In this simulation, we’ll load certain covariate features corresponding to indices as follows:

Redundant Indices: $\{3,14\}$ will be dependent on features $\{1,10\}$.
Indices with nonlinear relations to the target variable ($y$): $\{7,22\}$.

set.seed(123) # For reproducibility

# Number of samples and features
n_samples <- 500
n_features <- 30

# Generate features
feature_matrix <- matrix(rnorm(n_samples * n_features, mean = 0, sd = 1), 
                         nrow = n_samples, ncol = n_features)

# Create dependencies
feature_matrix[, 3] <- feature_matrix[, 1] + rnorm(n_samples, mean = 0, sd = 0.1)  # Redundant with feature 1
feature_matrix[, 4] <- feature_matrix[, 20] + rnorm(n_samples, mean = 0, sd = 0.2)  # Redundant with feature 2

# Target variable (nonlinear relationship with features)
target_variable <- sin(feature_matrix[, 7]) + log(abs(feature_matrix[, 22]) + 1) +
  rnorm(n_samples, mean = 0, sd = 0.5)

# Convert to a data frame
data <- as.data.frame(cbind(target_variable, feature_matrix))
colnames(data) <- c("target", paste0("f", 1:n_features))

Reformat the data using the mRMR.data class, as the mRMRe package requires data to be in a specific format.

# Convert the dataset into mRMR data format
mrmr_data <- mRMR.data(data = data)

Use the mRMR.classic() function to apply the mRMR algorithm and select the most relevant and least redundant features.

# Perform mRMR feature selection
mrmr_result <- mRMR.classic(data = mrmr_data, target_indices = 1, feature_count = 5)

# Display selected features
# Adjust the selected features to exclude the target variable
selected_features <- solutions(mrmr_result)[[1]]
selected_features_names <- colnames(data)[selected_features + 1] # Add 1 since mRMR indexing starts at 0

# Validate selected features
valid_selected_features_names <- intersect(selected_features_names, colnames(data))

# Check if there are valid features for visualization
if (length(valid_selected_features_names) > 0) {
  # Plot relationships between valid selected features and the target
  selected_data <- data[, c("target", valid_selected_features_names)]
  pairs(selected_data, main = "Selected Features vs Target")
} else {
  print("No valid features selected for visualization.")
}

# Remove the target variable if selected
selected_features_names <- setdiff(selected_features_names, "target")

# Display the selected features
print("Selected Features:")

## [1] "Selected Features:"

print(selected_features_names)

## [1] "f8"  "f3"  "f29" NA    "f27"

The mRMR algorithm should select features that are both highly relevant to the target variable and minimally redundant. Features $f_3$ and $f_{14}$, which are redundant with $f_1$ and $f_{10}$, are less likely to be selected. The output should be a list of feature names selected by the mRMR algorithm, demonstrating the trade-off between relevance and redundancy.

For better mRMR feature selection interpretation, we can visualize the relationships.

# library(ggplot2)
# GGPlot: Plot relationships between selected features and the target
# selected_data <- data[, c("target", selected_features_names)]
# pairs(selected_data, main = "Selected Features vs Target")

# # Filter selected features to include only valid columns in 'data'
# valid_selected_features_names <- intersect(selected_features_names, colnames(data))
# 
# # Check if there are valid features for visualization
# if (length(valid_selected_features_names) > 0) {
#   # Include 'target' and valid selected features for plotting
#   selected_data <- data[, c("target", valid_selected_features_names)]
#   
#   # Plot relationships between target and selected features
#   pairs(selected_data, main = "Selected Features vs Target")
# } else {
#   print("No valid features selected for visualization.")
# }
# 
# pairs(selected_data, main = "Selected Features vs Target")

library(plotly)
library(reshape2)

# Filter selected features to include only valid columns in 'data'
valid_selected_features_names <- intersect(selected_features_names, colnames(data))

# Check if there are valid features for visualization
if (length(valid_selected_features_names) > 0) {
  # Include 'target' and valid selected features for plotting
  selected_data <- data[, c("target", valid_selected_features_names)]
  
  # Convert selected data to a long format for plotly
  selected_data_long <- melt(selected_data, id.vars = "target")
  
  # Define a symbol scale (glyphs) for features
  feature_symbols <- c("circle", "square", "triangle-up", "diamond", "cross", "x", "star")
  feature_count <- length(valid_selected_features_names)
  symbol_mapping <- setNames(feature_symbols[seq_len(min(feature_count, 
                            length(feature_symbols)))], valid_selected_features_names)
  
  # Map each feature to a specific symbol
  selected_data_long$symbol <- symbol_mapping[selected_data_long$variable]
  
  # Create an interactive scatterplot matrix with unique glyphs
  scatter_matrix <- plot_ly(data = selected_data_long, x = ~value, y = ~target,
        type = 'scatter', mode = 'markers', color = ~variable,   # Color by feature name
        symbol = ~variable,  # Use unique glyphs for each feature
        symbols = symbol_mapping,  # Define the mapping of variable to symbols
        marker = list(size = 8, opacity = 0.7)  # Adjust marker size and transparency
  ) %>%
    layout(title = "Selected Features vs Target (Unique Feature Glyphs)",
           xaxis = list(title = "Feature Values"), yaxis = list(title = "Target Value"))
  # Display the plot
  scatter_matrix
} else {print("No valid features selected for visualization.") }

This example can be easily extended to use the mRMR algorithm on a high-dimensional datasets for highlighting selection of informative features while avoiding covariate redundancy.

6 Practice Problems

We can practice variable selection with the SOCR_Data_AD_BiomedBigMetadata on SOCR website. This is a smaller dataset that has 744 observations and 63 variables. Here we utilize DXCURREN or current diagnostics as the class variable.

Let’s import the dataset first.

library(rvest)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_AD_BiomedBigMetadata")
html_nodes(wiki_url, "#content")

## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body" role="main">\n\t\t\t<a id="top"></a>\n\ ...

alzh <- html_table(html_nodes(wiki_url, "table")[[1]])
summary(alzh)

##       SID            MMSCORE        FAQTOTAL            GDTOTAL     
##  Min.   :   2.0   Min.   :18.00   Length:744         Min.   :0.000  
##  1st Qu.: 355.5   1st Qu.:25.00   Class :character   1st Qu.:0.000  
##  Median : 697.5   Median :27.00   Mode  :character   Median :1.000  
##  Mean   : 707.5   Mean   :26.81                      Mean   :1.367  
##  3rd Qu.:1063.0   3rd Qu.:29.00                      3rd Qu.:2.000  
##  Max.   :1435.0   Max.   :30.00                      Max.   :6.000  
##    adascog              sobcdr         DXCURREN     DX_Conversion     
##  Length:744         Min.   :0.000   Min.   :1.000   Length:744        
##  Class :character   1st Qu.:0.000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :1.500   Median :2.000   Mode  :character  
##                     Mean   :1.785   Mean   :1.958                     
##                     3rd Qu.:2.625   3rd Qu.:2.000                     
##                     Max.   :9.000   Max.   :3.000                     
##     DXCONTYP      DX_Confidence          Gender         Married     
##  Min.   :-4.000   Length:744         Min.   :1.000   Min.   :1.000  
##  1st Qu.:-4.000   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Median :-4.000   Mode  :character   Median :1.000   Median :1.000  
##  Mean   :-3.962                      Mean   :1.407   Mean   :1.083  
##  3rd Qu.:-4.000                      3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   : 3.000                      Max.   :2.000   Max.   :2.000  
##    Education          Age          Weight_Kg         VSBPSYS     
##  Min.   : 6.00   Min.   :55.00   Min.   : -1.00   Min.   : 90.0  
##  1st Qu.:14.00   1st Qu.:71.00   1st Qu.: 64.67   1st Qu.:122.0  
##  Median :16.00   Median :76.00   Median : 74.39   Median :135.0  
##  Mean   :15.64   Mean   :75.49   Mean   : 75.28   Mean   :135.5  
##  3rd Qu.:18.00   3rd Qu.:80.00   3rd Qu.: 84.48   3rd Qu.:146.0  
##  Max.   :20.00   Max.   :91.00   Max.   :137.44   Max.   :206.0  
##     VSBPDIA          VSPULSE           VSRESP          VSTEMP     
##  Min.   : 43.00   Min.   : 40.00   Min.   :-1.00   Min.   :-1.00  
##  1st Qu.: 68.00   1st Qu.: 58.00   1st Qu.:16.00   1st Qu.:36.10  
##  Median : 75.00   Median : 64.00   Median :16.00   Median :36.40  
##  Mean   : 74.56   Mean   : 65.17   Mean   :16.68   Mean   :36.35  
##  3rd Qu.: 82.00   3rd Qu.: 72.00   3rd Qu.:18.00   3rd Qu.:36.70  
##  Max.   :103.00   Max.   :100.00   Max.   :32.00   Max.   :37.70  
##  SymptomeSeverety   SymptomeChronicity    BC.USEA         BCVOMIT     
##  Length:744         Length:744         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.000   Median :1.000  
##                                        Mean   :1.032   Mean   :1.016  
##                                        3rd Qu.:1.000   3rd Qu.:1.000  
##                                        Max.   :2.000   Max.   :2.000  
##     BCDIARRH        BCCONSTP        BCABDOMN        BCSWEATN    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.097   Mean   :1.106   Mean   :1.074   Mean   :1.056  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCDIZZY         BCENERGY      BCDROWSY       BCVISION        BCHDACHE    
##  Min.   :1.000   Min.   :1.0   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.0   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.0   Median :1.00   Median :1.000   Median :1.000  
##  Mean   :1.125   Mean   :1.2   Mean   :1.13   Mean   :1.059   Mean   :1.093  
##  3rd Qu.:1.000   3rd Qu.:1.0   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.0   Max.   :2.00   Max.   :2.000   Max.   :2.000  
##     BCDRYMTH        BCBREATH        BCCOUGH         BCPALPIT    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.087   Mean   :1.078   Mean   :1.116   Mean   :1.031  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCCHEST         BCURNDIS        BCURNFRQ        BCANKLE     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.017   Mean   :1.023   Mean   :1.218   Mean   :1.078  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCMUSCLE         BCRASH         BCINSOMN        BCDPMOOD    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.364   Mean   :1.073   Mean   :1.112   Mean   :1.122  
##  3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCCRYING        BCELMOOD        BCWANDER         BCFALL     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.035   Mean   :1.012   Mean   :1.004   Mean   :1.046  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCOTHER        CTWHITE             CTRED             PROTEIN         
##  Min.   :1.000   Length:744         Length:744         Length:744        
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :1.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1.046                                                           
##  3rd Qu.:1.000                                                           
##  Max.   :2.000                                                           
##    GLUCOSE          ApoEGeneAllele1 ApoEGeneAllele2    CDMEMORY
##  Length:744         Min.   :2.000   Min.   :2.000   Min.   :0  
##  Class :character   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0  
##  Mode  :character   Median :3.000   Median :3.000   Median :0  
##                     Mean   :3.023   Mean   :3.489   Mean   :0  
##                     3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:0  
##                     Max.   :4.000   Max.   :4.000   Max.   :0  
##     CDORIENT         CDJUDGE          CDCOMMUN          CDHOME      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.5000   Median :0.0000   Median :0.5000   Median :0.0000  
##  Mean   :0.5047   Mean   :0.3085   Mean   :0.3683   Mean   :0.2513  
##  3rd Qu.:1.0000   3rd Qu.:0.5000   3rd Qu.:0.5000   3rd Qu.:0.5000  
##  Max.   :2.0000   Max.   :2.0000   Max.   :2.0000   Max.   :2.0000  
##      CDCARE          CDGLOBAL     
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.2849   Mean   :0.0672  
##  3rd Qu.:0.5000   3rd Qu.:0.0000  
##  Max.   :2.0000   Max.   :2.0000

The data summary shows that we have several factor variables. After converting their type to numeric we find some missing data. We can manage this issue by selecting only the complete observation of the original dataset or by using multivariate imputation, see Chapter 2.

chrtofactor<-c(3, 5, 8, 10, 21:22, 51:54)
alzh[alzh=="."] <- NA  # replace all missing "." values with "NA"
alzh[chrtofactor]<-data.frame(apply(alzh[chrtofactor], 2, as.numeric))
alzh<-alzh[complete.cases(alzh), ]

For simplicity, here we eliminated the missing data and are left with 408 complete observations. Now, we can apply the Boruta method for feature selection.

set.seed(123)
train<-Boruta(DXCURREN~.-SID, data=alzh, doTrace=0)
print(train)

## Boruta performed 99 iterations in 1.82116 secs.
##  12 attributes confirmed important: adascog, BCBREATH, CDCARE,
## CDCOMMUN, CDGLOBAL and 7 more;
##  45 attributes confirmed unimportant: BC.USEA, BCABDOMN, BCANKLE,
## BCCHEST, BCCONSTP and 40 more;
##  4 tentative attributes left: Age, ApoEGeneAllele1, ApoEGeneAllele2,
## GLUCOSE;

You might get a result that is a little bit different. We can plot the variable importance graph using some previous knowledge.

The final step is to get rid of the tentative features.

## Boruta performed 99 iterations in 1.82116 secs.
## Tentatives roughfixed over the last 99 iterations.
##  13 attributes confirmed important: adascog, ApoEGeneAllele2, BCBREATH,
## CDCARE, CDCOMMUN and 8 more;
##  48 attributes confirmed unimportant: Age, ApoEGeneAllele1, BC.USEA,
## BCABDOMN, BCANKLE and 43 more;

##  [1] "MMSCORE"         "FAQTOTAL"        "adascog"         "sobcdr"         
##  [5] "DX_Confidence"   "BCBREATH"        "ApoEGeneAllele2" "CDORIENT"       
##  [9] "CDJUDGE"         "CDCOMMUN"        "CDHOME"          "CDCARE"         
## [13] "CDGLOBAL"

These techniques can be applied to many other datasets available in the Case-Studies archive.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Variable Importance and Feature Selection

SOCR/MIDAS (Ivo Dinov)

February 2025