SOCR ≫ DSPA ≫ Topics ≫

1 How to navigate the CBDA site

This is a dynamic webpage that allows interactive navigation via ToC (top-left) and tabs (horizontal and vertical hide/expand segments). By clicking on each blue hyperlinks in the TOC, the reader can navigate between different sections. By clicking on the tabs the reader can navigate within the many parts of each section. The video below illustrates the main features of the CBDA protocol.


2 Motivation

The theoretical foundations of Big Data Science are not fully developed, yet. This research proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. We are also working on elucidating the mathematical framework that will eventually enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches utilized within the CBDA technique.


2.1 Background

This document illustrates the first release of the CBDA protocol as described in the published manuscript “Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies”.

The proposed Compressive Big Data analytics (CBDA) provides a general foundation for effective representation, efficient processing, and model-free inference for complex heterogeneous data archives. Specifically, CBDA allows us to eliminate noise, forecast trends, compute probabilities, estimate likelihoods, and classify large, incomplete, and heterogeneous data from multiple sources. We developed a novel method and a protocol that iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. In a nutshell, repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable foundation for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges.

2.2 Technology

The CBDA protocol has been developed in the R environment. Since a large number of smaller training sets are needed for the convergence of the protocol, we created a workflow that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of independent instances/jobs of the CBDA protocol. The methods, software and protocols developed here are openly shared on our GitHub repository. All software, workflows, and datasets are publicly accessible.

To gain some familiarity with version control and LONI, this is a short introduction
Version Control and LONI.

This is a simple and clear Introduction to the SuperLearner, the ensemble meta-algorithm used in the CBDA protocol SuperLearner_Intro.

3 CBDA R Package

We published the first release of the CBDA package in April 2018. Below are the stats for the last week.

##         date count package
## 1 2018-12-27     8    CBDA
## 2 2018-12-28     9    CBDA
## 3 2018-12-29     7    CBDA
## 4 2018-12-30     5    CBDA
## 5 2018-12-31     5    CBDA
## 6 2019-01-01     4    CBDA
## 7 2019-01-02    14    CBDA

3.1 CBDA R Package Installation

The version 1.0.0 of the CBDA package can be downloaded and installed with the following command:

install.packages("CBDA",repos = 'https://cran.r-project.org/')

The documentation and vignettes can be found on CRAN. The source and binary files for updates can be found on our Github repository.

The necessary packages to run the CBDA algortihm are installed automatically at installation. However, they can also be installed/attached by launching the CBDA_initialization() function (see example in the R chunk below). If the parameter install is set to TRUE (by default it’s set to FALSE), then the CBDA_initialization() function installs (if needed) and attaches all the necessary packages to run the CBDA package v1.0.0. This function can be run before any production run or test. The list of packages can pe personalized to comprise extra packages needed for an expanded SL.library or for other needs by the user. The output shows a table (see Figure below) where for each package a TRUE or FALSE is displayed. Thus the necessary steps can be pursued in case some package has a FALSE.

N.B.: to eliminate a warning in Windows due to the “doMC” package not available (it was intended for Mac), install the “doMC” with the following command “install.packages(”doMC“, repos=”http://R-Forge.R-project.org“)”

ipaktable

ipaktable

3.2 Main functions of the CBDA package

The CBDA package comprises several functions. The main function is CBDA() that has all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp]. We assume that the Big Data is already clean and harmonized.

After the necessary data wrangling (i.e., imputation, normalization and rebalancing), an ensemble predictor (i.e., SuperLearner) is applied to each subsample for training/learning.

The list of algorithms (or wrappers) available from the SuperLearner can be displayed by typing listWrappers(). By default, the CBDA package operates with the following algorithms (set with their default values, as described in the SuperLearner package):

  • SL.glm: wrapper for generalized linear models via glm()

  • SL.xgboost: wrapper that supports the Extreme Gradient Boosting package for SuperLearnering, which is a variant of gradient boosted machines (GBM).

  • SL.glmnet: wrapper for penalized regression using elastic net.

  • SL.svm: wrapper for Support Vector Machine

  • SL.randomForest: wrapper that supports RandomForest

  • SL.bartMachine: wrapper that supports bayesian additive regression trees via the bartMachine package

To add more wrappers/algorithms, see the example in the main CBDA() function help.

The ensemble predictive model is then validated on a fraction alpha of the Big Data. Each subsample generates a predictive model that is ranked based on performance metrics (e.g., Mean Square Error-MSE and Accuracy) during the first validation step.

After all the M subsamples have been generated and each predictive model computed, the CBDA function CBDA() calls 4 more functions to perform, respectively:

  • CONSOLIDATION (i.e., CBDA_Consolidation()) and ranking of the results where the top predictive models are selected (top) and the more frequent features (BEST) are ranked and displayed as well

  • VALIDATION (i.e., CBDA_Validation()) on the top ranked features (i.e., up to “max_covs” number of features) where nested ensemble predictive models are generated in a bottom-up fashion

  • Implementation of STOPPING CRITERIA (i.e., CBDA_Stopping_Criteria()) for the best/optimal ensemble predictive model (to avoid overfitting)

  • CLEAN UP (i.e., CBDA_CleanUp()) step for deleting unnecessary workspaces generated by the CBDA protocol.

At the end of successfull CBDA() call, 2 workspaces are generated:

  • CBDA_input_specs_label_light.RData where most of the results of the M subsamples are saved

  • CBDA_input_specs_label_light_VALIDATION.RData where most of the results of the top-ranked subsamples are saved

Throughout the execution of the CBDA protocol, a workspace with the main CBDA specifications is created and loaded whenever necessary (e.g., “CBDA_label_info.RData”). Also a CBDA object is created (e.g., CBDA_object <- CBDA()) with the following data:

  • LearningTable: information on the top features selected and the correspondent predictive models’ performances during the learning/training step

  • ConfusionMatrices: list of confusion matrices for each of the top nested predictive models in ValidationTable

  • SuperLearnerLibrary: list of all the algorithms/wrappers used in the CBDA

  • SuperLearnerCoefficients: matrix with all the coefficients generated by the SuperLearner for each subsample and for each algorithms/wrappers used in the CBDA

  • ValidationTable: information on the top features selected and the correspondent nested predictive models’ performances

3.3 CBDA input specifications

The array of input specifications comprises the following labels (in the square brackets is the default value):

  • Ytemp This is the output variable (vector) in the original Big Data

  • Xtemp This is the input variable (matrix) in the original Big Data

  • label This is the label appended to RData workspaces generated within the CBDA calls [= “CBDA_package_test”]

  • alpha Percentage of the Big Data to hold off for Validation [= 0.2]

  • Kcol_min Lower bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) [= 5]

  • Kcol_max Upper bound for the percentage of features-columns sampling (used for the Feature Sampling Range - FSR) [= 15]

  • Nrow_min Lower bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) [= 30]

  • Nrow_max Upper bound for the percentage of cases-rows sampling (used for the Case Sampling Range - CSR) [= 50]

  • misValperc Percentage of missing values to introduce in BigData (used just for testing, to mimic real cases). [= 0]

  • M Number of the BigData subsets on which perform Knockoff Filtering and SuperLearner feature mining [= 3000]

  • N_cores Number of Cores to use in the parallel implementation (default is set to 1 core) [= 1, not multicore enabled]

  • top Top predictions to select out of the M [= 1000]

  • BEST Best predictions to display out of the top selected [= 100]

  • workspace_directory Directory where the results and workspaces are saved [= getwd()]

  • max_covs Top features to include in the Validation Step where nested models are tested [= 100]

  • min_covs Minimum number of top features to include in the initial model for the Validation Step. It must be > 2 [= 5]

Sampling ranges for cases (CSR - Cases Sampling Range) and features (FSR - Feature Sampling Range) are then defined as follow:

  • FSR = [min_FSR, max_FSR]

  • CSR = [min_CSR, max_CSR]

3.4 Datasets

The CBDA protocol has been applied to 3 different datasets in the manuscript: 2 synthetic ones (Null and Binomial) and a real one (ADNI dataset). The synthetic datasets all have the same number of cases (i.e., 300), but they differ in the number of features. The binomial datasets all have only 10 “true” predictive features, while the null datasets have none.

The Null datasets are organized as follow:

The Binomial datasets are organized as follow:

  • Binomial dataset: # of features = 100, with true features (10,20,30,40,50,60,70,80,90,100)

  • Binomial dataset 3: # of features = 900, with true features (1,100,200,300,400,500,600,700,800,900)

  • Binomial dataset 5: # of features = 1500, with true features (1,100,200,400,600,800,1000,1200,1400,1500).

3.5 LONI pipeline implementation

As mentioned earlier, a large number of smaller training sets are needed for the convergence of the protocol. Thus the CBDA R package has dedicated functions for a workflow implementation that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of simultaneous independent instances/jobs of the CBDA protocol. The main and secondary functions are duplicated and modified to run on the LONI pipeline. The main difference is that they have a job_id parameter that is used to identify the specific job on the server. The functions are CBDA.pipeline(), CBDA_Consolidation.pipeline(), CBDA_Validation.pipeline() and CBDA_Stopping_Criteria.pipeline(). The complete pipeline workflow implementation can be downloaded in this LONI.zip file.

3.6 Memory and storage limits to run CBDA

See Memory limits under different OS for various limitations on memory needs while running R. As far as CBDA is concerned, a CBDA object can be up to 200-300 Mb. The default limits are acceptable to run the CBDA algorithm. To know or change the memory allocation, the memory.limit() function reports or increases the limit in force on the total allocation.

memory.limit(50000)  # to allocate 50Gb of memory

However, the space needed to save all the workspaces may need to be as large as 1-5 Gb, depending on the number of subsamples. We are working on an new CBDA implementation that reduces the storage constraints.

4 CBDA Demo

4.1. Binomila Dataset Example

The following code replicates an experiment with the Binomial dataset

# Initialization
# This function call installs (if needed) and attaches all the necessary packages to run the CBDA package
# v1.0.0. It should be run before any production run or test. The output shows a table where for each
# package a TRUE or FALSE is displayed. Thus the necessary steps can be pursued in case some package has a FALSE.
CBDA_initialization()
                    
# Set the specs for the synthetic dataset to be tested
n = 300          # number of observations
p = 900          # number of variables

# Generate a nxp matrix of IID variables (e.g., ~N(0,1))
X1 = matrix(rnorm(n*p), nrow=n, ncol=p)

# Setting the nonzero ("true") variables/signals
nonzero=c(1,100,200,300,400,500,600,700,800,900)

# Set the signal amplitude (for noise level = 1)
amplitude = 10

# Allocate the nonzero coefficients in the correct places
beta = amplitude * (1:p %in% nonzero)

# Generate a linear model with a bias (e.g., white  noise ~N(0,1))
ztemp <- function() X1 %*% beta + rnorm(n)
z = ztemp()

# Pass it through an inv-logit function to
# generate the Bernoulli response variable Ytemp
pr = 1/(1+exp(-z))
Ytemp = rbinom(n,1,pr)
X2 <- cbind(Ytemp,X1)

dataset_file ="Binomial_dataset_3.txt"

# Save the synthetic dataset
write.table(X2,dataset_file,sep=",")

# Load the Synthetic dataset
Data = read.csv(dataset_file,header = TRUE)
Ytemp <- Data[,1] # set the outcome
original_names_Data <- names(Data)
cols_to_eliminate=1
Xtemp <- Data[-cols_to_eliminate] # set the matrix X of features/covariates
original_names_Xtemp <- names(Xtemp)

workspace_directory <- getwd()

# Add more wrappers/algorithms to the SuperLearner ensemble predictor
# It can be commented out if only the default set of algorithms are used,
# e.g., algorithm_list = c("SL.glm","SL.xgboost","SL.glmnet","SL.svm",
#                          "SL.randomForest","SL.bartMachine")
# This defines a "new" wrapper, based on the default SL.glmnet
SL.glmnet.0.75 <- function(..., alpha = 0.75,family="binomial"){
                  SL.glmnet(..., alpha = alpha, family = family)}

algorithm_list <- c("SL.glm","SL.xgboost","SL.glmnet.0.75",
                   "SL.glmnet","SL.svm")

# Call the Main CBDA function
# Multicore functionality NOT enabled
CBDA_object <- CBDA(Ytemp , Xtemp , M = 16 , Nrow_min = 50, Nrow_max = 70,
                     N_cores = 1 , top = 50, max_covs = 20 ,
                     min_covs = 5 , algorithm_list = algorithms_list)

As soon as it is launched, the progress of the CBDA algorithm is shown on the screen (see below).

After each step of the CBDA algorithm, completion messages and some output is displayed on the screen. For example, after the “CONSOLIDATION” phase is done, the screen shows:

Or after the Validation and STopping Criteria steps:

If the multicore is enabled, no screen messages are displayed due to the nature of the job allocations.

# Multicore functionality enabled ( i.e., N_cores > 1).
 CBDA_test <- CBDA(Ytemp , Xtemp , M = 100 , Nrow_min = 50, Nrow_max = 70,
                 N_cores = 4 , top = 50, max_covs = 20 ,
                   min_covs = 5 , SL.library = algorithms_list)

4.2 CBDA complete set of results for the synthetic datasets (Null and Binomial)

Due to the large number of experiments and the many different specification, the complete set of results for the 3 binomial datasets are shown at the following links:

The complete set of results for the histograms in Figure 3 are shown at the following links:

The complete set of results shown in the next figure (i.e., Figure 2 of the manuscript under review) are here.

The figure below shows a set of heatmaps and histograms to summarize all the experiments in the manuscript under review

In the heatmaps, the x axis represents the 16 combinations between the choice of the subsets of M (i.e., 1,000, 3,000, 6,000 and 9,000) and the choice for top-ranked predictions (i.e., 100, 200, 500 and 1,000, as described in the last 2 columns of Table 2 in the Methods section of the manuscript under review). Namely, the combinations are ordered as follows: Combination 1 = (1,000,100), Combination 2 = (1,000,200), Combination 3 = (1,000,500), Combination 4 = (1,000,1,000), Combination 5 = (3,000,100), Combination 6 = (3,000,200), Combination 7 = (3,000,500), Combination 8 = (3,000,1,000), Combination 9 = (6,000,100), Combination 10 = (6,000,200),Combination 11 = (6,000,500), Combination 12 = (6,000,1,000), Combination 13 = (9,000,100), Combination 14 = (9,000,200), Combination 15 = (9,000,500), Combination 16 = (9,000,1,000). The y axis represents the CBDA experiment specs, where Experiments 1-6 have no missing values (i.e., missValperc = 0%), and Experiments 7-12 have 20% missing values (i.e., missValperc = 20%). Both sets of experiments have the FSR and CSR ranges combined in ascending order, namely Exp1and Exp 7 = (FSR,CSR) =(1-5%,30-60%), Exp2 and Exp 8= (FSR,CSR) =(5-15%,30-60%), Exp3 and Exp 9 = (FSR,CSR) =(15-30%,30-60%), Exp4 and Exp 10 = (FSR,CSR) =(1-5%,60-80%), Exp5 and Exp 11= (FSR,CSR) =(5-15%,60-80%), Exp6 and Exp 12 = (FSR,CSR) =(15-30%,60-80%). See Table 2 for details. Panels A, C and E show the CBDA results using the Accuracy performance metric. Panels B, D and F show the CBDA results using the Mean Square Error-MSE performance metric (see Methods for details on the performance metrics). Panels A and B, C and D, E and F show the results for the 3 Binomial datasets tested, respectively.

The heatmaps show in light blue color CBDA experiments with high true positive rates, indicating a high frequency of correct feature identification (i.e., with more than 7 true features identified out of a total of 10) and in dark blue color CBDA experiments with low true positive rates, depicting a low frequency of correct feature identification (i.e., with less than 7 true features identified out of a total of 10). Our goal is to ultimately suggest the optimal input specifications for the CBDA protocol (i.e., decrease the computational time by reducing the number of samples, and increase the rate of discovery of “true” features).

The histograms illustrates a comparison between the Null and Binomial datasets experiments. The goal in this set of histograms is to validate the CBDA protocol when no signal is present in the data. For the best accuracy in the comparison, we used all the 9,000 CBDA samples, ranking the top 1,000 predictions (this is equivalent to the combination 16 in the heatmaps figure). Each histogram combines the results of all 12 experiments (using the MSE metric).

The function CBDA_spectrum_plots() generates the histograms on the figure (after loading the VALIDATION workspace created at the end of the CBDA calculations). Specifically, it generates 4 plots/histograms: count and density histograms for the top features as selected by both performance metrics (e.g., MSE and Accuracy).

In the heatmaps, the x axis represents the 16 combinations between the choice of the subsets of M (i.e., 1,000, 3,000, 6,000 and 9,000) and the choice for top-ranked predictions (i.e., 100, 200, 500 and 1,000, as described in the last 2 columns of Table 2 in the Methods section of the manuscript under review). Namely, the combinations are ordered as follows: Combination 1 = (1,000,100), Combination 2 = (1,000,200), Combination 3 = (1,000,500), Combination 4 = (1,000,1,000), Combination 5 = (3,000,100), Combination 6 = (3,000,200), Combination 7 = (3,000,500), Combination 8 = (3,000,1,000), Combination 9 = (6,000,100), Combination 10 = (6,000,200),Combination 11 = (6,000,500), Combination 12 = (6,000,1,000), Combination 13 = (9,000,100), Combination 14 = (9,000,200), Combination 15 = (9,000,500), Combination 16 = (9,000,1,000). The y axis represents the CBDA experiment specs, where Experiments 1-6 have no missing values (i.e., missValperc = 0%), and Experiments 7-12 have 20% missing values (i.e., missValperc = 20%). Both sets of experiments have the FSR and CSR ranges combined in ascending order, namely Exp1and Exp 7 = (FSR,CSR) =(1-5%,30-60%), Exp2 and Exp 8= (FSR,CSR) =(5-15%,30-60%), Exp3 and Exp 9 = (FSR,CSR) =(15-30%,30-60%), Exp4 and Exp 10 = (FSR,CSR) =(1-5%,60-80%), Exp5 and Exp 11= (FSR,CSR) =(5-15%,60-80%), Exp6 and Exp 12 = (FSR,CSR) =(15-30%,60-80%). See Table 2 for details. Panels A, C and E show the CBDA results using the Accuracy performance metric. Panels B, D and F show the CBDA results using the Mean Square Error-MSE performance metric (see Methods for details on the performance metrics). Panels A and B, C and D, E and F show the results for the 3 Binomial datasets tested, respectively.

The heatmaps show in light blue color CBDA experiments with high true positive rates, indicating a high frequency of correct feature identification (i.e., with more than 7 true features identified out of a total of 10) and in dark blue color CBDA experiments with low true positive rates, depicting a low frequency of correct feature identification (i.e., with less than 7 true features identified out of a total of 10). Our goal is to ultimately suggest the optimal input specifications for the CBDA protocol (i.e., decrease the computational time by reducing the number of samples, and increase the rate of discovery of “true” features).

The histograms illustrates a comparison between the Null and Binomial datasets experiments. The goal in this set of histograms is to validate the CBDA protocol when no signal is present in the data. For the best accuracy in the comparison, we used all the 9,000 CBDA samples, ranking the top 1,000 predictions (this is equivalent to the combination 16 in the heatmaps figure). Each histogram combines the results of all 12 experiments (using the MSE metric).

The complete workflow (implemented without the R CBDA package) is available for the Binomial and ADNI datasets.

5 Acknowledgments

We thank all the co-authors of the CBDA manuscript as well as everyone that contributed and is still contributing to the CBDA project in many ways.

More details can be found in the manuscript Simeone Marino, Jiachen Xu, Yi Zhao, Nina Zhou, Yiwang Zhou and Ivo D. Dinov. (2018) “Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies”.

SOCR Resource Visitor number SOCR Email