Center for Complexity and Self-Management of Chronic Disease
(CSCD): Core 2: Methods and Analytics Progress (2017-2018)
The Methods and Analytics core made significant advances in 2017-2018. This progress is summarized below:
I. Predictive big data analytics study of Amyotrophic Lateral Sclerosis (ALS)
We developed a
new non-parametric method for estimating the prognosis of ALS patients
using survival analysis (PMC5749893).
Our survival ranking technique transforms patients' survival data into a linear space of
hazard ranks and enables the subsequent machine learning prediction of the neurodegenerative
progression. This technique was received the top ranking in the DREAM Amyotrophic Lateral
Sclerosis (ALS) Stratification Challenge. As an application, we identified salient feature
that are important in ALS diagnosis and prognosis.
II. Genomic Data Analysis
We introduced a
new theoretical model for analyzing genetics sequence data (PMC5361063).
We compared our approach to other techniques for quantifying sequence distances and variability.
Most alignment-free methods rely on counting words, which are small contiguous fragments of the genome.
Our approach considers the locations of nucleotides in the sequences and relies more on appropriate
statistical distributions. We reported results of extracting information and comparing matching fidelity
and location regularization information to classify mutation sequences.
III. Visualization of Extremely High-dimensional Data
The CSCD analytics team developed a
new demonstration of modeling, simplifying and visualizing extremely high-dimensional data.
Many datasets have million observations and attributes/features.
Datasets with high dimensions/features are subjected to what is colloquially known as the
curse of dimensionality. For instance, medical images generate thousands of features and
are difficult to integrate with clinical and phenotypic information.
We utilized a novel manifold statistical technique, t-distributed stochastic neighbor
embedding (t-SNE), to reduce 3,000 dimensional data for 10,000 volunteers into a 3D space.
IV. Curricular Developments
- A new data science course has been approved and
was offered as a Summer MOOC course. It builds technical skills and provides
a tool chest of resources to manage and interrogate heterogeneous datasets.
- An electronic textbook (EBook) on
Scientific Methods for Health Sciences is developed and
widely shared with the entire community. This multilingual EBook is utilized by over 50,000 users worldwide.