SOCR ≫ DSPA ≫ DSPA2 Topics ≫

We start the DSPA journey with an overview of the mission and objectives of this textbook. Some early examples of driving motivational problems and challenges provide context into the common characteristics of big (biomedical and health) data. We will define data science and predictive analytics and emphasize the importance of their ethical, responsible, and reproducible practical use. This chapter also covers the foundations of R, contrasts R against other languages and computational data science platforms and introduces basic functions and data objects, formats, and simulation.

1 Motivation

Let’s start with a quick overview illustrating some common data science challenges, qualitative descriptions of the fundamental principles, and awareness about the power and potential pitfalls of modern data-driven scientific inquiry.

1.1 DSPA Mission and Objectives

The second edition of this textbook (DSPA2)is based on the HS650: Data Science and Predictive Analytics (DSPA) course I teach at the University of Michigan and the first DSPA edition. These materials collectively aim to provide learners with a deep understanding of the challenges, appreciation of the enormous opportunities, and a solid methodological foundation for designing, collecting, managing, processing, interrogating, analyzing and interpreting complex health and biomedical data. Readers that finish this course of training and successfully complete the examples and assignments included in the book will gain unique skills and acquire a tool-chest of methods, software tools, and protocols that can be applied to a broad spectrum of Big Data problems.

  • Vision: Enable active-learning by integrating driving motivational challenges with mathematical foundations, computational statistics, and modern scientific inference
  • Values: Effective, reliable, reproducible, and transformative data-driven discovery supporting open-science
  • Strategic priorities: Trainees will develop scientific intuition, computational skills, and data-wrangling abilities to tackle Big biomedical and health data problems. Instructors will provide well-documented R-scripts and software recipes implementing atomic data-filters as well as complex end-to-end predictive big data analytics solutions.

Before diving into the mathematical algorithms, statistical computing methods, software tools, and health analytics covered in the remaining chapters, we will discuss several driving motivational problems. These will ground all the subsequent scientific discussions, data modeling, and computational approaches.

1.2 Examples of driving motivational problems and challenges

For each of the studies below, we illustrate several clinically-relevant scientific questions, identify appropriate data sources, describe the types of data elements, and pinpoint various complexity challenges.

1.2.1 Alzheimer’s Disease

  • Identify the relation between observed clinical phenotypes and expected behavior;
  • Prognosticate future cognitive decline (3-12 months prospectively) as a function of imaging data and clinical assessment (both model-based and model-free machine learning prediction methods will be used);
  • Derive and interpret the classifications of subjects into clusters using the harmonized and aggregated data from multiple sources.
Data Source Sample Size/Data Type Summary
ADNI Archive Clinical data: demographics, clinical assessments, cognitive assessments; Imaging data: sMRI, fMRI, DTI, PiB/FDG PET; Genetics data: Ilumina SNP genotyping; Chemical biomarker: lab tests, proteomics. Each data modality comes with a different number of cohorts. Generally, \(200\le N \le 1200\). For instance, previously conducted ADNI studies with N>500 [ doi: 10.3233/JAD-150335, doi: 10.1111/jon.12252, doi: 10.3389/fninf.2014.00041] ADNI provides interesting data modalities, multiple cohorts (e.g., early-onset, mild, and severe dementia, controls) that allow effective model training and validation NACC Archive

1.2.2 Parkinson’s Disease

  • Predict the clinical diagnosis of patients using all available data (with and without the UPDRS clinical assessment, which is the basis of the clinical diagnosis by a physician);
  • Compute derived neuroimaging and genetics biomarkers that can be used to model the disease progression and provide automated clinical decisions support;
  • Generate decision trees for numeric and categorical responses (representing clinically relevant outcome variables) that can be used to suggest an appropriate course of treatment for specific clinical phenotypes.
Data Source Sample Size/Data Type Summary
PPMI Archive Demographics: age, medical history, sex; Clinical data: physical, verbal learning and language, neurological and olfactory (University of Pennsylvania Smell Identification Test, UPSIT) tests), vital signs, MDS-UPDRS scores (Movement Disorder; Society-Unified Parkinson’s Disease Rating Scale), ADL (activities of daily living), Montreal Cognitive Assessment (MoCA), Geriatric Depression Scale (GDS-15); Imaging data: structural MRI; Genetics data: llumina ImmunoChip (196,524 variants) and NeuroX (covering 240,000 exonic variants) with 100% sample success rate, and 98.7% genotype success rate genotyped for APOE e2/e3/e4. Three cohorts of subjects; Group 1 = {de novo PD Subjects with a diagnosis of PD for two years or less who are not taking PD medications}, N1 = 263; Group 2 = {PD Subjects with Scans without Evidence of a Dopaminergic Deficit (SWEDD)}, N2 = 40; Group 3 = {Control Subjects without PD who are 30 years or older and who do not have a first degree blood relative with PD}, N3 = 127 The longitudinal PPMI dataset including clinical, biological and imaging data (screening, baseline, 12, 24, and 48 month follow-ups) may be used conduct model-based predictions as well as model-free classification and forecasting analyses

1.2.3 Drug and substance use

  • Is the Risk for Alcohol Withdrawal Syndrome (RAWS) screen a valid and reliable tool for predicting alcohol withdrawal in an adult medical inpatient population?
  • What is the optimal cut-off score from the AUDIT-C to predict alcohol withdrawal based on RAWS screening?
  • Should any items be deleted from, or added to, the RAWS screening tool to enhance its performance in predicting the emergence of alcohol withdrawal syndrome in an adult medical inpatient population?
Data Source Sample Size/Data Type Summary
MAWS Data / UMHS EHR / WHO AWS Data Scores from Alcohol Use Disorders Identification Test-Consumption (AUDIT-C) [49], including dichotomous variables for any current alcohol use (AUDIT-C, question 1), total AUDIT-C score > 8, and any positive history of alcohol withdrawal syndrome (HAWS) ~1,000 positive cases per year among 10,000 adult medical inpatients, % RAWS screens completed, % positive screens, % entered into MAWS protocol who receive pharmacological treatment for AWS, % entered into MAWS protocol without a completed RAWS screen

1.2.4 Amyotrophic lateral sclerosis

  • Identify the most highly-significant variables that have power to jointly predict the progression of ALS (in terms of clinical outcomes like ALSFRS and muscle function)
  • Provide a decision tree prediction of adverse events based on subject phenotype and 0-3 month clinical assessment changes
Data Source Sample Size/Data Type Summary
ProAct Archive Over 100 clinical variables are recorded for all subjects including: Demographics: age, race, medical history, sex; Clinical data: Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS), adverse events, onset_delta, onset_site, drugs use (riluzole) The PRO-ACT training dataset contains clinical and lab test information of 8,635 patients. Information of 2,424 study subjects with valid gold standard ALSFRS slopes will be used in out processing, modeling and analysis The time points for all longitudinally varying data elements will be aggregated into signature vectors. This will facilitate the modeling and prediction of ALSFRS slope changes over the first three months (baseline to month 3)

1.2.5 Normal Brain Visualization

The SOCR Brain Visualization App has preloaded sMRI, ROI labels, and fiber track models for a normal brain. It also allows users to drag-and-drop their data into the browser to visualize and navigate through the stereotactic data (including imaging, parcellations and tractography).