First review the DSPA prerequisites.
Upon successful completion of this course, students are expected to have moderate competency in at least two of each of the three competency areas listed below:
Areas | Competency | Expectation | Notes | |
Algorithms and Applications | Tools | Working knowledge of basic software tools (command-line, GUI based, or web-services) | Familiarity with statistical programming languages, e.g., R or SciKit/Python, and database querying languages, e.g., SQL or NoSQL | |
Algorithms | Knowledge of core principles of scientific computing, applications programming, API’s, algorithm complexity, and data structures | Best practices for scientific and application programming, efficient implementation of matrix linear algebra and graphics, elementary notions of computational complexity, user-friendly interfaces, string matching | ||
Application Domain | Data analysis experience from at least one application area, either through coursework, internship, research project, etc. | Applied domain examples include: computational social sciences, health sciences, business and marketing, learning sciences, transportation sciences, engineering and physical sciences | ||
Data Management | Data validation & visualization | Curation, Exploratory Data Analysis (EDA) and visualization | Data provenance, validation, visualization via histograms, Q-Q plots, scatterplots (plotly, ggplot, Dashboard, D3.js) | |
Data wrangling | Skills for data normalization,
data cleaning, data aggregation, and data
harmonization/registration
|
Data imperfections include missing values, inconsistent string formatting (‘2016-01-01’ vs. ‘01/01/2016’, PC/Mac/Linux time vs. timestamps, structured vs. unstructured data | ||
Data infrastructure | Handling databases, web-services, Hadoop, multi-source data | Data structures, SOAP protocols, ontologies, XML, JSON, streaming | ||
Analysis Methods | Statistical inference | Basic understanding of bias and variance, principles of (non)parametric statistical inference, and (linear) modeling | Biological variability vs. technological noise, parametric (likelihood) vs non-parametric (rank order statistics) procedures, point vs. interval estimation, hypothesis testing, regression | |
Study design and diagnostics | Design of experiments, power calculations and sample sizing, strength of evidence, p-values, False Discovery Rates | Multistage testing, variance normalizing transforms, histogram equalization, goodness-of-fit tests, model overfitting, model reduction | ||
Machine
Learning |
Dimensionality reduction, k-nearest neighbors, random forests, AdaBoost, kernelization, SVM, ensemble methods, CNN | Empirical risk minimization. Supervised, semi-supervised, and unsupervised learning. Transfer learning, active learning, reinforcement learning, multiview learning, instance learning |