Outline of the typical and systematic SOCR
Consulting statistical, mathematical, computational, data
analytic, and AI consulting protocol; from a dataset to a specific
actionable statistical analysis plan (SAP).
Classical Statistical
Consulting Protocol: Dataset to Analysis Plan
This first approach describes the classical decision chart guiding
the process of moving from an initial dataset and research question to a
well-defined statistical analysis plan. Each step involves critical
evaluation and decision-making to ensure the chosen statistical methods
are appropriate and robust.
1. Initial Consultation & Problem Definition
- Objective: Understand the research question and the
client’s goals.
- Key Questions:
- What is the primary research question(s)?
- What are the desired outcomes or decisions to be made?
- What is the context of the study?
- Who is the target audience for the results?
2. Data Understanding & Exploration
- Objective: Familiarize with the dataset, its
structure, and potential issues.
- Key Actions:
- Data Acquisition: Obtain the dataset.
- Data Dictionary Review: Understand variable
definitions, data types, and units.
- Descriptive Statistics: Calculate means, medians,
standard deviations, frequencies, etc.
- Data Visualization: Create histograms, scatter
plots, box plots, etc., to identify patterns, outliers, and
distributions.
- Initial Data Quality Assessment: Check for missing
values, inconsistencies, and potential errors.
3. Data Preprocessing & Cleaning
- Objective: Prepare the data for analysis by
addressing identified issues.
- Key Actions:
- Handle Missing Data: Imputation, deletion, or other
appropriate methods.
- Outlier Treatment: Investigate and address outliers
(e.g., removal, transformation, Winsorizing).
- Data Transformation: Apply transformations (e.g.,
log, square root) if distributions are skewed or to meet assumptions of
statistical tests.
- Feature Engineering: Create new variables if
necessary.
- Data Merging/Reshaping: Combine or restructure data
as needed.
4. Variable Identification & Measurement
Scale
- Objective: Categorize variables based on their
nature and measurement scale.
- Key Actions:
- Independent Variables (Predictors): Identify
variables that might influence the outcome.
- Dependent Variables (Outcomes): Identify the
primary variable(s) of interest.
- Covariates/Confounders: Identify variables that
might influence both independent and dependent variables.
- Determine Measurement Scale:
- Categorical: Nominal (no order), Ordinal
(ordered).
- Continuous: Interval (equal intervals, no true
zero), Ratio (equal intervals, true zero).
5. Hypothesis Formulation
- Objective: Translate research questions into
testable statistical hypotheses.
- Key Actions:
- Null Hypothesis (\(H_0\)): A statement of no effect
or no difference.
- Alternative Hypothesis (\(H_1\)): A statement of an effect
or difference (can be one-sided or two-sided).
6. Statistical Assumptions Check
- Objective: Verify if the data meets the assumptions
required for proposed statistical methods.
- Key Actions:
- Normality: Shapiro-Wilk test, Q-Q plots.
- Homogeneity of Variance (Homoscedasticity):
Levene’s test, Bartlett’s test.
- Independence of Observations: Assessed through
study design, not solely statistical tests.
- Linearity: Scatter plots, residual plots.
- Multicollinearity: Variance Inflation Factor
(VIF).
7. Selection of Appropriate Statistical Analysis
- Objective: Choose the statistical method(s) that
best address the research question and are suitable for the data type
and assumptions.
- Decision Points (Examples):
- Comparing means of two groups?
- Assumptions met?
- Yes: Independent samples t-test.
- No: Mann-Whitney U test.
- Comparing means of three or more groups?
- Assumptions met?
- Yes: One-way ANOVA.
- No: Kruskal-Wallis test.
- Examining relationship between two continuous
variables?
- Assumptions met?
- Yes: Pearson correlation.
- No: Spearman rank correlation.
- Predicting a continuous outcome from one or more
predictors?
- Number of predictors?
- One: Simple linear regression.
- Two or more: Multiple linear regression.
- Predicting a categorical outcome?
- Binary outcome? Logistic regression.
- Multinomial outcome? Multinomial logistic
regression.
8. Statistical Analysis Plan (SAP) Development
- Objective: Document the entire process, including
methods, assumptions, and expected outputs.
- Key Components:
- Introduction (Research question, objectives)
- Data Description (Source, variables, measurement scales)
- Data Cleaning & Preprocessing steps
- Hypotheses
- Statistical Methods (Detailed description of chosen analyses)
- Assumption checks and how they will be handled
- Significance level (e.g., \(\alpha =
0.05\))
- Software to be used
- Expected outputs and reporting guidelines
9. Review & Refinement
- Objective: Ensure the SAP is clear, comprehensive,
and aligned with client needs.
- Key Actions:
- Internal review by consultant.
- Client review and feedback.
- Iterative refinement of the SAP.
10. Execution & Reporting
- Objective: Conduct the analysis and present the
findings.
- Key Actions:
- Perform statistical analyses as per SAP.
- Interpret results.
- Generate reports, visualizations, and summaries.
Mermaid Decision
Chart: Classical Statistical Consulting Protocol
Enhanced SAP
Protocol
This more advanced SAP verison expands the classical
protocol above to support modern data science, machine learning
and artificial intelligence analytics. Specifically, it adds the
following SAP elements:
Ethics, Privacy, & Governance: Any modern
protocol must address IRB, HIPAA/GDPR, PII, and data security
before exploration begins.
Power Analysis & Sample Size Evaluation:
Moved from an afterthought to a dedicated step. There is no point in
developing an SAP if the study is fundamentally underpowered.
Iterative Workflow Emphasis: The original
classical SAP implies a linear 10-step process. In reality,
data exploration often forces a return to problem definition. The
revised version explicitly notes iterative loops.
Expanding AI/ML & Modern Methods: The
original Step 7 only lists basic tests (t-test, ANOVA, Linear
Regression). It has been expanded to include mixed-effects models,
survival analysis, causal inference, and machine learning
(cross-validation, classification metrics).
Missing Data Mechanisms: Step 3 now explicitly
mentions assessing MCAR, MAR, and MNAR, which dictates the imputation
strategy.
Multiple Testing & Reproducibility: Added to
the SAP development step. If you are doing multiple comparisons, alpha
adjustment is required. Additionally, reproducibility (Git, Docker,
RMarkdown/Quarto) must be planned in advance.
Deployment & Handoff: Step 10 has been
expanded beyond just “reporting” to include code handoff, environment
documentation, and stakeholder training.
Outline of the typical and systematic SOCR
Consulting statistical, mathematical, computational, data
analytic, and AI consulting protocol; from a dataset to a specific
actionable statistical analysis plan (SAP).
Note on Workflow: This protocol is inherently
iterative, not strictly linear. Findings in later steps
(e.g., discovering severe missing data mechanisms during exploration)
may require revisiting earlier steps (e.g., reframing the research
question or redesigning the study).
Advanced Statistical
Consulting Protocol
This advanced decision chart guides the process of moving from an
initial dataset and research question to a well-defined statistical
analysis plan. Each step involves critical evaluation and
decision-making to ensure the chosen statistical methods are
appropriate, robust, and reproducible.
1. Initial Consultation, Problem Definition & Ethical
Review
- Objective: Understand the research question, the
client’s goals, and the ethical/legal constraints of the data.
- Key Questions:
- What is the primary research question(s)?
- What are the desired outcomes or decisions to be made?
- What is the context of the study and who is the target audience for
the results?
- Ethics & Privacy: Does the data contain PII?
Are there IRB restrictions? Must we comply with HIPAA, GDPR, or other
data governance frameworks?
- Deliverables & Timeline: What is the expected
format of the final output (report, dashboard, codebase,
manuscript)?
2. Data Understanding & Exploration
- Objective: Familiarize with the dataset, its
structure, and potential quality issues.
- Key Actions:
- Data Acquisition & Security: Securely obtain
the dataset and ensure compliance with Step 1 agreements.
- Data Dictionary Review: Understand variable
definitions, data types, units, and provenance.
- Descriptive Statistics: Calculate means, medians,
standard deviations, frequencies, etc.
- Data Visualization: Create histograms, scatter
plots, box plots, and correlation matrices to identify patterns,
outliers, and distributions.
- Initial Data Quality Assessment: Check for missing
values, inconsistencies, and potential errors.
- Missing Data Mechanism: Preliminary assessment of
whether data is Missing Completely at Random (MCAR), Missing at Random
(MAR), or Missing Not at Random (MNAR).
3. Data Preprocessing & Cleaning
- Objective: Prepare the data for analysis by
addressing identified issues while preventing data leakage.
- Key Actions:
- Handle Missing Data: Imputation (Mean/Median, MICE,
KNN), deletion (listwise/pairwise), or indicator variable methods,
justified by the missing data mechanism.
- Outlier Treatment: Investigate and address outliers
(e.g., removal, transformation, Winsorizing, or keeping them for robust
modeling).
- Data Transformation: Apply transformations (e.g.,
log, Box-Cox) if distributions are skewed or to meet parametric
assumptions.
- Feature Engineering: Create new variables from
existing data (e.g., interaction terms, polynomial features,
time-since-event).
- Data Merging/Reshaping: Combine or restructure data
as needed.
- Data Splitting (For AI/ML): If predictive modeling
is used, partition data into training, validation, and testing sets
before any imputation or feature scaling to prevent data
leakage.
4. Variable Identification & Measurement
Scale
- Objective: Categorize variables based on their
nature, role, and measurement scale.
- Key Actions:
- Dependent Variables (Outcomes/Targets): Identify
the primary variable(s) of interest.
- Independent Variables (Predictors/Features):
Identify variables that might influence the outcome.
- Covariates/Confounders/Mediators: Identify
variables that might influence both independent and dependent variables
(crucial for causal inference).
- Determine Measurement Scale:
- Categorical: Nominal (no order), Ordinal
(ordered).
- Continuous: Interval (equal intervals, no true
zero), Ratio (equal intervals, true zero).
- Time-to-Event: Censored durations (e.g., survival
data).
5. Power Analysis & Sample Size Evaluation
- Objective: Determine if the study has sufficient
statistical power to detect a meaningful effect, or calculate the
minimum detectable effect size given the current sample.
- Key Actions:
- Estimate required sample size based on chosen alpha (\(\alpha\)), desired power (\(1 - \beta\)), and expected effect
size.
- If the dataset is fixed, compute the minimum detectable effect (MDE)
to manage stakeholder expectations regarding statistical
significance.
6. Hypothesis Formulation & Causal Framework
- Objective: Translate research questions into
testable statistical hypotheses and outline causal pathways.
- Key Actions:
- Null Hypothesis (\(H_0\)): A statement of no effect
or no difference.
- Alternative Hypothesis (\(H_1\)): A statement of an effect
or difference (can be one-sided or two-sided).
- Causal Diagrams (DAGs): If conducting causal
inference, draw Directed Acyclic Graphs to justify the
inclusion/exclusion of covariates and identify colliders.
7. Statistical Assumptions Check
- Objective: Verify if the data meets the assumptions
required for proposed statistical methods.
- Key Actions:
- Normality: Shapiro-Wilk test, Q-Q plots.
- Homogeneity of Variance (Homoscedasticity):
Levene’s test, Bartlett’s test, residual plots.
- Independence of Observations: Assessed through
study design (check for clustering/repeated measures).
- Linearity: Scatter plots, component-plus-residual
plots.
- Multicollinearity: Variance Inflation Factor (VIF),
correlation matrices.
- Proportionality of Hazards (if survival analysis):
Schoenfeld residuals.
8. Selection of Appropriate Statistical & AI/ML
Methods
- Objective: Choose the method(s) that best address
the research question, suit the data type, and respect the
assumptions.
- Decision Points (Examples):
- Comparing means of two independent groups?
- Assumptions met? Yes: Independent t-test. No: Mann-Whitney U.
- Comparing means of three or more groups?
- Assumptions met? Yes: One-way ANOVA. No: Kruskal-Wallis.
- Repeated Measures / Clustered Data?
- Mixed-effects models, Generalized Estimating Equations (GEE).
- Time-to-Event Analysis?
- Kaplan-Meier curves, Cox Proportional Hazards, Accelerated Failure
Time.
- Predicting a continuous outcome?
- Linear regression, Ridge/Lasso regression, Random Forests, Gradient
Boosting.
- Predicting a categorical outcome?
- Logistic regression (Binary/Multinomial/Ordinal), Support Vector
Machines, Neural Networks.
- Causal Effect Estimation?
- Propensity Score Matching, Inverse Probability Weighting,
Difference-in-Differences.
9. Statistical Analysis Plan (SAP) Development
- Objective: Document the entire process rigorously
before looking at the final results to prevent p-hacking.
- Key Components:
- Introduction (Research question, objectives)
- Data Description (Source, variables, measurement scales, missing
data mechanisms)
- Data Cleaning & Preprocessing steps (Imputation rules, outlier
thresholds)
- Hypotheses & Causal framework
- Statistical Methods (Detailed description of chosen analyses,
including ML evaluation metrics like AUC, RMSE, F1-score if
applicable)
- Assumption checks and how violations will be handled
- Significance level (e.g., \(\alpha =
0.05\)) and Multiple Testing Corrections (e.g.,
Bonferroni, False Discovery Rate)
- Reproducibility Plan (Software versions, random seeds, Git
repository, Docker environment, Quarto/RMarkdown templates)
- Expected outputs and reporting guidelines (e.g., CONSORT, STROBE,
TRIPOD)
10. Review & Refinement
- Objective: Ensure the SAP is clear, comprehensive,
and aligned with client needs before execution.
- Key Actions:
- Internal review by consulting team.
- Client review and feedback.
- Iterative refinement of the SAP (all changes must be
documented).
11. Execution, Reporting & Handoff
- Objective: Conduct the analysis, present findings,
and ensure the client can maintain the work.
- Key Actions:
- Perform analyses strictly as per SAP (document any necessary
deviations).
- Interpret results in the context of the research question
(distinguish between statistical significance and practical/clinical
significance).
- Generate reports, interactive visualizations, and summaries.
- Handoff: Deliver clean, well-commented code;
environment specification files (e.g.,
requirements.txt,
renv.lock); and provide a brief walkthrough/training to the
client on how to run the code and interpret the outputs.
Mermaid Decision
Chart: Advanced Statistical Consulting Protocol