SOCR ≫ DSPA ≫ DSPA2 Topics ≫

This SOCR DSPA2 Appendix contains a comprehensive and realistic Rmd notebook of a biomedical case-study. Using synthetic (non-human) data, this Appendix demonstrates testing, validating, training, fine-tuning, exploring and confirming the utility of a wide range of model-based statistical methods as well as model-free ML/AI techniques.

1 Introduction and Motivation

1.1 Key Scientific Question

Chronic Respiratory Distress Syndrome (CRDS) is a hypothetical progressive lung disease characterized by declining forced vital capacity (FVC), reduced exercise tolerance, and diminished quality of life.

This notebook simulates a 12-month, two-arm, parallel-group clinical trial comparing a novel therapeutic agent, RespiClear, with Standard Care in 500 patients.

The dataset is deliberately engineered, i.e., “loaded”, so that every major family of statistical method taught in an intermediate biostatistics course can find a signal to detect. Below we describe how the data were generated to deep insight into what each statistical, analytic, or ML/AI technique is designed to recover and expected to export.

1.2 Road-Map of Methods

# Method Signal Encoded in the Simulation
1 MLR & Logistic Regression Linear relationship of QoL with age and baseline FVC; binary treatment success driven by treatment arm and smoking
2 Generalised Linear Model (GLM) Poisson-distributed hospital-visit counts whose log-rate depends on baseline lung capacity
3 ANCOVA Treatment-group difference in 12-month FVC after adjusting for baseline FVC covariate
4 MANOVA Joint treatment effect on the bivariate outcome (QoL, 6-min walk distance)
5 MANCOVA Same joint effect after partialling out age
6 Repeated-measures ANOVA Treatment × time interaction across three longitudinal FVC measurements
7a Text Mining Physician notes seeded with positive vs. negative adjectives linked to clinical outcome
7b Sensitivity Analysis Robustness of treatment effect under different success thresholds and missing-data mechanisms
8A Supervised – Random Forest + Gradient-Boosted Trees Predict treatment success and rank variable importance
8B Unsupervised – t-SNE Embedding + k-Means Clustering Discover latent patient sub-groups without outcome labels

2 Environment Setup

We start by loading all required R packages. If a package is not installed the code will attempt installation automatically.

# Helper: install-if-missing
# Note install the ez package isntall from ZIP: https://cran.r-project.org/src/contrib/Archive/ez/

req_pkgs <- c("tidyverse", "broom", "car", "ez", "emmeans",
              "tidytext", "wordcloud", "RColorBrewer", "mice",
              "patchwork", "kableExtra", "ggridges", "GGally")

for (pkg in req_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE)) install.packages(pkg, repos = "https://cran.r-project.org")
}

library(tidyverse)
library(broom)
library(car)
library(ez)  # Note install the ez package isntall from ZIP: https://cran.r-project.org/src/contrib/Archive/ez/
library(emmeans)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
library(mice)
library(patchwork)
library(kableExtra)
library(ggridges)
library(GGally)

theme_set(theme_minimal(base_size = 13))

3 Realistic Synthetic Data Simulation

Key idea. Every simulation choice below is a deliberate encoding of a statistical relationship. When we later apply a statistical method, we are asking Can this method recover the (artificial/synthetic) signal we planted in the data?

3.1 Baseline Demographics (Cross-sectional & Categorical)

Our design rationale for simulating a realistic Chronic Respiratory Distress Syndrome (CRDS) case-study aims to create a balanced two-arm trial with three important categorical covariates, with sample-size of \(n=500\).

  • Treatment is randomized \(1:1\). Because we will inject a treatment effect into the longitudinal FVC trajectory, any method that compares groups should detect it.
  • Smoking history is drawn with realistic marginal probabilities (\(30\%\) never, \(50\%\) former, \(20\%\) current). It will serve as a predictor in the logistic model.
  • Gender is included for realism and as a potential stratifier.

3.2 Continuous Covariate, Baseline Lung Capacity

\[ \text{FVC}_{\text{baseline},i} \sim \mathcal{N}(\mu = 3.5,\; \sigma = 0.5) \]

This is the covariate we will later adjust for in ANCOVA and MANCOVA.

3.3 Longitudinal Lung Function, The Treatment \(\times\) Time Interaction

This is the central signal for repeated-measures ANOVA (rANOVA). We generate three FVC measurements (months 0, 6, 12) with a treatment-dependent trajectory

\[ \text{FVC}_{i,t} = \begin{cases} \text{FVC}_{i,t-1} + \Delta_{\text{Respi}} + \varepsilon_{it}, & \text{if treatment}_i = \text{RespiClear} \\[4pt] \text{FVC}_{i,t-1} + \Delta_{\text{SC}} + \varepsilon_{it}, & \text{if treatment}_i = \text{Standard Care} \end{cases}, \]

where

Interval \(\Delta_{\text{Respi}}\) \(\Delta_{\text{SC}}\)
0 \(\to\) 6 mo \(+0.20\) \(+0.05\)
6 \(\to\) 12 mo \(+0.15\) \(-0.05\)

So, RespiClear patients improve, while Standard Care patients initially plateau then slightly decline, producing the classic diverging-trajectories pattern that rANOVA is designed to detect.

3.4 Multivariate Outcomes, QoL and Walk Distance

For MANOVA / MANCOVA we need two correlated continuous outcomes that are jointly affected by treatment.

\[ \text{QoL}_i = 50 + 0.5\,\text{age}_i + \underbrace{10 \cdot \mathbf{1}[\text{Respi}] + 2 \cdot \mathbf{1}[\text{SC}]}_{\text{treatment shift}} + \varepsilon_i^{(\text{QoL})}, \qquad \varepsilon_i^{(\text{QoL})} \sim \mathcal{N}(0, 25) \]

\[ \text{Walk}_i = 300 - 2\,\text{age}_i + 20\,\text{FVC}_{\text{base},i} + \varepsilon_i^{(\text{Walk})}, \qquad \varepsilon_i^{(\text{Walk})} \sim \mathcal{N}(0, 900) \]

Note that

  • QoL contains both an age effect and a treatment effect. MANOVA will pick up the treatment effect on QoL + Walk jointly. Whereas MANCOVA will additionally partial out age.
  • Walk distance depends on baseline FVC but not directly on treatment (only indirectly through FVC improvement). This asymmetry makes the multivariate test more interesting than two univariate tests.

3.5 Binary & Count Outcomes, For Logistic Regression and GLM

Binary (success), treatment success, is defined as total FVC improvement \(> 0.4\ L\) over 12 months. Because RespiClear’s expected improvement \(\approx 0.35\) L is near the threshold while Standard Care’s \(\approx 0.00\) L is well below it, the logistic regression should find a strong treatment predictor.

\[ \text{Success}_i = \mathbf{1}\!\bigl[\text{FVC}_{i,12} - \text{FVC}_{i,0} > 0.4\bigr] \]

Count (hospital visits): Hospital visits follow a Poisson model whose log-rate decreases with better baseline lung function

\[ \text{Visits}_i \sim \text{Pois}\!\bigl(\lambda_i\bigr)\ . \]

3.6 Physician Notes, Seeded Text for Text Mining

We construct templated clinical notes where the adjective polarity is linked to outcome

  • Successful patients receive notes with positive adjectives (“clearing”, “improved”, “stable”, “significant”, “responsive”).
  • Unsuccessful patients receive notes with negative adjectives (“severe”, “persistent”, “minimal”, “wheezing”, “struggling”).

This ensures that a text-mining pipeline (tokenisation → sentiment scoring) will recover a strong association between note sentiment and clinical success.

3.7 Assembling Wide and Long Data Frames

Let’s import the data from the SOCR server.

library(readr)

# REMOTE URL
url_wide <- "https://socr.umich.edu/docs/uploads/2026/SOCR_CRDS_CompAnalysis_CaseStudy_wide.csv"
url_long <- "https://socr.umich.edu/docs/uploads/2026/SOCR_CRDS_CompAnalysis_CaseStudy_long.csv"

# Import: Reading the Wide data
df_wide <- read_csv(url_wide, col_types = cols(
  patient_id      = col_factor(),    # Prevents math on IDs
  age             = col_double(),
  gender          = col_character(),
  treatment       = col_character(),
  smoking_history = col_character(),
  baseline_fvc    = col_double(),
  fvc_0           = col_double(),
  fvc_6           = col_double(),
  fvc_12          = col_double(),
  qol_score       = col_double(),
  walk_dist       = col_double(),
  success         = col_integer(),   # 0 or 1 is best as integer
  hospital_visits = col_integer(),
  physician_notes = col_character()
))

# Import: Reading the Long data
df_long <- read_csv(url_long, col_types = cols(
  patient_id = col_factor(),  
  age = col_double(),
  gender = col_character(),
  timepoint = col_factor(levels = c("Month 0", "Month 6", "Month 12"))
))

3.7.1 Quick Comparison: Wide vs. Long Structure

Feature df_wide df_long
Observation One row per patient. Multiple rows per patient (one per timepoint).
Key Columns fvc_0, fvc_6, fvc_12 are separate. timepoint and fvc_measure store the data.
Best Use Case Correlation tables, certain ML models. ANOVA, ggplot2 visualizations, growth modeling.

4 Exploratory Data Analysis (EDA)

Before formal modelling, we inspect the simulated data to verify the intended structure.

4.1 Demographics Summary

df_wide %>%
  group_by(treatment) %>%
  summarise(
    n           = n(),
    age_mean    = mean(age),
    age_sd      = sd(age),
    pct_female  = mean(gender == "Female") * 100,
    pct_current = mean(smoking_history == "Current") * 100,
    fvc_base_m  = mean(baseline_fvc),
    fvc_base_sd = sd(baseline_fvc)
  ) %>%
  kbl(digits = 2, caption = "Baseline Characteristics by Treatment Arm") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Baseline Characteristics by Treatment Arm
treatment n age_mean age_sd pct_female pct_current fvc_base_m fvc_base_sd
RespiClear 251 58.98 12.06 51.39 22.71 3.44 0.48
Standard Care 249 60.16 11.39 48.59 21.69 3.52 0.48

4.2 Outcome Distributions

p1 <- ggplot(df_wide, aes(x = qol_score, fill = treatment)) +
  geom_density(alpha = 0.5) +
  labs(title = "Quality-of-Life Score", x = "QoL", y = "Density")

p2 <- ggplot(df_wide, aes(x = walk_dist, fill = treatment)) +
  geom_density(alpha = 0.5) +
  labs(title = "Six-Minute Walk Distance (m)", x = "Walk Distance", y = "Density")

p3 <- ggplot(df_wide, aes(x = factor(success), fill = treatment)) +
  geom_bar(position = "dodge") +
  labs(title = "Treatment Success (FVC Δ > 0.4 L)", x = "Success", y = "Count")

p4 <- ggplot(df_wide, aes(x = hospital_visits, fill = treatment)) +
  geom_histogram(binwidth = 1, position = "dodge") +
  labs(title = "Hospital Visits (Count)", x = "Visits", y = "Frequency")

(p1 + p2) / (p3 + p4) + plot_annotation(title = "Overview of Key Outcomes")

4.3 Longitudinal FVC Trajectories

df_long %>%
  group_by(treatment, timepoint) %>%
  summarise(mean_fvc = mean(fvc_measure),
            se       = sd(fvc_measure) / sqrt(n()), .groups = "drop") %>%
  ggplot(aes(x = timepoint, y = mean_fvc, colour = treatment, group = treatment)) +
  geom_point(size = 3) +
  geom_line(linewidth = 1) +
  geom_errorbar(aes(ymin = mean_fvc - 1.96 * se, ymax = mean_fvc + 1.96 * se), width = 0.15) +
  labs(title = "Mean FVC Trajectory by Treatment Arm (±95 % CI)",
       y = "FVC (L)", x = "Time Point") +
  scale_colour_brewer(palette = "Set1")

4.4 Correlation Heatmap of Continuous Variables

df_wide %>%
  select(age, baseline_fvc, fvc_0, fvc_6, fvc_12, qol_score, walk_dist, hospital_visits) %>%
  cor() %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  ggplot(aes(var1, var2, fill = r)) +
  geom_tile() +
  geom_text(aes(label = round(r, 2)), size = 3) +
  scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick", midpoint = 0) +
  labs(title = "Pearson Correlation Matrix") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

5 Analytics - Multiple Linear Regression & Logistic Regression

5.1 Multiple Linear Regression (MLR)

Multiple linear regression models a continuous outcome \(Y\) as a linear function of \(p\) predictors:

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \varepsilon_i, \qquad \varepsilon_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2) \]

The ordinary least squares (OLS) estimator minimises the residual sum of squares:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y} \]

What we planted: qol_score was generated with \(\beta_{\text{age}} = 0.5\) and a treatment-dependent intercept shift (+10 for RespiClear, +2 for Standard Care). The MLR should recover these coefficients.

Fitting the MLR model.

mlr_fit <- lm(qol_score ~ age + baseline_fvc + treatment + smoking_history,
              data = df_wide)
summary(mlr_fit)
#> 
#> Call:
#> lm(formula = qol_score ~ age + baseline_fvc + treatment + smoking_history, 
#>     data = df_wide)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.8487  -3.2547  -0.1824   3.2350  13.6237 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)            60.58029    1.99677  30.339   <2e-16 ***
#> age                     0.51062    0.01868  27.341   <2e-16 ***
#> baseline_fvc           -0.31206    0.45486  -0.686    0.493    
#> treatmentStandard Care -8.69804    0.44016 -19.761   <2e-16 ***
#> smoking_historyFormer   0.46496    0.56154   0.828    0.408    
#> smoking_historyNever   -0.27770    0.61355  -0.453    0.651    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.889 on 494 degrees of freedom
#> Multiple R-squared:  0.6886, Adjusted R-squared:  0.6855 
#> F-statistic: 218.5 on 5 and 494 DF,  p-value: < 2.2e-16

5.1.1 Model Diagnostics

par(mfrow = c(2, 2))
plot(mlr_fit)

par(mfrow = c(1, 1))

5.1.2 Coefficient Plot

tidy(mlr_fit, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = "grey50") +
  geom_point(size = 3, colour = "steelblue") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, colour = "steelblue") +
  labs(title = "MLR Coefficient Estimates (QoL Model)",
       x = "Estimate ± 95% CI", y = NULL)

5.1.3 Interpretation

By the simulated data design, the model should show

  • A significant positive coefficient for age (~0.5), consistent with the data-generating process.
  • A significant negative coefficient for treatmentStandard Care (\(\sim −8\)) reflecting the \(10 − 2 = 8\)-point advantage for RespiClear.
  • Non-significant effects for smoking and baseline FVC (neither was included in the QoL data-generating equation).

5.2 Logistic Regression

For a binary outcome \(Y_i \in \{0,1\}\), logistic regression models the log-odds

\[ \log \frac{P(Y_i = 1)}{1 - P(Y_i = 1)} = \beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi} \]

Equivalently, \(P(Y_i = 1) = \text{logit}^{-1}(\mathbf{x}_i^\top \boldsymbol{\beta})\). Parameters are estimated by maximum likelihood.

What we planted: success depends on total FVC change, which in turn depends strongly on treatment arm and indirectly on baseline FVC (higher baseline \(\to\) higher absolute values but not necessarily more change).

Fitting the Logit model.

logit_fit <- glm(success ~ treatment + smoking_history + age + baseline_fvc,
                 family = binomial(link = "logit"), data = df_wide)
summary(logit_fit)
#> 
#> Call:
#> glm(formula = success ~ treatment + smoking_history + age + baseline_fvc, 
#>     family = binomial(link = "logit"), data = df_wide)
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)             -2.171243   1.189370  -1.826   0.0679 .
#> treatmentStandard Care -19.164938 678.972804  -0.028   0.9775  
#> smoking_historyFormer   -0.116437   0.333645  -0.349   0.7271  
#> smoking_historyNever     0.026838   0.356219   0.075   0.9399  
#> age                      0.009837   0.010844   0.907   0.3643  
#> baseline_fvc             0.344842   0.273215   1.262   0.2069  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 494.81  on 499  degrees of freedom
#> Residual deviance: 333.24  on 494  degrees of freedom
#> AIC: 345.24
#> 
#> Number of Fisher Scoring iterations: 18

5.2.1 Odds-Ratio Forest Plot

tidy(logit_fit, conf.int = TRUE, exponentiate = TRUE) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_point(size = 3, colour = "firebrick") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0.2, colour = "firebrick") +
  scale_x_log10() +
  labs(title = "Odds Ratios for Treatment Success",
       x = "OR (log scale) ± 95% CI", y = NULL)

5.2.2 Interpretation

We expect a large and significant odds-ratio for treatmentStandard Care \(< 1\) (i.e., Standard Care patients are much less likely to achieve success than RespiClear patients), with other predictors being relatively weak.

6 Analytics 2, Generalised Linear Modelling (GLM)

GLMs extend linear regression to non-normal response distributions via three components

  1. Random component: \(Y_i \sim f(\mu_i, \phi)\) for some exponential-family distribution.
  2. Systematic component: \(\eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}\).
  3. Link function: \(g(\mu_i) = \eta_i\).

For count data the natural choice is Poisson regression

\[ Y_i \sim \text{Poisson}(\lambda_i), \qquad \log \lambda_i = \beta_0 + \beta_1 X_{1i} + \cdots \]

What we planted:

\[ \log \lambda_i = 1.5 - 0.2\,\text{FVC}_{\text{base},i} \]

The Poisson GLM should recover \(\hat\beta_0 \approx 1.5\) and \(\hat\beta_1 \approx -0.2\).

GLM analysis.

pois_fit <- glm(hospital_visits ~ baseline_fvc + age + treatment,
                family = poisson(link = "log"), data = df_wide)
summary(pois_fit)
#> 
#> Call:
#> glm(formula = hospital_visits ~ baseline_fvc + age + treatment, 
#>     family = poisson(link = "log"), data = df_wide)
#> 
#> Coefficients:
#>                         Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)             1.348589   0.261722   5.153 2.57e-07 ***
#> baseline_fvc           -0.197228   0.061703  -3.196  0.00139 ** 
#> age                     0.003087   0.002534   1.218  0.22312    
#> treatmentStandard Care -0.072599   0.059836  -1.213  0.22501    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 541.58  on 499  degrees of freedom
#> Residual deviance: 527.77  on 496  degrees of freedom
#> AIC: 1757.8
#> 
#> Number of Fisher Scoring iterations: 5

6.0.1 Checking for Overdispersion

A key assumption of Poisson regression is equidispersion (\(\text{Var}[Y] = \mu\)). We check the dispersion parameter.

disp <- sum(residuals(pois_fit, type = "pearson")^2) / pois_fit$df.residual
cat("Estimated dispersion =", round(disp, 3), "\n")
#> Estimated dispersion = 0.964
if (disp > 1.5) {
  cat("→ Evidence of overdispersion; consider a Negative Binomial model.\n")
  if (requireNamespace("MASS", quietly = TRUE)) {
    nb_fit <- MASS::glm.nb(hospital_visits ~ baseline_fvc + age + treatment,
                           data = df_wide)
    cat("\nNegative Binomial model summary:\n")
    print(summary(nb_fit))
  }
} else {
  cat("→ Dispersion is acceptable; Poisson model is adequate.\n")
}
#> → Dispersion is acceptable; Poisson model is adequate.

6.0.2 Visualisation: Predicted vs Observed Counts

df_wide %>%
  mutate(pred_lambda = predict(pois_fit, type = "response")) %>%
  ggplot(aes(x = baseline_fvc)) +
  geom_jitter(aes(y = hospital_visits), alpha = 0.3, width = 0, height = 0.15) +
  geom_line(aes(y = pred_lambda), colour = "darkorange", linewidth = 1.2) +
  labs(title = "Poisson GLM: Hospital Visits vs Baseline FVC",
       x = "Baseline FVC (L)", y = "Hospital Visits")

7 Analytics 3, Analysis of Covariance (ANCOVA)

ANCOVA combines ANOVA with regression. For comparing group means on an outcome \(Y\) while adjusting for a continuous covariate \(X\)

\[ Y_{ij} = \mu + \tau_j + \beta(X_{ij} - \bar{X}) + \varepsilon_{ij}, \]

where \(\tau_j\) is the treatment effect for group \(j\) and \(\beta\) is the common within-group slope (homogeneity-of-slopes assumption).

The adjusted group means (least-squares means) are

\[ \bar{Y}_j^{\text{adj}} = \bar{Y}_j - \hat\beta(\bar{X}_j - \bar{X}). \]

What we planted: The 12-month FVC differs between groups and is correlated with baseline FVC. ANCOVA should reveal a significant treatment effect on fvc_12 after controlling for baseline_fvc.

7.1 Assumption Check: Homogeneity of Slopes

ggplot(df_wide, aes(x = baseline_fvc, y = fvc_12, colour = treatment)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "FVC at 12 Months vs Baseline FVC (by Treatment)",
       x = "Baseline FVC (L)", y = "FVC at 12 Months (L)")

# Formal test
slopes_test <- lm(fvc_12 ~ baseline_fvc * treatment, data = df_wide)
cat("Interaction (homogeneity-of-slopes) test:\n")
#> Interaction (homogeneity-of-slopes) test:
anova(slopes_test)
#> Analysis of Variance Table
#> 
#> Response: fvc_12
#>                         Df  Sum Sq Mean Sq   F value Pr(>F)    
#> baseline_fvc             1 106.787 106.787 5192.4475 <2e-16 ***
#> treatment                1  16.929  16.929  823.1763 <2e-16 ***
#> baseline_fvc:treatment   1   0.040   0.040    1.9358 0.1647    
#> Residuals              496  10.201   0.021                     
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.2 ANCOVA Model

ancova_fit <- lm(fvc_12 ~ treatment + baseline_fvc, data = df_wide)
Anova(ancova_fit, type = "III")
#> Anova Table (Type III tests)
#> 
#> Response: fvc_12
#>               Sum Sq  Df  F value    Pr(>F)    
#> (Intercept)    1.432   1   69.509 7.511e-16 ***
#> treatment     16.929   1  821.629 < 2.2e-16 ***
#> baseline_fvc 113.564   1 5511.594 < 2.2e-16 ***
#> Residuals     10.240 497                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.3 Estimated Marginal Means

emm <- emmeans(ancova_fit, ~ treatment)
summary(emm)
#>  treatment     emmean       SE  df lower.CL upper.CL
#>  RespiClear     3.837 0.009078 497    3.819    3.854
#>  Standard Care  3.467 0.009115 497    3.449    3.485
#> 
#> Confidence level used: 0.95
pairs(emm)
#>  contrast                   estimate     SE  df t.ratio p.value
#>  RespiClear - Standard Care    0.369 0.0129 497  28.664  <.0001

7.3.1 Visualisation: Adjusted Means

emm_df <- as.data.frame(summary(emm))
ggplot(emm_df, aes(x = treatment, y = emmean, fill = treatment)) +
  geom_col(width = 0.5, alpha = 0.8) +
  geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.15) +
  labs(title = "ANCOVA: Adjusted Mean FVC at 12 Months",
       y = "Adjusted Mean FVC (L)", x = NULL) +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set2")

8 Analytics 4, Multivariate Analysis of Variance (MANOVA)

MANOVA generalises ANOVA to a vector of \(q\) outcomes \(\mathbf{Y}_i = (Y_{i1}, \ldots, Y_{iq})^\top\). Under the null hypothesis the group mean vectors are equal

\[ H_0: \boldsymbol{\mu}_1 = \boldsymbol{\mu}_2 = \cdots = \boldsymbol{\mu}_g. \]

The test decomposes the total cross-product matrix as \(\mathbf{T} = \mathbf{B} + \mathbf{W}\) (between + within) and forms a test statistic from \(\mathbf{W}^{-1}\mathbf{B}\). Common choices include:

Statistic Formula
Wilks’ \(\Lambda\) \(\prod_{k} \frac{1}{1 + \lambda_k}\)
Pillai’s trace \(\sum_k \frac{\lambda_k}{1 + \lambda_k}\)
Hotelling–Lawley \(\sum_k \lambda_k\)
Roy’s largest root \(\max_k \lambda_k\)

where \(\lambda_k\) are the eigenvalues of \(\mathbf{W}^{-1}\mathbf{B}\).

What we planted: RespiClear shifts qol_score upward by ~8 points relative to Standard Care. walk_dist has no direct treatment effect but is correlated with QoL through age. MANOVA should detect a significant joint effect.

MANOVA model fitting.

manova_fit <- manova(cbind(qol_score, walk_dist) ~ treatment, data = df_wide)
summary(manova_fit, test = "Wilks")
#>            Df   Wilks approx F num Df den Df    Pr(>F)    
#> treatment   1 0.74926   83.162      2    497 < 2.2e-16 ***
#> Residuals 498                                             
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(manova_fit, test = "Pillai")
#>            Df  Pillai approx F num Df den Df    Pr(>F)    
#> treatment   1 0.25074   83.162      2    497 < 2.2e-16 ***
#> Residuals 498                                             
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.1 Univariate Follow-Ups

summary.aov(manova_fit)
#>  Response qol_score :
#>              Df Sum Sq Mean Sq F value    Pr(>F)    
#> treatment     1   8169  8169.0  136.76 < 2.2e-16 ***
#> Residuals   498  29747    59.7                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response walk_dist :
#>              Df Sum Sq Mean Sq F value Pr(>F)
#> treatment     1     50    49.9  0.0302  0.862
#> Residuals   498 822043  1650.7

8.2 Visualisation: Bivariate Outcome Space

ggplot(df_wide, aes(x = qol_score, y = walk_dist, colour = treatment)) +
  geom_point(alpha = 0.5) +
  stat_ellipse(level = 0.95, linewidth = 1) +
  labs(title = "MANOVA: Bivariate Outcome Space",
       x = "Quality of Life Score", y = "Six-Minute Walk Distance (m)")

9 Analytics 5, Multivariate Analysis of Covariance (MANCOVA)

MANCOVA extends MANOVA by including one or more continuous covariates, analogous to how ANCOVA extends ANOVA. The model becomes

\[ \mathbf{Y}_{ij} = \boldsymbol{\mu} + \boldsymbol{\tau}_j + \mathbf{B}\,\mathbf{x}_{ij} + \boldsymbol{\varepsilon}_{ij}, \]

where \(\mathbf{B}\) is a matrix of regression slopes for the covariates and \(\boldsymbol{\tau}_j\) represents the group effect.

What we planted: Because age drives both qol_score (slope +0.5) and walk_dist (slope −2), controlling for age should sharpen the treatment effect on QoL while revealing that the apparent walk-distance difference between arms (if any) is confounded with age.

Fitting the MANCOVA model.

mancova_fit <- manova(cbind(qol_score, walk_dist) ~ treatment + age, data = df_wide)
summary(mancova_fit, test = "Wilks")
#>            Df   Wilks approx F num Df den Df    Pr(>F)    
#> treatment   1 0.59214   170.82      2    496 < 2.2e-16 ***
#> age         1 0.33029   502.86      2    496 < 2.2e-16 ***
#> Residuals 497                                             
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mancova_fit, test = "Pillai")
#>            Df  Pillai approx F num Df den Df    Pr(>F)    
#> treatment   1 0.40786   170.82      2    496 < 2.2e-16 ***
#> age         1 0.66971   502.86      2    496 < 2.2e-16 ***
#> Residuals 497                                             
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.1 Univariate Follow-Ups

summary.aov(mancova_fit)
#>  Response qol_score :
#>              Df Sum Sq Mean Sq F value    Pr(>F)    
#> treatment     1   8169  8169.0  342.03 < 2.2e-16 ***
#> age           1  17877 17876.7  748.49 < 2.2e-16 ***
#> Residuals   497  11870    23.9                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>  Response walk_dist :
#>              Df Sum Sq Mean Sq F value Pr(>F)    
#> treatment     1     50      50   0.045 0.8321    
#> age           1 270646  270646 243.947 <2e-16 ***
#> Residuals   497 551396    1109                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.2 Comparing MANOVA vs. MANCOVA

# Extract Wilks' Lambda for comparison
wilks_manova  <- summary(manova_fit, test = "Wilks")$stats["treatment", "Wilks"]
wilks_mancova <- summary(mancova_fit, test = "Wilks")$stats["treatment", "Wilks"]
cat("Wilks' Lambda (MANOVA, no age covariate):     ", round(wilks_manova, 4), "\n")
#> Wilks' Lambda (MANOVA, no age covariate):      0.7493
cat("Wilks' Lambda (MANCOVA, adjusting for age):   ", round(wilks_mancova, 4), "\n")
#> Wilks' Lambda (MANCOVA, adjusting for age):    0.5921
cat("\nSmaller Lambda = stronger treatment effect.\n")
#> 
#> Smaller Lambda = stronger treatment effect.

10 Analytics 6, Repeated-Measures ANOVA (rANOVA)

When the same subjects are measured at multiple time points, observations within a subject are correlated. Repeated-measures ANOVA accounts for this by partitioning the residual into between-subject and within-subject components.

For a two-factor design (treatment \(\times\) time) the model is

\[ Y_{ijk} = \mu + \alpha_j + \pi_{i(j)} + \beta_k + (\alpha\beta)_{jk} + \varepsilon_{ijk}, \]

where \(\alpha_j\): treatment effect (between-subjects), \(\pi_{i(j)}\): subject nested within treatment (random effect), \(\beta_k\): time effect (within-subjects), and \((\alpha\beta)_{jk}\): treatment × time interaction, the key quantity of interest.

A significant interaction means the trajectories differ between groups.

10.1 Sphericity

The \(F\)-test for within-subject effects requires sphericity, equal variances of all pairwise differences. Violations are corrected with Greenhouse–Geisser (\(\hat\varepsilon_{\text{GG}}\)) or Huynh–Feldt adjustments.

What we planted: RespiClear shows a consistent upward trend while Standard Care flattens and then declines (see Section 3.3). The Treatment × Time interaction should be highly significant.

rANOVA modeling using ez::ezANOVA().

ranova_res <- ezANOVA(
  data       = as.data.frame(df_long),
  dv         = fvc_measure,
  wid        = patient_id,
  within     = timepoint,
  between    = treatment,
  type       = 3,
  detailed   = TRUE
)
print(ranova_res)
#> $ANOVA
#>                Effect DFn DFd          SSn        SSd            F
#> 1         (Intercept)   1 498 19208.123645 350.171508 27317.029972
#> 2           treatment   1 498     3.270260 350.171508     4.650834
#> 3           timepoint   2 996     7.933163   6.824018   578.942654
#> 4 treatment:timepoint   2 996     8.599404   6.824018   627.563304
#>               p p<.05         ges
#> 1  0.000000e+00     * 0.981753470
#> 2  3.151643e-02     * 0.009077355
#> 3 1.542797e-167     * 0.021738940
#> 4 4.345000e-177     * 0.023521672
#> 
#> $`Mauchly's Test for Sphericity`
#>                Effect         W            p p<.05
#> 3           timepoint 0.7460169 2.388085e-32     *
#> 4 treatment:timepoint 0.7460169 2.388085e-32     *
#> 
#> $`Sphericity Corrections`
#>                Effect       GGe         p[GG] p[GG]<.05       HFe         p[HF]
#> 3           timepoint 0.7974589 2.724868e-134         * 0.7996131 1.206973e-134
#> 4 treatment:timepoint 0.7974589 6.548829e-142         * 0.7996131 2.766792e-142
#>   p[HF]<.05
#> 3         *
#> 4         *

10.2 Visualisation: Individual Spaghetti Plots

set.seed(99)
sample_ids <- sample(unique(df_long$patient_id), 60)

df_long %>%
  filter(patient_id %in% sample_ids) %>%
  ggplot(aes(x = timepoint, y = fvc_measure, group = patient_id, colour = treatment)) +
  geom_line(alpha = 0.3) +
  geom_point(alpha = 0.4, size = 1) +
  stat_summary(aes(group = treatment), fun = mean, geom = "line",
               linewidth = 1.8, linetype = "solid") +
  stat_summary(aes(group = treatment), fun = mean, geom = "point",
               size = 4, shape = 18) +
  labs(title = "Repeated-Measures: Individual Trajectories (n = 60 sample) + Group Means",
       y = "FVC (L)", x = "Time Point") +
  scale_colour_brewer(palette = "Set1")

10.3 Pairwise Post-Hoc Comparisons

# Fit a linear mixed model for flexible post-hoc
lmm_fit <- lm(fvc_measure ~ treatment * timepoint, data = df_long)
emm_time <- emmeans(lmm_fit, ~ treatment | timepoint)
pairs(emm_time)
#> timepoint = Month 0:
#>  contrast                   estimate     SE   df t.ratio p.value
#>  RespiClear - Standard Care  -0.0853 0.0437 1494  -1.950  0.0514
#> 
#> timepoint = Month 6:
#>  contrast                   estimate     SE   df t.ratio p.value
#>  RespiClear - Standard Care   0.0804 0.0437 1494   1.840  0.0660
#> 
#> timepoint = Month 12:
#>  contrast                   estimate     SE   df t.ratio p.value
#>  RespiClear - Standard Care   0.2850 0.0437 1494   6.518  <.0001

11 Analytics 7a, Text Mining of Physician Notes

Text mining converts unstructured text into quantitative features. A standard pipeline involves

  1. Tokenisation: splitting text into individual words (tokens).
  2. Stop-word removal: filtering high-frequency function words.
  3. Sentiment scoring: joining tokens with a sentiment lexicon (e.g., Bing, AFINN, NRC) to assign positive/negative polarity.
  4. Aggregation: computing a per-document sentiment score.

What we planted: Successful patients’ notes were generated with positive adjectives; unsuccessful patients’ notes with negative adjectives (see Section 3.6). Sentiment analysis should clearly separate the two groups.

11.1 Tokenisation and Sentiment Scoring

tokens <- df_wide %>%
  select(patient_id, success, treatment, physician_notes) %>%
  unnest_tokens(word, physician_notes) %>%
  anti_join(stop_words, by = "word")

head(tokens, 10)
#> # A tibble: 10 × 4
#>    patient_id success treatment     word         
#>    <fct>        <int> <chr>         <chr>        
#>  1 1                0 Standard Care spirometry   
#>  2 1                0 Standard Care deteriorating
#>  3 1                0 Standard Care readings     
#>  4 1                0 Standard Care patient      
#>  5 1                0 Standard Care describes    
#>  6 1                0 Standard Care wheezing     
#>  7 1                0 Standard Care exercise     
#>  8 1                0 Standard Care capacity     
#>  9 2                0 RespiClear    patient      
#> 10 2                0 RespiClear    poor

11.1.1 Word Frequencies by Success Group

top_words <- tokens %>%
  mutate(success_label = ifelse(success == 1, "Success", "Failure")) %>%
  count(success_label, word, sort = TRUE) %>%
  group_by(success_label) %>%
  slice_max(n, n = 10) %>%
  ungroup()

ggplot(top_words, aes(x = n, y = reorder_within(word, n, success_label), fill = success_label)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ success_label, scales = "free_y") +
  scale_y_reordered() +
  labs(title = "Top 10 Words in Physician Notes by Outcome Group",
       x = "Frequency", y = NULL) +
  scale_fill_brewer(palette = "Set1")

11.1.2 Sentiment Analysis (Bing Lexicon)

sentiment_scores <- tokens %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(patient_id, success, treatment, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative,
         success_label = factor(ifelse(success == 1, "Success", "Failure")))

ggplot(sentiment_scores, aes(x = success_label, y = net_sentiment, fill = success_label)) +
  geom_boxplot(alpha = 0.7, outlier.alpha = 0.3) +
  labs(title = "Net Sentiment Score by Clinical Outcome",
       x = "Outcome", y = "Net Sentiment (positive − negative)") +
  theme(legend.position = "none") +
  scale_fill_brewer(palette = "Set1")

11.1.3 Word Cloud

cloud_df <- tokens %>%
  count(word, sort = TRUE)

wordcloud(words = cloud_df$word, freq = cloud_df$n,
          min.freq = 2, max.words = 80,
          colors = brewer.pal(8, "Dark2"),
          random.order = FALSE, scale = c(3, 0.5))

11.1.4 Statistical Test: Sentiment vs Outcome

t_test_res <- t.test(net_sentiment ~ success_label, data = sentiment_scores)
print(t_test_res)
#> 
#>  Welch Two Sample t-test
#> 
#> data:  net_sentiment by success_label
#> t = -39.65, df = 181.29, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group Failure and group Success is not equal to 0
#> 95 percent confidence interval:
#>  -3.186782 -2.884646
#> sample estimates:
#> mean in group Failure mean in group Success 
#>            -0.6071429             2.4285714

12 Analytics 7b, Sensitivity Analysis

Sensitivity analysis asks “How robust are our conclusions to changes in analytic choices?” We investigate two dimensions.

12.1 Threshold Sensitivity for Logistic Regression

We defined success as FVC improvement \(> 0.4\) L. What if the threshold were different?

library(broom)
library(purrr)

thresholds <- seq(0.1, 0.8, by = 0.05)

threshold_results <- map_dfr(thresholds, function(thr) {
  # Create the threshold-specific outcome
  df_wide$success_thr <- as.integer((df_wide$fvc_12 - df_wide$fvc_0) > thr)
  
  # Check if we have at least one success and one failure
  # This prevents the "interpolation" error in tidy(conf.int = TRUE)
  if (length(unique(df_wide$success_thr)) < 2) {
    return(NULL) # Skip this threshold if no variance in outcome
  }
  
  fit <- glm(success_thr ~ treatment + smoking_history + age + baseline_fvc, 
             family = binomial, data = df_wide)
  
  # Try-Catch block handles cases where model might still fail to converge
  tryCatch({
    coefs <- tidy(fit, conf.int = TRUE, exponentiate = TRUE) %>%
      filter(term == "treatmentStandard Care")
    
    if(nrow(coefs) == 0) return(NULL) # Case where treatment term is dropped
    
    tibble(
      threshold = thr,
      OR        = coefs$estimate,
      OR_low    = coefs$conf.low,
      OR_high   = coefs$conf.high,
      p_value   = coefs$p.value,
      pct_success = mean(df_wide$success_thr)
    )
  }, error = function(e) return(NULL))
})


ggplot(threshold_results, aes(x = threshold)) +
  geom_ribbon(aes(ymin = OR_low, ymax = OR_high), alpha = 0.2, fill = "steelblue") +
  geom_line(aes(y = OR), colour = "steelblue", linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  scale_y_log10() +
  labs(title = "Threshold Sensitivity: OR for Standard Care (vs RespiClear)",
       subtitle = "Shaded band = 95% CI",
       x = "Success Threshold (FVC Δ in L)", y = "Odds Ratio (log scale)")

12.1.1 Interpretation

As the threshold increases beyond the expected improvement of RespiClear (~0.35 L), the success rate drops for both groups but the relative advantage of RespiClear persists across a wide range, indicating robustness.

12.2 Missing-Data Sensitivity (Dropout Simulation)

We simulate Missing Not At Random (MNAR) dropout: patients with lower FVC at 6 months are more likely to drop out before month 12.

set.seed(123)

# Probability of dropout ∝ inverse of fvc_6
dropout_prob <- plogis(-2 + 1.5 * (3.5 - df_wide$fvc_6))
dropout      <- rbinom(n, 1, dropout_prob)
cat("Simulated dropout rate:", round(mean(dropout) * 100, 1), "%\n")
#> Simulated dropout rate: 12.2 %
df_missing <- df_wide
df_missing$fvc_12[dropout == 1] <- NA

12.2.1 Complete-Case Analysis

cc_fit <- lm(fvc_12 ~ treatment + baseline_fvc, data = df_missing)
cat("Complete-Case ANCOVA:\n")
#> Complete-Case ANCOVA:
summary(cc_fit)$coefficients["treatmentStandard Care", ]
#>      Estimate    Std. Error       t value      Pr(>|t|) 
#> -3.679334e-01  1.384439e-02 -2.657635e+01  3.144346e-93

12.2.2 Mean Imputation

df_mean_imp <- df_missing
df_mean_imp$fvc_12[is.na(df_mean_imp$fvc_12)] <- mean(df_missing$fvc_12, na.rm = TRUE)

mi_mean_fit <- lm(fvc_12 ~ treatment + baseline_fvc, data = df_mean_imp)
cat("Mean Imputation ANCOVA:\n")
#> Mean Imputation ANCOVA:
summary(mi_mean_fit)$coefficients["treatmentStandard Care", ]
#>      Estimate    Std. Error       t value      Pr(>|t|) 
#> -3.199695e-01  2.021919e-02 -1.582505e+01  5.639715e-46

12.2.3 Multiple Imputation (MICE)

imp_data <- df_missing %>% select(fvc_12, treatment, baseline_fvc, fvc_0, fvc_6, age)
mice_imp <- mice(imp_data, m = 10, method = "pmm", printFlag = FALSE, seed = 42)

mice_fit <- with(mice_imp, lm(fvc_12 ~ treatment + baseline_fvc))
mice_pool <- pool(mice_fit)
cat("Multiple Imputation (MICE) ANCOVA:\n")
#> Multiple Imputation (MICE) ANCOVA:
summary(mice_pool)[2, ]
#>                     term   estimate  std.error statistic       df      p.value
#> 2 treatmentStandard Care -0.3536645 0.01392774 -25.39281 461.1777 1.233365e-89

12.3 Comparison of Approaches

results_compare <- tibble(
  Method = c("Full Data (Truth)", "Complete Case", "Mean Imputation", "Multiple Imputation"),
  Estimate = c(
    coef(ancova_fit)["treatmentStandard Care"],
    coef(cc_fit)["treatmentStandard Care"],
    coef(mi_mean_fit)["treatmentStandard Care"],
    summary(mice_pool)$estimate[2]
  ),
  SE = c(
    summary(ancova_fit)$coefficients["treatmentStandard Care", "Std. Error"],
    summary(cc_fit)$coefficients["treatmentStandard Care", "Std. Error"],
    summary(mi_mean_fit)$coefficients["treatmentStandard Care", "Std. Error"],
    summary(mice_pool)$std.error[2]
  )
)

results_compare %>%
  kbl(digits = 4, caption = "Sensitivity: Treatment Effect Estimates Under Different Missing-Data Strategies") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Sensitivity: Treatment Effect Estimates Under Different Missing-Data Strategies
Method Estimate SE
Full Data (Truth) -0.3695 0.0129
Complete Case -0.3679 0.0138
Mean Imputation -0.3200 0.0202
Multiple Imputation -0.3537 0.0139

Let’s show the sensitivity plot - Missing-Data Sensitivity, Treatment Effect on FVC at 12 Months.

ggplot(results_compare, aes(x = Method, y = Estimate)) +
  geom_point(size = 3, colour = "firebrick") +
  geom_errorbar(aes(ymin = Estimate - 1.96 * SE, ymax = Estimate + 1.96 * SE),
                width = 0.2, colour = "firebrick") +
  geom_hline(yintercept = results_compare$Estimate[1], linetype = "dashed", colour = "grey40") +
  coord_flip() +
  labs(title = "Missing-Data Sensitivity: Treatment Effect on FVC at 12 Months",
       subtitle = "Dashed line = full-data estimate",
       x = NULL, y = "Treatment Effect Estimate (Standard Care − RespiClear)")

13 Method 8 – Model-Free Machine-Learning / AI Approaches

The seven methods above all rest on explicit parametric (or semi-parametric) assumptions: linearity, normality, a known link function, sphericity, etc.

Model-free machine-learning techniques make far fewer distributional assumptions and instead learn structure directly from data. We demonstrate two complementary paradigms.

Sub-section Paradigm Technique Goal
8A Supervised Random Forest + Gradient-Boosted Trees Predict treatment success and rank variable importance
8B Unsupervised t-SNE Embedding + k-Means Clustering Discover latent patient sub-groups without outcome labels

The following R packages are used in this test.

ml_pkgs <- c("randomForest", "xgboost", "pROC", "Rtsne", "caret")
for (pkg in ml_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE))
    install.packages(pkg, repos = "https://cran.r-project.org")
}
library(randomForest)
library(xgboost)
library(pROC)
library(Rtsne)
library(caret)

13.1 8A – Supervised Learning: Random Forest & Gradient-Boosted Trees

Logistic regression (Method 1) assumed a linear log-odds surface. A Random Forest is an ensemble of deep decision trees, each trained on a bootstrap sample of the data with a random subset of features considered at every split. The prediction is the majority vote (classification) or average (regression) across all \(B\) trees.

Key properties that contrast with parametric models include

No linearity assumption. Arbitrary interactions and non-linearities are captured automatically. - Built-in variable importance. Each feature’s contribution is measured by the mean decrease in Gini impurity (classification) or the mean decrease in accuracy when the feature is permuted. - Out-of-bag (OOB) error. Because each tree is trained on ~63 % of the data, the remaining ~37 % serves as an internal validation set, no separate hold-out is strictly required (though we use one anyway).

We also fit a Gradient-Boosted Tree model via XGBoost for comparison. Boosting builds trees sequentially, each new tree fitting the residual errors of the ensemble so far, with a learning-rate shrinkage parameter \(\eta\):

\[ \hat{f}^{(m)}(\mathbf{x}) = \hat{f}^{(m-1)}(\mathbf{x}) + \eta\,h_m(\mathbf{x}), \]

where \(h_m\) is the \(m\)-th weak learner (shallow tree).

Method Verification: success is deterministically driven by the 12-month FVC change, which in turn depends on treatment arm and stochastic noise. Both tree-based methods should identify treatment and the FVC measurements as the dominant predictors, while also detecting any non-linear or interactive patterns that the logistic model could not represent.

13.1.1 Data Preparation

# Create a modelling data frame with informative features
df_ml <- df_wide %>%
  transmute(
    age,
    gender         = factor(gender),
    treatment      = factor(treatment),
    smoking_history = factor(smoking_history),
    baseline_fvc,
    fvc_6,
    fvc_change_06  = fvc_6 - fvc_0,
    fvc_change_012 = fvc_12 - fvc_0,
    qol_score,
    walk_dist,
    hospital_visits,
    success        = factor(success, levels = c(0, 1), labels = c("Failure", "Success"))
  )

# Stratified 70/30 train–test split
set.seed(2024)
train_idx <- createDataPartition(df_ml$success, p = 0.70, list = FALSE)
df_train  <- df_ml[train_idx, ]
df_test   <- df_ml[-train_idx, ]
cat("Training set:", nrow(df_train), "| Test set:", nrow(df_test), "\n")
#> Training set: 351 | Test set: 149

13.1.2 Random Forest

set.seed(42)
rf_fit <- randomForest(
  success ~ .,
  data       = df_train,
  ntree      = 500,
  mtry       = 4,
  importance = TRUE
)
print(rf_fit)
#> 
#> Call:
#>  randomForest(formula = success ~ ., data = df_train, ntree = 500,      mtry = 4, importance = TRUE) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 4
#> 
#>         OOB estimate of  error rate: 0%
#> Confusion matrix:
#>         Failure Success class.error
#> Failure     282       0           0
#> Success       0      69           0

13.1.2.1 Variable Importance

imp_df <- importance(rf_fit) %>%
  as.data.frame() %>%
  rownames_to_column("Variable") %>%
  arrange(desc(MeanDecreaseGini))

ggplot(imp_df, aes(x = MeanDecreaseGini, y = reorder(Variable, MeanDecreaseGini))) +
  geom_col(fill = "forestgreen", alpha = 0.8) +
  labs(title = "Random Forest - Variable Importance (Gini)",
       x = "Mean Decrease in Gini Impurity", y = NULL)

Let’s evaluate the performance on a test-set.

rf_pred_class <- predict(rf_fit, newdata = df_test, type = "response")
rf_pred_prob  <- predict(rf_fit, newdata = df_test, type = "prob")[, "Success"]

rf_cm <- confusionMatrix(rf_pred_class, df_test$success, positive = "Success")
print(rf_cm)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Failure Success
#>    Failure     120       0
#>    Success       0      29
#>                                      
#>                Accuracy : 1          
#>                  95% CI : (0.9755, 1)
#>     No Information Rate : 0.8054     
#>     P-Value [Acc > NIR] : 9.846e-15  
#>                                      
#>                   Kappa : 1          
#>                                      
#>  Mcnemar's Test P-Value : NA         
#>                                      
#>             Sensitivity : 1.0000     
#>             Specificity : 1.0000     
#>          Pos Pred Value : 1.0000     
#>          Neg Pred Value : 1.0000     
#>              Prevalence : 0.1946     
#>          Detection Rate : 0.1946     
#>    Detection Prevalence : 0.1946     
#>       Balanced Accuracy : 1.0000     
#>                                      
#>        'Positive' Class : Success    
#> 

13.1.3 Gradient-Boosted Trees (XGBoost)

# XGBoost requires a numeric matrix
xgb_features <- c("age", "baseline_fvc", "fvc_6", "fvc_change_06",
                   "fvc_change_012", "qol_score", "walk_dist", "hospital_visits")

# One-hot encode categorical variables
dummies_train <- model.matrix(~ treatment + smoking_history + gender - 1, data = df_train)
dummies_test  <- model.matrix(~ treatment + smoking_history + gender - 1, data = df_test)

X_train <- cbind(as.matrix(df_train[, xgb_features]), dummies_train)
X_test  <- cbind(as.matrix(df_test[, xgb_features]),  dummies_test)
y_train <- as.integer(df_train$success == "Success")
y_test  <- as.integer(df_test$success  == "Success")

dtrain <- xgb.DMatrix(data = X_train, label = y_train)
dtest  <- xgb.DMatrix(data = X_test,  label = y_test)

set.seed(42)
xgb_fit <- xgb.train(
  params = list(
    objective   = "binary:logistic",
    eval_metric = "auc",
    eta         = 0.1,
    max_depth   = 4,
    subsample   = 0.8,
    colsample_bytree = 0.8
  ),
  data       = dtrain,
  nrounds    = 200,
  watchlist  = list(train = dtrain, test = dtest),
  verbose    = 0
)

13.1.4 XGBoost Variable Importance

xgb_imp <- xgb.importance(model = xgb_fit)

ggplot(xgb_imp[1:min(12, nrow(xgb_imp)), ],
       aes(x = Gain, y = reorder(Feature, Gain))) +
  geom_col(fill = "darkorange", alpha = 0.8) +
  labs(title = "XGBoost - Variable Importance (Gain)",
       x = "Fractional Gain", y = NULL)

13.1.5 ROC Curves: Logistic Regression vs Random Forest vs. XGBoost

# Logistic predictions on the same test set
logit_refit <- glm(success ~ ., family = binomial,
                   data = df_train %>% mutate(success = as.integer(success == "Success")))
logit_prob  <- predict(logit_refit, newdata = df_test, type = "response")

xgb_prob <- predict(xgb_fit, newdata = dtest)

roc_logit <- roc(y_test, logit_prob, quiet = TRUE)
roc_rf    <- roc(y_test, rf_pred_prob, quiet = TRUE)
roc_xgb   <- roc(y_test, xgb_prob, quiet = TRUE)

roc_df <- bind_rows(
  tibble(FPR = 1 - roc_logit$specificities, TPR = roc_logit$sensitivities,
         Model = sprintf("Logistic  (AUC = %.3f)", auc(roc_logit))),
  tibble(FPR = 1 - roc_rf$specificities,    TPR = roc_rf$sensitivities,
         Model = sprintf("Random Forest (AUC = %.3f)", auc(roc_rf))),
  tibble(FPR = 1 - roc_xgb$specificities,   TPR = roc_xgb$sensitivities,
         Model = sprintf("XGBoost (AUC = %.3f)", auc(roc_xgb)))
)

ggplot(roc_df, aes(x = FPR, y = TPR, colour = Model)) +
  geom_line(linewidth = 1) +
  geom_abline(linetype = "dashed", colour = "grey50") +
  labs(title = "ROC Comparison: Parametric vs Model-Free Classifiers",
       x = "False Positive Rate (1 − Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  theme(legend.position = c(0.65, 0.25),
        legend.background = element_rect(fill = "white", colour = "grey80"))

13.1.6 Interpretation

Because the true data-generating process for success is a simple threshold on a near-linear quantity (total FVC change), we expect the parametric logistic model to perform comparably to the tree ensembles; there is no hidden non-linearity to exploit.

This is itself an important lesson: model-free methods are not automatically superior; they shine when the true relationship is complex or unknown. The variable-importance plots, however, provide a valuable model-agnostic check: both Random Forest and XGBoost should confirm that fvc_change_012 and treatment are the dominant drivers of success, consistent with the logistic regression coefficients.

13.2 8B – Unsupervised Learning: t-SNE Embedding & k-Means Clustering

All preceding analyses were supervised, i.e., we specified an outcome and asked which predictors explained it.

Unsupervised learning asks a different question Are there natural groupings (clusters) among patients based solely on their clinical profiles, without reference to any outcome label?

We use two complementary techniques:

  1. t-distributed Stochastic Neighbour Embedding (t-SNE) for non-linear dimensionality reduction. t-SNE maps each high-dimensional observation \(\mathbf{x}_i \in \mathbb{R}^p\) to a low-dimensional point \(\mathbf{y}_i \in \mathbb{R}^2\) by minimising the Kullback–Leibler divergence between pairwise similarity distributions in the original and embedded spaces

\[ \mathrm{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}, \]

where \(p_{ij}\) are Gaussian-kernel similarities in \(\mathbb{R}^p\) and \(q_{ij}\) are Student-\(t\) kernel similarities in \(\mathbb{R}^2\).

  1. **k*-Means Clustering** partitions the \(n\) observations into \(k\) groups by minimising within-cluster sum of squares

\[ \underset{C_1, \ldots, C_k}{\arg\min} \sum_{j=1}^{k} \sum_{i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2. \]

Method validation: The simulation created two partially overlapping populations defined by treatment arm, with RespiClear patients having higher FVC trajectories and QoL. An unsupervised method operating on the clinical features (without seeing the treatment label) should recover groupings that correlate with, but need not perfectly replicate, the treatment assignment.

13.2.1 Feature Matrix

# Select clinical features only; exclude treatment label and text
features <- df_wide %>%
  transmute(
    age,
    baseline_fvc,
    fvc_0, fvc_6, fvc_12,
    fvc_change = fvc_12 - fvc_0,
    qol_score,
    walk_dist,
    hospital_visits,
    gender_num = as.integer(factor(gender)),
    smoking_num = as.integer(factor(smoking_history))
  )

# Scale to zero mean, unit variance (essential for distance-based methods)
features_scaled <- scale(features)

13.2.2 Choosing the hyper-parameter k: Elbow and Silhouette Plots

set.seed(42)
wss <- map_dbl(1:8, function(k) {
  kmeans(features_scaled, centers = k, nstart = 25, iter.max = 50)$tot.withinss
})

sil_avg <- map_dbl(2:8, function(k) {
  km <- kmeans(features_scaled, centers = k, nstart = 25, iter.max = 50)
  mean(cluster::silhouette(km$cluster, dist(features_scaled))[, "sil_width"])
})

p_elbow <- tibble(k = 1:8, WSS = wss) %>%
  ggplot(aes(k, WSS)) +
  geom_line(linewidth = 1) + geom_point(size = 2) +
  labs(title = "Elbow Plot", x = "Number of Clusters (k)", y = "Total Within-SS")

p_sil <- tibble(k = 2:8, Silhouette = sil_avg) %>%
  ggplot(aes(k, Silhouette)) +
  geom_line(linewidth = 1) + geom_point(size = 2) +
  labs(title = "Mean Silhouette Width", x = "k", y = "Avg Silhouette")

p_elbow + p_sil

Here is the k-Means with k = 2.

set.seed(42)
km2 <- kmeans(features_scaled, centers = 2, nstart = 50)
df_wide$cluster <- factor(km2$cluster)

And here is the t-SNE Embedding.

set.seed(42)
tsne_out <- Rtsne(features_scaled, dims = 2, perplexity = 30,
                  max_iter = 1000, check_duplicates = FALSE)

tsne_df <- tibble(
  tSNE1     = tsne_out$Y[, 1],
  tSNE2     = tsne_out$Y[, 2],
  treatment = df_wide$treatment,
  cluster   = df_wide$cluster,
  success   = factor(df_wide$success, labels = c("Failure", "Success"))
)

This plot shows t-SNE colored by k-Means cluster.

ggplot(tsne_df, aes(tSNE1, tSNE2, colour = cluster)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = "t-SNE Embedding - Coloured by k-Means Cluster (k = 2)",
       colour = "Cluster") +
  scale_colour_brewer(palette = "Set1")

And here is the t-SNE plot colored by Treatment Arm, note that the label Not Used in clustering.

ggplot(tsne_df, aes(tSNE1, tSNE2, colour = treatment)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = "t-SNE Embedding - Coloured by True Treatment Assignment",
       subtitle = "Treatment labels were NOT used in clustering or embedding",
       colour = "Treatment") +
  scale_colour_brewer(palette = "Dark2")

Alternatively, we can show the t-SNE colored by Clinical Outcome.

ggplot(tsne_df, aes(tSNE1, tSNE2, colour = success)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = "t-SNE Embedding - Coloured by Clinical Outcome",
       colour = "Outcome") +
  scale_colour_manual(values = c("Failure" = "tomato", "Success" = "seagreen"))

13.2.3 Cluster-Label Agreement

How well do the unsupervised clusters align with the known treatment groups and clinical outcomes?

# Cluster vs Treatment
ct_treat <- table(Cluster = df_wide$cluster, Treatment = df_wide$treatment)
cat("Cluster × Treatment:\n")
#> Cluster × Treatment:
print(ct_treat)
#>        Treatment
#> Cluster RespiClear Standard Care
#>       1        122           137
#>       2        129           112
cat("\nCramér's V (cluster vs treatment):",
    round(sqrt(chisq.test(ct_treat)$statistic / (n * (min(dim(ct_treat)) - 1))), 3), "\n")
#> 
#> Cramér's V (cluster vs treatment): 0.06
# Cluster vs Success
ct_success <- table(Cluster = df_wide$cluster, Success = df_wide$success)
cat("\nCluster × Success:\n")
#> 
#> Cluster × Success:
print(ct_success)
#>        Success
#> Cluster   0   1
#>       1 219  40
#>       2 183  58
cat("Cramér's V (cluster vs success):",
    round(sqrt(chisq.test(ct_success)$statistic / (n * (min(dim(ct_success)) - 1))), 3), "\n")
#> Cramér's V (cluster vs success): 0.103

13.2.4 Cluster Profiles

df_wide %>%
  group_by(cluster) %>%
  summarise(
    n             = n(),
    age_mean      = mean(age),
    fvc_change    = mean(fvc_12 - fvc_0),
    qol_mean      = mean(qol_score),
    walk_mean     = mean(walk_dist),
    pct_respi     = mean(treatment == "RespiClear") * 100,
    pct_success   = mean(success) * 100,
    .groups = "drop"
  ) %>%
  kbl(digits = 1,
      caption = "Clinical Profile of k-Means Clusters (k = 2)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Clinical Profile of k-Means Clusters (k = 2)
cluster n age_mean fvc_change qol_mean walk_mean pct_respi pct_success
1 259 60.7 0.1 86.2 236.7 47.1 15.4
2 241 58.3 0.2 85.2 263.4 53.5 24.1

13.2.5 Interpretation

Because the simulation encodes a strong treatment-driven signal in the FVC trajectory and QoL, the unsupervised pipeline should recover clusters that partially mirror the treatment arms, even though treatment labels were never provided to the algorithm. The degree of alignment (measured by Cramér’s V) quantifies how much of the overall patient-level variance is attributable to the treatment effect versus idiosyncratic noise.

Where the clusters diverge from treatment labels is equally informative: some Standard Care patients with high baseline FVC may cluster with RespiClear responders, highlighting that baseline physiology and treatment interact in ways that a simple two-group label cannot capture. This kind of hypothesis generation, identifying patient sub-phenotypes, is a hallmark strength of unsupervised ML in clinical research.

14 Summary

First we outline the Key Findings Across All Methods.

tibble(
  Method = c("MLR", "Logistic Regression", "Poisson GLM", "ANCOVA",
             "MANOVA", "MANCOVA", "rANOVA", "Text Mining", "Sensitivity"),
  `Signal Recovered` = c(
    "Age (+0.5/yr) and treatment (+8 QoL points) effects on QoL",
    "Strong treatment OR for clinical success",
    "Baseline FVC negatively associated with hospital visits (β ≈ −0.2)",
    "Significant treatment effect on 12-month FVC after baseline adjustment",
    "Significant joint treatment effect on QoL + Walk distance",
    "Treatment effect sharpened after adjusting for age",
    "Significant Treatment × Time interaction,  diverging trajectories",
    "Clear sentiment separation: positive words ↔ success; negative words ↔ failure",
    "Treatment effect robust across thresholds; MNAR dropout biases CC analysis"
  )
) %>%
  kbl(caption = "Summary of Encoded Signals and Recovery by Method") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Summary of Encoded Signals and Recovery by Method
Method Signal Recovered
MLR Age (+0.5/yr) and treatment (+8 QoL points) effects on QoL
Logistic Regression Strong treatment OR for clinical success
Poisson GLM Baseline FVC negatively associated with hospital visits (β ≈ −0.2)
ANCOVA Significant treatment effect on 12-month FVC after baseline adjustment
MANOVA Significant joint treatment effect on QoL + Walk distance
MANCOVA Treatment effect sharpened after adjusting for age
rANOVA Significant Treatment × Time interaction, diverging trajectories
Text Mining Clear sentiment separation: positive words ↔︎ success; negative words ↔︎ failure
Sensitivity Treatment effect robust across thresholds; MNAR dropout biases CC analysis

14.1 How the Data Were Loaded (Data Instrumentation)

The table below maps each simulation parameter to the method designed to detect it.

tibble(
  `Simulation Choice` = c(
    "QoL = 50 + 0.5·age + treatment_shift + ε",
    "success = I(ΔFVC > 0.4)",
    "visits ~ Poisson(exp(1.5 − 0.2·FVC_base))",
    "FVC₁₂ = f(treatment, FVC_base, ε)",
    "(QoL, Walk) jointly differ by treatment",
    "Age drives both QoL and Walk",
    "FVC improves for RespiClear, declines for SC",
    "Note adjectives ↔ outcome",
    "MNAR dropout mechanism"
  ),
  `Target Method` = c(
    "MLR", "Logistic Regression", "Poisson GLM", "ANCOVA",
    "MANOVA", "MANCOVA", "rANOVA", "Text Mining", "Sensitivity Analysis"
  ),
  `Expected Finding` = c(
    "β_age ≈ 0.5, β_treatment ≈ 8",
    "OR(Standard Care) << 1",
    "β_FVC ≈ −0.2",
    "Adjusted mean FVC higher for RespiClear",
    "Wilks' Λ significant",
    "Smaller Λ after age adjustment",
    "Significant interaction F-test",
    "t-test on sentiment significant",
    "CC estimate biased; MI recovers truth"
  )
) %>%
  kbl(caption = "Data-Loading Blueprint: How Simulation Parameters Map to Analytical Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Data-Loading Blueprint: How Simulation Parameters Map to Analytical Methods
Simulation Choice Target Method Expected Finding
QoL = 50 + 0.5·age + treatment_shift + ε MLR β_age ≈ 0.5, β_treatment ≈ 8
success = I(ΔFVC > 0.4) Logistic Regression OR(Standard Care) << 1
visits ~ Poisson(exp(1.5 − 0.2·FVC_base)) Poisson GLM β_FVC ≈ −0.2
FVC₁₂ = f(treatment, FVC_base, ε) ANCOVA Adjusted mean FVC higher for RespiClear
(QoL, Walk) jointly differ by treatment MANOVA Wilks’ Λ significant
Age drives both QoL and Walk MANCOVA Smaller Λ after age adjustment
FVC improves for RespiClear, declines for SC rANOVA Significant interaction F-test
Note adjectives ↔︎ outcome Text Mining t-test on sentiment significant
MNAR dropout mechanism Sensitivity Analysis CC estimate biased; MI recovers truth

15 Appendix: Full Wide Dataset Preview

df_wide %>%
  head(20) %>%
  select(-physician_notes) %>%
  kbl(digits = 2, caption = "First 20 rows of the simulated wide-format dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                font_size = 11) %>%
  scroll_box(width = "100%")
First 20 rows of the simulated wide-format dataset
patient_id age gender treatment smoking_history baseline_fvc fvc_0 fvc_6 fvc_12 qol_score walk_dist success hospital_visits cluster
1 77 Female Standard Care Former 4.66 4.66 4.74 4.62 91.95 235.00 0 1 2
2 77 Female RespiClear Current 3.76 3.76 3.95 4.10 91.98 196.82 0 3 2
3 51 Male Standard Care Former 3.99 3.99 3.86 3.77 76.93 267.94 0 0 2
4 73 Male Standard Care Current 3.69 3.69 3.54 3.37 88.04 239.11 0 1 1
5 66 Female Standard Care Former 3.00 3.00 2.92 2.98 84.39 168.21 0 2 1
6 61 Male RespiClear Never 3.20 3.20 3.44 3.58 86.08 212.04 0 3 1
7 69 Male Standard Care Never 3.58 3.58 3.62 3.70 89.50 242.17 0 1 2
8 45 Female Standard Care Former 2.04 2.04 2.01 1.82 78.09 217.68 0 4 1
9 66 Male Standard Care Former 3.08 3.08 3.05 3.08 83.05 259.02 0 4 1
10 68 Female Standard Care Former 3.90 3.90 4.00 3.89 88.14 211.49 0 1 2
11 58 Female Standard Care Never 3.35 3.35 3.24 3.20 76.81 225.10 0 5 1
12 69 Male Standard Care Former 3.36 3.36 3.48 3.51 86.61 222.35 0 2 1
13 77 Female Standard Care Never 3.93 3.93 4.03 3.89 94.30 216.52 0 0 2
14 50 Female Standard Care Former 3.23 3.23 3.19 3.16 78.23 256.94 0 1 1
15 58 Male RespiClear Current 3.81 3.81 3.96 4.03 89.71 272.76 0 0 2
16 78 Female Standard Care Former 2.79 2.79 2.93 2.77 97.70 216.95 0 4 1
17 79 Female Standard Care Never 2.89 2.89 3.07 3.08 87.75 145.18 0 2 1
18 45 Female RespiClear Former 2.66 2.66 2.73 2.85 86.38 291.86 0 4 1
19 59 Female Standard Care Never 3.54 3.54 3.63 3.73 82.64 293.17 0 2 2
20 62 Male RespiClear Current 3.40 3.40 3.46 3.72 87.32 221.30 0 2 1

16 Session Information

This notebook generates a synthetic clinical case-study with a self-contained demonstration of 7 families of statistical methods applied to a single, coherently simulated clinical-trial dataset.

All data are synthetic and do not represent any real patients or treatments.

SOCR Resource Visitor number Web Analytics SOCR Email