This DSPA Appendix explores the connections between text-mining, natural language processing and text classification and second-generation \(p\)-values (SGPV), see the references at the end.

1 Introduction

Second-generation \(p\)-values were introduced by Blume et al. (2018) as an alternative to traditional \(p\)-values that addresses some of their key limitations.

SGPVs incorporate scientific relevance by defining an interval, or region, of practical equivalence (ROPE) to measure the proportion of data-supported hypotheses that are also scientifically relevant. \(0\leq SGPVs\leq 1\), where \(SGPV=0\) indicates incompatibility with the null hypothesis, \(SGPV=1\) indicates insufficient evidence to reject the null, and \(0< SGPV < 1\) indicates inconclusive data.

Much like complex-time (kime) representation of repeated measurement longitudinal modeling using kimesurfaces, SGPVs also extends traditional statistical frameworks to capture more nuanced information in observed datasets. Kime representations use complex numbers to encode temporal patterns, while SGPVs use intervals to encode practical significance.

In practice, SGPVs can be integrated into model-based statistical approaches by defining ROPEs incorporating prior knowledge based on model parameters and using model-based confidence intervals, instead of simple parametric distribution intervals, e.g., \(t\)-test intervals. Similarly, SGPVs could potentially be adapted for model-free AI prediction by using prediction intervals instead of confidence intervals and formulating ROPEs based on prediction error tolerances and incorporating uncertainty quantification in deep learning networks.

1.1 Mathematical Formulation of Second Generation P-values

The theoretical foundation of SGPVs is based on the ratio \(p_\delta\), which measures the overlap between supported values and practically equivalent values. The SGPVs formalism extends naturally to multivariate settings, Bayesian frameworks, bootstrap methods, and asymptotic theory. SGPVs reduce to classical p-values as \(|\Delta| \to 0\) and support Bayesian interpretation through the posterior distributions.

Let \(\Theta\) be the parameter space and \(\theta \in \Theta\) be the parameter of interest. The interval of data-supported values, denoted as \(I\), is defined as \[I = \{\theta \in \Theta : \theta \text{ is compatible with the observed data at level } \alpha\}.\]

For a typical confidence interval at significance level \(1-\alpha\), \(I = [\hat{\theta} \pm c_{\alpha/2}\cdot SE(\hat{\theta})]\), where \(\hat{\theta}\) is the point estimate of the parameter, \(SE(\hat{\theta})\) is the standard error, and \(c_{\alpha/2}\) is the critical value corresponding to a specific model-distribution, e.g., Student’s \(t\).

The region of practically equivalent values (ROPE), \(\Delta\), is defined as \(\Delta = [\delta_0, \delta_1]\), where \(\delta_0\) is the lower bound of practical equivalence and \(\delta_1\) is the upper bound of practical equivalence.

Then, we can define the Second Generation P-value (SGPV), \(p_\delta\), as \(p_\delta = \frac{|I \cap \Delta|}{\min(|I|, |\Delta|)}\), where \(|I \cap \Delta|\) is the length of the intersection between \(I\) and \(\Delta\), \(|I|\) is the length of the interval of data-supported values, and \(|\Delta|\) is the length of the interval of practically equivalent values.

For multivariate parameters \(\boldsymbol{\theta} \in \mathbb{R}^p\), the SGPV is extended to \(p_\delta = \frac{\text{vol}(I \cap \Delta)}{\min(\text{vol}(I), \text{vol}(\Delta))}\), where \(\text{vol}(\cdot)\) denotes the volume in \(\mathbb{R}^p\), \(I\) is the confidence region, and \(\Delta\) is the multivariate ROPE.

1.2 SGPV Properties

Range: \(p_\delta \in [0,1]\)
Interpretation: \(p_\delta = 0\) suggests no overlap between \(I\) and \(\Delta\), \(p_\delta = 1\) indicates a complete containment of smaller interval, and \(0 < p_\delta < 1\) corresponds to a partial overlap.
Asymptotic Properties: Under regular conditions as \(n \to \infty\), \(\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0, \Sigma)\) and the asymptotic distribution of \(p_\delta\) can be derived as

\[P(p_\delta = 0) \to \begin{cases} 1 & \text{if } \theta \notin \Delta \\ 0 & \text{if } \theta \in \text{int}(\Delta) \end{cases} .\]

1.3 Decision Theoretic Framework

The loss function associated with SGPV is

\[L(\theta, \delta) = \begin{cases} 0 & \text{if } p_\delta \text{ leads to correct decision} \\ w_1 & \text{if } p_\delta \text{ leads to Type I error} \\ w_2 & \text{if } p_\delta \text{ leads to Type II error} \end{cases} ,\] where \(w_1\) and \(w_2\) are costs associated with the Type I (false-positive) and Type II (false-negative) errors.

1.4 Relationship to Classical P-values

For a point null hypothesis \(H_0: \theta = \theta_0\), as \(|\Delta| \to 0\) centered at \(\theta_0\), \(p_\delta \approx 2\min(p, 1-p)\), where \(p\) is the classical two-sided \(p\)-value.

Bootstrapped parameter estimates \(\{\hat{\theta}^{(b)}\}_{b=1}^B\) correspond with the following (bootstrap) intervals of data-supported values \[I_B = [\hat{\theta}_{(\alpha/2)}, \hat{\theta}_{(1-\alpha/2)}],\] where \(\hat{\theta}_{(q)}\) is the \(q\)th quantile of the bootstrap distribution.

Then, the bootstrap SGPV is \(p_{\delta,B} = \frac{|I_B \cap \Delta|}{\min(|I_B|, |\Delta|)}\)

Given a Bayesian posterior distribution \(\pi(\theta|data)\),

\[p_\delta \approx \min\left(1, \frac{\int_\Delta \pi(\theta|data)d\theta}{\pi_{max}|\Delta|}\right),\] where \(\pi_{max}\) is the maximum posterior density.

Finally, for complex models, the SGPV can be approximated using Monte Carlo methods using \[\hat{p}_\delta = \frac{\sum_{i=1}^M \mathbb{1}(\theta_i \in I \cap \Delta)}{\min(\sum_{i=1}^M \mathbb{1}(\theta_i \in I), \sum_{i=1}^M \mathbb{1}(\theta_i \in \Delta))},\] where \(\{\theta_i\}_{i=1}^M\) are draws from the sampling distribution or posterior.

2 Example Applications

2.1 Medical Trial Example

The first example uses a simulation of a medical trial scenario to demonstrate a custom SGPV calculation function, bootstrap uncertainty estimation, visualization of results with ROPE, and a sensitivity analysis for different effect sizes. The key difference from traditional p-value analysis is that SGPVs explicitly incorporate domain knowledge through the ROPE specification and provide more nuanced information about the evidence against the null hypothesis.

This two-arm randomized controlled trial (RCT) simulation generates \(n=200\) participants (samples) per group (total \(N = 400\)) using a true \(treatment\_effect=0.3\) standard deviations. In medical research, this is considered a small-to-moderate effect size, i.e., Cohen’s \(d = 0.3\) means the treatment increases the outcome by \(0.3\) standard deviations. The control group is Normally distributed, \(N(\mu = 0, \sigma = 1)\) and represents the baseline/placebo response, where zero mean \(\mu= 0\) simplifies the interpretation. The treatment group is Normally distributed, \(N(\mu = 0.3, \sigma = 1)\) and the only group difference is the shifted mean, since the variability is identical (\(\sigma = 1\)) in both cases assumes homoscedasticity.

The classical statistical analysis uses \(t\)-test for two independent samples to contrast outcome ~ group, i.e., “test if outcome differs by group”. The null hypothesis is \(H_o: \mu_{treatment} = \mu_{control}\) and the alternative (resaerch) hypothesis \(H_1: \mu_{treatment} \not= \mu_{control}\).

This simulation setup mimics common biostatistical or clinical trial scenario with assumptions about (i) equal allocation ratio (1:1 randomization), and (ii) normality and equal variances. This can apply for either continuous outcome measures, e.g., blood pressure, quality of life score, or for dichotomous outcome measures, e.g., two parallel groups (treatment vs. control).

In this case, the classical biostatistical concepts include (i) effect size (\(0.3\)), reflecting the true clinical significance, (ii) sample size (\(200\) per group), affecting the statistical power to detect the grouping effect, (iii) random sampling introducing natural variability, and (iv) independent groups design assuming no correlation between two arms of the study.

There is a direct connection between the classical \(t\)-test, which only tells if there’s a statistically significant group difference, and SGPVs, which add consideration of clinical significance through the ROPE. The treatment effect of \(0.3\) may or may not fall within our practically equivalent range. The the classical \(t\)-test confidence interval is extended to interval of data-supported values.

# Load required packages
library(tidyverse)
library(boot)
library(plotly)

# Function to calculate SGPV
calculate_sgpv <- function(ci_lower, ci_upper, delta_lower, delta_upper) {
  # Interval length calculations
  ci_length <- ci_upper - ci_lower
  delta_length <- delta_upper - delta_lower
  
  # Find overlap
  overlap_lower <- max(ci_lower, delta_lower)
  overlap_upper <- min(ci_upper, delta_upper)
  
  if (overlap_upper <= overlap_lower) {
    return(0)  # No overlap
  } else {
    overlap_length <- overlap_upper - overlap_lower
    return(overlap_length / min(ci_length, delta_length))
  }
}

# Simulate data for a medical trial
set.seed(123)
n <- 200
treatment_effect <- 0.3  # True effect size
control <- rnorm(n, mean = 0, sd = 1)
treatment <- rnorm(n, mean = treatment_effect, sd = 1)

# Create data frame
trial_data <- data.frame(
  group = rep(c("Control", "Treatment"), each = n),
  outcome = c(control, treatment)
)

# Traditional analysis
t_test_result <- t.test(outcome ~ group, data = trial_data)

# Calculate confidence interval
ci <- t_test_result$conf.int

# Define interval of practical equivalence (ROPE)
delta_lower <- -0.2  # Minimum clinically important difference
delta_upper <- 0.2

# Calculate SGPV
sgpv <- calculate_sgpv(ci[1], ci[2], delta_lower, delta_upper)

# Bootstrap to assess uncertainty
boot_sgpv <- function(data, indices) {
  d <- data[indices,]
  t_res <- t.test(outcome ~ group, data = d)
  calculate_sgpv(t_res$conf.int[1], t_res$conf.int[2], delta_lower, delta_upper)
}

boot_results <- boot(trial_data, boot_sgpv, R = 1000)

# Visualization
# ggplot(trial_data, aes(x = group, y = outcome)) +
#   geom_violin(aes(fill = group), alpha = 0.5) +
#   geom_boxplot(width = 0.2, alpha = 0.8) +
#   geom_hline(yintercept = c(delta_lower, delta_upper), 
#              linetype = "dashed", color = "red") +
#   annotate("rect", xmin = -Inf, xmax = Inf, 
#            ymin = delta_lower, ymax = delta_upper,
#            alpha = 0.2, fill = "red") +
#   labs(title = "Treatment Effect with ROPE",
#        subtitle = paste("SGPV =", round(sgpv, 3)),
#        y = "Outcome",
#        x = "Group") +
#   theme_minimal() +
#   theme(legend.position = "none")

# Print results
cat("\nTraditional p-value:", t_test_result$p.value)

## 
## Traditional p-value: 0.0003380627

cat("\nSecond Generation p-value:", sgpv)

## 
## Second Generation p-value: 0.104854

cat("\nBootstrap CI for SGPV:", 
    quantile(boot_results$t, c(0.025, 0.975)))

## 
## Bootstrap CI for SGPV: 0 0.60071

# Additional analysis: Effect size sensitivity
effect_sizes <- seq(0, 1, by = 0.1)
sgpvs <- numeric(length(effect_sizes))

for(i in seq_along(effect_sizes)) {
  treatment_temp <- rnorm(n, mean = effect_sizes[i], sd = 1)
  data_temp <- data.frame(
    group = rep(c("Control", "Treatment"), each = n),
    outcome = c(control, treatment_temp)
  )
  
  t_res <- t.test(outcome ~ group, data = data_temp)
  sgpvs[i] <- calculate_sgpv(t_res$conf.int[1], 
                            t_res$conf.int[2],
                            delta_lower, delta_upper)
}

# # Plot SGPV sensitivity
# ggplot(data.frame(effect_size = effect_sizes, sgpv = sgpvs),
#        aes(x = effect_size, y = sgpv)) +
#   geom_line() +
#   geom_point() +
#   labs(title = "SGPV Sensitivity to Effect Size",
#        x = "True Effect Size",
#        y = "Second Generation P-value") +
#   theme_minimal()
# 
# # Create violin and box plots with plotly
# library(plotly)

# Calculate kernel density for violin plots
density_control <- density(trial_data$outcome[trial_data$group == "Control"])
density_treatment <- density(trial_data$outcome[trial_data$group == "Treatment"])

# Create data frames for violin plots
violin_control <- data.frame(
  x = rep("Control", length(density_control$x)),
  y = density_control$x,
  density = density_control$y
)

violin_treatment <- data.frame(
  x = rep("Treatment", length(density_treatment$x)),
  y = density_treatment$y,
  density = density_treatment$y
)

# Calculate kernel density for violin plots with proper scaling
get_violin_data <- function(x, group_name) {
    dens <- density(x)
    data.frame(
        x = rep(group_name, length(dens$x)),
        y = dens$x,
        density = dens$y
    )
}

# Create violin plot data
control_data <- get_violin_data(
    trial_data$outcome[trial_data$group == "Control"], 
    "Control"
)
treatment_data <- get_violin_data(
    trial_data$outcome[trial_data$group == "Treatment"], 
    "Treatment"
)

# Calculate box plot statistics
box_stats <- function(x) {
    q <- quantile(x, probs = c(0.25, 0.5, 0.75))
    iqr <- q[3] - q[1]
    list(
        lower = max(min(x), q[1] - 1.5 * iqr),
        q1 = q[1],
        median = q[2],
        q3 = q[3],
        upper = min(max(x), q[3] + 1.5 * iqr)
    )
}

# Create the plot
p <- plot_ly() %>%
    # Add violin plots
    add_trace(
        data = control_data,
        type = "violin",
        x = ~x,
        y = ~y,
        name = "Control",
        side = "both",  # Show both sides
        points = FALSE,
        fillcolor = "rgba(31, 119, 180, 0.5)",
        line = list(color = "rgba(31, 119, 180, 0.8)"),
        meanline = list(visible = TRUE),
        showlegend = FALSE,
        spanmode = "soft"  # Smoother violin edges
    ) %>%
    add_trace(
        data = treatment_data,
        type = "violin",
        x = ~x,
        y = ~y,
        name = "Treatment",
        side = "both",  # Show both sides
        points = FALSE,
        fillcolor = "rgba(255, 127, 14, 0.5)",
        line = list(color = "rgba(255, 127, 14, 0.8)"),
        meanline = list(visible = TRUE),
        showlegend = FALSE,
        spanmode = "soft"  # Smoother violin edges
    ) %>%
    # Add box plots
    add_boxplot(
        data = trial_data,
        x = ~group,
        y = ~outcome,
        boxpoints = FALSE,
        width = 0.2,
        fillcolor = "rgba(255, 255, 255, 0.8)",
        line = list(color = "rgba(0, 0, 0, 0.8)"),
        showlegend = FALSE
    ) %>%
    # Add ROPE region
    add_trace(
        type = "scatter",
        x = c("Control", "Treatment"),
        y = rep(delta_lower, 2),
        mode = "lines",
        line = list(color = "red", dash = "dash"),
        showlegend = FALSE,
        hoverinfo = "text",
        text = "Lower ROPE bound"
    ) %>%
    add_trace(
        type = "scatter",
        x = c("Control", "Treatment"),
        y = rep(delta_upper, 2),
        mode = "lines",
        line = list(color = "red", dash = "dash"),
        showlegend = FALSE,
        hoverinfo = "text",
        text = "Upper ROPE bound"
    ) %>%
    # Layout
    layout(
        shapes = list(
            list(
                type = "rect",
                x0 = -0.5,
                x1 = 2.5,
                y0 = delta_lower,
                y1 = delta_upper,
                fillcolor = "red",
                opacity = 0.2,
                line = list(color = "transparent")
            )
        ),
        title = list(
            text = paste("Treatment Effect with ROPE<br>",
                        "<sup>SGPV =", round(sgpv, 3), "</sup>"),
            font = list(size = 16)
        ),
        xaxis = list(
            title = "Group",
            zeroline = FALSE
        ),
        yaxis = list(
            title = "Outcome",
            zeroline = FALSE
        ),
        hovermode = "closest"
    )

# Add hover information for individual points
p <- p %>% add_trace(
    data = trial_data,
    type = "scatter",
    mode = "markers",
    x = ~group,
    y = ~outcome,
    marker = list(size = 1, opacity = 0),
    hoverinfo = "text",
    text = ~sprintf(
        "Group: %s<br>Value: %.2f",
        group,
        outcome
    ),
    showlegend = FALSE
)

# Add bootstrap CI information if available
if(exists("boot_results")) {
    boot_ci <- boot.ci(boot_results, type="perc")
    boot_text <- sprintf(
        "Bootstrap 95%% CI for SGPV: (%.3f, %.3f)",
        boot_ci$percent[4],
        boot_ci$percent[5]
    )
    
    p <- p %>% layout(
        annotations = list(
            list(
                x = 0.5,
                y = max(trial_data$outcome) + 0.1,
                text = boot_text,
                showarrow = FALSE,
                font = list(size = 12)
            )
        )
    )
}
p

# Create sensitivity plot with plotly

# Create data frame with additional information
sensitivity_df <- data.frame(
    effect_size = effect_sizes,
    sgpv = sgpvs
)

# Calculate zero crossing and key points
zero_cross <- approx(sensitivity_df$sgpv, sensitivity_df$effect_size, 
                    xout = 0.05)$y # Find where SGPV crosses 0.05

# Create interactive plot
p <- plot_ly(sensitivity_df) %>%
    # Add line
    add_trace(
        x = ~effect_size,
        y = ~sgpv,
        type = 'scatter',
        mode = 'lines',
        line = list(
            color = 'rgb(31, 119, 180)',
            width = 2
        ),
        name = 'SGPV'
    ) %>%
    # Add points
    add_trace(
        x = ~effect_size,
        y = ~sgpv,
        type = 'scatter',
        mode = 'markers',
        marker = list(
            size = 8,
            color = 'rgb(31, 119, 180)',
            line = list(
                color = 'white',
                width = 1
            )
        ),
        name = 'Effect Size Points',
        text = ~sprintf(
            "Effect Size: %.2f<br>SGPV: %.3f",
            effect_size, sgpv
        ),
        hoverinfo = 'text'
    ) %>%
    # Add reference line for SGPV = 0.05
    add_trace(
        x = range(effect_sizes),
        y = c(0.05, 0.05),
        type = 'scatter',
        mode = 'lines',
        line = list(
            color = 'red',
            dash = 'dash',
            width = 1
        ),
        name = 'SGPV Threshold (0.05)',
        hoverinfo = 'name'
    ) %>%
    # Layout
    layout(
        title = list(
            text = "SGPV Sensitivity to Effect Size",
            font = list(size = 16)
        ),
        xaxis = list(
            title = "True Effect Size",
            zeroline = TRUE,
            zerolinecolor = '#969696',
            gridcolor = '#E2E2E2',
            range = c(min(effect_sizes), max(effect_sizes))
        ),
        yaxis = list(
            title = "Second Generation P-value",
            zeroline = TRUE,
            zerolinecolor = '#969696',
            gridcolor = '#E2E2E2',
            range = c(0, 1)
        ),
        # Add shaded regions for interpretation
        shapes = list(
            # Region of strong evidence (SGPV < 0.05)
            list(
                type = "rect",
                fillcolor = "rgba(144, 238, 144, 0.3)",
                line = list(color = "transparent"),
                x0 = min(effect_sizes),
                x1 = max(effect_sizes),
                y0 = 0,
                y1 = 0.05
            ),
            # Region of uncertainty (SGPV > 0.05)
            list(
                type = "rect",
                fillcolor = "rgba(255, 182, 193, 0.3)",
                line = list(color = "transparent"),
                x0 = min(effect_sizes),
                x1 = max(effect_sizes),
                y0 = 0.05,
                y1 = 1
            )
        ),
        # Add annotations
        annotations = list(
            # Mark the zero crossing point
            list(
                x = zero_cross,
                y = 0.05,
                text = sprintf("Effect Size = %.2f", zero_cross),
                showarrow = TRUE,
                arrowhead = 2,
                ax = 0,
                ay = -40
            ),
            # Label regions
            list(
                x = max(effect_sizes)/4,
                y = 0.02,
                text = "Strong Evidence",
                showarrow = FALSE,
                font = list(size = 12, color = "darkgreen")
            ),
            list(
                x = max(effect_sizes)/4,
                y = 0.5,
                text = "Uncertain Evidence",
                showarrow = FALSE,
                font = list(size = 12, color = "darkred")
            )
        ),
        hovermode = 'closest',
        showlegend = TRUE,
        legend = list(
            x = 0.05,
            y = 0.95,
            bgcolor = 'rgba(255, 255, 255, 0.9)'
        )
    )
p

2.2 SGPVs and Text Classification Example

Next, we show SGPVs in the context of text-classification, specifically for evaluating classification confidence and decision boundaries. This demonstration showcases the following aspects of SGPVs:

Probabilistic SGPV: Instead of using traditional confidence intervals for means, we adapt SGPVs to work with classification probabilities. The ROPE becomes a decision boundary region \((0.4,0.6)\) where we’re uncertain about class assignment, high SGPVs indicate classification uncertainty.
Bootstrap Uncertainty: We can use bootstrap resampling to estimate confidence intervals for predicted probabilities. This captures model uncertainty in a more robust way than standard probability scores.
Accuracy-Coverage Tradeoff: The estimated SGPVs provide a natural way to implement selective classification and tuning the SGPV threshold trades off between accuracy and coverage, higher thresholds mean we only make predictions when we’re very confident.
Feature Relevance: The \(L_1\) regularization (LASSO) helps identify important text features, which provide a framework to analyze SGPVs for feature importance.

SGPVs provide a more nuanced uncertainty quantification than traditional confidence scores and naturally handles the decision boundary region offering a principled way to implement refuse to predict options.

In the following SGPV example of text-classification the study design involves \(1,000\) samples of (synthetically-generated) medical text documents comprised of two types of topics

\(tech= \{"computer", "software", "data",\cdots\}\), and
\(health = \{"medical", "health", "patient",\cdots\}\).

To mimic a supervised learning scenario for topic classification, we will simulates a binary text classification task for the \(1,000\) total text documents (\(500\) per class, i.e., topic type). Each document is synthetically generated from a class-specific vocabulary.

The text processing pipeline will include document text word (1-gram) tokenization, text vectorization (bag-of-words representation) and character conversion to lower case. Then we will construct the document term matrices, DTMs, which transform the raw text numerical tensors.

The statistical inference will be based on a linear model with regularized elastic-net and LASSO smoothing. Specifically, we fit a binary classification (tech vs. health) LASSO model with \(L_1\)-penalized logistic regression with cross-validation for regularization strength and obtain sparse feature selection through \(L_1\) penalty.

The SGPV integration will utilize a custom calculate_classification_sgpv() function, a ROPE probability interval \([0.4, 0.6]\) representing the “uncertainty region” for the classification, where predictions inside this range will be considered ambiguous. The confidence intervals (quantifying the uncertainty) will be estimated via bootstrap to capture the model uncertainty and account for feature instability.

Classical statistical classification trades off confidence vs uncertainty. In classification SGPVs, the probability space overlaps with the classical case and the ROPE range becomes the decision boundary region enabling “refuse to predict” inference.

# Required packages
library(tidyverse)
library(tidytext)
library(text2vec)
library(glmnet)
library(boot)
library(plotly)

# Function to calculate SGPV for classification probabilities
calculate_classification_sgpv <- function(prob_ci_lower, prob_ci_upper, 
                                        decision_lower = 0.1, 
                                        decision_upper = 0.9) {
  # Calculate overlap between confidence interval and decision boundary region
  ci_length <- prob_ci_upper - prob_ci_lower
  boundary_length <- decision_upper - decision_lower
  
  overlap_lower <- max(prob_ci_lower, decision_lower)
  overlap_upper <- min(prob_ci_upper, decision_upper)
  
  if (overlap_upper <= overlap_lower) {
    return(0)
  } else {
    overlap_length <- overlap_upper - overlap_lower
    return(overlap_length / min(ci_length, boundary_length))
  }
}

# Generate synthetic text data
set.seed(123)
n_samples <- 1000

# Create synthetic documents with specific topics
generate_document <- function(topic) {
  words <- switch(topic,
    "tech" = sample(c("computer", "software", "data", "algorithm", "network",
                     "cloud", "system", "code", "digital", "technology"),
                   size = sample(5:15, 1), replace = TRUE),
    "health" = sample(c("medical", "health", "patient", "treatment", "doctor",
                       "hospital", "disease", "medicine", "clinical", "therapy"),
                     size = sample(5:15, 1), replace = TRUE)
  )
  paste(words, collapse = " ")
}

# Generate balanced dataset
documents <- data.frame(
  text = c(replicate(n_samples/2, generate_document("tech")),
           replicate(n_samples/2, generate_document("health"))),
  label = factor(rep(c("tech", "health"), each = n_samples/2)),
  stringsAsFactors = FALSE
)

# Text preprocessing and feature extraction
it <- itoken(documents$text, preprocessor = tolower, 
            tokenizer = word_tokenizer)
vocab <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab)
dtm <- create_dtm(it, vectorizer)

# Split data
set.seed(456)
train_idx <- sample(nrow(documents), 0.7 * nrow(documents))
X_train <- dtm[train_idx, ]
X_test <- dtm[-train_idx, ]
y_train <- documents$label[train_idx]
y_test <- documents$label[-train_idx]

# Train logistic regression with L1 regularization
cv_glmnet <- cv.glmnet(X_train, y_train, family = "binomial", alpha = 1)
model <- glmnet(X_train, y_train, family = "binomial", 
                lambda = cv_glmnet$lambda.min)

# Improved confidence interval calculation with model uncertainty
get_prob_ci <- function(x, model, R = 500) {
  n <- nrow(X_train)
  boot_probs <- matrix(NA, nrow = R, ncol = nrow(x))
  
  # Add noise to lambda to increase uncertainty
  lambda_noise <- 0.2  # 20% variation in regularization
  
  for(i in 1:R) {
    boot_idx <- sample(n, replace = TRUE)
    
    # Add some noise to lambda
    noisy_lambda <- cv_glmnet$lambda.min * (1 + rnorm(1, 0, lambda_noise))
    
    tryCatch({
      # Fit model with noisy lambda
      boot_model <- glmnet(X_train[boot_idx,], y_train[boot_idx], 
                          family = "binomial", lambda = noisy_lambda)
      
      # Add small random noise to predictions
      preds <- predict(boot_model, x, type = "response")[,1]
      preds <- preds + rnorm(length(preds), 0, 0.05)  # Add 5% noise
      preds <- pmin(pmax(preds, 0.001), 0.999)  # Bound between 0 and 1
      boot_probs[i,] <- preds
    }, error = function(e) {
      warning("Bootstrap iteration ", i, " failed: ", e$message)
    })
  }
  
  # Calculate wider confidence intervals (99%)
  ci_bounds <- t(apply(boot_probs, 2, quantile, 
                      probs = c(0.005, 0.995), 
                      na.rm = TRUE))
  
  return(ci_bounds)
}

# Calculate probabilities and SGPVs for test set
test_probs <- predict(model, X_test, type = "response")[,1]
test_cis <- get_prob_ci(X_test, model)
test_sgpvs <- mapply(calculate_classification_sgpv,
                     test_cis[,1], test_cis[,2])

# Create results dataframe
results <- data.frame(
  true_label = y_test,
  probability = test_probs,
  ci_lower = test_cis[,1],
  ci_upper = test_cis[,2],
  sgpv = test_sgpvs,
  row.names = NULL
)

# Function to evaluate performance by SGPV threshold
evaluate_by_sgpv <- function(results, sgpv_threshold) {
  predictions <- ifelse(results$sgpv > sgpv_threshold, 
                       "uncertain",
                       ifelse(results$probability >= 0.5, "tech", "health"))
  
  certain_idx <- predictions != "uncertain"
  accuracy <- mean(predictions[certain_idx] == results$true_label[certain_idx])
  coverage <- mean(certain_idx)
  
  return(c(accuracy = accuracy, coverage = coverage))
}

# Calculate performance metrics
sgpv_thresholds <- seq(0, 1, by = 0.1)
performance <- t(sapply(sgpv_thresholds, evaluate_by_sgpv, results = results))
performance_df <- as.data.frame(performance)
performance_df$threshold <- sgpv_thresholds

# Create custom hover text for plotly
performance_df$hover_text <- sprintf(
  "SGPV Threshold: %.1f<br>Accuracy: %.3f<br>Coverage: %.3f",
  performance_df$threshold,
  performance_df$accuracy,
  performance_df$coverage
)

# Create interactive plot with plotly
p <- plot_ly(performance_df) %>%
  add_trace(
    x = ~coverage,
    y = ~accuracy,
    type = 'scatter',
    mode = 'lines+markers+text',
    line = list(color = '#1f77b4', width = 2),
    marker = list(
      size = 20,
      color = '#1f77b4',
      line = list(color = 'white', width = 1)
    ),
    text = ~sprintf("%.1f", threshold),
    textposition = 'top',
    hovertext = ~hover_text,
    hoverinfo = 'text',
    showlegend = FALSE
  ) %>%
  layout(
    title = list(
      text = "Accuracy-Coverage Tradeoff by SGPV Threshold",
      font = list(size = 20)
    ),
    xaxis = list(
      title = "Coverage (Proportion of Certain Predictions)",
      gridcolor = '#E2E2E2',
      zerolinecolor = '#969696',
      zerolinewidth = 1,
      tickformat = ".2f",
      range = c(0, 1)
    ),
    yaxis = list(
      title = "Accuracy on Certain Predictions",
      gridcolor = '#E2E2E2',
      zerolinecolor = '#969696',
      zerolinewidth = 1,
      tickformat = ".2f",
      range = c(0.5, 1)
    ),
    plot_bgcolor = 'white',
    paper_bgcolor = 'white',
    hovermode = 'closest',
    shapes = list(
      list(
        type = "rect",
        fillcolor = "rgba(144, 238, 144, 0.5)",
        line = list(color = "transparent"),
        x0 = 0, x1 = 0.5,
        y0 = 0.8, y1 = 1,
        name = "Conservative"
      ),
      list(
        type = "rect",
        fillcolor = "rgba(255, 255, 0, 0.5)",
        line = list(color = "transparent"),
        x0 = 0.5, x1 = 0.8,
        y0 = 0.7, y1 = 0.9,
        name = "Balanced"
      ),
      list(
        type = "rect",
        fillcolor = "rgba(255, 182, 193, 0.5)",
        line = list(color = "transparent"),
        x0 = 0.8, x1 = 1,
        y0 = 0.6, y1 = 0.8,
        name = "Aggressive"
      )
    ),
    annotations = list(
      list(
        x = 0.25,
        y = 0.9,
        text = "Conservative",
        showarrow = FALSE,
        font = list(size = 12, color = "darkgreen")
      ),
      list(
        x = 0.65,
        y = 0.8,
        text = "Balanced",
        showarrow = FALSE,
        font = list(size = 12, color = "darkgoldenrod")
      ),
      list(
        x = 0.9,
        y = 0.7,
        text = "Aggressive",
        showarrow = FALSE,
        font = list(size = 12, color = "darkred")
      )
    )
  ) %>%
  config(
    displayModeBar = TRUE,
    modeBarButtonsToRemove = list(
      'select2d', 'lasso2d', 'zoomIn2d', 'zoomOut2d', 'autoScale2d',
      'hoverCompareCartesian', 'hoverClosestCartesian'
    )
  )

# Display the plot
p

# Calculate overall accuracy and other metrics
predictions <- ifelse(test_probs > 0.5, "tech", "health")
overall_accuracy <- mean(predictions == y_test)

# Print summary statistics
cat("\nSummary Statistics:")

## 
## Summary Statistics:

cat("\n-----------------")

## 
## -----------------

cat("\nOverall accuracy:", round(overall_accuracy, 3))

## 
## Overall accuracy: 1

cat("\nMean SGPV:", round(mean(test_sgpvs), 3))

## 
## Mean SGPV: 0.199

cat("\nProportion of confident predictions (SGPV < 0.1):", 
    round(mean(test_sgpvs < 0.1), 3))

## 
## Proportion of confident predictions (SGPV < 0.1): 0.06

# Additional diagnostics
cat("\n\nConfidence Interval Widths:")

## 
## 
## Confidence Interval Widths:

ci_widths <- test_cis[,2] - test_cis[,1]
print(summary(ci_widths))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09579 0.11799 0.12396 0.12435 0.13000 0.16548

cat("\nSGPV Distribution:")

## 
## SGPV Distribution:

print(summary(test_sgpvs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1609  0.2014  0.1988  0.2385  0.4018

2.3 Classical vs. SGPV fMRI analysis

Next, we will contrast SGPVs against classical parametric statistical inference using synthetic fMRI data. This simulation shows how SGPVs can be applied to fMRI analysis, comparing it with classical inference. This is particularly interesting as fMRI analysis typically involves multiple comparisons and extremely low signal-to-noise ratio (SNR). real fMRI data can be used, but for simplicity, we generate fMRI time-series in a block design with alternating finger-tapping/rest periods utilizing convolution with the hemodynamic response function (HRF), controlling the delayed fMRI/BOLD signal response. We’ll use temporal autocorrelation (\(AR(1)\) process) with a realistic signal-to-noise ratio.

Classical fMRI analyses use voxel-wise GLM with standard \(t\)-tests and \(p\)-values in a multiple comparison setting (over millions of voxels in the brain). The SGPV analysis will estimate the region of practical equivalence for the effect sizes, the corresponding confidence interval overlaps, and quantify the uncertainty. While classical p-values only consider null hypothesis, SGPVs incorporate practical significance and handle the process uncertainty differently.

The experimental design in this simulation uses a block design with alternating task/rest periods (\(20s\) blocks). The hemodynamic response function (HRF) model is gamma function \(h(t) = (\frac{t}{a})^5 e^{-(t-a)/b}\) with parameters shape \(a=6\) and scale\(b=1\). The design convolution is \(x_{conv}(t) = (x * h)(t)\), where \(x\) is boxcar function, and the design matrix \(X = [1, x_{conv}]\) includes an intercept.

The synthetic fMRI data generation assumes the following spatial properties \(100\) voxels with \(30\%\) activation probability, effect sizes \(\beta\sim N(\mu=0.5, \sigma^2=0.2^2)\) for active voxels and background noise \(\beta\sim N(\mu=0.0, \sigma^2=0.1^2)\) for inactive voxels. The temporal properties include an \(AR(1)\) noise process \(\epsilon_t = 0.3\epsilon_{t-1} + \eta_t\), where \(\eta_t \sim N(0,1)\), signal-to-noise ratio \(SNR= 0.5\), and time series model \(y_t = \beta x_{conv}(t) + \epsilon_t\).

The classical Statistical Analysis uses a General Linear Model for each voxel \(Y = X\beta + \epsilon\) and estimates the effects \(\hat{\beta}\) using ordinary least squares and \(t\)-statistics \(t = \frac{\hat{\beta}}{\text{SE}(\hat{\beta})}\). Correction for multiple comparison includes three alternative approaches – Bonferroni \(p_{bonf} = \min(np, 1)\), False Discovery Rate Benjamini-Hochberg FDR procedure, and test statistic transformation: \(-\log_{10}(p)\).

The second generation \(p\)-values (SGPVs) confidence intervals for each voxel were estimated as \(CI = [\hat{\beta} \pm t_{\alpha/2}\text{SE}(\hat{\beta})]\), the region of practical equivalence (ROPE) were set to \([-\delta, \delta]\), \(\delta = 0.2\), and the SGPV calculation involved \(\text{SGPV} = \frac{|I \cap \Delta|}{\min(|I|, |\Delta|)}\), where \(I\) is \(CI\), \(\Delta\) is ROPE, the norm \(|·|\) denotes the interval length, and the test statistic is \(1 - SGPV\) (higher values indicate stronger evidence).

In the ROC analysis, the ground truth \(|\beta| > 0.3\) defines true activation. For each method (classical p-values, SGPV), the predictor score is the log-transformed test statistics, the binary classification uses varying thresholds, \(TPR = P(test+ | true+)\), and \(FPR = P(test+ | true-)\). ALso, the AUC calculation was based on \(\text{AUC} = P(S_{pos} > S_{neg})\), where \(S\) are scores for positive/negative cases and we use non-parametric estimation via empirical distributions.

Ther performance evaluation includes primary metrics, e.g., Area Under ROC Curve (AUC), sensitivity and specificity at median threshold, and distribution of test statistics by activation status. The comparison across methods, of the relative performance in different SNR regimes, contrasted the classical \(p\)-values (raw, Bonferroni, FDR) against the SGPV approach.

# Required packages
library(tidyverse)
library(plotly)
library(pROC)

#===============================
# Simulation Parameters
#===============================
set.seed(123)
n_timepoints <- 200  # Number of TRs
tr <- 2  # TR in seconds
duration <- n_timepoints * tr  # Total duration in seconds
n_voxels <- 100  # Number of voxels in ROI

#===============================
# Design Matrix Generation
#===============================
create_design <- function(n_timepoints, tr) {
    # Create timing for block design
    block_duration <- 20  # seconds per block
    blocks_per_condition <- floor((n_timepoints * tr) / block_duration / 2)
    
    # Create boxcar function
    design <- rep(c(rep(1, floor(block_duration/tr)), 
                   rep(0, floor(block_duration/tr))), 
                 blocks_per_condition)
    
    # Ensure correct length
    design <- design[1:n_timepoints]
    
    # Create HRF (simple gamma function)
    t <- seq(0, 30, by=tr)
    hrf <- dgamma(t, shape=6, scale=1)
    hrf <- hrf/max(hrf)
    
    # Convolve design with HRF
    design_conv <- stats::filter(design, hrf, sides=2)
    design_conv[is.na(design_conv)] <- 0  # Replace NAs
    design_conv <- scale(design_conv)  # Standardize
    
    return(list(
        boxcar = design,
        convolved = as.numeric(design_conv)
    ))
}

#===============================
# fMRI Data Generation
#===============================
generate_fmri_data <- function(design, n_voxels, snr=0.5) {
    # Matrix for data storage
    data <- matrix(0, nrow=n_timepoints, ncol=n_voxels)
    true_activation <- numeric(n_voxels)
    
    for(v in 1:n_voxels) {
        # Determine if voxel is active (30% probability)
        is_active <- runif(1) < 0.3
        
        # More realistic effect sizes for active voxels
        beta <- if(is_active) rnorm(1, 0.5, 0.2) else rnorm(1, 0, 0.1)
        true_activation[v] <- beta
        
        # Generate temporally correlated noise (AR(1) process)
        noise <- arima.sim(list(ar=0.3), n=n_timepoints)
        noise <- scale(noise)  # Standardize noise
        
        # Combine signal and noise with realistic SNR
        signal <- beta * design$convolved
        data[,v] <- signal + noise/snr
    }
    
    return(list(
        data = data,
        true_activation = true_activation
    ))
}

#===============================
# Classical Analysis
#===============================
classical_analysis <- function(data, design) {
    n_voxels <- ncol(data)
    results <- matrix(NA, nrow=n_voxels, ncol=5)
    colnames(results) <- c("beta", "t_stat", "p_value", "p_bonf", "p_fdr")
    
    # Ensure design is properly formatted
    X <- cbind(1, design$convolved)  # Design matrix with intercept
    
    # First pass: compute basic statistics
    for(v in 1:n_voxels) {
        y <- data[,v]
        if(all(is.finite(y))) {
            fit <- lm(y ~ design$convolved)
            results[v, 1:3] <- c(
                coef(fit)[2],
                summary(fit)$coefficients[2,3],
                summary(fit)$coefficients[2,4]
            )
        }
    }
    
    # Compute multiple comparison corrections
    results[,"p_bonf"] <- p.adjust(results[,"p_value"], method="bonferroni")
    results[,"p_fdr"] <- p.adjust(results[,"p_value"], method="fdr")
    
    return(results)
}

#===============================
# SGPV Analysis
#===============================
sgpv_analysis <- function(data, design, delta=0.2) {
    n_voxels <- ncol(data)
    results <- matrix(NA, nrow=n_voxels, ncol=4)
    colnames(results) <- c("beta", "ci_lower", "ci_upper", "sgpv")
    
    for(v in 1:n_voxels) {
        y <- data[,v]
        if(all(is.finite(y))) {
            fit <- lm(y ~ design$convolved)
            ci <- confint(fit)[2,]
            
            # Calculate SGPV
            rope <- c(-delta, delta)
            ci_length <- diff(ci)
            rope_length <- diff(rope)
            
            overlap_lower <- max(ci[1], rope[1])
            overlap_upper <- min(ci[2], rope[2])
            
            if(overlap_upper <= overlap_lower) {
                sgpv <- 0
            } else {
                overlap_length <- overlap_upper - overlap_lower
                sgpv <- overlap_length / min(ci_length, rope_length)
            }
            
            results[v,] <- c(coef(fit)[2], ci, sgpv)
        }
    }
    
    return(results)
}

#===============================
# ROC Analysis
#===============================
test_roc_calculation <- function() {
    # Generate example data
    design <- create_design(n_timepoints, tr)
    fmri_data <- generate_fmri_data(design, n_voxels)
    classical_results <- classical_analysis(fmri_data$data, design)
    sgpv_results <- sgpv_analysis(fmri_data$data, design)
    
    # Create true activation labels (0/1)
    activation_threshold <- 0.3
    true_activation <- factor(abs(fmri_data$true_activation) > activation_threshold, 
                            levels = c(FALSE, TRUE))
    
    # Print class distributions
    cat("\nTrue activation distribution:")
    print(table(true_activation))
    
    # Create test statistics
    test_stats <- data.frame(
        p_value = -log10(classical_results[,"p_value"]),
        p_bonf = -log10(classical_results[,"p_bonf"]),
        p_fdr = -log10(classical_results[,"p_fdr"]),
        sgpv = 1 - sgpv_results[,"sgpv"]
    )
    
    # Compute ROC curves and AUC for each method
    roc_list <- list()
    auc_values <- numeric()
    
    for(method in names(test_stats)) {
        pred <- test_stats[[method]]
        valid_idx <- is.finite(pred)
        pred <- pred[valid_idx]
        true_act <- true_activation[valid_idx]
        
        # Compute ROC curve
        roc_obj <- roc(true_act, pred)
        roc_list[[method]] <- roc_obj
        auc_values[method] <- as.numeric(auc(roc_obj))
        
        # Print performance at median threshold
        thresh <- median(pred)
        pred_class <- pred > thresh
        sens <- sum(pred_class & true_act == TRUE) / sum(true_act == TRUE)
        spec <- sum(!pred_class & true_act == FALSE) / sum(true_act == FALSE)
        cat(sprintf("\n%s - AUC: %.3f, Sensitivity: %.3f, Specificity: %.3f",
                   method, auc_values[method], sens, spec))
    }
    
    return(list(
        roc_curves = roc_list,
        auc_values = auc_values
    ))
}

#===============================
# Visualization
#===============================
create_roc_plot <- function(roc_curves, auc_values) {
    plot_data_list <- list()
    
    for(method in names(roc_curves)) {
        roc_obj <- roc_curves[[method]]
        coords <- coords(roc_obj, x = seq(0, 1, length.out = 100))
        
        plot_data_list[[method]] <- data.frame(
            FPR = 1 - coords$specificity,
            TPR = coords$sensitivity,
            Method = method,
            AUC = auc_values[method]
        )
    }
    
    plot_data <- do.call(rbind, plot_data_list)
    
    p <- plot_ly() %>%
        layout(
            title = "ROC Curves for Different Analysis Methods",
            xaxis = list(
                title = "False Positive Rate",
                range = c(0, 1),
                zeroline = TRUE,
                showgrid = TRUE
            ),
            yaxis = list(
                title = "True Positive Rate",
                range = c(0, 1),
                zeroline = TRUE,
                showgrid = TRUE
            ),
            showlegend = TRUE
        )
    
    colors <- c("blue", "red", "green", "purple")
    
    for(i in seq_along(unique(plot_data$Method))) {
        method <- unique(plot_data$Method)[i]
        method_data <- subset(plot_data, Method == method)
        method_data <- method_data[order(method_data$FPR), ]
        
        p <- p %>% add_trace(
            x = method_data$FPR,
            y = method_data$TPR,
            name = sprintf("%s (AUC = %.3f)", 
                         method, unique(method_data$AUC)),
            type = 'scatter',
            mode = 'lines',
            line = list(
                color = colors[i],
                width = 2
            ),
            hoverinfo = "text",
            text = sprintf(
                "Method: %s<br>FPR: %.3f<br>TPR: %.3f<br>AUC: %.3f",
                method, method_data$FPR, method_data$TPR, 
                unique(method_data$AUC)
            )
        )
    }
    
    p <- p %>% add_trace(
        x = c(0, 1),
        y = c(0, 1),
        name = "Chance",
        type = 'scatter',
        mode = 'lines',
        line = list(
            color = 'black',
            dash = 'dash',
            width = 1
        ),
        hoverinfo = "none"
    )
    
    return(p)
}

#===============================
# Run Analysis
#===============================
# Run analysis
test_results <- test_roc_calculation()

## 
## True activation distribution:true_activation
## FALSE  TRUE 
##    72    28

## 
## p_value - AUC: 0.881, Sensitivity: 0.893, Specificity: 0.653

## 
## p_bonf - AUC: 0.914, Sensitivity: 0.893, Specificity: 0.889

## 
## p_fdr - AUC: 0.880, Sensitivity: 0.893, Specificity: 0.667

## 
## sgpv - AUC: 0.877, Sensitivity: 0.893, Specificity: 0.653

roc_plot <- create_roc_plot(test_results$roc_curves, test_results$auc_values)
roc_plot

# Print final performance summary
cat("\n\nFinal Performance Summary:\n")

## 
## 
## Final Performance Summary:

print(data.frame(
    Method = names(test_results$auc_values),
    AUC = round(test_results$auc_values, 3)
))

##          Method   AUC
## p_value p_value 0.881
## p_bonf   p_bonf 0.914
## p_fdr     p_fdr 0.880
## sgpv       sgpv 0.877

3 References

Text-classification, NLP, text-mining:

Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text Classification Algorithms: A Survey. Information 2019, 10(150), doi: 10.3390/info10040150.
K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, HDLTex: Hierarchical Deep Learning for Text Classification, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134.
Dinov, ID. (2023) Data Science and Predictive Analytics: Biomedical and Health Applications using R, 2nd edition, Springer Series in Applied Machine Learning, ISBN 978-3-031-17482-7.

Second-generation \(p\)-values (SGPV)

Blume, J.D., Greevy, R.A., Welty, V.F., Smith, J.R. and Dupont, W.D., 2019. An introduction to second-generation p-values. The American Statistician, 73(sup1), pp.157-167.
Blume JD, D’Agostino McGowan L, Dupont WD, Greevy RA Jr (2018) Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLoS ONE 13(3): e0188299, DOI: 10.1371/journal.pone.0188299.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Appendix 18: Second-Generation P-values (SGPV) Inference in RCT’s, Text-Classification, and Neuroimaging

SOCR (Ivo Dinov and SOCR)

February 12, 2025