SOCR ≫ DSPA ≫ DSPA2 Topics ≫

This SOCR DSPA2 Appendix contains a comprehensive suite of interactive learning modules designed for a rigorous mathematical-statistics courses.

The SOCR Statistical Data Analyzer (SDA), specifically the SDAP Math-Stat Learning Modules, includes interactive content to support active learning of these abstract ideas, concepts, methods, and techniques, which may support different student learning styles.

1 Mathematical-Statistics Inference Learning Modules

Curriculum Architecture Overview: The modules are structured to move from probabilistic foundations to advanced inferential methods, mirroring the text’s progression.

  • Module 1: Probability Theory (Sets, Axioms, Counting, Conditional Probability, Random Variables, CDFs/PDFs)
  • Module 2: Transformations and Expectations (Deep Dive detailed below)
  • Module 3: Common Families of Distributions (Discrete/Continuous families, Exponential families, Location/Scale families)
  • Module 4: Multiple Random Variables (Joint distributions, Conditional distributions, Bivariate transformations, Covariance, Multivariate distributions)
  • Module 5: Properties of a Random Sample (Sums, Normal sampling, Order Statistics, Convergence concepts, Generating samples)
  • Module 6: Principles of Data Reduction (Sufficiency, Likelihood Principle, Equivariance)
  • Module 7: Point Estimation (Method of Moments, MLE, Bayes, EM Algorithm, MSE, Unbiasedness)
  • Module 8: Hypothesis Testing (LRTs, Bayesian tests, Power, p-values, Neyman-Pearson)
  • Module 9: Interval Estimation (Pivotal quantities, Inverting tests, Bayesian intervals)
  • Module 10: Asymptotic Evaluations (Consistency, Efficiency, Robustness, Large-sample tests)
  • Module 11: Analysis of Variance and Regression (ANOVA, Simple Linear Regression)
  • Module 12: Regression Models (Errors in variables, Logistic, Robust regression)

2 Module 1: Probability Theory

2.1 1. Module Overview

This foundational module establishes the rigorous mathematical language of probability required for statistical inference. It transitions from set-theoretic foundations and axiomatic probability to combinatorial mechanics, conditional structures, and finally the formal definition and characterization of random variables via CDFs and PDFs. Mastery of this module is prerequisite for all subsequent inferential theory.

Learning Objectives:

  • Apply set-theoretic operations to events and prove fundamental identities (e.g., DeMorgan’s Laws).
  • Deduce probabilities of complex events using the Kolmogorov Axioms and their corollaries (Boole’s/Bonferroni inequalities).
  • Formulate probabilistic outcomes for finite sample spaces using combinatorial counting techniques.
  • Decompose complex probability problems using the Law of Total Probability and update beliefs using Bayes’ Rule.
  • Differentiate between pairwise independence and mutual independence.
  • Map sample space outcomes to real numbers using Random Variables, and characterize their distributions rigorously through CDFs, PMFs, and PDFs.

2.2 2. Sub-Module 1.1: Set Theory

2.2.1 1.1.1 Sample Spaces and Events

Content: Definition of the sample space \(S\) (discrete vs. continuous), definition of an event \(A \subseteq S\), and the concepts of the null set \(\emptyset\) and the entire space \(S\).

2.2.2 1.1.2 Set Operations and Fundamental Identities

Content: Union (\(A \cup B\)), Intersection (\(A \cap B\)), Complement (\(A^c\)), and set differences. Commutative, associative, and distributive laws. Rigorous Focus: DeMorgan’s Laws: \((\cup_{i=1}^n A_i)^c = \cap_{i=1}^n A_i^c\) and \((\cap_{i=1}^n A_i)^c = \cup_{i=1}^n A_i^c\).

Interactive Resource: Venn Diagram Set Manipulator

  • Design: A dynamic Venn diagram with up to three intersecting sets (\(A, B, C\)).
  • Interaction: Users type a set-theoretic expression (e.g., \((A \cup B)^c \cap C\)). The corresponding region instantly highlights on the diagram. The tool also provides a “Proof Checker” where users construct equivalences step-by-step (e.g., verifying \(A \setminus B = A \cap B^c\)), and the diagram updates at each step to visually confirm the equality.

2.3 3. Sub-Module 1.2: Probability Theory

2.3.1 1.2.1 Axiomatic Foundations

Content: The Kolmogorov Axioms for a probability function \(P: \mathcal{B} \to [0,1]\): 1. \(P(A) \ge 0\) for all \(A \in \mathcal{B}\). 2. \(P(S) = 1\). 3. If \(A_1, A_2, \dots\) are mutually exclusive, \(P(\cup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i)\) (Countable Additivity).

2.3.2 1.2.2 The Calculus of Probabilities

Content: Derivations from the axioms: \(P(\emptyset) = 0\), \(P(A^c) = 1 - P(A)\), monotonicity (\(A \subset B \implies P(A) \le P(B)\)), and additive generalizations. Rigorous Focus: Boole’s Inequality \(P(\cup A_i) \le \sum P(A_i)\) and Bonferroni’s Inequality \(P(\cap A_i) \ge 1 - \sum P(A_i^c)\).

Interactive Resource: The Axiom Derivation Graph

  • Design: A node-based directed graph. The top node is “Kolmogorov Axioms”. Branching nodes are derived theorems.
  • Interaction: Students must click a derived theorem (e.g., \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)) and select the axioms and prior theorems required to prove it. The tool visualizes the logical flow of mathematical proof.

2.3.3 1.2.3 & 1.2.4 Counting and Enumerating Outcomes

Content: The Multiplication Principle, Permutations (ordered subsets), and Combinations (unordered subsets). The Binomial Coefficient \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\). Rigorous Focus: Enumerating equally likely outcomes to calculate probabilities.

Worked Example: The Matching Problem (Montmort)

  • Scenario: \(n\) letters placed into \(n\) addressed envelopes at random. What is the probability of exactly \(k\) matches?

  • Interactive Element: “The Hat-Check Simulator”. Users select \(n\) (e.g., \(n=5\)) and run Monte Carlo simulations to approximate the probability. The module then provides the rigorous combinatorial derivation using the Inclusion-Exclusion principle, showing how the limit as \(n \to \infty\) approaches \(1/e\).

2.4 4. Sub-Module 1.3: Conditional Probability and Independence

2.4.1 1.3.1 Conditional Probability

Content: Definition of \(P(A|B) = \frac{P(A \cap B)}{P(B)}\) for \(P(B) > 0\). The conditional probability as a restriction of the sample space.

2.4.2 1.3.2 Law of Total Probability and Bayes’ Rule

Content: Partition of a sample space \(B_1, \dots, B_k\). The Law of Total Probability: \(P(A) = \sum_{i=1}^k P(A|B_i)P(B_i)\). Bayes’ Rule: \(P(B_j|A) = \frac{P(A|B_j)P(B_j)}{\sum_{i=1}^k P(A|B_i)P(B_i)}\).

Interactive Resource: Bayesian Medical Tester

  • Design: Sliders for Disease Prevalence \(P(D)\), Test Sensitivity \(P(+|D)\), and Test Specificity \(P(-|D^c)\).
  • Interaction: As users adjust sliders, a flow-chart dynamically updates the joint probabilities. A large bar chart compares \(P(D|+)\) with \(P(+|D)\), directly addressing the base-rate fallacy. The mathematical derivation via the Law of Total Probability is displayed algebraically alongside the numerical output.

2.4.3 1.3.3 Independence

Content: Definition: \(A\) and \(B\) are independent iff \(P(A \cap B) = P(A)P(B)\). Rigorous Focus: The crucial distinction between Pairwise Independence and Mutual Independence for collections of \(n\) events (requiring \(2^n - n - 1\) equations to hold).

Interactive Resource: The Three-Coin Paradox

  • Design: A simulation of three coins where the outcome of one depends on the others (e.g., Coin C is Heads iff A and B match).
  • Interaction: Students test pairs for independence (\(P(A \cap B) = P(A)P(B)\)), finding they are pairwise independent. The tool then calculates \(P(A \cap B \cap C)\), revealing it does not equal \(P(A)P(B)P(C)\), vividly demonstrating the failure of mutual independence.

2.5 5. Sub-Module 1.4: Random Variables

2.5.1 1.4.1 Definition and Mapping

Content: A random variable \(X\) is a function from the sample space \(S\) to the real numbers: \(X: S \to \mathbb{R}\). The pre-image (inverse mapping) \(X^{-1}(B) = \{s \in S : X(s) \in B\}\).

Interactive Resource: Sample Space to Real Line Mapper

  • Design: A two-panel interface. Left panel: Graphical representation of \(S\) (e.g., a grid of \((die1, die2)\) outcomes). Right panel: A real number line.

  • Interaction: User defines a function (e.g., \(X = \max(die1, die2)\) or \(X = die1 + die2\)). Hovering over a point on the real line (e.g., \(X=4\)) highlights the corresponding set of points in \(S\) (the pre-image). This visualizes the abstraction from sample outcomes to real-valued measurements.

2.6 6. Sub-Module 1.5: Distribution Functions

2.6.1 1.5.1 The Cumulative Distribution Function (CDF)

Content: Definition: \(F_X(x) = P(X \le x)\) for all \(x \in \mathbb{R}\). Rigorous Focus: The four mathematical properties of a valid CDF:

  1. \(\lim_{x \to -\infty} F_X(x) = 0\).
  2. \(\lim_{x \to \infty} F_X(x) = 1\).
  3. \(F_X(x)\) is a nondecreasing function of \(x\).
  4. \(F_X(x)\) is right-continuous (i.e., \(\lim_{y \downarrow x} F_X(y) = F_X(x)\)).

Interactive Resource: CDF Sketcher & Validator

  • Design: A coordinate grid where students can draw piecewise functions using a mouse/stylus.
  • Interaction: As the student draws, the engine checks the four CDF properties in real-time. If the student tries to draw a decreasing segment, a warning appears (“Violates non-decreasing property”). If they create a jump and attempt to make it left-continuous instead of right-continuous, the tool visually “fills the dot” on the right side and “hollows the dot” on the left, enforcing the right-continuity axiom.

2.6.2 1.5.2 Calculating Probabilities via the CDF

Content: \(P(a < X \le b) = F_X(b) - F_X(a)\). Handling strict vs. non-strict inequalities: \(P(X = x) = F_X(x) - F_X(x^-)\).

2.7 7. Sub-Module 1.6: Density and Mass Functions

2.7.1 1.6.1 Probability Mass Function (pmf)

Content: Definition for discrete random variables: \(f_X(x) = P(X = x)\). Properties: \(f_X(x) \ge 0\) and \(\sum_{x \in \mathcal{X}} f_X(x) = 1\). Relationship to CDF: \(F_X(x) = \sum_{y \le x} f_X(y)\).

2.7.2 1.6.2 Probability Density Function (pdf)

Content: Definition for continuous random variables: \(f_X(x)\) such that \(F_X(x) = \int_{-\infty}^x f_X(t)dt\). Properties: \(f_X(x) \ge 0\) and \(\int_{-\infty}^\infty f_X(t)dt = 1\). Note that \(f_X(x)\) does not have to be \(\le 1\). Rigorous Focus: For continuous \(X\), \(P(X = x) = 0\), and \(P(a \le X \le b) = P(a < X < b) = \int_a^b f_X(x)dx\).

Interactive Resource: The PDF/CDF Duality Engine

  • Design: A split screen. Top: a dynamic PDF curve. Bottom: the corresponding CDF curve.

  • Interaction: * PDF to CDF: The user shades an area under the PDF curve from \(-\infty\) to \(x\). The area value appears, and a corresponding point plots on the CDF below. Dragging \(x\) along the x-axis dynamically builds the CDF. * CDF to PDF: The user manipulates points on the CDF (respecting the 4 axioms from 1.5). The PDF automatically calculates and draws as the derivative of the CDF. For discrete jumps, the PDF plots discrete spikes equal to the jump height. * The “Density > 1” Mythbuster: A slider allows the user to decrease the variance of a normal distribution. The PDF peak rises above 1, while the total integral remains 1, correcting the common misconception that a pdf is a probability.

2.8 Part III: Technical Implementation Guidelines (Module 1 Specifics)

  1. Set Theory Engine: Use a library capable of boolean geometry operations (like Clipper.js or paper.js) to accurately render the unions, intersections, and differences of arbitrary shapes for the Venn Diagram manipulator.

  2. Symbolic Logic Integration: For the Axiom Derivation Graph and the Independence checker, integrate a lightweight symbolic logic validator. The student inputs \(P(A \cap B) == P(A)P(B)\), and the system tests this algebraically or numerically against simulation data.

  3. Combinatorics Optimizer: For the counting module, ensure the backend can compute large factorials exactly using arbitrary-precision arithmetic (like Python’s built-in math.comb or math.factorial) to prevent overflow when students test limits (e.g., \(n=100\) in the matching problem).

  4. Riemann/Lebesgue Integration visualization: In the PDF/CDF Duality Engine, render the area under the curve using WebGL/Canvas to ensure smooth animation as the integral limit \(x\) slides along the axis. For discrete CDFs, ensure the “step” function rendering accurately places open and closed circles to denote right-continuity.

3 Module 2: Transformations and Expectations

3.1 1. Module Overview

This module bridges the gap between basic probability distributions and the behavior of functions of random variables. It covers the mathematical rigor required to find the distribution of a transformed variable, calculate its expected value, and utilize Moment Generating Functions (MGFs) to characterize distributions. It also introduces the critical theoretical justifications for differentiating under an integral sign.

Learning Objectives:

  • Determine the pmf/pdf of a transformed random variable \(Y = g(X)\) for both monotone and piecewise monotone functions.
  • Apply the Probability Integral Transformation and its inverse for random variate generation.
  • Compute and interpret expected values, variances, and higher moments.
  • Derive and utilize Moment Generating Functions (MGFs) to identify distributions and prove convergence.
  • Mathematically justify the interchange of differentiation and integration/summation using Leibnitz’s Rule and Dominated Convergence concepts.

3.2 2. Sub-Module 2.1: Distributions of Functions of a Random Variable

3.2.1 2.1.1 The Inverse Mapping Concept

Content: Definition of \(Y = g(X)\), the mapping \(g(x): \mathcal{X} \to \mathcal{Y}\), and the inverse mapping \(g^{-1}(A) = \{x \in \mathcal{X}: g(x) \in A\}\). Proving the distribution of \(Y\) satisfies Kolmogorov Axioms.

Interactive Resource: Visual Mapping Engine * Design: A split-screen canvas. Left side: a number line representing \(\mathcal{X}\). Right side: a number line representing \(\mathcal{Y}\). * Interaction: The user selects a function \(g(x)\) (e.g., \(x^2\), \(e^x\)). A point or interval highlight on \(\mathcal{X}\) instantly maps to \(\mathcal{Y}\), visually demonstrating how \(g^{-1}\) maps sets back to sets. * Rigor Check: Include a “Non-one-to-one” mode (e.g., \(y=x^2\)) where selecting \(y > 0\) on \(\mathcal{Y}\) highlights two disjoint intervals on \(\mathcal{X}\), emphasizing that \(g^{-1}(y)\) can be a set, not just a point.

3.2.2 2.1.2 The Discrete Case

Content: Finding the pmf of \(Y\) via \(f_Y(y) = \sum_{x \in g^{-1}(y)} f_X(x)\).

Worked Example: The Binomial Transformation (Example 2.1.1)

  • Scenario: \(X \sim \text{Binomial}(n, p)\), \(Y = n - X\).
  • Interactive Step-by-Step: 1. Display the binomial pmf. 2. Visually flip the pmf around \(n/2\). 3. Mathematical derivation showing \(g^{-1}(y) = n-y\) and the combinatorial identity \(\binom{n}{n-y} = \binom{n}{y}\). 4. Conclusion: \(Y \sim \text{Binomial}(n, 1-p)\).

3.2.3 2.1.3 The Continuous Case (Monotone Transformations)

Content: Theorem 2.1.3 (CDF method) and Theorem 2.1.5 (PDF method).

Formula: \(f_Y(y) = f_X(g^{-1}(y)) |\frac{d}{dy}g^{-1}(y)|\).

Interactive Resource: The Jacobian Visualizer

  • Design: A graph showing the transformation \(y=g(x)\).
  • Interaction: As the user drags a point along the \(y\)-axis, the tool calculates the slope of \(g^{-1}(y)\). The output visually scales the density \(f_X\) by the absolute derivative, showing why the Jacobian term is necessary to preserve total probability mass (area).

Worked Examples:

  1. Uniform-Exponential Relationship (Ex 2.1.4): \(X \sim \text{Uniform}(0,1)\), \(Y = -\log X\). Derivation showing \(Y \sim \text{Exp}(1)\).

  2. Inverted Gamma (Ex 2.1.6): \(X \sim \text{Gamma}(\alpha, \beta)\), \(Y=1/X\). Derivation of the inverted gamma pdf.

3.2.4 2.1.4 Non-Monotone Transformations

Content: Theorem 2.1.8 (Piecewise monotone transformations). Partitioning \(\mathcal{X}\) into \(A_1, \dots, A_k\).

Formula: \(f_Y(y) = \sum_{i=1}^k f_X(g_i^{-1}(y)) |\frac{d}{dy}g_i^{-1}(y)|\).

Worked Example: Normal-Chi Squared Relationship (Ex 2.1.9)

  • Scenario: \(X \sim N(0,1)\), \(Y = X^2\).
  • Interactive Element: Show the standard normal curve. Animate the “folding” of the negative half onto the positive half, demonstrating why the probability mass doubles for \(y > 0\). Derivation resulting in \(\chi^2_1\).

3.2.5 2.1.5 The Probability Integral Transformation

Content: Theorem 2.1.10. If \(X\) has continuous cdf \(F_X\), then \(Y = F_X(X) \sim \text{Uniform}(0,1)\). Definition of the generalized inverse \(F_X^{-1}(y) = \inf\{x: F_X(x) \ge y\}\).

Interactive Resource: Random Variate Generator

  • Design: A “slot machine” interface.
  • Interaction: User selects a target distribution (e.g., Exponential). The engine draws a uniform random number \(U\), plots it on the y-axis of the CDF plot, traces horizontally to the CDF curve, and drops down to the x-axis to find \(X = F^{-1}(U)\). Histograms build in real-time to prove the theorem empirically.

3.3 3. Sub-Module 2.2: Expected Values

3.3.1 2.2.1 Definitions and Properties

Content: Definition 2.2.1 (The “Law of the Unconscious Statistician”). Theorem 2.2.5 (Linearity, Non-negativity, Monotonicity, Boundedness).

Worked Examples:

  1. Exponential Mean (Ex 2.2.2): Integration by parts to show \(E[X] = 1/\lambda\).
  2. Binomial Mean (Ex 2.2.3): Algebraic manipulation of binomial coefficients to show \(E[X] = np\).
  3. Cauchy Mean (Ex 2.2.4): Demonstrating that \(E|X| = \infty\), hence the expectation does not exist.

3.3.2 2.2.2 Expectation as a Predictor

Content: Example 2.2.6. Minimizing \(E[(X-b)^2]\).

Derivation: Expanding \(E[(X - EX + EX - b)^2]\) to prove the minimum occurs at \(b = EX\).

Interactive Resource: MSE Optimization Slider

  • Design: A plot of a pdf (e.g., Gamma). A vertical line represents the “guess” \(b\). A calculated readout displays \(E[(X-b)^2]\).
  • Interaction: User drags the line \(b\) along the x-axis. The MSE value updates in real-time, visually demonstrating that the minimum MSE occurs precisely at the mean \(E[X]\), not the median or mode.

3.4 4. Sub-Module 2.3: Moments and Moment Generating Functions

3.4.1 4.1 Moments and Variance

Content: Definitions of \(\mu'_n\) and \(\mu_n\). Definition of Variance (\(\text{Var } X\)) and Standard Deviation. Theorem 2.3.4 (\(\text{Var }(aX+b) = a^2 \text{Var } X\)). Computational formula: \(\text{Var } X = EX^2 - (EX)^2\).

Interactive Resource: Variance Decomposition

  • Design: A bar chart visualization of \(E[X^2]\) vs \((E[X])^2\).
  • Interaction: User changes parameters of a distribution. The visualization dynamically resizes the bars to show that \(\text{Var } X\) is the gap between the second raw moment and the square of the first moment.

3.4.2 4.2 Moment Generating Functions (MGFs)

Content: Definition 2.3.6 (\(M_X(t) = E[e^{tX}]\)). Theorem 2.3.7 (Generating moments via \(E[X^n] = M_X^{(n)}(0)\)). Theorem 2.3.15 (\(M_{aX+b}(t) = e^{bt}M_X(at)\)).

Worked Examples:

  1. Gamma MGF (Ex 2.3.8): Derivation of \(M_X(t) = (1-\beta t)^{-\alpha}\) for \(t < 1/\beta\).
  2. Binomial MGF (Ex 2.3.9): Derivation of \(M_X(t) = (pe^t + 1-p)^n\).

3.4.3 4.3 Uniqueness and Convergence

Content: The issue of non-unique moments (Example 2.3.10 - Lognormal vs. modified Lognormal). Theorem 2.3.11 (Uniqueness of MGF). Theorem 2.3.12 (Convergence of MGFs implies convergence of CDFs).

Interactive Resource: The Poisson Approximation Simulator (Ex 2.3.13)

  • Design: Side-by-side bar charts of Binomial\((n, p)\) and Poisson\((\lambda=np)\).
  • Interaction: A slider controls \(n\) (from 10 to 1000). As \(n \to \infty\) with constant \(\lambda\), the Binomial MGF \((1 - \frac{\lambda}{n} + \frac{\lambda}{n}e^t)^n\) converges to the Poisson MGF \(e^{\lambda(e^t-1)}\). The visualization shows the binomial bars morphing into the Poisson bars. Mathematical derivation utilizing Lemma 2.3.14 (\(\lim (1+a_n/n)^n = e^a\)) is displayed alongside.

3.5 5. Sub-Module 2.4: Differentiating Under an Integral Sign

3.5.1 5.1 Leibnitz’s Rule

Content: Theorem 2.4.1. Differentiating integrals with variable limits.

Formula: \(\frac{d}{d\theta} \int_{a(\theta)}^{b(\theta)} f(x,\theta)dx = f(b(\theta),\theta)b'(\theta) - f(a(\theta),\theta)a'(\theta) + \int_{a(\theta)}^{b(\theta)} \frac{\partial}{\partial \theta}f(x,\theta)dx\).

3.5.2 5.2 Infinite Range and Dominated Convergence

Content: Theorem 2.4.3 and Corollary 2.4.4. The necessity of the dominating function \(g(x,\theta)\) and the Lipschitz-like condition bounding the derivative.

Worked Example: Interchanging I (Ex 2.4.5)

  • Scenario: Exponential distribution. Calculating \(\frac{d}{d\lambda} E[X^n]\).
  • Interactive Element: A “proof-checker” interface. The user must identify the bounding function \(g(x, \lambda)\) that bounds \(\frac{\partial}{\partial \lambda} [x^n \lambda^{-1} e^{-x/\lambda}]\). If they input a valid \(g\), the tool confirms the interchange is justified and displays the recursive moment formula: \(E[X^{n+1}] = \lambda^2 \frac{d}{d\lambda}E[X^n] + \lambda E[X^n]\).

3.5.3 5.3 Interchanging Summation and Differentiation

Content: Theorem 2.4.8. Uniform convergence of series of derivatives.

Worked Example: The Geometric Distribution (Ex 2.4.7 & 2.4.9)

  • Scenario: \(X \sim \text{Geometric}(\theta)\). Finding \(E[X]\) via \(\sum x \theta(1-\theta)^x\).
  • Rigorous Step: Proving uniform convergence of the series of derivatives on closed subintervals \((c, d) \subset (0,1)\) to justify taking the derivative inside the sum.
  • Result: Derivation of \(E[X] = \frac{1-\theta}{\theta}\).

4 Technical Implementation Guidelines

To develop these resources effectively, the following tech stack and pedagogical patterns are recommended:

  1. Interactive Visualization Engine: Use D3.js or Plotly.js for the mapping and distribution visualizers. They require binding data (the pdf/cdf functions) to DOM elements to allow real-time updates as parameters change.

  2. Symbolic Computation: Integrate SymPy (Python) or MathJax with step-by-step reveal logic. For modules involving MGFs and Leibnitz’s rule, the algebra is dense. The UI should allow users to click “Next Step” to see the expansion of \(E[(X-EX + EX - b)^2]\) or the factoring of the binomial coefficient.

  3. Assessment Algorithmics: * Transformation Drills: Randomly generate a base distribution (e.g., Gamma) and a transformation (e.g., \(Y = 1/X\)). The student must choose the correct theorem (Monotone vs. Piecewise) and input the resulting PDF. The backend evaluates symbolically. * MGF Matching: Given an MGF, match it to the distribution. This reinforces Theorem 2.3.11.

  4. “Prove It” Code Blocks: For Section 2.4, provide Python/R code templates where students must write code to numerically verify that \(\lim_{h \to 0} \frac{1}{h} \int [f(x, \theta+h) - f(x, \theta)] dx\) equals \(\int \frac{\partial}{\partial \theta} f(x, \theta) dx\) for specific functions, bridging the gap between theoretical limits and numerical computation.

5 Module 3: Common Families of Distributions

5.1 1. Module Overview

This module transitions from the general mechanics of random variables (Module 2) to the specific, named families of distributions that form the workhorses of statistical inference. Rather than merely cataloging formulas, the module emphasizes the structural relationships between distributions, the unifying mathematical framework of Exponential Families, and the geometric interpretations of Location and Scale families.

Learning Objectives:

  • Identify, parameterize, and compute probabilities/moments for key discrete distributions (Uniform, Binomial, Poisson, Geometric, Negative Binomial, Hypergeometric).
  • Identify, parameterize, and compute probabilities/moments for key continuous distributions (Uniform, Gamma/Exponential/Chi-squared, Normal, Beta, Student’s t, Snedecor’s F).
  • Prove relationships and transformations between distributions (e.g., Normal to Chi-squared, Gamma to Exponential).
  • Express distributions as Exponential Families, identifying natural parameters, sufficient statistics, and the support constraints that define a full rank.
  • Generate new probability models by applying location and scale transformations to standard distributions.

5.2 2. Sub-Module 3.1 & 3.2: Discrete Distributions

5.2.1 3.2.1 The Discrete Catalog

Content: Rigorous definition, parameter space, support, mean, variance, and MGFs for:

  • Discrete Uniform: \(P(X=x) = 1/N\).
  • Binomial: \(X \sim \text{Bin}(n, p)\), arising from \(n\) independent Bernoulli trials.
  • Poisson: \(X \sim \text{Poisson}(\lambda)\), the limit of the Binomial.
  • Geometric: \(X \sim \text{Geom}(p)\), the number of trials until the first success.
  • Negative Binomial: \(X \sim \text{NegBin}(r, p)\), the number of trials until \(r\) successes.
  • Hypergeometric: Arising from sampling without replacement.

Interactive Resource: The Parameter Topology Explorer * Design: A multi-panel dashboard featuring a dynamic bar chart (PMF), a parameter slider panel, and an “Assumptions” checklist. * Interaction: As the user adjusts parameters (e.g., \(n, p\) for Binomial), the PMF updates in real-time. To demonstrate the Poisson limit, a “Link to Binomial” toggle forces \(\lambda = n \cdot p\). As \(n \to \infty\) (via slider), the Binomial bars smoothly morph into the Poisson bars. * Rigorous Focus: The Hypergeometric variance formula. A simulation window contrasts sampling with replacement (Binomial) vs. without replacement (Hypergeometric), visually demonstrating how the finite population correction factor affects the spread as sample size approaches population size.

5.3 3. Sub-Module 3.3: Continuous Distributions

5.3.1 3.3.1 The Gamma Function and Gamma Family

Content: Definition and recursive properties of the Gamma function \(\Gamma(\alpha) = \int_0^\infty t^{\alpha-1}e^{-t}dt\). The Gamma pdf: \(f(x|\alpha,\beta) = \frac{1}{\Gamma(\alpha)\beta^\alpha}x^{\alpha-1}e^{-x/\beta}\). Special Cases: Exponential (\(\alpha=1\)) and Chi-squared (\(\alpha=\nu/2, \beta=2\)).

5.3.2 3.3.2 The Normal Distribution

Content: \(X \sim N(\mu, \sigma^2)\). The pdf, standardization \(Z = (X-\mu)/\sigma\), and MGF \(M_X(t) = \exp(\mu t + \sigma^2 t^2 / 2)\).

5.3.3 3.3.3 The Beta Family

Content: \(X \sim \text{Beta}(\alpha, \beta)\). Support on \([0,1]\). Relationship to order statistics (preview of Module 5).

Interactive Resource: The Distribution Genealogy Graph

  • Design: A node-based directed graph (similar to a flowchart) mapping the transformations between distributions.
  • Interaction: The user clicks a node (e.g., \(X \sim \text{Gamma}(\alpha, \beta)\)). Edges light up showing transformations: “Divide by \(\beta\)” leads to \(\text{Gamma}(\alpha, 1)\); “Set \(\alpha=1\)” leads to \(\text{Exp}(\beta)\); “Sum \(k\) independent \(\text{Exp}(\lambda)\)” leads to \(\text{Gamma}(k, \lambda)\); “Square \(Z \sim N(0,1)\)” leads to \(\chi^2_1\).
  • Rigor Check: Clicking an edge opens a step-by-step mathematical derivation panel (using Theorem 2.1.8 from Module 2) proving the transformation.

5.4 4. Sub-Module 3.4: Exponential Families

5.4.1 3.4.1 Definition and Canonical Form

Content: A family of pdfs/pmfs is an exponential family if it can be expressed as: \[f(x|\theta) = h(x)c(\theta)\exp\left( \sum_{i=1}^k w_i(\theta)t_i(x) \right)\] Definitions of \(h(x)\) (base measure), \(c(\theta)\) (normalizing constant), \(w_i(\theta)\) (natural parameters), and \(t_i(x)\) (sufficient statistics).

Interactive Resource: The Exponential Family Decomposer

  • Design: A symbolic math interface.
  • Interaction: The user inputs a standard pdf (e.g., \(f(x|\lambda) = \lambda e^{-\lambda x}\)). The tool systematically performs algebraic manipulations to group \(\theta\)-dependent terms and \(x\)-dependent terms into the exponent. It outputs the specific functions: \(h(x) = I_{(0,\infty)}(x)\), \(c(\lambda) = \lambda\), \(w(\lambda) = -\lambda\), \(t(x) = x\).
  • Rigorous Focus: The Support Constraint. The tool highlights why the Uniform\((0,\theta)\) is not an exponential family (because the support depends on \(\theta\), preventing the factorization into \(h(x)c(\theta)\)).

5.4.2 3.4.2 Natural Parameter Space and Full Rank

Content: Defining the natural parameter space \(\Omega = \{ \theta : \int h(x)c(\theta)\exp(\sum w_i(\theta)t_i(x))dx < \infty \}\). Concept of a full-rank exponential family (the \(w_i(\theta)\) functions are linearly independent, and the \(t_i(x)\) statistics are linearly independent). Rigorous Focus: Convexity of the natural parameter space.

Worked Example: The Curved Exponential Family

  • Scenario: \(X \sim N(\theta, \theta^2)\).

  • Interactive Step-by-Step: Show that while it fits the exponential family form, the dimension of the parameter space (1) is less than the dimension of the natural parameter vector (2). The tool visually represents the parameter curve \((\mu/\sigma^2, -1/2\sigma^2)\) restricted to a 1-D manifold within the 2-D natural parameter space.

5.5 5. Sub-Module 3.5: Location and Scale Families

5.5.1 3.5.1 Location Families

Content: \(X\) is a member of a location family if \(X = \theta + Z\), where \(Z\) has a “standard” pdf \(f(z)\). Thus \(f(x|\theta) = f(x-\theta)\). Visualized as horizontal shifts.

5.5.2 3.5.2 Scale Families

Content: \(X\) is a member of a scale family if \(X = \sigma Z\) (for \(\sigma > 0\)). Thus \(f(x|\sigma) = \frac{1}{\sigma}f(\frac{x}{\sigma})\). Visualized as horizontal stretching/shrinking.

5.5.3 3.5.3 Location-Scale Families

Content: \(X = \mu + \sigma Z\). Thus \(f(x|\mu,\sigma) = \frac{1}{\sigma}f(\frac{x-\mu}{\sigma})\). Examples: Normal, Cauchy, Double Exponential.

Interactive Resource: The Shape-Shifter Engine

  • Design: A canvas displaying a standard pdf \(f(z)\) (e.g., Standard Cauchy) with integration bounds.

  • Interaction: 1. Location Phase: A slider adjusts \(\theta\). The curve slides horizontally. The tool enforces that the area remains 1 and the shape is rigid. 2. Scale Phase: A slider adjusts \(\sigma\). The curve stretches. Crucially, the y-axis automatically scales to show that as the curve widens, it must flatten to preserve \(\int f(x)dx = 1\).

  • Rigorous Focus: The Jacobian justification. As the user applies the transformation \(x \to \mu + \sigma z\), the tool displays the differential \(dx = \sigma dz\), proving mathematically why the \(\frac{1}{\sigma}\) pre-multiplier is required in the pdf.

5.6 6. Sub-Module 3.6: Inequalities and Identities

5.6.1 3.6.1 Probability Inequalities

Content: Markov’s Inequality (\(P(X \ge a) \le \frac{E[X]}{a}\) for \(X \ge 0\)) and Chebyshev’s Inequality (\(P(|X-\mu| \ge k\sigma) \le \frac{1}{k^2}\)).

Interactive Resource: The Bound Tightness Tester

  • Design: A plot showing the true tail probability \(P(|X-\mu| \ge k\sigma)\) for a chosen distribution (e.g., Exponential, Normal, Uniform) alongside the Chebyshev bound \(1/k^2\).
  • Interaction: The user varies \(k\). The tool dynamically plots both curves, demonstrating that Chebyshev is often extremely conservative.
  • Rigorous Focus: A dropdown allows switching to a Uniform distribution, where the true probability drops to 0 long before the bound reaches 0, contrasting with a heavy-tailed distribution where the bound is tighter.

5.6.2 3.6.2 Identities

Content: Useful algebraic identities for manipulating moments and expectations. Rigorous Focus: Stein’s Lemma (for Normal variables: \(E[g(X)(X-\mu)] = \sigma^2 E[g'(X)]\)).

Worked Example: Applying Stein’s Lemma

  • Scenario: Calculating \(E[X^3]\) for \(X \sim N(0,1)\) without direct integration.

  • Interactive Element: Step-by-step derivation. Set \(g(x) = x^2\), so \(g'(x) = 2x\). The interface shows the substitution into Stein’s Lemma: \(E[X^2 \cdot X] = 1 \cdot E[2X]\), resulting in \(E[X^3] = 2(0) = 0\). The tool then prompts the user to calculate \(E[X^4]\) by choosing \(g(x) = x^3\).

5.7 Part III: Technical Implementation Guidelines (Module 3 Specifics)

  1. Symbolic Math Backend: For the Exponential Family Decomposer, a robust symbolic computation engine is required. SymPy (Python) or Mathematica Kernel are ideal. The frontend must pass LaTeX strings to the backend, and the backend must perform algebraic factoring (e.g., factoring \(\frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}\) into \(\exp\) terms with \(x^2\), \(x\), and constants grouped).
  2. Graph Visualization Library: For the Distribution Genealogy Graph, use D3.js force-directed graphs or Cytoscape.js. Nodes should contain hover-state LaTeX formulas for PDFs/PMFs, and edges must support click events to open derivation modals.
  3. Numerical Integration for Bounds: For the Chebyshev/Markov Tester, the frontend cannot rely on closed-form CDFs if the user uploads a custom \(f(z)\) for a location-scale family. Use a fast WebAssembly (Wasm) port of a QUADPACK integration routine to calculate true tail probabilities on the fly.
  4. Dynamic Rescaling Logic: In the Shape-Shifter Engine, standard chart libraries (like Chart.js) scale the y-axis to the max value automatically. This must be overridden. As \(\sigma\) increases, the code must manually calculate \(f(\mu)/\sigma\) and lock the y-axis limits to prevent the visual “flattening” from being obscured by auto-zooming, ensuring the user sees the physical reduction in density height caused by the \(1/\sigma\) term.

6 Module 4: Multiple Random Variables

6.1 1. Module Overview

This module generalizes the concepts of univariate random variables to the multivariate setting. It rigorously develops the mathematical framework required to model the simultaneous behavior of multiple random variables, focusing on their joint, marginal, and conditional structures. The module covers the complex calculus of bivariate transformations, the algebra of covariance, hierarchical models, and key inequalities that bound probabilistic behavior.

Learning Objectives:

  • Calculate and interpret joint, marginal, and conditional distributions for both discrete and continuous random vectors.
  • Assess the independence of random variables and identify the structural properties of conditionally independent variables.
  • Derive the joint pdf of transformed random variables using the bivariate Jacobian matrix.
  • Compute covariance and correlation, and apply the variance of sums formula to linear combinations of random variables.
  • Deconstruct hierarchical models using the Laws of Total Expectation and Total Variance (EVE’s Law).
  • Apply the Multivariate Normal Distribution and understand its properties under affine transformations.
  • Utilize probability and expectation inequalities (Cauchy-Schwarz, Jensen’s) for theoretical bounding.

6.2 2. Sub-Module 4.1: Joint and Marginal Distributions

6.2.1 4.1.1 Joint Distributions

Content: Definition of the joint CDF \(F_{X,Y}(x,y) = P(X \le x, Y \le y)\). Joint pmf for discrete variables: \(f_{X,Y}(x,y) = P(X=x, Y=y)\). Joint pdf for continuous variables: \(P((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \,dx\,dy\).

6.2.2 4.1.2 Marginal Distributions

Content: Recovering the univariate distribution from the joint distribution. Discrete: \(f_X(x) = \sum_y f_{X,Y}(x,y)\). Continuous: \(f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y) \,dy\).

Interactive Resource: The 3D Marginalizer

  • Design: A 3D surface plot of a joint pdf \(f_{X,Y}(x,y)\) with a volume slider.
  • Interaction: The user selects a specific slice \(x = x_0\). A plane cuts through the 3D surface, highlighting the curve \(f_{X,Y}(x_0, y)\). The tool then animates “collapsing” the 3D volume along the y-axis (integrating out \(y\)), projecting the marginal density \(f_X(x)\) onto the back wall of the plot. A side-by-side 2D plot dynamically builds the marginal pdf as the user sweeps the \(x\)-slicer across the domain.

6.3 3. Sub-Module 4.2: Conditional Distributions and Independence

6.3.1 4.2.1 Conditional Distributions

Content: Definition of the conditional pdf/pmf: \(f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}\) for \(f_X(x) > 0\). The conditional distribution as a probability distribution in its own right.

6.3.2 4.2.2 Independence

Content: \(X\) and \(Y\) are independent iff \(f_{X,Y}(x,y) = f_X(x)f_Y(y)\) for all \(x,y\). Equivalently, \(F_{X,Y}(x,y) = F_X(x)F_Y(y)\). Rigorous Focus: Support sets. If the support of \((X,Y)\) is not a Cartesian product of the supports of \(X\) and \(Y\) (e.g., a triangular region), the variables cannot be independent.

Interactive Resource: Conditional Shape-Shifter & Independence Checker

  • Design: A 2D heatmap of a joint pdf \(f_{X,Y}\). Below it, a dynamic plot for \(f_{Y|X}(y|x)\).

  • Interaction: * Conditioning: The user drags a vertical line along the x-axis (\(x_0\)). The cross-section is extracted, normalized, and plotted in the lower window as \(f_{Y|X}(y|x_0)\). * Independence Test: The user presses an “Independence Check” button. The tool visually multiplies \(f_X(x)\) and \(f_Y(y)\) and overlays the resulting surface on the joint heatmap. If they match, independence is verified. The tool explicitly flags non-Cartesian supports (e.g., \(0 < x < y < 1\)) with a red warning: “Dependent due to support constraints.”

6.4 4. Sub-Module 4.3: Bivariate Transformations

6.4.1 4.3.1 The Jacobian Matrix Method

Content: Finding the joint pdf of \((U,V) = (g_1(X,Y), g_2(X,Y))\).

  1. Ensure the transformation is one-to-one.
  2. Find the inverse transformation \(x = g_1^{-1}(u,v), y = g_2^{-1}(u,v)\).
  3. Compute the Jacobian determinant: \(J = \begin{vmatrix} \frac{\partial x}{\partial u} & \frac{\partial x}{\partial v} \\ \frac{\partial y}{\partial u} & \frac{\partial y}{\partial v} \end{vmatrix}\).
  4. Apply the formula: \(f_{U,V}(u,v) = f_{X,Y}(x(u,v), y(u,v)) |J|\).

6.4.2 4.3.2 Non-One-to-One Mappings and Auxiliary Variables

Content: If the mapping is not one-to-one (e.g., \(U = X/Y\)), partition the space or introduce an auxiliary variable (e.g., \(V = Y\)), find the joint pdf of \((U,V)\), and then integrate out \(V\) to find the marginal of \(U\).

Interactive Resource: The Deformation Grid Engine

  • Design: A canvas displaying a uniform grid in the \((X,Y)\) plane and a second canvas for the \((U,V)\) plane.

  • Interaction: The user defines a transformation (e.g., \(U = X+Y, V = X-Y\)). The grid in the \((X,Y)\) plane warps into a parallelogram grid in the \((U,V)\) plane.

  • Rigor Check: Hovering over a small area element \(dx \times dy\) in the \((X,Y)\) plane highlights the corresponding area \(du \times dv\) in the \((U,V)\) plane. The tool calculates the ratio of the areas, demonstrating that it exactly equals the absolute value of the Jacobian determinant \(|J|\). This visualizes why the Jacobian is needed to preserve probability mass.

Worked Example: Sum and Difference of Independent Normals

  • Scenario: \(X, Y \sim N(0,1)\) independent. Let \(U = X+Y, V = X-Y\).

  • Interactive Element: The engine walks through the inverse (\(X = (U+V)/2, Y = (U-V)/2\)), computes the Jacobian (\(J = -1/2\), \(|J| = 1/2\)), and shows the algebraic factorization proving \(U\) and \(V\) are independent \(N(0,2)\) variables.

6.5 5. Sub-Module 4.4: Hierarchical Models and Mixture Distributions

6.5.1 4.4.1 Hierarchical Models

Content: Modeling situations where the parameters of a distribution are themselves random variables (e.g., \(X|Y \sim \text{Poisson}(Y)\) and \(Y \sim \text{Gamma}(\alpha, \beta)\)).

6.5.2 4.4.2 Mixture Distributions

Content: The marginal distribution of \(X\) in a hierarchical model is a mixture: \(f_X(x) = \int f_{X|Y}(x|y) f_Y(y) dy\).

6.5.3 4.4.3 Laws of Total Expectation and Variance

Content:

  • Adam’s Law (Total Expectation): \(E[X] = E[E(X|Y)]\).
  • EVE’s Law (Total Variance): \(\text{Var}(X) = E[\text{Var}(X|Y)] + \text{Var}(E[X|Y])\).

Interactive Resource: The Variance Decompounder

  • Design: A flowchart visualization of a hierarchical model (e.g., \(Y \to X|Y\)). A dashboard showing \(E[X]\) and \(\text{Var}(X)\).

  • Interaction: The user adjusts the parameters of the prior \(f_Y(y)\).

  • Rigorous Focus: A dynamic bar chart visualizing EVE’s law. The total variance bar is split into two stacked segments: “Explained Variance” \([\text{Var}(E[X|Y])]\) and “Unexplained Variance” \([E[\text{Var}(X|Y)]]\). As the user increases the variance of \(Y\), the “Explained” segment grows, showing how the uncertainty in the parameter propagates to the total uncertainty in \(X\).

6.6 6. Sub-Module 4.5: Covariance and Correlation

6.6.1 4.5.1 Definitions and Properties

Content: Covariance: \(\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[XY] - E[X]E[Y]\). Correlation: \(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\). Rigorous Focus: \(\text{Cov}(X,Y) = 0 \not\implies\) Independence (except for joint Normals).

6.6.2 4.5.2 Variance of Sums

Content: \(\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X,Y)\). Extension to sums of \(n\) variables: \(\text{Var}(\sum X_i) = \sum \text{Var}(X_i) + 2\sum\sum_{i<j} \text{Cov}(X_i, X_j)\).

Interactive Resource: The Scatterplot & Correlation Shifter

  • Design: A scatterplot of \((X,Y)\) data and a slider for \(\rho\) (constraining the joint distribution to be, for example, Bivariate Normal).

  • Interaction: As \(\rho\) varies from -1 to 1, the point cloud morphs from a negative slope line to a circle to a positive slope line.

  • Rigor Check: A “Non-linear dependence” button generates a dataset where \(Y = X^2\) (with \(X\) symmetric around 0). The tool calculates \(\text{Cov}(X,Y) = 0\) and \(\rho = 0\), but visually highlights the strong parabolic relationship, driving home the point that correlation only measures linear dependence.

6.7 7. Sub-Module 4.6: Multivariate Distributions

6.7.1 4.6.1 Random Vectors and Matrices

Content: Notation for \(\mathbf{X} = (X_1, \dots, X_n)^T\). The mean vector \(\mathbf{\mu}\) and the covariance matrix \(\Sigma\) where \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\).

6.7.2 4.6.2 The Multivariate Normal Distribution

Content: \(\mathbf{X} \sim N_n(\mathbf{\mu}, \Sigma)\). The joint pdf using the inverse of the covariance matrix (precision matrix).

Rigorous Focus: Affine transformations: If \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\), then \(\mathbf{Y} \sim N(\mathbf{A}\mathbf{\mu} + \mathbf{b}, \mathbf{A}\Sigma\mathbf{A}^T)\).

Interactive Resource: The Covariance Ellipse Constructor

  • Design: A 2D plane drawing confidence ellipses for a Bivariate Normal. Input fields for the \(2 \times 2\) covariance matrix \(\Sigma\).

  • Interaction: The user modifies the variances (\(\sigma_1^2, \sigma_2^2\)) and the covariance (\(\sigma_{12}\)). The ellipse rotates and stretches in real-time.

  • Rigor Check: The tool computes the eigenvalues and eigenvectors of \(\Sigma\) and overlays them as principal axes on the ellipse, demonstrating that the eigenvectors dictate the orientation of the joint density, and the square roots of the eigenvalues dictate the spread along those axes.

6.8 8. Sub-Module 4.7: Inequalities

6.8.1 4.7.1 Numerical Inequalities

Content: Generalized Chebyshev, Cauchy-Schwarz Inequality: \([E(XY)]^2 \le E(X^2)E(Y^2)\). Proof that \(|\rho| \le 1\) using Cauchy-Schwarz.

6.8.2 4.7.2 Functional Inequalities

Content: Jensen’s Inequality: For a convex function \(g\), \(E[g(X)] \ge g(E[X])\). For concave \(g\), \(E[g(X)] \le g(E[X])\).

Interactive Resource: Jensen’s Visual Prover

  • Design: A graphing canvas where the user can select a convex function \(g(x)\) (e.g., \(g(x) = x^2\) or \(g(x) = e^x\)) and a probability distribution for \(X\).

  • Interaction: The tool plots \(g(x)\). It calculates \(E[X]\) and draws a point \((E[X], g(E[X]))\) on the curve. It then calculates \(E[g(X)]\) and draws a horizontal line at that height. The visual gap between the horizontal line (higher) and the point on the curve (lower) explicitly demonstrates \(E[g(X)] \ge g(E[X])\). The user can drag a slider to morph \(g(x)\) from convex to concave, watching the inequality flip.

6.9 Part III: Technical Implementation Guidelines (Module 4 Specifics)

  1. 3D Visualization Library: Module 4 heavily relies on understanding surfaces and volumes. Utilize Three.js or Plotly.js (WebGL mode) for the “3D Marginalizer”. Ensure the rendering supports transparency so the user can “see inside” the joint density volume when slicing.
  2. Linear Algebra Backend: For the Multivariate Normal and Bivariate Transformations, a client-side linear algebra library like ml-matrix or a WebAssembly port of Eigen is mandatory. The engine must dynamically invert matrices (to find the precision matrix or the inverse transformation) and calculate determinants (for the Jacobian) on the fly as users adjust parameters.
  3. Symbolic Integration for Margins: When a user defines a custom joint pdf \(f_{X,Y}(x,y)\), integrating out \(y\) analytically to find \(f_X(x)\) is often impossible. The tool must seamlessly fall back to numerical integration (e.g., using a WASM port of QUADPACK or a Simpson’s rule implementation in JS) to render the marginal curves.
  4. Deformation Grid Animation: For the Bivariate Transformation engine, do not just transform the points; transform a mesh grid. Using CSS/SVG transforms or WebGL vertex shaders, mapping the \((X,Y)\) grid to the \((U,V)\) grid provides the crucial visual intuition for the Jacobian distortion that static diagrams lack.

7 Module 5: Properties of a Random Sample

7.1 1. Module Overview

This module bridges the gap between the probabilistic behavior of individual random variables and the statistical inference of populations based on observed data. It rigorously defines the concept of a random sample (iid random variables), explores the exact distributions of fundamental statistics (like the sample mean and variance) under normality, and introduces the asymptotic tools, convergence theorems and the Delta Method, that allow statisticians to approximate distributions when exact solutions are intractable.

Learning Objectives:

  • Formalize the concept of a random sample and compute distributions of sums using Moment Generating Functions (MGFs).

  • Prove and apply the independence of the sample mean \(\bar{X}\) and sample variance \(S^2\) when sampling from a Normal distribution.

  • Derive the exact sampling distributions of the \(\chi^2\), Student’s \(t\), and Snedecor’s \(F\) statistics.

  • Derive the marginal and joint distributions of order statistics.

  • Distinguish between convergence in probability, almost sure convergence, and convergence in distribution.

  • Apply the Weak Law of Large Numbers (WLLN) and the Central Limit Theorem (CLT).

  • Use the Delta Method to approximate the variance and distribution of functions of the sample mean.

7.2 2. Sub-Module 5.1 & 5.2: Basic Concepts and Sums of Random Variables

7.2.1 5.1.1 The IID Assumption

Content: Definition of a random sample: \(X_1, \dots, X_n\) are independent and identically distributed (iid) random variables with cdf \(F_X(x)\). Definition of a statistic \(T = g(X_1, \dots, X_n)\) and the concept of a sampling distribution.

7.2.2 5.2.1 Distributions of Sums via MGFs

Content: The MGF of a sum of independent random variables is the product of their MGFs: \(M_{\sum X_i}(t) = \prod_{i=1}^n M_{X_i}(t)\). Rigorous Focus: Using this to prove that sums of independent Normals are Normal, and sums of independent Gamma variables (with the same scale parameter) are Gamma.

Interactive Resource: The Convolution & MGF Multiplier

  • Design: A split screen. Left: Sliders to choose \(n\) and the parameters of an iid distribution (e.g., Gamma\((\alpha, \beta)\)). Right: A dynamic plot of the MGF of the sum \(M_{\sum X_i}(t)\).

  • Interaction: The user increments \(n\). The tool visually multiplies the base MGF \(M_X(t)\) by itself \(n\) times. It then identifies the resulting functional form (e.g., \((1-\beta t)^{-n\alpha}\)) and plots the corresponding pdf of the sum below, demonstrating how the distribution shifts and spreads as \(n\) increases.

7.3 3. Sub-Module 5.3: Sampling from the Normal Distribution

7.3.1 5.3.1 Sample Mean and Variance Independence

Content: Definitions: \(\bar{X} = \frac{1}{n}\sum X_i\) and \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\). Rigorous Focus: Theorem: If \(X_1, \dots, X_n\) are iid \(N(\mu, \sigma^2)\), then:

  1. \(\bar{X} \sim N(\mu, \sigma^2/n)\).
  2. \((n-1)S^2/\sigma^2 \sim \chi^2_{n-1}\).
  3. \(\bar{X}\) and \(S^2\) are independent.

Interactive Resource: The Independence Scatter-Proof

  • Design: A Monte Carlo simulation engine. It draws \(M=500\) samples of size \(n\) from a Normal distribution. For each sample, it calculates \(\bar{X}\) and \(S^2\).

  • Interaction: A scatterplot of the 500 points \((\bar{X}, S^2)\) is rendered. A “Test Independence” button runs a correlation test on the scatterplot, yielding \(\rho \approx 0\). The user can change the underlying population to an Exponential distribution. The scatterplot immediately shows a strong dependency (funnel shape), proving that the independence of \(\bar{X}\) and \(S^2\) is a unique and critical property of the Normal family.

7.3.2 5.3.2 The Derived Distributions: Student’s \(t\) and Snedecor’s \(F\)

Content:

  • Definition of \(T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}\).

  • Definition of \(F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \sim F_{n_1-1, n_2-1}\).

Rigorous Focus: Deriving the \(t\)-distribution as a ratio of a standard Normal to the square root of an independent Chi-squared divided by its df. The convergence of \(t_{\nu} \to N(0,1)\) as \(\nu \to \infty\).

Interactive Resource: The \(t\) vs. Normal Morphing Slider

  • Design: A plot showing the standard Normal pdf. A slider controls the degrees of freedom \(\nu\).

  • Interaction: As \(\nu\) decreases from \(\infty\) down to 1, the \(t\)-distribution curve overlays the Normal, showing its heavier tails and lower peak. The tool dynamically calculates and displays the kurtosis, visually linking the mathematical moment to the shape of the tails.

7.4 4. Sub-Module 5.4: Order Statistics

7.4.1 5.4.1 Marginal Distributions of Order Statistics

Content: \(X_{(1)} \le X_{(2)} \le \dots \le X_{(n)}\). The CDF of the \(j\)th order statistic: \(F_{X_{(j)}}(x) = \sum_{k=j}^n \binom{n}{k} [F_X(x)]^k [1-F_X(x)]^{n-k}\). The PDF of the \(j\)th order statistic: \(f_{X_{(j)}}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1-F_X(x)]^{n-j}\).

Interactive Resource: The Order Statistic Sub-Sampler

  • Design: A canvas with \(n\) slots representing a sample. Next to it, a plot of the underlying pdf \(f_X(x)\).

  • Interaction: The user selects a specific order statistic \(j\) (e.g., the minimum \(j=1\), or the median \(j=\lceil n/2 \rceil\)). The tool runs 10,000 simulations, extracts the \(j\)th order statistic from each, and builds a histogram. The theoretical pdf \(f_{X_{(j)}}(x)\) is overlaid. The user can dynamically increase \(n\), watching the distribution of the sample minimum shift sharply left (for right-skewed distributions) and the distribution of the median tighten around the population median.

7.4.2 5.4.2 Joint Distributions of Order Statistics

Content: The joint pdf of \(X_{(i)}\) and \(X_{(j)}\) for \(i < j\).

7.5 5. Sub-Module 5.5: Convergence Concepts

7.5.1 5.5.1 Convergence in Probability

Content: \(X_n \xrightarrow{P} X\) if \(\lim_{n \to \infty} P(|X_n - X| \ge \epsilon) = 0\) for all \(\epsilon > 0\).

Rigorous Focus: The Weak Law of Large Numbers (WLLN): \(\bar{X}_n \xrightarrow{P} \mu\).

7.5.2 5.5.2 Almost Sure Convergence

Content: \(X_n \xrightarrow{a.s.} X\) if \(P(\lim_{n \to \infty} X_n = X) = 1\).

Rigorous Focus: The Strong Law of Large Numbers (SLLN). The subtle but critical difference between the WLLN (probability of a deviation at a specific \(n\)) and SLLN (probability of an eventual permanent deviation).

Interactive Resource: The Limit Distinction Visualizer

  • Design: A time-series plot of \(\bar{X}_n\) vs. \(n\) for multiple independent sample paths. Two horizontal lines represent \(\mu \pm \epsilon\).

  • Interaction: * WLLN Mode: The tool highlights that at any given \(n\), the fraction of paths outside the bounds approaches 0. * SLLN Mode: The user observes that while a path might occasionally dip outside the bounds, eventually every single path enters the bounds and stays there forever. The tool includes a “Pathological Counterexample” (e.g., \(X_n = n\) with prob \(1/n\), \(0\) with prob \(1-1/n\)) which converges in probability to 0, but fails to converge almost surely, visually demonstrating the divergence.

7.5.3 5.5.3 Convergence in Distribution

Content: \(X_n \xrightarrow{d} X\) if \(\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\) for all \(x\) where \(F_X\) is continuous.

Rigorous Focus: The Central Limit Theorem (CLT): \(\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0,1)\).

Interactive Resource: The Universal CLT Engine

  • Design: A “Choose Your Own Adventure” distribution selector (e.g., highly skewed Gamma, discrete Bernoulli, heavy-tailed Cauchy).

  • Interaction: The user selects a distribution. A slider increases \(n\). The tool plots the exact sampling distribution of \(\bar{X}_n\) (via fast numerical convolution or Monte Carlo) and overlays the Normal approximation from the CLT.

  • Rigor Check: If the user selects the Cauchy distribution (which has infinite variance), the CLT overlay fails to match the sampling distribution regardless of \(n\), proving that the CLT assumptions are not merely decorative.

7.5.4 5.5.4 The Delta Method

Content: If \(\sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2)\), then \(\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, \sigma^2 [g'(\theta)]^2)\), provided \(g'(\theta) \neq 0\). Rigorous Focus: The Second-Order Delta Method for when \(g'(\theta) = 0\).

Interactive Resource: The Tangent Line Transformer

  • Design: A plot of a function \(g(x)\) and a point \((\theta, g(\theta))\). A dynamic normal curve representing the distribution of \(X_n\) is centered at \(\theta\).

  • Interaction: The user chooses \(g(x)\) (e.g., \(g(x) = \sqrt{x}\)). As \(n\) increases, the distribution of \(X_n\) tightens around \(\theta\). The tool draws the tangent line \(g(x) \approx g(\theta) + g'(\theta)(x-\theta)\) and “reflects” the distribution of \(X_n\) off this tangent line to project the distribution of \(g(X_n)\) below. This visualizes the linearization at the heart of the Delta Method’s proof.

7.6 6. Sub-Module 5.6: Generating a Random Sample

7.6.1 5.6.1 Direct and Indirect Methods

Content: Direct Method (Probability Integral Transform from Module 2). Indirect Methods (e.g., using polar coordinates for Normals: Box-Muller transform).

7.6.2 5.6.3 The Accept/Reject Algorithm

Content: Generating from a target pdf \(f(x)\) using a proposal pdf \(g(x)\) and a constant \(M\) such that \(f(x) \le M g(x)\). Algorithm: Generate \(Y \sim g\), generate \(U \sim \text{Uniform}(0,1)\). If \(U \le f(Y)/(M g(Y))\), accept \(X=Y\); else repeat.

Interactive Resource: The Accept/Reject Dartboard

  • Design: A 2D plot showing the target curve \(f(x)\) and the envelope curve \(M g(x)\).

  • Interaction: The user clicks “Throw Dart”. A uniform \(x\)-coordinate is drawn from \(g(x)\), and a uniform \(y\)-coordinate \(U \cdot M g(x)\) is generated, plotting a point on the screen. If the point falls under \(f(x)\), it turns Green (Accept); if between \(f(x)\) and \(M g(x)\), it turns Red (Reject). The accepted \(x\)-values fall into a histogram below, gradually building the target distribution \(f(x)\).

  • Rigor Check: The user can adjust \(M\). If they make \(M\) too large, the acceptance rate plummets (mostly red darts), demonstrating the computational inefficiency of a poor envelope.

7.7 Part III: Technical Implementation Guidelines (Module 5 Specifics)

  1. Web Workers for Monte Carlo Simulation: Sub-modules 5.3 and 5.5 rely heavily on Monte Carlo simulation to empirically prove sampling distributions and convergence. To prevent the browser UI from freezing, all random number generation and statistical aggregation must be offloaded to Web Workers. The main thread should only receive summary data (histogram bins, means, variances) for rendering.
  2. Numerical Convolution Engine: For the CLT Engine, relying purely on Monte Carlo can be slow and “noisy”. For discrete distributions (like Binomial or Bernoulli), implement a fast Fourier transform (FFT) based convolution library. The exact distribution of the sum of \(n\) variables is the \(n\)-fold convolution of the PMF, which FFT computes in \(O(n \log n)\) time. This allows real-time updating of the exact distribution as the user drags the \(n\) slider.
  3. Path Generation for SLLN: The Almost Sure Convergence visualizer requires storing and updating multiple long-running sequences. Implement a streaming architecture where a background worker continuously appends random numbers to \(M\) distinct arrays, calculating the running mean. The visualization samples these arrays at intervals to update the time-series lines.
  4. Custom PRNG (Pseudorandom Number Generator): JavaScript’s native Math.random() is insufficient for rigorous statistical simulation (often implemented as a 32-bit Xorshift). Bundle a robust PRNG like Mersenne Twister (MT19937) in WebAssembly to ensure high-quality random numbers for the Accept/Reject and sampling simulators.
  5. Symbolic Differentiation for Delta Method: The Tangent Line Transformer requires \(g'(x)\). Instead of hardcoding derivatives, integrate math.js or SymPy (via Pyodide) so users can type any valid differentiable JavaScript/Python math function (e.g., 1/x, sin(x)) and the tool automatically calculates and plots the derivative and resulting variance.

8 Module 6: Principles of Data Reduction

8.1 1. Module Overview

In practice, statisticians are faced with large datasets, but only a few parameters of interest. This module addresses the fundamental question: How can we reduce the data to a smaller set of summary statistics without losing information about the parameter? It formalizes three distinct philosophical and mathematical approaches to data reduction: the Sufficiency Principle, the Likelihood Principle, and the Equivariance Principle. Mastery of these principles is crucial for evaluating the optimality of estimators and tests in subsequent modules.

Learning Objectives:

  • Define and identify sufficient statistics using both the conditional definition and the Factorization Theorem.
  • Derive minimal sufficient statistics and understand their role in achieving maximum data reduction.
  • Identify ancillary statistics and understand their relationship with parameter independence.
  • Prove and apply Basu’s Theorem using the intersection of completeness, sufficiency, and ancillarity.
  • Construct the Likelihood Function and apply the formal Likelihood Principle, understanding its controversial implications for stopping rules.
  • Apply the Equivariance Principle to derive estimators under parameter transformations.

8.2 2. Sub-Module 6.1: Introduction to Data Reduction

8.2.1 6.1.1 The Core Problem

Content: Given a sample \(X_1, \dots, X_n\) from \(f(x|\theta)\), we wish to find a statistic \(T(X_1, \dots, X_n)\) that summarizes the data. The goal is to achieve dimension reduction while retaining all “information” about \(\theta\).

Interactive Resource: The Information Loss Visualizer

  • Design: A scatterplot of \(n=50\) iid data points drawn from a Normal distribution.
  • Interaction: The user applies various statistics to summarize the data: \(T_1 = \bar{X}\), \(T_2 = \text{Median}\), \(T_3 = X_1\) (first data point), \(T_4 = \sum X_i^2\). The original data points fade, leaving only the value of the statistic as a vertical line or summary point. The tool then runs a Monte Carlo simulation: given that fixed statistic, it generates “plausible” datasets consistent with that summary. The user visually observes that \(\bar{X}\) and \(\sum X_i^2\) recreate the general shape of the data, while \(X_1\) alone loses almost all structural information.

8.3 3. Sub-Module 6.2: The Sufficiency Principle

8.3.1 6.2.1 Sufficient Statistics

Content:

  • Definition: \(T(X)\) is sufficient for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given \(T(\mathbf{X})=t\) does not depend on \(\theta\).
  • The Factorization Theorem (Critical Tool): \(T(X)\) is sufficient iff there exist functions \(g(t|\theta)\) and \(h(x)\) such that \(f(\mathbf{x}|\theta) = g(T(\mathbf{x})|\theta) h(\mathbf{x})\).

Interactive Resource: The Factorization Factory

  • Design: A symbolic math parser. The user inputs a joint pdf (e.g., \(f(\mathbf{x}|\theta) = \theta^n e^{-\theta \sum x_i}\) for Exponential).
  • Interaction: The user highlights portions of the mathematical expression to assign them to \(g(t|\theta)\) and \(h(\mathbf{x})\).
    • Step 1: Group \(\theta\)-dependent and \(\theta\)-independent terms.
    • Step 2: Identify the statistic \(T(\mathbf{x})\) trapped inside the \(\theta\)-dependent term.
    • Step 3: The engine verifies if the factorization is valid and declares \(T(\mathbf{X}) = \sum X_i\) a sufficient statistic.
  • Rigorous Focus: The engine rejects invalid factorizations (e.g., if the user tries to include an \(x_i\) in \(g\) that is not part of the \(T(\mathbf{x})\) mapping), enforcing the strict separation of variables.

8.3.2 6.2.2 Minimal Sufficient Statistics

Content: A sufficient statistic \(T\) is minimal sufficient if it is a function of every other sufficient statistic (maximum data reduction). Rigorous Focus: The Ratio Method: \(T(\mathbf{x})\) is minimal sufficient if \(\frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}\) is constant in \(\theta\) iff \(T(\mathbf{x}) = T(\mathbf{y})\).

Interactive Resource: The Ratio Constancy Checker

  • Design: A dual input for two sample vectors \(\mathbf{x}\) and \(\mathbf{y}\). A dynamic plot of the function \(R(\theta) = \frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}\).

  • Interaction: The user alters \(\mathbf{x}\) and \(\mathbf{y}\). If \(T(\mathbf{x}) = T(\mathbf{y})\) (e.g., they have the same sum), the plot \(R(\theta)\) renders as a flat horizontal line (constant in \(\theta\)). If \(T(\mathbf{x}) \neq T(\mathbf{y})\), the plot shows a curve varying with \(\theta\). This visually defines the partitioning of the sample space induced by a minimal sufficient statistic.

8.3.3 6.2.3 Ancillary Statistics

Content: A statistic \(S(\mathbf{X})\) is ancillary for \(\theta\) if its distribution does not depend on \(\theta\). Examples:

  • Sample range \(X_{(n)} - X_{(1)}\) for Uniform\((0, \theta)\).
  • Sample variance \(S^2\) for \(N(\theta, \sigma^2)\) where \(\sigma^2\) is known.

8.3.4 6.2.4 Sufficient, Ancillary, and Complete Statistics

Content:

  • Completeness: A statistic \(T\) is complete if \(E_\theta[g(T)] = 0\) for all \(\theta\) implies \(P(g(T)=0) = 1\) for all \(\theta\). (No non-trivial function of \(T\) has mean zero).

  • Basu’s Theorem: If \(T\) is a complete sufficient statistic for \(\theta\), then \(T\) is independent of every ancillary statistic.

Interactive Resource: Basu’s Theorem Correlation Engine

  • Design: A simulator drawing samples from \(N(\mu, 1)\). A scatterplot mapping \(T(\mathbf{X}) = \bar{X}\) vs. \(S(\mathbf{X}) = S^2\).

  • Interaction: The engine runs thousands of iterations, plotting \((\bar{X}, S^2)\). The empirical correlation \(\rho\) is displayed, converging to 0.

  • Rigorous Focus: A “Proof Checker” module walks through the logic: (1) Prove \(\bar{X}\) is sufficient for \(\mu\). (2) Prove \(\bar{X}\) is complete for \(\mu\). (3) Prove \(S^2\) is ancillary for \(\mu\). (4) Conclude independence via Basu. The user can change the known parameter (e.g., \(N(0, \sigma^2)\)) to see where the theorem breaks down (\(S^2\) is no longer ancillary for \(\sigma^2\), correlation no longer 0).

8.4 4. Sub-Module 6.3: The Likelihood Principle

8.4.1 6.3.1 The Likelihood Function

Content: Definition: \(L(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\). The Likelihood Principle states that all experimental information about \(\theta\) is contained in the likelihood function. Two likelihoods proportional in \(\theta\) yield the same inference.

8.4.1.1 6.3.2 The Formal Likelihood Principle

Content: Birnbaum’s Theorem: The Likelihood Principle is mathematically equivalent to the conjunction of the Sufficiency Principle and the Conditionality Principle. Rigorous Focus: The Stopping Rule Paradox.

Interactive Resource: The Stopping Rule Paradox Simulator

  • Design: A coin-flipping simulator with two distinct experimental modes. * Mode A (Binomial): Flip exactly \(n=12\) times. Observe \(x=9\) heads. * Mode B (Negative Binomial): Flip until \(x=9\) heads are observed. It takes \(n=12\) flips.

  • Interaction: The user runs both experiments. Both yield identical Likelihood Functions: \(L(p|x=9, n=12) \propto p^9(1-p)^3\).

  • Rigorous Focus: The tool then calculates the Frequentist p-value for the hypothesis \(H_0: p=0.5\) under both models. * Mode A p-value: \(P(X \ge 9 | n=12, p=0.5) = 0.073\). * Mode B p-value: \(P(N \le 12 | x=9, p=0.5) = 0.038\).

  • The visualization highlights that despite identical data and identical likelihoods, the Frequentist inference changes based on the intent of the experimenter (the stopping rule). This powerfully demonstrates why adherence to the Likelihood Principle fundamentally conflicts with standard Frequentist methodology.

8.5 5. Sub-Module 6.4: The Equivariance Principle

8.5.1 6.4.1 Equivariance under Transformations

Content: If a parameter is transformed by a function \(\eta = g(\theta)\), and \(\hat{\theta}\) is a “good” estimator for \(\theta\), then a “good” estimator for \(\eta\) should be \(\hat{\eta} = g(\hat{\theta})\).

Rigorous Focus: The distinction between Equivariance (for transformations of parameters, e.g., \(\mu \to e^\mu\)) and Invariance (for transformations of the sample space that leave the parameter unchanged, e.g., location shifts).

Interactive Resource: The Parameter Transformation Mapper

  • Design: A number line representing \(\theta\), and a secondary curve representing \(\eta = g(\theta)\).

  • Interaction: The user selects a distribution (e.g., Exponential(\(\theta\))) and an estimator \(\hat{\theta} = \bar{X}\). They define a transformation (e.g., \(\eta = 1/\theta\), the rate parameter). The tool plots the sampling distribution of \(\hat{\theta}\). It then applies the transformation to the random variable itself, deriving and plotting the distribution of \(g(\hat{\theta}) = 1/\bar{X}\) (the Inverse Gaussian family). It demonstrates that the equivariant estimator is naturally induced by the transformation.

8.6 Part III: Technical Implementation Guidelines (Module 6 Specifics)

  1. Advanced Symbolic Math Engine: Module 6 is intensely algebraic. The Factorization Factory and the Ratio Constancy Checker require a backend (like SymPy via Pyodide or a custom WASM module) capable of symbolic factorization, simplification of multivariate expressions, and isolation of variables. The engine must differentiate between parameters (\(\theta\)) and random variables (\(x_i\)) syntactically.

  2. Conditional Distribution Calculator: To visually prove sufficiency without relying solely on the Factorization Theorem, the tool must compute \(P(\mathbf{X}=\mathbf{x} | T(\mathbf{X})=t)\). For discrete distributions, this requires dynamic enumeration of the sample space partitioned by \(T\). For continuous distributions, numerical integration over the hyper-surface where \(T(\mathbf{x})=t\) is required (a computationally expensive but necessary feature using Markov Chain Monte Carlo sampling constrained to the surface).

  3. Likelihood Rendering Engine: For the Stopping Rule Paradox, the tool must render likelihood functions efficiently. Instead of plotting points, use WebGL to draw smooth curves based on evaluating the symbolic likelihood expression over a grid of \(\theta\) values.

  4. Hypothesis Testing Calculator API: To calculate the p-values in the Stopping Rule Paradox, integrate a statistical library (like jStat) capable of computing CDFs for Binomial and Negative Binomial distributions with high precision, as the philosophical point relies on the exact discrepancy between the two p-values.

  5. Data Generation for Basu’s Theorem: Ensure the random number generator uses independent streams for generating the normal variables. The scatterplot visualization for Basu’s theorem should use alpha-blending (opacity) so that the independence (uniform cloud) vs. dependence (clustering) is visually obvious to the student even at high sample counts.

9 Module 7: Point Estimation

9.1 1. Module Overview

This module transitions from the probabilistic properties of data to the core of statistical inference: estimating unknown population parameters. It covers the primary methodologies for deriving point estimators (Method of Moments, Maximum Likelihood, Bayes, and the EM Algorithm) and the rigorous mathematical criteria for evaluating their quality (MSE, Bias, Variance, Efficiency, and Decision-Theoretic Optimality).

Learning Objectives:

  • Derive Method of Moments estimators by equating sample and population moments.
  • Derive Maximum Likelihood Estimators (MLEs) analytically and numerically, and apply the invariance property.
  • Construct Bayesian estimators by combining prior distributions with likelihoods, optimizing under various loss functions.
  • Implement the Expectation-Maximization (EM) algorithm for models with missing or latent data.
  • Decompose Mean Squared Error (MSE) into bias and variance, and analyze the bias-variance tradeoff.
  • Prove estimator optimality using the Cramér-Rao Lower Bound, the Rao-Blackwell Theorem, and the Lehmann-Scheffé Theorem.
  • Evaluate estimators within a decision-theoretic framework using risk functions and minimax/Bayes criteria.

9.2 2. Sub-Module 7.2: Methods of Finding Estimators - Part I (Frequentist)

9.2.1 7.2.1 Method of Moments (MME)

Content: Equating population moments (\(\mu_k' = E[X^k]\)) to sample moments (\(M_k = \frac{1}{n}\sum X_i^k\)) and solving the resulting system of equations for \(\theta\). Rigorous Focus: MMEs are often simple to compute but are not necessarily optimal; they may not even be functions of sufficient statistics.

Interactive Resource: The Moment Matching Engine

  • Design: A dual-panel interface. Left: Sliders controlling the true parameters \(\theta\) of a distribution (e.g., Gamma\((\alpha, \beta)\)). Right: A random sample drawn from the distribution.
  • Interaction: The tool calculates \(M_1\) and \(M_2\) from the sample. It then dynamically solves the system \(\mu_1'(\hat{\alpha}, \hat{\beta}) = M_1\) and \(\mu_2'(\hat{\alpha}, \hat{\beta}) = M_2\). A “Moment Match” plot overlays the true population pdf with the MME-estimated pdf, showing how well the moments align. Users can click “New Sample” to observe the high variance of MMEs compared to other methods.

9.2.2 7.2.2 Maximum Likelihood Estimators (MLE)

Content: Defining the likelihood function \(L(\theta|\mathbf{x})\) and the log-likelihood \(\ell(\theta|\mathbf{x})\). Finding \(\hat{\theta}\) that maximizes \(\ell\).

Rigorous Focus:

  1. Calculus-based derivation (setting derivatives to zero, checking second derivatives).
  2. The Invariance Property: If \(\hat{\theta}\) is the MLE of \(\theta\), then \(g(\hat{\theta})\) is the MLE of \(g(\theta)\), provided \(g\) is a one-to-one function (though invariance holds more generally for any function in Casella & Berger’s treatment).

Interactive Resource: The Log-Likelihood Landscape Explorer

  • Design: A 3D surface plot or 2D contour plot of the log-likelihood function for a two-parameter family (e.g., Normal or Weibull).

  • Interaction: The user generates a sample. The likelihood surface renders instantly. The user can click on the surface to place a “climber” and manually navigate toward the peak. Alternatively, a “Gradient Ascent” button animates the numerical optimization algorithm (Newton-Raphson) traversing the surface to the MLE.

  • Rigor Check: An “Invariance Tester” allows the user to type a transformation \(g(\theta)\) (e.g., \(\sqrt{\theta}\)). The tool calculates \(g(\hat{\theta})\) and plots the transformed likelihood, proving the peak aligns.

9.3 3. Sub-Module 7.2: Methods of Finding Estimators - Part II (Bayesian & Computational)

9.3.1 7.2.3 Bayes Estimators

Content: The posterior distribution \(\pi(\theta|\mathbf{x}) \propto f(\mathbf{x}|\theta)\pi(\theta)\). Rigorous Focus: Estimators are derived by minimizing the posterior expected loss.

  • Under squared error loss (SEL): Bayes estimator is the posterior mean \(E[\theta|\mathbf{x}]\).
  • Under absolute error loss: Bayes estimator is the posterior median.
  • Under 0-1 loss: Bayes estimator is the posterior mode (MAP).

Interactive Resource: The Prior-Posterior Dynamics Simulator

  • Design: Three plots: Prior \(\pi(\theta)\), Likelihood \(f(\mathbf{x}|\theta)\), and Posterior \(\pi(\theta|\mathbf{x})\). A dropdown for Loss Function (SEL, Absolute, 0-1).
  • Interaction: The user manipulates the prior parameters and sample size \(n\). As \(n\) increases, the posterior dynamically “peels away” from the prior and conforms to the likelihood (the Bernstein-von Mises phenomenon). Switching the loss function moves a vertical line along the posterior curve, visually distinguishing the mean, median, and mode.

9.3.2 7.2.4 The EM Algorithm

Content: Finding MLEs when data is incomplete or models have latent variables.

  1. E-step: Calculate the expected complete-data log-likelihood, \(Q(\theta|\theta^{(t)}) = E_{\theta^{(t)}}[\log L(\theta|\mathbf{X}, \mathbf{Y}) | \mathbf{X}]\).
  2. M-step: Maximize \(Q\) with respect to \(\theta\) to find \(\theta^{(t+1)}\). Rigorous Focus: Proving that the observed-data likelihood is non-decreasing at each iteration.

Interactive Resource: The EM Ascent Visualizer

  • Design: A plot of the observed-data log-likelihood (a hard-to-maximize curve).
  • Interaction: The user clicks “E-Step”. The tool evaluates the expected missing data given current \(\theta^{(t)}\). The user clicks “M-Step”, and the tool maximizes the surrogate \(Q\)-function (shown as a tangent parabola below the true likelihood). The parameter updates, and a point “climbs” the observed likelihood curve. This visualizes how maximizing the easier \(Q\)-function guarantees an upward step on the true, complex likelihood surface.

Worked Example: Mixture of Normals

  • Scenario: \(X_1, \dots, X_n \sim p N(\mu_1, \sigma_1^2) + (1-p) N(\mu_2, \sigma_2^2)\).

  • Interactive Element: Students step through the E-step (calculating the probability each point belongs to cluster 1 vs cluster 2) and the M-step (updating means and variances based on these soft assignments), watching the mixture model fit converge.

9.4 4. Sub-Module 7.3: Methods of Evaluating Estimators - Bias and MSE

9.4.1 7.3.1 Mean Squared Error (MSE)

Content: \(MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]\).

Rigorous Focus: The Bias-Variance Decomposition: \(MSE(\hat{\theta}) = Var(\hat{\theta}) + [Bias(\hat{\theta})]^2\). The concept of the Bias-Variance Tradeoff.

Interactive Resource: The Dartboard Tradeoff Dashboard

  • Design: Three interactive dartboards representing different estimators of \(\theta\): 1. \(\hat{\theta}_1\): Unbiased but high variance (e.g., sample variance with denominator \(n\)). 2. \(\hat{\theta}_2\): Biased but low variance (e.g., a shrinkage estimator). 3. \(\hat{\theta}_3\): Optimal MSE.

  • Interaction: The user simulates thousands of estimates. The dartboards populate. A dynamic bar chart displays \(MSE = Var + Bias^2\). The user adjusts a shrinkage parameter \(\lambda\) for \(\hat{\theta}_2\). As \(\lambda\) increases, the Bias bar grows, but the Variance bar shrinks. The MSE curve is plotted against \(\lambda\), allowing the student to visually find the \(\lambda\) that minimizes overall MSE, demonstrating that unbiased estimators are not always “best.”

9.4.2 7.3.2 Best Unbiased Estimators (UMVUE)

Content: \(\hat{\theta}\) is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) if \(E[\hat{\theta}] = \theta\) and \(Var(\hat{\theta}) \le Var(\hat{\theta}^*)\) for all \(\theta\) and any other unbiased \(\hat{\theta}^*\).

9.4.3 7.3.3 Fisher Information and the Cramér-Rao Lower Bound (CRLB)

Content:

  • Fisher Information: \(I(\theta) = -E\left[ \frac{\partial^2}{\partial \theta^2} \log f(X|\theta) \right]\).

  • The CRLB: Under regularity conditions, \(Var(\hat{\theta}) \ge \frac{1}{n I(\theta)}\) for any unbiased \(\hat{\theta}\). Rigorous Focus: Verifying regularity conditions (support cannot depend on \(\theta\), hence Uniform\((0,\theta)\) violates CRLB).

Interactive Resource: The Efficiency Bounding Engine

  • Design: A dynamic plot of the variance of an estimator vs. \(\theta\). A dashed line representing the CRLB \(\frac{1}{nI(\theta)}\) is drawn.

  • Interaction: The user selects an unbiased estimator. Its variance curve is plotted. If the curve touches the CRLB, the tool flashes “Efficient!” (e.g., \(\bar{X}\) for Normal mean). If it sits above (e.g., sample median for Normal mean), the tool calculates asymptotic relative efficiency.

9.4.4 7.3.4 Sufficiency, Unbiasedness, and the Rao-Blackwell Theorem

Content:

  • Rao-Blackwell Theorem: If \(T\) is a sufficient statistic for \(\theta\) and \(W\) is any unbiased estimator, then \(\phi(T) = E[W|T]\) is an unbiased estimator with \(Var(\phi(T)) \le Var(W)\).

  • Lehmann-Scheffé Theorem: If \(T\) is a complete sufficient statistic, then \(\phi(T)\) is the unique UMVUE.

Interactive Resource: The Variance Reducer (Rao-Blackwell Machine)

  • Design: A pipeline graphic. Input: A naive unbiased estimator \(W\) (e.g., \(W = X_1\) for the Poisson mean \(\lambda\)). Intermediate: A complete sufficient statistic \(T = \sum X_i\). Output: The Rao-Blackwellized estimator \(\phi(T) = E[X_1 | \sum X_i = t]\).
  • Interaction: The tool walks through the conditional expectation calculation (yielding \(t/n\)). It then simulates both \(W\) and \(\phi(T)\), plotting their sampling distributions overlaid on each other. Both are centered at \(\lambda\) (both unbiased), but the distribution of \(\phi(T)\) is visibly narrower, proving variance reduction through conditioning on sufficiency.

9.5 5. Sub-Module 7.3: Methods of Evaluating Estimators - Decision Theory

9.5.1 7.3.5 Loss Function Optimality

Content: Moving beyond MSE to general loss functions \(L(\theta, a)\).

  • Risk Function: \(R(\theta, \hat{\theta}) = E_\theta[L(\theta, \hat{\theta})]\).

  • Minimax Principle: Choose \(\hat{\theta}\) to minimize \(\max_\theta R(\theta, \hat{\theta})\).

  • Bayes Risk: \(r(\pi, \hat{\theta}) = E^\pi[R(\theta, \hat{\theta})]\).

Interactive Resource: The Risk Profile Arena

  • Design: A 2D plot where the x-axis is the parameter \(\theta\) and the y-axis is the Risk \(R(\theta, \hat{\theta})\).

  • Interaction: The user selects two estimators (e.g., MLE vs. a Bayes estimator under SEL). The risk profiles for both are plotted as functions of \(\theta\). * Minimax Mode: The tool highlights the maximum risk (the peak) of both curves. The estimator with the lower peak is declared minimax. * Bayes Mode: The user shades the area under the risk curve weighted by a prior density \(\pi(\theta)\). The estimator with the smaller shaded area (Bayes risk) is the preferred Bayes rule. This visualizes how a prior can “forgive” high risk in low-probability regions of \(\theta\).

9.6 Part III: Technical Implementation Guidelines (Module 7 Specifics)

  1. Numerical Optimization Suite (for MLE & EM): The MLE and EM modules require robust optimization. The backend must include SciPy.optimize (e.g., minimize with BFGS or Nelder-Mead). For the EM algorithm, the system should allow users to define \(Q\)-functions symbolically, which the backend then differentiates and solves.
  2. Symbolic Differentiation Engine (for CRLB): Calculating Fisher Information requires taking the second derivative of the log-likelihood. Integrate SymPy to automatically parse a user-inputted \(f(x|\theta)\), compute \(\frac{\partial^2}{\partial \theta^2} \log f(x|\theta)\), and then calculate the expectation. This allows students to test CRLB on custom distributions without tedious hand-calculus.
  3. Monte Carlo Risk Calculator: Calculating exact risk functions \(R(\theta, \hat{\theta})\) analytically is often impossible for complex estimators. Implement a distributed Monte Carlo engine. For a grid of \(\theta\) values, simulate \(M\) datasets, compute the loss \(L(\theta, \hat{\theta})\) for each, and average. The UI should plot these empirical risk points alongside the theoretical risk curves.
  4. Conditional Expectation Calculator (for Rao-Blackwell): Computing \(E[W|T=t]\) symbolically is mathematically intensive. For discrete distributions, the engine can enumerate the sample space partitioned by \(T=t\). For continuous distributions, provide templated examples (Normal, Exponential, Poisson) where the algebraic Rao-Blackwellization is pre-computed, but allow numerical approximation for user-defined statistics via kernel smoothing of the joint simulations \((W, T)\).
  5. Real-time EM Animation: The EM Ascent Visualizer must be highly responsive. Cache the true likelihood curve. When the user steps through EM, only the surrogate \(Q\)-function and the current \(\theta^{(t)}\) need updating. Use D3.js transitions to smoothly animate the point climbing the likelihood surface.

10 Module 8: Hypothesis Testing

10.1 1. Module Overview

This module rigorously formalizes the statistical procedure of making decisions about population parameters based on sample data. It transitions from intuitive concepts of “surprise” to mathematically optimal testing frameworks. The module covers the primary methods for constructing tests (Likelihood Ratio, Bayesian, Union-Intersection), the rigorous evaluation of test performance (Error probabilities, Power), and the mathematical proofs of optimality under the Neyman-Pearson and Decision-Theoretic frameworks.

Learning Objectives:

  • Formulate null and alternative hypotheses, and define Type I and Type II errors.

  • Derive Likelihood Ratio Tests (LRTs) for simple and composite hypotheses.

  • Construct Bayesian tests using posterior odds and Bayes factors.

  • Build tests for composite hypotheses using the Union-Intersection and Intersection-Union principles.

  • Calculate and interpret the power function \(\beta(\theta)\), distinguishing between test size and level.

  • Prove test optimality using the Neyman-Pearson Lemma for simple hypotheses and the Karlin-Rubin Theorem for composite hypotheses.

  • Define p-values as random variables and relate them to frequentist decision rules.

  • Evaluate tests within a decision-theoretic framework using risk functions and minimax/Bayes criteria.

10.2 2. Sub-Module 8.1 & 8.2: Methods of Finding Tests - Frequentist Approaches

10.2.1 8.2.1 Likelihood Ratio Tests (LRTs)

Content: The most versatile frequentist method. Define the test statistic: \[\lambda(\mathbf{x}) = \frac{\sup_{\theta \in \Theta_0} L(\theta|\mathbf{x})}{\sup_{\theta \in \Theta} L(\theta|\mathbf{x})}\] The LRT rejects \(H_0\) for small values of \(\lambda(\mathbf{x})\). Rigorous Focus: Deriving the asymptotic distribution of \(-2\log\lambda(\mathbf{X}) \xrightarrow{d} \chi^2_{\nu}\) under regularity conditions (where \(\nu = \dim(\Theta) - \dim(\Theta_0)\)).

Interactive Resource: The LRT Surface Slicer

  • Design: A 3D surface plot of the likelihood function \(L(\theta|\mathbf{x})\) over the parameter space \(\Theta\). A highlighted subset represents the null space \(\Theta_0\).
  • Interaction: The user generates a sample. The MLE \(\hat{\theta}\) (the global peak) and the restricted MLE \(\hat{\theta}_0\) (the peak constrained to \(\Theta_0\)) are plotted. A vertical line connects \(\hat{\theta}_0\) to the surface. The ratio \(\lambda\) is visually represented as the fraction of the restricted height to the global height. As the sample shifts, the user sees that when \(\Theta_0\) is far from the global peak, \(\lambda \to 0\) (Reject \(H_0\)).

10.2.2 8.2.2 Bayesian Tests

Content: Formulating tests using posterior probabilities. The posterior odds ratio: \[\frac{P(\Theta_0|\mathbf{x})}{P(\Theta_1|\mathbf{x})} = \left(\frac{P(\Theta_0)}{P(\Theta_1)}\right) \times \left(\frac{\int_{\Theta_0} f(\mathbf{x}|\theta)\pi(\theta)d\theta}{\int_{\Theta_1} f(\mathbf{x}|\theta)\pi(\theta)d\theta}\right).\]

Rigorous Focus: The Bayes Factor (the ratio of integrated likelihoods), which replaces the frequentist LRT and is immune to stopping-rule paradoxes.

Interactive Resource: The Bayes Factor Dynamics Engine

  • Design: A split dashboard. Left: Prior distributions for \(\theta\) under \(H_0\) and \(H_1\). Right: The posterior distributions.

  • Interaction: The user sets prior odds to 1 (equal priors) and observes the Bayes Factor. They can then heavily skew the prior probability of \(H_0\) to 0.99. The posterior updates, showing that strong prior belief requires overwhelming data to reject \(H_0\).

  • Rigor Check: A side-panel calculates the p-value for the same data. With large sample sizes, the tool demonstrates Lindley’s Paradox: the p-value screams “Reject \(H_0\)” while the Bayes Factor suggests “Strong support for \(H_0\),” highlighting the divergence of the two paradigms.

10.2.3 8.2.3 Union-Intersection and Intersection-Union Tests

Content:

  • Union-Intersection: \(H_0: \theta \in \bigcap_{\gamma \in \Gamma} \Theta_\gamma\) vs \(H_1: \theta \in \bigcup_{\gamma \in \Gamma} \Theta_\gamma^c\). Reject \(H_0\) if any individual test rejects.

  • Intersection-Union: \(H_0: \theta \in \bigcup_{\gamma \in \Gamma} \Theta_\gamma\) vs \(H_1: \theta \in \bigcap_{\gamma \in \Gamma} \Theta_\gamma^c\). Reject \(H_0\) if all individual tests reject. Rigorous Focus: Size control. For Union-Intersection, the overall size \(\alpha\) is bounded by the individual sizes (requiring Bonferroni-like corrections). For Intersection-Union, if individual tests are size \(\alpha\), the overall test is exactly size \(\alpha\).

Interactive Resource: The Region Logician

  • Design: A 2D parameter space \((\theta_1, \theta_2)\).

  • Interaction: The user defines \(H_0\) as a union or intersection of regions (e.g., \(H_0: \theta_1 \le 0 \cup \theta_2 \le 0\)). The tool shades the rejection regions for individual tests. It then uses boolean logic operators to visually merge these regions into the final rejection region, demonstrating the intersection/union geometric property.

10.3 3. Sub-Module 8.3: Methods of Evaluating Tests - Power and Optimality

10.3.1 8.3.1 Error Probabilities and the Power Function

Content:

  • Type I Error: \(\alpha = P(\text{Reject } H_0 | \theta \in \Theta_0)\).

  • Type II Error: \(\beta = P(\text{Accept } H_0 | \theta \in \Theta_1)\).

  • Power Function: \(\pi(\theta) = P(\text{Reject } H_0 | \theta)\). Rigorous Focus: The strict distinction between a test of size \(\alpha\) (supremum of \(\pi(\theta)\) over \(\Theta_0\) is exactly \(\alpha\)) and a test of level \(\alpha\) (supremum is \(\le \alpha\)).

Interactive Resource: The Power Curve Architect

  • Design: A graph plotting the power function \(\pi(\theta)\) against \(\theta\). Horizontal lines represent the \(\alpha\) level.

  • Interaction: The user selects a test statistic and an \(\alpha\) level. They adjust the sample size \(n\). The power curve dynamically updates, pulling away from \(\alpha\) in the alternative region as \(n\) increases.

  • Rigor Check: The tool highlights the supremum of the curve in \(\Theta_0\), forcing the student to verify that the size of the test does not exceed \(\alpha\), particularly at the boundary of the null hypothesis.

10.3.2 8.3.2 Most Powerful Tests (Neyman-Pearson Lemma)

Content: Testing \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\) (Simple vs. Simple).

The Neyman-Pearson Lemma: A test that rejects \(H_0\) if \(\frac{f(\mathbf{x}|\theta_1)}{f(\mathbf{x}|\theta_0)} > k\) is the Uniformly Most Powerful (UMP) test of size \(\alpha\).

Rigorous Focus:

  1. The randomized test trick for discrete distributions to achieve exact size \(\alpha\).
  2. Extension to composite alternatives via Monotone Likelihood Ratios (MLR) and the Karlin-Rubin Theorem.

Interactive Resource: The Neyman-Pearson Partition Visualizer

  • Design: Overlapping density curves for \(f(x|\theta_0)\) and \(f(x|\theta_1)\). A vertical line represents the critical value \(c\).

  • Interaction: The user drags the critical value line left and right. The area under \(f(x|\theta_0)\) to the right of \(c\) shades red (Type I error \(\alpha\)). The area under \(f(x|\theta_1)\) to the right of \(c\) shades green (Power \(1-\beta\)). The tool dynamically plots the ratio of the densities, demonstrating that the optimal rejection region precisely corresponds to where the likelihood ratio exceeds a threshold.

Worked Example: The Randomized Test Simulator

  • Scenario: Testing \(H_0: \text{Poisson}(\lambda=1)\) vs \(H_1: \text{Poisson}(\lambda=2)\) with exact size \(\alpha=0.05\).

  • Interactive Element: Because Poisson is discrete, no integer critical value yields exactly 0.05. The tool calculates the CDFs, identifies the jump across 0.05, and visually explains the randomization probability \(\gamma\) required to achieve exact size. A Monte Carlo simulator implements the randomization, proving the long-run Type I error is exactly 0.05.

10.3.3 8.3.4 p-Values

Content: The p-value is a test statistic defined as the smallest level \(\alpha\) at which \(H_0\) would be rejected. Formally: \(p(\mathbf{x}) = \sup_{\theta \in \Theta_0} P_\theta(T \ge T(\mathbf{x}))\).

Rigorous Focus: The p-value is a random variable. Under \(H_0\), if the test statistic is continuous, the p-value is uniformly distributed on \((0,1)\). Misinterpretation: The p-value is not the probability that \(H_0\) is true.

Interactive Resource: The p-Value Distribution Exposer

  • Design: A simulation engine running 10,000 experiments under a true \(H_0\).

  • Interaction: The tool calculates the p-value for each experiment and plots a histogram. The result is a perfectly uniform distribution.

  • Rigor Check: The user clicks “Add False \(H_0\)”. The simulation runs under the alternative, and the p-value histogram skews heavily toward 0. The tool overlays the power curve, showing that the area to the left of \(\alpha\) in the p-value histogram is exactly the power of the test.

10.3.4 8.3.5 Loss Function Optimality

Content: Framing hypothesis testing as a decision problem. The risk function combines Type I and Type II errors weighted by a loss function \(L(\theta, a)\).

Rigorous Focus: Proving that the Bayes test chooses the hypothesis with the smaller posterior expected loss, which directly corresponds to comparing the posterior odds to the loss ratio.

Interactive Resource: The Decision-Theoretic Risk Frontier

  • Design: A 2D plane where the x-axis is the probability of Type I error (\(\alpha\)) and the y-axis is the probability of Type II error (\(\beta\)).

  • Interaction: The user defines a family of tests (e.g., varying critical values). The tool plots the risk points \((\alpha, \beta)\), tracing out the risk frontier. The user adjusts the loss weights \(L(\text{Type I})\) and \(L(\text{Type II})\). A line representing the minimax or Bayes risk objective pivots, and the optimal test on the frontier is highlighted.

10.4 Part III: Technical Implementation Guidelines (Module 8 Specifics)

  1. Non-Central Distribution Calculator: To implement the Power Curve Architect accurately, the backend requires a library capable of calculating CDFs for non-central distributions (e.g., Non-central \(t\), Non-central \(\chi^2\), Non-central \(F\)). SciPy.stats in Python handles this natively (ncf, nct, ncx2) and should be exposed via API.
  2. Symbolic Suprema Engine (for Size): A frequent stumbling block for students is calculating the size of a test, which requires finding the supremum of the power function over \(\Theta_0\). The evaluation engine must numerically optimize the power function over the null parameter space to verify if a user-provided test violates the level \(\alpha\).
  3. Monte Carlo Integration for Bayes Factors: Calculating the marginal likelihoods \(\int f(\mathbf{x}|\theta)\pi(\theta)d\theta\) analytically is often impossible. The backend must employ MCMC (Markov Chain Monte Carlo) or bridge sampling techniques to estimate Bayes Factors numerically for arbitrary user-defined priors and likelihoods.
  4. Randomized Test Logic: The Neyman-Pearson simulator for discrete statistics requires a unique implementation. If the test statistic falls in the “randomization region,” the tool must draw a Uniform(0,1) random number \(U\) and reject \(H_0\) if \(U < \gamma\). This logic must be explicitly visualized so students understand this is a theoretical construct defining the boundary of the power set, rather than a practical data analysis technique.
  5. High-Performance p-Value Simulator: Generating the uniform p-value histogram under \(H_0\) requires thousands of independent statistical tests. This must be vectorized (e.g., using NumPy) in the backend to return the array of p-values to the frontend in under 1 second for a seamless interactive experience.

11 Module 9: Interval Estimation

11.1 1. Module Overview

While point estimation provides a single best guess for a parameter, it provides no measure of the uncertainty inherent in that guess. This module rigorously develops the theory of interval estimation, constructing ranges of values that are likely to contain the true parameter. The module unifies this concept with hypothesis testing (Module 8), demonstrates various mathematical techniques for deriving intervals, and contrasts the Frequentist interpretation (Confidence Intervals) with the Bayesian interpretation (Credible Intervals).

Learning Objectives:

  • Define confidence intervals and rigorously distinguish between random interval bounds and fixed parameters.

  • Derive confidence intervals by inverting the acceptance regions of hypothesis tests.

  • Construct intervals using Pivotal Quantities and by pivoting the CDF.

  • Derive and interpret Bayesian Credible Intervals and Highest Posterior Density (HPD) regions.

  • Evaluate interval estimators based on coverage probability, size, and expected length.

  • Establish optimality of intervals via decision theory and test-inversion (Uniformly Most Accurate intervals).

11.2 2. Sub-Module 9.1: Introduction to Interval Estimation

11.2.1 9.1.1 The Concept of a Random Interval

Content: An interval estimator for \(\theta\) is a pair of statistics \(L(\mathbf{X})\) and \(U(\mathbf{X})\) such that \(L(\mathbf{X}) \le U(\mathbf{X})\). The interval \([L(\mathbf{X}), U(\mathbf{X})]\) is a random set.

Rigorous Focus: The Frequentist interpretation. The probability \(P_\theta(\theta \in [L(\mathbf{X}), U(\mathbf{X})])\) refers to the long-run frequency of random intervals capturing the fixed true parameter \(\theta\). It is not the probability that \(\theta\) lies in a fixed observed interval.

Interactive Resource: The Frequentist Target Shooter

  • Design: A number line representing the parameter space. A vertical bullseye marks the true, fixed parameter \(\theta_0\).

  • Interaction: The user clicks “Draw Sample”. The engine calculates \([L(\mathbf{x}), U(\mathbf{x})]\) and shoots a horizontal bar (the interval) onto the number line. If the bar covers \(\theta_0\), it turns Green (Hit); otherwise, Red (Miss). A counter tracks the hit rate. As the user rapidly clicks, the hit rate converges exactly to the confidence level (e.g., 95%), proving that the method has a 95% success rate, not the individual interval.

11.3 3. Sub-Module 9.2: Methods of Finding Interval Estimators - Frequentist

11.3.1 9.2.1 Inverting a Test Statistic

Content: There is a 1-to-1 duality between hypothesis tests and confidence intervals. A \(1-\alpha\) confidence set consists of all values \(\theta_0\) for which the hypothesis \(H_0: \theta = \theta_0\) would not be rejected at level \(\alpha\).

Rigorous Focus: \(C(\mathbf{x}) = \{\theta_0 : \mathbf{x} \in A(\theta_0)\}\), where \(A(\theta_0)\) is the acceptance region of the level \(\alpha\) test.

Interactive Resource: The Test-Inversion Tracer

  • Design: A dual-axis plot. Left: a plot of the test statistic \(T(\mathbf{x})\) vs. \(\theta_0\), showing the acceptance region boundaries. Right: the resulting confidence interval \([L(\mathbf{x}), U(\mathbf{x})]\) on a number line.

  • Interaction: The user observes a fixed sample \(\mathbf{x}\) (hence a fixed horizontal line \(T_{obs}\)). The tool sweeps \(\theta_0\) across the x-axis. For each \(\theta_0\), the engine checks if \(T_{obs}\) falls inside \(A(\theta_0)\). If yes, \(\theta_0\) is highlighted on the right-hand number line. The accumulation of these \(\theta_0\) values visually “paints” the confidence interval, proving that the CI is exactly the set of non-rejected nulls.

11.3.2 9.2.2 Pivotal Quantities

Content: A pivot is a function \(Q(\mathbf{X}, \theta)\) whose distribution is independent of \(\theta\) (and usually of other unknown parameters).

Example: \(Q = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}\).

Rigorous Focus: Finding constants \(a\) and \(b\) such that \(P(a \le Q(\mathbf{X}, \theta) \le b) = 1-\alpha\), and algebraically “inverting” the inequality inside the probability to isolate \(\theta\).

Interactive Resource: The Pivot Isolator

  • Design: A symbolic algebra step-by-step solver.

  • Interaction: The user inputs a pivot (e.g., \(Q = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\) for known \(\sigma\)). The tool sets up the probability inequality: \(P(-z_{\alpha/2} \le \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \le z_{\alpha/2}) = 1-\alpha\). The user must click to apply algebraic operations (multiply by \(\sigma/\sqrt{n}\), subtract \(\bar{X}\), multiply by -1 [flipping inequalities]). The tool dynamically slides the terms across the inequality signs, finally isolating \(\mu\) to yield the standard \(z\)-interval.

11.3.3 9.2.3 Pivoting the CDF

Content: If \(X\) is continuous with CDF \(F_X(x|\theta)\), then \(F_X(X|\theta) \sim \text{Uniform}(0,1)\). We can form a pivot using \(a \le F_X(X|\theta) \le b\).

Rigorous Focus: Solving the CDF inequality for \(\theta\). This often yields one-sided intervals, particularly for scale parameters or order statistics.

Worked Example: Interval for Uniform Scale

  • Scenario: \(X_1, \dots, X_n \sim \text{Uniform}(0, \theta)\). Finding a CI for \(\theta\) using the maximum \(X_{(n)}\).

  • Interactive Element: The tool visualizes the CDF of \(X_{(n)}\), which is \((x/\theta)^n\). It shows how setting \(P(\alpha/2 \le (X_{(n)}/\theta)^n \le 1-\alpha/2) = 1-\alpha\) allows solving for \(\theta\) in terms of \(X_{(n)}\), resulting in the interval \(\left[ X_{(n)}, X_{(n)} / \alpha^{1/n} \right]\).

11.4 4. Sub-Module 9.2: Methods of Finding Interval Estimators - Bayesian

11.4.1 9.2.4 Bayesian Intervals (Credible Intervals)

Content: Given the posterior distribution \(\pi(\theta|\mathbf{x})\), a \(1-\alpha\) credible interval is any set \(C\) such that \(P(\theta \in C | \mathbf{x}) = 1-\alpha\).

Rigorous Focus:

  1. The equal-tailed interval: \([L, U]\) where \(P(\theta < L|\mathbf{x}) = \alpha/2\) and \(P(\theta > U|\mathbf{x}) = \alpha/2\).

  2. The Highest Posterior Density (HPD) region: \(C = \{\theta : \pi(\theta|\mathbf{x}) \ge k(\alpha)\}\). The HPD is the shortest possible \(1-\alpha\) credible interval.

Interactive Resource: The HPD Carver

  • Design: A plot of the posterior density \(\pi(\theta|\mathbf{x})\). A horizontal line representing the density threshold \(k\) is present. A slider controls \(\alpha\).

  • Interaction: The user adjusts \(k\). The tool shades the region where \(\pi(\theta|\mathbf{x}) \ge k\) and calculates the integral (the coverage). The user adjusts \(k\) until the coverage is exactly \(1-\alpha\). The tool simultaneously calculates the length of the resulting interval.

  • Rigor Check: The user selects a skewed posterior (e.g., Gamma). The tool overlays the Equal-Tailed interval and the HPD. The HPD is visibly shifted toward the mode, and a “Length Score” confirms the HPD is strictly shorter than the Equal-Tailed interval. For a Normal posterior, they perfectly overlap.

11.5 5. Sub-Module 9.3: Methods of Evaluating Interval Estimators

11.5.1 9.3.1 Size and Coverage Probability

Content:

  • Coverage Probability: \(P_\theta(\theta \in C(\mathbf{X}))\).

  • Confidence Coefficient (Size): The infimum of the coverage probability over all \(\theta \in \Omega\): \(\inf_{\theta \in \Omega} P_\theta(\theta \in C(\mathbf{X}))\).

  • Rigorous Focus: A \(1-\alpha\) interval guarantees coverage is at least \(1-\alpha\) for all \(\theta\), not exactly \(1-\alpha\). Conservative intervals (like the standard \(t\)-interval for non-normal data) may over-cover.

Interactive Resource: The Coverage Probability Profiler

  • Design: A plot of Coverage Probability vs. \(\theta\). A horizontal dashed line at \(1-\alpha\).

  • Interaction: The user selects a distribution (e.g., Exponential) and the standard \(t\)-interval (derived under Normality assumptions). The tool runs Monte Carlo simulations across a grid of \(\theta\) values and plots the actual coverage. The plot shows that for small \(n\), the coverage dips below \(1-\alpha\) (violating the size guarantee), but as \(n\) increases (CLT kicks in), the coverage curve flattens to \(1-\alpha\).

11.5.3 9.3.3 & 9.3.4 Loss Function Optimality

Content: Applying decision theory to intervals.

  • Loss functions: \(L(\theta, C) = \text{Length}(C)\) (penalize wide intervals) or \(L(\theta, C) = I(\theta \notin C)\) (penalize missing the parameter).

  • Risk: \(R(\theta, C) = E_\theta[L(\theta, C)]\).

Interactive Resource: The Risk-Length Tradeoff Dashboard

  • Design: A 2D plane with Expected Length on the x-axis and Coverage Error (\(1 - \text{Coverage}\)) on the y-axis.

  • Interaction: The user defines a family of intervals by tweaking a tuning parameter (e.g., varying the split of \(\alpha\) between the tails: \(\alpha_1\) and \(\alpha_2\) such that \(\alpha_1 + \alpha_2 = \alpha\)). The tool plots the risk points. The student discovers that the symmetric split (\(\alpha_1 = \alpha_2 = \alpha/2\)) minimizes the expected length for symmetric unimodal distributions, balancing the Decision-Theoretic risk.

11.6 Part III: Technical Implementation Guidelines (Module 9 Specifics)

  1. Interval Intersection Engine (for Test Inversion): The Test-Inversion Tracer requires solving the inequality \(T_{obs} \in A(\theta_0)\) for \(\theta_0\). This often requires root-finding. The backend must use numerical solvers (like scipy.optimize.brentq) to find the precise boundaries \(L(\mathbf{x})\) and \(U(\mathbf{x})\) where the test statistic hits the critical value, especially when analytical inversion is messy.
  2. Symbolic Inequality Solver (for Pivoting): The Pivot Isolator needs a symbolic math engine (like SymPy) capable of manipulating inequalities. Specifically, it must track the direction of inequality signs when multiplied by negative numbers or when reciprocals are taken of variables known to be positive/negative (e.g., \(\sigma\) or \(\theta > 0\)).
  3. HPD Numerical Optimizer: Finding the Highest Posterior Density (HPD) region analytically is rarely possible for skewed distributions. The HPD Carver must implement a numerical algorithm: (1) compute the CDF of the posterior, (2) perform a bisection search on the density threshold \(k\) such that the integral of \(\pi(\theta|\mathbf{x})\) over \(\{\theta : \pi(\theta|\mathbf{x}) \ge k\}\) equals \(1-\alpha\).
  4. Monte Carlo Coverage Engine: The Coverage Probability Profiler requires high-performance computing. For a grid of \(\theta\) values, the engine must simulate \(M\) datasets, compute the interval for each, and check if \(\theta\) is contained. This must be heavily vectorized (NumPy/JAX) to return smooth coverage curves in real-time.
  5. Visualizing the “Random” Nature: In the Frequentist Target Shooter, the random number generator must be strictly independent for each draw to simulate the true frequentist long-run behavior accurately. The UI should allow a “Turbo Mode” (e.g., draw 1,000 intervals in 1 second) to rapidly build up the histogram of hit rates and prove the coverage guarantee empirically.

12 Module 10: Asymptotic Evaluations

12.1 1. Module Overview

In practice, exact sampling distributions of statistics are often mathematically intractable or rely on unrealistic distributional assumptions (like strict Normality). This module equips students with the asymptotic tools necessary to approximate the behavior of estimators and test statistics when the sample size \(n\) is large. It covers the convergence of point estimators (Consistency, Efficiency), computational resampling methods (Bootstrap), robustness against model misspecification, and the large-sample theory for hypothesis tests and confidence intervals.

Learning Objectives:

  • Prove the consistency of estimators using convergence in probability and the Continuous Mapping Theorem.

  • Derive the asymptotic distribution of Maximum Likelihood Estimators (MLEs) and define Asymptotic Efficiency.

  • Calculate and interpret Asymptotic Relative Efficiency (ARE) to compare two consistent estimators.

  • Implement the nonparametric Bootstrap to estimate standard errors and construct confidence intervals.

  • Evaluate the robustness of estimators by analyzing breakdown points and influence functions, specifically contrasting the sample mean, median, and M-estimators.

  • Derive the asymptotic distribution of the Likelihood Ratio Test (Wilks’ Theorem) and formulate large-sample Wald and Score tests.

  • Construct approximate large-sample confidence intervals using MLE asymptotics and the Bootstrap.

12.2 2. Sub-Module 10.1: Point Estimation

12.2.1 10.1.1 Consistency

Content: An estimator \(\hat{\theta}_n\) is consistent for \(\theta\) if \(\hat{\theta}_n \xrightarrow{P} \theta\).

Rigorous Focus:

  1. Proving consistency via the Weak Law of Large Numbers (WLLN) and the Continuous Mapping Theorem.

  2. Proving the consistency of MLEs under regularity conditions (the global maximum of the likelihood converges in probability to the true parameter).

Interactive Resource: The Consistency Targeting Simulator

  • Design: A number line representing the parameter space, with a bullseye at the true \(\theta_0\). A dynamic density curve representing the sampling distribution of \(\hat{\theta}_n\).

  • Interaction: The user selects an estimator (e.g., MLE for Poisson, or an inconsistent estimator like \(X_1\)). As the user increases \(n\) via a slider, the sampling distribution visually collapses onto \(\theta_0\) for consistent estimators. For \(X_1\), the distribution remains static, demonstrating inconsistency regardless of sample size.

12.2.2 10.1.2 Efficiency

Content: Under regularity conditions, \(\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} N\left(0, \frac{1}{I(\theta)}\right)\) for the MLE.

Rigorous Focus: Asymptotic variance vs. exact variance. An estimator is asymptotically efficient if its asymptotic variance achieves the Cramér-Rao Lower Bound (CRLB).

12.2.3 10.1.3 Calculations and Comparisons (Asymptotic Relative Efficiency - ARE)

Content: Comparing two consistent estimators, \(\hat{\theta}_1\) and \(\hat{\theta}_2\), with asymptotic variances \(v_1(\theta)/n\) and \(v_2(\theta)/n\). The ARE is \(ARE(\hat{\theta}_1, \hat{\theta}_2) = \frac{v_2(\theta)}{v_1(\theta)}\).

Worked Example: Sample Mean vs. Sample Median for Normal and Double Exponential distributions.

Interactive Resource: The ARE Arena

  • Design: A split-screen race track. Two estimators (e.g., Mean and Median) are competing to converge to \(\theta\).

  • Interaction: The user selects the underlying population (Normal vs. Heavy-tailed). A simulation runs, plotting the MSE of both estimators against \(1/n\).

  • Rigor Check: For the Normal distribution, the Mean’s MSE drops faster; the tool calculates \(ARE(Median, Mean) = 2/\pi \approx 0.64\), showing the median needs \(\sim 56\%\) more data to achieve the same precision. Switching to Double Exponential flips the ratio: \(ARE(Mean, Median) = 1/2\), visually proving the median’s superiority under heavy tails.

12.2.4 10.1.4 Bootstrap Standard Errors

Content: The nonparametric bootstrap procedure:

  1. Draw \(B\) resamples \(\mathbf{x}^{*1}, \dots, \mathbf{x}^{*B}\) with replacement from the observed data.

  2. Calculate the statistic \(\hat{\theta}^{*b}\) for each resample.

  3. Estimate the standard error as \(SE_{boot} = \sqrt{\frac{1}{B-1} \sum (\hat{\theta}^{*b} - \bar{\hat{\theta}}^*)^2}\).

Interactive Resource: The Resampling Engine

  • Design: A histogram of the original sample. A “slot machine” style resampler. A dynamic histogram of the bootstrap distribution \(\hat{\theta}^*\).

  • Interaction: The user defines a complex statistic (e.g., sample Trimmed Mean). The engine rapidly draws resamples, updating the bootstrap histogram in real-time. The tool overlays a Normal curve with the bootstrap SE, demonstrating how the central limit theorem applies even to analytically difficult statistics.

12.3 3. Sub-Module 10.2: Robustness

12.3.1 10.2.1 The Mean and the Median

Content: Sensitivity of estimators to deviations from model assumptions (e.g., outliers).

Rigorous Focus:

  1. Breakdown Point: The fraction of contaminated data an estimator can tolerate before becoming arbitrarily wrong. (Mean = 0%, Median = 50%).

  2. Influence Function: Measures the infinitesimal sensitivity of an estimator to an outlier at \(x\).

Interactive Resource: The Contamination and Influence Sandbox

  • Design: A dot plot of a sample from \(N(0,1)\). Two vertical lines tracking the value of the Sample Mean and Sample Median.

  • Interaction: The user clicks on the far right tail of the dot plot to inject an extreme outlier. The Mean line rapidly chases the outlier, while the Median line barely moves. The tool plots the empirical Influence Function (change in estimate vs. outlier location), showing the unbounded influence of the mean and the bounded influence of the median.

12.3.2 10.2.2 M-Estimators

Content: Estimators defined as the minimizer of \(\sum_{i=1}^n \rho(x_i - \theta)\), or the root of \(\sum_{i=1}^n \psi(x_i - \theta) = 0\).

  • Mean: \(\rho(x-\theta) = (x-\theta)^2\) (\(\psi\) is unbounded).

  • Median: \(\rho(x-\theta) = |x-\theta|\) (\(\psi\) is bounded but discontinuous).

  • Huber Estimator: A hybrid that is quadratic near zero and linear in the tails, providing both efficiency (like the mean) and robustness (like the median).

Interactive Resource: The Huber Tuning Knob

  • Design: A plot of the \(\psi\)-function \(\psi(x) = x\) for \(|x| \le k\) and \(\psi(x) = k \cdot \text{sign}(x)\) for \(|x| > k\).

  • Interaction: The user adjusts the tuning constant \(k\). As \(k \to \infty\), \(\psi\) becomes the mean (unbounded). As \(k \to 0\), \(\psi\) becomes the median. The user sets a mixed dataset (mostly Normal with \(10\%\) extreme outliers) and adjusts \(k\) to find the optimal bias-variance tradeoff, visualizing how M-estimators tame the influence of outliers.

12.4 4. Sub-Module 10.3: Hypothesis Testing

####10.3.1 Asymptotic Distribution of LRTs (Wilks’ Theorem)

Content: For testing \(H_0: \theta \in \Theta_0\) vs \(H_1: \theta \in \Theta_0^c\), under regularity conditions, the LRT statistic \(-2\log\lambda(\mathbf{X}) \xrightarrow{d} \chi^2_{\nu}\), where \(\nu = \dim(\Theta) - \dim(\Theta_0)\).

Rigorous Focus: This allows for chi-squared approximations of p-values when exact \(n\) is small or the exact distribution is intractable.

Interactive Resource: The LRT Convergence Fitter

  • Design: A histogram of simulated \(-2\log\lambda\) values under \(H_0\). Overlay curves for exact distributions (if known) and the \(\chi^2_\nu\) asymptote.

  • Interaction: The user adjusts the sample size \(n\). For small \(n\), the empirical histogram is skewed relative to the \(\chi^2\) overlay. As \(n\) increases, the histogram seamlessly merges with the \(\chi^2\) curve, validating Wilks’ Theorem empirically.

12.4.1 10.3.2 Other Large-Sample Tests (Wald and Score)

Content:

  1. Wald Test: Uses the MLE. \(W = \frac{(\hat{\theta} - \theta_0)^2}{\widehat{Var}(\hat{\theta})} \xrightarrow{d} \chi^2_1\).

  2. Score (Rao) Test: Uses only the estimator under \(H_0\). \(S = \frac{\ell'(\theta_0)^2}{I(\theta_0)} \xrightarrow{d} \chi^2_1\). Rigorous Focus: Geometric interpretations. The Wald test measures distance on the parameter scale; the Score test measures the slope of the log-likelihood at the null.

Interactive Resource: The Holy Trinity Visualizer

  • Design: A plot of the log-likelihood function \(\ell(\theta)\). Markers for \(\theta_0\) (null) and \(\hat{\theta}\) (MLE).

  • Interaction: The user drags the observed data, shifting the likelihood curve. The tool dynamically draws: * LRT: The vertical drop from the peak \(\ell(\hat{\theta})\) to \(\ell(\theta_0)\). * Wald: The horizontal distance \(|\hat{\theta} - \theta_0|\). * Score: The tangent line slope at \(\theta_0\).

  • As the user makes \(n\) large, the log-likelihood becomes perfectly quadratic, and the tool proves that all three test statistics become mathematically identical.

12.5 5. Sub-Module 10.4: Interval Estimation

12.5.1 10.4.1 Approximate Maximum Likelihood Intervals

Content: Using the asymptotic normality of the MLE: \(\hat{\theta} \pm z_{\alpha/2} \sqrt{\frac{1}{nI(\hat{\theta})}}\).

Rigorous Focus: Replacing the true Fisher Information \(I(\theta)\) with the estimated Fisher Information \(I(\hat{\theta})\) or the observed Fisher Information.

12.5.2 10.4.2 Other Large-Sample Intervals

Content:

  1. Inverting large-sample Wald and Score tests.

  2. Bootstrap Percentile Intervals: Using the quantiles of the bootstrap distribution \(\hat{\theta}^*\) directly: \([\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]\).

Interactive Resource: The Bootstrap CI Constructor

  • Design: The bootstrap histogram of \(\hat{\theta}^*\). A slider for \(\alpha\).

  • Interaction: The user selects a highly skewed estimator (e.g., the sample variance). The Wald interval (symmetric around \(\hat{\theta}\)) is plotted, often crossing impossible parameter boundaries (e.g., \(< 0\)). The user then clicks “Percentile Method”, and the tool slices off the \(\alpha/2\) tails of the bootstrap histogram, providing an asymmetric, realistic confidence interval that respects the natural boundaries of the parameter space.

12.6 Part III: Technical Implementation Guidelines (Module 10 Specifics)

  1. High-Performance Resampling Engine: The Bootstrap and robustness simulations require generating millions of random numbers rapidly. The backend must use vectorized operations (NumPy/NumPyro in Python or a WebAssembly port of a C++ random library). The UI must not freeze; progress bars or streaming updates are mandatory for \(B > 10,000\) resamples.

  2. Automatic Differentiation (AD) Engine: Calculating the Score test and Fisher Information requires taking derivatives of the log-likelihood. Instead of hardcoding derivatives, integrate an AD library (like JAX or PyTorch in Python, or math.js for basic JS). This allows students to input any valid likelihood function and instantly get the Score test statistic and Wald standard errors.

  3. Robustness Outlier Injector: The Contamination Sandbox needs a fluid UI for adding/removing data points. Use D3.js to bind data points to DOM elements that can be dragged along the x-axis, instantly recalculating the Mean, Median, and Huber estimates without re-running the entire simulation.

  4. Chi-Square Quantile Calculator: Wilks’ Theorem relies heavily on the \(\chi^2\) distribution. The frontend needs a fast statistical library (like jStat) to compute CDFs and inverse CDFs (quantiles) of \(\chi^2_\nu\) to calculate asymptotic p-values and critical values dynamically.

  5. Influence Function Plotter: The Influence Function for M-estimators is defined by the derivative of \(\rho\). The tool should allow users to choose \(\rho\) (Quadratic, Absolute, Huber) and dynamically plot \(\psi(x)\), overlaying it on the data dot-plot to visually explain why certain estimators resist outliers (bounded \(\psi\)) and others do not (unbounded \(\psi\)).

13 Module 11: Analysis of Variance and Regression

13.1 1. Module Overview

This module transitions from univariate statistics to the analysis of structured data. It introduces two of the most foundational linear models in statistics: the Oneway Analysis of Variance (ANOVA) for comparing group means, and Simple Linear Regression (SLR) for modeling the relationship between two continuous variables. The module rigorously develops these models from both an algebraic optimization perspective (Least Squares) and a probabilistic optimality perspective (BLUE and MLE), emphasizing the geometric and distributional properties of the resulting estimators.

Learning Objectives:

  • Formulate the Oneway ANOVA model and partition the Total Sum of Squares (SST) into explained (SSG) and unexplained (SSE) variation.

  • Derive and evaluate the ANOVA F test for equality of means, proving its optimality under normality.

  • Define and test linear contrasts, and apply methods for simultaneous inference (Scheffé’s method).

  • Derive Ordinary Least Squares (OLS) estimators via calculus and orthogonal projections.

  • Prove the Gauss-Markov Theorem, establishing OLS as the Best Linear Unbiased Estimator (BLUE).

  • Derive the distributions of OLS estimators under the assumption of normal errors, and construct t-tests and confidence intervals.

  • Distinguish rigorously between confidence intervals for the mean response \(E(Y|x_0)\) and prediction intervals for a new observation \(Y_0\).

13.2 2. Sub-Module 11.2: Oneway Analysis of Variance (ANOVA)

13.2.1 11.2.1 Model and Distribution Assumptions

Content: The cell means model: \(Y_{ij} = \mu_i + \epsilon_{ij}\), or the overparameterized model: \(Y_{ij} = \mu + \tau_i + \epsilon_{ij}\), for \(i=1,\dots,k\) groups and \(j=1,\dots,n_i\). Assumptions: \(\epsilon_{ij} \text{ iid } N(0, \sigma^2)\).

Rigorous Focus: The constraint \(\sum n_i \tau_i = 0\) (or similar) to make the model identifiable.

13.2.1.1 11.2.2 The Classic ANOVA Hypothesis

Content: Testing \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\) vs. \(H_1: \mu_i \neq \mu_j\) for some \(i \neq j\). (Equivalently, \(H_0: \tau_1 = \dots = \tau_k = 0\)).

13.2.2 11.2.3 & 11.2.4 Sums of Squares and the ANOVA F Test

Content: Partitioning the total variation: \(\text{SST} = \text{SSG} + \text{SSE}\).

  • \(\text{SST} = \sum \sum (Y_{ij} - \bar{\bar{Y}})^2\)

  • \(\text{SSG} = \sum n_i (\bar{Y}_i - \bar{\bar{Y}})^2\)

  • \(\text{SSE} = \sum \sum (Y_{ij} - \bar{Y}_i)^2\)

The F-statistic: \(F = \frac{\text{SSG}/(k-1)}{\text{SSE}/(N-k)}\).

Interactive Resource: The Variance Partitioner

  • Design: A dot plot showing \(k\) groups of data points. Three overlapping histograms/density curves representing the distributions of SST, SSG, and SSE.

  • Interaction: The user clicks and drags the group means (\(\bar{Y}_i\)) left and right. As the group means move further apart, the SSG bar chart grows and the SSE remains constant. The F-statistic slider updates dynamically. If the user moves all means to the same location, SSG drops to 0 and \(F \to 0\). This directly links the geometric spread of the groups to the numerator of the F-test.

13.2.3 11.2.3 Inferences Regarding Linear Combinations of Means

Content: A contrast \(C = \sum a_i \mu_i\) where \(\sum a_i = 0\). Estimating \(C\) with \(\hat{C} = \sum a_i \bar{Y}_i\) and constructing t-tests and confidence intervals.

Rigorous Focus: Contrasts allow targeted pairwise or complex comparisons while maintaining the interpretability of the ANOVA structure.

13.2.4 11.2.5 Simultaneous Estimation of Contrasts (Scheffé’s Method)

Content: The Multiple Comparisons Problem. If we test \(m\) hypotheses at level \(\alpha\), the family-wise error rate (FWER) inflates.

Rigorous Focus: Scheffé’s method allows all possible contrasts to be tested simultaneously with an overall FWER of \(\alpha\). The critical value is modified: \(\sqrt{(k-1)F_{k-1, N-k, \alpha}}\).

Interactive Resource: The Multiple Comparisons Trap

  • Design: A dashboard simulating \(k=5\) groups where \(H_0\) is strictly true (all means are equal).

  • Interaction: The user sets the per-comparison Type I error rate to \(\alpha = 0.05\). The tool runs 10,000 simulations, performing all \(\binom{5}{2} = 10\) pairwise t-tests. A bar chart shows the percentage of simulations where at least one false positive occurred (FWER \(\approx 40\%\)). The user then switches to Scheffé’s adjustment, and the FWER drops back to \(0.05\), visually proving the necessity of multiple comparison corrections.

13.3 3. Sub-Module 11.3: Simple Linear Regression (SLR)

13.3.1 11.3.1 Least Squares: A Mathematical Solution

Content: Model: \(Y_i = \alpha + \beta x_i + \epsilon_i\). Minimizing the sum of squared residuals \(S(\alpha, \beta) = \sum (Y_i - \alpha - \beta x_i)^2\).

Rigorous Focus: Taking partial derivatives to find the normal equations. The geometry of least squares: the observed vector \(\mathbf{Y}\) is orthogonally projected onto the column space of the design matrix \(\mathbf{X}\). The residuals \(\mathbf{e}\) are orthogonal to the fitted values \(\hat{\mathbf{Y}}\).

Interactive Resource: The Regression Sandbox

  • Design: A 2D scatterplot. A movable line with sliders for intercept (\(\alpha\)) and slope (\(\beta\)). Squares representing the squared residuals are drawn between the points and the line.

  • Interaction: The user manually adjusts \(\alpha\) and \(\beta\) to try and minimize the total area of the squares (SSR). A dynamic “SSR Meter” shows the current error. A “Snap to OLS” button animates the line jumping to the exact mathematical minimum, and the residual squares instantly reshape.

13.3.2 11.3.2 Best Linear Unbiased Estimators (BLUE): A Statistical Solution

Content: The Gauss-Markov Theorem. Under assumptions \(E[\epsilon_i] = 0\) and \(Var(\epsilon_i) = \sigma^2\) (homoscedasticity, no normality required), the OLS estimators \(\hat{\alpha}\) and \(\hat{\beta}\) have the smallest variance among all linear unbiased estimators.

Rigorous Focus: Proving the theorem by showing that the variance of any other linear unbiased estimator differs from the OLS variance by a positive semi-definite matrix.

Interactive Resource: The Gauss-Markov Variance Smackdown

  • Design: A simulator generating non-normal errors (e.g., skewed Exponential errors). Two histograms: one for OLS \(\hat{\beta}\), one for an alternative linear unbiased estimator (e.g., a line fit perfectly through the first and last data point).

  • Interaction: The user runs thousands of simulations. Both histograms are centered at the true \(\beta\) (both unbiased). However, the alternative estimator’s histogram is visibly wider. The tool calculates the empirical variances, confirming \(Var(\hat{\beta}_{OLS}) < Var(\hat{\beta}_{alt})\), demonstrating the power of Gauss-Markov without the crutch of Normality.

13.3.3 11.3.3 & 11.3.4 Models and Distribution Assumptions (Normal Errors)

Content: Adding the assumption \(\epsilon_i \sim N(0, \sigma^2)\).

  • OLS estimators coincide with MLEs.

  • \(\hat{\beta} \sim N(\beta, \frac{\sigma^2}{S_{xx}})\) and \(\hat{\alpha} \sim N(\alpha, \sigma^2 (\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}))\).

  • \(S^2 = \frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}\) is an unbiased estimator for \(\sigma^2\).

  • t-tests for \(H_0: \beta = \beta_0\) using the statistic \(T = \frac{\hat{\beta} - \beta_0}{S/\sqrt{S_{xx}}} \sim t_{n-2}\).

Interactive Resource: The Parameter Geometry Engine

  • Design: A 3D surface plot of the log-likelihood \(L(\alpha, \beta, \sigma^2 | \mathbf{x}, \mathbf{y})\).

  • Interaction: The user rotates the 3D surface. The tool overlays the OLS normal equations as geometric planes slicing through the peak. It highlights the ridge corresponding to \(\sigma^2\), showing how maximizing over \(\alpha\) and \(\beta\) first yields the profile likelihood for \(\sigma^2\).

13.3.4 11.3.5 Estimation and Prediction at \(x = x_0\)

Content: Two distinct goals:

  1. Confidence Interval for \(E(Y|x_0)\): Estimating the mean of the distribution at \(x_0\). \(Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)\).

  2. Prediction Interval for \(Y_0\): Predicting an individual new observation at \(x_0\). \(Var(Y_0 - \hat{Y}_0) = \sigma^2 \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)\).

Rigorous Focus: The prediction interval is strictly wider because it must account for the intrinsic variance of the new observation (\(\sigma^2\)) plus the variance of estimating the mean.

Interactive Resource: The Band Splitter

  • Design: A scatterplot with the OLS regression line. Two shaded regions surrounding the line.

  • Interaction: The user clicks along the x-axis to select \(x_0\). A vertical slice appears showing the Normal distribution of \(E(Y|x_0)\) and the wider Normal distribution of \(Y_0\). The user adjusts a slider for \(x_0\). As \(x_0\) moves away from \(\bar{x}\), both bands “bowtie” outward (variance inflation), but the Prediction Interval always maintains a minimum width determined by the irreducible error \(\sigma^2\).

13.3.5 11.3.6 Simultaneous Estimation and Confidence Bands

Content: Constructing a confidence band for the entire regression line over an interval of \(x\) values, maintaining a family-wise confidence level of \(1-\alpha\). The Working-Hotelling method: \(E(Y|x) \in \hat{\alpha} + \hat{\beta}x \pm \sqrt{2F_{2, n-2, \alpha}} S\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}}}\).

Interactive Resource: The Working-Hotelling Enveloper

  • Design: The regression line with individual confidence intervals drawn at many \(x\) points.

  • Interaction: The user clicks “Simulate”. The tool draws 100 new samples, fitting 100 regression lines, and calculating 100 individual CIs. It highlights that at the edges (extreme \(x\) values), individual CIs fail to capture the true line ~5% of the time. It then overlays the Working-Hotelling band, which smoothly contains the true line in exactly 95% of simulations, demonstrating the geometric tightening required for simultaneous inference.

13.4 Part III: Technical Implementation Guidelines (Module 11 Specifics)

  1. Matrix/Linear Algebra Backend: While SLR can be formulated with scalar sums (\(S_{xx}, S_{xy}\)), introducing the matrix formulation (\(\hat{\beta} = (X^T X)^{-1} X^T Y\)) under the hood sets the stage for Module 12 and is standard in Casella & Berger. Use NumPy.linalg to handle these operations efficiently and cleanly.

  2. Interactive Geometry Engine: For the Regression Sandbox, the visualization of residuals as squares is a classic teaching tool. Use D3.js or Canvas to render actual squares stretching between the data points and the regression line. As the user drags the line, the squares must skew and resize in real-time, and the total area (SSR) must update instantly.

  3. 3D Likelihood Surface Renderer: For the Parameter Geometry Engine, use Plotly.js or Three.js to render the bivariate Normal log-likelihood surface. The surface must be semi-transparent so the user can see the MLE point and the normal equation planes intersecting at the base.

  4. Non-Normal Random Number Generator: To properly demonstrate the Gauss-Markov Variance Smackdown, the simulator must easily draw from non-Normal distributions (e.g., Chi-squared, Exponential, Uniform) to prove that OLS remains BLUE without Normality.

  5. Multiple Comparison Math: Scheffé’s method requires looking up F-distribution quantiles. The backend must dynamically calculate \(F_{k-1, N-k, \alpha}\) to scale the t-intervals correctly for the interactive simulator.

14 Module 12: Regression Models

14.1 1. Module Overview

Module 11 established Simple Linear Regression under idealized conditions (exact \(x\), Normal errors, constant variance). This module tackles the messy realities of data that violate these assumptions. It rigorously addresses three major departures: (1) measurement error in the predictors (Errors in Variables), (2) non-Normal, categorical response variables (Logistic Regression), and (3) outliers and heavy-tailed error distributions (Robust Regression). The focus shifts from closed-form OLS solutions to iterative numerical optimization (MLE, IRWLS) and the analysis of estimator sensitivity.

Learning Objectives:

  • Differentiate between Functional and Structural relationships in Errors-in-Variables (EIV) models.

  • Prove and visualize the attenuation bias (regression dilution) caused by measurement error in OLS.

  • Derive the orthogonal/Deming regression solution assuming known error variance ratios.

  • Formulate the Logistic Regression model using the Bernoulli distribution and the logit link function.

  • Derive the MLE for logistic regression via Iteratively Reweighted Least Squares (IRWLS) and compute asymptotic inference using the observed Fisher Information.

  • Define M-estimators for regression, construct robust loss functions (Huber, Tukey), and compute bounded influence weights.

  • Diagnose leverage points and distinguish between good and bad leverage points in robust regression.

14.2 2. Sub-Module 12.2: Regression with Errors in Variables

14.2.1 12.2.1 Functional and Structural Relationships

Content: In standard regression, \(x_i\) is fixed and known. In EIV, we observe \(W_i = x_i + U_i\) where \(U_i\) is measurement error.

  • Functional Relationship: The true \(x_i\) are fixed, unknown constants (nuisance parameters).

  • Structural Relationship: The true \(X_i\) are random variables \(X_i \sim N(\mu_X, \sigma_X^2)\).

Rigorous Focus: The model is unidentifiable without additional information (e.g., knowing the ratio \(\lambda = \sigma_U^2 / \sigma_V^2\), where \(V_i\) is the error in \(Y\)).

14.2.2 12.2.2 & 12.2.3 Least Squares and Maximum Likelihood Solutions

Content:

  • The Attenuation Effect: If we naively regress \(Y\) on \(W\), the OLS estimator \(\hat{\beta}_{OLS}\) is biased toward 0. \(E[\hat{\beta}_{OLS}] \approx \beta \frac{\sigma_X^2}{\sigma_X^2 + \sigma_U^2}\).

  • Orthogonal/Deming Regression: Minimizing the perpendicular distances to the line, weighted by the error variance ratio \(\lambda = \sigma_U^2 / \sigma_V^2\).

Interactive Resource: The Attenuation Dilution Simulator

  • Design: A scatterplot of true \((x, y)\) points forming a tight line, overlaid with a blurred scatterplot of observed \((W, Y)\) points.

  • Interaction: The user controls the measurement error variance \(\sigma_U^2\) via a slider. As \(\sigma_U^2\) increases, the \(W\) values spread out horizontally. The tool dynamically fits the naive OLS line (regressing \(Y\) on \(W\)). The line visually flattens toward 0, and a numeric readout displays the shrinking slope, proving the attenuation bias. A second button fits the Deming Regression line, which remains stable and accurate regardless of \(\sigma_U^2\).

14.2.3 12.2.4 Confidence Sets

Content: Constructing confidence intervals for the slope \(\beta\) in EIV models.

Rigorous Focus: The variance of the orthogonal regression estimator is larger than the naive OLS variance, correctly reflecting the increased uncertainty due to measurement error.

14.3 3. Sub-Module 12.3: Logistic Regression

14.3.1 12.3.1 The Model

Content: Modeling a binary response \(Y_i \in \{0, 1\}\). \(Y_i \sim \text{Bernoulli}(p_i)\), where the logit of the probability is a linear function of predictors: \(\log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta x_i\).

Rigorous Focus: Why OLS is inappropriate (predicted probabilities \(<0\) or \(>1\), non-constant variance \(Var(Y_i) = p_i(1-p_i)\)).

14.3.2 12.3.2 Estimation (MLE and IRWLS)

Content: The Likelihood function: \(L(\alpha, \beta | \mathbf{x}, \mathbf{y}) = \prod p_i^{y_i} (1-p_i)^{1-y_i}\). Because the score equations are non-linear, we use numerical optimization.

Rigorous Focus: The Newton-Raphson algorithm for MLE simplifies to Iteratively Reweighted Least Squares (IRWLS). The update step solves a weighted least squares problem where the weights are \(W_i = n_i p_i (1-p_i)\) and the working response is \(Z_i = \hat{\eta}_i + \frac{Y_i - \hat{p}_i}{\hat{p}_i(1-\hat{p}_i)}\).

Interactive Resource: The Logit Bender & IRWLS Tracker

  • Design: A binary scatterplot (0s and 1s on the y-axis). An S-shaped logistic curve is overlaid.

  • Interaction: The user drags sliders for \(\alpha\) and \(\beta\) to manually fit the curve. A dynamic “Log-Likelihood Meter” shows how the likelihood changes.

  • Rigor Check: The user clicks “IRWLS Step”. The tool calculates the working response \(Z_i\) and weights \(W_i\), plots them as a transformed weighted scatterplot behind the curve, and fits a Weighted Least Squares line to it. It then updates the logistic curve. Repeated clicks animate the algorithm converging, visualizing how logistic regression iteratively re-weights the data to handle heteroscedasticity.

Worked Example: Asymptotic Inference

  • Content: Calculating the observed Fisher Information matrix \(I(\hat{\alpha}, \hat{\beta}) = \mathbf{X}^T \mathbf{W} \mathbf{X}\). Using the inverse to construct Wald tests and confidence intervals for odds ratios (\(e^\beta\)).

14.4 4. Sub-Module 12.4: Robust Regression

14.4.1 12.4.1 The Breakdown of OLS

Content: OLS minimizes \(\sum (Y_i - \alpha - \beta x_i)^2\). Because the loss function \(\rho(r) = r^2\) increases rapidly, a single outlier can drag the regression line arbitrarily far.

Rigorous Focus: Differentiating between:

  1. Vertical outliers: Outliers in \(Y\) (OLS handles poorly).

  2. Leverage points: Outliers in \(X\) (OLS handles very poorly).

  3. Bad leverage points: Leverage points that are also vertical outliers (catastrophic for OLS).

Interactive Resource: The Leverage Point Injector

  • Design: A scatterplot of well-behaved data with an OLS line and a Robust M-estimator line (e.g., Huber) perfectly overlapping.

  • Interaction: The user clicks anywhere on the canvas to inject a data point. If they inject a vertical outlier, the OLS line deflects slightly. If they inject a point at an extreme \(x\) value (high leverage) with a misfitting \(y\) value, the OLS line wildly pivots to pass through it, while the Robust line remains stable. The tool dynamically calculates Cook’s Distance for the injected point.

14.4.2 12.4.2 M-Estimators for Regression

Content: Generalizing OLS by minimizing \(\sum \rho((Y_i - \alpha - \beta x_i)/\hat{\sigma})\), where \(\rho\) is a robust loss function, and \(\hat{\sigma}\) is a robust scale estimate (e.g., MAD).

  • Huber Loss: \(\rho(r) = \frac{1}{2}r^2\) for \(|r| \le c\), and \(c|r| - \frac{1}{2}c^2\) for \(|r| > c\). (Quadratic near 0, linear in the tails).

  • Tukey’s Biweight: \(\rho(r)\) that flattens out completely for extreme outliers, entirely rejecting their influence.

Interactive Resource: The Loss Function Forge

  • Design: A dual-panel interface. Left: Plots of different \(\rho(r)\) and their derivatives \(\psi(r)\) (the influence function). Right: A scatterplot with outliers.

  • Interaction: The user adjusts the tuning constant \(c\) for the Huber loss. The right panel dynamically fits the regression line. As \(c \to \infty\), the influence function becomes unbounded, and the line behaves exactly like OLS (pulled by outliers). As \(c \to 0\), it behaves like \(L_1\) regression (median). The user can switch to Tukey’s loss, where extreme outliers are ignored entirely (their weight drops to 0).

14.4.3 12.4.3 Iteratively Reweighted Least Squares (IRWLS) for Robust Regression

Content: Solving the M-estimator requires solving \(\sum \psi(r_i) x_{ij} = 0\). This is achieved via IRWLS, with weights \(w_i = \frac{\psi(r_i/\hat{\sigma})}{r_i/\hat{\sigma}}\).

Rigorous Focus: The weight function \(w(r)\) dictates the influence. For Huber, \(w(r) \to c/|r|\) as \(|r| \to \infty\), meaning outliers are down-weighted proportional to their distance. For Tukey, \(w(r) \to 0\).

Interactive Resource: The Weight Tracker

  • Design: A scatterplot where each data point is a circle. The radius of the circle represents its IRWLS weight \(w_i\).

  • Interaction: The user runs the robust regression algorithm step-by-step. Inlier points maintain large circles (weight \(\approx 1\)). Outliers visibly shrink as the algorithm iterates, visually demonstrating how robust regression “turns down the volume” on bad data points, allowing the line to fit the majority of the data.

14.5 Part III: Technical Implementation Guidelines (Module 12 Specifics)

  1. Numerical Optimization Solvers (Logistic & Robust): Module 12 relies heavily on iterative algorithms that have no closed-form solution. The backend must expose robust optimization libraries. For Logistic MLE, use scipy.optimize.minimize (e.g., L-BFGS-B). For Robust M-estimators, implement the IRWLS loop manually, ensuring convergence checks (tolerances) and step-halving to prevent divergence.

  2. Matrix Algebra for IRWLS: Both Logistic Regression and Robust Regression utilize IRWLS. The core operation is Weighted Least Squares: \(\hat{\beta} = (X^T W X)^{-1} X^T W Z\). Use NumPy to compute this efficiently. Ensure the design matrix \(X\) includes a column of 1s for the intercept.

  3. Robust Scale Estimators (MAD): The Robust Regression M-estimator requires a preliminary estimate of scale \(\hat{\sigma}\) that is itself robust to the very outliers we are trying to ignore. Implement the Median Absolute Deviation (MAD): \(\hat{\sigma} = 1.4826 \times \text{median}(|r_i - \text{median}(r)|)\). The constant 1.4826 ensures consistency for the Normal distribution.

  4. Outlier Injection UI Mechanics: For the Leverage Point Injector, the UI must allow precise clicking. Map pixel coordinates directly to the \((x, y)\) data space. After injecting a point, the OLS and Robust models must refit instantly (within \(\sim 100\)ms) to provide satisfying interactive feedback. Use WebGL or highly optimized Canvas rendering if rendering thousands of points.

  5. Information Matrix Calculator for Logistic: To provide inference (standard errors, Wald tests) for Logistic Regression, the engine must calculate the Observed Fisher Information \(I(\hat{\beta}) = X^T \text{diag}(p_i(1-p_i)) X\). The tool should invert this matrix on the fly to return the variance-covariance matrix to the frontend for display.

SOCR Resource Visitor number Web Analytics SOCR Email