| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
This SOCR DSPA2 Appendix contains a comprehensive suite of interactive learning modules designed for a rigorous mathematical-statistics courses.
The SOCR Statistical Data Analyzer (SDA), specifically the SDAP Math-Stat Learning Modules, includes interactive content to support active learning of these abstract ideas, concepts, methods, and techniques, which may support different student learning styles.
Curriculum Architecture Overview: The modules are structured to move from probabilistic foundations to advanced inferential methods, mirroring the text’s progression.
This foundational module establishes the rigorous mathematical language of probability required for statistical inference. It transitions from set-theoretic foundations and axiomatic probability to combinatorial mechanics, conditional structures, and finally the formal definition and characterization of random variables via CDFs and PDFs. Mastery of this module is prerequisite for all subsequent inferential theory.
Learning Objectives:
Content: Definition of the sample space \(S\) (discrete vs. continuous), definition of an event \(A \subseteq S\), and the concepts of the null set \(\emptyset\) and the entire space \(S\).
Content: Union (\(A \cup B\)), Intersection (\(A \cap B\)), Complement (\(A^c\)), and set differences. Commutative, associative, and distributive laws. Rigorous Focus: DeMorgan’s Laws: \((\cup_{i=1}^n A_i)^c = \cap_{i=1}^n A_i^c\) and \((\cap_{i=1}^n A_i)^c = \cup_{i=1}^n A_i^c\).
Interactive Resource: Venn Diagram Set Manipulator
Content: The Kolmogorov Axioms for a probability function \(P: \mathcal{B} \to [0,1]\): 1. \(P(A) \ge 0\) for all \(A \in \mathcal{B}\). 2. \(P(S) = 1\). 3. If \(A_1, A_2, \dots\) are mutually exclusive, \(P(\cup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i)\) (Countable Additivity).
Content: Derivations from the axioms: \(P(\emptyset) = 0\), \(P(A^c) = 1 - P(A)\), monotonicity (\(A \subset B \implies P(A) \le P(B)\)), and additive generalizations. Rigorous Focus: Boole’s Inequality \(P(\cup A_i) \le \sum P(A_i)\) and Bonferroni’s Inequality \(P(\cap A_i) \ge 1 - \sum P(A_i^c)\).
Interactive Resource: The Axiom Derivation Graph
Content: The Multiplication Principle, Permutations (ordered subsets), and Combinations (unordered subsets). The Binomial Coefficient \(\binom{n}{k} = \frac{n!}{k!(n-k)!}\). Rigorous Focus: Enumerating equally likely outcomes to calculate probabilities.
Worked Example: The Matching Problem (Montmort)
Scenario: \(n\) letters placed into \(n\) addressed envelopes at random. What is the probability of exactly \(k\) matches?
Interactive Element: “The Hat-Check Simulator”. Users select \(n\) (e.g., \(n=5\)) and run Monte Carlo simulations to approximate the probability. The module then provides the rigorous combinatorial derivation using the Inclusion-Exclusion principle, showing how the limit as \(n \to \infty\) approaches \(1/e\).
Content: Definition of \(P(A|B) = \frac{P(A \cap B)}{P(B)}\) for \(P(B) > 0\). The conditional probability as a restriction of the sample space.
Content: Partition of a sample space \(B_1, \dots, B_k\). The Law of Total Probability: \(P(A) = \sum_{i=1}^k P(A|B_i)P(B_i)\). Bayes’ Rule: \(P(B_j|A) = \frac{P(A|B_j)P(B_j)}{\sum_{i=1}^k P(A|B_i)P(B_i)}\).
Interactive Resource: Bayesian Medical Tester
Content: Definition: \(A\) and \(B\) are independent iff \(P(A \cap B) = P(A)P(B)\). Rigorous Focus: The crucial distinction between Pairwise Independence and Mutual Independence for collections of \(n\) events (requiring \(2^n - n - 1\) equations to hold).
Interactive Resource: The Three-Coin Paradox
Content: A random variable \(X\) is a function from the sample space \(S\) to the real numbers: \(X: S \to \mathbb{R}\). The pre-image (inverse mapping) \(X^{-1}(B) = \{s \in S : X(s) \in B\}\).
Interactive Resource: Sample Space to Real Line Mapper
Design: A two-panel interface. Left panel: Graphical representation of \(S\) (e.g., a grid of \((die1, die2)\) outcomes). Right panel: A real number line.
Interaction: User defines a function (e.g., \(X = \max(die1, die2)\) or \(X = die1 + die2\)). Hovering over a point on the real line (e.g., \(X=4\)) highlights the corresponding set of points in \(S\) (the pre-image). This visualizes the abstraction from sample outcomes to real-valued measurements.
Content: Definition: \(F_X(x) = P(X \le x)\) for all \(x \in \mathbb{R}\). Rigorous Focus: The four mathematical properties of a valid CDF:
Interactive Resource: CDF Sketcher & Validator
Content: \(P(a < X \le b) = F_X(b) - F_X(a)\). Handling strict vs. non-strict inequalities: \(P(X = x) = F_X(x) - F_X(x^-)\).
Content: Definition for discrete random variables: \(f_X(x) = P(X = x)\). Properties: \(f_X(x) \ge 0\) and \(\sum_{x \in \mathcal{X}} f_X(x) = 1\). Relationship to CDF: \(F_X(x) = \sum_{y \le x} f_X(y)\).
Content: Definition for continuous random variables: \(f_X(x)\) such that \(F_X(x) = \int_{-\infty}^x f_X(t)dt\). Properties: \(f_X(x) \ge 0\) and \(\int_{-\infty}^\infty f_X(t)dt = 1\). Note that \(f_X(x)\) does not have to be \(\le 1\). Rigorous Focus: For continuous \(X\), \(P(X = x) = 0\), and \(P(a \le X \le b) = P(a < X < b) = \int_a^b f_X(x)dx\).
Interactive Resource: The PDF/CDF Duality Engine
Design: A split screen. Top: a dynamic PDF curve. Bottom: the corresponding CDF curve.
Interaction: * PDF to CDF: The user shades an area under the PDF curve from \(-\infty\) to \(x\). The area value appears, and a corresponding point plots on the CDF below. Dragging \(x\) along the x-axis dynamically builds the CDF. * CDF to PDF: The user manipulates points on the CDF (respecting the 4 axioms from 1.5). The PDF automatically calculates and draws as the derivative of the CDF. For discrete jumps, the PDF plots discrete spikes equal to the jump height. * The “Density > 1” Mythbuster: A slider allows the user to decrease the variance of a normal distribution. The PDF peak rises above 1, while the total integral remains 1, correcting the common misconception that a pdf is a probability.
Set Theory Engine: Use a library capable of
boolean geometry operations (like Clipper.js or
paper.js) to accurately render the unions, intersections,
and differences of arbitrary shapes for the Venn Diagram
manipulator.
Symbolic Logic Integration: For the Axiom Derivation Graph and the Independence checker, integrate a lightweight symbolic logic validator. The student inputs \(P(A \cap B) == P(A)P(B)\), and the system tests this algebraically or numerically against simulation data.
Combinatorics Optimizer: For the counting
module, ensure the backend can compute large factorials exactly using
arbitrary-precision arithmetic (like Python’s built-in
math.comb or math.factorial) to prevent
overflow when students test limits (e.g., \(n=100\) in the matching problem).
Riemann/Lebesgue Integration visualization: In the PDF/CDF Duality Engine, render the area under the curve using WebGL/Canvas to ensure smooth animation as the integral limit \(x\) slides along the axis. For discrete CDFs, ensure the “step” function rendering accurately places open and closed circles to denote right-continuity.
This module bridges the gap between basic probability distributions and the behavior of functions of random variables. It covers the mathematical rigor required to find the distribution of a transformed variable, calculate its expected value, and utilize Moment Generating Functions (MGFs) to characterize distributions. It also introduces the critical theoretical justifications for differentiating under an integral sign.
Learning Objectives:
Content: Definition of \(Y = g(X)\), the mapping \(g(x): \mathcal{X} \to \mathcal{Y}\), and the inverse mapping \(g^{-1}(A) = \{x \in \mathcal{X}: g(x) \in A\}\). Proving the distribution of \(Y\) satisfies Kolmogorov Axioms.
Interactive Resource: Visual Mapping Engine * Design: A split-screen canvas. Left side: a number line representing \(\mathcal{X}\). Right side: a number line representing \(\mathcal{Y}\). * Interaction: The user selects a function \(g(x)\) (e.g., \(x^2\), \(e^x\)). A point or interval highlight on \(\mathcal{X}\) instantly maps to \(\mathcal{Y}\), visually demonstrating how \(g^{-1}\) maps sets back to sets. * Rigor Check: Include a “Non-one-to-one” mode (e.g., \(y=x^2\)) where selecting \(y > 0\) on \(\mathcal{Y}\) highlights two disjoint intervals on \(\mathcal{X}\), emphasizing that \(g^{-1}(y)\) can be a set, not just a point.
Content: Finding the pmf of \(Y\) via \(f_Y(y) = \sum_{x \in g^{-1}(y)} f_X(x)\).
Worked Example: The Binomial Transformation (Example 2.1.1)
Content: Theorem 2.1.3 (CDF method) and Theorem 2.1.5 (PDF method).
Formula: \(f_Y(y) = f_X(g^{-1}(y)) |\frac{d}{dy}g^{-1}(y)|\).
Interactive Resource: The Jacobian Visualizer
Worked Examples:
Uniform-Exponential Relationship (Ex 2.1.4): \(X \sim \text{Uniform}(0,1)\), \(Y = -\log X\). Derivation showing \(Y \sim \text{Exp}(1)\).
Inverted Gamma (Ex 2.1.6): \(X \sim \text{Gamma}(\alpha, \beta)\), \(Y=1/X\). Derivation of the inverted gamma pdf.
Content: Theorem 2.1.8 (Piecewise monotone transformations). Partitioning \(\mathcal{X}\) into \(A_1, \dots, A_k\).
Formula: \(f_Y(y) = \sum_{i=1}^k f_X(g_i^{-1}(y)) |\frac{d}{dy}g_i^{-1}(y)|\).
Worked Example: Normal-Chi Squared Relationship (Ex 2.1.9)
Content: Theorem 2.1.10. If \(X\) has continuous cdf \(F_X\), then \(Y = F_X(X) \sim \text{Uniform}(0,1)\). Definition of the generalized inverse \(F_X^{-1}(y) = \inf\{x: F_X(x) \ge y\}\).
Interactive Resource: Random Variate Generator
Content: Definition 2.2.1 (The “Law of the Unconscious Statistician”). Theorem 2.2.5 (Linearity, Non-negativity, Monotonicity, Boundedness).
Worked Examples:
Content: Example 2.2.6. Minimizing \(E[(X-b)^2]\).
Derivation: Expanding \(E[(X - EX + EX - b)^2]\) to prove the minimum occurs at \(b = EX\).
Interactive Resource: MSE Optimization Slider
Content: Definitions of \(\mu'_n\) and \(\mu_n\). Definition of Variance (\(\text{Var } X\)) and Standard Deviation. Theorem 2.3.4 (\(\text{Var }(aX+b) = a^2 \text{Var } X\)). Computational formula: \(\text{Var } X = EX^2 - (EX)^2\).
Interactive Resource: Variance Decomposition
Content: Definition 2.3.6 (\(M_X(t) = E[e^{tX}]\)). Theorem 2.3.7 (Generating moments via \(E[X^n] = M_X^{(n)}(0)\)). Theorem 2.3.15 (\(M_{aX+b}(t) = e^{bt}M_X(at)\)).
Worked Examples:
Content: The issue of non-unique moments (Example 2.3.10 - Lognormal vs. modified Lognormal). Theorem 2.3.11 (Uniqueness of MGF). Theorem 2.3.12 (Convergence of MGFs implies convergence of CDFs).
Interactive Resource: The Poisson Approximation Simulator (Ex 2.3.13)
Content: Theorem 2.4.1. Differentiating integrals with variable limits.
Formula: \(\frac{d}{d\theta} \int_{a(\theta)}^{b(\theta)} f(x,\theta)dx = f(b(\theta),\theta)b'(\theta) - f(a(\theta),\theta)a'(\theta) + \int_{a(\theta)}^{b(\theta)} \frac{\partial}{\partial \theta}f(x,\theta)dx\).
Content: Theorem 2.4.3 and Corollary 2.4.4. The necessity of the dominating function \(g(x,\theta)\) and the Lipschitz-like condition bounding the derivative.
Worked Example: Interchanging I (Ex 2.4.5)
Content: Theorem 2.4.8. Uniform convergence of series of derivatives.
Worked Example: The Geometric Distribution (Ex 2.4.7 & 2.4.9)
To develop these resources effectively, the following tech stack and pedagogical patterns are recommended:
Interactive Visualization Engine: Use D3.js or Plotly.js for the mapping and distribution visualizers. They require binding data (the pdf/cdf functions) to DOM elements to allow real-time updates as parameters change.
Symbolic Computation: Integrate SymPy (Python) or MathJax with step-by-step reveal logic. For modules involving MGFs and Leibnitz’s rule, the algebra is dense. The UI should allow users to click “Next Step” to see the expansion of \(E[(X-EX + EX - b)^2]\) or the factoring of the binomial coefficient.
Assessment Algorithmics: * Transformation Drills: Randomly generate a base distribution (e.g., Gamma) and a transformation (e.g., \(Y = 1/X\)). The student must choose the correct theorem (Monotone vs. Piecewise) and input the resulting PDF. The backend evaluates symbolically. * MGF Matching: Given an MGF, match it to the distribution. This reinforces Theorem 2.3.11.
“Prove It” Code Blocks: For Section 2.4, provide Python/R code templates where students must write code to numerically verify that \(\lim_{h \to 0} \frac{1}{h} \int [f(x, \theta+h) - f(x, \theta)] dx\) equals \(\int \frac{\partial}{\partial \theta} f(x, \theta) dx\) for specific functions, bridging the gap between theoretical limits and numerical computation.
This module transitions from the general mechanics of random variables (Module 2) to the specific, named families of distributions that form the workhorses of statistical inference. Rather than merely cataloging formulas, the module emphasizes the structural relationships between distributions, the unifying mathematical framework of Exponential Families, and the geometric interpretations of Location and Scale families.
Learning Objectives:
Content: Rigorous definition, parameter space, support, mean, variance, and MGFs for:
Interactive Resource: The Parameter Topology Explorer * Design: A multi-panel dashboard featuring a dynamic bar chart (PMF), a parameter slider panel, and an “Assumptions” checklist. * Interaction: As the user adjusts parameters (e.g., \(n, p\) for Binomial), the PMF updates in real-time. To demonstrate the Poisson limit, a “Link to Binomial” toggle forces \(\lambda = n \cdot p\). As \(n \to \infty\) (via slider), the Binomial bars smoothly morph into the Poisson bars. * Rigorous Focus: The Hypergeometric variance formula. A simulation window contrasts sampling with replacement (Binomial) vs. without replacement (Hypergeometric), visually demonstrating how the finite population correction factor affects the spread as sample size approaches population size.
Content: Definition and recursive properties of the Gamma function \(\Gamma(\alpha) = \int_0^\infty t^{\alpha-1}e^{-t}dt\). The Gamma pdf: \(f(x|\alpha,\beta) = \frac{1}{\Gamma(\alpha)\beta^\alpha}x^{\alpha-1}e^{-x/\beta}\). Special Cases: Exponential (\(\alpha=1\)) and Chi-squared (\(\alpha=\nu/2, \beta=2\)).
Content: \(X \sim N(\mu, \sigma^2)\). The pdf, standardization \(Z = (X-\mu)/\sigma\), and MGF \(M_X(t) = \exp(\mu t + \sigma^2 t^2 / 2)\).
Content: \(X \sim \text{Beta}(\alpha, \beta)\). Support on \([0,1]\). Relationship to order statistics (preview of Module 5).
Interactive Resource: The Distribution Genealogy Graph
Content: A family of pdfs/pmfs is an exponential family if it can be expressed as: \[f(x|\theta) = h(x)c(\theta)\exp\left( \sum_{i=1}^k w_i(\theta)t_i(x) \right)\] Definitions of \(h(x)\) (base measure), \(c(\theta)\) (normalizing constant), \(w_i(\theta)\) (natural parameters), and \(t_i(x)\) (sufficient statistics).
Interactive Resource: The Exponential Family Decomposer
Content: Defining the natural parameter space \(\Omega = \{ \theta : \int h(x)c(\theta)\exp(\sum w_i(\theta)t_i(x))dx < \infty \}\). Concept of a full-rank exponential family (the \(w_i(\theta)\) functions are linearly independent, and the \(t_i(x)\) statistics are linearly independent). Rigorous Focus: Convexity of the natural parameter space.
Worked Example: The Curved Exponential Family
Scenario: \(X \sim N(\theta, \theta^2)\).
Interactive Step-by-Step: Show that while it fits the exponential family form, the dimension of the parameter space (1) is less than the dimension of the natural parameter vector (2). The tool visually represents the parameter curve \((\mu/\sigma^2, -1/2\sigma^2)\) restricted to a 1-D manifold within the 2-D natural parameter space.
Content: \(X\) is a member of a location family if \(X = \theta + Z\), where \(Z\) has a “standard” pdf \(f(z)\). Thus \(f(x|\theta) = f(x-\theta)\). Visualized as horizontal shifts.
Content: \(X\) is a member of a scale family if \(X = \sigma Z\) (for \(\sigma > 0\)). Thus \(f(x|\sigma) = \frac{1}{\sigma}f(\frac{x}{\sigma})\). Visualized as horizontal stretching/shrinking.
Content: \(X = \mu + \sigma Z\). Thus \(f(x|\mu,\sigma) = \frac{1}{\sigma}f(\frac{x-\mu}{\sigma})\). Examples: Normal, Cauchy, Double Exponential.
Interactive Resource: The Shape-Shifter Engine
Design: A canvas displaying a standard pdf \(f(z)\) (e.g., Standard Cauchy) with integration bounds.
Interaction: 1. Location Phase: A slider adjusts \(\theta\). The curve slides horizontally. The tool enforces that the area remains 1 and the shape is rigid. 2. Scale Phase: A slider adjusts \(\sigma\). The curve stretches. Crucially, the y-axis automatically scales to show that as the curve widens, it must flatten to preserve \(\int f(x)dx = 1\).
Rigorous Focus: The Jacobian justification. As the user applies the transformation \(x \to \mu + \sigma z\), the tool displays the differential \(dx = \sigma dz\), proving mathematically why the \(\frac{1}{\sigma}\) pre-multiplier is required in the pdf.
Content: Markov’s Inequality (\(P(X \ge a) \le \frac{E[X]}{a}\) for \(X \ge 0\)) and Chebyshev’s Inequality (\(P(|X-\mu| \ge k\sigma) \le \frac{1}{k^2}\)).
Interactive Resource: The Bound Tightness Tester
Content: Useful algebraic identities for manipulating moments and expectations. Rigorous Focus: Stein’s Lemma (for Normal variables: \(E[g(X)(X-\mu)] = \sigma^2 E[g'(X)]\)).
Worked Example: Applying Stein’s Lemma
Scenario: Calculating \(E[X^3]\) for \(X \sim N(0,1)\) without direct integration.
Interactive Element: Step-by-step derivation. Set \(g(x) = x^2\), so \(g'(x) = 2x\). The interface shows the substitution into Stein’s Lemma: \(E[X^2 \cdot X] = 1 \cdot E[2X]\), resulting in \(E[X^3] = 2(0) = 0\). The tool then prompts the user to calculate \(E[X^4]\) by choosing \(g(x) = x^3\).
This module generalizes the concepts of univariate random variables to the multivariate setting. It rigorously develops the mathematical framework required to model the simultaneous behavior of multiple random variables, focusing on their joint, marginal, and conditional structures. The module covers the complex calculus of bivariate transformations, the algebra of covariance, hierarchical models, and key inequalities that bound probabilistic behavior.
Learning Objectives:
Content: Definition of the joint CDF \(F_{X,Y}(x,y) = P(X \le x, Y \le y)\). Joint pmf for discrete variables: \(f_{X,Y}(x,y) = P(X=x, Y=y)\). Joint pdf for continuous variables: \(P((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \,dx\,dy\).
Content: Recovering the univariate distribution from the joint distribution. Discrete: \(f_X(x) = \sum_y f_{X,Y}(x,y)\). Continuous: \(f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y) \,dy\).
Interactive Resource: The 3D Marginalizer
Content: Definition of the conditional pdf/pmf: \(f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}\) for \(f_X(x) > 0\). The conditional distribution as a probability distribution in its own right.
Content: \(X\) and \(Y\) are independent iff \(f_{X,Y}(x,y) = f_X(x)f_Y(y)\) for all \(x,y\). Equivalently, \(F_{X,Y}(x,y) = F_X(x)F_Y(y)\). Rigorous Focus: Support sets. If the support of \((X,Y)\) is not a Cartesian product of the supports of \(X\) and \(Y\) (e.g., a triangular region), the variables cannot be independent.
Interactive Resource: Conditional Shape-Shifter & Independence Checker
Design: A 2D heatmap of a joint pdf \(f_{X,Y}\). Below it, a dynamic plot for \(f_{Y|X}(y|x)\).
Interaction: * Conditioning: The user drags a vertical line along the x-axis (\(x_0\)). The cross-section is extracted, normalized, and plotted in the lower window as \(f_{Y|X}(y|x_0)\). * Independence Test: The user presses an “Independence Check” button. The tool visually multiplies \(f_X(x)\) and \(f_Y(y)\) and overlays the resulting surface on the joint heatmap. If they match, independence is verified. The tool explicitly flags non-Cartesian supports (e.g., \(0 < x < y < 1\)) with a red warning: “Dependent due to support constraints.”
Content: Finding the joint pdf of \((U,V) = (g_1(X,Y), g_2(X,Y))\).
Content: If the mapping is not one-to-one (e.g., \(U = X/Y\)), partition the space or introduce an auxiliary variable (e.g., \(V = Y\)), find the joint pdf of \((U,V)\), and then integrate out \(V\) to find the marginal of \(U\).
Interactive Resource: The Deformation Grid Engine
Design: A canvas displaying a uniform grid in the \((X,Y)\) plane and a second canvas for the \((U,V)\) plane.
Interaction: The user defines a transformation (e.g., \(U = X+Y, V = X-Y\)). The grid in the \((X,Y)\) plane warps into a parallelogram grid in the \((U,V)\) plane.
Rigor Check: Hovering over a small area element \(dx \times dy\) in the \((X,Y)\) plane highlights the corresponding area \(du \times dv\) in the \((U,V)\) plane. The tool calculates the ratio of the areas, demonstrating that it exactly equals the absolute value of the Jacobian determinant \(|J|\). This visualizes why the Jacobian is needed to preserve probability mass.
Worked Example: Sum and Difference of Independent Normals
Scenario: \(X, Y \sim N(0,1)\) independent. Let \(U = X+Y, V = X-Y\).
Interactive Element: The engine walks through the inverse (\(X = (U+V)/2, Y = (U-V)/2\)), computes the Jacobian (\(J = -1/2\), \(|J| = 1/2\)), and shows the algebraic factorization proving \(U\) and \(V\) are independent \(N(0,2)\) variables.
Content: Modeling situations where the parameters of a distribution are themselves random variables (e.g., \(X|Y \sim \text{Poisson}(Y)\) and \(Y \sim \text{Gamma}(\alpha, \beta)\)).
Content: The marginal distribution of \(X\) in a hierarchical model is a mixture: \(f_X(x) = \int f_{X|Y}(x|y) f_Y(y) dy\).
Content:
Interactive Resource: The Variance Decompounder
Design: A flowchart visualization of a hierarchical model (e.g., \(Y \to X|Y\)). A dashboard showing \(E[X]\) and \(\text{Var}(X)\).
Interaction: The user adjusts the parameters of the prior \(f_Y(y)\).
Rigorous Focus: A dynamic bar chart visualizing EVE’s law. The total variance bar is split into two stacked segments: “Explained Variance” \([\text{Var}(E[X|Y])]\) and “Unexplained Variance” \([E[\text{Var}(X|Y)]]\). As the user increases the variance of \(Y\), the “Explained” segment grows, showing how the uncertainty in the parameter propagates to the total uncertainty in \(X\).
Content: Covariance: \(\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)] = E[XY] - E[X]E[Y]\). Correlation: \(\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}\). Rigorous Focus: \(\text{Cov}(X,Y) = 0 \not\implies\) Independence (except for joint Normals).
Content: \(\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X,Y)\). Extension to sums of \(n\) variables: \(\text{Var}(\sum X_i) = \sum \text{Var}(X_i) + 2\sum\sum_{i<j} \text{Cov}(X_i, X_j)\).
Interactive Resource: The Scatterplot & Correlation Shifter
Design: A scatterplot of \((X,Y)\) data and a slider for \(\rho\) (constraining the joint distribution to be, for example, Bivariate Normal).
Interaction: As \(\rho\) varies from -1 to 1, the point cloud morphs from a negative slope line to a circle to a positive slope line.
Rigor Check: A “Non-linear dependence” button generates a dataset where \(Y = X^2\) (with \(X\) symmetric around 0). The tool calculates \(\text{Cov}(X,Y) = 0\) and \(\rho = 0\), but visually highlights the strong parabolic relationship, driving home the point that correlation only measures linear dependence.
Content: Notation for \(\mathbf{X} = (X_1, \dots, X_n)^T\). The mean vector \(\mathbf{\mu}\) and the covariance matrix \(\Sigma\) where \(\Sigma_{ij} = \text{Cov}(X_i, X_j)\).
Content: \(\mathbf{X} \sim N_n(\mathbf{\mu}, \Sigma)\). The joint pdf using the inverse of the covariance matrix (precision matrix).
Rigorous Focus: Affine transformations: If \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\), then \(\mathbf{Y} \sim N(\mathbf{A}\mathbf{\mu} + \mathbf{b}, \mathbf{A}\Sigma\mathbf{A}^T)\).
Interactive Resource: The Covariance Ellipse Constructor
Design: A 2D plane drawing confidence ellipses for a Bivariate Normal. Input fields for the \(2 \times 2\) covariance matrix \(\Sigma\).
Interaction: The user modifies the variances (\(\sigma_1^2, \sigma_2^2\)) and the covariance (\(\sigma_{12}\)). The ellipse rotates and stretches in real-time.
Rigor Check: The tool computes the eigenvalues and eigenvectors of \(\Sigma\) and overlays them as principal axes on the ellipse, demonstrating that the eigenvectors dictate the orientation of the joint density, and the square roots of the eigenvalues dictate the spread along those axes.
Content: Generalized Chebyshev, Cauchy-Schwarz Inequality: \([E(XY)]^2 \le E(X^2)E(Y^2)\). Proof that \(|\rho| \le 1\) using Cauchy-Schwarz.
Content: Jensen’s Inequality: For a convex function \(g\), \(E[g(X)] \ge g(E[X])\). For concave \(g\), \(E[g(X)] \le g(E[X])\).
Interactive Resource: Jensen’s Visual Prover
Design: A graphing canvas where the user can select a convex function \(g(x)\) (e.g., \(g(x) = x^2\) or \(g(x) = e^x\)) and a probability distribution for \(X\).
Interaction: The tool plots \(g(x)\). It calculates \(E[X]\) and draws a point \((E[X], g(E[X]))\) on the curve. It then calculates \(E[g(X)]\) and draws a horizontal line at that height. The visual gap between the horizontal line (higher) and the point on the curve (lower) explicitly demonstrates \(E[g(X)] \ge g(E[X])\). The user can drag a slider to morph \(g(x)\) from convex to concave, watching the inequality flip.
This module bridges the gap between the probabilistic behavior of individual random variables and the statistical inference of populations based on observed data. It rigorously defines the concept of a random sample (iid random variables), explores the exact distributions of fundamental statistics (like the sample mean and variance) under normality, and introduces the asymptotic tools, convergence theorems and the Delta Method, that allow statisticians to approximate distributions when exact solutions are intractable.
Learning Objectives:
Formalize the concept of a random sample and compute distributions of sums using Moment Generating Functions (MGFs).
Prove and apply the independence of the sample mean \(\bar{X}\) and sample variance \(S^2\) when sampling from a Normal distribution.
Derive the exact sampling distributions of the \(\chi^2\), Student’s \(t\), and Snedecor’s \(F\) statistics.
Derive the marginal and joint distributions of order statistics.
Distinguish between convergence in probability, almost sure convergence, and convergence in distribution.
Apply the Weak Law of Large Numbers (WLLN) and the Central Limit Theorem (CLT).
Use the Delta Method to approximate the variance and distribution of functions of the sample mean.
Content: Definition of a random sample: \(X_1, \dots, X_n\) are independent and identically distributed (iid) random variables with cdf \(F_X(x)\). Definition of a statistic \(T = g(X_1, \dots, X_n)\) and the concept of a sampling distribution.
Content: The MGF of a sum of independent random variables is the product of their MGFs: \(M_{\sum X_i}(t) = \prod_{i=1}^n M_{X_i}(t)\). Rigorous Focus: Using this to prove that sums of independent Normals are Normal, and sums of independent Gamma variables (with the same scale parameter) are Gamma.
Interactive Resource: The Convolution & MGF Multiplier
Design: A split screen. Left: Sliders to choose \(n\) and the parameters of an iid distribution (e.g., Gamma\((\alpha, \beta)\)). Right: A dynamic plot of the MGF of the sum \(M_{\sum X_i}(t)\).
Interaction: The user increments \(n\). The tool visually multiplies the base MGF \(M_X(t)\) by itself \(n\) times. It then identifies the resulting functional form (e.g., \((1-\beta t)^{-n\alpha}\)) and plots the corresponding pdf of the sum below, demonstrating how the distribution shifts and spreads as \(n\) increases.
Content: Definitions: \(\bar{X} = \frac{1}{n}\sum X_i\) and \(S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2\). Rigorous Focus: Theorem: If \(X_1, \dots, X_n\) are iid \(N(\mu, \sigma^2)\), then:
Interactive Resource: The Independence Scatter-Proof
Design: A Monte Carlo simulation engine. It draws \(M=500\) samples of size \(n\) from a Normal distribution. For each sample, it calculates \(\bar{X}\) and \(S^2\).
Interaction: A scatterplot of the 500 points \((\bar{X}, S^2)\) is rendered. A “Test Independence” button runs a correlation test on the scatterplot, yielding \(\rho \approx 0\). The user can change the underlying population to an Exponential distribution. The scatterplot immediately shows a strong dependency (funnel shape), proving that the independence of \(\bar{X}\) and \(S^2\) is a unique and critical property of the Normal family.
Content:
Definition of \(T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}\).
Definition of \(F = \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \sim F_{n_1-1, n_2-1}\).
Rigorous Focus: Deriving the \(t\)-distribution as a ratio of a standard Normal to the square root of an independent Chi-squared divided by its df. The convergence of \(t_{\nu} \to N(0,1)\) as \(\nu \to \infty\).
Interactive Resource: The \(t\) vs. Normal Morphing Slider
Design: A plot showing the standard Normal pdf. A slider controls the degrees of freedom \(\nu\).
Interaction: As \(\nu\) decreases from \(\infty\) down to 1, the \(t\)-distribution curve overlays the Normal, showing its heavier tails and lower peak. The tool dynamically calculates and displays the kurtosis, visually linking the mathematical moment to the shape of the tails.
Content: \(X_{(1)} \le X_{(2)} \le \dots \le X_{(n)}\). The CDF of the \(j\)th order statistic: \(F_{X_{(j)}}(x) = \sum_{k=j}^n \binom{n}{k} [F_X(x)]^k [1-F_X(x)]^{n-k}\). The PDF of the \(j\)th order statistic: \(f_{X_{(j)}}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1-F_X(x)]^{n-j}\).
Interactive Resource: The Order Statistic Sub-Sampler
Design: A canvas with \(n\) slots representing a sample. Next to it, a plot of the underlying pdf \(f_X(x)\).
Interaction: The user selects a specific order statistic \(j\) (e.g., the minimum \(j=1\), or the median \(j=\lceil n/2 \rceil\)). The tool runs 10,000 simulations, extracts the \(j\)th order statistic from each, and builds a histogram. The theoretical pdf \(f_{X_{(j)}}(x)\) is overlaid. The user can dynamically increase \(n\), watching the distribution of the sample minimum shift sharply left (for right-skewed distributions) and the distribution of the median tighten around the population median.
Content: The joint pdf of \(X_{(i)}\) and \(X_{(j)}\) for \(i < j\).
Content: \(X_n \xrightarrow{P} X\) if \(\lim_{n \to \infty} P(|X_n - X| \ge \epsilon) = 0\) for all \(\epsilon > 0\).
Rigorous Focus: The Weak Law of Large Numbers (WLLN): \(\bar{X}_n \xrightarrow{P} \mu\).
Content: \(X_n \xrightarrow{a.s.} X\) if \(P(\lim_{n \to \infty} X_n = X) = 1\).
Rigorous Focus: The Strong Law of Large Numbers (SLLN). The subtle but critical difference between the WLLN (probability of a deviation at a specific \(n\)) and SLLN (probability of an eventual permanent deviation).
Interactive Resource: The Limit Distinction Visualizer
Design: A time-series plot of \(\bar{X}_n\) vs. \(n\) for multiple independent sample paths. Two horizontal lines represent \(\mu \pm \epsilon\).
Interaction: * WLLN Mode: The tool highlights that at any given \(n\), the fraction of paths outside the bounds approaches 0. * SLLN Mode: The user observes that while a path might occasionally dip outside the bounds, eventually every single path enters the bounds and stays there forever. The tool includes a “Pathological Counterexample” (e.g., \(X_n = n\) with prob \(1/n\), \(0\) with prob \(1-1/n\)) which converges in probability to 0, but fails to converge almost surely, visually demonstrating the divergence.
Content: \(X_n \xrightarrow{d} X\) if \(\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\) for all \(x\) where \(F_X\) is continuous.
Rigorous Focus: The Central Limit Theorem (CLT): \(\sqrt{n}(\bar{X}_n - \mu)/\sigma \xrightarrow{d} N(0,1)\).
Interactive Resource: The Universal CLT Engine
Design: A “Choose Your Own Adventure” distribution selector (e.g., highly skewed Gamma, discrete Bernoulli, heavy-tailed Cauchy).
Interaction: The user selects a distribution. A slider increases \(n\). The tool plots the exact sampling distribution of \(\bar{X}_n\) (via fast numerical convolution or Monte Carlo) and overlays the Normal approximation from the CLT.
Rigor Check: If the user selects the Cauchy distribution (which has infinite variance), the CLT overlay fails to match the sampling distribution regardless of \(n\), proving that the CLT assumptions are not merely decorative.
Content: If \(\sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2)\), then \(\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, \sigma^2 [g'(\theta)]^2)\), provided \(g'(\theta) \neq 0\). Rigorous Focus: The Second-Order Delta Method for when \(g'(\theta) = 0\).
Interactive Resource: The Tangent Line Transformer
Design: A plot of a function \(g(x)\) and a point \((\theta, g(\theta))\). A dynamic normal curve representing the distribution of \(X_n\) is centered at \(\theta\).
Interaction: The user chooses \(g(x)\) (e.g., \(g(x) = \sqrt{x}\)). As \(n\) increases, the distribution of \(X_n\) tightens around \(\theta\). The tool draws the tangent line \(g(x) \approx g(\theta) + g'(\theta)(x-\theta)\) and “reflects” the distribution of \(X_n\) off this tangent line to project the distribution of \(g(X_n)\) below. This visualizes the linearization at the heart of the Delta Method’s proof.
Content: Direct Method (Probability Integral Transform from Module 2). Indirect Methods (e.g., using polar coordinates for Normals: Box-Muller transform).
Content: Generating from a target pdf \(f(x)\) using a proposal pdf \(g(x)\) and a constant \(M\) such that \(f(x) \le M g(x)\). Algorithm: Generate \(Y \sim g\), generate \(U \sim \text{Uniform}(0,1)\). If \(U \le f(Y)/(M g(Y))\), accept \(X=Y\); else repeat.
Interactive Resource: The Accept/Reject Dartboard
Design: A 2D plot showing the target curve \(f(x)\) and the envelope curve \(M g(x)\).
Interaction: The user clicks “Throw Dart”. A uniform \(x\)-coordinate is drawn from \(g(x)\), and a uniform \(y\)-coordinate \(U \cdot M g(x)\) is generated, plotting a point on the screen. If the point falls under \(f(x)\), it turns Green (Accept); if between \(f(x)\) and \(M g(x)\), it turns Red (Reject). The accepted \(x\)-values fall into a histogram below, gradually building the target distribution \(f(x)\).
Rigor Check: The user can adjust \(M\). If they make \(M\) too large, the acceptance rate plummets (mostly red darts), demonstrating the computational inefficiency of a poor envelope.
Math.random() is insufficient for
rigorous statistical simulation (often implemented as a 32-bit
Xorshift). Bundle a robust PRNG like Mersenne Twister
(MT19937) in WebAssembly to ensure high-quality random numbers
for the Accept/Reject and sampling simulators.1/x,
sin(x)) and the tool automatically calculates and plots the
derivative and resulting variance.In practice, statisticians are faced with large datasets, but only a few parameters of interest. This module addresses the fundamental question: How can we reduce the data to a smaller set of summary statistics without losing information about the parameter? It formalizes three distinct philosophical and mathematical approaches to data reduction: the Sufficiency Principle, the Likelihood Principle, and the Equivariance Principle. Mastery of these principles is crucial for evaluating the optimality of estimators and tests in subsequent modules.
Learning Objectives:
Content: Given a sample \(X_1, \dots, X_n\) from \(f(x|\theta)\), we wish to find a statistic \(T(X_1, \dots, X_n)\) that summarizes the data. The goal is to achieve dimension reduction while retaining all “information” about \(\theta\).
Interactive Resource: The Information Loss Visualizer
Content:
Interactive Resource: The Factorization Factory
Content: A sufficient statistic \(T\) is minimal sufficient if it is a function of every other sufficient statistic (maximum data reduction). Rigorous Focus: The Ratio Method: \(T(\mathbf{x})\) is minimal sufficient if \(\frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}\) is constant in \(\theta\) iff \(T(\mathbf{x}) = T(\mathbf{y})\).
Interactive Resource: The Ratio Constancy Checker
Design: A dual input for two sample vectors \(\mathbf{x}\) and \(\mathbf{y}\). A dynamic plot of the function \(R(\theta) = \frac{f(\mathbf{x}|\theta)}{f(\mathbf{y}|\theta)}\).
Interaction: The user alters \(\mathbf{x}\) and \(\mathbf{y}\). If \(T(\mathbf{x}) = T(\mathbf{y})\) (e.g., they have the same sum), the plot \(R(\theta)\) renders as a flat horizontal line (constant in \(\theta\)). If \(T(\mathbf{x}) \neq T(\mathbf{y})\), the plot shows a curve varying with \(\theta\). This visually defines the partitioning of the sample space induced by a minimal sufficient statistic.
Content: A statistic \(S(\mathbf{X})\) is ancillary for \(\theta\) if its distribution does not depend on \(\theta\). Examples:
Content:
Completeness: A statistic \(T\) is complete if \(E_\theta[g(T)] = 0\) for all \(\theta\) implies \(P(g(T)=0) = 1\) for all \(\theta\). (No non-trivial function of \(T\) has mean zero).
Basu’s Theorem: If \(T\) is a complete sufficient statistic for \(\theta\), then \(T\) is independent of every ancillary statistic.
Interactive Resource: Basu’s Theorem Correlation Engine
Design: A simulator drawing samples from \(N(\mu, 1)\). A scatterplot mapping \(T(\mathbf{X}) = \bar{X}\) vs. \(S(\mathbf{X}) = S^2\).
Interaction: The engine runs thousands of iterations, plotting \((\bar{X}, S^2)\). The empirical correlation \(\rho\) is displayed, converging to 0.
Rigorous Focus: A “Proof Checker” module walks through the logic: (1) Prove \(\bar{X}\) is sufficient for \(\mu\). (2) Prove \(\bar{X}\) is complete for \(\mu\). (3) Prove \(S^2\) is ancillary for \(\mu\). (4) Conclude independence via Basu. The user can change the known parameter (e.g., \(N(0, \sigma^2)\)) to see where the theorem breaks down (\(S^2\) is no longer ancillary for \(\sigma^2\), correlation no longer 0).
Content: Definition: \(L(\theta|\mathbf{x}) = f(\mathbf{x}|\theta)\). The Likelihood Principle states that all experimental information about \(\theta\) is contained in the likelihood function. Two likelihoods proportional in \(\theta\) yield the same inference.
Content: Birnbaum’s Theorem: The Likelihood Principle is mathematically equivalent to the conjunction of the Sufficiency Principle and the Conditionality Principle. Rigorous Focus: The Stopping Rule Paradox.
Interactive Resource: The Stopping Rule Paradox Simulator
Design: A coin-flipping simulator with two distinct experimental modes. * Mode A (Binomial): Flip exactly \(n=12\) times. Observe \(x=9\) heads. * Mode B (Negative Binomial): Flip until \(x=9\) heads are observed. It takes \(n=12\) flips.
Interaction: The user runs both experiments. Both yield identical Likelihood Functions: \(L(p|x=9, n=12) \propto p^9(1-p)^3\).
Rigorous Focus: The tool then calculates the Frequentist p-value for the hypothesis \(H_0: p=0.5\) under both models. * Mode A p-value: \(P(X \ge 9 | n=12, p=0.5) = 0.073\). * Mode B p-value: \(P(N \le 12 | x=9, p=0.5) = 0.038\).
The visualization highlights that despite identical data and identical likelihoods, the Frequentist inference changes based on the intent of the experimenter (the stopping rule). This powerfully demonstrates why adherence to the Likelihood Principle fundamentally conflicts with standard Frequentist methodology.
Content: If a parameter is transformed by a function \(\eta = g(\theta)\), and \(\hat{\theta}\) is a “good” estimator for \(\theta\), then a “good” estimator for \(\eta\) should be \(\hat{\eta} = g(\hat{\theta})\).
Rigorous Focus: The distinction between Equivariance (for transformations of parameters, e.g., \(\mu \to e^\mu\)) and Invariance (for transformations of the sample space that leave the parameter unchanged, e.g., location shifts).
Interactive Resource: The Parameter Transformation Mapper
Design: A number line representing \(\theta\), and a secondary curve representing \(\eta = g(\theta)\).
Interaction: The user selects a distribution (e.g., Exponential(\(\theta\))) and an estimator \(\hat{\theta} = \bar{X}\). They define a transformation (e.g., \(\eta = 1/\theta\), the rate parameter). The tool plots the sampling distribution of \(\hat{\theta}\). It then applies the transformation to the random variable itself, deriving and plotting the distribution of \(g(\hat{\theta}) = 1/\bar{X}\) (the Inverse Gaussian family). It demonstrates that the equivariant estimator is naturally induced by the transformation.
Advanced Symbolic Math Engine: Module 6 is intensely algebraic. The Factorization Factory and the Ratio Constancy Checker require a backend (like SymPy via Pyodide or a custom WASM module) capable of symbolic factorization, simplification of multivariate expressions, and isolation of variables. The engine must differentiate between parameters (\(\theta\)) and random variables (\(x_i\)) syntactically.
Conditional Distribution Calculator: To visually prove sufficiency without relying solely on the Factorization Theorem, the tool must compute \(P(\mathbf{X}=\mathbf{x} | T(\mathbf{X})=t)\). For discrete distributions, this requires dynamic enumeration of the sample space partitioned by \(T\). For continuous distributions, numerical integration over the hyper-surface where \(T(\mathbf{x})=t\) is required (a computationally expensive but necessary feature using Markov Chain Monte Carlo sampling constrained to the surface).
Likelihood Rendering Engine: For the Stopping Rule Paradox, the tool must render likelihood functions efficiently. Instead of plotting points, use WebGL to draw smooth curves based on evaluating the symbolic likelihood expression over a grid of \(\theta\) values.
Hypothesis Testing Calculator API: To calculate the p-values in the Stopping Rule Paradox, integrate a statistical library (like jStat) capable of computing CDFs for Binomial and Negative Binomial distributions with high precision, as the philosophical point relies on the exact discrepancy between the two p-values.
Data Generation for Basu’s Theorem: Ensure the random number generator uses independent streams for generating the normal variables. The scatterplot visualization for Basu’s theorem should use alpha-blending (opacity) so that the independence (uniform cloud) vs. dependence (clustering) is visually obvious to the student even at high sample counts.
This module transitions from the probabilistic properties of data to the core of statistical inference: estimating unknown population parameters. It covers the primary methodologies for deriving point estimators (Method of Moments, Maximum Likelihood, Bayes, and the EM Algorithm) and the rigorous mathematical criteria for evaluating their quality (MSE, Bias, Variance, Efficiency, and Decision-Theoretic Optimality).
Learning Objectives:
Content: Equating population moments (\(\mu_k' = E[X^k]\)) to sample moments (\(M_k = \frac{1}{n}\sum X_i^k\)) and solving the resulting system of equations for \(\theta\). Rigorous Focus: MMEs are often simple to compute but are not necessarily optimal; they may not even be functions of sufficient statistics.
Interactive Resource: The Moment Matching Engine
Content: Defining the likelihood function \(L(\theta|\mathbf{x})\) and the log-likelihood \(\ell(\theta|\mathbf{x})\). Finding \(\hat{\theta}\) that maximizes \(\ell\).
Rigorous Focus:
Interactive Resource: The Log-Likelihood Landscape Explorer
Design: A 3D surface plot or 2D contour plot of the log-likelihood function for a two-parameter family (e.g., Normal or Weibull).
Interaction: The user generates a sample. The likelihood surface renders instantly. The user can click on the surface to place a “climber” and manually navigate toward the peak. Alternatively, a “Gradient Ascent” button animates the numerical optimization algorithm (Newton-Raphson) traversing the surface to the MLE.
Rigor Check: An “Invariance Tester” allows the user to type a transformation \(g(\theta)\) (e.g., \(\sqrt{\theta}\)). The tool calculates \(g(\hat{\theta})\) and plots the transformed likelihood, proving the peak aligns.
Content: The posterior distribution \(\pi(\theta|\mathbf{x}) \propto f(\mathbf{x}|\theta)\pi(\theta)\). Rigorous Focus: Estimators are derived by minimizing the posterior expected loss.
Interactive Resource: The Prior-Posterior Dynamics Simulator
Content: Finding MLEs when data is incomplete or models have latent variables.
Interactive Resource: The EM Ascent Visualizer
Worked Example: Mixture of Normals
Scenario: \(X_1, \dots, X_n \sim p N(\mu_1, \sigma_1^2) + (1-p) N(\mu_2, \sigma_2^2)\).
Interactive Element: Students step through the E-step (calculating the probability each point belongs to cluster 1 vs cluster 2) and the M-step (updating means and variances based on these soft assignments), watching the mixture model fit converge.
Content: \(MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2]\).
Rigorous Focus: The Bias-Variance Decomposition: \(MSE(\hat{\theta}) = Var(\hat{\theta}) + [Bias(\hat{\theta})]^2\). The concept of the Bias-Variance Tradeoff.
Interactive Resource: The Dartboard Tradeoff Dashboard
Design: Three interactive dartboards representing different estimators of \(\theta\): 1. \(\hat{\theta}_1\): Unbiased but high variance (e.g., sample variance with denominator \(n\)). 2. \(\hat{\theta}_2\): Biased but low variance (e.g., a shrinkage estimator). 3. \(\hat{\theta}_3\): Optimal MSE.
Interaction: The user simulates thousands of estimates. The dartboards populate. A dynamic bar chart displays \(MSE = Var + Bias^2\). The user adjusts a shrinkage parameter \(\lambda\) for \(\hat{\theta}_2\). As \(\lambda\) increases, the Bias bar grows, but the Variance bar shrinks. The MSE curve is plotted against \(\lambda\), allowing the student to visually find the \(\lambda\) that minimizes overall MSE, demonstrating that unbiased estimators are not always “best.”
Content: \(\hat{\theta}\) is the Uniformly Minimum Variance Unbiased Estimator (UMVUE) if \(E[\hat{\theta}] = \theta\) and \(Var(\hat{\theta}) \le Var(\hat{\theta}^*)\) for all \(\theta\) and any other unbiased \(\hat{\theta}^*\).
Content:
Fisher Information: \(I(\theta) = -E\left[ \frac{\partial^2}{\partial \theta^2} \log f(X|\theta) \right]\).
The CRLB: Under regularity conditions, \(Var(\hat{\theta}) \ge \frac{1}{n I(\theta)}\) for any unbiased \(\hat{\theta}\). Rigorous Focus: Verifying regularity conditions (support cannot depend on \(\theta\), hence Uniform\((0,\theta)\) violates CRLB).
Interactive Resource: The Efficiency Bounding Engine
Design: A dynamic plot of the variance of an estimator vs. \(\theta\). A dashed line representing the CRLB \(\frac{1}{nI(\theta)}\) is drawn.
Interaction: The user selects an unbiased estimator. Its variance curve is plotted. If the curve touches the CRLB, the tool flashes “Efficient!” (e.g., \(\bar{X}\) for Normal mean). If it sits above (e.g., sample median for Normal mean), the tool calculates asymptotic relative efficiency.
Content:
Rao-Blackwell Theorem: If \(T\) is a sufficient statistic for \(\theta\) and \(W\) is any unbiased estimator, then \(\phi(T) = E[W|T]\) is an unbiased estimator with \(Var(\phi(T)) \le Var(W)\).
Lehmann-Scheffé Theorem: If \(T\) is a complete sufficient statistic, then \(\phi(T)\) is the unique UMVUE.
Interactive Resource: The Variance Reducer (Rao-Blackwell Machine)
Content: Moving beyond MSE to general loss functions \(L(\theta, a)\).
Risk Function: \(R(\theta, \hat{\theta}) = E_\theta[L(\theta, \hat{\theta})]\).
Minimax Principle: Choose \(\hat{\theta}\) to minimize \(\max_\theta R(\theta, \hat{\theta})\).
Bayes Risk: \(r(\pi, \hat{\theta}) = E^\pi[R(\theta, \hat{\theta})]\).
Interactive Resource: The Risk Profile Arena
Design: A 2D plot where the x-axis is the parameter \(\theta\) and the y-axis is the Risk \(R(\theta, \hat{\theta})\).
Interaction: The user selects two estimators (e.g., MLE vs. a Bayes estimator under SEL). The risk profiles for both are plotted as functions of \(\theta\). * Minimax Mode: The tool highlights the maximum risk (the peak) of both curves. The estimator with the lower peak is declared minimax. * Bayes Mode: The user shades the area under the risk curve weighted by a prior density \(\pi(\theta)\). The estimator with the smaller shaded area (Bayes risk) is the preferred Bayes rule. This visualizes how a prior can “forgive” high risk in low-probability regions of \(\theta\).
minimize
with BFGS or Nelder-Mead). For the EM algorithm, the system should allow
users to define \(Q\)-functions
symbolically, which the backend then differentiates and solves.This module rigorously formalizes the statistical procedure of making decisions about population parameters based on sample data. It transitions from intuitive concepts of “surprise” to mathematically optimal testing frameworks. The module covers the primary methods for constructing tests (Likelihood Ratio, Bayesian, Union-Intersection), the rigorous evaluation of test performance (Error probabilities, Power), and the mathematical proofs of optimality under the Neyman-Pearson and Decision-Theoretic frameworks.
Learning Objectives:
Formulate null and alternative hypotheses, and define Type I and Type II errors.
Derive Likelihood Ratio Tests (LRTs) for simple and composite hypotheses.
Construct Bayesian tests using posterior odds and Bayes factors.
Build tests for composite hypotheses using the Union-Intersection and Intersection-Union principles.
Calculate and interpret the power function \(\beta(\theta)\), distinguishing between test size and level.
Prove test optimality using the Neyman-Pearson Lemma for simple hypotheses and the Karlin-Rubin Theorem for composite hypotheses.
Define p-values as random variables and relate them to frequentist decision rules.
Evaluate tests within a decision-theoretic framework using risk functions and minimax/Bayes criteria.
Content: The most versatile frequentist method. Define the test statistic: \[\lambda(\mathbf{x}) = \frac{\sup_{\theta \in \Theta_0} L(\theta|\mathbf{x})}{\sup_{\theta \in \Theta} L(\theta|\mathbf{x})}\] The LRT rejects \(H_0\) for small values of \(\lambda(\mathbf{x})\). Rigorous Focus: Deriving the asymptotic distribution of \(-2\log\lambda(\mathbf{X}) \xrightarrow{d} \chi^2_{\nu}\) under regularity conditions (where \(\nu = \dim(\Theta) - \dim(\Theta_0)\)).
Interactive Resource: The LRT Surface Slicer
Content: Formulating tests using posterior probabilities. The posterior odds ratio: \[\frac{P(\Theta_0|\mathbf{x})}{P(\Theta_1|\mathbf{x})} = \left(\frac{P(\Theta_0)}{P(\Theta_1)}\right) \times \left(\frac{\int_{\Theta_0} f(\mathbf{x}|\theta)\pi(\theta)d\theta}{\int_{\Theta_1} f(\mathbf{x}|\theta)\pi(\theta)d\theta}\right).\]
Rigorous Focus: The Bayes Factor (the ratio of integrated likelihoods), which replaces the frequentist LRT and is immune to stopping-rule paradoxes.
Interactive Resource: The Bayes Factor Dynamics Engine
Design: A split dashboard. Left: Prior distributions for \(\theta\) under \(H_0\) and \(H_1\). Right: The posterior distributions.
Interaction: The user sets prior odds to 1 (equal priors) and observes the Bayes Factor. They can then heavily skew the prior probability of \(H_0\) to 0.99. The posterior updates, showing that strong prior belief requires overwhelming data to reject \(H_0\).
Rigor Check: A side-panel calculates the p-value for the same data. With large sample sizes, the tool demonstrates Lindley’s Paradox: the p-value screams “Reject \(H_0\)” while the Bayes Factor suggests “Strong support for \(H_0\),” highlighting the divergence of the two paradigms.
Content:
Union-Intersection: \(H_0: \theta \in \bigcap_{\gamma \in \Gamma} \Theta_\gamma\) vs \(H_1: \theta \in \bigcup_{\gamma \in \Gamma} \Theta_\gamma^c\). Reject \(H_0\) if any individual test rejects.
Intersection-Union: \(H_0: \theta \in \bigcup_{\gamma \in \Gamma} \Theta_\gamma\) vs \(H_1: \theta \in \bigcap_{\gamma \in \Gamma} \Theta_\gamma^c\). Reject \(H_0\) if all individual tests reject. Rigorous Focus: Size control. For Union-Intersection, the overall size \(\alpha\) is bounded by the individual sizes (requiring Bonferroni-like corrections). For Intersection-Union, if individual tests are size \(\alpha\), the overall test is exactly size \(\alpha\).
Interactive Resource: The Region Logician
Design: A 2D parameter space \((\theta_1, \theta_2)\).
Interaction: The user defines \(H_0\) as a union or intersection of regions (e.g., \(H_0: \theta_1 \le 0 \cup \theta_2 \le 0\)). The tool shades the rejection regions for individual tests. It then uses boolean logic operators to visually merge these regions into the final rejection region, demonstrating the intersection/union geometric property.
Content:
Type I Error: \(\alpha = P(\text{Reject } H_0 | \theta \in \Theta_0)\).
Type II Error: \(\beta = P(\text{Accept } H_0 | \theta \in \Theta_1)\).
Power Function: \(\pi(\theta) = P(\text{Reject } H_0 | \theta)\). Rigorous Focus: The strict distinction between a test of size \(\alpha\) (supremum of \(\pi(\theta)\) over \(\Theta_0\) is exactly \(\alpha\)) and a test of level \(\alpha\) (supremum is \(\le \alpha\)).
Interactive Resource: The Power Curve Architect
Design: A graph plotting the power function \(\pi(\theta)\) against \(\theta\). Horizontal lines represent the \(\alpha\) level.
Interaction: The user selects a test statistic and an \(\alpha\) level. They adjust the sample size \(n\). The power curve dynamically updates, pulling away from \(\alpha\) in the alternative region as \(n\) increases.
Rigor Check: The tool highlights the supremum of the curve in \(\Theta_0\), forcing the student to verify that the size of the test does not exceed \(\alpha\), particularly at the boundary of the null hypothesis.
Content: Testing \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\) (Simple vs. Simple).
The Neyman-Pearson Lemma: A test that rejects \(H_0\) if \(\frac{f(\mathbf{x}|\theta_1)}{f(\mathbf{x}|\theta_0)} > k\) is the Uniformly Most Powerful (UMP) test of size \(\alpha\).
Rigorous Focus:
Interactive Resource: The Neyman-Pearson Partition Visualizer
Design: Overlapping density curves for \(f(x|\theta_0)\) and \(f(x|\theta_1)\). A vertical line represents the critical value \(c\).
Interaction: The user drags the critical value line left and right. The area under \(f(x|\theta_0)\) to the right of \(c\) shades red (Type I error \(\alpha\)). The area under \(f(x|\theta_1)\) to the right of \(c\) shades green (Power \(1-\beta\)). The tool dynamically plots the ratio of the densities, demonstrating that the optimal rejection region precisely corresponds to where the likelihood ratio exceeds a threshold.
Worked Example: The Randomized Test Simulator
Scenario: Testing \(H_0: \text{Poisson}(\lambda=1)\) vs \(H_1: \text{Poisson}(\lambda=2)\) with exact size \(\alpha=0.05\).
Interactive Element: Because Poisson is discrete, no integer critical value yields exactly 0.05. The tool calculates the CDFs, identifies the jump across 0.05, and visually explains the randomization probability \(\gamma\) required to achieve exact size. A Monte Carlo simulator implements the randomization, proving the long-run Type I error is exactly 0.05.
Content: The p-value is a test statistic defined as the smallest level \(\alpha\) at which \(H_0\) would be rejected. Formally: \(p(\mathbf{x}) = \sup_{\theta \in \Theta_0} P_\theta(T \ge T(\mathbf{x}))\).
Rigorous Focus: The p-value is a random variable. Under \(H_0\), if the test statistic is continuous, the p-value is uniformly distributed on \((0,1)\). Misinterpretation: The p-value is not the probability that \(H_0\) is true.
Interactive Resource: The p-Value Distribution Exposer
Design: A simulation engine running 10,000 experiments under a true \(H_0\).
Interaction: The tool calculates the p-value for each experiment and plots a histogram. The result is a perfectly uniform distribution.
Rigor Check: The user clicks “Add False \(H_0\)”. The simulation runs under the alternative, and the p-value histogram skews heavily toward 0. The tool overlays the power curve, showing that the area to the left of \(\alpha\) in the p-value histogram is exactly the power of the test.
Content: Framing hypothesis testing as a decision problem. The risk function combines Type I and Type II errors weighted by a loss function \(L(\theta, a)\).
Rigorous Focus: Proving that the Bayes test chooses the hypothesis with the smaller posterior expected loss, which directly corresponds to comparing the posterior odds to the loss ratio.
Interactive Resource: The Decision-Theoretic Risk Frontier
Design: A 2D plane where the x-axis is the probability of Type I error (\(\alpha\)) and the y-axis is the probability of Type II error (\(\beta\)).
Interaction: The user defines a family of tests (e.g., varying critical values). The tool plots the risk points \((\alpha, \beta)\), tracing out the risk frontier. The user adjusts the loss weights \(L(\text{Type I})\) and \(L(\text{Type II})\). A line representing the minimax or Bayes risk objective pivots, and the optimal test on the frontier is highlighted.
ncf, nct,
ncx2) and should be exposed via API.While point estimation provides a single best guess for a parameter, it provides no measure of the uncertainty inherent in that guess. This module rigorously develops the theory of interval estimation, constructing ranges of values that are likely to contain the true parameter. The module unifies this concept with hypothesis testing (Module 8), demonstrates various mathematical techniques for deriving intervals, and contrasts the Frequentist interpretation (Confidence Intervals) with the Bayesian interpretation (Credible Intervals).
Learning Objectives:
Define confidence intervals and rigorously distinguish between random interval bounds and fixed parameters.
Derive confidence intervals by inverting the acceptance regions of hypothesis tests.
Construct intervals using Pivotal Quantities and by pivoting the CDF.
Derive and interpret Bayesian Credible Intervals and Highest Posterior Density (HPD) regions.
Evaluate interval estimators based on coverage probability, size, and expected length.
Establish optimality of intervals via decision theory and test-inversion (Uniformly Most Accurate intervals).
Content: An interval estimator for \(\theta\) is a pair of statistics \(L(\mathbf{X})\) and \(U(\mathbf{X})\) such that \(L(\mathbf{X}) \le U(\mathbf{X})\). The interval \([L(\mathbf{X}), U(\mathbf{X})]\) is a random set.
Rigorous Focus: The Frequentist interpretation. The probability \(P_\theta(\theta \in [L(\mathbf{X}), U(\mathbf{X})])\) refers to the long-run frequency of random intervals capturing the fixed true parameter \(\theta\). It is not the probability that \(\theta\) lies in a fixed observed interval.
Interactive Resource: The Frequentist Target Shooter
Design: A number line representing the parameter space. A vertical bullseye marks the true, fixed parameter \(\theta_0\).
Interaction: The user clicks “Draw Sample”. The engine calculates \([L(\mathbf{x}), U(\mathbf{x})]\) and shoots a horizontal bar (the interval) onto the number line. If the bar covers \(\theta_0\), it turns Green (Hit); otherwise, Red (Miss). A counter tracks the hit rate. As the user rapidly clicks, the hit rate converges exactly to the confidence level (e.g., 95%), proving that the method has a 95% success rate, not the individual interval.
Content: There is a 1-to-1 duality between hypothesis tests and confidence intervals. A \(1-\alpha\) confidence set consists of all values \(\theta_0\) for which the hypothesis \(H_0: \theta = \theta_0\) would not be rejected at level \(\alpha\).
Rigorous Focus: \(C(\mathbf{x}) = \{\theta_0 : \mathbf{x} \in A(\theta_0)\}\), where \(A(\theta_0)\) is the acceptance region of the level \(\alpha\) test.
Interactive Resource: The Test-Inversion Tracer
Design: A dual-axis plot. Left: a plot of the test statistic \(T(\mathbf{x})\) vs. \(\theta_0\), showing the acceptance region boundaries. Right: the resulting confidence interval \([L(\mathbf{x}), U(\mathbf{x})]\) on a number line.
Interaction: The user observes a fixed sample \(\mathbf{x}\) (hence a fixed horizontal line \(T_{obs}\)). The tool sweeps \(\theta_0\) across the x-axis. For each \(\theta_0\), the engine checks if \(T_{obs}\) falls inside \(A(\theta_0)\). If yes, \(\theta_0\) is highlighted on the right-hand number line. The accumulation of these \(\theta_0\) values visually “paints” the confidence interval, proving that the CI is exactly the set of non-rejected nulls.
Content: A pivot is a function \(Q(\mathbf{X}, \theta)\) whose distribution is independent of \(\theta\) (and usually of other unknown parameters).
Example: \(Q = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}\).
Rigorous Focus: Finding constants \(a\) and \(b\) such that \(P(a \le Q(\mathbf{X}, \theta) \le b) = 1-\alpha\), and algebraically “inverting” the inequality inside the probability to isolate \(\theta\).
Interactive Resource: The Pivot Isolator
Design: A symbolic algebra step-by-step solver.
Interaction: The user inputs a pivot (e.g., \(Q = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\) for known \(\sigma\)). The tool sets up the probability inequality: \(P(-z_{\alpha/2} \le \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \le z_{\alpha/2}) = 1-\alpha\). The user must click to apply algebraic operations (multiply by \(\sigma/\sqrt{n}\), subtract \(\bar{X}\), multiply by -1 [flipping inequalities]). The tool dynamically slides the terms across the inequality signs, finally isolating \(\mu\) to yield the standard \(z\)-interval.
Content: If \(X\) is continuous with CDF \(F_X(x|\theta)\), then \(F_X(X|\theta) \sim \text{Uniform}(0,1)\). We can form a pivot using \(a \le F_X(X|\theta) \le b\).
Rigorous Focus: Solving the CDF inequality for \(\theta\). This often yields one-sided intervals, particularly for scale parameters or order statistics.
Worked Example: Interval for Uniform Scale
Scenario: \(X_1, \dots, X_n \sim \text{Uniform}(0, \theta)\). Finding a CI for \(\theta\) using the maximum \(X_{(n)}\).
Interactive Element: The tool visualizes the CDF of \(X_{(n)}\), which is \((x/\theta)^n\). It shows how setting \(P(\alpha/2 \le (X_{(n)}/\theta)^n \le 1-\alpha/2) = 1-\alpha\) allows solving for \(\theta\) in terms of \(X_{(n)}\), resulting in the interval \(\left[ X_{(n)}, X_{(n)} / \alpha^{1/n} \right]\).
Content: Given the posterior distribution \(\pi(\theta|\mathbf{x})\), a \(1-\alpha\) credible interval is any set \(C\) such that \(P(\theta \in C | \mathbf{x}) = 1-\alpha\).
Rigorous Focus:
The equal-tailed interval: \([L, U]\) where \(P(\theta < L|\mathbf{x}) = \alpha/2\) and \(P(\theta > U|\mathbf{x}) = \alpha/2\).
The Highest Posterior Density (HPD) region: \(C = \{\theta : \pi(\theta|\mathbf{x}) \ge k(\alpha)\}\). The HPD is the shortest possible \(1-\alpha\) credible interval.
Interactive Resource: The HPD Carver
Design: A plot of the posterior density \(\pi(\theta|\mathbf{x})\). A horizontal line representing the density threshold \(k\) is present. A slider controls \(\alpha\).
Interaction: The user adjusts \(k\). The tool shades the region where \(\pi(\theta|\mathbf{x}) \ge k\) and calculates the integral (the coverage). The user adjusts \(k\) until the coverage is exactly \(1-\alpha\). The tool simultaneously calculates the length of the resulting interval.
Rigor Check: The user selects a skewed posterior (e.g., Gamma). The tool overlays the Equal-Tailed interval and the HPD. The HPD is visibly shifted toward the mode, and a “Length Score” confirms the HPD is strictly shorter than the Equal-Tailed interval. For a Normal posterior, they perfectly overlap.
Content:
Coverage Probability: \(P_\theta(\theta \in C(\mathbf{X}))\).
Confidence Coefficient (Size): The infimum of the coverage probability over all \(\theta \in \Omega\): \(\inf_{\theta \in \Omega} P_\theta(\theta \in C(\mathbf{X}))\).
Rigorous Focus: A \(1-\alpha\) interval guarantees coverage is at least \(1-\alpha\) for all \(\theta\), not exactly \(1-\alpha\). Conservative intervals (like the standard \(t\)-interval for non-normal data) may over-cover.
Interactive Resource: The Coverage Probability Profiler
Design: A plot of Coverage Probability vs. \(\theta\). A horizontal dashed line at \(1-\alpha\).
Interaction: The user selects a distribution (e.g., Exponential) and the standard \(t\)-interval (derived under Normality assumptions). The tool runs Monte Carlo simulations across a grid of \(\theta\) values and plots the actual coverage. The plot shows that for small \(n\), the coverage dips below \(1-\alpha\) (violating the size guarantee), but as \(n\) increases (CLT kicks in), the coverage curve flattens to \(1-\alpha\).
Content: Applying decision theory to intervals.
Loss functions: \(L(\theta, C) = \text{Length}(C)\) (penalize wide intervals) or \(L(\theta, C) = I(\theta \notin C)\) (penalize missing the parameter).
Risk: \(R(\theta, C) = E_\theta[L(\theta, C)]\).
Interactive Resource: The Risk-Length Tradeoff Dashboard
Design: A 2D plane with Expected Length on the x-axis and Coverage Error (\(1 - \text{Coverage}\)) on the y-axis.
Interaction: The user defines a family of intervals by tweaking a tuning parameter (e.g., varying the split of \(\alpha\) between the tails: \(\alpha_1\) and \(\alpha_2\) such that \(\alpha_1 + \alpha_2 = \alpha\)). The tool plots the risk points. The student discovers that the symmetric split (\(\alpha_1 = \alpha_2 = \alpha/2\)) minimizes the expected length for symmetric unimodal distributions, balancing the Decision-Theoretic risk.
scipy.optimize.brentq) to find the precise boundaries \(L(\mathbf{x})\) and \(U(\mathbf{x})\) where the test statistic
hits the critical value, especially when analytical inversion is
messy.In practice, exact sampling distributions of statistics are often mathematically intractable or rely on unrealistic distributional assumptions (like strict Normality). This module equips students with the asymptotic tools necessary to approximate the behavior of estimators and test statistics when the sample size \(n\) is large. It covers the convergence of point estimators (Consistency, Efficiency), computational resampling methods (Bootstrap), robustness against model misspecification, and the large-sample theory for hypothesis tests and confidence intervals.
Learning Objectives:
Prove the consistency of estimators using convergence in probability and the Continuous Mapping Theorem.
Derive the asymptotic distribution of Maximum Likelihood Estimators (MLEs) and define Asymptotic Efficiency.
Calculate and interpret Asymptotic Relative Efficiency (ARE) to compare two consistent estimators.
Implement the nonparametric Bootstrap to estimate standard errors and construct confidence intervals.
Evaluate the robustness of estimators by analyzing breakdown points and influence functions, specifically contrasting the sample mean, median, and M-estimators.
Derive the asymptotic distribution of the Likelihood Ratio Test (Wilks’ Theorem) and formulate large-sample Wald and Score tests.
Construct approximate large-sample confidence intervals using MLE asymptotics and the Bootstrap.
Content: An estimator \(\hat{\theta}_n\) is consistent for \(\theta\) if \(\hat{\theta}_n \xrightarrow{P} \theta\).
Rigorous Focus:
Proving consistency via the Weak Law of Large Numbers (WLLN) and the Continuous Mapping Theorem.
Proving the consistency of MLEs under regularity conditions (the global maximum of the likelihood converges in probability to the true parameter).
Interactive Resource: The Consistency Targeting Simulator
Design: A number line representing the parameter space, with a bullseye at the true \(\theta_0\). A dynamic density curve representing the sampling distribution of \(\hat{\theta}_n\).
Interaction: The user selects an estimator (e.g., MLE for Poisson, or an inconsistent estimator like \(X_1\)). As the user increases \(n\) via a slider, the sampling distribution visually collapses onto \(\theta_0\) for consistent estimators. For \(X_1\), the distribution remains static, demonstrating inconsistency regardless of sample size.
Content: Under regularity conditions, \(\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} N\left(0, \frac{1}{I(\theta)}\right)\) for the MLE.
Rigorous Focus: Asymptotic variance vs. exact variance. An estimator is asymptotically efficient if its asymptotic variance achieves the Cramér-Rao Lower Bound (CRLB).
Content: Comparing two consistent estimators, \(\hat{\theta}_1\) and \(\hat{\theta}_2\), with asymptotic variances \(v_1(\theta)/n\) and \(v_2(\theta)/n\). The ARE is \(ARE(\hat{\theta}_1, \hat{\theta}_2) = \frac{v_2(\theta)}{v_1(\theta)}\).
Worked Example: Sample Mean vs. Sample Median for Normal and Double Exponential distributions.
Interactive Resource: The ARE Arena
Design: A split-screen race track. Two estimators (e.g., Mean and Median) are competing to converge to \(\theta\).
Interaction: The user selects the underlying population (Normal vs. Heavy-tailed). A simulation runs, plotting the MSE of both estimators against \(1/n\).
Rigor Check: For the Normal distribution, the Mean’s MSE drops faster; the tool calculates \(ARE(Median, Mean) = 2/\pi \approx 0.64\), showing the median needs \(\sim 56\%\) more data to achieve the same precision. Switching to Double Exponential flips the ratio: \(ARE(Mean, Median) = 1/2\), visually proving the median’s superiority under heavy tails.
Content: The nonparametric bootstrap procedure:
Draw \(B\) resamples \(\mathbf{x}^{*1}, \dots, \mathbf{x}^{*B}\) with replacement from the observed data.
Calculate the statistic \(\hat{\theta}^{*b}\) for each resample.
Estimate the standard error as \(SE_{boot} = \sqrt{\frac{1}{B-1} \sum (\hat{\theta}^{*b} - \bar{\hat{\theta}}^*)^2}\).
Interactive Resource: The Resampling Engine
Design: A histogram of the original sample. A “slot machine” style resampler. A dynamic histogram of the bootstrap distribution \(\hat{\theta}^*\).
Interaction: The user defines a complex statistic (e.g., sample Trimmed Mean). The engine rapidly draws resamples, updating the bootstrap histogram in real-time. The tool overlays a Normal curve with the bootstrap SE, demonstrating how the central limit theorem applies even to analytically difficult statistics.
Content: Sensitivity of estimators to deviations from model assumptions (e.g., outliers).
Rigorous Focus:
Breakdown Point: The fraction of contaminated data an estimator can tolerate before becoming arbitrarily wrong. (Mean = 0%, Median = 50%).
Influence Function: Measures the infinitesimal sensitivity of an estimator to an outlier at \(x\).
Interactive Resource: The Contamination and Influence Sandbox
Design: A dot plot of a sample from \(N(0,1)\). Two vertical lines tracking the value of the Sample Mean and Sample Median.
Interaction: The user clicks on the far right tail of the dot plot to inject an extreme outlier. The Mean line rapidly chases the outlier, while the Median line barely moves. The tool plots the empirical Influence Function (change in estimate vs. outlier location), showing the unbounded influence of the mean and the bounded influence of the median.
Content: Estimators defined as the minimizer of \(\sum_{i=1}^n \rho(x_i - \theta)\), or the root of \(\sum_{i=1}^n \psi(x_i - \theta) = 0\).
Mean: \(\rho(x-\theta) = (x-\theta)^2\) (\(\psi\) is unbounded).
Median: \(\rho(x-\theta) = |x-\theta|\) (\(\psi\) is bounded but discontinuous).
Huber Estimator: A hybrid that is quadratic near zero and linear in the tails, providing both efficiency (like the mean) and robustness (like the median).
Interactive Resource: The Huber Tuning Knob
Design: A plot of the \(\psi\)-function \(\psi(x) = x\) for \(|x| \le k\) and \(\psi(x) = k \cdot \text{sign}(x)\) for \(|x| > k\).
Interaction: The user adjusts the tuning constant \(k\). As \(k \to \infty\), \(\psi\) becomes the mean (unbounded). As \(k \to 0\), \(\psi\) becomes the median. The user sets a mixed dataset (mostly Normal with \(10\%\) extreme outliers) and adjusts \(k\) to find the optimal bias-variance tradeoff, visualizing how M-estimators tame the influence of outliers.
####10.3.1 Asymptotic Distribution of LRTs (Wilks’ Theorem)
Content: For testing \(H_0: \theta \in \Theta_0\) vs \(H_1: \theta \in \Theta_0^c\), under regularity conditions, the LRT statistic \(-2\log\lambda(\mathbf{X}) \xrightarrow{d} \chi^2_{\nu}\), where \(\nu = \dim(\Theta) - \dim(\Theta_0)\).
Rigorous Focus: This allows for chi-squared approximations of p-values when exact \(n\) is small or the exact distribution is intractable.
Interactive Resource: The LRT Convergence Fitter
Design: A histogram of simulated \(-2\log\lambda\) values under \(H_0\). Overlay curves for exact distributions (if known) and the \(\chi^2_\nu\) asymptote.
Interaction: The user adjusts the sample size \(n\). For small \(n\), the empirical histogram is skewed relative to the \(\chi^2\) overlay. As \(n\) increases, the histogram seamlessly merges with the \(\chi^2\) curve, validating Wilks’ Theorem empirically.
Content:
Wald Test: Uses the MLE. \(W = \frac{(\hat{\theta} - \theta_0)^2}{\widehat{Var}(\hat{\theta})} \xrightarrow{d} \chi^2_1\).
Score (Rao) Test: Uses only the estimator under \(H_0\). \(S = \frac{\ell'(\theta_0)^2}{I(\theta_0)} \xrightarrow{d} \chi^2_1\). Rigorous Focus: Geometric interpretations. The Wald test measures distance on the parameter scale; the Score test measures the slope of the log-likelihood at the null.
Interactive Resource: The Holy Trinity Visualizer
Design: A plot of the log-likelihood function \(\ell(\theta)\). Markers for \(\theta_0\) (null) and \(\hat{\theta}\) (MLE).
Interaction: The user drags the observed data, shifting the likelihood curve. The tool dynamically draws: * LRT: The vertical drop from the peak \(\ell(\hat{\theta})\) to \(\ell(\theta_0)\). * Wald: The horizontal distance \(|\hat{\theta} - \theta_0|\). * Score: The tangent line slope at \(\theta_0\).
As the user makes \(n\) large, the log-likelihood becomes perfectly quadratic, and the tool proves that all three test statistics become mathematically identical.
Content: Using the asymptotic normality of the MLE: \(\hat{\theta} \pm z_{\alpha/2} \sqrt{\frac{1}{nI(\hat{\theta})}}\).
Rigorous Focus: Replacing the true Fisher Information \(I(\theta)\) with the estimated Fisher Information \(I(\hat{\theta})\) or the observed Fisher Information.
Content:
Inverting large-sample Wald and Score tests.
Bootstrap Percentile Intervals: Using the quantiles of the bootstrap distribution \(\hat{\theta}^*\) directly: \([\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}]\).
Interactive Resource: The Bootstrap CI Constructor
Design: The bootstrap histogram of \(\hat{\theta}^*\). A slider for \(\alpha\).
Interaction: The user selects a highly skewed estimator (e.g., the sample variance). The Wald interval (symmetric around \(\hat{\theta}\)) is plotted, often crossing impossible parameter boundaries (e.g., \(< 0\)). The user then clicks “Percentile Method”, and the tool slices off the \(\alpha/2\) tails of the bootstrap histogram, providing an asymmetric, realistic confidence interval that respects the natural boundaries of the parameter space.
High-Performance Resampling Engine: The Bootstrap and robustness simulations require generating millions of random numbers rapidly. The backend must use vectorized operations (NumPy/NumPyro in Python or a WebAssembly port of a C++ random library). The UI must not freeze; progress bars or streaming updates are mandatory for \(B > 10,000\) resamples.
Automatic Differentiation (AD) Engine: Calculating the Score test and Fisher Information requires taking derivatives of the log-likelihood. Instead of hardcoding derivatives, integrate an AD library (like JAX or PyTorch in Python, or math.js for basic JS). This allows students to input any valid likelihood function and instantly get the Score test statistic and Wald standard errors.
Robustness Outlier Injector: The Contamination Sandbox needs a fluid UI for adding/removing data points. Use D3.js to bind data points to DOM elements that can be dragged along the x-axis, instantly recalculating the Mean, Median, and Huber estimates without re-running the entire simulation.
Chi-Square Quantile Calculator: Wilks’ Theorem relies heavily on the \(\chi^2\) distribution. The frontend needs a fast statistical library (like jStat) to compute CDFs and inverse CDFs (quantiles) of \(\chi^2_\nu\) to calculate asymptotic p-values and critical values dynamically.
Influence Function Plotter: The Influence Function for M-estimators is defined by the derivative of \(\rho\). The tool should allow users to choose \(\rho\) (Quadratic, Absolute, Huber) and dynamically plot \(\psi(x)\), overlaying it on the data dot-plot to visually explain why certain estimators resist outliers (bounded \(\psi\)) and others do not (unbounded \(\psi\)).
This module transitions from univariate statistics to the analysis of structured data. It introduces two of the most foundational linear models in statistics: the Oneway Analysis of Variance (ANOVA) for comparing group means, and Simple Linear Regression (SLR) for modeling the relationship between two continuous variables. The module rigorously develops these models from both an algebraic optimization perspective (Least Squares) and a probabilistic optimality perspective (BLUE and MLE), emphasizing the geometric and distributional properties of the resulting estimators.
Learning Objectives:
Formulate the Oneway ANOVA model and partition the Total Sum of Squares (SST) into explained (SSG) and unexplained (SSE) variation.
Derive and evaluate the ANOVA F test for equality of means, proving its optimality under normality.
Define and test linear contrasts, and apply methods for simultaneous inference (Scheffé’s method).
Derive Ordinary Least Squares (OLS) estimators via calculus and orthogonal projections.
Prove the Gauss-Markov Theorem, establishing OLS as the Best Linear Unbiased Estimator (BLUE).
Derive the distributions of OLS estimators under the assumption of normal errors, and construct t-tests and confidence intervals.
Distinguish rigorously between confidence intervals for the mean response \(E(Y|x_0)\) and prediction intervals for a new observation \(Y_0\).
Content: The cell means model: \(Y_{ij} = \mu_i + \epsilon_{ij}\), or the overparameterized model: \(Y_{ij} = \mu + \tau_i + \epsilon_{ij}\), for \(i=1,\dots,k\) groups and \(j=1,\dots,n_i\). Assumptions: \(\epsilon_{ij} \text{ iid } N(0, \sigma^2)\).
Rigorous Focus: The constraint \(\sum n_i \tau_i = 0\) (or similar) to make the model identifiable.
Content: Testing \(H_0: \mu_1 = \mu_2 = \dots = \mu_k\) vs. \(H_1: \mu_i \neq \mu_j\) for some \(i \neq j\). (Equivalently, \(H_0: \tau_1 = \dots = \tau_k = 0\)).
Content: Partitioning the total variation: \(\text{SST} = \text{SSG} + \text{SSE}\).
\(\text{SST} = \sum \sum (Y_{ij} - \bar{\bar{Y}})^2\)
\(\text{SSG} = \sum n_i (\bar{Y}_i - \bar{\bar{Y}})^2\)
\(\text{SSE} = \sum \sum (Y_{ij} - \bar{Y}_i)^2\)
The F-statistic: \(F = \frac{\text{SSG}/(k-1)}{\text{SSE}/(N-k)}\).
Interactive Resource: The Variance Partitioner
Design: A dot plot showing \(k\) groups of data points. Three overlapping histograms/density curves representing the distributions of SST, SSG, and SSE.
Interaction: The user clicks and drags the group means (\(\bar{Y}_i\)) left and right. As the group means move further apart, the SSG bar chart grows and the SSE remains constant. The F-statistic slider updates dynamically. If the user moves all means to the same location, SSG drops to 0 and \(F \to 0\). This directly links the geometric spread of the groups to the numerator of the F-test.
Content: A contrast \(C = \sum a_i \mu_i\) where \(\sum a_i = 0\). Estimating \(C\) with \(\hat{C} = \sum a_i \bar{Y}_i\) and constructing t-tests and confidence intervals.
Rigorous Focus: Contrasts allow targeted pairwise or complex comparisons while maintaining the interpretability of the ANOVA structure.
Content: The Multiple Comparisons Problem. If we test \(m\) hypotheses at level \(\alpha\), the family-wise error rate (FWER) inflates.
Rigorous Focus: Scheffé’s method allows all possible contrasts to be tested simultaneously with an overall FWER of \(\alpha\). The critical value is modified: \(\sqrt{(k-1)F_{k-1, N-k, \alpha}}\).
Interactive Resource: The Multiple Comparisons Trap
Design: A dashboard simulating \(k=5\) groups where \(H_0\) is strictly true (all means are equal).
Interaction: The user sets the per-comparison Type I error rate to \(\alpha = 0.05\). The tool runs 10,000 simulations, performing all \(\binom{5}{2} = 10\) pairwise t-tests. A bar chart shows the percentage of simulations where at least one false positive occurred (FWER \(\approx 40\%\)). The user then switches to Scheffé’s adjustment, and the FWER drops back to \(0.05\), visually proving the necessity of multiple comparison corrections.
Content: Model: \(Y_i = \alpha + \beta x_i + \epsilon_i\). Minimizing the sum of squared residuals \(S(\alpha, \beta) = \sum (Y_i - \alpha - \beta x_i)^2\).
Rigorous Focus: Taking partial derivatives to find the normal equations. The geometry of least squares: the observed vector \(\mathbf{Y}\) is orthogonally projected onto the column space of the design matrix \(\mathbf{X}\). The residuals \(\mathbf{e}\) are orthogonal to the fitted values \(\hat{\mathbf{Y}}\).
Interactive Resource: The Regression Sandbox
Design: A 2D scatterplot. A movable line with sliders for intercept (\(\alpha\)) and slope (\(\beta\)). Squares representing the squared residuals are drawn between the points and the line.
Interaction: The user manually adjusts \(\alpha\) and \(\beta\) to try and minimize the total area of the squares (SSR). A dynamic “SSR Meter” shows the current error. A “Snap to OLS” button animates the line jumping to the exact mathematical minimum, and the residual squares instantly reshape.
Content: The Gauss-Markov Theorem. Under assumptions \(E[\epsilon_i] = 0\) and \(Var(\epsilon_i) = \sigma^2\) (homoscedasticity, no normality required), the OLS estimators \(\hat{\alpha}\) and \(\hat{\beta}\) have the smallest variance among all linear unbiased estimators.
Rigorous Focus: Proving the theorem by showing that the variance of any other linear unbiased estimator differs from the OLS variance by a positive semi-definite matrix.
Interactive Resource: The Gauss-Markov Variance Smackdown
Design: A simulator generating non-normal errors (e.g., skewed Exponential errors). Two histograms: one for OLS \(\hat{\beta}\), one for an alternative linear unbiased estimator (e.g., a line fit perfectly through the first and last data point).
Interaction: The user runs thousands of simulations. Both histograms are centered at the true \(\beta\) (both unbiased). However, the alternative estimator’s histogram is visibly wider. The tool calculates the empirical variances, confirming \(Var(\hat{\beta}_{OLS}) < Var(\hat{\beta}_{alt})\), demonstrating the power of Gauss-Markov without the crutch of Normality.
Content: Adding the assumption \(\epsilon_i \sim N(0, \sigma^2)\).
OLS estimators coincide with MLEs.
\(\hat{\beta} \sim N(\beta, \frac{\sigma^2}{S_{xx}})\) and \(\hat{\alpha} \sim N(\alpha, \sigma^2 (\frac{1}{n} + \frac{\bar{x}^2}{S_{xx}}))\).
\(S^2 = \frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}\) is an unbiased estimator for \(\sigma^2\).
t-tests for \(H_0: \beta = \beta_0\) using the statistic \(T = \frac{\hat{\beta} - \beta_0}{S/\sqrt{S_{xx}}} \sim t_{n-2}\).
Interactive Resource: The Parameter Geometry Engine
Design: A 3D surface plot of the log-likelihood \(L(\alpha, \beta, \sigma^2 | \mathbf{x}, \mathbf{y})\).
Interaction: The user rotates the 3D surface. The tool overlays the OLS normal equations as geometric planes slicing through the peak. It highlights the ridge corresponding to \(\sigma^2\), showing how maximizing over \(\alpha\) and \(\beta\) first yields the profile likelihood for \(\sigma^2\).
Content: Two distinct goals:
Confidence Interval for \(E(Y|x_0)\): Estimating the mean of the distribution at \(x_0\). \(Var(\hat{Y}_0) = \sigma^2 \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)\).
Prediction Interval for \(Y_0\): Predicting an individual new observation at \(x_0\). \(Var(Y_0 - \hat{Y}_0) = \sigma^2 \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)\).
Rigorous Focus: The prediction interval is strictly wider because it must account for the intrinsic variance of the new observation (\(\sigma^2\)) plus the variance of estimating the mean.
Interactive Resource: The Band Splitter
Design: A scatterplot with the OLS regression line. Two shaded regions surrounding the line.
Interaction: The user clicks along the x-axis to select \(x_0\). A vertical slice appears showing the Normal distribution of \(E(Y|x_0)\) and the wider Normal distribution of \(Y_0\). The user adjusts a slider for \(x_0\). As \(x_0\) moves away from \(\bar{x}\), both bands “bowtie” outward (variance inflation), but the Prediction Interval always maintains a minimum width determined by the irreducible error \(\sigma^2\).
Content: Constructing a confidence band for the entire regression line over an interval of \(x\) values, maintaining a family-wise confidence level of \(1-\alpha\). The Working-Hotelling method: \(E(Y|x) \in \hat{\alpha} + \hat{\beta}x \pm \sqrt{2F_{2, n-2, \alpha}} S\sqrt{\frac{1}{n} + \frac{(x-\bar{x})^2}{S_{xx}}}\).
Interactive Resource: The Working-Hotelling Enveloper
Design: The regression line with individual confidence intervals drawn at many \(x\) points.
Interaction: The user clicks “Simulate”. The tool draws 100 new samples, fitting 100 regression lines, and calculating 100 individual CIs. It highlights that at the edges (extreme \(x\) values), individual CIs fail to capture the true line ~5% of the time. It then overlays the Working-Hotelling band, which smoothly contains the true line in exactly 95% of simulations, demonstrating the geometric tightening required for simultaneous inference.
Matrix/Linear Algebra Backend: While SLR can be formulated with scalar sums (\(S_{xx}, S_{xy}\)), introducing the matrix formulation (\(\hat{\beta} = (X^T X)^{-1} X^T Y\)) under the hood sets the stage for Module 12 and is standard in Casella & Berger. Use NumPy.linalg to handle these operations efficiently and cleanly.
Interactive Geometry Engine: For the Regression Sandbox, the visualization of residuals as squares is a classic teaching tool. Use D3.js or Canvas to render actual squares stretching between the data points and the regression line. As the user drags the line, the squares must skew and resize in real-time, and the total area (SSR) must update instantly.
3D Likelihood Surface Renderer: For the Parameter Geometry Engine, use Plotly.js or Three.js to render the bivariate Normal log-likelihood surface. The surface must be semi-transparent so the user can see the MLE point and the normal equation planes intersecting at the base.
Non-Normal Random Number Generator: To properly demonstrate the Gauss-Markov Variance Smackdown, the simulator must easily draw from non-Normal distributions (e.g., Chi-squared, Exponential, Uniform) to prove that OLS remains BLUE without Normality.
Multiple Comparison Math: Scheffé’s method requires looking up F-distribution quantiles. The backend must dynamically calculate \(F_{k-1, N-k, \alpha}\) to scale the t-intervals correctly for the interactive simulator.
Module 11 established Simple Linear Regression under idealized conditions (exact \(x\), Normal errors, constant variance). This module tackles the messy realities of data that violate these assumptions. It rigorously addresses three major departures: (1) measurement error in the predictors (Errors in Variables), (2) non-Normal, categorical response variables (Logistic Regression), and (3) outliers and heavy-tailed error distributions (Robust Regression). The focus shifts from closed-form OLS solutions to iterative numerical optimization (MLE, IRWLS) and the analysis of estimator sensitivity.
Learning Objectives:
Differentiate between Functional and Structural relationships in Errors-in-Variables (EIV) models.
Prove and visualize the attenuation bias (regression dilution) caused by measurement error in OLS.
Derive the orthogonal/Deming regression solution assuming known error variance ratios.
Formulate the Logistic Regression model using the Bernoulli distribution and the logit link function.
Derive the MLE for logistic regression via Iteratively Reweighted Least Squares (IRWLS) and compute asymptotic inference using the observed Fisher Information.
Define M-estimators for regression, construct robust loss functions (Huber, Tukey), and compute bounded influence weights.
Diagnose leverage points and distinguish between good and bad leverage points in robust regression.
Content: In standard regression, \(x_i\) is fixed and known. In EIV, we observe \(W_i = x_i + U_i\) where \(U_i\) is measurement error.
Functional Relationship: The true \(x_i\) are fixed, unknown constants (nuisance parameters).
Structural Relationship: The true \(X_i\) are random variables \(X_i \sim N(\mu_X, \sigma_X^2)\).
Rigorous Focus: The model is unidentifiable without additional information (e.g., knowing the ratio \(\lambda = \sigma_U^2 / \sigma_V^2\), where \(V_i\) is the error in \(Y\)).
Content:
The Attenuation Effect: If we naively regress \(Y\) on \(W\), the OLS estimator \(\hat{\beta}_{OLS}\) is biased toward 0. \(E[\hat{\beta}_{OLS}] \approx \beta \frac{\sigma_X^2}{\sigma_X^2 + \sigma_U^2}\).
Orthogonal/Deming Regression: Minimizing the perpendicular distances to the line, weighted by the error variance ratio \(\lambda = \sigma_U^2 / \sigma_V^2\).
Interactive Resource: The Attenuation Dilution Simulator
Design: A scatterplot of true \((x, y)\) points forming a tight line, overlaid with a blurred scatterplot of observed \((W, Y)\) points.
Interaction: The user controls the measurement error variance \(\sigma_U^2\) via a slider. As \(\sigma_U^2\) increases, the \(W\) values spread out horizontally. The tool dynamically fits the naive OLS line (regressing \(Y\) on \(W\)). The line visually flattens toward 0, and a numeric readout displays the shrinking slope, proving the attenuation bias. A second button fits the Deming Regression line, which remains stable and accurate regardless of \(\sigma_U^2\).
Content: Constructing confidence intervals for the slope \(\beta\) in EIV models.
Rigorous Focus: The variance of the orthogonal regression estimator is larger than the naive OLS variance, correctly reflecting the increased uncertainty due to measurement error.
Content: Modeling a binary response \(Y_i \in \{0, 1\}\). \(Y_i \sim \text{Bernoulli}(p_i)\), where the logit of the probability is a linear function of predictors: \(\log\left(\frac{p_i}{1-p_i}\right) = \alpha + \beta x_i\).
Rigorous Focus: Why OLS is inappropriate (predicted probabilities \(<0\) or \(>1\), non-constant variance \(Var(Y_i) = p_i(1-p_i)\)).
Content: The Likelihood function: \(L(\alpha, \beta | \mathbf{x}, \mathbf{y}) = \prod p_i^{y_i} (1-p_i)^{1-y_i}\). Because the score equations are non-linear, we use numerical optimization.
Rigorous Focus: The Newton-Raphson algorithm for MLE simplifies to Iteratively Reweighted Least Squares (IRWLS). The update step solves a weighted least squares problem where the weights are \(W_i = n_i p_i (1-p_i)\) and the working response is \(Z_i = \hat{\eta}_i + \frac{Y_i - \hat{p}_i}{\hat{p}_i(1-\hat{p}_i)}\).
Interactive Resource: The Logit Bender & IRWLS Tracker
Design: A binary scatterplot (0s and 1s on the y-axis). An S-shaped logistic curve is overlaid.
Interaction: The user drags sliders for \(\alpha\) and \(\beta\) to manually fit the curve. A dynamic “Log-Likelihood Meter” shows how the likelihood changes.
Rigor Check: The user clicks “IRWLS Step”. The tool calculates the working response \(Z_i\) and weights \(W_i\), plots them as a transformed weighted scatterplot behind the curve, and fits a Weighted Least Squares line to it. It then updates the logistic curve. Repeated clicks animate the algorithm converging, visualizing how logistic regression iteratively re-weights the data to handle heteroscedasticity.
Worked Example: Asymptotic Inference
Content: OLS minimizes \(\sum (Y_i - \alpha - \beta x_i)^2\). Because the loss function \(\rho(r) = r^2\) increases rapidly, a single outlier can drag the regression line arbitrarily far.
Rigorous Focus: Differentiating between:
Vertical outliers: Outliers in \(Y\) (OLS handles poorly).
Leverage points: Outliers in \(X\) (OLS handles very poorly).
Bad leverage points: Leverage points that are also vertical outliers (catastrophic for OLS).
Interactive Resource: The Leverage Point Injector
Design: A scatterplot of well-behaved data with an OLS line and a Robust M-estimator line (e.g., Huber) perfectly overlapping.
Interaction: The user clicks anywhere on the canvas to inject a data point. If they inject a vertical outlier, the OLS line deflects slightly. If they inject a point at an extreme \(x\) value (high leverage) with a misfitting \(y\) value, the OLS line wildly pivots to pass through it, while the Robust line remains stable. The tool dynamically calculates Cook’s Distance for the injected point.
Content: Generalizing OLS by minimizing \(\sum \rho((Y_i - \alpha - \beta x_i)/\hat{\sigma})\), where \(\rho\) is a robust loss function, and \(\hat{\sigma}\) is a robust scale estimate (e.g., MAD).
Huber Loss: \(\rho(r) = \frac{1}{2}r^2\) for \(|r| \le c\), and \(c|r| - \frac{1}{2}c^2\) for \(|r| > c\). (Quadratic near 0, linear in the tails).
Tukey’s Biweight: \(\rho(r)\) that flattens out completely for extreme outliers, entirely rejecting their influence.
Interactive Resource: The Loss Function Forge
Design: A dual-panel interface. Left: Plots of different \(\rho(r)\) and their derivatives \(\psi(r)\) (the influence function). Right: A scatterplot with outliers.
Interaction: The user adjusts the tuning constant \(c\) for the Huber loss. The right panel dynamically fits the regression line. As \(c \to \infty\), the influence function becomes unbounded, and the line behaves exactly like OLS (pulled by outliers). As \(c \to 0\), it behaves like \(L_1\) regression (median). The user can switch to Tukey’s loss, where extreme outliers are ignored entirely (their weight drops to 0).
Content: Solving the M-estimator requires solving \(\sum \psi(r_i) x_{ij} = 0\). This is achieved via IRWLS, with weights \(w_i = \frac{\psi(r_i/\hat{\sigma})}{r_i/\hat{\sigma}}\).
Rigorous Focus: The weight function \(w(r)\) dictates the influence. For Huber, \(w(r) \to c/|r|\) as \(|r| \to \infty\), meaning outliers are down-weighted proportional to their distance. For Tukey, \(w(r) \to 0\).
Interactive Resource: The Weight Tracker
Design: A scatterplot where each data point is a circle. The radius of the circle represents its IRWLS weight \(w_i\).
Interaction: The user runs the robust regression algorithm step-by-step. Inlier points maintain large circles (weight \(\approx 1\)). Outliers visibly shrink as the algorithm iterates, visually demonstrating how robust regression “turns down the volume” on bad data points, allowing the line to fit the majority of the data.
Numerical Optimization Solvers (Logistic & Robust): Module 12 relies heavily on iterative algorithms that have no closed-form solution. The backend must expose robust optimization libraries. For Logistic MLE, use scipy.optimize.minimize (e.g., L-BFGS-B). For Robust M-estimators, implement the IRWLS loop manually, ensuring convergence checks (tolerances) and step-halving to prevent divergence.
Matrix Algebra for IRWLS: Both Logistic Regression and Robust Regression utilize IRWLS. The core operation is Weighted Least Squares: \(\hat{\beta} = (X^T W X)^{-1} X^T W Z\). Use NumPy to compute this efficiently. Ensure the design matrix \(X\) includes a column of 1s for the intercept.
Robust Scale Estimators (MAD): The Robust Regression M-estimator requires a preliminary estimate of scale \(\hat{\sigma}\) that is itself robust to the very outliers we are trying to ignore. Implement the Median Absolute Deviation (MAD): \(\hat{\sigma} = 1.4826 \times \text{median}(|r_i - \text{median}(r)|)\). The constant 1.4826 ensures consistency for the Normal distribution.
Outlier Injection UI Mechanics: For the Leverage Point Injector, the UI must allow precise clicking. Map pixel coordinates directly to the \((x, y)\) data space. After injecting a point, the OLS and Robust models must refit instantly (within \(\sim 100\)ms) to provide satisfying interactive feedback. Use WebGL or highly optimized Canvas rendering if rendering thousands of points.
Information Matrix Calculator for Logistic: To provide inference (standard errors, Wald tests) for Logistic Regression, the engine must calculate the Observed Fisher Information \(I(\hat{\beta}) = X^T \text{diag}(p_i(1-p_i)) X\). The tool should invert this matrix on the fly to return the variance-covariance matrix to the frontend for display.