# Load necessary libraries (install if needed)
# install.packages(c("tidyverse", "gganimate", "igraph", "ggraph", "plotly", "stringdist", "pROC", "HardyWeinberg", "viridisLite", "visNetwork", "networkD3"))
library(tidyverse)
library(gganimate)
library(igraph)
library(ggraph)
library(plotly)
library(stringdist)
library(pROC)
library(HardyWeinberg)
library(viridisLite)
library(visNetwork)
library(networkD3)

1 Overview

This Data Science and Predictive Analytics (DSPA2) Appendix covers Epidemiology, the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow’s investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.

A core framework in epidemiology is the “epidemiologic triad” of agent, host, and environment, but contemporary approaches emphasize the “person, place, and time” dimensions:

Person: Characteristics of individuals, such as age, sex, genetics, socioeconomic status, and behaviors that influence susceptibility or exposure.
Place: Geographic variations, including urban vs. rural settings, climate, or access to healthcare, which can reveal environmental or social risk factors.
Time: Temporal patterns, such as seasonal trends, secular changes over years, or sudden outbreaks, helping identify emerging threats or intervention effects.

This appendix expands the Scientific Methods for Health Sciences (SMHS) Ebook Epidemiology Section, delving into genetic epidemiology and bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.

2 Motivation

This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:

Understand Genetic Foundations: Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tend to be inherited together unless separated by recombination).
Analyze Population Dynamics: Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
Evaluate Associations: Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
Apply Computational Methods: Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
Additional Skills: Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.

These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.

3 Theory: The Human Genome and Mutation

The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000-25,000 genes that orchestrate cellular functions, development, and responses to the environment. Mutations, changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.

Figure 1: Illustration of human chromosomal structure, highlighting key features like centromeres, telomeres, and banding patterns.

3.1 Chromosomal Structure

Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.

Banding: Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
Karyotype: The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
Functional Elements:
- Centromeres: Central constrictions composed of repetitive alpha-satellite DNA sequences. They serve as attachment points for spindle fibers during mitosis and meiosis, ensuring proper chromosome segregation. Centromeric dysfunction can lead to aneuploidy (abnormal chromosome numbers).
- Telomeres: Protective caps at chromosome ends, consisting of repetitive TTAGGG sequences (in humans) bound by shelterin proteins. Telomeres shorten with each somatic cell division due to the “end-replication problem,” contributing to cellular aging (senescence). In germ cells and stem cells, telomerase enzyme maintains telomere length. Shortened telomeres are linked to diseases like dyskeratosis congenita and increased cancer risk.
- Chromatin: The complex of DNA and proteins (histones) that forms chromosomes. It exists in two states:
  - Euchromatin: Loosely packed, transcriptionally active, and rich in genes and regulatory elements.
  - Heterochromatin: Tightly packed, transcriptionally silent, often containing repetitive DNA (e.g., satellites, transposons) near centromeres and telomeres. Epigenetic modifications (e.g., histone methylation) regulate chromatin states.

3.2 Mutations and Abnormalities

Mutations are heritable changes in the DNA sequence, occurring at a rate of about \(10^{-8}\) per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.

Figure 2: Schematic of common structural chromosomal abnormalities, including deletion, duplication, translocation, and inversion.

Structural Abnormalities (Large-scale changes visible under a microscope, affecting thousands to millions of base pairs):
- Deletion: Removal of a chromosomal segment, leading to loss of genes (e.g., cri-du-chat syndrome from 5p deletion).
- Duplication: Extra copy of a segment, potentially causing gene dosage imbalances (e.g., Charcot-Marie-Tooth disease from PMP22 duplication).
- Translocation: Exchange of segments between non-homologous chromosomes. Balanced translocations may be asymptomatic but increase risks in offspring; unbalanced ones cause disorders (e.g., chronic myeloid leukemia from t(9;22) “Philadelphia chromosome”).
- Inversion: Reversal of a segment within a chromosome, which can disrupt genes or lead to abnormal recombination (e.g., hemophilia A inversions).
- Other Types: Isochromosomes (duplicated arms, e.g., i(Xq) in Turner syndrome) or ring chromosomes (circular formations from deletions at both ends).
Point Mutations (Small-scale changes at the nucleotide level, detected via sequencing):
- Nucleotide Substitution: Replacement of one base with another.
  - Silent: No amino acid change (due to codon degeneracy).
  - Missense: Amino acid change (e.g., sickle cell anemia from GAG to GTG in beta-globin gene).
  - Nonsense: Introduces premature stop codon, truncating the protein (e.g., cystic fibrosis).
- Indels: Insertions or deletions of nucleotides. Small indels can cause frameshifts, altering the reading frame and producing dysfunctional proteins (e.g., Tay-Sachs disease).
- Splice Site Variation: Mutations in intronic regions affecting mRNA splicing, leading to exon skipping or inclusion of introns (e.g., beta-thalassemia).
Epidemiological Relevance: Mutations’ population distribution informs disease prevalence. Rare mutations cause Mendelian disorders (e.g., Huntington’s), while common variants (SNPs) contribute to complex traits via polygenic risk scores. Tools like next-generation sequencing (NGS) enable large-scale mutation detection in cohort studies.

4 Population Genetics

Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.

4.1 Allele and Genotype Frequencies

The gene pool is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.

Allele Frequency: Proportion of a specific allele at a locus. For a biallelic locus, frequencies sum to 1.

\[ \text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}. \]

Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (240 + 50) / 200 = 0.65; q (a) = (210 + 50) / 200 = 0.35.

Genotype Frequency: Proportion of individuals with a specific genotype (e.g., P(AA) = number of AA / total individuals).
Applications: Used in GWAS to compare frequencies between cases and controls; deviations can indicate associations.

4.2 Hardy-Weinberg Equilibrium (HWE)

HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):

\[ P(AA) = p^2\ (homozygous\ dominant) \]

\[ P(Aa) = 2pq\ (heterozygous) \]

\[ P(aa) = q^2\ (homozygous\ recessive) \]

Deviations from HWE: Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
Testing for HWE: Use Chi-squared goodness-of-fit test.

Formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}, \] \(df=1\) for biallelic loci.

Critical value: \(>3.84\) for \(p<0.05\) (reject HWE).

Example:

Observed: 400 AA, 500 Aa, 100 aa (total 1000). p=0.65, q=0.35.
Expected: 422.5 AA, 455 Aa, 122.5 aa.
\(\Chi^2\approx 10.5\) (p<0.05, deviation).

R Implementation for HWE: This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the HardyWeinberg package.

# Hardy-Weinberg Equilibrium Test
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
obs <- c(AA = 400, Aa = 500, aa = 100)  # Example observed counts

# Input validation
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")

total <- sum(obs)
if (total == 0) stop("No individuals in population")

# Calculate allele frequencies
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total)  # Frequency of A
q <- 1 - p  # Frequency of a

# Expected counts under HWE
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)

# Chi-squared test (ensure expected >5 for validity)
if (any(expected < 5)) warning("Expected counts <5; consider exact test")

chi2 <- sum((obs - expected)^2 / expected)
df <- 1  # Degrees of freedom for biallelic
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)

# Optional: Fisher's exact test for small samples (using contingency table simulation)
# But for simplicity, we use chi-squared here

# Output results
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")

## Allele Frequencies: p(A) = 0.65 q(a) = 0.35

cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")

## Expected Counts: AA = NA Aa = NA aa = NA

cat("Chi-squared =", round(chi2, 4), "df =", df, "p-value =", format.pval(p_val_chi), "\n")

## Chi-squared = 9.7814 df = 1 p-value = 0.0017628

if (p_val_chi < 0.05) {
  cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
} else {
  cat("Fail to reject HWE: Population appears in equilibrium.\n")
}

## Reject HWE: Possible deviation due to selection, inbreeding, or error.

5 Pedigree Analysis and Inheritance

Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.

Figure 3: Example pedigree illustrating inheritance patterns, with symbols for affected individuals, carriers, and relationships across generations.

5.1 Modes of Inheritance

Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.

Autosomal Dominant (AD):
- Definition: A single copy of the dominant allele (D) is sufficient to express the phenotype. Affected individuals are heterozygous (Dd) or homozygous (DD), while dd are unaffected.
- Pedigree Characteristics: Vertical transmission (affected individuals in every generation); no skipped generations if fully penetrant. Affects males and females equally; 50% risk to offspring of an affected parent.
- Example: Huntington’s disease (CAG repeat expansion in HTT gene on chromosome 4). Late-onset, neurodegenerative disorder with high penetrance.
- Epidemiological Notes: Often rare; founder effects in isolated populations increase prevalence. Used in linkage studies to map genes.
- Limitations: Incomplete penetrance or variable expressivity (different symptom severity) can mimic other patterns.
Autosomal Recessive (AR):
- Definition: Requires two copies of the recessive allele (dd) for expression. Heterozygotes (Dd) are asymptomatic carriers.
- Pedigree Characteristics: Horizontal transmission (affected siblings from unaffected parents); skips generations. Consanguinity (e.g., cousin marriages) increases risk; equal male-female incidence.
- Example: Cystic fibrosis (mutations in CFTR gene on chromosome 7). Affects lung and digestive functions; carrier frequency ~1/25 in Caucasians.
- Epidemiological Notes: Higher prevalence in populations with high carrier rates (e.g., due to heterozygote advantage, like sickle cell anemia in malaria-endemic areas). Screening programs target carriers.
- Limitations: Phenocopies (environmental mimics) can complicate diagnosis.
X-Linked Recessive (XR):
- Definition: Mutation on the X chromosome; males (X^c Y) express with one copy due to hemizygosity, while females (X^C X^c) are carriers (often unaffected due to X-inactivation).
- Pedigree Characteristics: No male-to-male transmission; affected males pass to all daughters (carriers) but no sons. Mother-to-son transmission; more males affected.
- Example: Hemophilia A (mutations in F8 gene on Xq28). Bleeding disorder; historical prevalence in European royalty.
- Epidemiological Notes: Skewed sex ratios in affected individuals; carrier testing in families. Lyonization (random X-inactivation) can cause mild symptoms in females.
- Limitations: De novo mutations can appear without family history.
Additional Modes:
- X-Linked Dominant (XD): Rare; affects females more (e.g., Rett syndrome, MECP2 gene). Lethal in males or mild; female-to-offspring transmission.
- Y-Linked: Male-only transmission (e.g., azoospermia factors on Y chromosome). Rare, as Y has few genes.
- Mitochondrial: Maternal inheritance (mtDNA from mother); affects both sexes but no father-to-child. Variable due to heteroplasmy (mixed mtDNA). Example: Leber’s hereditary optic neuropathy.

5.2 Probability in Pedigrees

Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for Bayesian risk assessment in genetic counseling.

Key Formula: The likelihood of observing a pedigree under a genetic model is: \[ P(\text{pedigree}) = \prod_{i=1}^{n} P(\text{genotype}_i) \times P(\text{phenotype}_i | \text{genotype}_i), \]

where the product is over all individuals, \(P(\text{genotype}_i)\) is based on Mendelian laws and parental genotypes, and \(P(\text{phenotype}_i | \text{genotype}_i)\) accounts for penetrance.

Penetrance: Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
Variable Expressivity: Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
Example: In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
Applications: Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
Limitations: Assumes accurate family history; ignores modifiers like environment.

6 Linkage Analysis

Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.

Genetic Linkage: Occurs when genes are close on the same chromosome, violating independent assortment.
Recombination Fraction (θ): Proportion of gametes with recombination between two loci. \[ \theta = 0.5: \text{No linkage (independent, >50 cM apart)}. \] \[ \theta < 0.5: \text{Linkage (e.g., θ=0.1 means 10% recombination)}. \] \[ \theta = 0: \text{Complete linkage (no recombination, syntenic loci)}. \] Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
Phases: Coupling (alleles on same chromosome) vs. repulsion (opposite).
Applications: Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).

6.1 LOD Score

The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.

Formula: \(Z(\theta) = \log_{10} \frac{L(\text{data} | \theta=\hat{\theta})}{L(\text{data} | \theta=0.5)}\), where \(\hat{\theta}\) is the maximum likelihood estimate of θ, and L is the likelihood function.
Calculation: Involves summing over possible phases and recombinants in pedigrees.
Interpretation: Positive Z favors linkage; threshold Z≥3 for significance (false positive rate 5%); Z≤-2 excludes linkage.

LOD Score	Odds Ratio	Interpretation
Z ≤ -2	≤1:100	Strong evidence against linkage (exclusion)
-2 < Z < 3	Indeterminate	Suggestive but not conclusive
Z ≥ 3	≥1000:1	Strong evidence for linkage (genome-wide significance)

Example: In a family with a disease and marker, if data is 1000 times more likely under θ=0.1 than 0.5, Z=3.
Limitations: Requires large pedigrees; sensitive to model misspecification (e.g., wrong penetrance). Superseded by GWAS for complex diseases.

7 Linkage Disequilibrium (LD) and Association

LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.

Causes of LD: Mutation, selection, genetic drift, population admixture, or bottlenecks. High LD in isolated populations (e.g., Ashkenazi Jews).
Decay: LD decreases with genetic distance and time; measured in haplotype blocks.
Vs. Linkage: Linkage is family-based (meiotic); LD is population-based (historical).
Applications: Imputation in GWAS; fine-mapping; inferring evolutionary history.

7.1 Measures of LD

Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.

D (Disequilibrium Coefficient):

Formula: \(D_{AB} = p_{AB} - p_A p_B\).

Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.

Limitations: Sensitive to allele frequencies; not comparable across loci.

D’ (Normalized D):

Formula: if \(D<0\), \(D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}\) if \(D>0\), \(D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}\).

Interpretation: Ranges -1 to 1; |D’|=1: Complete LD (no recombination evidence); |D’|<1: Partial.

Use: Detects historical recombination.

r² (Correlation Coefficient):

Formula: \(r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}\).

Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).

Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.

7.2 Additional Considerations

Haplotypes: Combinations of alleles (e.g., estimated via EM algorithm in PHASE software).
Visualization: LD plots (triangular heatmaps) using tools like Haploview.
Best Practices: Account for population structure (e.g., via principal components) to avoid spurious LD. In epidemiology, LD informs polygenic risk scores.
Limitations: LD varies by ancestry (e.g., higher in Europeans than Africans); tagging fails for rare variants.

R Implementation for LD: This code computes \(D\), \(D'\), and \(r^2\) from example haplotype frequencies or counts. It uses base R for simplicity; for real data, consider packages like genetics or snpStats.

# Linkage Disequilibrium Measures
# Input: Haplotype counts (e.g., from population data)
# Assume biallelic loci A/a, B/b
count_AB <- 120  # AB haplotypes
count_Ab <- 80   # Ab
count_aB <- 60   # aB
count_ab <- 140  # ab
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))

# Haplotype frequencies
p_AB <- count_AB / total
p_Ab <- count_Ab / total
p_aB <- count_aB / total
p_ab <- count_ab / total

# Allele frequencies
p_A <- p_AB + p_Ab
p_a <- 1 - p_A
p_B <- p_AB + p_aB
p_b <- 1 - p_B

# D
D <- p_AB - p_A * p_B

# D' (normalized)
D_max_pos <- min(p_A * p_b, p_a * p_B)
D_max_neg <- -min(p_A * p_B, p_a * p_b)
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)

# r-squared
r_sq <- D^2 / (p_A * p_a * p_B * p_b)

# Output
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")

## Allele Frequencies: p_A = 0.5 p_B = 0.45

cat("D =", round(D, 4), "\n")

## D = 0.075

cat("D' =", round(D_prime, 4), "\n")

## D' = 0.3333

cat("r-squared =", round(r_sq, 4), "\n")

## r-squared = 0.0909

if (abs(D_prime) == 1) cat("Complete LD detected.\n") else cat("Partial or no LD.\n")

## Partial or no LD.

7.3 Genome-Wide Association Studies (GWAS)

GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.

Statistical Model: Typically uses logistic regression for case-control studies. \[ \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}. \]
Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., \(p < 5 \times 10^{-8}\)) are required to avoid false positives.

8 Gene-Environment Interactions

Disease risk is often modeled as a combination of genetics (\(G\)), environment (\(E\)), and their interaction (\(G \times E\)). \[ Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon. \]

Interaction Models: - Synergistic: Genotype exacerbates the risk factor (or vice versa). - Independent: Both factors influence risk but do not interact.

Figure 4: Model: Genotype exacerbates the effect of the risk factor.

9 Core Epidemiological Measures

In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.

9.1 Absolute Risk Reduction (ARR)

Definition: The difference in event rates (incidences) between a control (or unexposed) group and a treatment (or exposed) group. ARR measures the absolute effect of an intervention or exposure on outcome risk.
Formula: \(ARR = I_{\text{control}} - I_{\text{treatment}}\), where \(I\) represents incidence (proportion of events).
Interpretation: A positive ARR indicates risk reduction (benefit); a negative ARR indicates increased risk (harm). It is straightforward but does not account for baseline risk.
Example: If the incidence of heart attacks is 10% in the control group and 7% in the treatment group, ARR = 0.10 - 0.07 = 0.03 (3% absolute reduction).
When to Use: Prospective studies like randomized controlled trials (RCTs); useful for communicating tangible benefits to patients.
Limitations: Sensitive to baseline risk; not ideal for comparing across populations with different event rates.

9.2 Relative Risk (RR)

Definition: The ratio of the incidence of an outcome in the exposed group to that in the unexposed group. RR assesses how much an exposure increases or decreases the probability of an event.
Formula: \(RR = \frac{I_{\text{exposed}}}{I_{\text{unexposed}}}\).
Interpretation: RR > 1 indicates increased risk due to exposure; RR < 1 indicates protective effect; RR = 1 indicates no association. It is multiplicative and accounts for baseline risk.
Example: In a cohort study, smokers have a 20% lung cancer incidence, while non-smokers have 2%. RR = 0.20 / 0.02 = 10 (smokers are 10 times more likely to develop lung cancer).
When to Use: Cohort studies or RCTs; preferred for common outcomes.
Limitations: Can overestimate associations for rare events; not applicable in case-control studies.

9.3 Odds Ratio (OR)

A 2×2 contingency table with odds ratio formula.

	Exposed	Unexposed	Total
Cases (Diseased)	a	c	a + c
Controls (Healthy)	b	d	b + d
Total	a + b	c + d	N

Odds Ratio (OR) Calculation: \[ OR = \frac{a / c}{b / d} = \frac{ad}{bc} \]

Formula: \(OR = \frac{ad}{bc}\)

Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].

Definition: The ratio of the odds of an outcome in the exposed group to the odds in the unexposed group. OR approximates RR when the outcome is rare.
Formula: From a 2x2 contingency table (a = exposed cases, b = exposed non-cases, c = unexposed cases, d = unexposed non-cases): \(OR = \frac{ad}{bc}\).
Interpretation: OR > 1 suggests positive association; OR < 1 suggests inverse association; OR = 1 suggests no association. It is often used in logistic regression.
Example: In a case-control study of diabetes and obesity: 80 obese diabetics (a), 20 obese non-diabetics (b), 30 non-obese diabetics (c), 70 non-obese non-diabetics (d). OR = (8070) / (2030) = 9.33 (obesity increases odds of diabetes by over 9 times).
When to Use: Case-control studies or when incidence data is unavailable; common in meta-analyses.
Limitations: Not directly interpretable as risk for common outcomes; can differ from RR if events are frequent.

9.4 Number Needed to Treat (NNT) or Harm (NNH)

Definition: The average number of patients who need to be treated (or exposed) to prevent (or cause) one additional outcome compared to the control. NNT is based on ARR and translates statistical effects into clinical relevance.
Formula: \(NNT = \frac{1}{|ARR|}\) (use absolute value for magnitude; sign of ARR determines benefit vs. harm).
Interpretation: Lower NNT indicates greater treatment efficacy. If ARR > 0, it’s NNT (benefit); if ARR < 0, it’s NNH (harm). Infinite NNT means no effect.
Example (Benefit): ARR = 0.03 (as above), NNT = 1 / 0.03 ≈ 33.3 (treat 33 patients to prevent one heart attack).
Example (Harm): If treatment increases incidence from 50% to 80%, ARR = 0.50 - 0.80 = -0.30, NNH = 1 / 0.30 ≈ 3.3 (treat 3 patients to cause one additional bad outcome).
When to Use: RCTs or systematic reviews; helps in shared decision-making and cost-benefit analysis.
Limitations: Assumes constant ARR; sensitive to time frame and baseline risk. Confidence intervals should be reported for real-world application.

9.5 Additional Considerations

Confidence Intervals (CI): Always compute 95% CIs for these measures to assess precision (e.g., using bootstrap methods or formulas in R packages like epitools or epiR).
Attributable Risk (AR): Extends RR; AR = \(I_{\text{exposed}} - I_{\text{unexposed}}\) (absolute risk due to exposure).
Population Attributable Risk (PAR): PAR = \(I_{\text{population}} - I_{\text{unexposed}}\) (proportion of cases attributable to exposure in the population).
Best Practices: Adjust for confounders using multivariable models; interpret in context (e.g., RR may seem large for rare events but have small absolute impact).
Software Tools: R (with packages like epiR, Epi, or survival for advanced metrics like Hazard Ratios) or Python (with scipy or lifelines) are commonly used.

R Implementation for Key Measures: This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.

# Example data: 2x2 table for OR/RR (cohort study assumption)
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
a <- 20  # Exposed cases
b <- 80  # Exposed non-cases
c <- 2   # Unexposed cases
d <- 98  # Unexposed non-cases

# Incidences
I_exposed <- a / (a + b)
I_unexposed <- c / (c + d)

# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
ARR <- I_unexposed - I_exposed  # Positive if treatment reduces risk

# Relative Risk
RR <- I_exposed / I_unexposed

# Odds Ratio
OR <- (a * d) / (b * c)

# Number Needed to Treat/Harm
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))

# Output
cat("Incidence Exposed:", round(I_exposed, 3), "\n")

## Incidence Exposed: 0.2

cat("Incidence Unexposed:", round(I_unexposed, 3), "\n")

## Incidence Unexposed: 0.02

cat("ARR:", round(ARR, 3), "\n")

## ARR: -0.18

cat("RR:", round(RR, 3), "\n")

## RR: 10

cat("OR:", round(OR, 3), "\n")

## OR: 12.25

cat(type, ":", ifelse(is.finite(NNT), round(NNT, 1), "Infinite"), "\n")

## NNH (Harm) : 5.6

# Harm example (swap for treatment increasing risk)
I_control <- 0.50
I_treatment <- 0.80
ARR_harm <- I_control - I_treatment
NNT_harm <- 1 / abs(ARR_harm)
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")

## 
## Harm Example - ARR: -0.3

cat("NNH:", round(NNT_harm, 1), "\n")

## NNH: 3.3

The following code uses a custom function to handle both benefit (reduction in risk) and harm (increase in risk) scenarios dynamically.

# Function to calculate Epidemiological Metrics
calculate_epi_stats <- function(a, b, c, d) {
# Input validation
if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")

# Basic Incidences
i_exp <- a / (a + b)
i_unexp <- c / (c + d)

# Absolute Risk Reduction (ARR)
# Defined as Control Risk - Treated Risk
arr <- i_unexp - i_exp

# Relative Risk (RR) and Odds Ratio (OR)
# Adding small constant (0.5) if zero to avoid division by zero errors
rr <- i_exp / i_unexp
or <- (a * d) / (b * c)

# NNT / NNH Logic
nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
ifelse(arr < 0, "NNH (Harm)", "No Effect"))

# Return formatted list
return(list(
Incidence_Exposed = round(i_exp, 4),
Incidence_Unexposed = round(i_unexp, 4),
ARR = round(arr, 4),
RR = round(rr, 2),
OR = round(or, 2),
Measure = nnt_label,
Value = round(nnt_val, 1)
))
}

# --- Example 1: Beneficial Treatment ---
# (e.g., Vaccine trial: Exposed = Vaccinated)

result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
print(result_benefit)

## $Incidence_Exposed
## [1] 0.2
## 
## $Incidence_Unexposed
## [1] 0.4
## 
## $ARR
## [1] 0.2
## 
## $RR
## [1] 0.5
## 
## $OR
## [1] 0.38
## 
## $Measure
## [1] "NNT (Benefit)"
## 
## $Value
## [1] 5

# --- Example 2: Harmful Exposure ---
# (e.g., Smoking vs Lung Cancer)
result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
print(result_harm)

## $Incidence_Exposed
## [1] 0.5
## 
## $Incidence_Unexposed
## [1] 0.1
## 
## $ARR
## [1] -0.4
## 
## $RR
## [1] 5
## 
## $OR
## [1] 9
## 
## $Measure
## [1] "NNH (Harm)"
## 
## $Value
## [1] 2.5

To understand the output and interpret the results, we use the following logic.

Metric	Interpretation	Calculation
ARR	The actual difference in risk between groups.	\(I_{unexposed} - I_{exposed}\)
RR	How many times more likely the outcome is in the exposed group.	\(I_{exposed} / I_{unexposed}\)
OR	The ratio of the odds of the outcome occurring in the exposed group.	\(\frac{a \times d}{b \times c}\)
NNT	The number of patients needed to treat to prevent one bad outcome.	\(1 / ARR\)

Note on NNT/NNH: If the ARR is positive (risk decreased), we report Number Needed to Treat (NNT). If the ARR is negative (risk increased), we report Number Needed to Harm (NNH).

Confidence Intervals: In a real-world study, we typically report 95% CIs. Using prop.test() in base R to get CIs for risk differences.

Log Transformation: For OR and RR, calculations for CIs are typically done on the log scale to ensure symmetry.

10 `Rmd/R` Epidemiological simulation of a hypothetical viral epidemic

We’ll try to illustrates genomics tracking, prediction, and spatio-temporal quantification of a viral epidemic. For this, we’ll build a high-fidelity epidemiological simulation in R, integrating three distinct domains: compartmental modeling (SIR/SEIR), phylodynamics (genomic tracking), and spatial movement.

To keep this self-contained and realistic, we will use a Stochastic Agent-Based Model (ABM) approach. This allows us to track individual “viral sequences” as they mutate and spread across a coordinate system.

The R simulation requires the following libraries:: tidyverse: For data manipulation and plotting.; gganimate: For spatio-temporal visualization.; adegenet or ape: For handling genomic distances.; sf: for projecting this onto real-world maps.

10.1 The Core Simulation Engine

This script generates a synthetic population, simulates transmission based on proximity, and introduces mutations (modeled as bit-strings or distance matrices).

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.05  # Probability of a genomic shift per transmission
timesteps <- 50

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = 0,        # Base viral lineage
  parent_id = NA,
  infect_time = NA
)

# Infect patient zeros
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0
pop$genome[1:initially_infected] <- sample(1:100, initially_infected)

# --- Run Simulation ---
history <- list()

for (t in 1:timesteps) {
  # 1. Movement (Random Walk)
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Find neighbors in range
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] # Transmission radius
    
    if (length(contacts) > 0) {
      for (target in contacts) {
        if (runif(1) < 0.3) { # Prob of infection
          pop$state[target] <- "I"
          pop$infect_time[target] <- t
          pop$parent_id[target] <- pop$id[i]
          # Genomic Mutation: Slight drift from parent genome
          pop$genome[target] <- pop$genome[i] + 
            sample(c(-1, 0, 1), 1, prob = c(mutation_rate/2, 1-mutation_rate, mutation_rate/2))
        }
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

Here is the structure of the simulated dataset.

str(sim_data )

## tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
##  $ id         : int [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ x          : num [1:10000] 89 92.5 28 83.5 65.4 ...
##  $ y          : num [1:10000] 86.5 50.2 85.2 47.1 16.4 ...
##  $ state      : chr [1:10000] "I" "I" "I" "S" ...
##  $ genome     : num [1:10000] 80 53 24 0 0 0 0 0 0 0 ...
##  $ parent_id  : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
##  $ infect_time: num [1:10000] 0 0 0 NA NA NA NA NA NA NA ...
##  $ time       : int [1:10000] 1 1 1 1 1 1 1 1 1 1 ...

10.2 Genomic Tracking & Phylogeny

To visualize the “Genomic Tracking,” we treat the genome value as a proxy for a sequence. In a real scenario, this would be a high-dimensional Hamming distance, but here we can visualize the Clade Divergence over time.

Visualizing the Spatio-Temporal Spread: Using gganimate, we can see how the virus clusters geographically and how the “Genomic Signature” (color) evolves.

anim <- ggplot(sim_data, aes(x = x, y = y, color = as.factor(genome), size = state)) +
  geom_point(alpha = 0.7) +
  scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
  theme_minimal() +
  labs(title = "Viral Spread & Genomic Drift: Time {frame}",
       color = "Genomic Lineage",
       size = "Infection Status") +
  transition_time(time)

# animate(anim) # Uncomment to render

10.3 Realistic Epidemiological Quantification

To make this “highly realistic,” we should calculate the Effective Reproduction Number (\(R_t\)) and the Spatio-Temporal Kernel 3. Realistic Epidemiological Quantification

\[ R_t=\frac{\text{New Infections at time}\ t}{\text{Infectious Individuals at time}\ t−1}. \]

Quantifying the Clusters: We can use a K-nearest neighbor (KNN) or DBSCAN algorithm on the sim_data to identify hotspots where specific genomic lineages are dominating. This mimics real-world Genomic Surveillance reports.

# Example quantification of lineage prevalence
lineage_summary_plot <- sim_data %>%
    filter(state == "I") %>%
    # Use group_by to organize the data by time and genome
    group_by(time, genome) %>% 
    summarise(count = n(), .groups = "drop") %>%
    ggplot(aes(x = time, y = count, fill = as.factor(genome))) +
    geom_area() +
    labs(title = "Genomic Lineage Succession", x = "Days", y = "Active Cases")

# Display the plot
lineage_summary_plot

To extend this code into an Rmd notebook, we can include

Parameter Sweeping: Wrap the simulation in a function and run it with different values.
Network Analysis: Use the parent_id column to create a transmission tree using the igraph package.
Kernel Density: Add a geom_density_2d() layer to the map to show spatial intensity.

Let’s expand the simulation from a simple proxy to a high-fidelity Phylodynamic Model by simulating a founder DNA sequence and allow it to evolve through point mutations (substitutions) as it spreads across the population.

DNA Sequences & Mutation: Instead of a single number, each agent now carries a character string. When transmission occurs, there is a probability that one or more bases in the sequence will mutate.

seq_length <- 50
bases <- c("A", "C", "G", "T")
mutation_rate <- 0.01 # Probability per base per transmission

# Create a starting sequence (Patient Zero)
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

# Helper function to mutate DNA
mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

Genetic Distance Matrix: To quantify the “genomic tracking” aspect, we use the Hamming Distance, which counts the number of positions at which the corresponding nucleotides are different.

The Hamming distance \(d(x, y)\) between two strings \(x = (x_1, x_2, \dots, x_n)\) and \(y = (y_1, y_2, \dots, y_n)\) of equal length \(n\) over a finite alphabet \(\Sigma\) is defined as the number of positions at which the corresponding symbols differ:

\[ d(x, y) = \sum_{i=1}^{n} \mathbb{1}_{x_i \neq y_i}, \]

where \(\mathbb{1}\) is the indicator function (1 if true, 0 otherwise). It forms a metric on the space \(\Sigma^n\), satisfying non-negativity, symmetry, and the triangle inequality.

# Extract all unique sequences found in the simulation
unique_genomes <- unique(pop$genome)

# Calculate Hamming Distance Matrix
dist_matrix <- stringdistmatrix(unique_genomes, unique_genomes, method = "hamming")
rownames(dist_matrix) <- unique_genomes
colnames(dist_matrix) <- unique_genomes

# This matrix allows us to see how far 'Variant B' has drifted from 'Variant A'

Integrated Spatio-Temporal & Transmission Model: We will update the simulation to record the transmission pairs to build the igraph tree and add the spatial density layer.

# --- Initialization Fixes ---
# Ensure genome column is a character vector to hold DNA strings
pop$genome <- as.character(pop$genome) 
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  # 1. Movement
  pop$x <- pmax(0, pmin(map_size, pop$x + rnorm(n_agents, 0, 2)))
  pop$y <- pmax(0, pmin(map_size, pop$y + rnorm(n_agents, 0, 2)))
  
  # 2. Transmission
  infected <- which(pop$state == "I")
  susceptible <- which(pop$state == "S")
  
  for (i in infected) {
    # Check distances against only the susceptible population
    dists <- sqrt((pop$x[susceptible] - pop$x[i])^2 + (pop$y[susceptible] - pop$y[i])^2)
    contacts <- susceptible[dists < 5] 
    
    for (target in contacts) {
      # Important: Check if target is still susceptible (in case they were infected
      # by someone else earlier in this same timestep loop)
      if (pop$state[target] == "S" && runif(1) < 0.3) { 
        
        # Update Status
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        
        # 1. Expand Genomic Logic: DNA Mutation
        # We replace the numeric addition with our string mutation function
        new_sequence <- mutate_dna(pop$genome[i]) 
        pop$genome[target] <- new_sequence
        
        # 3. Track Transmission for igraph
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }
  
  # 3. Recovery
  pop$state[pop$state == "I" & (t - pop$infect_time) > 10] <- "R"
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)

# --- Visualization: Spatial Intensity ---
ggplot(sim_data %>% filter(state == "I"), aes(x = x, y = y)) +
  geom_density_2d_filled(
    aes(fill = after_stat(level)),
    alpha = 0.6,
    contour_var = "ndensity"   # optional: normalize for better contrast
  ) +
  geom_point(aes(color = genome), alpha = 0.9) +
  scale_fill_viridis_d() +      # discrete fill for the bands
  theme_minimal() +
  labs(title = "Viral Hotspots and Genomic Clustering") +
  theme(legend.position = "none")

The Transmission Tree (Phylogeny): Using igraph and ggraph, we can visualize the Who-Infects-Whom network. This is the gold standard for contact tracing and understanding super-spreading events.

# Create graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

# Plot the tree
ggraph(transmission_graph, layout = 'tree') + 
  geom_edge_diagonal(alpha = 0.2, color = "grey") + 
  geom_node_point(color = "steelblue", size = 2) + 
  theme_void() +
  labs(title = "Reconstructed Transmission Tree")

# An alternative plot_ly() SVG infection graph can be created by
# library(plotly)
# library(igraph)
# library(dplyr)  # For piping and joins

# Extract layout coordinates (tree structure)
layout <- layout_as_tree(transmission_graph)

# Prepare node data with safe column names
node_data <- data.frame(
  node_id = V(transmission_graph)$name,  # Renamed from 'id'
  x = layout[,1],
  y = layout[,2]
)

# Calculate in-degree (number of incoming edges) to identify root nodes
in_degree <- degree(transmission_graph, mode = "in")

# Create nodes dataframe with size based on root status
nodes <- node_data %>%
  mutate(
    size = ifelse(in_degree == 0, 10, 7),  # Root nodes (in-degree=0) are larger
    label = ifelse(in_degree == 0, "Root", "Descendant")
  )

# Prepare edge data with directional arrows
edges <- as_data_frame(transmission_graph, what = "edges") %>%
  left_join(nodes, by = c("from" = "node_id")) %>%
  left_join(nodes, by = c("to" = "node_id"), suffix = c("_from", "_to"))

# 1. Calculate children for each node
children_info <- edges %>%
  group_by(from) %>%
  summarise(children = paste(to, collapse = ", ")) %>%
  rename(node_id = from)

# 2. Calculate parents for each node
parent_info <- edges %>%
  group_by(to) %>%
  summarise(parents = paste(from, collapse = ", ")) %>%
  rename(node_id = to)

# 3. Join this info back to your nodes dataframe
nodes <- nodes %>%
  left_join(children_info, by = "node_id") %>%
  left_join(parent_info, by = "node_id") %>%
  mutate(
    children = ifelse(is.na(children), "None", children),
    parents = ifelse(is.na(parents), "None", parents)
  )

# Create interactive plot
transmission_plot <- plot_ly() %>%
  add_segments(
    data = edges,
    x = ~x_from, y = ~y_from,
    xend = ~x_to, yend = ~y_to,
    line = list(
      color = ~ifelse(x_from < x_to, "#1f77b4", "#ff7f0e"),
      width = 2
    ),
    hoverinfo = "none"
  ) %>%
  add_trace(
    data = nodes,
    x = ~x,
    y = ~y,
    type = "scatter",
    mode = "markers",
    marker = list(
      size = ~size,
      opacity = 0.8,
      line = list(width = 1, color = "black")
    ),
    # Added Parents and Children to the hover text
    text = ~paste0(
      "<b>Node ID:</b> ", node_id, 
      "<br><b>Type:</b> ", label,
      "<br><b>Parent(s):</b> ", parents,
      "<br><b>Children:</b> ", children
    ),
    hoverinfo = "text",
    showlegend = FALSE
  ) %>%
  layout(
    title = "Interactive Viral Transmission Tree (Graph)",
    xaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    yaxis = list(showgrid = FALSE, showticklabels = FALSE, zeroline = FALSE),
    hovermode = "closest"
  )

transmission_plot

Genetic Distance: By calculating (distance between sequence and ), we can identify if a new “cluster” appearing in a specific city is a local evolution or a new introduction from outside.
Spatial Kernel: The geom_density_2d visualizes the spatial transmission kernel, identifying where the force of infection is highest.
Tree Topology: A star-like phylogeny suggests rapid exponential growth, while a ladder-like phylogeny suggests a stable, endemic transmission chain.

# Proper Genetic Distance Matrix
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")

# Optional: Visualize distance as a heatmap
image(as.matrix(dist_mat), main = "Genetic Distance Heatmap")

# SVG plot using plot_ly()
# 1. Get the unique viral strains
unique_strains <- unique(sim_data$genome[sim_data$state == "I"])

# 2. Generate clean Short IDs (e.g., "Strain 1", "Strain 2", etc.)
# Or use the first few characters if that's where the ID is hidden
short_ids <- paste0("S-", seq_along(unique_strains)) 

# 3. Calculate the matrix using the full sequences
dist_mat <- stringdistmatrix(unique_strains, unique_strains, method = "hamming")
m <- as.matrix(dist_mat)

# 4. Handle Infinities
m[is.infinite(m)] <- max(m[is.finite(m)], na.rm = TRUE)

# 5. Assign the CLEAN IDs to the matrix
rownames(m) <- short_ids
colnames(m) <- short_ids

# 6. Plot
fig <- plot_ly(
    z = m, 
    x = colnames(m), 
    y = rownames(m), 
    type = "heatmap",
    colorscale = "Viridis",
    # This customdata trick lets you see the FULL sequence when you hover!
    customdata = matrix(unique_strains, nrow=nrow(m), ncol=ncol(m)),
    hovertemplate = "Row: %{y}<br>Col: %{x}<br>Dist: %{z}<extra></extra>"
  ) %>%
  layout(
    xaxis = list(title = "Strain ID", type = "category"),
    yaxis = list(title = "Strain ID", type = "category", autorange = "reversed")
  )

fig

Finally, display the transmission tree graph.

inf_graph <- graph_from_data_frame(edges, directed = TRUE)

# ggraph(inf_graph, layout = 'kk') + 
#   geom_edge_link(alpha = 0.3, arrow = arrow(length = unit(2, 'mm'))) + 
#   geom_node_point(color = "red", size = 1) + 
#   theme_void() + 
#   labs(title = "Epidemiological Transmission Network")

library(dplyr)
library(igraph)
library(visNetwork)


# Create igraph object
inf_graph <- graph_from_data_frame(edges, directed = TRUE)

# Get all node IDs (as character, for consistency)
node_ids <- V(inf_graph)$name   # already character, e.g. "14", "42"

# Convert sim_data id to character to match node_ids
node_metadata <- sim_data %>%
  filter(state == "I") %>%           # only infected individuals (optional)
  group_by(id) %>%
  summarise(
    genome      = first(genome),     # or any other aggregation
    infect_time = min(infect_time),  # earliest infection time
    .groups = "drop"
  ) %>%
  mutate(id = as.character(id))      # match node_ids format

# merge with the complete set of graph nodes
nodes <- data.frame(id = node_ids) %>%
  left_join(node_metadata, by = "id")

# Combine the metadata into an HTML tooltip string
nodes$title <- paste0(
  "Node: ", nodes$id, "<br>",
  "Genome: ", ifelse(is.na(nodes$genome), "unknown", nodes$genome), "<br>",
  "Infection time: ", ifelse(is.na(nodes$infect_time), "not infected", nodes$infect_time)
)

# Optional: color nodes by genome or infection time
nodes$color <- ifelse(is.na(nodes$genome), "green", "red")   # example

# Build the interactive network
visNetwork(nodes, edges) %>%
  visNodes(
    title = nodes$title,      # tooltip
    color = nodes$color,
    size  = 15
  ) %>%
  visEdges(
    arrows = "to",
    color  = list(color = "darkgray", opacity = 0.4),
    smooth = FALSE
  ) %>%
  visOptions(
    highlightNearest = TRUE,
    nodesIdSelection = TRUE
  )

11 Applications and Software

Modern epidemiology relies heavily on computational tools.

Key R Packages: epiR (Epi measures), genetics (HWE, LD), survival (time-to-event), qqman (GWAS visualization).
Online Tools:
- SOCR Distribution Tables.
- SOCR Distributome.

12 Problems

12.1 Problem 1: Linkage Mapping

Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction \(\theta\). Placeholder for Figure 8

Calculate LOD Scores:
Using Maximum Likelihood Estimation (MLE), if the phase is known and the data include 4 non-recombinant and 1 recombinant meioses, the likelihood is:
\(L(\theta) = (1-\theta)^4 \theta\).
The LOD score is defined as:
\(Z(\theta) = \log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)\), where \(L(0.5) = (0.5)^5 = 1/32\).
Corrected Result Table: | \(\theta\) | \(Z(\theta)\) | |———-|————-| | 0.0 | \(-\infty\) | | 0.10 | 0.322 | | 0.20 | 0.418 (Max LOD) | | 0.50 | 0.0 |

The maximum LOD score occurs at \(\hat{\theta} = 0.20\), consistent with 1 recombinant out of 5 informative meioses.

Note: A \(LOD > 3\) is conventionally required for significant linkage; this example is illustrative.

12.2 Problem 2: Number Needed to Treat (NNT) and Harm (NNH)

Scenario: A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.

Event rates: \(p_A = 800/1000 = 0.80\), \(p_B = 600/1200 = 0.50\).
Absolute Risk Increase (ARI): \(\text{ARI} = p_A - p_B = 0.30\).
Since the treatment increases risk, we compute the Number Needed to Harm (NNH):
\(\text{NNH} = \frac{1}{\text{ARI}} = \frac{1}{0.30} \approx 3.33\).

Interpretation: For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.

12.3 Problem 3: GWAS Power Analysis (R)

Scenario: Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at \(\alpha = 0.05\).

The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:

gwas_power_case_control <- function(n_cases = 200, n_controls = 200, 
                                 maf = 0.2, OR = 1.5, 
                                 alpha = 0.05, n_sims = 100) {
# Genotype coding: 0, 1, 2 copies of effect allele
# Relative risks under additive log-odds model: 1, OR, OR^2
rr <- c(1, OR, OR^2)
  
# Genotype frequencies in controls (Hardy-Weinberg)
g_ctrl <- dbinom(0:2, size = 2, prob = maf)
  
# Genotype frequencies in cases (proportional to g_ctrl * rr)
g_case_unscaled <- g_ctrl * rr
g_case <- g_case_unscaled / sum(g_case_unscaled)
  
significant <- numeric(n_sims)
  
for (i in 1:n_sims) {
  # Simulate genotypes
  geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
  geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
    
  geno <- c(geno_cases, geno_ctrls)
  status <- c(rep(1, n_cases), rep(0, n_controls))
    
  # Fit logistic regression
  model <- glm(status ~ geno, family = binomial)
  p_val <- summary(model)$coefficients[2, 4]
  significant[i] <- as.numeric(p_val < alpha)
}
  
return(mean(significant))  # Estimated power
}

# Example usage:
set.seed(2026)
power_est <- gwas_power_case_control()
cat("Estimated power:", round(power_est, 3), "\n")

## Estimated power: 0.69

12.4 Problem 4: Hardy-Weinberg Equilibrium (HWE) Testing

Scenario: In a GWAS control group (\(n = 1000\)), genotype counts for a SNP are:
AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.

# Observed counts: AA, Aa, aa
obs <- c(240, 120, 40)

# Exact HWE test
hwe_test <- HWExact(obs, verbose = FALSE)

## Warning in homozyg(X): Genotypes are not labelled, default labels (AA, AB, BB)
## assumed.

## Warning in heterozyg(X): Genotypes are not labelled, default labels assumed.

p_val <- hwe_test$pval

cat("HWE exact test p-value:", format.pval(p_val, digits = 4), "\n")

## HWE exact test p-value: 9.821e-05

# Interpretation: p < 0.001 suggests deviation from HWE (common QC threshold in GWAS)

12.5 Problem 5: Assessing Confounding in Observational Data

Scenario: Evaluate whether age and sex confound the association between an exposure and disease outcome.

We’ll use the Logistic Distribution function plogis, which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters location and scale, and rbinom.

set.seed(123)
n <- 5000
# Simulate confounders and exposure

age <- rnorm(n, mean = 50, sd = 10)
sex <- rbinom(n, 1, 0.5)
exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))

# Simulate disease influenced by exposure and confounders
disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))


# Crude model
crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])


# Adjusted model
adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])

# Percent change due to confounding
confounding_pct <- (crude_or - adj_or) / crude_or * 100
cat("Crude OR:", round(crude_or, 2), "\n")

## Crude OR: 2.38

cat("Adjusted OR:", round(adj_or, 2), "\n")

## Adjusted OR: 2.23

cat("Percent change:", round(confounding_pct, 1), "%\n")

## Percent change: 6.5 %

if (abs(confounding_pct) > 10) {
 cat("→ Substantial confounding detected (change > 10%).\n")
} else cat("→ No substantial confounding detected (change < 10%).\n")

## → No substantial confounding detected (change < 10%).

12.6 Problem 6: Diagnostic Test Accuracy via ROC Analysis

Scenario: A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using Youden’s index.

# Simulate biomarker scores
set.seed(42)
score_healthy <- rnorm(100, mean = 0, sd = 1)
score_diseased <- rnorm(100, mean = 1.5, sd = 1)

score <- c(score_healthy, score_diseased)
status <- c(rep(0, 100), rep(1, 100))

# ROC analysis
roc_curve <- roc(status, score)
auc_val <- auc(roc_curve)

# Optimal threshold (Youden's J statistic)
opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))

cat("AUC:", round(auc_val, 3), "\n")

## AUC: 0.844

cat("Optimal cutoff:", round(opt[["threshold"]], 2), "\n")

## Optimal cutoff: 0.83

cat("Sensitivity:", round(opt[["sensitivity"]], 3), "\n")

## Sensitivity: 0.77

cat("Specificity:", round(opt[["specificity"]], 3), "\n")

## Specificity: 0.81

12.7 Problem 7

Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of Phylodynamics and Genomic Surveillance in R. Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.

The Core Simulation Engine can initialize a population and simulates a Random Walk movement model, by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a Variant of Concern.

# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.02
timesteps <- 60
seq_length <- 50
bases <- c("A", "C", "G", "T")

# --- Helper Functions ---
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")

mutate_dna <- function(sequence) {
  seq_vec <- strsplit(sequence, "")[[1]]
  mutation_mask <- runif(seq_length) < mutation_rate
  if(any(mutation_mask)) {
    seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
  }
  return(paste(seq_vec, collapse = ""))
}

calc_fitness <- function(sequence) {
  gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
  return(0.3 + (gc_content * 0.2)) 
}

# --- Initialize Population ---
pop <- tibble(
  id = 1:n_agents,
  x = runif(n_agents, 0, map_size),
  y = runif(n_agents, 0, map_size),
  state = "S",
  genome = founder_seq,
  parent_id = NA_integer_,
  infect_time = NA_real_
)

# Set state to "I" for initial cases
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0

# --- Run Simulation ---
history <- list()
edges <- data.frame(from = integer(), to = integer())

for (t in 1:timesteps) {
  
  # Assignment operator (<-) to update movement
  # loaded cyclical harmonics
  # pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  # pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
  
  # simple stochastic dynamics
  pop$x <- pop$x + rnorm(n_agents, 0, 2)
  pop$y <- pop$y + rnorm(n_agents, 0, 2)

  # Optional: Keep agents within map boundaries
  pop$x <- pmax(0, pmin(map_size, pop$x))
  pop$y <- pmax(0, pmin(map_size, pop$y))

  # 2. Transmission
  infected_idx <- which(pop$state == "I")
  
  for (i in infected_idx) {
    susceptible_idx <- which(pop$state == "S")
    if(length(susceptible_idx) == 0) break
    
    dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
    contacts <- susceptible_idx[dists < 6] 
    
    for (target in contacts) {
      if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
        pop$state[target] <- "I"
        pop$infect_time[target] <- t
        pop$parent_id[target] <- pop$id[i]
        pop$genome[target] <- mutate_dna(pop$genome[i])
        edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
      }
    }
  }

  # subsetting and assignment for Recovery
  recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
  if(length(recovery_idx) > 0) {
    pop$state[recovery_idx] <- "R"
  }
  
  history[[t]] <- pop %>% mutate(time = t)
}

sim_data <- bind_rows(history)
# str(sim_data)

Spatio-Temporal Visualization: By mapping the genomic strings to colors, we can visualize the “clade emergence” over time.

anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
geom_point(alpha = 0.6) +
scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
transition_time(time)

# animate(anim)

Genomic Tracking & Distance Matrices: To understand how much the virus has drifted, we use the Hamming Distance. This calculates the number of base substitutions between any two viral samples.

# Extract final strains
final_strains <- sim_data %>%
filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
distinct(genome) %>%
pull(genome)

dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))

# Visualize Genetic Distance
plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
layout(title = "Genetic Hamming Distance Matrix")

Phylogeny: The Transmission Tree: While a distance matrix shows “how much” change occurred, the Transmission Tree shows “who infected whom.” A “star-like” tree indicates rapid expansion, while a “ladder-like” tree indicates a slow, constant chain of transmission.

# Reconstruct the tree using igraph
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)

ggraph(transmission_graph, layout = 'tree') +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
theme_void() +
labs(title = "Reconstructed Epidemiological Transmission Tree")

Quantifying the Molecular Clock: A key concept in phylodynamics is the Molecular Clock, the idea that mutations accumulate at a relatively constant rate over time.

# Calculate divergence from founder
clock_data <- sim_data %>%
filter(!is.na(infect_time)) %>%
mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))

ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
geom_jitter(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "The Molecular Clock: Mutation Accumulation",
x = "Time of Infection (Days)",
y = "Hamming Distance from Founder")

Here are a a few additional plots.

# Create the ggraph plot
gg <- ggraph(transmission_graph, layout = 'tree') +
  geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
  geom_node_point(aes(color = name, 
                      text = paste("ID:", name,
                                   "<br>Click for details")), 
                  size = 5) +
  theme_void() +
  labs(title = "Reconstructed Epidemiological Transmission Tree")

## Warning in geom_node_point(aes(color = name, text = paste("ID:", name,
## "<br>Click for details")), : Ignoring unknown aesthetics: text

# Convert to interactive plotly
ggplotly(gg, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
    yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
  )

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomEdgePath() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

and …

# Convert igraph to networkD3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)

# Prepare nodes with tooltips
nodes_d3 <- transmission_d3$nodes
if(exists("pop")) {
  nodes_d3 <- nodes_d3 %>%
    left_join(
      pop %>% 
        mutate(name = as.character(id)) %>% 
        select(name, state, infect_time, genome),
      by = "name"
    )
} else {
  nodes_d3$state <- "Unknown"
  nodes_d3$infect_time <- NA
  nodes_d3$genome <- NA
}

# Create custom tooltips
nodes_d3$tooltip <- sprintf(
  "ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
  nodes_d3$name,
  nodes_d3$state,
  ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
  ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
)

# Create the force-directed network
# forceNetwork(
#   Links = transmission_d3$links,
#   Nodes = nodes_d3,
#   Source = 'source',
#   Target = 'target',
#   NodeID = 'name',
#   Group = 'state',
#   Nodesize = 5,
#   opacity = 0.9,
#   zoom = TRUE,
#   fontSize = 14,
#   linkDistance = 150,
#   charge = -100,
#   opacityNoHover = 0.5,
#   legend = TRUE,
#   bounded = TRUE,
#   clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
# )
forceNetwork(
  Links = transmission_d3$links,
  Nodes = nodes_d3,
  Source = 'source',
  Target = 'target',
  NodeID = 'name',
  Group = 'state',
  Nodesize = 5,
  opacity = 0.9,
  zoom = TRUE,
  fontSize = 14,
  linkDistance = 10,   # now longer
  charge = -10,        # stronger repulsion
  # bounded = FALSE,    # default is FALSE – remove/comment
  opacityNoHover = 0.5,
  legend = TRUE,
  clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
)