| SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
# Load necessary libraries (install if needed)
# install.packages(c("tidyverse", "gganimate", "igraph", "ggraph", "plotly", "stringdist", "pROC", "HardyWeinberg", "viridisLite", "visNetwork", "networkD3"))
library(tidyverse)
library(gganimate)
library(igraph)
library(ggraph)
library(plotly)
library(stringdist)
library(pROC)
library(HardyWeinberg)
library(viridisLite)
library(visNetwork)
library(networkD3)This Data Science and Predictive Analytics (DSPA2) Appendix covers Epidemiology, the scientific discipline that investigates the distribution, determinants, and control of health-related states or events (including diseases) in specified populations. It applies this knowledge to control health problems and improve public health outcomes. Historically, epidemiology originated from the study of infectious disease outbreaks, such as John Snow’s investigation of the 1854 cholera epidemic in London, which linked contaminated water sources to disease spread. In modern times, the field has broadened to include non-infectious conditions like chronic diseases (e.g., cancer, diabetes), environmental exposures (e.g., air pollution, toxins), behavioral factors (e.g., smoking, diet), and genetic influences. This helps with understanding that health outcomes arise from complex interactions between genetic predispositions, environmental factors, and social determinants.
A core framework in epidemiology is the “epidemiologic triad” of agent, host, and environment, but contemporary approaches emphasize the “person, place, and time” dimensions:
This appendix expands the Scientific Methods for Health Sciences (SMHS) Ebook Epidemiology Section, delving into genetic epidemiology and bridging molecular biology with population-level analysis to identify risk factors and outcomes. It explores how genetic variations contribute to disease patterns and how computational tools can quantify these relationships.
This module equips learners with foundational knowledge in genetic epidemiology, enabling them to integrate genetic data into public health practice. By the end of this module, learners should be able to:
Understand Genetic Foundations: Describe the structure and key features of the human genome, explain the types and distributions of mutations, and apply principles of Mendelian inheritance, including segregation (independent assortment of alleles) and linkage (genes on the same chromosome tend to be inherited together unless separated by recombination).
Analyze Population Dynamics: Utilize quantitative genetic concepts to examine the interplay between genetic variation and phenotypic (disease) variation in populations, including calculations of allele and genotype frequencies and testing for Hardy-Weinberg equilibrium to detect deviations indicative of evolutionary pressures.
Evaluate Associations: Identify common gene-disease relationships (e.g., monogenic vs. polygenic traits), interpret results from Genome-Wide Association Studies (GWAS) to pinpoint susceptibility loci, and recognize gene-environment interactions (e.g., how smoking exacerbates genetic risks for lung cancer).
Apply Computational Methods: Conduct basic genetic association analyses using statistical software like R, interpret epidemiological measures such as Number Needed to Treat (NNT), Odds Ratio (OR), and Relative Risk (RR), and understand their implications for clinical and public health decision-making.
Additional Skills: Critically evaluate study designs (e.g., cohort vs. case-control), account for confounders like population stratification in genetic studies, and discuss ethical considerations in genetic epidemiology, such as privacy in genomic data.
These objectives align with real-world applications, such as designing targeted interventions (e.g., pharmacogenomics) or predicting disease outbreaks through genomic surveillance.
The human genome comprises approximately 3 billion base pairs of DNA, encoding around 20,000-25,000 genes that orchestrate cellular functions, development, and responses to the environment. Mutations, changes in this DNA sequence, can arise spontaneously or from external factors (e.g., radiation, chemicals) and may lead to diseases if they disrupt gene function. Understanding genomic structure and mutations is crucial for identifying genetic risk factors in epidemiological studies.
Chromosomes are thread-like structures in the cell nucleus that package and organize DNA for efficient replication and segregation during cell division. Each human cell (except gametes) contains 46 chromosomes.
Banding: Cytogenetic staining techniques (e.g., Giemsa staining) produce visible bands on chromosomes. Dark bands (G-bands) are AT-rich, heterochromatic regions with fewer genes, while light bands (R-bands) are GC-rich, euchromatic, and gene-dense. These patterns, spanning millions of nucleotides, aid in identifying chromosomal abnormalities in karyotyping.
Karyotype: The complete set of chromosomes arranged by size and shape. A typical human karyotype includes 22 pairs of autosomes (numbered 1–22) and one pair of sex chromosomes (XX in females, XY in males). Abnormal karyotypes, such as trisomy 21 (Down syndrome), can be detected via techniques like fluorescence in situ hybridization (FISH).
Functional Elements:
Mutations are heritable changes in the DNA sequence, occurring at a rate of about \(10^{-8}\) per nucleotide per generation. They can be somatic (acquired in body cells, e.g., leading to cancer) or germline (in gametes, passed to offspring). Mutations drive genetic diversity but can cause disease if they affect critical genes.
Population genetics examines how genetic variation is maintained, distributed, and evolves in groups, providing the mathematical foundation for genetic epidemiology. It helps predict disease risks based on allele frequencies and detect deviations signaling selection or population structure.
The gene pool is the total collection of alleles in a population at a given time. Frequencies are key metrics for assessing genetic diversity and disease susceptibility.
\[ \text{Allele Frequency (p for A)} = \frac{2 \times \text{Number of AA} + \text{Number of Aa}}{2 \times \text{Total individuals}}. \]
Example: In a population of 100 people with genotypes: 40 AA, 50 Aa, 10 aa. p (A) = (240 + 50) / 200 = 0.65; q (a) = (210 + 50) / 200 = 0.35.
HWE is a null model assuming no evolutionary forces, predicting genotype frequencies from allele frequencies in a stable population. Assumptions: Large population, random mating, no mutation, no migration, no selection. For a biallelic locus with alleles A (p) and a (q=1-p):
\[ P(AA) = p^2\ (homozygous\ dominant) \]
\[ P(Aa) = 2pq\ (heterozygous) \]
\[ P(aa) = q^2\ (homozygous\ recessive) \]
Deviations from HWE: Indicate forces like inbreeding (excess homozygotes), assortative mating, or genotyping errors. In epidemiology, HWE testing filters SNPs in controls to ensure data quality.
Testing for HWE: Use Chi-squared goodness-of-fit test.
Formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}, \] \(df=1\) for biallelic loci.
Critical value: \(>3.84\) for \(p<0.05\) (reject HWE).
Example:
R Implementation for HWE: This code uses base R for manual calculation but adds input validation, exact test option (for small samples), and clearer output. For advanced use, consider the HardyWeinberg package.
# Hardy-Weinberg Equilibrium Test
# Inputs: Observed genotype counts as a named vector (AA, Aa, aa)
obs <- c(AA = 400, Aa = 500, aa = 100) # Example observed counts
# Input validation
if (any(obs < 0) || length(obs) != 3) stop("Invalid genotype counts")
total <- sum(obs)
if (total == 0) stop("No individuals in population")
# Calculate allele frequencies
p <- (2 * obs["AA"] + obs["Aa"]) / (2 * total) # Frequency of A
q <- 1 - p # Frequency of a
# Expected counts under HWE
expected <- c(AA = p^2 * total, Aa = 2 * p * q * total, aa = q^2 * total)
# Chi-squared test (ensure expected >5 for validity)
if (any(expected < 5)) warning("Expected counts <5; consider exact test")
chi2 <- sum((obs - expected)^2 / expected)
df <- 1 # Degrees of freedom for biallelic
p_val_chi <- pchisq(chi2, df = df, lower.tail = FALSE)
# Optional: Fisher's exact test for small samples (using contingency table simulation)
# But for simplicity, we use chi-squared here
# Output results
cat("Allele Frequencies: p(A) =", round(p, 3), "q(a) =", round(q, 3), "\n")## Allele Frequencies: p(A) = 0.65 q(a) = 0.35
cat("Expected Counts: AA =", round(expected["AA"], 1), "Aa =", round(expected["Aa"], 1), "aa =", round(expected["aa"], 1), "\n")## Expected Counts: AA = NA Aa = NA aa = NA
## Chi-squared = 9.7814 df = 1 p-value = 0.0017628
if (p_val_chi < 0.05) {
cat("Reject HWE: Possible deviation due to selection, inbreeding, or error.\n")
} else {
cat("Fail to reject HWE: Population appears in equilibrium.\n")
}## Reject HWE: Possible deviation due to selection, inbreeding, or error.
Pedigrees are diagrammatic representations of family histories that illustrate the inheritance patterns of traits or diseases across generations. They are essential tools in genetic epidemiology for identifying modes of inheritance, estimating risks, and guiding genetic counseling. Symbols in pedigrees typically include squares for males, circles for females, filled shapes for affected individuals, and slashes for deceased persons. Arrows may indicate the proband (the individual through whom the family is ascertained). Pedigrees help visualize vertical (parent-to-child) or horizontal (sibling) transmission patterns, skips in generations, and sex-specific effects.
Genetic traits and diseases follow specific inheritance patterns based on the chromosomal location and dominance of alleles. Understanding these modes aids in predicting disease risks and designing targeted epidemiological studies. Below, we detail common modes, including characteristics, pedigree patterns, examples, and epidemiological implications.
Probabilistic models quantify inheritance risks, incorporating prior probabilities, transmission, and penetrance. This is crucial for Bayesian risk assessment in genetic counseling.
where the product is over all individuals, \(P(\text{genotype}_i)\) is based on Mendelian laws and parental genotypes, and \(P(\text{phenotype}_i | \text{genotype}_i)\) accounts for penetrance.
Penetrance: Probability of phenotypic expression given a genotype (e.g., 100% for complete, <100% for incomplete). Incomplete penetrance (e.g., BRCA1 mutations in breast cancer) leads to unaffected gene carriers.
Variable Expressivity: Variation in phenotype severity among those with the same genotype (e.g., neurofibromatosis type 1).
Example: In an AD pedigree, risk to a child of an affected parent (Dd) is 50% (Dd) × penetrance.
Applications: Risk prediction software like BRCAPRO uses these models. In epidemiology, adjusts for ascertainment bias (families selected via affected probands).
Limitations: Assumes accurate family history; ignores modifiers like environment.
Linkage analysis identifies genes by studying co-inheritance of markers and traits in families, leveraging the fact that nearby loci on a chromosome are less likely to separate during meiosis.
Genetic Linkage: Occurs when genes are close on the same chromosome, violating independent assortment.
Recombination Fraction (θ): Proportion of gametes with recombination between two loci. \[ \theta = 0.5: \text{No linkage (independent, >50 cM apart)}. \] \[ \theta < 0.5: \text{Linkage (e.g., θ=0.1 means 10% recombination)}. \] \[ \theta = 0: \text{Complete linkage (no recombination, syntenic loci)}. \] Units: Measured in centimorgans (cM); 1 cM ≈ 1% recombination ≈ 1 Mb DNA.
Phases: Coupling (alleles on same chromosome) vs. repulsion (opposite).
Applications: Parametric (assumes model) for Mendelian diseases; non-parametric for complex traits. Historical in positional cloning (e.g., cystic fibrosis gene).
The LOD (Logarithm of Odds) score statistically tests for linkage by comparing likelihoods.
| LOD Score | Odds Ratio | Interpretation |
|---|---|---|
| Z ≤ -2 | ≤1:100 | Strong evidence against linkage (exclusion) |
| -2 < Z < 3 | Indeterminate | Suggestive but not conclusive |
| Z ≥ 3 | ≥1000:1 | Strong evidence for linkage (genome-wide significance) |
LD refers to non-random association of alleles at different loci in a population, often due to shared ancestry rather than physical linkage. It decays over generations via recombination and is key in association studies like GWAS, where markers tag causal variants.
Common metrics quantify deviation from independence. Assume two biallelic loci: A/a (frequencies p_A, 1-p_A) and B/b (p_B, 1-p_B); haplotype frequency p_AB.
Formula: \(D_{AB} = p_{AB} - p_A p_B\).
Interpretation: D>0: Excess AB haplotypes; D<0: Deficit. D=0: Equilibrium.
Limitations: Sensitive to allele frequencies; not comparable across loci.
Formula: if \(D<0\), \(D' = \frac{D}{\max(-p_A p_B, -(1-p_A)(1-p_B))}\) if \(D>0\), \(D' = \frac{D}{\min((1-p_A)p_B, p_A(1-p_B))}\).
Interpretation: Ranges -1 to 1; |D’|=1: Complete LD (no recombination evidence); |D’|<1: Partial.
Use: Detects historical recombination.
Formula: \(r^2 = \frac{D^2}{p_A (1-p_A) p_B (1-p_B)}\).
Interpretation: 0 to 1; r²=1: Perfect correlation (one SNP predicts another); r²>0.8: Strong proxy. Relates to power in association studies (effective sample size reduced by 1/r²).
Use: Preferred in GWAS for tag SNP selection; less biased by rare alleles.
R Implementation for LD: This code computes \(D\), \(D'\), and \(r^2\) from example haplotype frequencies or
counts. It uses base R for simplicity; for real data,
consider packages like genetics or snpStats.
# Linkage Disequilibrium Measures
# Input: Haplotype counts (e.g., from population data)
# Assume biallelic loci A/a, B/b
count_AB <- 120 # AB haplotypes
count_Ab <- 80 # Ab
count_aB <- 60 # aB
count_ab <- 140 # ab
total <- sum(c(count_AB, count_Ab, count_aB, count_ab))
# Haplotype frequencies
p_AB <- count_AB / total
p_Ab <- count_Ab / total
p_aB <- count_aB / total
p_ab <- count_ab / total
# Allele frequencies
p_A <- p_AB + p_Ab
p_a <- 1 - p_A
p_B <- p_AB + p_aB
p_b <- 1 - p_B
# D
D <- p_AB - p_A * p_B
# D' (normalized)
D_max_pos <- min(p_A * p_b, p_a * p_B)
D_max_neg <- -min(p_A * p_B, p_a * p_b)
D_prime <- ifelse(D >= 0, D / D_max_pos, D / D_max_neg)
# r-squared
r_sq <- D^2 / (p_A * p_a * p_B * p_b)
# Output
cat("Allele Frequencies: p_A =", round(p_A, 3), "p_B =", round(p_B, 3), "\n")## Allele Frequencies: p_A = 0.5 p_B = 0.45
## D = 0.075
## D' = 0.3333
## r-squared = 0.0909
## Partial or no LD.
GWAS tests for correlation between genetic markers (SNPs) and phenotypes across the entire genome in unrelated individuals.
Statistical Model: Typically uses logistic regression for case-control studies. \[ \ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right) = \beta_0 + \beta_1 \cdot \text{SNP} + \text{Covariates}. \]
Manhattan & QQ Plots: Used to visualize results. Because millions of tests are performed, strict significance thresholds (e.g., \(p < 5 \times 10^{-8}\)) are required to avoid false positives.
Disease risk is often modeled as a combination of genetics (\(G\)), environment (\(E\)), and their interaction (\(G \times E\)). \[ Y = \beta_0 + \beta_1 G + \beta_2 E + \beta_3 (G \times E) + \epsilon. \]
Interaction Models: - Synergistic: Genotype exacerbates the risk factor (or vice versa). - Independent: Both factors influence risk but do not interact.
In addition to genetic metrics, standard epidemiological measures provide essential tools for assessing disease risk, evaluating interventions, and guiding public health decisions. These metrics help quantify associations between exposures and outcomes, estimate treatment effects, and inform policy. Below, we outline key measures, including definitions, formulas, interpretations, and examples. Where relevant, we include R code implementations for practical computation.
A 2×2 contingency table with odds ratio formula.
| Exposed | Unexposed | Total | |
|---|---|---|---|
| Cases (Diseased) | a | c | a + c |
| Controls (Healthy) | b | d | b + d |
| Total | a + b | c + d | N |
Odds Ratio (OR) Calculation: \[ OR = \frac{a / c}{b / d} = \frac{ad}{bc} \]
Formula: \(OR = \frac{ad}{bc}\)
Odds Ratio = [Exposed Cases × Unexposed Non-cases] / [Exposed Non-cases × Unexposed Cases].
R Implementation for Key Measures: This code snippet computes ARR, RR, OR, NNT/NNH from example data. It includes error handling and supports both benefit and harm scenarios.
# Example data: 2x2 table for OR/RR (cohort study assumption)
# Rows: Exposed (1) vs Unexposed (0); Columns: Cases vs Non-cases
a <- 20 # Exposed cases
b <- 80 # Exposed non-cases
c <- 2 # Unexposed cases
d <- 98 # Unexposed non-cases
# Incidences
I_exposed <- a / (a + b)
I_unexposed <- c / (c + d)
# Absolute Risk Reduction (assuming unexposed = control, exposed = treatment)
ARR <- I_unexposed - I_exposed # Positive if treatment reduces risk
# Relative Risk
RR <- I_exposed / I_unexposed
# Odds Ratio
OR <- (a * d) / (b * c)
# Number Needed to Treat/Harm
NNT <- ifelse(ARR != 0, 1 / abs(ARR), Inf)
type <- ifelse(ARR > 0, "NNT (Benefit)", ifelse(ARR < 0, "NNH (Harm)", "No Effect"))
# Output
cat("Incidence Exposed:", round(I_exposed, 3), "\n")## Incidence Exposed: 0.2
## Incidence Unexposed: 0.02
## ARR: -0.18
## RR: 10
## OR: 12.25
## NNH (Harm) : 5.6
# Harm example (swap for treatment increasing risk)
I_control <- 0.50
I_treatment <- 0.80
ARR_harm <- I_control - I_treatment
NNT_harm <- 1 / abs(ARR_harm)
cat("\nHarm Example - ARR:", round(ARR_harm, 3), "\n")##
## Harm Example - ARR: -0.3
## NNH: 3.3
The following code uses a custom function to handle both benefit (reduction in risk) and harm (increase in risk) scenarios dynamically.
# Function to calculate Epidemiological Metrics
calculate_epi_stats <- function(a, b, c, d) {
# Input validation
if (any(c(a, b, c, d) < 0)) stop("Cell counts must be non-negative")
# Basic Incidences
i_exp <- a / (a + b)
i_unexp <- c / (c + d)
# Absolute Risk Reduction (ARR)
# Defined as Control Risk - Treated Risk
arr <- i_unexp - i_exp
# Relative Risk (RR) and Odds Ratio (OR)
# Adding small constant (0.5) if zero to avoid division by zero errors
rr <- i_exp / i_unexp
or <- (a * d) / (b * c)
# NNT / NNH Logic
nnt_val <- if (arr != 0) 1 / abs(arr) else Inf
nnt_label <- ifelse(arr > 0, "NNT (Benefit)",
ifelse(arr < 0, "NNH (Harm)", "No Effect"))
# Return formatted list
return(list(
Incidence_Exposed = round(i_exp, 4),
Incidence_Unexposed = round(i_unexp, 4),
ARR = round(arr, 4),
RR = round(rr, 2),
OR = round(or, 2),
Measure = nnt_label,
Value = round(nnt_val, 1)
))
}
# --- Example 1: Beneficial Treatment ---
# (e.g., Vaccine trial: Exposed = Vaccinated)
result_benefit <- calculate_epi_stats(a=20, b=80, c=40, d=60)
print(result_benefit)## $Incidence_Exposed
## [1] 0.2
##
## $Incidence_Unexposed
## [1] 0.4
##
## $ARR
## [1] 0.2
##
## $RR
## [1] 0.5
##
## $OR
## [1] 0.38
##
## $Measure
## [1] "NNT (Benefit)"
##
## $Value
## [1] 5
# --- Example 2: Harmful Exposure ---
# (e.g., Smoking vs Lung Cancer)
result_harm <- calculate_epi_stats(a=50, b=50, c=10, d=90)
print(result_harm)## $Incidence_Exposed
## [1] 0.5
##
## $Incidence_Unexposed
## [1] 0.1
##
## $ARR
## [1] -0.4
##
## $RR
## [1] 5
##
## $OR
## [1] 9
##
## $Measure
## [1] "NNH (Harm)"
##
## $Value
## [1] 2.5
To understand the output and interpret the results, we use the following logic.
| Metric | Interpretation | Calculation |
|---|---|---|
| ARR | The actual difference in risk between groups. | \(I_{unexposed} - I_{exposed}\) |
| RR | How many times more likely the outcome is in the exposed group. | \(I_{exposed} / I_{unexposed}\) |
| OR | The ratio of the odds of the outcome occurring in the exposed group. | \(\frac{a \times d}{b \times c}\) |
| NNT | The number of patients needed to treat to prevent one bad outcome. | \(1 / ARR\) |
Note on NNT/NNH: If the ARR is positive (risk decreased), we report Number Needed to Treat (NNT). If the ARR is negative (risk increased), we report Number Needed to Harm (NNH).
Confidence Intervals: In a real-world study, we typically
report 95% CIs. Using prop.test() in base R to get CIs for
risk differences.
Log Transformation: For OR and RR, calculations for CIs are typically done on the log scale to ensure symmetry.
Modern epidemiology relies heavily on computational tools.
R Packages: epiR (Epi measures),
genetics (HWE, LD), survival (time-to-event),
qqman (GWAS visualization).Scenario: Analyze the pedigree below under a Dominant Inheritance model. We need to estimate the recombination fraction \(\theta\). Placeholder for Figure 8
Calculate LOD Scores:
Using Maximum Likelihood Estimation (MLE), if the phase is known and the
data include 4 non-recombinant and 1 recombinant meioses, the likelihood
is:
\(L(\theta) = (1-\theta)^4
\theta\).
The LOD score is defined as:
\(Z(\theta) =
\log_{10}\left(\frac{L(\theta)}{L(0.5)}\right)\), where \(L(0.5) = (0.5)^5 = 1/32\).
Corrected Result Table: | \(\theta\) | \(Z(\theta)\) | |———-|————-| | 0.0 | \(-\infty\) | | 0.10 | 0.322 | | 0.20 | 0.418 (Max LOD) | | 0.50 | 0.0 |
The maximum LOD score occurs at \(\hat{\theta} = 0.20\), consistent with 1 recombinant out of 5 informative meioses.
Note: A \(LOD > 3\) is conventionally required for significant linkage; this example is illustrative.
Scenario: A trial reports 800 events among 1000 patients in Treatment Group A and 600 events among 1200 patients in Control Group B.
Event rates: \(p_A = 800/1000 = 0.80\), \(p_B = 600/1200 = 0.50\).
Absolute Risk Increase (ARI): \(\text{ARI} = p_A - p_B = 0.30\).
Since the treatment increases risk, we compute the Number Needed
to Harm (NNH):
\(\text{NNH} = \frac{1}{\text{ARI}} =
\frac{1}{0.30} \approx 3.33\).
Interpretation: For every 3–4 patients treated with intervention A (vs. control), one additional adverse event occurs. The negative NNT is conventionally reported as NNH.
Scenario: Simulate a case-control GWAS with 500 cases and 500 controls, Minor Allele Frequency (MAF) = 0.2, and Odds Ratio (OR) = 1.5. Estimate statistical power at \(\alpha = 0.05\).
The following R function correctly simulates genotypes separately for cases and controls based on genotype relative risks derived from the OR:
gwas_power_case_control <- function(n_cases = 200, n_controls = 200,
maf = 0.2, OR = 1.5,
alpha = 0.05, n_sims = 100) {
# Genotype coding: 0, 1, 2 copies of effect allele
# Relative risks under additive log-odds model: 1, OR, OR^2
rr <- c(1, OR, OR^2)
# Genotype frequencies in controls (Hardy-Weinberg)
g_ctrl <- dbinom(0:2, size = 2, prob = maf)
# Genotype frequencies in cases (proportional to g_ctrl * rr)
g_case_unscaled <- g_ctrl * rr
g_case <- g_case_unscaled / sum(g_case_unscaled)
significant <- numeric(n_sims)
for (i in 1:n_sims) {
# Simulate genotypes
geno_cases <- sample(0:2, n_cases, replace = TRUE, prob = g_case)
geno_ctrls <- sample(0:2, n_controls, replace = TRUE, prob = g_ctrl)
geno <- c(geno_cases, geno_ctrls)
status <- c(rep(1, n_cases), rep(0, n_controls))
# Fit logistic regression
model <- glm(status ~ geno, family = binomial)
p_val <- summary(model)$coefficients[2, 4]
significant[i] <- as.numeric(p_val < alpha)
}
return(mean(significant)) # Estimated power
}
# Example usage:
set.seed(2026)
power_est <- gwas_power_case_control()
cat("Estimated power:", round(power_est, 3), "\n")## Estimated power: 0.69
Scenario: In a GWAS control group (\(n = 1000\)), genotype counts for a SNP
are:
AA = 240, Aa = 120, aa = 40. Test HWE using an exact test.
# Observed counts: AA, Aa, aa
obs <- c(240, 120, 40)
# Exact HWE test
hwe_test <- HWExact(obs, verbose = FALSE)## Warning in homozyg(X): Genotypes are not labelled, default labels (AA, AB, BB)
## assumed.
## Warning in heterozyg(X): Genotypes are not labelled, default labels assumed.
## HWE exact test p-value: 9.821e-05
Scenario: Evaluate whether age and sex confound the association between an exposure and disease outcome.
We’ll use the Logistic Distribution function plogis, which provides the standard Density, distribution function, quantile function and random generation for the logistic distribution with parameters location and scale, and rbinom.
set.seed(123)
n <- 5000
# Simulate confounders and exposure
age <- rnorm(n, mean = 50, sd = 10)
sex <- rbinom(n, 1, 0.5)
exposure <- rbinom(n, 1, plogis(-1 + 0.02 * age + 0.3 * sex))
# Simulate disease influenced by exposure and confounders
disease <- rbinom(n, 1, plogis(-2 + 0.9 * exposure + 0.03 * age + 0.4 * sex))
# Crude model
crude_or <- exp(coef(glm(disease ~ exposure, family = binomial))["exposure"])
# Adjusted model
adj_or <- exp(coef(glm(disease ~ exposure + age + sex, family = binomial))["exposure"])
# Percent change due to confounding
confounding_pct <- (crude_or - adj_or) / crude_or * 100
cat("Crude OR:", round(crude_or, 2), "\n")## Crude OR: 2.38
## Adjusted OR: 2.23
## Percent change: 6.5 %
if (abs(confounding_pct) > 10) {
cat("→ Substantial confounding detected (change > 10%).\n")
} else cat("→ No substantial confounding detected (change < 10%).\n")## → No substantial confounding detected (change < 10%).
Scenario: A continuous biomarker is measured in 100 diseased and 100 non-diseased individuals. Determine the optimal cutoff, sensitivity, and specificity using Youden’s index.
# Simulate biomarker scores
set.seed(42)
score_healthy <- rnorm(100, mean = 0, sd = 1)
score_diseased <- rnorm(100, mean = 1.5, sd = 1)
score <- c(score_healthy, score_diseased)
status <- c(rep(0, 100), rep(1, 100))
# ROC analysis
roc_curve <- roc(status, score)
auc_val <- auc(roc_curve)
# Optimal threshold (Youden's J statistic)
opt <- coords(roc_curve, "best", ret = c("threshold", "sensitivity", "specificity"))
cat("AUC:", round(auc_val, 3), "\n")## AUC: 0.844
## Optimal cutoff: 0.83
## Sensitivity: 0.77
## Specificity: 0.81
Try to integrate structural fixes, biological realism (fitness selection), and technical optimizations into the above genetics simulation of Phylodynamics and Genomic Surveillance in R. Try to demonstrates how viral mutations, spatial movement, and natural selection interact to shape an outbreak.
The Core Simulation Engine can initialize a population and simulates a Random Walk movement model, by introducing a Fitness Function; mutations at specific genomic loci now increase the transmission probability, simulating the emergence of a Variant of Concern.
# --- Setup Parameters ---
set.seed(42)
n_agents <- 200
map_size <- 100
initially_infected <- 3
mutation_rate <- 0.02
timesteps <- 60
seq_length <- 50
bases <- c("A", "C", "G", "T")
# --- Helper Functions ---
founder_seq <- paste(sample(bases, seq_length, replace = TRUE), collapse = "")
mutate_dna <- function(sequence) {
seq_vec <- strsplit(sequence, "")[[1]]
mutation_mask <- runif(seq_length) < mutation_rate
if(any(mutation_mask)) {
seq_vec[mutation_mask] <- sample(bases, sum(mutation_mask), replace = TRUE)
}
return(paste(seq_vec, collapse = ""))
}
calc_fitness <- function(sequence) {
gc_content <- nchar(gsub("[AT]", "", sequence)) / nchar(sequence)
return(0.3 + (gc_content * 0.2))
}
# --- Initialize Population ---
pop <- tibble(
id = 1:n_agents,
x = runif(n_agents, 0, map_size),
y = runif(n_agents, 0, map_size),
state = "S",
genome = founder_seq,
parent_id = NA_integer_,
infect_time = NA_real_
)
# Set state to "I" for initial cases
pop$state[1:initially_infected] <- "I"
pop$infect_time[1:initially_infected] <- 0
# --- Run Simulation ---
history <- list()
edges <- data.frame(from = integer(), to = integer())
for (t in 1:timesteps) {
# Assignment operator (<-) to update movement
# loaded cyclical harmonics
# pop$x <- pop$x * rep(0.1 + cos(10*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
# pop$y <- pop$y * rep(0.1 + sin(5*pi*t/timesteps), times=n_agents) + rnorm(n_agents, 0, 2)
# simple stochastic dynamics
pop$x <- pop$x + rnorm(n_agents, 0, 2)
pop$y <- pop$y + rnorm(n_agents, 0, 2)
# Optional: Keep agents within map boundaries
pop$x <- pmax(0, pmin(map_size, pop$x))
pop$y <- pmax(0, pmin(map_size, pop$y))
# 2. Transmission
infected_idx <- which(pop$state == "I")
for (i in infected_idx) {
susceptible_idx <- which(pop$state == "S")
if(length(susceptible_idx) == 0) break
dists <- sqrt((pop$x[susceptible_idx] - pop$x[i])^2 + (pop$y[susceptible_idx] - pop$y[i])^2)
contacts <- susceptible_idx[dists < 6]
for (target in contacts) {
if (pop$state[target] == "S" && runif(1) < calc_fitness(pop$genome[i])) {
pop$state[target] <- "I"
pop$infect_time[target] <- t
pop$parent_id[target] <- pop$id[i]
pop$genome[target] <- mutate_dna(pop$genome[i])
edges <- rbind(edges, data.frame(from = pop$id[i], to = pop$id[target]))
}
}
}
# subsetting and assignment for Recovery
recovery_idx <- which(pop$state == "I" & (t - pop$infect_time) >= 10)
if(length(recovery_idx) > 0) {
pop$state[recovery_idx] <- "R"
}
history[[t]] <- pop %>% mutate(time = t)
}
sim_data <- bind_rows(history)
# str(sim_data)anim <- ggplot(sim_data, aes(x = x, y = y, color = genome, size = state)) +
geom_point(alpha = 0.6) +
scale_size_manual(values = c("S" = 1, "I" = 3, "R" = 1.5)) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Viral Spread & Selection Pressure: Day {frame}") +
transition_time(time)
# animate(anim)# Extract final strains
final_strains <- sim_data %>%
filter(state %in% c("I", "R"), !is.na(infect_time)) %>%
distinct(genome) %>%
pull(genome)
dist_mat <- stringdistmatrix(final_strains, final_strains, method = "hamming")
rownames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
colnames(dist_mat) <- paste0("Strain_", 1:length(final_strains))
# Visualize Genetic Distance
plot_ly(z = ~dist_mat, x = colnames(dist_mat), y = rownames(dist_mat), type = "heatmap", colorscale = "Viridis") %>%
layout(title = "Genetic Hamming Distance Matrix")# Reconstruct the tree using igraph
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
ggraph(transmission_graph, layout = 'tree') +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
geom_node_point(aes(color = name), size = 3, show.legend = FALSE) +
theme_void() +
labs(title = "Reconstructed Epidemiological Transmission Tree")# Calculate divergence from founder
clock_data <- sim_data %>%
filter(!is.na(infect_time)) %>%
mutate(dist_from_founder = stringdist(genome, founder_seq, method = "hamming"))
ggplot(clock_data, aes(x = infect_time, y = dist_from_founder)) +
geom_jitter(alpha = 0.2, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
theme_minimal() +
labs(title = "The Molecular Clock: Mutation Accumulation",
x = "Time of Infection (Days)",
y = "Hamming Distance from Founder")Here are a a few additional plots.
# Create the ggraph plot
gg <- ggraph(transmission_graph, layout = 'tree') +
geom_edge_diagonal(alpha = 0.3, color = "darkgrey") +
geom_node_point(aes(color = name,
text = paste("ID:", name,
"<br>Click for details")),
size = 5) +
theme_void() +
labs(title = "Reconstructed Epidemiological Transmission Tree")## Warning in geom_node_point(aes(color = name, text = paste("ID:", name,
## "<br>Click for details")), : Ignoring unknown aesthetics: text
# Convert to interactive plotly
ggplotly(gg, tooltip = "text") %>%
layout(
hoverlabel = list(bgcolor = "white", font = list(size = 12)),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE)
)## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomEdgePath() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
and …
# Convert igraph to networkD3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)
# Prepare nodes with tooltips
nodes_d3 <- transmission_d3$nodes
if(exists("pop")) {
nodes_d3 <- nodes_d3 %>%
left_join(
pop %>%
mutate(name = as.character(id)) %>%
select(name, state, infect_time, genome),
by = "name"
)
} else {
nodes_d3$state <- "Unknown"
nodes_d3$infect_time <- NA
nodes_d3$genome <- NA
}
# Create custom tooltips
nodes_d3$tooltip <- sprintf(
"ID: %s<br>State: %s<br>Infection Time: %s<br>Genome: %s",
nodes_d3$name,
nodes_d3$state,
ifelse(is.na(nodes_d3$infect_time), "N/A", nodes_d3$infect_time),
ifelse(is.na(nodes_d3$genome), "N/A", substr(nodes_d3$genome, 1, 15))
)
# Create the force-directed network
# forceNetwork(
# Links = transmission_d3$links,
# Nodes = nodes_d3,
# Source = 'source',
# Target = 'target',
# NodeID = 'name',
# Group = 'state',
# Nodesize = 5,
# opacity = 0.9,
# zoom = TRUE,
# fontSize = 14,
# linkDistance = 150,
# charge = -100,
# opacityNoHover = 0.5,
# legend = TRUE,
# bounded = TRUE,
# clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
# )
forceNetwork(
Links = transmission_d3$links,
Nodes = nodes_d3,
Source = 'source',
Target = 'target',
NodeID = 'name',
Group = 'state',
Nodesize = 5,
opacity = 0.9,
zoom = TRUE,
fontSize = 14,
linkDistance = 10, # now longer
charge = -10, # stronger repulsion
# bounded = FALSE, # default is FALSE – remove/comment
opacityNoHover = 0.5,
legend = TRUE,
clickAction = 'alert("Node ID: " + d.name + "\\nGroup: " + d.state)'
)and …
# 1. Create the graph object
transmission_graph <- graph_from_data_frame(edges, directed = TRUE)
# 2. Prepare nodes data
# Get vertex names
node_names <- V(transmission_graph)$name
# Merge with pop data for additional information
nodes <- data.frame(
id = node_names,
label = node_names,
title = NA, # Will be filled with tooltip info
stringsAsFactors = FALSE
)
# Join with pop data to get metadata for tooltips
if(exists("pop")) {
nodes <- nodes %>%
left_join(
pop %>%
mutate(id = as.character(id)) %>%
select(id, state, infect_time, genome),
by = "id"
) %>%
mutate(
title = paste0(
"<b>Agent ID:</b> ", id, "<br>",
"<b>State:</b> ", state, "<br>",
"<b>Infection Time:</b> ", infect_time, "<br>",
"<b>Genome:</b> ", ifelse(!is.na(genome), substr(genome, 1, 20), "N/A")
)
)
} else {
# Fallback if pop doesn't exist
nodes$title <- paste0("<b>Agent ID:</b> ", nodes$id)
}
# 3. Prepare edges data
edges_df <- igraph::as_data_frame(transmission_graph, what = "edges")
edges_vis <- data.frame(
from = edges_df$from,
to = edges_df$to,
arrows = "to",
color = list(color = "gray", opacity = 0.6),
width = 1,
smooth = TRUE
)
# 4. Optional: Add colors based on state or infection time
if("state" %in% colnames(nodes)) {
# Create color mapping for different states
unique_states <- unique(nodes$state)
colors <- viridisLite::viridis(length(unique_states))
color_map <- setNames(colors, unique_states)
nodes$color <- sapply(nodes$state, function(s) {
if(is.na(s)) "#808080" else color_map[s]
})
} else if("infect_time" %in% colnames(nodes)) {
# Color by infection time
min_time <- min(nodes$infect_time, na.rm = TRUE)
max_time <- max(nodes$infect_time, na.rm = TRUE)
# Normalize infection time for color scale
nodes$value <- nodes$infect_time
nodes$color <- viridisLite::viridis(100)[
cut(nodes$infect_time, 100, labels = FALSE)
]
} else {
nodes$color <- "#2C3E50" # Default color
}
# 5. Create the interactive network
visNetwork(nodes, edges_vis,
main = "Interactive Transmission Tree",
submain = "Hover over nodes for details") %>%
visNodes(
shape = "dot",
size = 15,
borderWidth = 2,
borderWidthSelected = 3,
shadow = list(enabled = TRUE, size = 10)
) %>%
visEdges(
shadow = TRUE,
smooth = list(type = "curvedCW", roundness = 0.1)
) %>%
visOptions(
highlightNearest = list(
enabled = TRUE,
degree = 1,
hover = TRUE,
algorithm = "hierarchical"
),
nodesIdSelection = list(
enabled = TRUE,
values = nodes$id,
style = 'width: 150px; height: 26px'
)
) %>%
visInteraction(
hover = TRUE,
hoverConnectedEdges = TRUE,
tooltipDelay = 0,
navigationButtons = TRUE,
dragNodes = TRUE,
dragView = TRUE,
zoomView = TRUE
) %>%
visPhysics(
stabilization = TRUE,
solver = "forceAtlas2Based",
forceAtlas2Based = list(
gravitationalConstant = -50,
centralGravity = 0.01,
springLength = 100,
springConstant = 0.08,
damping = 0.4,
avoidOverlap = 1
)
) %>%
visLayout(randomSeed = 123) # For reproducible layoutAs well as …
# Convert to D3 format
transmission_d3 <- igraph_to_networkD3(transmission_graph)
# Add node attributes
transmission_d3$nodes <- transmission_d3$nodes %>%
left_join(
pop %>%
mutate(name = as.character(id)) %>%
select(name, state, infect_time, genome),
by = "name"
) %>%
mutate(
tooltip = paste0(
"Agent ID: ", name,
"<br>State: ", state,
"<br>Infection Time: ", infect_time,
"<br>Genome: ", ifelse(!is.na(genome), substr(genome, 1, 10), "NA"), "..."
)
)
# Create force directed network
forceNetwork(
Links = transmission_d3$links,
Nodes = transmission_d3$nodes,
Source = 'source',
Target = 'target',
NodeID = 'name',
Group = 'state',
Nodesize = 5,
opacity = 0.8,
zoom = TRUE,
fontSize = 12,
linkDistance = 100,
charge = -30,
opacityNoHover = 0.5,
legend = TRUE
)calc_fitness()
function to give a \(50\%\)
transmission boost to any sequence containing the pattern GAA. How does
this change the speed of the outbreak?