SOCR ≫ | DSPA ≫ | Topics ≫ |
Power analysis represents a statistical approach to explicate the relations between a number of parameters that affect most experimental designs. It is well known that data are proxies of the natural phenomena, or processes, about which we try to make inference, and the size of a sample is associated with our ability to derive useful information about the underlying process or make predictions about it’s past, present or future states. In some situations, given the sample-size and a certain degree of confidence, we can compute the power to statistically detect an effect of interest. Similarly, we can determine the likelihood of detecting an effect of a certain size, subject to a predefined level of confidence and specific sample size constraints. This power, or probability, to detect the effect of interest may be low, medium, or high, which would help us determine the potential value of the experiment.
In most experimental designs, power analyses establish a relation between 5 quantities:
In mathematical terms, having any 3 of the last 4 parameters may allow us to estimate the last one. Note that there is no general analytical expression that provides an exact closed-form expression (e.g., implicit or explicit function) encoding relation between all 5 terms.
The R package pwr provides the core functionality to conduct power analysis for some situations. It includes the following methods:
function | Corresponding Statistical Inference |
---|---|
cohen.ES | Conventional effects size |
ES.h | Effect size calculation for proportions |
ES.w1 | Effect size calculation in the chi-squared test for goodness of fit |
ES.w2 | Effect size calculation in the chi-squared test for association |
pwr.2p.test | Two proportions test (equal sample sizes, n) |
pwr.2p2n.test | Two proportions (unequal n) |
pwr.anova.test | Balanced one way ANOVA |
pwr.chisq.test | Chi-square test |
pwr.f2.test | General linear model (GLM) |
pwr.norm.test | Power calculations for the mean of a normal distribution (known variance) |
pwr.p.test | Single sample proportion |
pwr.r.test | Correlation |
pwr.t.test | T-tests (one sample, 2 sample, paired) |
pwr.t2n.test | T-test (two samples with unequal n), t-tests of means |
As each method explicitly specifies the statistical inference procedure, we need to only specify 3 of the remaining 4 quantities (effect size, sample size, significance level, and power) to calculate the last parameter. A common practice is to use the default significance level of \(\alpha=0.05\), and hence we are down to specifying 2 out of 3 remaining parameters. For instance, given an effect size (from prior research or an oracle) and a desired power, we can calculate an appropriate experimental design sample size.
Determining an effective and appropriate effect size is often a challenge that can be tackled either by running simulations, collecting data, or using Cohen’s social-studies protocol, which provides an outline of categorizing the effect size as small, medium or large.
Let’s look at some examples.
pwr.t.test(n = n, d = d, sig.level = a, power = b, type = c("two.sample", "one.sample", "paired"))
: In this method definition, \(n\) is the sample size, \(d\) is Cohen’s effect size, the desired power is \(b\), and type indicates the specific parametric t-test we choose.
pwr.t2n.test(n1 = n1, n2= n2, d = d, sig.level = a, power = b)
: This is a more general call for unequal sample-sizes \(n1\) and \(n2\), for independent t-tests. Cohen’s d characterizes the effect size to three values, \(0.2\), \(0.5\), and \(0.8\) representing small, medium, and large effect sizes, respectively.
pwr.anova.test(k = k, n = n, f = f, sig.level = a, power = b)
: A one-way analysis of variance (ANOVA) test with \(k\) number of groups, \(n\) common sample size within each group, and effect size \(f\). Cohen’s f values of 0.1, 0.25, and 0.4 represent small, medium, and large effect sizes, respectively.
pwr.r.test(n = n, r = r, sig.level = a, power = b)
: Correlation coefficient analysis, where \(n\) is the sample size and \(r\) is the correlation, which uses the population correlation coefficient as a measure of the effect size. Cohen’s r values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes, respectively.
pwr.f2.test(u = u, v = v, f2 = f2, sig.level = a, power = b)
: Multivariate linear Models, including multiple linear regression, with \(u\), \(v\), and $f2 representing the ANOVA numerator and denominator degrees of freedom, and the effect size measure. Cohen’s f2 values of 0.02, 0.15, and 0.35 approximately represent small, medium, and large effect sizes, respectively.
pwr.chisq.test(w = w, N = N, df = df, sig.level = a, power = b)
: Chi-square Test with \(w\) the effect size, \(N\) the total sample size, and \(df\) the degrees of freedom. Cohen’s w values of 0.1, 0.3, and 0.5 represent small, medium, and large effect sizes, respectively.
Let’s try to run power analysis for a 1-way ANOVA comparing 5 groups. Specifically, we are interested in estimating the sample size needed in each group to secure a power \(\beta \geq 0.80\), given a moderate effect size (\(0.25\)) and a significance level of 0.025.
# install.packages("pwr")
library(pwr)
pwr.anova.test(k=5, f=0.25, sig.level=0.025, power=0.8)
##
## Balanced one-way analysis of variance power calculation
##
## k = 5
## n = 46.12892
## f = 0.25
## sig.level = 0.025
## power = 0.8
##
## NOTE: n is number in each group
This suggests that at least \(47\) participants will be required (\(n=46.12892\)).
Would that sample size estimate increase or decrease when we increase or decrease the effect-size? Inspect the following two examples.
# install.packages("pwr")
# library(pwr)
pwr.anova.test(k=5, f=0.1, sig.level=0.025, power=0.8) # small effect-size
##
## Balanced one-way analysis of variance power calculation
##
## k = 5
## n = 282.3918
## f = 0.1
## sig.level = 0.025
## power = 0.8
##
## NOTE: n is number in each group
pwr.anova.test(k=5, f=0.4, sig.level=0.025, power=0.8) # large effect-size
##
## Balanced one-way analysis of variance power calculation
##
## k = 5
## n = 18.71997
## f = 0.4
## sig.level = 0.025
## power = 0.8
##
## NOTE: n is number in each group
For a 1-way ANOVA test, Cohen’s effect size \(f\) is categorized as 0.1 (small), 0.25 (medium), and 0.4 (large), but computed by:
\[f=\sqrt{\frac{\sum_{i=1}^k{p_i\times(\mu_i-\mu)^2}}{\sigma^2}},\] where \(n\) is the total number of observations in all groups, \(n_i\) is the number of observations in group \(i\), \(p_i=\frac{n_i}{n}\), \(\mu_i\) and \(\mu\) are the group \(i\) and overall means, and \(\sigma^2\) is the within-group variance. Similar analytical expressions exist for other statistical tests and there are corresponding sample-driven estimates of these effects that can be used for the practical calculations.
Let’s run power analysis for a two-sample, one-sided, T-test using a significance level of \(\alpha=0.001\), \(n=30\) participants per group, and a large effect size of \(0.8\).
# install.packages("pwr")
# library(pwr)
pwr.t.test(n=30, d=0.8, sig.level=0.001, alternative="greater") # large effect-size
##
## Two-sample t test power calculation
##
## n = 30
## d = 0.8
## sig.level = 0.001
## power = 0.4526868
## alternative = greater
##
## NOTE: n is number in *each* group
This yields a power of \(\beta = 0.4526868\) to detect an effect.
The pwr
package also provides some functions to generate power and sample size plots.
For instance, we can plot sample-size vs. effect-size curves for the power of detecting different levels of correlations, \(0.1\leq \rho\leq 0.8\), for a number of power values, \(0.3\leq\beta\leq 0.85\).
# install.packages("pwr")
# library(pwr)
<- seq(0.1, 0.8, 0.01) # define a range of correlations and sampling rate within this range
r <- length(r)
nr
<- seq(0.3, 0.85, 0.1) # define a range for the power values, and their sampling rate
p <- length(p)
np
# Compute the corresponding sample sizes for all combinations of correlations and power values
<- array(numeric(nr*np), dim=c(nr, np))
sampleSize for (i in 1:np) {
for (j in 1:nr) {
# solve for sample size (n)
<- pwr.r.test(n = NULL, r = r[j], sig.level = 0.05, power = p[i], alternative = "two.sided")
testResult <- ceiling(testResult$n) # round sample sizes up to nearest integer
sampleSize[j, i] # print(sprintf("sampleSize[%d,%d]=%s", j,i, round(sampleSize[j, i], 2)))
}
}
# Graph the power plot
<- range(r)
xRange <- round(range(sampleSize))
yRange <- rainbow(length(p))
colors plot(xRange, yRange, type="n", xlab="Correlation Coefficient (r)", ylab="Sample Size (n)")
# Add power curves
for (i in 1:np) lines(r, sampleSize[ , i], type="l", lwd=2, col=colors[i])
# add annotations (grid lines, title, legend)
abline(v=0, h=seq(0, yRange[2], 100), lty=2, col="light grey")
abline(h=0, v=seq(xRange[1], xRange[2], 0.1), lty=2, col="light grey")
title("Effect-size (X) vs. Sample-size (Y) for \n different Power values in
(0.3, 0.85), Significance=0.05 (Two-tailed Correlation Test)")
legend("topright", title="Power", as.character(p), fill=colors)
In this case, we can also plot the sample-size against power. This graph indicates the optimal sample-size selection that achieves the lower-bound of the power (\(0.8\)) for the minimal sample-size (\(n=20\)).
<- pwr.r.test(n = NULL, r = r[50], sig.level = 0.05, power = p[6], alternative = "two.sided")
p.out plot(p.out, lwd=2)
Similarly, we can plot a sample-size vs. effect-size curve for a multivariate linear model of the efficacy of Argus retinal prosthesis (treatment) to enhance brain plasticity in the visual cortex. Suppose we are trying to determine the relation between the smallest possible sample size that can yield \(\beta\) sensitivity to detect an a change in the functional connectivity (fMRI data) and in Argus II blind patients. These two articles provide some background information and support for the range of effect-sizes used in this example:
The research goal is to model the outcome \(Y\) in terms of eight (8) covariates \(X_i\):
As shown in Chapter 9 (Linear Modeling), the matrix form of the linear inference model, \(Y=X\beta+\epsilon\), can also be explicated to:
\[Y=\beta_o+\beta_1 X_1+\beta_2 X_2+...+\beta_u X_u+ \epsilon.\]
# install.packages("pwr")
# library(pwr)
<- seq(0.2, 4, 0.02) # define a range of effect-sizes and sampling rate within this range
f2 <- length(f2)
nf2
<- seq(0.3, 0.85, 0.1) # define a range for the power values, and their sampling rate
p <- length(p)
np
#`pwr.f2.test(u = u, v = v, f2 = f2, sig.level = a, power = b)`: Multivariate linear Models, including multiple linear regression, with $u$, $v$, and $f2 representing the ANOVA numerator and denominator degrees of freedom, and the effect size measure.
# Cohen's f2 values of 0.02, 0.15, and 0.35 approximately represent small, medium, and large effect sizes, respectively.
# Compute the corresponding sample sizes for all combinations of correlations and power values
<- array(numeric(nf2*np), dim=c(nf2, np))
sampleSize for (i in 1:np) {
for (j in 1:nf2) {
# solve for sample size (v), assuming we use u=8 predictors (X) explaining the outcome (Y)
<- pwr.f2.test(u = 8, v = NULL, f2 = f2[j], sig.level = 0.05, power = p[i]) # num-covariates=8
testResult <- ceiling(testResult$v) # extract and round sample sizes up to nearest integer
sampleSize[j, i] # print(sprintf("sampleSize[%d,%d]=%s", j,i, round(sampleSize[j, i], 2)))
}
}
# Graph the power plot
<- range(f2)
xRange <- round(range(sampleSize))
yRange <- rainbow(length(p))
colors plot(xRange, yRange, type="n", xlab="Effect-size", ylab="Sample Size (u)")
# Add power curves
for (i in 1:np) lines(f2, sampleSize[ , i], type="l", lwd=2, col=colors[i])
# add annotations (grid lines, title, legend)
abline(v=0, h=seq(0, yRange[2], 10), lty=2, col="light grey")
abline(h=0, v=seq(xRange[1], xRange[2], 0.5), lty=2, col="light grey")
title("Effect-size vs. Sample-size for \n different Power values in
(0.3, 0.8), Significance=0.05 (MLR)")
legend("topright", title="Power", as.character(p), fill=colors)
For a simpler balanced 5-group ANOVA test, we can plot the sample-size against power. This graph indicates the optimal sample-size selection that achieves the lower-bound of the power (\(0.8\)) for the minimal sample-size (\(n=20\)).
<- pwr.anova.test(k=5, f=0.4, sig.level=0.025, power=0.8) # large effect-size
p.out plot(p.out, lwd=2)
Suppose we are interested in estimating the relation between sample-size and statistical-power (power analyses) for a clustered study design within 12 units (sites). Assume the Intra-class correlation of \(ICC=0.01\), and the proposed study design involves 2 steps and 40 participants per cluster (site). At each step, conditioning on enrolling four sites (clusters), we want to ensure 76% statistical power (based on a two-sided test, alpha=0.05) for detecting between-site effects, we get 80% statistical power.
Using the WebPower package and WebPower manual/tutorial, we can estimate the sample-size corresponding with the desired power using the R function wp.crt2arm
. The interface of this function involves:
wp.crt2arm(n=NULL, f=NULL, J=NULL, icc=NULL, power=NULL, alternative = c("two.sided", "one.sided"), alpha=0.05)
The R inputs and outputs for several examples are given below:
## calculate power given sample size and effect size
## install.packages("WebPower", lib="C:/Users/Dinov/Documents/R/R-4.0.2/library")
library(WebPower)
## Warning: package 'WebPower' was built under R version 4.1.2
## Loading required package: MASS
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: lavaan
## This is lavaan 0.6-9
## lavaan is FREE software! Please report any bugs.
## Loading required package: parallel
## Loading required package: PearsonDS
## Warning: package 'PearsonDS' was built under R version 4.1.1
# wp.crt2arm(f=0.6,n=20,J=10,icc=.1)
wp.crt2arm(f=0.5, n=320, J=4, icc= 0.03)
## Cluster randomized trials with 2 arms
##
## J n f icc power alpha
## 4 320 0.5 0.03 0.3431254 0.05
##
## NOTE: n is the number of subjects per cluster.
## URL: http://psychstat.org/crt2arm
wp.crt2arm(f=0.65, n=320, J=4, icc= 0.01)
## Cluster randomized trials with 2 arms
##
## J n f icc power alpha
## 4 320 0.65 0.01 0.8029552 0.05
##
## NOTE: n is the number of subjects per cluster.
## URL: http://psychstat.org/crt2arm
Note the strong effects of the parameters ICC, n, and effect-size, and number of sites.
Power analysis is an important statistical computing technique to design effective research studies based on the collection and interrogation of observational data. Power is the sensitivity or the probability of detecting a true effect when it exists. There is no perfect analytical expression (formula) determining the relation between sample size, effect size, and power for all possible research situations. In practice, assumptions and empirical evidence may be used to approximate this unknown relation even when for complex study designs.
In practice, all sample size calculations are based on many assumptions. For instance, calculating the sample-size/power relation for a one-way ANOVA test requires normality assumptions for each group as well as variance homoscedasticity, equal variances, for all the groups. However, reasonable violations of core assumptions may still generate rough approximations of the expectation of the sensitivity for a test. Exact knowledge of the magnitude of effect size is also rarely tractable, however it can often be estimated from analogous processes, prior observations, or by other techniques. Practicing statisticians tend to use more conservative assumptions when estimating sample-size/power relations.
There are no closed-form expressions to estimate the power-size-effect relationships for non-parametric models and most model-free machine learning techniques. In such situations, two complementary approaches may be utilized to ensure reliable analytics and reproducible inference. The first approach is based on identifying an approximate power-size-effect relationship for another analogous statistical test, which is based on a parametric model, compute the power-size-effect relationship and use it cautiously as a rough guide of the relation for the corresponding machine learning technique. An alternative approach is to employ resampling methods to confirm these affect relationships via internal statistical cross-validation.