1 Applied Bayesian Modeling, Simulation and Inference

1.1 Background

After Thomas Bayes died in 1761 his unpublished work about the flat-prior posterior binomial distribution was rediscovered and published in the Philosophical Transactions of the Royal Society of London in 1764.

The Monte Carlo (MC) Method is a simple technique used to estimate analytic quantities that may have closed-form expressions, but are difficult to compute exactly, e.g., integration, differentiation, optimization of non-trivial functions. MC simulation may be used to approximate probability distributions, estimate likelihoods, or calculate expectations. The intuition of MC sampling is rooted in the fact that we can estimate any desired (posterior) expectation using ergodic averages. In other words, we can compute any population characteristic, associated with a corresponding sample-driven statistic, of a (posterior) distribution provided we can generate $N$ simulated samples from that distribution:

\[E(f(s))_P = \int {f(s)p(s)ds} \approx \frac{1}{N}\sum_{i=1}^N {f(s^{(i)})},\] where $P$ and $p$ are the probability (posterior) distribution and density functions of interest, $E$ represents the expectation, $f(s)$ is the population characteristic of interest (corresponding to a statistic), and $s^{(i)}$ is the $i^{th}$ simulated sample from the distribution $P$. For instance, we can estimate the population mean (linked to the sample arithmetic average statistic) by:

\[E(x)_P =\int{xp(x)dx} \approx \frac{1}{N}\sum_{i=1}^N {x^{(i)}},\] where $p(x)$ is the density of the distribution function $P$. Similarly, assuming a trivial mean, to estimate the population variance (corresponding to the sample-variance statistic) we compute:

\[E(x^2)_P =\int{x^2p(x)dx} \approx \frac{1}{N-1}\sum_{i=1}^N {\left ( x^{(i)}\right ) ^2}.\]

Many integrals, differentials, or sums that cannot be exactly computed in closed analytic form, can still be expressed as expectations of some random variables with respect to some probability distribution, which are estimable by MC methods.

The Markov Chain Monte Carlo (MCMC) method was proposed by Andrey Markov in the early $20^{th}$ century. Markov Chains refer to one-step dependent sequences of random variables obeying the Markov property (past and future are conditionally independent given the present). For example:

Sampling with replacement vs. without replacement from a space (e.g., an urn) containing a set of identifiable objects represent processes obeying vs. not obeying the Markov property, respectively.
A simple Brownian motion (Wiener process) random walk on an integer lattice $Z^d \subset R^d$ represents a Markov chain whose transition probabilities are defined by $p(x,x\pm e_i)=\frac{1}{2d}$, for each $x\in Z^d$, where $e_i$ are the normalized vectors in $Z^d$. At each step, this random walk process, currently at position $x$, moves to a randomly chosen nearest neighbor, $x\pm e_i$.
Consider cancer staging and the corresponding transitions between stages. The Cancer Tumor-Node-Metastasis (TNM) staging system describes the size of the primary tumor (T), whether the cancer has spread to the lymph nodes (N), and whether it has spread to a different part of the body (M). $T=1(small), 2, 3, 4(large)$ refers to the size of the cancer and how far it has spread into nearby tissue, $N=0$ (no lymph nodes containing cancer cells), $1, 2, 3$ (lots of lymph nodes containing cancer cells), and $M=$ $0$ (no spread), $1$ (cancer has spread), refers to whether the cancer has spread to another part of the body. Most types of cancer have four stages, $CS=1(early), 2, 3,\ to\ 4\ (advanced)$. Stage 1 implies that a cancer is relatively small and contained within the organ it started in. Stage 2 means the cancer has not started to spread into surrounding tissue but tumor is larger than in stage 1, or that cancer cells have spread into lymph nodes close to the tumor. Stage 3 usually means the cancer is larger and may have started to spread into surrounding tissues possibly with cancer cells in the lymph nodes in the area. And stage 4 means the cancer has spread from its point of origination to another body organ indicating metastatic progression.

Let’s use cancer staging to illustrate the fundamentals of MC representation, its properties, and computational characteristics. For each stage, let $X_t$ denote the cancer stage of a patient at time $t$. Cancer onset, progression, prognosis and outcomes have stochastic behavior. We have 5 states, $X_n=0(no\ cancer), 1,2, 3, 4$, and initially $X_o=0$ (cancer occurs), with $P(X_o=0)=1$. Next we can define the marginal probabilities for transition from one cancer stage to another (these depend upon the type of cancer), say $P(X_1=1)=0.6$, $P(X_1=2)=0.4$, $P(X_1=3)=0.2$, $P(X_1=4)=0.1$. Computing the distribution of $X_t$ for $t\geq 2$ requires conditional probabilities.

Suppose that at time $t$, the cancer patient is at stage $s_o$. Then, we can compute the conditional probabilities: $P(X_{t+1}=s_k|X_t=s_o)$ of patient being in a specific stage $s_k$ at the next time point $t+1$. For instance, if at time $t$ the patient is in stage $1$ we may have $P(X_{t+1}=0|X_t=1)=0.15$, $P(X_{t+1}=1|X_t=1)=0.5$, $P(X_{t+1}=2|X_t=1)=0.2$, $P(X_{t+1}=3|X_t=1)=0.1$, and $P(X_{t+1}=4|X_t=1)=0.05$.

Would these probabilities change if we condition further on the oncological history of the same patient up to time $t$? Because of the stochastic coin-toss like mechanism of many biomedical, physiologic and natural processes, these conditional likelihoods may not be affected by knowing a longer staging history of the disease provenance. To predict the likelihood of being in a certain state at the next time period, only the current state of the disease matters. Thus, for any cancer provenance record ($s_o, s_1, \cdots, s_{t-1}$), we may have $P(X_{t+1}=0|X_o=S_o, X_1=s_1, \cdots, X_{t-1}=s_{t-1}, X_t=1)=P(X_{t+1}=0|X_t=1)=0.2$, $P(X_{t+1}=1|X_o=S_o, X_1=s_1, \cdots, X_{t-1}=s_{t-1}, X_t=1)=P(X_{t+1}=1|X_t=1)=0.2$, $P(X_{t+1}=2|X_o=S_o, X_1=s_1, \cdots, X_{t-1}=s_{t-1}, X_t=1)=P(X_{t+1}=2|X_t=1)=0.2$, $P(X_{t+1}=3|X_o=S_o, X_1=s_1, \cdots, X_{t-1}=s_{t-1}, X_t=1)=P(X_{t+1}=3|X_t=1)=0.2$, and $P(X_{t+1}=4|X_o=S_o, X_1=s_1, \cdots, X_{t-1}=s_{t-1}, X_t=1)=P(X_{t+1}=4|X_t=1)=0.2$.

This is because the coin-toss like cancer staging outcome (not necessarily a fair coin) at time $t+1$ is independent of the prior coin flips and therefore it does not depend on $X_o, X_1, \cdots, X_{t-1}$. It only depends on the current cancer stage, $X_t$, at time $t$. This rule, that the conditional distribution of $X_{t+1}$ depends only on the state at the prior time, $X_t$, is referred to as the Markov property, or memoryless property.

Another characteristic of Markov processes, time homogeneity, is that the conditional distribution of $X_{t+1}$ given (say) $X_t=s_o$ is the same for all time points, i.e., $P(X_{t'+1} | X_{t'}=s_o) = P(X_{t''+1} | X_{t''}=s_o)$, for all $t'$ and $t''$. This follows from the fact that albeit dependent on human intervention, the biophysiological mechanism of cancer stage transitioning obeys the same rules independently of the time.

Collectively, the transition probabilities comprise the Markov transition matrix, $P$, where each element $P_{i,j}$ represents the conditional probability of observing cancer stage $s_j$ at time $t+1$ given that the current, time $t$, stage is $s_i$, i.e., $P_{i,j} = P(X_{t'+1} =s_j| X_{t'}=s_i)$.

This a Markov chain model for cancer progression, using the above 5-cancer stages (model states) may be represented mathematically as a transition matrix or a graph (here $k=5$): \[P= \begin{pmatrix} P_{1,1} & P_{1,2} & \cdots & P_{1,k} \\ P_{2,1} & P_{2,2} & \cdots & P_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ P_{k,1} & P_{k,2} & \cdots & P_{k,k} \end{pmatrix} .\] The axioms of probability theory dictate that:

the conditional probabilities are non-negative, $P_{i,j} = P(X_{t+1} =s_j| X_{t}=s_i) \geq 0$, for all $i,j \in \{1, 2, 3, \cdots, k\}$,
the law of total probability is satisfied, $\sum_{j=1}^k{P_{i,j}} = 1$, for all all $i \in \{1, 2, 3, \cdots, k\}$, and
The initial distribution of the Markov Chain $\{X_o, X_1, X_2, \cdots \}$ defining the initial conditions, $t=0$, of starting the chain over its states $\{s_1, s_2, s_3, \cdots, s_k\}$, can be expressed as $\mu^{(o)}=(\mu_1^{(o)}=P(X_o=s_1), \mu_2^{(o)} =P(X_o=s_2), \cdots, \mu_k^{(o)} =P(X_o=s_k))$, where $\sum_{i=1}^k{\mu_i^{(o)}}=1$.

Below we show the mathematical matrix representation of the MC process above as a (directed, cyclic, complete) graphical object illustrating all states and their corresponding transition probabilities.

# install.packages("network")
library('igraph')

nodes <- data.frame(c('Stage_0','Stage_1','Stage_2', 'Stage_3', 'Stage_4'))
node_names <- c('Stage_0','Stage_1',' Stage_2', ' Stage_3', ' Stage_4' )
edges <- data.frame(from=c( 'Stage_0', 'Stage_0', 'Stage_0', 'Stage_0', 'Stage_0',
                        'Stage_1', 'Stage_1', 'Stage_1', 'Stage_1', 'Stage_1',
            'Stage_2', 'Stage_2', 'Stage_2', 'Stage_2', 'Stage_2', 
            'Stage_3', 'Stage_3', 'Stage_3', 'Stage_3', 'Stage_3', 
            'Stage_4', 'Stage_4', 'Stage_4', 'Stage_4', 'Stage_4'),
                        to=c('Stage_0', 'Stage_1', 'Stage_2', 'Stage_3', 'Stage_4',
            'Stage_0', 'Stage_1', 'Stage_2', 'Stage_3', 'Stage_4',
            'Stage_0', 'Stage_1', 'Stage_2', 'Stage_3', 'Stage_4',
            'Stage_0', 'Stage_1', 'Stage_2', 'Stage_3', 'Stage_4',
            'Stage_0', 'Stage_1', 'Stage_2', 'Stage_3', 'Stage_4'),
                            same.dept=c(T,T,T,F, T)
                        )
edge_names <- rep('P{i,j}', 25)
vertex_size <- rep(20, 5)
for (i in 1:5) {
    vertex_size[i] <- 70-5*(i-1)
    for (j in 1:5) {
        edge_names[j + (i-1)*5] <- paste0('P{', i-1, ',', j-1, '}')
        # edge_names[j + (i-1)*5]
    }
}

g <- graph_from_data_frame(edges, directed=TRUE, vertices=nodes)
plot(g, layout=layout_with_fr, vertex.size=vertex_size, vertex.label=node_names, edge.label=edge_names, edge.label.cex = 0.7, edge.curved=0.3, margin=c(0.1, 0.1, 0.1, 0.1), main='Cancer Staging Transition Graph', ylim=c(-1,1),xlim=c(-1,1), asp = 1)

In quantum physics, the Heisenberg uncertainty principle suggests that there is a limit to the process of simultaneously identifying the position and momentum of particles in the universe, which implies that the future is difficult to predict exactly, albeit may be probabilistically quantified. Similarly, in logic, the G?del’s incompleteness theorem suggests that no system (closed system supported by quantitative arithmetic) can be complete and consistent at the same time. This also imposes some theoretical bounds on the precision of our forecasting and inference.

Monte Carlo experiments represent computational algorithms utilizing repeated random sampling of data and calculation of specific statistics that estimate concrete population characteristics. Two examples of using randomness to solve challenging problems involve the estimation of the area of a 2D region in the plane with complex boundary and compressed sensing for rapid estimation of complex signals using (sub-Nyquist frequency) stochastic sampling. The Metropolis-Hastings algorithm and Gibbs sampling represent a couple of specific Markov Chain Monte Carlo (MCMC) methods of simulation and estimation.

1.2 Bayesian Modeling using MCMC

Earlier in Chapter 7, we presented the Bayes conditional probability formula in terms of events. Below, we will expand that formulation of the relation between the posterior, prior, conditional and marginal probabilities.

Gibbs sampling can be applied to estimate Bayesian models facilitating subsequent Bayesian inference. The fundamental idea is to obtain MCMC computationally accurate estimates of complicated or intractable statistical estimations, mathematical operations.

Conditional probability reflects partial knowledge about some observed data. Let’s consider the bivariate density and distribution situations to avoid the limitations of univariate cases and reduce the complexities of more general multivariate processes. For discrete-state processes, $X,Y\in \Omega$, the probability mass function (PMF), $f(.,.)\geq 0$, consists of the complete set of probability values assigned to all possible $(x, y)$ pair states, where $\sum_{(x, y) \in \Omega} f(x, y) = 1$. The PMF values, $f(x, y)$ correspond to the probability of observing the event $X = x$ and $Y = y$. For continuous-state processes, the probability density function (PDF), is a non-negative function $f(x,y)\geq 0$ that integrates to one $\int_{-\infty}^\infty { \int_{-\infty}^{\infty} { f(x, y) \ d x \ d y }} = 1$.

The PDF assigns probabilities, $f(x, y) \, d x \, d y$ , to outcome “areas” corresponding to observing the event $x < X < x + d x$ and $y < Y < y + d y$, where the infinitesimal $d x$ and $d y$ represent the small increments in the two planar directions. In mixed situations where one of the variable may be continuous and the other discrete, probability mass-density functions may be defined in the same spirit yielding probabilities of outcomes, $f(x, y) \, d x$, defined for mixed continuous-discrete events like $x < X < x + d x$ and $Y = y$.

Often we may be able to observe $X$ but not $Y$. The joint distribution represents the probability of the outcome $(x,y)\in A$, $P((X,Y)\in A)$, without any other knowledge and before observing either of the random processes $X$ or $Y$. Suppose we see $X = x$ first, what should be the distribution of $Y$? Given $X$, the probability function has one unknown variable and has a conditional probability function analytic representation. For instance, when $X$ and $Y$ are discrete random variables and $(x,y)\in \Omega$, then the conditional probability mass function of $Y$ given $X$ is:

\[f(y | x) = \frac{h(x, y)}{\sum_{y \in \Omega_x} h(x, y)}, \qquad y \in \Omega _x,\]

where \[\Omega_x = \{\ y : \text{such that } (x, y) \in \Omega \}.\]

Similarly, when $X$ and $Y$ are continuous random variables, then \[ f(y | x) = \frac{h(x, y)}{\int_{- \infty}^\infty h(x, y) \ d y}, \qquad - \infty < y < \infty,\] represents the conditional probability density function of $Y$ given $X$.

The joint multivariate distribution is the product of the marginal and the conditional distributions, $f(x, y)= f_X(x)\times f(y | x)$. In the discrete case, the marginal distribution of $X$ (ignoring $Y$) is $f_X(x) = \sum_{y \in \Omega_x} f(x, y)$, where $\Omega_x$ is the support of $X$, represents the probability of the event $X = x$. And when $X$ and $Y$ are both continuous, the marginal is $f_X(x) = \int_{- \infty}^\infty f(x, y) \ d y$ and for any event $E\subset \Re$, $P(X \in E) = \int_E f_X(x) \ d x = \int_E d x \int_{- \infty}^\infty f(x, y) \ d y$ represents the probability of the event $X \in E$, which is equivalent to $(X, Y) \in E \times \Re$.

The conditional distributions are expressed as: \[\begin{align*} f(y | x) & = \frac{f(x, y)}{f_X(x)} \\ f(x | y) & = \frac{f(x, y)}{f_Y(y)} \end{align*}.\]

Notice that there are separate marginal distributions for each variable. In our bivariate case, one marginal for $X$ and another for $Y$.

This bivariate situation generalizes to $X$ and $Y$ variable vectors where multiple variables can be conditioned upon in the above relation between joint, marginal and conditional probabilities.

Review the many hands-on examples of constructing and validating Bayesian confidence intervals that are included in the SOCR General Confidence Intervals Activity.

1.3 Bayesian Inference

Bayesian inference is based on probability theory as a consistent model for describing uncertainty using probability distributions. For instance, statistical inference frequently involves probability distribution models, whose true parameters are unknown but may be estimated from observations. A prior distribution of the parameter (or parameters, if the model distribution has multiple parameters like in $Normal(\mu, \sigma^2)$) is the initial belief about the parameter distribution before any data are observed. Of course, if there is no reason to believe in a certain distribution without any prior evidence, we can use a uniform prior. In both Bayesian and frequentist inference approaches, the distribution of the data given the parameters ($\theta$) are identically interpreted based on some underlying statistical model. However, frequentist representation of this distribution is expressed as $f^{freq}_{\theta}(x)$ emphasizing $x$ as random (described by a probability distribution) and the parameter $\theta$ as fixed. The corresponding Bayesian distribution representation is written as $f^{Bayes}(x | \theta)$, emphasizing as random both $X$ and $\theta$, and conditioning on the $\theta$ parameter as being given (i.e., known). This conditional distribution describes the probability of observing $X$ for a given fixed value of $\theta$.

Thus, in the Bayesian framework, assuming a prior distribution, $g()$, for the parameter $\theta$, the joint = marginal times conditional, $f(X, \theta) = g(\theta) \times f(x | \theta)$, describes the uncertainty about $X$ and $\theta$ before observing the data.

Taking a data sample, $X = x$, the conditional distribution of the parameter $\theta$ given $X$ can be computed, $ f(| x) = $, and is called the posterior distribution representing the probability distribution describing the post-evidence uncertainty in the parameter $\theta$ after taking a data sample $x$.

Both the (marginal) prior and the (conditional) posterior distributions capture the probability of the parameter of interest. The prior is the initial distribution before seeing any data and the posterior the probability of the same parameter after observing the data. The Bayesian rule or Bayesian theorem provides a mechanism of inverting the conditioning and computing one conditional posterior probability from the other, e.g., having $f(x | \theta)$, we can compute the conditional posterior probability of $\theta$ given $x$:

\[f(\theta |x) = \frac{g(\theta)\times f(x | \theta)}{f_X(x)}.\]

Bayesian inference employs the Bayesian rule to compute the conditional posterior probability from the joint distribution model and the prior belief.

The numerator in the right hand size above represents an unnormalized posterior distribution, $f(\theta |x) \sim g(\theta)\times f(x | \theta)$. In other words, $\text{un-normalized posterior}=\text{unnormalized prior} \times \text{likelihood}$.

Bayesian inference suggests unique answers for each problem, each problem yields a single posterior distribution. On the other hand, frequentist inference always provides ranges of answers, e.g., confidence intervals of certain confidence levels. Bayesian inference is subjective as it heavily depends on prior distributions, which may be difficult to justify objectively. The frequentist inference methods (e.g., parametric hypothesis tests) tend to be simple (e.g., using sampling distributions) to compute using samples of data. Bayesian inference is much more difficult due to general challenges of analytically estimating the likelihoods, $f(x | \theta)$. Hence, MCMC are developed to provide stochastic approaches to compute analytically intractable likelihoods. MCMC transforms analytic challenges into computational problems that are tractable using programming, storage, and compute infrastructure.

Frequentist inference treats parameters as fixed and data as random, even after it is observed (think of the sampling distribution of the mean), whereas Bayesian inference treats both data and parameters as random until data is observed, and then treats the data as fixed to compute the conditional posterior probability. Yet, the two inference strategies tend to agree for large sample sizes.

Suppose $\hat{\theta}_n$ and $\theta_o$ denote the maximum likelihood estimate (MLE) and the true unknown parameter value. Frequentist inference treats $\hat{\theta}_n$ as random and $\theta_o$ as fixed and computes the asymptotic distribution: \[\hat{\theta}_n - \theta_o \sim Normal\left(0, I_n(\theta)^{- 1} \right),\] where the reciprocal of the Fisher’s information, $I(\theta)$, is related to the the lower bound of the Cramer-Rao inequality, see this paper.

This inference may not be useful as frequentist inference has no access to $\theta_o$. However, plugging in a specific value for $\theta_o$ does yield a useful result: \[\hat{\theta}_n - \theta_o \sim Normal\left(0, I_n(\hat{\theta}_n)^{- 1} \right).\]

A corresponding Bayesian inference is based on a random $\theta_o$ and a fixed $\hat{\theta}_n$, not random once the data are observed. Both inference strategies agree that $\hat{\theta}_n - \theta_o$ is random, as a function of a random variable is itself a random variable, and its distribution is independent of the prior distribution, under regularity conditions (assuming smooth prior and locally strictly positive function around the true unknown parameter value).

In practice, having large amounts of data suggests that small perturbations of the prior don’t substantially alter the posterior probability or final inference.

The main downside of Bayesian inference is that two Bayesian scientists that look at the same data could arrive at different conclusions based on their prior values or prior beliefs.

To mediate the Bayesian subjectivity, we can employ change-of-variables for priors. Suppose the prior distribution for the parameter $\theta$ is $g$. Then, the distribution of another parameter $\phi = h(\theta)$, where $h$ is an invertible function with inverse $h^{-1}$. The change-of-variable theorem for integration implies that: \[d \theta = h^{-1}(\phi) d \phi .\]

Thus, \[P(\phi \in A) = \int_{h(\theta) \in A} {g(\theta) \, d \theta } = \int_A {g(h^{-1}(\phi)) h^{-1}(\phi) \ d \phi} .\]

Therefore, we can employ the following prior for $\phi$: $g(h^{-1} (\phi)) h^{-1}(\phi)$.

This suggests that a flat uninformative (diffuse) prior for one parameter may transform to a non-flat, or informative, prior for another parameter.

1.4 Monte Carlo

Monte Carlo simulations may be used to estimate probabilities (likelihoods) or expectations (integrals) involving functions that may be difficult to analytically process (i.e., compute a closed form analytical expression of the desired solutions). Likelihoods are special cases of integrals, and each probability can be expressed as an expectation (e.g., for a characteristic function $g$, $E(g(X)) = P( g(X) = 1)$). Let’s assume we want to compute $\theta = E(g(X)) = \int_{\mathbb{R}} {g(x)P_X(x)dx}$, but the integral can’t be exactly solved using simple analytical terms.

Instead, we can employ Monte Carlo simulation to estimate $\theta$. Let’s generate a random sample of observations $\{ X_1, X_2, \cdots \}$, independent and identically distributed (IID). Then, for each iteration $n$, the average of the random variables is: \[\hat{\theta}_n = \frac{1}{n} \sum_{i = 1}^n g(X_i).\]

If $Y_i = g(X_i)$, then $\hat{\theta}_n$ can be written explicitly as a sample arithmetic average: \[\overline{Y}_n = \frac{1}{n} \sum_{i = 1}^n {Y_i}\]

The law of large numbers (LLN) implies that $\hat{\theta}_n \longrightarrow \theta$ as $n \longrightarrow \infty$.

Having a sufficiently large simulation sample would allow us to calculate a good estimate of $\theta$.

By the central limit theorem (CLT): \[\hat{\theta}_n \sim Normal\left( \theta, \frac{\sigma^2}{n} \right),\] as $n\longrightarrow \infty$ and $\sigma^2 = \mathop{\rm var}( g(X))$ represents the population variance.

Neither $\theta$ nor $\sigma$ are known. However, we can use the sample variance as an estimate \[\hat{\sigma}^2 = \frac{1}{n-1} \sum_{i = 1}^n \left ( g(X_i) - \hat{\theta}_n \right )^2. \]

Thus, \[\hat{\theta}_n \sim Normal\left( \theta, \frac{\hat{\sigma}^2}{n} \right). \]

Let’s construct a 95% confidence interval for $\theta$: \[\hat{\theta}_n \pm 1.96 \times \frac{\hat{\sigma}_n}{\sqrt{n}}. \]

Let’s now go back to Markov chains, sequences $\{X_1$, $X_2$, $\ldots \}$ of random variables obeying the Markov property that the conditional distribution of $X_{n + 1}$ given the entire transition history back to the initial time actually only depends on the current state $X_n$: \[f(x_{n + 1} | x_1, x_2, \cdots, x_n) = f(x_{n + 1} | x_n).\]

Referring back to our 5-state cancer staging Markov process, as the conditional distribution does not depend on $n$, we have stationary transition probabilities. If $g$ represents the chain distribution at the initial state, the Bayesian inference, $ = $, implies that

\[\begin{align*} f(x_1, x_2) &= g(x_1) f(x_2 | x_1) \\ f(x_1, x_2, x_3) &= g(x_1) f(x_3 | x_2, x_1) f(x_2 | x_1) \\ f(x_1, x_2, x_3, x_4) &= g(x_1) f(x_4 | x_3 , x_2, x_1) f(x_3 | x_2, x_1) f(x_2 |x_1) \\ f(x_1, x_2, x_3, x_4, x_5) &= g(x_1) f(x_5, x_4, x_3, x_2 | x_1) \\ & = g(x_1) f(x_5 | x_4, x_3, x_2, x_1) f(x_4 | x_3, x_2, x_1) f(x_3 | x_2, x_1) f(x_2 | x_1). \end{align*} \]

All of the Markov chain probabilities are computed by multiplying the initial state distribution $g$ and the conditional transition probability distribution $f(\,\cdot\, | \,\cdot\,)$.

The marginal distributions of the different variables can also be estimated by: \[\begin{align*} f(x_2) & = \int f(x_1, x_2) \, d x_1 \\ & = \int f(x_2 | x_1) g(x_1) \, d x_1 \\ f(x_3) & = \iint f(x_1, x_2, x_3) \, d x_1 \, d x_2 \\ & = \iint f(x_3 | x_2) f(x_2 | x_1) g(x_1) \, d x_1 \, d x_2 \\ f(x_4) & = \iiint f(x_1, x_2, x_3, x_4) \, d x_1 \, d x_2 \, d x_3 \\ & = \iiint f(x_4 | x_3) f(x_3 \mid x_2) f(x_2 \mid x_1) g(x_1) \, d x_1 \, d x_2 \, d x_3 \\ f(x_5) & = \iiiint f(x_1, x_2, x_3, x_4, x_5) \, d x_1 \, d x_2 \, d x_3 \, d x_4 \\ & = \iiiint f(x_5 | x_4) f(x_4 | x_3) f(x_3 | x_2) f(x_2 | x_1) g(x_1) \, d x_1 \, d x_2 \, d x_3\, d x_4 \end{align*}\]

1.5 Ergodic Averaging

An ergodic law, ergodicity, or ergodic theory represents the idea is that for certain dynamical systems, like Markov chains, the time average of their properties is equal to the average over the entire state space. In physics, ergodic methods are used to study geodesic flows on Riemannian manifolds and in mathematics, there are ergodic applications in harmonic analysis, Lie group representation, the theory of diophantine approximations, etc.

Ergodic theory characterizes the behavior of a dynamical system that runs indefinitely. The Poincaré recurrence theorem suggests that on the long run, almost all points in any subset of the phase-space are eventually revisited. In general, ergodic theorems determine the conditions required to guarantee the almost certain existence of the time average of a function along the trajectories and its relation to the space average. For example, one special class of ergodic systems, this time average is statistically the same for almost all initial points, i.e., the evolving dynamical system forgets its initial state.

One special case of ergodic law for Markov chains suggests that: $\hat{\theta}_n \longrightarrow \theta$ as $n \longrightarrow \infty$.

1.5.1 Batch Means

The CLT suggests that for sufficiently large $b<<n$, we have \[\hat{\theta}_b \approx Normal\left( \theta, \frac{\sigma^2}{b} \right) .\] When $n$ is a multiple of $b$, i.e., $m\times b = n$, for an integer $m$, we can divide the Markov chain into $m$ batches: \[ X_{m, 1}, X_{m, 2}, \ldots, X_{m, b}, \ \forall\ m=1, 2, 3, \cdots, \frac{n}{b}.\] Computing the batch means yields: \[\hat{\theta}_{b, m} = \frac{1}{b} \sum_{i = 1}^b {X_{m, i}}. \] Again, assume that $b$ is sufficiently large, then we have \[\hat{\theta}_{b, m} \sim Normal\left( \theta, \frac{\sigma^2}{b} \right) . \] Therefore, we can use the batches and their batch means, $\hat{\theta}_{b, m}$, to obtain Bayesian simulation based estimates of the unknown process variance, $\frac{\sigma^2}{b}$, using the sample variance statistic:

\[\frac{1}{m} \sum_{j = 1}^m ( \hat{\theta}_{b, m} - \hat{\theta}_n )^2. \] As the last expression estimates $\sigma^2 / b$, we can solve for the numerator and estimate the variance by: \[\hat{\sigma}^2_{b, m} = \frac{b}{m} \sum_{j = 1}^m ( \hat{\theta}_{b, m} - \hat{\theta}_n )^2.\] Therefore, for sufficiently large $b$ and $m$, the MCMC simulation yields a good estimator of $\sigma^2$.

Similarly, we can perform Bayesian inference on the parameter $\theta$ as well as compute confidence intervals for $\theta$. For instance, when we run the MCMC simulation long enough (large sample), we get large $b$ and $m$ such that $n = b m$, the expression below represents a 95% confidence interval for $\theta$: \[\hat{\theta}_n \pm 1.96 \times \frac{\hat{\sigma}_{b, m}}{\sqrt{n}}. \]

Some additional details and demonstrations are provided in the SOCR Bayesian Estimator.

Christou N, Dinov ID. (2011) Confidence Interval Based Parameter Estimation-A New SOCR Applet and Activity. PLoS ONE 6(5): e19178. doi:10.1371/journal.pone.0019178.
SOCR General Confidence Interval Activity.

1.5.2 Random-Walk Metropolis Algorithm

Remember that in Chapter 21, we discussed the importance of randomly sampling from diverse arrays of probability distributions. The Metropolis random walk algorithm provides a protocol for simulating data from the distribution of a continuous random vector for a given PDF $h$. The pseudo-code implementation of the Metropolis algorithm is provided below.

Initiate the process starting at any point $x$ where $h(x) > 0$.
Repeat the following
- Generate a random vector $y_n \sim MultivariateNormal(\mu=0, \Sigma)$ that is independent of $x$. These normal random vectors will be IID.
- Calculate $r = \frac{h(x_n + y_n) }{h(x_n)}$.
- Simulate $u_n \sim {Uniform}(0, 1)$ random variable.
- If $u_n < r$, set $x_{n+1} := x_n + y_n$, otherwise $x_{n+1} := x_n$.
- Return $x_n$.

In the Metropolis algorithm, we can only adjust the proposal variance, $\Sigma$, which in the scalar version corresponds to $\sigma_2$. Too small variance yields MCMC chains taking tiny steps forward and requiring a very long time to equilibrate. On the flip side, large variance would force the algorithm to take very large steps, $y_n$, most of which may be discarded (not accepted or used) in the last step because the calculation of $r$ in step 3 may be too small.

This process generates a positively recurrent Markov chain that obeys the ergodic LLN law and has $h$ as its equilibrium distribution. This algorithm works for any distribution function $h\geq 0$.

1.5.3 MCMC R implementation

The R function mcmc::metrop implements the Random-Walk Metropolis algorithm. It requires a predefined R function to compute $\log (h(x))$. In the degenerate case of $h(x) = 0$ for some $x$, would yield $\log h(x) = - \infty$.

The metrop function parameters include a scale parameter vector and an output function, $g$, coded as an R function.

The scale argument to the metrop function specifies the multivariate normal random vector $y$ in the Metropolis algorithm. If z is a standard multivariate normal random vector, then $y$ in the algorithm is $scale \times z$ (when scale is scalar or vector) and $scale\ \%*\%\ z$ (when scale is a matrix of appropriate dimensions.

The acceptance rate represents the proportion of the time that $u < r$ in step 4 of the algorithm, e.g., acceptance=20%. When $h$ is a continuous distribution function the acceptance rate may be close to one, which reduces the iterative steps and may substantially increase the execution time to transition from one part of the sample space to another. For scale parameters, the algorithm will take large steps. When $h(x)$ is moderately large and $x$ resembles a sample from the equilibrium distribution, $x + y$ will be extremely large, in the tails of the distribution, corresponding to a very small $r = \frac{h(x + y)}{ h(x)}$.

The Metropolis algorithm only requires un-normalized prior distribution, so $h$ can be defined as \[\text{un-normalized posterior} \sim \text{likelihood} \times \text{un-normalized prior}. \]

MCMC-based Bayesian inference is always possible when we can provide specific functions that evaluate the likelihood and unnormalized posterior.

1.6 Hospital Admissions Case-Study

1.6.1 Data

We will demonstrate frequentist and Bayesian inference using a clinical case-study of hospital admissions; N=58,863 (cases) and k=9 (features) including:

ID: Coded Case Identifier
AdmissionLengthDays: Duration of hospital stay (in days)
Death_1: Indicator of Death (1) survival (0)
Admission_Type: Type of admission (categorical)
Insurance_Type: Type of Health insurance (string)
EnglishLanguage_1: Primary Language English Indicator (1), 0 otherwise
Religion_Type: Type of Religious Affiliation (string)
Married_1: Indicator of marital status (Married=1)
Race: Race, categorical (string)
Dx: Diagnosis

First, we will load and inspect the data.

hosAdmissions <- read.csv("https://umich.instructure.com/files/5005926/download?download_frd=1")
head(hosAdmissions)

##      ID AdmissionLengthDays Death_1 Admission_Type Insurance_Type
## 1 42862            4.277778       0       elective        private
## 2  5967           26.122917       0       elective        private
## 3 50952            7.028472       1      emergency       medicare
## 4 17138           34.187500       0       elective       medicare
## 5 27703            6.165972       0       elective     government
## 6 29593            3.286806       0       elective       medicare
##   EnglishLanguage_1       Religion_Type Married_1                   Race
## 1                 1        episcopalian         1                  white
## 2                 0            catholic         1                  white
## 3                 1            catholic         0       asian - japanese
## 4                 1            catholic         1                  white
## 5                 0   protestant quaker         0 black/african american
## 6                 0 christian scientist         0                  white
##                                                 Dx
## 1   mitral stenosis\\mitral valve replacement /sda
## 2                                duodenal mass/sda
## 3                             ? serotonin syndrome
## 4                    abdominal aortic aneurysm/sda
## 5                                      absence/sda
## 6                     adenoid cystic carcinoma/sda

# remove the case ID:
hosAdmissions <- hosAdmissions[, -1]

# Clean the diagnostic variable (free text feature!)
preproc_fun = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # remove non-alpha-numeric values
  x  =  tolower(x)                          # convert to lowercase characters
  # x  =  removeNumbers(x)                  # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

# Clean the DX feature by itself
Dx11 <- preproc_fun(hosAdmissions$Dx); hosAdmissions$Dx <- Dx11

## Loading required package: tm

## Loading required package: NLP

Each row of this data frame represents one hospital admission record. The response variable, $y$ (random outcome), is a binary survival outcome (1 for death, 0 for survival). The frequentist inference will be based on logistic regression modeling.

#Check sizes
cases <- unique(hosAdmissions$ID); length(cases); dim(hosAdmissions)

## [1] 0

## [1] 58863     9

# define the  binary outcome variable
y <- hosAdmissions$Death_1

Let’s fit a logistic prediction model for the clinical outcome $y$ representing the binary variable Death_1.

# model_glm <- glm(y ~ ., data = hosAdmissions, family = binomial)
model_glm <- glm(y ~ AdmissionLengthDays + Married_1 + Insurance_Type + Admission_Type, data = hosAdmissions, family = binomial)
summary(model_glm)

## 
## Call:
## glm(formula = y ~ AdmissionLengthDays + Married_1 + Insurance_Type + 
##     Admission_Type, family = binomial, data = hosAdmissions)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.326120   0.131779 -32.829  < 2e-16 ***
## AdmissionLengthDays      0.001042   0.001273   0.819 0.412896    
## Married_1               -0.003940   0.029010  -0.136 0.891965    
## Insurance_Typemedicaid   0.274587   0.121530   2.259 0.023857 *  
## Insurance_Typemedicare   0.920552   0.110573   8.325  < 2e-16 ***
## Insurance_Typeprivate    0.382308   0.113017   3.383 0.000718 ***
## Insurance_Typeself pay   1.017459   0.156722   6.492 8.46e-11 ***
## Admission_Typeemergency  1.707001   0.073893  23.101  < 2e-16 ***
## Admission_Typenewborn   -0.875436   0.148996  -5.876 4.21e-09 ***
## Admission_Typeurgent     1.657608   0.111239  14.901  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 38120  on 58862  degrees of freedom
## Residual deviance: 35470  on 58853  degrees of freedom
## AIC: 35490
## 
## Number of Fisher Scoring iterations: 7

Stratify the population, say on Race=white, refit the logistic model, and report the results.

model_glm_white <- glm(y ~ AdmissionLengthDays + as.factor(Married_1) + as.factor(Insurance_Type) + as.factor(Admission_Type), data = hosAdmissions, family = binomial, subset = hosAdmissions$Race == "white")
summary(model_glm_white)

## 
## Call:
## glm(formula = y ~ AdmissionLengthDays + as.factor(Married_1) + 
##     as.factor(Insurance_Type) + as.factor(Admission_Type), family = binomial, 
##     data = hosAdmissions, subset = hosAdmissions$Race == "white")
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        -4.408236   0.189887 -23.215  < 2e-16 ***
## AdmissionLengthDays                 0.002208   0.001518   1.455 0.145737    
## as.factor(Married_1)1               0.033884   0.034740   0.975 0.329387    
## as.factor(Insurance_Type)medicaid   0.298104   0.186195   1.601 0.109370    
## as.factor(Insurance_Type)medicare   0.972134   0.171160   5.680 1.35e-08 ***
## as.factor(Insurance_Type)private    0.364072   0.173671   2.096 0.036052 *  
## as.factor(Insurance_Type)self pay   0.854595   0.245315   3.484 0.000495 ***
## as.factor(Admission_Type)emergency  1.702102   0.084834  20.064  < 2e-16 ***
## as.factor(Admission_Type)newborn   -0.825623   0.188062  -4.390 1.13e-05 ***
## as.factor(Admission_Type)urgent     1.479738   0.149358   9.907  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26399  on 40926  degrees of freedom
## Residual deviance: 24603  on 40917  degrees of freedom
## AIC: 24623
## 
## Number of Fisher Scoring iterations: 7

Report the variance-covariance matrix of the estimates of the model coefficients:

summ_out <- summary(model_glm_white)
summ_out$coefficients

##                                        Estimate  Std. Error     z value
## (Intercept)                        -4.408236433 0.189887179 -23.2150293
## AdmissionLengthDays                 0.002208128 0.001517869   1.4547558
## as.factor(Married_1)1               0.033884030 0.034740425   0.9753488
## as.factor(Insurance_Type)medicaid   0.298103910 0.186195000   1.6010307
## as.factor(Insurance_Type)medicare   0.972133761 0.171159880   5.6796824
## as.factor(Insurance_Type)private    0.364071672 0.173670514   2.0963355
## as.factor(Insurance_Type)self pay   0.854595497 0.245315401   3.4836602
## as.factor(Admission_Type)emergency  1.702102077 0.084834009  20.0639118
## as.factor(Admission_Type)newborn   -0.825622895 0.188062297  -4.3901564
## as.factor(Admission_Type)urgent     1.479737666 0.149358060   9.9073171
##                                         Pr(>|z|)
## (Intercept)                        3.210399e-119
## AdmissionLengthDays                 1.457369e-01
## as.factor(Married_1)1               3.293873e-01
## as.factor(Insurance_Type)medicaid   1.093701e-01
## as.factor(Insurance_Type)medicare   1.349451e-08
## as.factor(Insurance_Type)private    3.605244e-02
## as.factor(Insurance_Type)self pay   4.946070e-04
## as.factor(Admission_Type)emergency  1.525943e-89
## as.factor(Admission_Type)newborn    1.132692e-05
## as.factor(Admission_Type)urgent     3.868958e-23

var_cov <- vcov(model_glm_white)
var_cov

##                                      (Intercept) AdmissionLengthDays
## (Intercept)                         3.605714e-02       -2.153387e-05
## AdmissionLengthDays                -2.153387e-05        2.303925e-06
## as.factor(Married_1)1              -4.616676e-04       -3.524855e-07
## as.factor(Insurance_Type)medicaid  -2.882354e-02       -3.075147e-06
## as.factor(Insurance_Type)medicare  -2.886613e-02        6.731918e-07
## as.factor(Insurance_Type)private   -2.886567e-02       -5.812371e-07
## as.factor(Insurance_Type)self pay  -2.883529e-02        8.082561e-06
## as.factor(Admission_Type)emergency -7.045865e-03       -2.026943e-06
## as.factor(Admission_Type)newborn   -7.298670e-03       -5.844913e-06
## as.factor(Admission_Type)urgent    -6.945212e-03       -8.437119e-06
##                                    as.factor(Married_1)1
## (Intercept)                                -4.616676e-04
## AdmissionLengthDays                        -3.524855e-07
## as.factor(Married_1)1                       1.206897e-03
## as.factor(Insurance_Type)medicaid           9.691941e-05
## as.factor(Insurance_Type)medicare          -2.370309e-04
## as.factor(Insurance_Type)private           -4.300747e-04
## as.factor(Insurance_Type)self pay           5.090034e-05
## as.factor(Admission_Type)emergency          1.559534e-04
## as.factor(Admission_Type)newborn            8.603678e-04
## as.factor(Admission_Type)urgent             6.969787e-05
##                                    as.factor(Insurance_Type)medicaid
## (Intercept)                                            -2.882354e-02
## AdmissionLengthDays                                    -3.075147e-06
## as.factor(Married_1)1                                   9.691941e-05
## as.factor(Insurance_Type)medicaid                       3.466858e-02
## as.factor(Insurance_Type)medicare                       2.880777e-02
## as.factor(Insurance_Type)private                        2.879171e-02
## as.factor(Insurance_Type)self pay                       2.882194e-02
## as.factor(Admission_Type)emergency                      1.063638e-06
## as.factor(Admission_Type)newborn                        1.113055e-04
## as.factor(Admission_Type)urgent                         6.101094e-05
##                                    as.factor(Insurance_Type)medicare
## (Intercept)                                            -2.886613e-02
## AdmissionLengthDays                                     6.731918e-07
## as.factor(Married_1)1                                  -2.370309e-04
## as.factor(Insurance_Type)medicaid                       2.880777e-02
## as.factor(Insurance_Type)medicare                       2.929570e-02
## as.factor(Insurance_Type)private                        2.890530e-02
## as.factor(Insurance_Type)self pay                       2.882195e-02
## as.factor(Admission_Type)emergency                      9.022303e-05
## as.factor(Admission_Type)newborn                        3.258590e-04
## as.factor(Admission_Type)urgent                         1.185816e-04
##                                    as.factor(Insurance_Type)private
## (Intercept)                                           -2.886567e-02
## AdmissionLengthDays                                   -5.812371e-07
## as.factor(Married_1)1                                 -4.300747e-04
## as.factor(Insurance_Type)medicaid                      2.879171e-02
## as.factor(Insurance_Type)medicare                      2.890530e-02
## as.factor(Insurance_Type)private                       3.016145e-02
## as.factor(Insurance_Type)self pay                      2.879496e-02
## as.factor(Admission_Type)emergency                     1.708531e-04
## as.factor(Admission_Type)newborn                      -8.262959e-04
## as.factor(Admission_Type)urgent                        1.467773e-04
##                                    as.factor(Insurance_Type)self pay
## (Intercept)                                            -2.883529e-02
## AdmissionLengthDays                                     8.082561e-06
## as.factor(Married_1)1                                   5.090034e-05
## as.factor(Insurance_Type)medicaid                       2.882194e-02
## as.factor(Insurance_Type)medicare                       2.882195e-02
## as.factor(Insurance_Type)private                        2.879496e-02
## as.factor(Insurance_Type)self pay                       6.017965e-02
## as.factor(Admission_Type)emergency                     -9.802637e-05
## as.factor(Admission_Type)newborn                        2.327653e-04
## as.factor(Admission_Type)urgent                         7.672538e-05
##                                    as.factor(Admission_Type)emergency
## (Intercept)                                             -7.045865e-03
## AdmissionLengthDays                                     -2.026943e-06
## as.factor(Married_1)1                                    1.559534e-04
## as.factor(Insurance_Type)medicaid                        1.063638e-06
## as.factor(Insurance_Type)medicare                        9.022303e-05
## as.factor(Insurance_Type)private                         1.708531e-04
## as.factor(Insurance_Type)self pay                       -9.802637e-05
## as.factor(Admission_Type)emergency                       7.196809e-03
## as.factor(Admission_Type)newborn                         6.911878e-03
## as.factor(Admission_Type)urgent                          6.877517e-03
##                                    as.factor(Admission_Type)newborn
## (Intercept)                                           -7.298670e-03
## AdmissionLengthDays                                   -5.844913e-06
## as.factor(Married_1)1                                  8.603678e-04
## as.factor(Insurance_Type)medicaid                      1.113055e-04
## as.factor(Insurance_Type)medicare                      3.258590e-04
## as.factor(Insurance_Type)private                      -8.262959e-04
## as.factor(Insurance_Type)self pay                      2.327653e-04
## as.factor(Admission_Type)emergency                     6.911878e-03
## as.factor(Admission_Type)newborn                       3.536743e-02
## as.factor(Admission_Type)urgent                        6.907991e-03
##                                    as.factor(Admission_Type)urgent
## (Intercept)                                          -6.945212e-03
## AdmissionLengthDays                                  -8.437119e-06
## as.factor(Married_1)1                                 6.969787e-05
## as.factor(Insurance_Type)medicaid                     6.101094e-05
## as.factor(Insurance_Type)medicare                     1.185816e-04
## as.factor(Insurance_Type)private                      1.467773e-04
## as.factor(Insurance_Type)self pay                     7.672538e-05
## as.factor(Admission_Type)emergency                    6.877517e-03
## as.factor(Admission_Type)newborn                      6.907991e-03
## as.factor(Admission_Type)urgent                       2.230783e-02

# It may be easier to display the matrix in Rstudio as a table
View(var_cov)

1.6.2 Bayesian Inference

To perform the analogous MCMC Bayesian inference, we will use the R function mcmc::metrop, which takes as input the log unnormalized posterior probability distribution (that we need to generate MCMC samples from). Let’s rewrite the Bayesian rule ($unnormalized\ posterior\ \sim \ likelihood \times \ unnormalized\ prior$) in terms of logs: \[\text{log unnormalized posterior} \sim \text{log likelihood} + \text{log unnormalized prior}. \] Let’s examine the two terms on the right-hand side.

1.6.3 Log Likelihood

The frequentist and Bayesian analyses share the same (log) likelihood function. Thus, the frequentist model matrix can be reused in the Bayesian analysis.

hosAdmissions$Married_1 <- as.factor(hosAdmissions$Married_1)
hosAdmissions$Insurance_Type <- as.factor(hosAdmissions$Insurance_Type)
hosAdmissions$Admission_Type <- as.factor(hosAdmissions$Admission_Type)

model_glm <- glm(y ~ AdmissionLengthDays + Married_1 + Insurance_Type + Admission_Type, 
                 data = hosAdmissions, family = binomial, x=TRUE) # return the design matrix X
design_mat <- model_glm$x
response <- model_glm$y

We need to define a function that computes the log likelihood, $\log(f(x | \beta))$.

log_likelihod <- function(beta) {
    # housekeeping
    stopifnot(is.numeric(beta))
    stopifnot(is.finite(beta))
    stopifnot(length(beta) == ncol(design_mat))
    # Compute the estimates y_hat = X*b
    eta <- drop(design_mat %*% beta)
    # use log1p to calculate the function x --> log(1+x)
    # correctly for small x, avoiding singularity problems
    logp <- ifelse(eta < 0, eta - log1p(exp(eta)), - log1p(exp(-eta)))
    logq <- ifelse(eta < 0, -log1p(exp(eta)), -eta -log1p(exp(-eta)))
    sum(ifelse(response == 1, logp, logq))
}

The reason for computing log likelihoods, $\log(f(x | \beta))$, instead of the raw likelihoods, $f(x | \beta)$, is that it makes the asymptotic calculations more stable. For instance, let $\eta = X\times \beta$, $p=\frac{\exp{(\eta)}}{1+\exp{(\eta)}}$, and $q=1-p=\frac{1}{1+\exp{(\eta)}}$. Then, $\log(p)=\eta-\log(1+\exp{(\eta)})$ and $\log(q)=-\log(1+\exp{(\eta)})$ are computable for large $|\eta|$, either positive or negative.

The mcmc::demo package (vignette(“demo”, “mcmc”)) includes additional vignettes and examples.

Ideally, we would choose informative Bayesian priors, but we also don’t want to introduce strong subjectivity. Conjugate priors assume the prior and posterior distributions are of the same probability distribution family.

\[\prod_{i = 1}^{k} p(\beta_i) q(\beta_i),\] where $p(\beta) = {\rm logit}^{- 1}(\beta)$ is the mean-value parameter corresponding to the linear predictor $\beta$ and $q(\beta) = 1 - p(\beta)$; the $p$ and $q$ functions are used to calculate for various $\beta$ values the log likelihood, see log_likelihod.

log_unnormalized_prior <- function(beta) {
    stopifnot(is.numeric(beta))
    stopifnot(is.finite(beta))
    stopifnot(length(beta) == ncol(design_mat))
    logp <- ifelse(beta < 0, beta - log1p(exp(beta)), - log1p(exp(- beta)))
    logq <- ifelse(beta < 0, - log1p(exp(beta)), - beta - log1p(exp(- beta)))
    sum(logp) + sum(logq)
}
log_unnormalized_posterior <- function(beta) {
  log_likelihod(beta) + log_unnormalized_prior(beta)
}

We are ready to run the MCMC Bayesian inference.

# install.packages("mcmc")
library(mcmc)
logUnnormalizedPDF <- function(x) {
  if (all(x >= 0) && sum(x) <= 1) return(1) 
  else return(-Inf)
}
set.seed(1234) 
mcmc_out <- metrop(log_unnormalized_prior, rep(0, ncol(design_mat)),nbatch = 100, blen = 100)
# Runs a "random-walk" Metropolis algorithm with multivariate normal proposal
# Generates a Markov chain (MC) with equilibrium distribution having the
# specified unnormalized density
mcmc_out$accept

## [1] 0.3907

The main output, mcmc_out, of the MCMC call, metrop, includes:

accept: fraction of Metropolis proposals accepted.
batch: $nbatch \times k$ batch means matrix, where $k$ is the dimension of the result of outfun if outfun is a function, or the dimension of state when outfun is missing.
initial: value of argument initial.
final: final state of Markov chain.
time: running time of Markov chain from system.time().

Sometimes, we need to adjust the proposal to get higher acceptance rates (mcmc_out$accept). In general, an acceptance rate of $\sim 20\%$ is good. Forcing the acceptance rates below 70% may keep the sampler from ever visiting part of the state space. The acceptance rate reflects the proportion of the time that $u < r$ in step 4 of the algorithm.

The metrop function accepts as first argument the output of a previous run, and it uses all the same arguments as the previous run except for any modified in the call for this new run. Of course the initial state of the next run is the final state of the previous run and the random number generation seed at the start of the next are the same as at the end of the previous run. This yields a continuous feed-forward iteration if no modification of other arguments is made.

mcmc_out <- metrop(mcmc_out, scale = 0.1)
mcmc_out$accept

## [1] 0.9311

mcmc_out <- metrop(mcmc_out, scale = 0.3)
mcmc_out$accept

## [1] 0.7905

mcmc_out <- metrop(mcmc_out, scale = 0.5)
mcmc_out$accept

## [1] 0.6642

mcmc_out <- metrop(mcmc_out, scale = 0.4)
mcmc_out$accept

## [1] 0.7219

t.test(mcmc_out$batch)$conf.int

## [1] -0.05332039  0.11673006
## attr(,"conf.level")
## [1] 0.95

mcmc_out$blen  # check the length of batches

## [1] 100

It may be appropriate to check that the batch length is long enough for the validity of this confidence interval for the true unknown acceptance rate.

Similar to diagnostic plots for linear models, analogous MCMC diagnostic plots may be drawn to confirm there are no clear problems with the simulation estimates. For instance, we can display the time series of any component of the Markov function, e.g., acceptance indicators or their batch means. Ideally, the time series should behave stationarily.

# Recall the actual model
# model_glm <- glm(y ~ AdmissionLengthDays + as.factor(Married_1) + as.factor(Insurance_Type) + as.factor(Admission_Type), data = hosAdmissions, family = binomial, x=TRUE)

library(plotly)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:igraph':
## 
##     groups

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Batch Means for AdmissionLengthDays
#plot(ts(mcmc_out$batch[ , 1]), main="Batch Means for AdmissionLengthDays Coefficient")
plot_ly(x=c(1:length(ts(mcmc_out$batch[ , 1]))), y=ts(mcmc_out$batch[ , 1]), 
        type="scatter", mode="markers+lines") %>%
  layout(title="Batch Means for AdmissionLengthDays Coefficient", 
         xaxis=list(title="time"), yaxis=list(title="mcmc_out"))

# Married_1
# plot(ts(mcmc_out$batch[ , 2]), main="Batch Means for Married_1 Coefficient")
plot_ly(x=c(1:length(ts(mcmc_out$batch[ , 2]))), y=ts(mcmc_out$batch[ , 2]), 
        type="scatter", mode="markers+lines") %>%
  layout(title="Batch Means for Married_1 Coefficient", 
         xaxis=list(title="time"), yaxis=list(title="mcmc_out"))

Autocorrelation (ACF) plots may also be used to report the timeseries autocorrelation of the time series of batch means. These depict the internal timeseries correlation that should not be significantly nonzero, except for lag zero. Otherwise, it may suggest that the batch length is not long enough.

acf(ts(mcmc_out$batch[ , 1]), main="Autocorrelation of AdmissionLengthDays Coefficient")

acf(ts(mcmc_out$batch[ , 2]), main="Autocorrelation of Married_1 Coefficient")

acf(ts(mcmc_out$batch), main="Autocorrelation of Acceptance Indicators")

It appears as if some of the low-lag autocorrelations may be significantly non-trivial, i.e., the blue dashed error lines indicate the cutoff points for $0.05$ level tests of the null hypothesis that the true unknown autocorrelation is zero. To mediate this problem, we can increase the batch length and re-plot the resulting autocorrelations.

Are there examples of questions that may be answerable using Bayesian inference that may not be within the reach of a frequentist inference? Question of about probability of parameters only make sense in a Bayesian setting, as frequentist inference considers all parameters as static, not random.

Once we applied the Bayesian rule and sampled the posterior probability distribution, we can examine the function of those parameters to answer specific parameter probabilistic questions. In order to calculate the probabilities of specific events, we need to output the indicators of these events.

outputIndicatorFun <- function(beta) {
    stopifnot(is.numeric(beta))
    stopifnot(is.finite(beta))
    stopifnot(length(beta) == ncol(design_mat))
    as.numeric(beta == max(beta))
}

Let’s call the MCMC with a reference to the output indicator function.

mcmc_out <- metrop(mcmc_out, blen = 1000, outfun = outputIndicatorFun)
mcmc_out$time

##    user  system elapsed 
##    0.95    0.00    2.04

mcmc_out$accept

## [1] 0.71766

Notice that the increase of blen drives the computing time higher. However, longer run times yield more precise answers.

colnames(mcmc_out$batch)

## NULL

colnames(mcmc_out$batch) <- names(model_glm_white$coefficients)
# skip over the intercept column and deal with the remaining 10-1=9 coefficient estimates
mcmc_out_ts <- as.ts(mcmc_out$batch[ , 2:10])
# report the overall means for each coefficient by batch averaging
colMeans(mcmc_out_ts)

##                AdmissionLengthDays              as.factor(Married_1)1 
##                            0.08101                            0.10191 
##  as.factor(Insurance_Type)medicaid  as.factor(Insurance_Type)medicare 
##                            0.09991                            0.10775 
##   as.factor(Insurance_Type)private  as.factor(Insurance_Type)self pay 
##                            0.10386                            0.09003 
## as.factor(Admission_Type)emergency   as.factor(Admission_Type)newborn 
##                            0.10722                            0.09921 
##    as.factor(Admission_Type)urgent 
##                            0.10589

Let’s report only the coefficient estimates corresponding to the larger probabilities ($p>0.1$).

mcmc_out_ts_highProb <- mcmc_out_ts[ , colMeans(mcmc_out_ts) > 0.1]
colMeans(mcmc_out_ts_highProb)

##              as.factor(Married_1)1  as.factor(Insurance_Type)medicare 
##                            0.10191                            0.10775 
##   as.factor(Insurance_Type)private as.factor(Admission_Type)emergency 
##                            0.10386                            0.10722 
##    as.factor(Admission_Type)urgent 
##                            0.10589

# Compute the grand variance (variance of batch means)
apply(mcmc_out_ts_highProb, 2, sd) / sqrt(mcmc_out$batch)

##        (Intercept) AdmissionLengthDays as.factor(Married_1)1
##   [1,]   0.1862559           0.3226046             0.3345570
##   [2,]   0.2821392           0.4448249             0.1720344
##   [3,]   0.5178081           0.2188139             0.2388471
##   [4,]   0.2007440           0.2975295             0.2927687
##   [5,]   0.2483227           0.4301076             0.2633860
##   [6,]   0.3803786           0.1634675             0.3900100
##   [7,]   0.1924431           0.2509151             0.3922987
##   [8,]         Inf           0.2767801             0.2315708
##   [9,]   0.1971004           0.2112527             0.1515867
##  [10,]   0.1691363           0.3436282             0.1801522
##  [11,]   0.3377586           0.1605995             0.2104048
##  [12,]   0.2631620           0.2556055             0.2317012
##  [13,]   0.2159409           0.1688904             0.3573216
##  [14,]   0.2422708           0.2371698             0.1532255
##  [15,]   0.3551496           0.2836045             0.7852654
##  [16,]   0.1638370           0.2890539             0.3803786
##  [17,]   0.4886795           0.2299257             0.6579049
##  [18,]   0.4631417           0.2178435             0.1697960
##  [19,]   0.2246482           0.1672693             0.6708174
##  [20,]   0.3148065           0.9617498             0.3205833
##  [21,]   0.4602873           0.2251724             0.3410538
##  [22,]   0.1475739           0.4652090             1.1768961
##  [23,]   0.4300163           0.2375867             0.3971407
##  [24,]   0.2347372           0.2777448             0.2112527
##  [25,]   0.2074163           0.2815715             0.1560164
##  [26,]   0.1800456           0.2261409             0.2223399
##  [27,]   0.5263239           0.2699985             0.3101393
##  [28,]   0.2188139           0.2238680             0.1802777
##  [29,]   0.2130205           0.2236058             0.2078452
##  [30,]   0.1904548                 Inf             0.3331599
##  [31,]   0.1815398           0.3021004             0.7343645
##  [32,]   0.4060677           0.4269051             0.2924022
##  [33,]   0.2829088           0.2270739             0.3661456
##  [34,]   0.3389581           0.1866505             0.2371698
##  [35,]   0.1850888           0.3926327             0.6533803
##  [36,]   0.4177039           0.1830718             0.3118481
##  [37,]   0.2372851           0.5064554             0.1385063
##  [38,]   0.2964961           0.2573009             0.4300163
##  [39,]   0.2951203           0.5999973             0.3985249
##  [40,]   0.1574033           0.4301076             1.0535441
##  [41,]   1.2178058           0.9205747             0.2831341
##  [42,]   0.2344433           0.2821392             0.2497823
##  [43,]   0.1730842           0.1866051             0.4300163
##  [44,]   0.2176419           0.5196129             0.4472116
##  [45,]   0.1718141           0.3148065             0.2525679
##  [46,]   0.2043920           0.9943343             0.4687337
##  [47,]   0.2486647           0.2432933             0.3168098
##  [48,]   0.1916494           0.6188991             0.1597991
##  [49,]   0.2927687           0.2614456             0.2738600
##  [50,]   0.1695739           0.4711592             0.2815715
##  [51,]   0.3951084           0.3068582             0.1709464
##  [52,]   0.4581062           0.7596831             0.3397406
##  [53,]   0.1929757           0.3807001             0.1352851
##  [54,]   0.1238579           0.1672693             0.1971004
##  [55,]   0.1886146           0.2738557             0.2585822
##  [56,]   0.3118481           0.1692867                   Inf
##  [57,]   0.2248344           0.6382615             0.2475619
##  [58,]   0.4159135           0.1890768             0.1978443
##  [59,]   0.2598064           0.1978132             0.2491353
##  [60,]   0.2367664           0.2738557             0.1862420
##  [61,]   0.1701105           0.3170896             0.5078601
##  [62,]   0.4804658           0.2618559             0.2475619
##  [63,]   0.2557272           0.1943299             0.5458177
##  [64,]   0.3543730           0.2838948             0.3319685
##  [65,]   0.2355796           0.3592555             0.2483227
##  [66,]   0.2073330           0.2512142             0.4687337
##  [67,]   0.2699985           0.2177944             0.3685005
##  [68,]   0.3308155           0.2281735             0.1993044
##  [69,]   0.3051272           0.1792835             0.2777448
##  [70,]   0.1910803           0.4808749             0.2944745
##  [71,]         Inf           0.2525612             0.2793838
##  [72,]   0.3123158           0.3880111             0.2821392
##  [73,]   0.2270739           0.2917518             0.1760601
##  [74,]   0.1943242           0.8783061             0.1545755
##  [75,]   0.2027547           0.2367664             0.2005387
##  [76,]   0.2911113           0.6089029             0.1301889
##  [77,]   0.1514327           0.3168098             0.2605692
##  [78,]   0.2829088           0.1750511             0.2228291
##  [79,]   0.2148335           0.2614456             0.2505796
##  [80,]   0.2344105           0.4301076             0.2082249
##  [81,]   0.2022665           0.2998031             0.2581743
##  [82,]   0.1727743           0.3342160             0.1480402
##  [83,]   0.2917518           0.4726920             0.2031011
##  [84,]   1.0392258           0.2927687             0.3674218
##  [85,]   0.2720239           0.3365423             0.1923500
##  [86,]   2.4356116           0.2870396             1.0892386
##  [87,]   0.3059196           0.7596831             0.2871332
##  [88,]   0.6188991                 Inf             0.1473450
##  [89,]   0.2665558           0.2166936             0.3769668
##  [90,]         Inf           0.3235935             0.2107088
##  [91,]   0.3284178           0.1937669             0.1684748
##  [92,]   0.4269051           0.2344433             0.6037349
##  [93,]   0.1658313           0.2238680             0.2710332
##  [94,]   0.2975295           0.2817997             0.1722500
##  [95,]   0.1292911           0.3205833             0.2391949
##  [96,]   0.3951084           0.6509446             0.2411615
##  [97,]   0.2532277           0.4726528             0.2644877
##  [98,]   0.2894636           0.2401278             0.2327375
##  [99,]   0.2195765           0.2449479             0.1936483
## [100,]   0.2757251           0.2570384             0.2256444
##        as.factor(Insurance_Type)medicaid as.factor(Insurance_Type)medicare
##   [1,]                         0.1605995                         0.1943830
##   [2,]                         0.2728862                         0.2743653
##   [3,]                         0.2351246                         1.1578542
##   [4,]                         0.2205634                         0.2777448
##   [5,]                         0.1195974                         0.1455416
##   [6,]                         0.2485836                         0.2447882
##   [7,]                         0.2080478                         0.2281904
##   [8,]                         0.3859514                         0.1163687
##   [9,]                         0.3721025                         0.2505796
##  [10,]                         1.1778981                         0.1722729
##  [11,]                         0.2689683                         0.2261409
##  [12,]                         0.6794813                         0.3581181
##  [13,]                         0.1658313                         0.3308155
##  [14,]                         0.1577484                         0.2007440
##  [15,]                         0.1791079                         0.3176555
##  [16,]                         0.2793838                         0.1762347
##  [17,]                         0.2821392                         0.5263239
##  [18,]                         0.2748240                         0.3971407
##  [19,]                         0.2121311                         0.3820270
##  [20,]                         0.1610389                         0.2601542
##  [21,]                         0.2343668                         0.1271379
##  [22,]                         0.2758687                         0.4269051
##  [23,]                         0.3040674                         0.1476442
##  [24,]                         0.3585670                         0.3191957
##  [25,]                         0.2776332                         0.2922006
##  [26,]                         0.3198115                         0.2553215
##  [27,]                         0.4448249                         0.1613550
##  [28,]                         0.1662582                         0.4093633
##  [29,]                         0.2112527                         0.1592227
##  [30,]                         0.2720239                         0.1760805
##  [31,]                         0.2161256                         0.3479445
##  [32,]                         0.4513190                         0.2134525
##  [33,]                         0.2497093                         0.9453840
##  [34,]                         0.3354087                         0.2422708
##  [35,]                         0.2684679                         0.2332586
##  [36,]                         0.1717948                         0.2512142
##  [37,]                         0.2372851                         0.2871332
##  [38,]                         0.1923091                         0.2259900
##  [39,]                         0.1532255                         0.3464086
##  [40,]                         0.1806813                         0.2601542
##  [41,]                         0.1595622                         0.4177039
##  [42,]                         0.5742665                         0.3342160
##  [43,]                         0.1654077                         0.3067231
##  [44,]                         0.2550679                         0.1326251
##  [45,]                         0.3400299                         0.3266902
##  [46,]                         0.1244539                         0.2365675
##  [47,]                         0.3059196                         0.3967316
##  [48,]                         0.2228291                         0.1819392
##  [49,]                         0.3354087                         0.3319685
##  [50,]                         0.1856627                         0.4374604
##  [51,]                         0.2998031                         0.1696971
##  [52,]                         0.9304180                         0.8772065
##  [53,]                         0.2808209                         0.2710332
##  [54,]                         0.6708174                         0.3464086
##  [55,]                         0.3511814                         0.5022574
##  [56,]                         0.3254723                         0.5740792
##  [57,]                         0.3145387                         0.3059196
##  [58,]                         0.1517071                         0.2557272
##  [59,]                         0.2359441                         0.1722500
##  [60,]                         0.5552665                         0.3205833
##  [61,]                         0.2657470                         0.2975573
##  [62,]                         0.4448249                         0.1661064
##  [63,]                         0.2729089                         0.1943299
##  [64,]                         0.1626971                         0.1432904
##  [65,]                         0.2150538                         0.2570384
##  [66,]                         0.6288722                         0.2641792
##  [67,]                         0.2193016                         0.3426077
##  [68,]                         0.3807001                         0.1808264
##  [69,]                         0.1878664                         0.2062021
##  [70,]                         0.3331599                         0.1796278
##  [71,]                         0.1841149                               Inf
##  [72,]                         0.2532277                         0.2193016
##  [73,]                         0.3274906                         0.2787785
##  [74,]                         0.2738600                         0.2701339
##  [75,]                         0.4164499                         0.2511287
##  [76,]                         0.5740792                         0.3021004
##  [77,]                         1.5193663                         0.4726528
##  [78,]                         0.2989567                         0.5616418
##  [79,]                         0.3721025                         0.2566185
##  [80,]                         0.1606640                         0.4374604
##  [81,]                         0.1835911                         0.1873547
##  [82,]                         0.2453998                         1.8608361
##  [83,]                         0.1654077                         0.2249215
##  [84,]                         0.2384147                         0.1787523
##  [85,]                         0.1976940                         0.2601542
##  [86,]                         1.4062010                         0.1516346
##  [87,]                         0.2432933                         0.2163178
##  [88,]                         0.2710332                         0.1781314
##  [89,]                         0.4107901                         0.1943242
##  [90,]                         0.2344105                         0.4374604
##  [91,]                         0.2740277                         0.2388314
##  [92,]                         0.7596831                         0.2743653
##  [93,]                         0.1414544                         0.3274906
##  [94,]                         0.3162263                         0.1803603
##  [95,]                         0.3176555                         0.3298774
##  [96,]                         0.2998031                         0.2008859
##  [97,]                         0.9304180                         0.1388916
##  [98,]                         0.3452054                         0.5979133
##  [99,]                         0.9486790                         0.2648193
## [100,]                         0.3982022                         0.2027547
##        as.factor(Insurance_Type)private as.factor(Insurance_Type)self pay
##   [1,]                        0.4687337                         0.2281159
##   [2,]                        0.3485665                         0.1688185
##   [3,]                        0.2808209                         0.1123284
##   [4,]                        0.2396796                         0.3985249
##   [5,]                        0.5267720                         0.2991864
##   [6,]                        0.2793838                         0.2332893
##   [7,]                        0.5487306                         0.2871332
##   [8,]                        0.2150081                         0.2388471
##   [9,]                        0.3629134                         0.2860375
##  [10,]                        0.1570531                         0.3772293
##  [11,]                        0.2512142                         0.5078601
##  [12,]                        0.2193016                         0.2979722
##  [13,]                        0.2327375                         0.2388471
##  [14,]                        0.3354087                         0.4173631
##  [15,]                        0.1700149                         0.3120328
##  [16,]                        0.2775637                         0.3226046
##  [17,]                        0.1754413                         0.2335185
##  [18,]                        0.1336975                         0.3452054
##  [19,]                        0.4242621                         0.1672693
##  [20,]                        0.1629538                         0.2141633
##  [21,]                        0.2812402                         0.2831341
##  [22,]                        0.3397406                         0.4213964
##  [23,]                        0.1929757                         0.2000467
##  [24,]                        0.2014972                         0.7006459
##  [25,]                        0.3041320                         0.2367664
##  [26,]                        0.3630795                         0.3170896
##  [27,]                        0.5610632                         0.2821392
##  [28,]                        1.1578542                         0.1726027
##  [29,]                        0.4173631                         0.2566185
##  [30,]                        0.2082249                         0.3148065
##  [31,]                        0.2460339                         0.6755171
##  [32,]                        0.2568198                         0.1972482
##  [33,]                        0.3242640                         0.3807001
##  [34,]                        0.4173631                         0.3426226
##  [35,]                        0.2256444                         0.2332586
##  [36,]                        0.3951084                         0.3068582
##  [37,]                        0.3798416                         0.3922987
##  [38,]                        0.2281735                         0.3531422
##  [39,]                        0.3162263                         0.2777448
##  [40,]                        0.2050457                         0.3872905
##  [41,]                        0.1835911                         0.3068582
##  [42,]                        0.2200671                         0.2353792
##  [43,]                        0.2427522                         0.3014796
##  [44,]                        0.1635006                         0.2797502
##  [45,]                        0.2555220                         0.1796278
##  [46,]                        0.4446798                         0.7343645
##  [47,]                        0.2018360                         0.2888578
##  [48,]                        0.2304216                         0.1745531
##  [49,]                        0.2070187                         0.1771865
##  [50,]                        0.2141633                         0.2187302
##  [51,]                        0.1975542                         0.2641792
##  [52,]                        0.2854392                         0.1598595
##  [53,]                        0.2710332                         0.2808209
##  [54,]                        0.5635993                         0.6708174
##  [55,]                        0.2525679                         0.1917120
##  [56,]                        0.1602515                         0.4116935
##  [57,]                        0.1961493                         0.1671080
##  [58,]                        0.2238680                         0.2315708
##  [59,]                        0.2157575                         0.4173631
##  [60,]                        0.1439032                         0.2050457
##  [61,]                        0.2104048                         0.3851040
##  [62,]                        0.3922987                         0.4060677
##  [63,]                        0.2270739                         0.3067231
##  [64,]                        0.2701339                         0.3629134
##  [65,]                        0.4374604                         0.2344105
##  [66,]                        0.5192741                         0.2301437
##  [67,]                        0.2618559                         0.3581181
##  [68,]                        0.1707163                         0.3122501
##  [69,]                        0.4472116                         0.2070187
##  [70,]                        0.1765744                         0.3235935
##  [71,]                        0.6755171                         0.1556055
##  [72,]                        0.1966965                         1.5193663
##  [73,]                        0.1438911                         0.3859514
##  [74,]                        0.4045181                         0.1713113
##  [75,]                        0.2058269                         0.2776332
##  [76,]                        0.2460339                         0.9205747
##  [77,]                        0.2273373                         0.2193016
##  [78,]                        0.2304216                         0.2691956
##  [79,]                        0.2904724                         0.5070903
##  [80,]                        0.1976940                         0.2299022
##  [81,]                        0.1525239                         0.5192741
##  [82,]                        0.6037349                         0.4652090
##  [83,]                        0.2894636                         0.2207941
##  [84,]                        0.4472116                         0.2701339
##  [85,]                        0.8328997                         0.2815715
##  [86,]                        0.1664950                         0.3377586
##  [87,]                        0.2805316                         0.2273373
##  [88,]                        0.3274906                         0.1280590
##  [89,]                        0.4107901                         0.2359441
##  [90,]                        0.3148065                         0.2216147
##  [91,]                        0.4177039                         0.5192741
##  [92,]                        0.1406660                         1.1768961
##  [93,]                        0.5616418                         0.3342437
##  [94,]                        0.2347372                         0.3721025
##  [95,]                        0.2367664                         0.8328997
##  [96,]                        0.2058468                         0.2058468
##  [97,]                        0.1632061                         0.8321912
##  [98,]                        0.6422620                         0.2589041
##  [99,]                        0.1831395                         0.1860513
## [100,]                        0.1910803                         0.1850888
##        as.factor(Admission_Type)emergency as.factor(Admission_Type)newborn
##   [1,]                          0.2911113                        0.3851040
##   [2,]                          0.3342160                        0.6579049
##   [3,]                          0.4031131                        0.4726920
##   [4,]                          0.2195765                        0.2520493
##   [5,]                          0.5889490                        0.5022574
##   [6,]                          0.1907718                        0.2088520
##   [7,]                          0.2685886                        0.2486647
##   [8,]                          0.3531422                        0.2031011
##   [9,]                          1.0392258                        0.2103851
##  [10,]                          0.2141633                              Inf
##  [11,]                          0.2332893                        0.2223399
##  [12,]                          0.2464738                        0.3581181
##  [13,]                          0.2062997                        0.2710332
##  [14,]                          0.3286320                        0.4845415
##  [15,]                          0.1880130                        0.1713590
##  [16,]                          0.2232722                        0.2932132
##  [17,]                          0.2200671                        0.6794813
##  [18,]                          0.4726920                        0.2573009
##  [19,]                          0.1992624                        0.1777039
##  [20,]                          0.2511287                        0.2226018
##  [21,]                          0.2706235                        0.4004121
##  [22,]                          0.1433535                        0.4886795
##  [23,]                          0.4631417                        0.2989567
##  [24,]                          0.1273423                        0.5196129
##  [25,]                          0.3093312                        0.2899785
##  [26,]                          0.2271220                        0.1815398
##  [27,]                          0.2685886                        0.3168098
##  [28,]                          0.2573009                        0.3491062
##  [29,]                          0.4647559                        0.2520493
##  [30,]                          0.2585822                        0.1614165
##  [31,]                          0.3144361                        0.1937669
##  [32,]                          0.1691669                        0.3880111
##  [33,]                          0.1419852                        0.1711821
##  [34,]                          0.1929794                        1.3416347
##  [35,]                          0.2878063                        0.1751050
##  [36,]                          0.2261409                        0.2311780
##  [37,]                          0.3397406                        0.2672005
##  [38,]                          0.1561250                        0.1813803
##  [39,]                          0.1581132                        0.1903714
##  [40,]                          0.1850888                        0.2288152
##  [41,]                          0.2073330                        0.1672785
##  [42,]                          0.4448249                        0.2087010
##  [43,]                          0.2375867                        0.4159135
##  [44,]                          0.7745932                        0.3162263
##  [45,]                          0.2005387                        0.2404374
##  [46,]                          0.1919531                        0.4239856
##  [47,]                          0.2924022                        0.2326045
##  [48,]                          0.3377808                        1.0356163
##  [49,]                          0.1992624                        0.1766737
##  [50,]                          0.3400299                        0.1949671
##  [51,]                          0.3630795                        0.3044515
##  [52,]                          0.1592729                        0.2924022
##  [53,]                          0.1866051                        1.1578542
##  [54,]                          0.5999973                        0.3426226
##  [55,]                          0.1546656                        0.2456087
##  [56,]                          0.2399879                        0.2472989
##  [57,]                          0.3798416                        0.3038733
##  [58,]                          0.6188991                        0.2259900
##  [59,]                          0.2927687                        0.1999991
##  [60,]                          0.4301076                        0.2050457
##  [61,]                          0.2498885                        0.4374488
##  [62,]                          0.1754413                        0.3038733
##  [63,]                          0.5178081                        0.2000467
##  [64,]                          0.2007440                        0.2166936
##  [65,]                          0.1641361                        0.2540320
##  [66,]                          0.1846432                        0.2740277
##  [67,]                          0.2290531                        0.3168098
##  [68,]                          0.4031131                        0.1666883
##  [69,]                          0.6445005                        0.2614456
##  [70,]                          0.3266902                        0.5267720
##  [71,]                          0.2152797                        0.1988669
##  [72,]                          0.3101393                        0.2127538
##  [73,]                          0.3414326                        0.2964961
##  [74,]                          0.2038089                        0.7745932
##  [75,]                          0.3331599                        0.1775748
##  [76,]                          0.1562446                        0.3630795
##  [77,]                          0.2412402                        0.3649400
##  [78,]                          0.3859514                        0.1355166
##  [79,]                          0.1722500                        0.2078452
##  [80,]                          0.4533732                        0.1839568
##  [81,]                          0.7343645                        0.3254723
##  [82,]                          0.3759457                        0.3426077
##  [83,]                          0.1526933                        0.4159135
##  [84,]                          0.1756612                        0.1532255
##  [85,]                          0.2667414                        0.1304755
##  [86,]                          0.2223399                        0.1487787
##  [87,]                          0.3123158                        0.1989317
##  [88,]                          0.2159409                        1.1578542
##  [89,]                          0.1563143                        0.2053950
##  [90,]                          0.3772293                        0.1314877
##  [91,]                          0.3021004                        0.2205099
##  [92,]                          0.3239300                        0.1754413
##  [93,]                          0.1854051                        0.3914262
##  [94,]                          0.3162263                        0.2701339
##  [95,]                          0.2115567                        0.2483227
##  [96,]                          0.2152797                        0.6089029
##  [97,]                          0.3342160                        0.6202787
##  [98,]                          0.1617361                        1.3369749
##  [99,]                          0.1943242                        0.2648193
## [100,]                          0.3235935                        0.2442847
##        as.factor(Admission_Type)urgent
##   [1,]                       0.2706235
##   [2,]                       0.2924022
##   [3,]                       0.2088004
##   [4,]                       0.1512652
##   [5,]                       0.4301076
##   [6,]                       0.5587676
##   [7,]                       0.3721672
##   [8,]                       0.5979133
##   [9,]                       0.1923174
##  [10,]                       0.6800598
##  [11,]                       0.3851040
##  [12,]                       0.1833536
##  [13,]                       0.5312599
##  [14,]                       0.2614456
##  [15,]                       0.3176555
##  [16,]                       0.2757787
##  [17,]                       0.1851601
##  [18,]                       0.2105189
##  [19,]                       2.3237795
##  [20,]                       0.3093312
##  [21,]                       0.3093230
##  [22,]                       0.3685005
##  [23,]                       0.1750511
##  [24,]                       0.2130205
##  [25,]                       0.1904548
##  [26,]                       0.4239856
##  [27,]                       0.1724031
##  [28,]                       0.2573009
##  [29,]                       0.2435983
##  [30,]                       0.2968024
##  [31,]                       0.1575466
##  [32,]                       0.2906138
##  [33,]                       0.1830728
##  [34,]                       0.1265843
##  [35,]                       0.1917120
##  [36,]                       0.2376914
##  [37,]                       0.4652090
##  [38,]                       0.3274906
##  [39,]                       0.2078452
##  [40,]                       0.3266902
##  [41,]                       0.3170896
##  [42,]                       0.2299257
##  [43,]                       0.2218046
##  [44,]                       0.2166936
##  [45,]                       0.2667414
##  [46,]                       0.2423524
##  [47,]                       0.2382556
##  [48,]                       0.4159135
##  [49,]                       0.2566185
##  [50,]                       0.2469545
##  [51,]                       0.2757787
##  [52,]                       0.2030339
##  [53,]                       0.2497093
##  [54,]                       0.1787523
##  [55,]                       0.3093312
##  [56,]                       0.1757751
##  [57,]                       0.4269051
##  [58,]                       0.3616529
##  [59,]                       0.2757819
##  [60,]                       0.4100915
##  [61,]                       0.1504726
##  [62,]                       0.2888578
##  [63,]                       0.1530263
##  [64,]                       0.7348436
##  [65,]                       0.1983938
##  [66,]                       0.1528239
##  [67,]                       0.2155907
##  [68,]                       0.3377808
##  [69,]                       0.1736874
##  [70,]                       0.1536752
##  [71,]                       0.2223399
##  [72,]                       0.2248344
##  [73,]                       0.1964158
##  [74,]                       0.2103851
##  [75,]                       0.2650478
##  [76,]                       0.2596371
##  [77,]                       0.3967316
##  [78,]                       0.2767801
##  [79,]                       0.1622998
##  [80,]                       0.3066985
##  [81,]                       0.7343645
##  [82,]                       0.2464738
##  [83,]                       0.3414326
##  [84,]                       0.2103851
##  [85,]                       0.5267720
##  [86,]                       0.9205747
##  [87,]                       0.2672005
##  [88,]                       2.3157084
##  [89,]                       0.2139212
##  [90,]                       0.1845202
##  [91,]                       0.1975542
##  [92,]                       0.3215035
##  [93,]                       0.3242640
##  [94,]                       0.1626971
##  [95,]                       0.2483227
##  [96,]                       0.1684748
##  [97,]                       0.3922987
##  [98,]                       0.1183272
##  [99,]                       0.3985249
## [100,]                       0.2310048

The Monte Carlo standard errors (MCSE) are computed for the corresponding Monte Carlo estimates of the posterior probabilities via this command: apply(mcmc_out_ts_highProb, 2, sd) / sqrt(mcmc_out$batch), where the second parameter represents margin=2 indicates the function is computed for each column separately.

The extra term in the denominator, sqrt(mcmc_out$batch) ensures that the batch means have variance $\frac{\sigma^2}{b}$, where $b$ is the batch length ($out\$blen$), whereas the CLT dictates that the overall means, $\mu$, have variance $\frac{\sigma^2}{n}$, where $n$ is the total number of iterations, which is $out\$blen \times out\$nbatch$.

The square root law implies that t guarantee one more significant digit of precision ($10 \times \ accuracy$) we need to run $100$ times as many iterations. Let’s display some diagnostic plots on the mcmc_out_ts_highProb results.

# plot(mcmc_out_ts_highProb)
df <- as.data.frame(mcmc_out_ts_highProb)

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:igraph':
## 
##     crossing

df_long <- gather(df, factors, value, colnames(df), factor_key=TRUE)
df_long$index <- rep(c(1:(dim(df)[1])), times=dim(df)[2])

plot_ly(data=df_long, x=~index, y=~value, color=~factors,
        type="scatter", mode="markers+lines") %>%
  layout(title="MCMC Output (Time-Serties High-Probility)", 
         legend = list(orientation = 'h', y=-0.5))

acf(mcmc_out_ts_highProb)

Most of these autocorrelation plots appear fine and we can generate the final overall summary of the MCMC Bayesian inference.

library(knitr)
bar <- rbind(colMeans(mcmc_out_ts_highProb), apply(mcmc_out_ts_highProb, 2, sd) / sqrt(mcmc_out$nbatch))
rownames(bar) <- c("probability of death", "MCSE")
kable(bar, digits = 2, caption = "Estimated (higherst) Posterior Probability of dying following an admission to the hospital. These probability values may not add to one due to rounding error and the fact that we only chose the high-probability predictors.")

Estimated (higherst) Posterior Probability of dying following an admission to the hospital. These probability values may not add to one due to rounding error and the fact that we only chose the high-probability predictors.
	as.factor(Married_1)1	as.factor(Insurance_Type)medicare	as.factor(Insurance_Type)private	as.factor(Admission_Type)emergency	as.factor(Admission_Type)urgent
probability of death	0.10	0.11	0.10	0.11	0.11
MCSE	0.01	0.01	0.01	0.01	0.01

We can also display histograms of the marginal posteriors for the parameters.

# mcmc_out <- metrop(out, blen = 1, nbatch = 1e5)
theta <- mcmc_out$batch[ , 1]
# hist(theta, freq = FALSE)
plot_ly(x = theta, type = "histogram") %>%
  layout(title="Histogram of Theta", xaxis=list(title="Theta"), yaxis=list(title="Density"))

# Other parameters
beta <- mcmc_out$batch[ , 2]
# hist(beta, freq = FALSE)
plot_ly(x = beta, type = "histogram") %>%
  layout(title="Histogram of Beta", xaxis=list(title="Beta"), yaxis=list(title="Density"))

# display smooth parameter density plots
out <- density(beta)
# plot(out)
plot_ly(x = out$x, y=out$y, type = "scatter", mode="lines") %>%
  layout(title="Beta Density", xaxis=list(title="Beta"), yaxis=list(title="Density"))

Of course, we can also try to forecast one of the clinical outcomes (“Death_1”, “AdmissionLengthDays”, or “Dx”) using the machine learning methods we saw in the previous chapters.

# Define Outcome and features
dim(hosAdmissions)

## [1] 58863     9

hosAdmissions$Married_1 <- as.numeric(hosAdmissions$Married_1)
hosAdmissions$Insurance_Type <- as.numeric(hosAdmissions$Insurance_Type)
hosAdmissions$Admission_Type <- as.numeric(hosAdmissions$Admission_Type)

y <- hosAdmissions[ , "Death_1"]   # outcome
x <- hosAdmissions[ , c("AdmissionLengthDays", "Admission_Type", "Insurance_Type", "Married_1", "Race", "Dx")]

# split training/testing data (80:20)
sub <- sample(nrow(x), floor(nrow(x)*0.8))
x <- x[sub, ]; y <- y[sub]
data_train <- cbind(x[ , c("AdmissionLengthDays","Married_1")], y); dim(data_train)  # training data

## [1] 47090     3

x_test <- hosAdmissions[-sub, c("AdmissionLengthDays", "Admission_Type", "Insurance_Type", "Married_1", "Race", "Dx")]
y_test <- hosAdmissions[-sub, "Death_1"]
data_test <- cbind(x_test[ , c("AdmissionLengthDays","Married_1")], y_test)  # testing data
dim(data_test)

## [1] 11773     3

library(neuralnet)
# hosAdmissions_NNmodel <- neuralnet(y.train ~ AdmissionLengthDays + Admission_Type + Insurance_Type + Married_1 + Race + Dx, data=data_train. hidden=3) # can;t have categorical/unstructured features
set.seed(1234)
hosAdmissions_NNmodel <- neuralnet(y ~ AdmissionLengthDays + Married_1, data=data_train, hidden=1)
plot(hosAdmissions_NNmodel)

# evaluate model on testing data

# Compute the posterior probabilities
hosAdmissions_pred<-compute(hosAdmissions_NNmodel, x_test[, c("AdmissionLengthDays","Married_1")])
# dichotomize the outcome
plot(hosAdmissions_pred$net.result)
# plot(data_test$y_test, hosAdmissions_pred$net.result)

# hist(hosAdmissions_pred$net.result) 
pred_results_bin <- as.factor(ifelse (hosAdmissions_pred$net.result>0.107, 1, 0))
table(pred_results_bin, data_test$y_test)

##                 
## pred_results_bin     0     1
##                0 10581  1192

library(caret)
caret::confusionMatrix(pred_results_bin, as.factor(data_test$y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10581  1192
##          1     0     0
##                                           
##                Accuracy : 0.8988          
##                  95% CI : (0.8932, 0.9041)
##     No Information Rate : 0.8988          
##     P-Value [Acc > NIR] : 0.5077          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8988          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.8988          
##          Detection Rate : 0.8988          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

Let’s also try decision tree classification for the same hospitalization case-study.

library(C50)

# The Dx free text is an important feature, but needs to be cleaned prior to processing .... here is an instance of a "cleaning" function
preproc_fun = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # remove non-alpha-numeric values
  x  =  tolower(x)                          # convert to lowercase characters
  # x  =  removeNumbers(x)                  # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

# Clean the DX feature by itself, if not already pre-cleaned upon loading the hosAdmissions dataset
# Dx11 <- preproc_fun(x[ , "Dx"]); x$Dx <- Dx11

hosAdmissions_DTmodel <- C5.0(x=x[ , c("AdmissionLengthDays", "Admission_Type", "Dx")], y=as.factor(y), control = C5.0Control(winnow = TRUE, minCases= 100))
# hosAdmissions_DTmodel
summary(hosAdmissions_DTmodel)

## 
## Call:
## C5.0.default(x = x[, c("AdmissionLengthDays", "Admission_Type", "Dx")], y
##  = as.factor(y), control = C5.0Control(winnow = TRUE, minCases = 100))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Jun 20 10:11:13 2024
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 47090 cases (4 attributes) from undefined.data
## 
## No attributes winnowed
## 
## Decision tree:
## 
## AdmissionLengthDays > 1.638194: 0 (44140/3583)
## AdmissionLengthDays <= 1.638194:
## :...Admission_Type <= 1: 0 (290/8)
##     Admission_Type > 1:
##     :...Admission_Type > 2: 0 (321/37)
##         Admission_Type <= 2:
##         :...AdmissionLengthDays <= 0.4736111: 1 (548/132)
##             AdmissionLengthDays > 0.4736111: 0 (1791/616)
## 
## 
## Evaluation on training data (47090 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5 4376( 9.3%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##   42298   132    (a): class 0
##    4244   416    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% AdmissionLengthDays
##    6.26% Admission_Type
## 
## 
## Time: 1.4 secs

plot(hosAdmissions_DTmodel, type="simple")

# Predict death (survival) based on C50 model using testing data
# Clean Dx, if not previously cleaned!
# Dx11 <- preproc_fun(hosAdmissions[-sub, "Dx"]); hosAdmissions[-sub, "Dx"] <- Dx11

# Extract ONLY the hospital-admission TESTING cases that have the common Dx with the training cases (otherwise, prediction on test cases may fail)
common <- intersect(hosAdmissions[sub, ]$Dx, hosAdmissions[-sub, ]$Dx)
hosAdmission_test <- hosAdmissions[which(hosAdmissions$Dx %in% common), ]
hosAdmission_test <- hosAdmission_test[-sub, c("AdmissionLengthDays", "Admission_Type", "Dx")]
# View(hosAdmission_test)

# Same preprocessing for the TESTING data outcome variable (death)
y_test <- hosAdmissions[which(hosAdmissions$Dx %in% common), ]
y_test <- y_test[-sub, "Death_1"]

hosAdmissions_DTpred <- predict(hosAdmissions_DTmodel, hosAdmission_test)
caret::confusionMatrix(hosAdmissions_DTpred, as.factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7833  729
##          1   26   79
##                                           
##                Accuracy : 0.9129          
##                  95% CI : (0.9068, 0.9187)
##     No Information Rate : 0.9068          
##     P-Value [Acc > NIR] : 0.02536         
##                                           
##                   Kappa : 0.1549          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.99669         
##             Specificity : 0.09777         
##          Pos Pred Value : 0.91486         
##          Neg Pred Value : 0.75238         
##              Prevalence : 0.90677         
##          Detection Rate : 0.90377         
##    Detection Prevalence : 0.98789         
##       Balanced Accuracy : 0.54723         
##                                           
##        'Positive' Class : 0               
##

Lastly, we can employ RandomForest classification of the clinical outcome variable, death.

require(randomForest)
y <- hosAdmissions[sub, "Death_1"]
data_train <- cbind(hosAdmissions[sub, c("AdmissionLengthDays","Married_1")], y); dim(data_train)  # training data

## [1] 47090     3

hosAdmissions_RFmodel_sm <- randomForest(as.factor(y) ~ AdmissionLengthDays + Married_1, data=data_train, importance=TRUE, ntree=3000, mtry=2)
# hosAdmissions_RFmodel <- randomForest(as.factor(y) ~ . , data=data_train, importance=TRUE, ntree=2000, mtry=2)
varImpPlot(hosAdmissions_RFmodel_sm, cex=0.8)

print(hosAdmissions_RFmodel_sm)

## 
## Call:
##  randomForest(formula = as.factor(y) ~ AdmissionLengthDays + Married_1,      data = data_train, importance = TRUE, ntree = 3000, mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 3000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.99%
## Confusion matrix:
##       0    1 class.error
## 0 39812 2618  0.06170163
## 1  3970  690  0.85193133

plot(hosAdmissions_RFmodel_sm,log="x",main="hosAdmissions_RF_model")

hosAdmissions_RFmodel_pred<-predict(hosAdmissions_RFmodel_sm, hosAdmissions[-sub, ], type = 'class')
caret::confusionMatrix(hosAdmissions_RFmodel_pred, as.factor(data_test$y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9956 1022
##          1  625  170
##                                           
##                Accuracy : 0.8601          
##                  95% CI : (0.8537, 0.8663)
##     No Information Rate : 0.8988          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.098           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9409          
##             Specificity : 0.1426          
##          Pos Pred Value : 0.9069          
##          Neg Pred Value : 0.2138          
##              Prevalence : 0.8988          
##          Detection Rate : 0.8457          
##    Detection Prevalence : 0.9325          
##       Balanced Accuracy : 0.5418          
##                                           
##        'Positive' Class : 0               
##

# Use caret::train to fine-tune RF
# control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
# metric <- "Accuracy"
# set.seed(1234)
# tunegrid <- expand.grid(.mtry=c(1:3))
# rf_gridsearch <- train(as.factor(y) ~ AdmissionLengthDays + Race + Admission_Type, data=data_train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
# print(rf_gridsearch)
# plot(rf_gridsearch)

# Alternative parameter tuning using randomForest::tuneRF()
set.seed(234)
best_mtry <- tuneRF(hosAdmissions[sub, c("AdmissionLengthDays","Admission_Type", "Race")], y, stepFactor=1.5, improve=1e-5, ntree=500)

## mtry = 1  OOB error = 0.07936883 
## Searching left ...
## Searching right ...

print(best_mtry)

##   mtry   OOBError
## 1    1 0.07936883

# Try NLP on the clinical diagnosis "Dx"
library(text2vec)
library(data.table)

# split data into train:test (80:20)
set.seed(1234)
sub <- sample(nrow(hosAdmissions), floor(nrow(hosAdmissions)*0.8))
y_train <- hosAdmissions[sub , ] # Training Dx outcome
y_test <- hosAdmissions[-sub , ] # Testing Dx outcome

# coerce Dx as string/character
y_train$Dx <- as.character(y_train$Dx)
y_test$Dx <- as.character(y_test$Dx)

# either a simple (tolower case) function
preproc_fun = tolower

# or a more elaborate "cleaning" function
preproc_fun = function(x)                    # text data
{ require("tm")
  x  =  gsub("<.*?>", " ", x)               # regex removing HTML tags
  x  =  iconv(x, "latin1", "ASCII", sub="") # remove non-ASCII characters
  x  =  gsub("[^[:alnum:]]", " ", x)        # remove non-alpha-numeric values
  x  =  tolower(x)                          # convert to lowercase characters
  # x  =  removeNumbers(x)                  # removing numbers
  x  =  stripWhitespace(x)                  # removing white space
  x  =  gsub("^\\s+|\\s+$", "", x)          # remove leading and trailing white space
  return(x)
}

# define the tokenization function
token_fun = word_tokenizer

# iterator for both training and testing sets
iter_train = itoken(y_train$Dx, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             ids = sub, 
             progressbar = TRUE)

iter_test = itoken(y_test$Dx, 
             preprocessor = preproc_fun, 
             tokenizer = token_fun, 
             # ids = -sub, 
             ids = 1:11773,
             progressbar = TRUE)
reviewVocab = create_vocabulary(iter_train)
reviewVocab

## Number of docs: 47090 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##           term term_count doc_count
##         <char>      <int>     <int>
##    1:   020807          1         1
##    2:       10          1         1
##    3:      111          1         1
##    4:       15          1         1
##    5:       17          1         1
##   ---                              
## 4308:  disease       3185      3159
## 4309:   artery       4379      3102
## 4310: coronary       4498      3229
## 4311:      sda       5366      5365
## 4312:  newborn       6221      6221

# Next, compute the document term matrix (DTM)
reviewVectorizer = vocab_vectorizer(reviewVocab)
t0 = Sys.time()
dtm_train = create_dtm(iter_train, reviewVectorizer)
dtm_test = create_dtm(iter_test, reviewVectorizer)
dim(dtm_train); dim(dtm_test)

## [1] 47090  4312

## [1] 11773  4312

# confirm that the training data review DTM dimensions are consistent with training review IDs, i.e., #rows = number of documents, and #columns = number of unique terms (n-grams), dim(dtm_train)[[2]]
identical(dim(dtm_train)[1], length(sub))

## [1] TRUE

# TM/NLP analytics
library(glmnet)
nFolds = 10
t0 = Sys.time()
glmnet_classifier = cv.glmnet(x = dtm_train, 
                              y = y_train$Death_1, 
        family = "binomial", 
        # LASSO L1 penalty
        alpha = 1,
        # interested in the area under ROC curve or MSE
        type.measure = "auc",
        # n-fold internal (training data) stats cross-validation
        nfolds = nFolds,
        # threshold: high value is less accurate / faster training
        thresh = 1e-2,
        # again lower number of iterations for faster training
        maxit = 1e3
      )

lambda.best <- glmnet_classifier$lambda.min
lambda.best

## [1] 0.001270929

t1 = Sys.time()
print(difftime(t1, t0, units = 'sec'))

## Time difference of 13.50766 secs

# some prediction plots
plot(glmnet_classifier)
# plot(glmnet_classifier, xvar="lambda", label="TRUE")
mtext("CV LASSO: Number of Nonzero (Active) Coefficients", side=3, line=2.5)

# report the mean internal cross-validated error
print(paste("max AUC =", round(max(glmnet_classifier$cvm), 4)))

## [1] "max AUC = 0.7509"

# report TESTING data prediction accuracy
xTest = dtm_test
yTest = y_test$Death_1
predLASSO <- predict(glmnet_classifier, 
       s = glmnet_classifier$lambda.1se, newx = xTest)
testMSE_LASSO <- mean((predLASSO - yTest)^2); testMSE_LASSO

## [1] 7.554096

# Binarize the LASSO probability prediction 
binPredLASSO <- ifelse(predLASSO < -2, 0, 1)
table(binPredLASSO, yTest)

##             yTest
## binPredLASSO    0    1
##            0 8267  521
##            1 2371  614

# and testing data AUC
glmnet:::auc(yTest, predLASSO)

## [1] 0.7453573

# predict
predTest <- predict(glmnet_classifier, 
                    s = glmnet_classifier$lambda.1se, 
                    newx = dtm_test, type="response") 
predVSrealDeath <- cbind(predTest, y_test$Death_1)
cor(predTest, y_test$Death_1)

##         [,1]
## s1 0.2784294

table (ifelse(predTest < 0.2, 0, 1), y_test$Death_1)

##    
##        0    1
##   0 9993  873
##   1  645  262

# install.packages("RTextTools")
# library("RTextTools")
# 
# # DxCorpus <- create_corpus(iter_train, reviewVectorizer)
# # DxDTM <- get_dtm(DxCorpus)
# DxDTM_train <- create_dtm(iter_train, reviewVectorizer)
# dim(DxDTM_train)
# image(DxDTM_train[1:dim(DxDTM_train)[2], ], xlab="Terms", ylab="Training records/documents")
# 
# DxDTM_test <- create_dtm(iter_test, reviewVectorizer)
# dim(DxDTM_test)
# image(DxDTM_test[1:dim(DxDTM_test)[2], ], xlab="Terms", ylab="Testing records/documents")

1.7 Summary

Data-driven Bayesian inference treats all unknowns (parameters and data prior to being observed) as random variables. The inference is probabilistic in nature, not deterministic (the way the frequentist inference is), which quantifies the measures of intrinsic process uncertainty.

Bayesian modeling propagates uncertainty from the initial prior to derived posterior distributions using the Bayesian inversion of conditioning theorem. When the posterior is not in simple standard form, e.g., complex or non-conjugate priors, MCMC sample-based simulation provides a computationally-expensive, but scientifically valid, protocol for modeling and inference. MCMC provides a general recipe for obtaining representative samples from any posterior density.

1.8 References

SMHS EBook Bayesian Inference Chapter
Christou N, Dinov ID. (2011) Confidence Interval Based Parameter Estimation-A New SOCR Applet and Activity. PLoS ONE 6(5): e19178. doi:10.1371/journal.pone.0019178.
SOCR General Confidence Interval Activity.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Appendix: Applied Bayesian Modeling, Simulation and Inference

SOCR/MIDAS (Ivo Dinov)

June 2024