SOCR ≫ DSPA ≫ DSPA2 Topics ≫

Expanding on DSPA Chapter 6 (Neural Networks and Supervised Machine Learning), this DSPA2 appendix covers a type of stochastic neural networks, called a Restricted Boltzmann Machine (RBM). RBMs are used for tasks such as dimensionality reduction, classification, regression, collaborative filtering, feature learning, and generative modeling. Its foundations span mathematics, statistics, and physics, with each field contributing essential principles to the structure and function of RBMs.

1 Stochastic neural networks

Stochastic neural networks are a special kind of neural network where certain components (such as the neurons (perceptrons), connections, or activation functions) are governed by probabilistic or stochastic processes. Unlike traditional deterministic neural networks, where the output is fully determined by the input and the learned weights, stochastic neural networks introduce randomness into the learning process, making them better suited for handling uncertainty, modeling distributions, and performing tasks that require probabilistic reasoning, such as generative modeling and reinforcement learning. The key features of stochastic neural networks include

  • Randomness in Neuron Activation: In stochastic neural networks, neurons can activate probabilistically based on the input they receive. For example, a neuron’s output might be a probabilistic function (such as a sigmoid or softmax) of the input, and its state can be randomly determined (e.g., binary state \(0\) or \(1\) based on some probability distribution).
  • Probabilistic Weight Updates: Stochastic neural networks may use probabilistic rules for updating weights during training, often utilizing techniques like Gibbs sampling, Markov Chain Monte Carlo (MCMC), or variational inference.
  • Modeling Uncertainty: These networks are particularly useful for tasks involving uncertainty, such as generative modeling, where the goal is to sample from a learned distribution, and reinforcement learning, where the environment is often stochastic.
  • Learning and Sampling: Stochastic neural networks often rely on sampling-based learning algorithms, such as contrastive divergence (for Restricted Boltzmann Machines) or variational inference (for Variational Autoencoders), which approximate the gradient of the likelihood function by drawing samples from a probability distribution.

Here are some examples of Stochastic Neural Networks:

  • Boltzmann Machines (BM): A type of stochastic neural network where the neurons have binary states and stochastic activation functions. BMs learn the probability distribution over the data by minimizing an energy function.
  • Restricted Boltzmann Machines (RBM): A specific type of Boltzmann Machine with restricted connectivity (no intra-layer connections). RBMs are widely used for feature extraction and dimensionality reduction.
  • Deep Belief Networks (DBN): A stack of multiple RBMs trained in a greedy layer-wise fashion.
  • Variational Autoencoders (VAE): A type of stochastic neural network where the latent variables are modeled by a probabilistic distribution, and both the encoder and decoder are trained to maximize a variational bound on the likelihood of the data.

In this appendix, we’ll focus on the Restricted Boltzmann Machines (RBMs), as a specific class of stochastic neural networks that have the following characteristics

  • Bipartite Graph Structure: The RBM consists of two layers of neurons—visible units (representing the input data) and hidden units (representing latent features) with stochastic, probabilistic activations. The visible and hidden units are fully connected, but there are no intra-layer connections (i.e., no connections between units within the visible layer or within the hidden layer), giving the “restricted” property to the network.
  • Stochastic Activations: In a RBM, the states of the visible and hidden units are determined probabilistically (1) the hidden units are activated based on a probability that depends on the visible units and the learned weights; and (2) the visible units can also be probabilistically reconstructed from the hidden units.

The activation of the hidden units is stochastic, meaning that, given a visible vector \(v\), the state of each hidden unit \(h_j\) is sampled from a probability distribution \[P(h_j = 1 | v) = \sigma \left( \sum_i W_{ij} v_i + c_j \right) , \]

where \(\sigma(x)\) is the sigmoid function, \(W_{ij}\) is the weight connecting visible unit \(v_i\) and hidden unit \(h_j\), and \(c_j\) is the bias for hidden unit \(h_j\).

  • Energy-Based Model: RBMs are energy-based models where the joint probability distribution of the visible and hidden units is modeled using an energy function. The energy of a configuration (state of the visible and hidden units) is \[E(v, h) = - \sum_{i,j} v_i W_{ij} h_j - \sum_i b_i v_i - \sum_j c_j h_j . \]

The lower the energy of a configuration, the more likely it is under the model. The goal of training an RBM is to find weights and biases that assign low energy to configurations corresponding to the data.

  • Learning via Contrastive Divergence: RBMs use a stochastic learning algorithm called Contrastive Divergence (CD) to approximate the gradient of the log-likelihood function and update the model parameters. CD works by (1) sampling from the distribution of the hidden units given the visible units (positive phase); (2) reconstructing the visible units from the hidden units (negative phase); and (3) updating the weights based on the difference between the positive and negative phase statistics.

  • Generative and Feature Learning: Since RBMs model the probability distribution of the input data, they are generative models, meaning they can generate new samples by sampling from the learned distribution. They are also powerful for feature learning, as the hidden units capture latent features of the data that can be used for tasks like classification, clustering, or dimensionality reduction.

The hidden and visible units in a RBM have stochastic binary states. For instance, given the visible units, the hidden units are not deterministically set to 0 or 1 but rather are probabilistically sampled based on their activation probability. This probabilistic behavior captures uncertainty and allows the model to represent distributions, rather than fixed mappings, between the input and output.

Since RBMs are stochastic, they rely on Gibbs sampling for inference and learning. Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method where the states of the visible and hidden units are alternately sampled until the system reaches a stationary distribution. The stochastic nature of the model allows it to explore various configurations during learning.

RBMs are useful for (1) dimensionality reduction, since RBMs learn a compact representation of the data in the hidden layer; (2) feature extraction, as hidden units in a RBM learn abstract features of the input data; and (3) generative modeling, as RBMs can simulate/generate new samples from the learned distribution by sampling from the hidden layer and reconstructing the visible layer.

In a general Boltzmann Machine, all units (both visible and hidden) can be connected to one another, resulting in a much more complex network with higher computational costs. In contrast, RBMs are restricted, meaning no connections exist within the visible or hidden layers. This makes training RBMs more tractable, especially with methods like Contrastive Divergence.

VAEs differ from RBMs in that they use continuous latent variables (instead of binary) and employ variational inference to approximate the posterior distribution. VAEs are more common for tasks like generating images or text, while RBMs are simpler and can be applied to binary or discrete data.

2 Restricted Boltzmann Machines (RBMs)

RBMs represent a type of stochastic neural network used for tasks such as dimensionality reduction, classification, regression, collaborative filtering, feature learning, and generative modeling. Its foundations span mathematics, statistics, and physics. The core of an RBM is the idea of representing and learning probability distributions through the interactions of neurons. An RBM consists of two layers of neurons

  • Visible layer: Represents observed data (input).
  • Hidden layer: Represents latent variables or features to learn.

These layers are fully connected, but connections within each layer are restricted, meaning neurons in the visible layer do not interact with each other, and neurons in the hidden layer do not interact with each other. The key mathematical formulation of an RBM is an energy function, which assigns an energy to each configuration of the visible and hidden neurons. The lower the energy, the more likely the configuration. The joint probability distribution of the visible and hidden units is defined using a Boltzmann distribution:

\[P(v, h) = \frac{1}{Z} \exp(-E(v, h)) ,\]

where \(v\) and \(h\) are vectors representing the states of the visible and hidden units, respectively, \(E(v, h)\) is the energy function that measures the interaction between the visible and hidden units, and \(Z\) is the partition function, ensuring that the probabilities sum to 1, i.e., \(Z = \sum_{v, h} \exp(-E(v, h)) .\)

2.1 RBM Energy Function

The energy function of the RBM is typically defined by \[E(v, h) = -\sum_{i} b_i v_i - \sum_{j} c_j h_j - \sum_{i,j} v_i W_{ij} h_j, \]

where \(b_i\) and \(c_j\) are biases for the visible and hidden units, \(W_{ij}\) are the weights connecting visible unit \(i\) and hidden unit \(j\), and \(v_i\) and \(h_j\) are the states of visible unit \(i\) and hidden unit \(j\), respectively.

The objective is to minimize the energy function, which is equivalent to maximizing the likelihood of the data. This leads to the optimal configuration of weights and biases that define the learned model. The RBM’s marginal probability over the visible units \(v\) is obtained by summing over all possible hidden states \(P(v) = \frac{1}{Z} \sum_h \exp(-E(v, h)) .\) This represents a powerful model for representing complex distributions because the hidden units act as latent variables that can capture higher-order correlations in the visible data.

RBMs are closely tied to probabilistic graphical models and use principles from statistical inference and statistical/machine learning theory. An RBM can be represented by an undirected graphical model (Markov Random Field) with the visible and hidden units forming a bipartite graph. The lack of connections within each layer simplifies inference and learning, which can be challenging in fully connected graphical models. To train RBMs (estimate the model parameters), we need to sample from the joint distribution \(P(v, h)\). This is challenging due to the partition function \(Z\), which is computationally intractable for large models. Instead, Gibbs sampling is used, where we alternate between sampling the hidden units given the visible units and sampling the visible units given the hidden units - (1) sample \(h_j\) from \(P(h_j | v)\), and then (2) sample \(v_i\) from \(P(v_i | h)\).

This iterative procedure is based on Markov Chain Monte Carlo (MCMC) methods, which allow us to approximate the distribution without directly calculating \(Z\). Since exact inference is computationally expensive, RBMs often use Contrastive Divergence (CD) to approximate the gradient of the likelihood function. The CD algorithm updates the weights using the following steps:

  1. Sample the hidden units \(h_j\) based on the data \(v_i\).
  2. Reconstruct the visible units \(v_i'\) from the hidden units.
  3. Update the weights and biases using the difference between the observed data and the reconstruction \[\Delta W_{ij} = \eta \left( \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{reconstruction}} \right) ,\] where \(\eta\) is the learning rate, and the angle brackets denote expectations over the distribution of \(v_i\) and \(h_j\).

The original RBM formulation is deeply connected to concepts from statistical mechanics, particularly the Boltzmann distribution and energy-based models in physics.

The Boltzmann distribution describes the probability of a system being in a certain state based on its energy. In a RBM, this is analogous to the joint distribution of visible and hidden units \(P(v, h) = \frac{1}{Z} \exp(-E(v, h)) .\) This distribution comes from thermodynamics, where systems tend to configurations with lower energy. In the context of an RBM, the system (data) seeks the configuration (parameter settings) that minimizes energy, corresponding to the most likely states for the model.

The partition function \(Z\) in the RBM is borrowed directly from statistical mechanics, where it serves to normalize the probability distribution. It is defined as the sum over all possible configurations (microstates) of the system. In practice, computing \(Z\) is computationally expensive, as it requires summing over an exponentially large number of states. Techniques like Gibbs sampling and Contrastive Divergence circumvent the need for direct computation.

The RBM learning process (RBM optimization) is analogous to the thermodynamic principle of minimizing free energy in a physical system. The goal is to adjust the model parameters (weights and biases) to minimize the energy function, thereby maximizing the likelihood of the observed data. The free energy of a configuration \(v\) in a RBM is \(F(v) = -\log \sum_h \exp(-E(v, h)).\) Minimizing this free energy is equivalent to maximizing the likelihood of the observed data. This parallels the behavior of physical systems that seek equilibrium states with minimum free energy.

2.2 Restricted vs. General (Unrestricted) Boltzmann Machines

The restricted adjective in the RBM name refers to the restriction on the network’s connectivity – no direct interactions within the observed (data) layer or within the subsequent hidden layer. This restriction of connections only between-layers, i.e., between the visible (data) units and the hidden units, constrains the network architecture making it distinct from its more general counterpart, the Boltzmann Machine, where units can have connections to other units in the same layer.

The RBM restriction that only allows interactions between visible and hidden units (and no connections between units within the same layer) has both mathematical and statistical motivations:

  • Mathematical Motivation: Simplifying Inference and Learning: Without the restriction, a full Boltzmann Machine becomes computationally intractable for inference and learning, particularly for large datasets. This is because the partition function \(Z\), which normalizes the probability distribution, is exponentially difficult to compute when connections within a layer exist.
  • General Boltzmann Machines: In a fully connected Boltzmann Machine, each unit (visible or hidden) can interact with every other unit. As a result, computing the exact likelihood of the data (which requires marginalizing over all possible hidden configurations) becomes exponentially complex because interactions within layers create loops in the graphical model. These loops make it challenging to compute the conditional distributions and marginal likelihood.
  • Restricted Boltzmann Machines: By restricting the network so that there are no connections within the visible layer or within the hidden layer, the model becomes a bipartite graph, which is free of cycles. This restriction allows for an efficient block Gibbs sampling procedure, where the hidden units can be sampled independently given the visible units, and vice versa. This simplifies both inference and learning considerably.

Mathematically, this restriction allows the conditional distributions \(P(v | h)\) and \(P(h | v)\) to factorize \(P(v | h) = \prod_i P(v_i | h)\) and \(P(h | v) = \prod_j P(h_j | v)\). This factorization means that each visible unit can be updated independently given the hidden units, and vice versa. This is crucial for efficient Gibbs sampling and for making the model computationally feasible to train.

A statistical motivation relies on simplified interdependencies. The restriction that only allows connections between visible and hidden units also reflects certain statistical assumptions about the data. In a RBM, the visible units (features) are conditionally independent given the hidden units (latent variables). This reflects the assumption that once the hidden layer explains the data, the remaining variability in the visible units is independent of one another. This assumption is often reasonable in high-dimensional data: many complex dependencies in the visible layer can be captured by introducing hidden variables that encode patterns or latent features. The restriction forces the hidden layer to capture the most important patterns in the data because the visible units can only interact with each other indirectly through the hidden units.

Example: Consider an image dataset where the visible units represent pixel intensities. Without the hidden units, modeling pixel correlations directly would be very complex. However, with hidden units that represent abstract features (like edges, textures, or object parts), the visible units (pixels) become conditionally independent given these hidden features. Thus, the hidden layer effectively captures correlations between visible units, but through a simplified, abstract representation.

  • No Correlations Between Hidden Units: The restriction also assumes that the hidden units do not directly interact with each other. This means that each hidden unit represents an independent latent feature. In practice, this independence assumption is relaxed somewhat during learning because the hidden units share statistical dependencies through their connections to the visible units. However, the lack of direct connections between hidden units simplifies the learning process.

The restricted structure of the RBM allows for the efficient use of Contrastive Divergence (CD) to approximate the gradient of the likelihood function. CD works by sampling from the conditionally independent hidden and visible layers, avoiding the need for fully sampling over all configurations of the network. The restricted connectivity ensures that:

  • The visible units can be updated independently given the hidden units.
  • The hidden units can be updated independently given the visible units.

This leads to a closed-form update rule for the weights that connect visible and hidden units, enabling efficient learning with gradient-based methods. The restriction that only interactions between hidden and visible units are allowed is necessary for the following reasons:

  • Conditional Independence for Efficient Sampling: The restriction enforces conditional independence within each layer, simplifying the Gibbs sampling procedure and making it computationally feasible. Without the restriction, interactions between units in the same layer would make the model’s likelihood intractable to compute because we would no longer have conditional independence between the variables.
  • Modeling Complex Correlations with Simpler Structure: The hidden units act as latent variables that capture complex dependencies between the visible units (the data). Even though the visible units are not directly connected, the hidden units model higher-order interactions between them. This simplifies the modeling process while still allowing for the representation of complex patterns in the data.

If visible units (features) were directly connected, we would face the following issues (1) Complexity in Conditional Probabilities, with direct connections between visible units, the conditional probability \(P(v | h)\) would not factorize into independent terms, making inference and learning computationally expensive; and (2) Difficulties in Training, the learning process would require dealing with the complex dependencies between visible units. This introduces loops in the model’s graph, making exact inference and gradient-based learning much harder due to the need to sample from a much more complex distribution. If hidden units were allowed to interact directly, the hidden layer would not be able to act as a simple set of independent feature detectors. Instead, allowing interactions between hidden units would introduce dependencies between the features they represent. This would complicate the learning process and reduce the interpretability of the hidden units as independent latent factors or features. Also, without the restriction, sampling and inference become much more complex because we lose the conditional independence between the hidden units. As a result, the process of sampling from the hidden layer given the visible layer becomes intractable.

2.3 Examples of Restricted Boltzmann Machines

Let’s work through two RBM examples. One with a Restricted Boltzmann Machine (RBM) and one with a general (unrestricted) Boltzmann Machine (BM)—to demonstrate the calculations involved and highlight the differences in complexity and approach.

2.3.1 Example 1: Restricted Boltzmann Machine (RBM)

Consider a simple RBM with

  • 3 visible units: \(v_1, v_2, v_3\)
  • 2 hidden units: \(h_1, h_2\)
  • The weights connecting the visible and hidden units are given by a matrix \(W\), and each unit has a bias (denoted by \(b\) for visible units and \(c\) for hidden units).

Assume the following weights and biases \[W = \begin{bmatrix} 0.5 & -0.3 \\ 0.7 & 0.1 \\ -0.2 & 0.4 \\ \end{bmatrix} \quad b = \begin{bmatrix} 0.2 \\ 0.4 \\ 0.1 \end{bmatrix} \quad c = \begin{bmatrix} 0.6 \\ -0.2 \end{bmatrix} ,\] where, \(W_{ij}\) represents the weight between visible unit \(v_i\) and hidden unit \(h_j\), and \(b_i\) and \(c_j\) represent biases for the visible and hidden units, respectively.

The energy function for the RBM is \[E(v, h) = - \sum_i b_i v_i - \sum_j c_j h_j - \sum_{i,j} v_i W_{ij} h_j ,\] where \(v_i\) and \(h_j\) represent the states of the visible and hidden units, respectively.

Let’s compute the energy for a particular state of the visible and hidden units. Assume \(v = [1, 0, 1]\) (binary values for visible units) and \(h = [1, 0]\) (binary values for hidden units).

  • First term: \(\sum_i b_i v_i = 0.2 \cdot 1 + 0.4 \cdot 0 + 0.1 \cdot 1 = 0.2 + 0.1 = 0.3 .\)

  • Second term: \(\sum_j c_j h_j = 0.6 \cdot 1 + (-0.2) \cdot 0 = 0.6 .\)

  • Third term: \(\sum_{i,j} v_i W_{ij} h_j = (1 \cdot 0.5 \cdot 1) + (0 \cdot 0.7 \cdot 1) + (1 \cdot -0.2 \cdot 1) = 0.5 + 0 + (-0.2) = 0.3 .\)

Then, the total energy for this state \(v = [1, 0, 1]\) and \(h = [1, 0]\) is \(E(v, h) = -(0.3 + 0.6 + 0.3) = -1.2.\)

To train the RBM, we need to sample from the conditional distributions \(P(h_j | v)\) and \(P(v_i | h)\). The conditional probability of hidden unit \(h_j\) given the visible units \(v\) is \(P(h_j = 1 | v) = \sigma \left( c_j + \sum_i W_{ij} v_i \right) ,\) where \(\sigma(x) = \frac{1}{1 + e^{-x}}\) is the sigmoid function.

For hidden unit \(h_1\) \[P(h_1 = 1 | v) = \sigma \left( 0.6 + (1 \cdot 0.5) + (0 \cdot 0.7) + (1 \cdot -0.2) \right) = \sigma(0.9) = \frac{1}{1 + e^{-0.9}} \approx 0.71 .\]

Therefore, \(h_1 = 1\) with probability \(0.71\). Similarly, for visible units given hidden units \(P(v_i = 1 | h) = \sigma \left( b_i + \sum_j W_{ij} h_j \right) .\)

In a RBM, the lack of intra-layer connections allows for efficient Gibbs sampling because the units within each layer are conditionally independent given the other layer. Also, we can sample all hidden units in parallel and all visible units in parallel because of the restricted bipartite structure.

2.3.2 Example 2: General (Unrestricted) Boltzmann Machine (BM)

Next, consider a general Boltzmann Machine with 3 visible units \(v_1, v_2, v_3\) and 2 hidden units \(h_1, h_2\), but this time we allow intra-layer connections (i.e., visible units can interact with each other, and hidden units can interact with each other).

Assume the following weights and biases, but now include intra-layer connections \[W = \begin{bmatrix} 0.5 & -0.3 & 0.2 \\ 0.7 & 0.1 & 0.0 \\ -0.2 & 0.4 & -0.1 \\ \end{bmatrix} \quad b = \begin{bmatrix} 0.2 \\ 0.4 \\ 0.1 \end{bmatrix} \quad c = \begin{bmatrix} 0.6 \\ -0.2 \end{bmatrix} .\]

For simplicity, assume the intra-layer connections are defined by \[U_{\text{visible}} = \begin{bmatrix} 0 & 0.1 & -0.05 \\ 0.1 & 0 & 0.07 \\ -0.05 & 0.07 & 0 \end{bmatrix} \quad U_{\text{hidden}} = \begin{bmatrix} 0 & 0.2 \\ 0.2 & 0 \end{bmatrix} .\]

In this case, the energy function in a general Boltzmann Machine is more complex because it includes intra-layer terms \[E(v, h) = - \sum_i b_i v_i - \sum_j c_j h_j - \sum_{i,j} v_i W_{ij} h_j - \frac{1}{2} \sum_{i, i'} U_{ii'} v_i v_{i'} - \frac{1}{2} \sum_{j, j'} U_{jj'} h_j h_{j'} .\]

As before, we can compute the energy for the same state \(v = [1, 0, 1]\) and \(h = [1, 0]\).

  • First term: \(\sum_i b_i v_i = 0.2 \cdot 1 + 0.4 \cdot 0 + 0.1 \cdot 1 = 0.3.\)
  • Second term: \(\sum_j c_j h_j = 0.6 \cdot 1 + (-0.2) \cdot 0 = 0.6 .\)
  • Third term: \(\sum_{i,j} v_i W_{ij} h_j = (1 \cdot 0.5 \cdot 1) + (0 \cdot 0.7 \cdot 1) + (1 \cdot -0.2 \cdot 1) = 0.5 + 0 + (-0.2) = 0.3.\)
  • Intra-layer visible term: \(\frac{1}{2} \sum_{i, i'} U_{ii'} v_i v_{i'} = \frac{1}{2} \left( 1 \cdot 0.1 \cdot 0 + 1 \cdot (-0.05) \cdot 1 + 0 \cdot 0.07 \cdot 1 \right) = -0.025 .\)
  • Intra-layer hidden term: \(\frac{1}{2} \sum_{j, j'} U_{jj'} h_j h_{j'} = \frac{1}{2} \left( 1 \cdot 0.2 \cdot 0 \right) = 0 .\)

Thus, the total energy is \(E(v, h) = -(0.3 + 0.6 + 0.3 + (-0.025) + 0) = -1.175.\)

In this unrestricted BM, Gibbs sampling is more complex because the visible units and hidden units are no longer conditionally independent given the other layer. This introduces dependencies within the layers, so the conditional distributions \(P(v | h)\) and \(P(h | v)\) cannot be factorized easily. To sample \(v_i\), we must now consider not only its connections to the hidden units but also the influence of its neighboring visible units. Similarly, the hidden units depend on each other due to the intra-layer connections. This increases the complexity of both inference and learning.

Key differences between RBMs and General BMs

  • Conditional Independence: In RBMs, the visible units are conditionally independent given the hidden units, and vice versa. This allows efficient block sampling of all units in one layer, given the other. In general BMs, there is no conditional independence within layers, leading to more complex sampling and inference. Intra-layer dependencies introduce loops in the graphical model.
  • Complexity: The restricted structure simplifies both inference and learning. Contrastive Divergence (CD) can be applied to approximate the maximum likelihood gradients. Whereas the general BM requires more complex algorithms, such as full MCMC or other approximations, due to the added interactions between units within the same layer.
  • Inference: Gibbs sampling in a RBM is more tractable due to the restriction. The updates for each unit are simpler. Whereas, Gibbs sampling in a general BM is slower and more difficult due to the need to consider interactions between all units in the same layer.
  • Learning: Learning is easier in RBMs with CD, where we sample the hidden and visible units alternately, exploiting the bipartite graph structure. Learning is much more computationally expensive in BMs due to the full interactions between all units, making CD less effective.

2.4 R Demonstration - Function estimation \(( f(x) = x - \sin(x) )\)

Let’s develop an R function demonstrating the architecture of a Restricted Boltzmann Machine (RBM), its optimization and training process, and finally apply it to estimate the behavior of the function \(f(x) = x - \sin(x)\) over the range \([-10, 10]\). The RBM can be designed with visible units corresponding to the input space and hidden units for latent features. We will train it to approximate the target function by reconstructing the input-output relationship through the hidden units. Specifically, the goal is to capture the behavior of \(f(x)\) by minimizing the reconstruction error over samples from the function.

Here are the basic steps in a RBM implementation:

  • RBM Architecture: Define the visible and hidden units and initialize weights and biases randomly.
  • Optimization and Training: Use Contrastive Divergence (CD) to train the RBM and apply the Gibbs sampling procedure to update the weights and biases.
  • Application to Function Estimation: After completing RBM training, use the RBM to approximate the function \(f(x) = x - \sin(x)\), and evaluate the RBM’s ability to reconstruct the target function.

2.5 Manual RBM Implementation

Below is an example R function train_RBM() depicting the formulation, training, and application of an RBM for this task. The protocol involves the following steps:

  • Matrix Multiplication Conformability: the function apply_RBM() ensures that x_values and weights are compatible for matrix multiplication. We use the shape of x_values (i.e., the number of rows of x_values) and properly adjust the biases using matrix(…, byrow = TRUE) to make sure dimensions conform. The weights matrix is applied properly in both the forward pass (hidden layer) and the reconstruction step (visible layer).

  • Sampling Hidden and Visible Units: We ensured that prob_hidden_given_visible and prob_visible_given_hidden are properly sampled and used in Gibbs sampling, with dimensionality adjustments made at each step.

  • To train the RBM, we first initialize the process by setting up the weights and biases, and applying Contrastive Divergence over multiple epochs. The visible units (representing xx and f(x)f(x)) are sampled, and the hidden units are updated stochastically. The parameters are updated iteratively. Following training, the RBM is applied to the xx values to reconstruct the function approximation. Finally, the original function \(f(x)=x−\sin(x)\) is compared to the RBM’s approximation, and both are plotted.

library(plotly)

# Define the sigmoid activation function
sigmoid <- function(x) {
  1 / (1 + exp(-x))
}

# Define the RBM architecture, optimization, and training process
train_RBM <- function(data, n_hidden = 10, n_epochs = 4000, learning_rate = 0.2, batch_size = 20) {
  
  # Initialize RBM parameters
  n_visible <- ncol(data)  # Number of visible units (input features)
  
  # Weight matrix (randomly initialized) and bias terms
  weights <- matrix(rnorm(n_visible * n_hidden, 0, 0.1), n_visible, n_hidden)
  visible_bias <- rep(0, n_visible)
  hidden_bias <- rep(0, n_hidden)
  
  # Function to perform Gibbs sampling (updating hidden given visible and vice versa)
  sample_hidden <- function(visible) {
    prob_hidden <- sigmoid(visible %*% weights + matrix(hidden_bias, nrow = nrow(visible), ncol = n_hidden, byrow = TRUE))
    prob_hidden
  }
  
  sample_visible <- function(hidden) {
    prob_visible <- sigmoid(hidden %*% t(weights) + matrix(visible_bias, nrow = nrow(hidden), ncol = n_visible, byrow = TRUE))
    prob_visible
  }
  
  # Contrastive Divergence (CD-k where k=1)
  for (epoch in 1:n_epochs) {
    
    # Shuffle the data
    data <- data[sample(1:nrow(data)), ]
    
    for (i in seq(1, nrow(data), by = batch_size)) {
      batch <- data[i:min(i + batch_size - 1, nrow(data)), ]
      
      # Positive phase
      prob_hidden_given_visible <- sample_hidden(batch)
      hidden_states <- (prob_hidden_given_visible > matrix(runif(nrow(batch) * n_hidden), nrow = nrow(batch), ncol = n_hidden)) * 1
      
      # Negative phase (reconstruction)
      prob_visible_given_hidden <- sample_visible(hidden_states)
      visible_reconstructed <- (prob_visible_given_hidden > matrix(runif(nrow(hidden_states) * n_visible), nrow = nrow(hidden_states), ncol = n_visible)) * 1
      
      prob_hidden_given_visible_reconstructed <- sample_hidden(visible_reconstructed)
      
      # Update weights and biases using contrastive divergence
      weights <- weights + learning_rate * ((t(batch) %*% prob_hidden_given_visible) - 
                                            (t(visible_reconstructed) %*% prob_hidden_given_visible_reconstructed)) / nrow(batch)
      visible_bias <- visible_bias + learning_rate * colSums(batch - visible_reconstructed) / nrow(batch)
      hidden_bias <- hidden_bias + learning_rate * colSums(prob_hidden_given_visible - prob_hidden_given_visible_reconstructed) / nrow(batch)
    }
    
    # Optionally print progress
    if (epoch %% 1000 == 0) {
      cat("Epoch:", epoch, "complete\n")
    }
  }
  
  # Return the trained RBM parameters
  return(list(weights = weights, visible_bias = visible_bias, hidden_bias = hidden_bias))
}

# Define the function to apply the trained RBM to estimate the function f(x) = x - sin(x)
apply_RBM <- function(rbm, x_values) {
  # Input for the RBM is x_values
  weights <- rbm$weights
  visible_bias <- rbm$visible_bias
  hidden_bias <- rbm$hidden_bias
  
  # Get hidden representations (features)
  prob_hidden_given_visible <- sigmoid(as.matrix(x_values) %*% weights + matrix(hidden_bias, nrow = nrow(x_values), ncol = ncol(weights), byrow = TRUE))
  hidden_states <- (prob_hidden_given_visible > matrix(runif(nrow(x_values) * ncol(weights)), nrow = nrow(x_values), ncol = ncol(weights))) * 1
  
  # Reconstruct the visible units (the function approximation)
  reconstructed_visible <- sigmoid(hidden_states %*% t(weights) + matrix(visible_bias, nrow = nrow(hidden_states), ncol = length(visible_bias), byrow = TRUE))
  
  return(reconstructed_visible)
}

# Generate training data for the function f(x) = x - sin(x)
generate_data <- function(x_range) {
  x_values <- seq(x_range[1], x_range[2], length.out = 100)
  y_values <- x_values - sin(x_values)
  
  # Normalize the data to fit into [0, 1] for the RBM
  x_min <- min(x_values)
  x_max <- max(x_values)
  y_min <- min(y_values)
  y_max <- max(y_values)
  
  x_normalized <- (x_values - x_min) / (x_max - x_min)
  y_normalized <- (y_values - y_min) / (y_max - y_min)
  
  data <- cbind(x_normalized, y_normalized)
  
  return(list(data = data, x_values = x_values, y_values = y_values, 
              x_min = x_min, x_max = x_max, y_min = y_min, y_max = y_max))
}

# Train the RBM and apply it to approximate f(x) = x - sin(x)
run_RBM_demo <- function() {
  # Generate data for the function f(x) = x - sin(x)
  training_data <- generate_data(c(-10, 10))
  
  # Train the RBM on the data
  rbm <- train_RBM(training_data$data, n_hidden = 20, n_epochs = 4000, learning_rate = 0.2, batch_size = 20)
  
  # Apply the trained RBM to approximate the function
  reconstructed <- apply_RBM(rbm, training_data$data) # [, 1, drop = FALSE])  # Only use the x-values for prediction
  
  # Rescale the reconstructed output to the original scale
  y_pred <- reconstructed * (training_data$y_max - training_data$y_min) + training_data$y_min
  
  # Plot the results
  plot(training_data$x_values, training_data$y_values, type = "l", col = "blue", lwd = 2, ylab = "f(x)", xlab = "x",
       main = "Function Approximation using RBM")
  lines(training_data$x_values, 20*y_pred[,2], col = "red", lwd = 2)
  legend("topright", legend = c("True f(x) = x - sin(x)", "RBM Approximation"), col = c("blue", "red"), lwd = 2)
}

# Run the RBM demo
run_RBM_demo()
## Epoch: 1000 complete
## Epoch: 2000 complete
## Epoch: 3000 complete
## Epoch: 4000 complete

2.6 Automated RBM Implementation (using package deepnet)

Compare this simpe manual RBM implementation against the function deepnet::rbm.train(), which as expected produces a much more accurate prediction of the nonlinear function estimation for \(f(x)=x-\sin(x)\). In this case, training the RBM uses a custom function train_RBM_deepnet(), which utilizes deepnet::rbm.train() to train a RBM with the specified number of hidden units, epochs, learning rate, and batch size. Also, the predict_RBM() function uses the trained RBM to generate predictions by first calculating the hidden layer activations with deepnet::rbm.up() and then reconstructs the visible layer (output) with deepnet::rbm.down().

# Load the required package
# install.packages("deepnet")
library(deepnet)

# Generate the dataset for f(x) = x - sin(x)
generate_data <- function(x_range) {
  x_values <- seq(x_range[1], x_range[2], length.out = 100)
  y_values <- x_values - sin(x_values)
  
  # Normalize the data between 0 and 1 for the RBM
  x_min <- min(x_values)
  x_max <- max(x_values)
  y_min <- min(y_values)
  y_max <- max(y_values)
  
  x_normalized <- (x_values - x_min) / (x_max - x_min)
  y_normalized <- (y_values - y_min) / (y_max - y_min)
  
  data <- cbind(x_normalized, y_normalized)
  
  return(list(data = data, x_values = x_values, y_values = y_values, 
              x_min = x_min, x_max = x_max, y_min = y_min, y_max = y_max))
}

# Train a RBM to approximate f(x) = x - sin(x)
train_RBM_deepnet <- function(data, hidden_units = 10, num_epochs = 1000, learning_rate = 0.1, batch_size = 10) {
  
  # Train the RBM using deepnet's rbm.train function
  rbm_model <- rbm.train(x = data, hidden = hidden_units, 
                         numepochs = num_epochs, learningrate = learning_rate, batchsize = batch_size)
  
  return(rbm_model)
}

# Use the trained RBM to generate predictions
predict_RBM <- function(rbm_model, x_values_normalized) {
  
  # Use the rbm.up function from deepnet to get hidden representation, then rbm.down to get reconstruction
  hidden_representation <- rbm.up(rbm_model, x_values_normalized)
  reconstructed_output <- rbm.down(rbm_model, hidden_representation)
  
  return(reconstructed_output)
}

# Plot the results comparing true f(x) with RBM approximations
run_RBM_demo_deepnet <- function() {
  
  # Generate the data for f(x) = x - sin(x)
  training_data <- generate_data(c(-10, 10))
  
  # Train the RBM with deepnet (normalized data)
  rbm_model <- train_RBM_deepnet(training_data$data, hidden_units = 30, num_epochs = 10000, learning_rate = 0.1, batch_size = 20)
  
  # Get predictions from the trained RBM (use the normalized x values)
  reconstructed_output <- predict_RBM(rbm_model, training_data$data)
  
  # Rescale the RBM predictions back to the original scale
  # y_pred <- reconstructed_output * (training_data$y_max - training_data$y_min) + training_data$y_min
  # Plot the original function and RBM's approximation
  # plot(training_data$x_values, training_data$y_values, type = "l", col = "blue", lwd = 2, 
  #      ylab = "f(x)", xlab = "x", main = "RBM Approximation of f(x) = x - sin(x)")
  # lines(training_data$x_values, y_pred[, 1], col = "red", lwd = 2)
  # 
  # legend("topright", legend = c("True f(x)", "RBM Approximation"), col = c("blue", "red"), lwd = 2)
  
  # Rescale the RBM predictions back to the original scale
  y_pred <- reconstructed_output[, 2] * (training_data$y_max - training_data$y_min) +
            training_data$y_min
  
  # expend testing data outside the domain of the training data [-10,10] --> [-20,20]
  training_data <- generate_data(c(-20, 20))
  reconstructed_output <- predict_RBM(rbm_model, training_data$data)
  y_pred <- reconstructed_output[, 2] * (training_data$y_max - training_data$y_min) +
            training_data$y_min
  
  # Create an interactive plot using plot_ly
  # vertical lines delineating the training domain
  vline <- function(x = 0, color = "green") {
    list(type = "line", y0 = 0, y1 = 1, yref = "paper", x0 = x, x1 = x,
         line = list(color = color, dash="dot"))
  }
  
  fig <- plot_ly() %>%
    add_lines(x = training_data$x_values, y = training_data$y_values, 
              name = "True f(x)", line = list(color = 'blue')) %>%
    add_lines(x = training_data$x_values, y = y_pred, 
              name = "RBM Approximation", line = list(color = 'red')) %>%
    layout(title = "RBM Approximation of f(x) = x - sin(x)",
           xaxis = list(title = "x"),
           yaxis = list(title = "f(x)"),
           legend = list(title="Predicting Covariates", orientation = 'h'), 
           # legend = list(x = 0.1, y = 0.9),
           shapes = list(vline(-10), vline(10)))
  
  # Display the plot
  fig
}

# Run the RBM demo
run_RBM_demo_deepnet()
SOCR Resource Visitor number Web Analytics SOCR Email