SOCR ≫ | DSPA ≫ | DSPA2 Topics ≫ |
Expanding on DSPA Chapter 6 (Neural Networks and Supervised Machine Learning), this DSPA2 appendix covers a type of stochastic neural networks, called a Restricted Boltzmann Machine (RBM). RBMs are used for tasks such as dimensionality reduction, classification, regression, collaborative filtering, feature learning, and generative modeling. Its foundations span mathematics, statistics, and physics, with each field contributing essential principles to the structure and function of RBMs.
Stochastic neural networks are a special kind of neural network where certain components (such as the neurons (perceptrons), connections, or activation functions) are governed by probabilistic or stochastic processes. Unlike traditional deterministic neural networks, where the output is fully determined by the input and the learned weights, stochastic neural networks introduce randomness into the learning process, making them better suited for handling uncertainty, modeling distributions, and performing tasks that require probabilistic reasoning, such as generative modeling and reinforcement learning. The key features of stochastic neural networks include
Here are some examples of Stochastic Neural Networks:
In this appendix, we’ll focus on the Restricted Boltzmann Machines (RBMs), as a specific class of stochastic neural networks that have the following characteristics
The activation of the hidden units is stochastic, meaning that, given a visible vector \(v\), the state of each hidden unit \(h_j\) is sampled from a probability distribution \[P(h_j = 1 | v) = \sigma \left( \sum_i W_{ij} v_i + c_j \right) , \]
where \(\sigma(x)\) is the sigmoid function, \(W_{ij}\) is the weight connecting visible unit \(v_i\) and hidden unit \(h_j\), and \(c_j\) is the bias for hidden unit \(h_j\).
The lower the energy of a configuration, the more likely it is under the model. The goal of training an RBM is to find weights and biases that assign low energy to configurations corresponding to the data.
Learning via Contrastive Divergence: RBMs use a stochastic learning algorithm called Contrastive Divergence (CD) to approximate the gradient of the log-likelihood function and update the model parameters. CD works by (1) sampling from the distribution of the hidden units given the visible units (positive phase); (2) reconstructing the visible units from the hidden units (negative phase); and (3) updating the weights based on the difference between the positive and negative phase statistics.
Generative and Feature Learning: Since RBMs model the probability distribution of the input data, they are generative models, meaning they can generate new samples by sampling from the learned distribution. They are also powerful for feature learning, as the hidden units capture latent features of the data that can be used for tasks like classification, clustering, or dimensionality reduction.
The hidden and visible units in a RBM have stochastic binary states. For instance, given the visible units, the hidden units are not deterministically set to 0 or 1 but rather are probabilistically sampled based on their activation probability. This probabilistic behavior captures uncertainty and allows the model to represent distributions, rather than fixed mappings, between the input and output.
Since RBMs are stochastic, they rely on Gibbs sampling for inference and learning. Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method where the states of the visible and hidden units are alternately sampled until the system reaches a stationary distribution. The stochastic nature of the model allows it to explore various configurations during learning.
RBMs are useful for (1) dimensionality reduction, since RBMs learn a compact representation of the data in the hidden layer; (2) feature extraction, as hidden units in a RBM learn abstract features of the input data; and (3) generative modeling, as RBMs can simulate/generate new samples from the learned distribution by sampling from the hidden layer and reconstructing the visible layer.
In a general Boltzmann Machine, all units (both visible and hidden) can be connected to one another, resulting in a much more complex network with higher computational costs. In contrast, RBMs are restricted, meaning no connections exist within the visible or hidden layers. This makes training RBMs more tractable, especially with methods like Contrastive Divergence.
VAEs differ from RBMs in that they use continuous latent variables (instead of binary) and employ variational inference to approximate the posterior distribution. VAEs are more common for tasks like generating images or text, while RBMs are simpler and can be applied to binary or discrete data.
RBMs represent a type of stochastic neural network used for tasks such as dimensionality reduction, classification, regression, collaborative filtering, feature learning, and generative modeling. Its foundations span mathematics, statistics, and physics. The core of an RBM is the idea of representing and learning probability distributions through the interactions of neurons. An RBM consists of two layers of neurons
These layers are fully connected, but connections within each layer are restricted, meaning neurons in the visible layer do not interact with each other, and neurons in the hidden layer do not interact with each other. The key mathematical formulation of an RBM is an energy function, which assigns an energy to each configuration of the visible and hidden neurons. The lower the energy, the more likely the configuration. The joint probability distribution of the visible and hidden units is defined using a Boltzmann distribution:
\[P(v, h) = \frac{1}{Z} \exp(-E(v, h)) ,\]
where \(v\) and \(h\) are vectors representing the states of the visible and hidden units, respectively, \(E(v, h)\) is the energy function that measures the interaction between the visible and hidden units, and \(Z\) is the partition function, ensuring that the probabilities sum to 1, i.e., \(Z = \sum_{v, h} \exp(-E(v, h)) .\)
The energy function of the RBM is typically defined by \[E(v, h) = -\sum_{i} b_i v_i - \sum_{j} c_j h_j - \sum_{i,j} v_i W_{ij} h_j, \]
where \(b_i\) and \(c_j\) are biases for the visible and hidden units, \(W_{ij}\) are the weights connecting visible unit \(i\) and hidden unit \(j\), and \(v_i\) and \(h_j\) are the states of visible unit \(i\) and hidden unit \(j\), respectively.
The objective is to minimize the energy function, which is equivalent to maximizing the likelihood of the data. This leads to the optimal configuration of weights and biases that define the learned model. The RBM’s marginal probability over the visible units \(v\) is obtained by summing over all possible hidden states \(P(v) = \frac{1}{Z} \sum_h \exp(-E(v, h)) .\) This represents a powerful model for representing complex distributions because the hidden units act as latent variables that can capture higher-order correlations in the visible data.
RBMs are closely tied to probabilistic graphical models and use principles from statistical inference and statistical/machine learning theory. An RBM can be represented by an undirected graphical model (Markov Random Field) with the visible and hidden units forming a bipartite graph. The lack of connections within each layer simplifies inference and learning, which can be challenging in fully connected graphical models. To train RBMs (estimate the model parameters), we need to sample from the joint distribution \(P(v, h)\). This is challenging due to the partition function \(Z\), which is computationally intractable for large models. Instead, Gibbs sampling is used, where we alternate between sampling the hidden units given the visible units and sampling the visible units given the hidden units - (1) sample \(h_j\) from \(P(h_j | v)\), and then (2) sample \(v_i\) from \(P(v_i | h)\).
This iterative procedure is based on Markov Chain Monte Carlo (MCMC) methods, which allow us to approximate the distribution without directly calculating \(Z\). Since exact inference is computationally expensive, RBMs often use Contrastive Divergence (CD) to approximate the gradient of the likelihood function. The CD algorithm updates the weights using the following steps:
The original RBM formulation is deeply connected to concepts from statistical mechanics, particularly the Boltzmann distribution and energy-based models in physics.
The Boltzmann distribution describes the probability of a system being in a certain state based on its energy. In a RBM, this is analogous to the joint distribution of visible and hidden units \(P(v, h) = \frac{1}{Z} \exp(-E(v, h)) .\) This distribution comes from thermodynamics, where systems tend to configurations with lower energy. In the context of an RBM, the system (data) seeks the configuration (parameter settings) that minimizes energy, corresponding to the most likely states for the model.
The partition function \(Z\) in the RBM is borrowed directly from statistical mechanics, where it serves to normalize the probability distribution. It is defined as the sum over all possible configurations (microstates) of the system. In practice, computing \(Z\) is computationally expensive, as it requires summing over an exponentially large number of states. Techniques like Gibbs sampling and Contrastive Divergence circumvent the need for direct computation.
The RBM learning process (RBM optimization) is analogous to the thermodynamic principle of minimizing free energy in a physical system. The goal is to adjust the model parameters (weights and biases) to minimize the energy function, thereby maximizing the likelihood of the observed data. The free energy of a configuration \(v\) in a RBM is \(F(v) = -\log \sum_h \exp(-E(v, h)).\) Minimizing this free energy is equivalent to maximizing the likelihood of the observed data. This parallels the behavior of physical systems that seek equilibrium states with minimum free energy.
The restricted adjective in the RBM name refers to the restriction on the network’s connectivity – no direct interactions within the observed (data) layer or within the subsequent hidden layer. This restriction of connections only between-layers, i.e., between the visible (data) units and the hidden units, constrains the network architecture making it distinct from its more general counterpart, the Boltzmann Machine, where units can have connections to other units in the same layer.
The RBM restriction that only allows interactions between visible and hidden units (and no connections between units within the same layer) has both mathematical and statistical motivations:
Mathematically, this restriction allows the conditional distributions \(P(v | h)\) and \(P(h | v)\) to factorize \(P(v | h) = \prod_i P(v_i | h)\) and \(P(h | v) = \prod_j P(h_j | v)\). This factorization means that each visible unit can be updated independently given the hidden units, and vice versa. This is crucial for efficient Gibbs sampling and for making the model computationally feasible to train.
A statistical motivation relies on simplified interdependencies. The restriction that only allows connections between visible and hidden units also reflects certain statistical assumptions about the data. In a RBM, the visible units (features) are conditionally independent given the hidden units (latent variables). This reflects the assumption that once the hidden layer explains the data, the remaining variability in the visible units is independent of one another. This assumption is often reasonable in high-dimensional data: many complex dependencies in the visible layer can be captured by introducing hidden variables that encode patterns or latent features. The restriction forces the hidden layer to capture the most important patterns in the data because the visible units can only interact with each other indirectly through the hidden units.
Example: Consider an image dataset where the visible units represent pixel intensities. Without the hidden units, modeling pixel correlations directly would be very complex. However, with hidden units that represent abstract features (like edges, textures, or object parts), the visible units (pixels) become conditionally independent given these hidden features. Thus, the hidden layer effectively captures correlations between visible units, but through a simplified, abstract representation.
The restricted structure of the RBM allows for the efficient use of Contrastive Divergence (CD) to approximate the gradient of the likelihood function. CD works by sampling from the conditionally independent hidden and visible layers, avoiding the need for fully sampling over all configurations of the network. The restricted connectivity ensures that:
This leads to a closed-form update rule for the weights that connect visible and hidden units, enabling efficient learning with gradient-based methods. The restriction that only interactions between hidden and visible units are allowed is necessary for the following reasons:
If visible units (features) were directly connected, we would face the following issues (1) Complexity in Conditional Probabilities, with direct connections between visible units, the conditional probability \(P(v | h)\) would not factorize into independent terms, making inference and learning computationally expensive; and (2) Difficulties in Training, the learning process would require dealing with the complex dependencies between visible units. This introduces loops in the model’s graph, making exact inference and gradient-based learning much harder due to the need to sample from a much more complex distribution. If hidden units were allowed to interact directly, the hidden layer would not be able to act as a simple set of independent feature detectors. Instead, allowing interactions between hidden units would introduce dependencies between the features they represent. This would complicate the learning process and reduce the interpretability of the hidden units as independent latent factors or features. Also, without the restriction, sampling and inference become much more complex because we lose the conditional independence between the hidden units. As a result, the process of sampling from the hidden layer given the visible layer becomes intractable.
Let’s work through two RBM examples. One with a Restricted Boltzmann Machine (RBM) and one with a general (unrestricted) Boltzmann Machine (BM)—to demonstrate the calculations involved and highlight the differences in complexity and approach.
Consider a simple RBM with
Assume the following weights and biases \[W = \begin{bmatrix} 0.5 & -0.3 \\ 0.7 & 0.1 \\ -0.2 & 0.4 \\ \end{bmatrix} \quad b = \begin{bmatrix} 0.2 \\ 0.4 \\ 0.1 \end{bmatrix} \quad c = \begin{bmatrix} 0.6 \\ -0.2 \end{bmatrix} ,\] where, \(W_{ij}\) represents the weight between visible unit \(v_i\) and hidden unit \(h_j\), and \(b_i\) and \(c_j\) represent biases for the visible and hidden units, respectively.
The energy function for the RBM is \[E(v, h) = - \sum_i b_i v_i - \sum_j c_j h_j - \sum_{i,j} v_i W_{ij} h_j ,\] where \(v_i\) and \(h_j\) represent the states of the visible and hidden units, respectively.
Let’s compute the energy for a particular state of the visible and hidden units. Assume \(v = [1, 0, 1]\) (binary values for visible units) and \(h = [1, 0]\) (binary values for hidden units).
First term: \(\sum_i b_i v_i = 0.2 \cdot 1 + 0.4 \cdot 0 + 0.1 \cdot 1 = 0.2 + 0.1 = 0.3 .\)
Second term: \(\sum_j c_j h_j = 0.6 \cdot 1 + (-0.2) \cdot 0 = 0.6 .\)
Third term: \(\sum_{i,j} v_i W_{ij} h_j = (1 \cdot 0.5 \cdot 1) + (0 \cdot 0.7 \cdot 1) + (1 \cdot -0.2 \cdot 1) = 0.5 + 0 + (-0.2) = 0.3 .\)
Then, the total energy for this state \(v = [1, 0, 1]\) and \(h = [1, 0]\) is \(E(v, h) = -(0.3 + 0.6 + 0.3) = -1.2.\)
To train the RBM, we need to sample from the conditional distributions \(P(h_j | v)\) and \(P(v_i | h)\). The conditional probability of hidden unit \(h_j\) given the visible units \(v\) is \(P(h_j = 1 | v) = \sigma \left( c_j + \sum_i W_{ij} v_i \right) ,\) where \(\sigma(x) = \frac{1}{1 + e^{-x}}\) is the sigmoid function.
For hidden unit \(h_1\) \[P(h_1 = 1 | v) = \sigma \left( 0.6 + (1 \cdot 0.5) + (0 \cdot 0.7) + (1 \cdot -0.2) \right) = \sigma(0.9) = \frac{1}{1 + e^{-0.9}} \approx 0.71 .\]
Therefore, \(h_1 = 1\) with probability \(0.71\). Similarly, for visible units given hidden units \(P(v_i = 1 | h) = \sigma \left( b_i + \sum_j W_{ij} h_j \right) .\)
In a RBM, the lack of intra-layer connections allows for efficient Gibbs sampling because the units within each layer are conditionally independent given the other layer. Also, we can sample all hidden units in parallel and all visible units in parallel because of the restricted bipartite structure.
Next, consider a general Boltzmann Machine with 3 visible units \(v_1, v_2, v_3\) and 2 hidden units \(h_1, h_2\), but this time we allow intra-layer connections (i.e., visible units can interact with each other, and hidden units can interact with each other).
Assume the following weights and biases, but now include intra-layer connections \[W = \begin{bmatrix} 0.5 & -0.3 & 0.2 \\ 0.7 & 0.1 & 0.0 \\ -0.2 & 0.4 & -0.1 \\ \end{bmatrix} \quad b = \begin{bmatrix} 0.2 \\ 0.4 \\ 0.1 \end{bmatrix} \quad c = \begin{bmatrix} 0.6 \\ -0.2 \end{bmatrix} .\]
For simplicity, assume the intra-layer connections are defined by \[U_{\text{visible}} = \begin{bmatrix} 0 & 0.1 & -0.05 \\ 0.1 & 0 & 0.07 \\ -0.05 & 0.07 & 0 \end{bmatrix} \quad U_{\text{hidden}} = \begin{bmatrix} 0 & 0.2 \\ 0.2 & 0 \end{bmatrix} .\]
In this case, the energy function in a general Boltzmann Machine is more complex because it includes intra-layer terms \[E(v, h) = - \sum_i b_i v_i - \sum_j c_j h_j - \sum_{i,j} v_i W_{ij} h_j - \frac{1}{2} \sum_{i, i'} U_{ii'} v_i v_{i'} - \frac{1}{2} \sum_{j, j'} U_{jj'} h_j h_{j'} .\]
As before, we can compute the energy for the same state \(v = [1, 0, 1]\) and \(h = [1, 0]\).
Thus, the total energy is \(E(v, h) = -(0.3 + 0.6 + 0.3 + (-0.025) + 0) = -1.175.\)
In this unrestricted BM, Gibbs sampling is more complex because the visible units and hidden units are no longer conditionally independent given the other layer. This introduces dependencies within the layers, so the conditional distributions \(P(v | h)\) and \(P(h | v)\) cannot be factorized easily. To sample \(v_i\), we must now consider not only its connections to the hidden units but also the influence of its neighboring visible units. Similarly, the hidden units depend on each other due to the intra-layer connections. This increases the complexity of both inference and learning.
Key differences between RBMs and General BMs
R
Demonstration - Function estimation \(( f(x) =
x - \sin(x) )\)Let’s develop an R
function demonstrating the
architecture of a Restricted Boltzmann Machine (RBM), its optimization
and training process, and finally apply it to estimate the behavior of
the function \(f(x) = x - \sin(x)\)
over the range \([-10, 10]\). The RBM
can be designed with visible units corresponding to the input space and
hidden units for latent features. We will train it to approximate the
target function by reconstructing the input-output relationship through
the hidden units. Specifically, the goal is to capture the behavior of
\(f(x)\) by minimizing the
reconstruction error over samples from the function.
Here are the basic steps in a RBM implementation:
Below is an example R
function train_RBM()
depicting the formulation, training, and application of an RBM for this
task. The protocol involves the following steps:
Matrix Multiplication Conformability: the function
apply_RBM()
ensures that x_values and
weights are compatible for matrix multiplication. We use the
shape of x_values (i.e., the number of rows of x_values) and properly
adjust the biases using matrix(…, byrow = TRUE) to make sure dimensions
conform. The weights matrix is applied properly in both the
forward pass (hidden layer) and the reconstruction step (visible
layer).
Sampling Hidden and Visible Units: We ensured that
prob_hidden_given_visible
and
prob_visible_given_hidden
are properly sampled and used in
Gibbs sampling, with dimensionality adjustments made at each
step.
To train the RBM, we first initialize the process by setting up the weights and biases, and applying Contrastive Divergence over multiple epochs. The visible units (representing xx and f(x)f(x)) are sampled, and the hidden units are updated stochastically. The parameters are updated iteratively. Following training, the RBM is applied to the xx values to reconstruct the function approximation. Finally, the original function \(f(x)=x−\sin(x)\) is compared to the RBM’s approximation, and both are plotted.
library(plotly)
# Define the sigmoid activation function
sigmoid <- function(x) {
1 / (1 + exp(-x))
}
# Define the RBM architecture, optimization, and training process
train_RBM <- function(data, n_hidden = 10, n_epochs = 4000, learning_rate = 0.2, batch_size = 20) {
# Initialize RBM parameters
n_visible <- ncol(data) # Number of visible units (input features)
# Weight matrix (randomly initialized) and bias terms
weights <- matrix(rnorm(n_visible * n_hidden, 0, 0.1), n_visible, n_hidden)
visible_bias <- rep(0, n_visible)
hidden_bias <- rep(0, n_hidden)
# Function to perform Gibbs sampling (updating hidden given visible and vice versa)
sample_hidden <- function(visible) {
prob_hidden <- sigmoid(visible %*% weights + matrix(hidden_bias, nrow = nrow(visible), ncol = n_hidden, byrow = TRUE))
prob_hidden
}
sample_visible <- function(hidden) {
prob_visible <- sigmoid(hidden %*% t(weights) + matrix(visible_bias, nrow = nrow(hidden), ncol = n_visible, byrow = TRUE))
prob_visible
}
# Contrastive Divergence (CD-k where k=1)
for (epoch in 1:n_epochs) {
# Shuffle the data
data <- data[sample(1:nrow(data)), ]
for (i in seq(1, nrow(data), by = batch_size)) {
batch <- data[i:min(i + batch_size - 1, nrow(data)), ]
# Positive phase
prob_hidden_given_visible <- sample_hidden(batch)
hidden_states <- (prob_hidden_given_visible > matrix(runif(nrow(batch) * n_hidden), nrow = nrow(batch), ncol = n_hidden)) * 1
# Negative phase (reconstruction)
prob_visible_given_hidden <- sample_visible(hidden_states)
visible_reconstructed <- (prob_visible_given_hidden > matrix(runif(nrow(hidden_states) * n_visible), nrow = nrow(hidden_states), ncol = n_visible)) * 1
prob_hidden_given_visible_reconstructed <- sample_hidden(visible_reconstructed)
# Update weights and biases using contrastive divergence
weights <- weights + learning_rate * ((t(batch) %*% prob_hidden_given_visible) -
(t(visible_reconstructed) %*% prob_hidden_given_visible_reconstructed)) / nrow(batch)
visible_bias <- visible_bias + learning_rate * colSums(batch - visible_reconstructed) / nrow(batch)
hidden_bias <- hidden_bias + learning_rate * colSums(prob_hidden_given_visible - prob_hidden_given_visible_reconstructed) / nrow(batch)
}
# Optionally print progress
if (epoch %% 1000 == 0) {
cat("Epoch:", epoch, "complete\n")
}
}
# Return the trained RBM parameters
return(list(weights = weights, visible_bias = visible_bias, hidden_bias = hidden_bias))
}
# Define the function to apply the trained RBM to estimate the function f(x) = x - sin(x)
apply_RBM <- function(rbm, x_values) {
# Input for the RBM is x_values
weights <- rbm$weights
visible_bias <- rbm$visible_bias
hidden_bias <- rbm$hidden_bias
# Get hidden representations (features)
prob_hidden_given_visible <- sigmoid(as.matrix(x_values) %*% weights + matrix(hidden_bias, nrow = nrow(x_values), ncol = ncol(weights), byrow = TRUE))
hidden_states <- (prob_hidden_given_visible > matrix(runif(nrow(x_values) * ncol(weights)), nrow = nrow(x_values), ncol = ncol(weights))) * 1
# Reconstruct the visible units (the function approximation)
reconstructed_visible <- sigmoid(hidden_states %*% t(weights) + matrix(visible_bias, nrow = nrow(hidden_states), ncol = length(visible_bias), byrow = TRUE))
return(reconstructed_visible)
}
# Generate training data for the function f(x) = x - sin(x)
generate_data <- function(x_range) {
x_values <- seq(x_range[1], x_range[2], length.out = 100)
y_values <- x_values - sin(x_values)
# Normalize the data to fit into [0, 1] for the RBM
x_min <- min(x_values)
x_max <- max(x_values)
y_min <- min(y_values)
y_max <- max(y_values)
x_normalized <- (x_values - x_min) / (x_max - x_min)
y_normalized <- (y_values - y_min) / (y_max - y_min)
data <- cbind(x_normalized, y_normalized)
return(list(data = data, x_values = x_values, y_values = y_values,
x_min = x_min, x_max = x_max, y_min = y_min, y_max = y_max))
}
# Train the RBM and apply it to approximate f(x) = x - sin(x)
run_RBM_demo <- function() {
# Generate data for the function f(x) = x - sin(x)
training_data <- generate_data(c(-10, 10))
# Train the RBM on the data
rbm <- train_RBM(training_data$data, n_hidden = 20, n_epochs = 4000, learning_rate = 0.2, batch_size = 20)
# Apply the trained RBM to approximate the function
reconstructed <- apply_RBM(rbm, training_data$data) # [, 1, drop = FALSE]) # Only use the x-values for prediction
# Rescale the reconstructed output to the original scale
y_pred <- reconstructed * (training_data$y_max - training_data$y_min) + training_data$y_min
# Plot the results
plot(training_data$x_values, training_data$y_values, type = "l", col = "blue", lwd = 2, ylab = "f(x)", xlab = "x",
main = "Function Approximation using RBM")
lines(training_data$x_values, 20*y_pred[,2], col = "red", lwd = 2)
legend("topright", legend = c("True f(x) = x - sin(x)", "RBM Approximation"), col = c("blue", "red"), lwd = 2)
}
# Run the RBM demo
run_RBM_demo()
## Epoch: 1000 complete
## Epoch: 2000 complete
## Epoch: 3000 complete
## Epoch: 4000 complete
deepnet
)Compare this simpe manual RBM implementation against the function
deepnet::rbm.train()
, which as expected produces a much
more accurate prediction of the nonlinear function estimation for \(f(x)=x-\sin(x)\). In this case, training
the RBM uses a custom function train_RBM_deepnet()
, which
utilizes deepnet::rbm.train()
to train a RBM with the
specified number of hidden units, epochs, learning rate, and batch size.
Also, the predict_RBM()
function uses the trained RBM to
generate predictions by first calculating the hidden layer activations
with deepnet::rbm.up()
and then reconstructs the visible
layer (output) with deepnet::rbm.down()
.
# Load the required package
# install.packages("deepnet")
library(deepnet)
# Generate the dataset for f(x) = x - sin(x)
generate_data <- function(x_range) {
x_values <- seq(x_range[1], x_range[2], length.out = 100)
y_values <- x_values - sin(x_values)
# Normalize the data between 0 and 1 for the RBM
x_min <- min(x_values)
x_max <- max(x_values)
y_min <- min(y_values)
y_max <- max(y_values)
x_normalized <- (x_values - x_min) / (x_max - x_min)
y_normalized <- (y_values - y_min) / (y_max - y_min)
data <- cbind(x_normalized, y_normalized)
return(list(data = data, x_values = x_values, y_values = y_values,
x_min = x_min, x_max = x_max, y_min = y_min, y_max = y_max))
}
# Train a RBM to approximate f(x) = x - sin(x)
train_RBM_deepnet <- function(data, hidden_units = 10, num_epochs = 1000, learning_rate = 0.1, batch_size = 10) {
# Train the RBM using deepnet's rbm.train function
rbm_model <- rbm.train(x = data, hidden = hidden_units,
numepochs = num_epochs, learningrate = learning_rate, batchsize = batch_size)
return(rbm_model)
}
# Use the trained RBM to generate predictions
predict_RBM <- function(rbm_model, x_values_normalized) {
# Use the rbm.up function from deepnet to get hidden representation, then rbm.down to get reconstruction
hidden_representation <- rbm.up(rbm_model, x_values_normalized)
reconstructed_output <- rbm.down(rbm_model, hidden_representation)
return(reconstructed_output)
}
# Plot the results comparing true f(x) with RBM approximations
run_RBM_demo_deepnet <- function() {
# Generate the data for f(x) = x - sin(x)
training_data <- generate_data(c(-10, 10))
# Train the RBM with deepnet (normalized data)
rbm_model <- train_RBM_deepnet(training_data$data, hidden_units = 30, num_epochs = 10000, learning_rate = 0.1, batch_size = 20)
# Get predictions from the trained RBM (use the normalized x values)
reconstructed_output <- predict_RBM(rbm_model, training_data$data)
# Rescale the RBM predictions back to the original scale
# y_pred <- reconstructed_output * (training_data$y_max - training_data$y_min) + training_data$y_min
# Plot the original function and RBM's approximation
# plot(training_data$x_values, training_data$y_values, type = "l", col = "blue", lwd = 2,
# ylab = "f(x)", xlab = "x", main = "RBM Approximation of f(x) = x - sin(x)")
# lines(training_data$x_values, y_pred[, 1], col = "red", lwd = 2)
#
# legend("topright", legend = c("True f(x)", "RBM Approximation"), col = c("blue", "red"), lwd = 2)
# Rescale the RBM predictions back to the original scale
y_pred <- reconstructed_output[, 2] * (training_data$y_max - training_data$y_min) +
training_data$y_min
# expend testing data outside the domain of the training data [-10,10] --> [-20,20]
training_data <- generate_data(c(-20, 20))
reconstructed_output <- predict_RBM(rbm_model, training_data$data)
y_pred <- reconstructed_output[, 2] * (training_data$y_max - training_data$y_min) +
training_data$y_min
# Create an interactive plot using plot_ly
# vertical lines delineating the training domain
vline <- function(x = 0, color = "green") {
list(type = "line", y0 = 0, y1 = 1, yref = "paper", x0 = x, x1 = x,
line = list(color = color, dash="dot"))
}
fig <- plot_ly() %>%
add_lines(x = training_data$x_values, y = training_data$y_values,
name = "True f(x)", line = list(color = 'blue')) %>%
add_lines(x = training_data$x_values, y = y_pred,
name = "RBM Approximation", line = list(color = 'red')) %>%
layout(title = "RBM Approximation of f(x) = x - sin(x)",
xaxis = list(title = "x"),
yaxis = list(title = "f(x)"),
legend = list(title="Predicting Covariates", orientation = 'h'),
# legend = list(x = 0.1, y = 0.9),
shapes = list(vline(-10), vline(10)))
# Display the plot
fig
}
# Run the RBM demo
run_RBM_demo_deepnet()