SOCR ≫ | TCIU Website ≫ | TCIU GitHub ≫ |
Hilbert space embeddings of distributions, or kernel mean embeddings, map distributions into reproducing kernel Hilbert spaces (RKHS’s) where kernel methods are extended to probability measures. Kernel mean embedding methods have many applications in probabilistic modeling of complex phenomena, statistical inference based on high-dimensional data, and artificial intelligence. In this TCIU Reproducing Kernel Hilbert Spaces and Temporal Distribution Dynamics (TDD) Section we explore the relations between RKHS, kime-phase distributions, and temporal distribution dynamics.
Specifically, in this section we explore repeated spatiotemporal sampling, where \(\forall\ n\), we observe \(K\) (longitudinal or spatiotemporal) sample means \(\{\bar{X}_{n1},\bar{X}_{n2},\cdots, \bar{X}_{nK}\}\). In this case, \(\forall\ 1\leq k_o\leq K\) each IID sample \(\{X_{1k_o},X_{2k_o},\cdots ,X_{nk_o}\}\sim F_X\) (across the kime-phase space) allows us to compute the \(k_o^{th}\) sample mean \(\bar{X}_{nk_o}\).
Computing \(\bar{X}_{nk_o}\) from the \(k_o^{th}\) sample \(\{X_{1k_o},\cdots ,X_{nk_o}\}\) is identical to directly sampling \(\bar{X}_{nk_o}\) from the sampling distribution of the mean \(F_{\bar{X}_n}\). Kime-phase sampling, at a fixed spatiotemporal index \(k=k_o\) samples from the kime-phase distribution, \(\{X_{1k_o},\cdots , X_{nk_o}\}\). The complementary spatiotemporal sampling reflects the varying \(k\) index, where the kime-phase variability is annihilated by using a kime-phase aggregator, in this case sample-averages, to generate a sample \(\{\bar{X}_{n1},\cdots , \bar{X}_{nK} \}\).
Prior to data acquisition, \(\bar{X}_{nk_o}\) is a random variable, once the observed data values are plugged in, it’s a constant. Hence, the sample mean random variable (a sum of \(n\) \(\sim F_X\) variables) \(\bar{X}_{nk_o}\) based on \(\{X_{1k_o},\cdots , X_{nk_o}\}\sim F_X\), and the (single!) random variable \(Y\sim F_{\bar{X}_{nk_o}}\) represent exactly the same random variable.
There is a kime-phase connection to quantum fluctuations, statistical inference, and artificial intelligence. This is related to random drawing of a second-order tensor \(X_{n\times K}\) (a random matrix) representing \(K\) (spatiotemporal or longitudinal) samples of \(n\) IID observations \(\{X_{1k_o},\cdots , X_{nk_o}\}_{k_o}\sim F_X\). Computing the \(K\) corresponding means \(\bar{X}_{nk_o}=\frac{1}{n}\sum_{i=1}^n{X_{ik_o}}\) is equivalent to randomly drawing \(K\) samples (a random vector) directly from the sampling distribution of the mean \(F_{\bar{X}_{nk_o}}\).
Kernel functions are positive definite operators that arise as a way to perform inner products \(\langle \cdot, \cdot \rangle\) in high-dimensional feature spaces \(\mathcal{F}\) given observed data points \(x, y \in \mathcal{X}\). These kernel functions guarantee the existence of mappings from the native state spaces of observed data \(\mathcal{X}\) to dot product feature spaces \(\mathcal{F}\), \(\phi: \mathcal{X} \to \mathcal{F}\), such that the kernel inner product \(\kappa(x, y) = \langle \phi(x), \phi(y)\rangle\) can be computed without the need for explicit calculation of the enigmatic feature mapping \(\phi\).
Kernel methods are used by machine learning (ML) algorithms and artificial intelligence (AI) techniques to replace the native space inner products \(\langle x, y\rangle_{\mathcal{X}}\) with other, possibly nonlinear, distance or similarity measures computed in the higher dimensional feature space representing the range (output) of nonlinear transformations \(\phi: \mathcal{X} \to \mathcal{F}\), which map the native space (domain, \(\mathcal{X}\)) into a higher-dimensional feature space \(\mathcal{F}\). Each feature map \(\phi\) transforms the native space inner product distance calculations \(\langle x, y\rangle_{\mathcal{X}}: \mathcal{X}\times \mathcal{X}\to \mathbb{R}\) with kernel function calculations \(\kappa(x, y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{F}}: \mathcal{F}\times \mathcal{F}\to \mathbb{R}\) on the feature space \(\mathcal{F}\). As most of the time the elements of the inner product are clearly identifiable, we can skip the space-designating subscript and write \(\langle \cdot, \cdot\rangle\) for \(\langle \cdot, \cdot\rangle_{\mathcal{X}}\) and \(\langle \cdot, \cdot\rangle_{\mathcal{F}}\).
In other words, we interpret \(\kappa(x, y)\) as a nonlinear similarity measure between the sets \(x\) and \(y\). Many ML/AI algorithms that depend only on the inner product (distances) between state space data points, i.e., on \(\langle x, y\rangle\). Hence, this kernel-trick allows us to obtain nonlinear extensions of such algorithms by simply substituting the kernel distance measures \(\kappa(x, y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{F}}\) for the corresponding native space distances, \(\langle x, y\rangle\) without altering the algorithm learning, training, and validation steps. For instance, consider a polynomial feature map \(\phi:\mathbb{R}^2\to \mathbb{R}^3\)
\[\phi(x) = (x^2_1,\ x^2_2,\ \sqrt{2}\ x_1 x_2),\ \forall\ x=(x_1,x_2)\in\mathbb{R}^2\ ,\]
which transforms the calculations of the native-space inner product to another feature-space inner product as follows
\[\underbrace{\langle x, y\rangle_{\mathcal{X}} \equiv x_1 y_1 + x_2 y_2}_{\text{native space similarity, }\mathcal{X}} \longrightarrow \underbrace{\kappa(x, y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{F}}\equiv x_1^2 y_1^2 + x_2^2 y_2^2 + 2x_1 x_2 y_1 y_2}_{\text{(corresponding) RKHS space similarity, }\mathcal{F}}\equiv \langle x, y\rangle_{\mathcal{X}}^2 \ .\]
In this case, the kernel trick suggests that the \(\phi\)-transformed similarity measure, \(\langle \phi(x), \phi(y)\rangle_{\mathcal{F}}\), is just the square of the inner product in the native space \(\mathcal{X}\),i.e., \(\langle x, y\rangle ^2\).
The more general \(d\)-degree polynomial kernel relies on a map \(\phi(x):\mathcal{X}\equiv\mathbb{R}^n \to \mathcal{F}\), where \(\phi(x)\) represent all possible \(d^{th}\) degree ordered products of the observable data \(x,y\in\mathbb{R}^n\)
\[\kappa(x, y) = \langle \phi(x), \phi(y)\rangle_{\mathcal{F}} = \langle x, y\rangle_{\mathcal{X}} ^d\ .\]
In DSPA Chapter 6 we show several types of kernels commonly used in support vector machines (SVM) and other ML methods. Here are some kernel examples.
\[\kappa(x, y)=\langle x,\ y \rangle\ .\]
\[\kappa(x, y) = \langle x, y\rangle ^d=(x\cdot y+1)^d \ .\]
\[\kappa(x, y)=\tanh(s\ \langle x, y\rangle + o).\]
\[\kappa(x, y)=e^{- \left (\frac{\lVert x-y\rVert^2}{2\sigma^2}\right )} .\]
\[\tanh(x)\equiv\frac{e^x-e^{-x}}{e^x+e^{-x}} = \frac{1-e^{-2x}}{1+e^{-2x}}=\frac{2-(1+e^{-2x})}{1+e^{-2x}} = \frac{2}{1+e^{-2x}}-1 = 2\sigma(2x)-1\in(-1,1)\]
\[\kappa(x, y)=\tanh(s\langle x, y\rangle + o)\ .\]
\[\kappa(x, y) = e^{-\left (\frac{\lVert x-y\rVert}{\sigma}\right )} \ . \]
\[\kappa_{\nu} (x,y)=\sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left (\sqrt{2\nu}\frac{\|x-y\|}{\rho}\right )^{\nu} K_{\nu}\left (\sqrt{2\nu}\frac{\|x-y\|}{\rho} \right )\ ,\]
where \(\sigma^2\) is the variance parameter and \(\rho\) controls the kernel scale (width).
When \(\nu=\frac{1}{2}\), \[\kappa_{\frac{1}{2}}(x,y)=\sigma^2\sqrt{\frac{2}{\pi}\frac{\|x-y\|}{\rho}} \underbrace{K_{\frac{1}{2}}\left (\frac{\|x-y\|}{\rho}\right )}_{\left (2\frac{\|x-y\|}{\rho\pi}\right )^{-1/2} \exp\left (-\frac{\|x-y\|}{\rho}\right )}= \sigma^2 \exp\left (-\frac{\|x-y\|}{\rho}\right )\ .\]
As \(\nu\to\infty\), the Matern kernel transforms into the Gaussian kernel,
\[\lim_{\nu\to\infty}\kappa_{\nu}(x,y)=\sigma^2\exp \left (-\frac{\|x-y\|^2}{\rho}\right )\]
where the Laplace method may be used to approximate the asymptotic convergence of the kernel as\(\nu\to\infty\)
\[K_{\nu}(x)\to \sqrt{\frac{\pi}{2\nu}}e^{-\nu}(2\nu/x)^{\nu},\ \Gamma(\nu)\to \sqrt{2\pi}\nu^{\nu-1/2}e^{-\nu}\ ,\]
\[K_{\nu}(\sqrt{v}x)\to \sqrt{\frac{\pi}{2\nu}}e^{-\nu}\left (2\frac{\sqrt{\nu}}{x}\right )^{\nu}\exp(x^2)\ ,\]
and \(\sqrt{\nu x^2+\nu^2}\sim \nu+\frac{x^2}{2},\ \nu\to \infty\).
This unique interpolation of the Matern kernel between Gaussian and Laplacian kernels is effectively a 1-parameter kernel homotopy, just like the t-distribution is a 1-parameter (probability distribution) homotopy between Cauchy distribution and Gaussian distribution.
Also, Matern kernels generate classical Sobolev spaces, indicating that reproducing-kernel Hilbert spaces (RKHS) associated with the Green function can be embedded into or are equivalent to a generalized Sobolev space.
Here are the Matern kernels for a couple of \(\nu\in \left \{\frac{3}{2}, \frac{5}{2} \right \}\) parameters:
\[\kappa_{\frac{3}{2}}(x,y)=\sigma^2\left (1+\frac{\sqrt{3}\|x-y\|}{\rho}\right ) \exp\left (-\frac{\sqrt{3}\|x-y\|}{\rho}\right)\]
\[\kappa_{\frac{5}{2}}(x,y)=\sigma^2\left (1+\frac{\sqrt{5}\|x-y\|}{\rho}+\frac{5(x-y)^2}{3\rho^2}\right ) \exp \left (-\frac{\sqrt{3}\|x-y\|}{\rho}\right ).\]
Below are plots of the shapes of several Matern kernels.
library(GPBayes)
besselk_nu <- function(nu) {
return(function(x) {
return(GPBayes::BesselK(nu, x))
})
}
maternkernel <- function(sigma,nu,rho,x){
scaled_x = sqrt(2*nu)*x/rho
return (sapply(scaled_x,besselk_nu(nu))*scaled_x**nu*sigma**2*2**(1-nu)/gamma(nu))
}
library(plotly)
p <- plot_ly(type="scatter", mode="lines")
x_values = seq(from = -5, to = 5, by = 0.01)
p <- p %>%add_trace(x=~x_values, y=~maternkernel(1,1/8,1,abs(x_values)),
name = ~paste("nu=", 1/4), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,1/2,1,abs(x_values)),
name = ~paste("nu=", 1/2), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,1,1,abs(x_values)),
name = ~paste("nu=", 1), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,5,1,abs(x_values)),
name = ~paste("nu=", 5), mode="lines")
p <- p %>%add_trace(x = ~x_values, y = ~maternkernel(1,50,1,abs(x_values)),
name = ~paste("nu=", 50), mode="lines")
p = layout(p, title = "Matern kernels for different parameter values",
xaxis = list(title="X values"), yaxis = list(title="Y values"))
p
\[J_{n}(z)=\frac{1}{2\pi i} \oint e^{\frac{z}{2}(t-\frac{1}{t})} t^{-n-1}dt\ ,\] \[Bessel_{\nu +1}^{n}(u) = \frac{1}{(4 \pi)^{(\nu +1)/2}\Gamma ((\nu +1)/2)} \int_0^{\infty} {\frac{e^{-\left (\frac{\pi \vert u \vert^2}{s}+\frac{s}{4 \pi}\right )}} {s^{1 + \frac{n - (\nu + 1)}{2}}}\,\mathrm{d}s}\ , \forall\ u\in\mathbb{R}^n\setminus\{0\} ,\]
\[\kappa(x, y) = -Bessel_{\nu +1}^{n} \left ( \sigma ||x-y||^2 \right ) \equiv \frac{J_{\nu +1}(\sigma ||x-y||^2)}{||x-y||^{-n(\nu + 1)}}\ .\]
\[\kappa(x, y) = \prod_{i=1}^n\left (1+ x_i y_i (1+\min(x_i,y_i))- \frac{(x_i+y_i)\min^2(x_i,y_i)}{2} +\frac{\min^3(x_i,y_i)}{3}\right )\ .\]
The R implementations of all kernlab
kernels are available here.
In general, for each pair of observations \(x,y\in\mathcal{X}\), evaluating the kernel \(\kappa(x, y)\) involves two computationally expensive steps:
Note that there is a simplified alternative way to compute directly the kernel inner product (second step) \(\langle \phi(x), \phi(y)\rangle _{\mathcal{F}}\) without explicitly computing the feature maps (step one). This requires an explicit relation, like the (quadratic and \(d^{th}\) degree) polynomial kernel relations we saw earlier, e.g., \(\kappa(x, y) = \langle x, y\rangle_{\mathcal{F}}^2\) or \(\kappa(x, y) = \langle x, y\rangle_{\mathcal{F}}^d\).
It turns out that for each positive definite kernel \(\kappa(x, y)\), there always exists a feature map \(\phi: \mathcal{X} \to \mathcal{F}\) so \(\kappa(x, y) = \langle \phi(x), \phi(y)\rangle _{\mathcal{F}}\). And since feature space inner product \(\langle \cdot, \cdot\rangle _{\mathcal{F}}\) is positive definite, it implies that \(\kappa(x, y) = \langle \phi(x), \phi(y)\rangle _{\mathcal{F}}\) is positive definite \(\forall\ \phi:\mathcal{X} \to \mathcal{F}\).
Definition: A reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\) is defined as the space of feature functions \(\phi:\mathcal{X} \to \mathcal{F}\), corresponding to reproducing positive semi-definite and symmetric kernels \(\kappa\)
\[\kappa: \mathcal{X}\times \mathcal{X} \longrightarrow \mathbb{C}\ \ ({\text{or }} \mathbb{R})\ ,\]
\[\text{Reproducing}: f(x)=\langle \kappa(x,\cdot),f(\cdot)\rangle_{\mathcal{H}} \forall f\in\mathcal{H}\] \[\text{Positive semi-definiteness}:\sum_{i,j=1}^n {c_i^*\ c_j\ \kappa(x_i,x_j)}\geq 0,\ \forall\ n\in\mathbb{N},\ \forall\ \{x_i\}_i \in \mathcal{X}, \forall\ \{c_i\}_i \in \mathbb{C} \ . \]
Two important RKHS properties are:
\[\langle \kappa (x, \cdot ),\ \kappa (x_o, \cdot )\rangle_{\mathcal{H}}\equiv \kappa (x, x_o)\ .\]
Thus, we do not need to explicate the feature map \(\phi_x = \kappa(x, \cdot )\) as it can be derived directly from the kernel \(\kappa\) by the above reproducing relation.
Note: Below we will demonstrate several examples of using kernel/RKHS approaches for regularized linear modeling of data (Ridge and LASSO regression) and show the notion of LASSO shrinkage on features (identification of salient variables associated with predicting a specific outcome of interest) and shrinkage on cases (selection of observation instances, i.e., participants, that play the most influential role in deriving model-based predictions of the outcome).
Let’s look at some hands-on parameter estimation examples of RKHS including a regularized linear model fitting (Ridge and LASSO) and simulation cases.
Let’s consider the DSPA amyotrophic lateral sclerosis (ALS) case-study, which consists of \(2,223\) observations and \(131\) numeric variables. We select ALSFRS slope as our outcome variable, as it captures the patients’ clinical decline over a year. In this simple demonstration, we’ll only use \(5\) covariates \(\{x_i\}_{i=1}^5\), \(x_1=\)Age_mean, \(x_2=\)Albumin_max, \(x_3=\)Albumin_median, \(x_4=\)Albumin_min, \(x_5=\)Albumin_range, to predict the outcome \(Y=ALSFRS\_slope\). This ALS case-study is intentionally chosen as it represents a tough modeling, prediction, and classification AI challenge.
We start by importing the raw data and identifying the design matrix \(X\) and the clinical outcome of interest, \(Y=ALSFRS\_slope\).
library(tidyverse)
# devtools::install_github("kupietz/kableExtra")
# See: https://stackoverflow.com/questions/76118194/error-when-loading-kableextra-in-markdown-file-after-updating-to-r-4-3-0
library(kableExtra)
ALS.train <- read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
# summary(ALS.train); colnames(ALS.train)
y = ALS.train$ALSFRS_slope
X = as.matrix(ALS.train[ , 2:6])
X = apply(X, 2, scales::rescale, to = c(0, 1))
To avoid potential computationally singular results, we define a new function inverse()
, which will be used in place of the more commonly used base::solve()
.
Define a bivariate reproducing linear kernel operator, rk()
,
\(\kappa(s,t) = rk(s,t)=s'\cdot t=\langle s, t \rangle\), in this case a simple inner-product.
For a set of vectors \(S = \{x_1,x_2,\cdots, x_n\}\) representing an observed sample of \(n\) multivariable data points \(\{x_i\}\), the Gram matrix is defined as the kernel-matrix
\[G_{n\times n}=\left (G_{ij} \equiv \langle \phi(x_i), \phi(x_j) \rangle \equiv \underbrace{\kappa (x_i, x_j )}_{kernel} \right )\ ,\]
where the kernel function \(\kappa(\cdot,\cdot)\) computes the inner products in the feature-space based on the feature map \(\phi(\cdot)\). Note that \(G_{ij}\equiv G_{ji},\ \forall\ 1\leq i,j\leq n\) and the symmetric Gram matrix, \(G'\equiv G\), contains all the information needed to compute the pairwise distances within the dataset. Indeed, the Gram matrix is a lossy representation of the complete original dataset (set of features). Despite the fact that the Gram matrix loses some information about the original dataset it has some desirable properties:
The power of the Gram matrix comes from its preservation of all pairwise distances between the observed data points, i.e., the vectors \(\{x_i\}_{i=1}^n\), and their \(\phi\)-transformed counterparts, \(\{\phi(x_i)\}_{i=1}^n\).
In our examples, the gram(X,RK)
function will represent an alternative of the standard (square transposed cross-product matrix) \(tcrossprod(X)=tcrossprod(X,Y\equiv X)= X\times Y'\) function, which takes the (transposed) cross-product of a matrix with its transpose, i.e., \(X\times X'\). The gram()
operator allows us to generalize tcrossprod()
, so that it can be used with alternative kernels (RK
) provided as arguments to gram()
.
Given a linear model \(Y=X\beta +\varepsilon\) and a value of the regularization hyperparameter, \(\lambda\), the GCV statistic is
\[V(λ)=\frac{\frac{1}{n}||(I-A(\lambda))y||^2}{ \left (\frac{1}{n}\cdot Tr(I-A(\lambda))\right )^2}\ ,\]
where \(n\) is the number of observations, \(H(\lambda)=X(X^TX)^{-1} X^T\) is the hat matrix, and \(A(\lambda)=X(X^TX+n\lambda I)^{-1} X^T\) is a smoothed hat matrix, and fitted values of the linear least-squares regression model are \(\hat{y}=X\beta = X(X^TX)^{-1} X^T = Hy\).
Recall the two alternative forms of the generalized cross-validation (GCV) score defined in terms of the root sum-square (RSS) error or the deviance
\[GCV_{RSS} = \frac{RSS/n}{(1 -Tr(H(\lambda))/n)^2}\] \[GCV_{Deviance} = -2\frac{LogLikelihood/n}{(1 -Tr(H(\lambda))/n)^2} \ .\]
Here we will define the ridge(X, y, lambda)
function, which computes the Gram matrix \(G_{n\times n}\), formulates a non-singular design matrix \(Q\) (with an intercept first column),
computes the inverse of the cross-product matrix, \(M\), estimates the ridge parameters, and calculates the GCV measure.
The R
package glmnet already provides an efficient method for fitting elastic net regularized linear models.
We will define a de novo ridge()
method that demonstrates RKHS approaches using
the Gram matrix to fit a Ridge regularized linear model for estimating the parameters (effect sizes).
Recall that if \(X\) is the design matrix of observed values of predicting-covariates and \(Y\) is the observed outcome vector, the linear model predictions (fitted-values) are expressed as
\[\underbrace{\hat{Y}}_{fitted\ values}=X\hat{\beta}=X(X'X)^{-1}X'Y\ , \]
where the effect-size estimates \(\hat{\beta}=\{\beta_i\}_i\) for all covariates \(\{x_i\}_i\) correspond to the specific objective (cost/loss) function used in the optimization process
We implement the kernelized ridge regression as follows (Some additional details See appendix 5.3):
ridge <- function(X, y, lambda,kernel=rk) {
K = gram(X,kernel) #assumed X to be n*d where observations stacked on the rows, GramMatrix (nxn) XTX (already phi^Tphi)
rows = dim(X)[1]
n = length(y)
#Q = cbind(1, GramMatrix) no need for intercept as the kernel already captures this
#S = rbind(0, cbind(0, GramMatrix))
M = K + lambda*diag(dim(K)[1]) # (singularity protection)K+lambda I)^{-1}
M_inv = inverse(M) # invert M
omega_tilde = M_inv %*% y
f_hat = K %*% omega_tilde
A = K %*% M_inv # Hat matrix maps response to fitted values
tr_A = sum(diag(A)) # trace of hat matrix
rss = crossprod(y - f_hat) # residual sum of squares
gcv = n*rss / (n - tr_A)^2 # compute GCV score
return(list(f_hat = f_hat, omega_tilde = omega_tilde, gcv = gcv, rss=rss))
}
Using the ALS dataset, we will implement a direct search for the GCV optimal ridge regularization parameter \(\lambda\) to predict the clinical outcome ALSFRS_slope
. We will use the R
package furrr
and request multiple cores (\(workers = 8\)) in the call to furrr::future_map()
. Using multicore processing will significantly expedite this massive search calculation.
We use LASSO (alpha = 1
) regularization on the same ALS test case with
a Laplacian reproducing kernel operator, rk()
\[\kappa(s,t) = rk(s,t)= e^{-\left (\frac{\lVert s - t\rVert}{\sigma}\right )}\ .\]
rk_Laplace <- function(s, t, sigma=1) {
if (sigma <= 0) sigma=1 # avoid singularities
init_len = length(s)
rk_Lap = 0
for (i in 1:init_len) { rk_Lap = (s[i] - t[i])^2 + rk_Lap }
rk_Lap = exp(-(sqrt(rk_Lap))/sigma)
return(rk_Lap)
}
Next we fit the Ridge
model and plot the trajectories of the parameter estimates across
\(\lambda\).
lambda = 10^seq(-6, 2, by = 0.5)
library(furrr)
# For classical sequential parameter search use:
# gcv_search = map(lambda, function(lam) ridge(X, y, lam))
# For multicore execution use:
plan(multisession, workers = 8)
gcv_search <- future_map(lambda, function(lam) ridge(X, y, lam,rk_Laplace))
V = map_dbl(gcv_search, function(x) x$gcv)
ridge_coefs = map_df(gcv_search, function(x) {
data.frame(value = x$omega_tilde, coef =as.character(1:2223))
},
.id = 'iter') %>%
mutate(lambda = lambda[as.integer(iter)])
library(patchwork)
library(plotly)
# gcv_plot = qplot(lambda, V, geom='line', main='GCV score (log scale)', ylab='GCV') +
# scale_x_log10()
plot_ly(x=lambda, y=V, type="scatter", mode="lines") %>%
layout(title="(Ridge) GCV vs. Log10(Lambda)",
xaxis = list(title="log10(Lambda)", type = "log"))
# beta_plot = ridge_coefs %>%
# ggplot(aes(x = lambda, y = value, color = coef)) +
# geom_line() +
# scale_x_log10() +
# scico::scale_color_scico_d(end = 0.8) +
# labs(title = 'Model Effects (Betas) Across Lambda')
# ggplotly(gcv_plot + beta_plot)
plot_ly(data=ridge_coefs[as.numeric(ridge_coefs$coef)<20,], x=~lambda, y=~value,
type="scatter", mode="lines", color=~coef, name=~coef) %>%
layout(title="(Ridge Estimates) Effects (Beta's) vs. log10(Lambda)",
xaxis = list(title="", type = "log"), # title="log10(Lambda)",
legend = list(title="Predicting Covariates", orientation = 'h'))
The plot allows manual zoom in to confirm that instances (case observations) \(4\) and \(18\) appear to be most dominating in estimating the case-contribution weights in the linear regularized model, among the first \(20\) participants. By zooming in or deselecting these two instances \(\{4,18\}\), we can explore the canonical regularization path trajectories.
The following example uses small numbers to reduce the extreme computational burden and expedite the computational estimates in compiling the HTML output from the Rmd source. In practice, larger numbers will provide more precise estimates. The resulting RSS are reported in tables.
# Matern kernel function
maternkernel <- function(sigma,nu,rho,x){
scaled_x = sqrt(2*nu)*x/rho
return (sapply(scaled_x,besselk_nu(nu))*scaled_x**nu*sigma**2*2**(1-nu)/gamma(nu))
}
rk_Matern <- function(s, t, nu=0.5) {
# if (sigma <= 0) sigma=1 # avoid singularities
init_len = length(s)
rk_dist = 0
for (i in 1:init_len) { rk_dist = (s[i] - t[i])^2 + rk_dist }
rk_Mat = maternkernel(1,nu,1, rk_dist+1e-6)
return(rk_Mat)
}
maternkernel_nu_mod <- function(nu) {
return(function(s, t) {
# if (sigma <= 0) sigma=1 # avoid singularities
init_len = length(s)
rk_dist = 0
for (i in 1:init_len) { rk_dist = (s[i] - t[i])^2 + rk_dist }
rk_Mat = maternkernel(1,nu,1, sqrt(rk_dist)+1e-6)
return(rk_Mat)
} )
}
rk_Matern <- function(s, t, nu=0.5) {
# if (sigma <= 0) sigma=1 # avoid singularities
init_len = length(s)
rk_dist = 0
for (i in 1:init_len) { rk_dist = (s[i] - t[i])^2 + rk_dist }
rk_Mat = maternkernel(1,nu,1, sqrt(rk_dist)+1e-6)
return(rk_Mat)
}
poly_d_mod <- function(d) {
return(function(s, t) {
init_len = length(s)
rk = 1
for (i in 1:init_len) { rk = s[i]*t[i] + rk }
return(rk**d)
} )
}
Next, we will expand the previous standard Ridge model by using a RKHS and specifying alternative kernels in the estimation of the model coefficients.
For \(\lambda\in \{0.01, 0.1, 1, 10\}\).
#Hyperparameter set 1 - Linear Kernel
library(foreach)
library(doParallel)
registerDoParallel(8)
# fitted_ridge1<- ridge(X,y,lambda=10)
# fitted_ridge2<- ridge(X,y,lambda=1)
# fitted_ridge3<- ridge(X,y,lambda=0.1)
# fitted_ridge4<- ridge(X,y,lambda=0.01)
lambda <- c(0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 10)
fitted_ridge <- vector()
fitted_ridge <- foreach (param = 1:length(lambda), .combine = 'c') %dopar% {
ridge(X,y,lambda=lambda[param])
}
stopImplicitCluster()
df <- as.data.frame(fitted_ridge)
df_tbl <- rbind(c(df$gcv[1], df$gcv.1[1], df$gcv.2[1], df$gcv.3[1],
df$gcv.4[1], df$gcv.5[1], df$gcv.6[1], df$gcv.7[1]),
c(df$rss[1], df$rss.1[1], df$rss.2[1], df$rss.3[1]))
colnames(df_tbl) <- c(paste0("lambda=", lambda[1]), paste0("lambda=", lambda[2]),
paste0("lambda=", lambda[3]), paste0("lambda=", lambda[4]),
paste0("lambda=", lambda[5]), paste0("lambda=", lambda[6]),
paste0("lambda=", lambda[7]), paste0("lambda=", lambda[8]))
rownames(df_tbl) <- c("GCV", "RSS")
library(DT)
datatable(df_tbl, caption="Linear Ridge Kernel (parameter estimates)")
# fitted_ridge1$rss
# fitted_ridge2$rss
# fitted_ridge3$rss
# fitted_ridge4$rss
For \(\lambda=0.1\) and \(\nu\in \{0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 10\}\).
# Hyperparameter set 2 - Matern kernel
# matern1 <- ridge(X,y,lambda=0.1,kernel=rk_Matern)
# matern2 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(1))
# matern3 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(0.01))
# matern4 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(1.5))
# matern5 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(2.5))
# matern6 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(10))
# matern7 <- ridge(X,y,lambda=0.1,kernel=maternkernel_nu_mod(0.5))
registerDoParallel(8)
nu <- c(0.01, 0.1, 0.5, 1, 1.5, 2, 2.5, 10)
fitted_ridgeMatern <- vector()
fitted_ridgeMatern <-
foreach (param = 1:(length(nu)+1), .combine = 'c') %dopar% {
besselk_nu <- function(nu) {
return(function(x) { return(GPBayes::BesselK(nu, x)) }) }
if (param==1) ridge(X, y, lambda=0.1, kernel=rk_Matern)
else ridge(X, y, lambda=0.1, kernel=maternkernel_nu_mod(nu[param-1]))
}
stopImplicitCluster()
df <- as.data.frame(fitted_ridgeMatern)
df_tbl <- rbind(c(df$gcv[1], df$gcv.1[1], df$gcv.2[1], df$gcv.3[1], df$gcv.4[1],
df$gcv.5[1], df$gcv.6[1], df$gcv.7[1], df$gcv.8[1], df$gcv.9[1]),
c(df$rss[1], df$rss.1[1], df$rss.2[1], df$rss.3[1], df$rss.4[1],
df$rss.5[1], df$rss.6[1], df$rss.7[1], df$rss.8[1], df$rss.9[1]))
colnames(df_tbl) <- c(paste0("nu=", 0.5), paste0("nu=", nu[1]), paste0("nu=", nu[2]),
paste0("nu=", nu[3]), paste0("nu=", nu[4]),
paste0("nu=", nu[5]), paste0("nu=", nu[6]),
paste0("nu=", nu[7]), paste0("nu=", nu[8]))
rownames(df_tbl) <- c("GCV", "RSS")
library(DT)
datatable(df_tbl, caption="Matern Ridge Kernel (parameter estimates)")
We can also use any of the other kernels (e.g., polynomial) to fit RKHS Ridge models and compare their performance across different kernels, and hyperparameters (e.g., \(\nu\), \(d\)).
#Hyperparameter set 3 - Polynomial Kernel kernel
# poly1 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(3/2))
# poly2 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(1))
# poly3 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(0.9))
# poly4 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(0.7))
# poly5 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(0.5))
# poly6 <- ridge(X,y,lambda=0.1,kernel=poly_d_mod(0.3))
registerDoParallel(7)
poly_d <- c(0.3, 0.5, 0.7, 0.9, 1.5, 2.5, 10)
fitted_ridgePoly <- vector()
fitted_ridgePoly <-
foreach (param = 1:length(poly_d), .combine = 'c') %dopar% {
ridge(X, y, lambda=0.1, kernel=poly_d_mod(poly_d[param]))
}
stopImplicitCluster()
df <- as.data.frame(fitted_ridgePoly)
df_tbl <- rbind(c(df$gcv[1], df$gcv.1[1], df$gcv.2[1], df$gcv.3[1], df$gcv.4[1],
df$gcv.5[1], df$gcv.6[1]),
c(df$rss[1], df$rss.1[1], df$rss.2[1], df$rss.3[1], df$rss.4[1],
df$rss.5[1], df$rss.6[1]))
colnames(df_tbl) <- c(paste0("d=", poly_d[1]), paste0("d=", poly_d[2]),
paste0("d=", poly_d[3]), paste0("d=", poly_d[4]),
paste0("d=", poly_d[5]), paste0("d=", poly_d[6]),
paste0("d=", poly_d[7]))
rownames(df_tbl) <- c("GCV", "RSS")
library(DT)
datatable(df_tbl, caption="Polynomial Ridge Kernel (parameter estimates)")
Finally, we can select the best model corresponding to the optimal regularization parameter \(\lambda=\lambda_{[which.min(V)]}\) associated with minimization of the GCV search and obtain the corresponding effect size estimates \(\beta\)’s, for example, by refitting the Ridge model with a linear kernel.
plan(multisession, workers = 8)
X_augmented <- cbind(1, X) # augmented design matrix
gcv_search <- future_map(lambda, function(lam) ridge(X_augmented, y, lam))
V = map_dbl(gcv_search, function(x) x$gcv)
ridge_coefs = map_df(gcv_search, function(x) {
data.frame(value = x$omega_tilde, coef =as.character(1:2223))
},
.id = 'iter') %>% mutate(lambda = lambda[as.integer(iter)])
fit_ridge <- ridge(X_augmented, y, lambda[which.min(V)]) # fit optimal model
omega_tilde <- fit_ridge$omega_tilde
beta_hat <- crossprod(omega_tilde, X) # slope and noise term coefficients
Next, we will compare our manual ridge-regularized linear model parameter estimation strategy (using RKHS) against the classical glmnet()
gold standard. Recall that setting the glmnet
hyperparameter alpha = 0
and alpha = 1
correspond to ridge (\(L_2\)) and LASSO (\(L_1\)) regression, respectively.
library(glmnet)
# c(beta_0, beta_hat)
fit_glmnet = glmnet(X, y, alpha = 0, lambda=lambda, standardize = FALSE)
The following table contrasts the manual RKHS-Ridge and the automated
GLM-Net Ridge regression effect size estimates for all \(5\) covariate
predictors of the outcome ALSFRS_slope
.
Theoretically, the manual estimates are not expected to match
the GLM-Net Ridge estimates, as \(X^TX\neq I\).
kable_df = function(..., digits =3, full_width = FALSE)
kable(..., digits=digits) %>% kable_styling(full_width = full_width)
rbind(RKHS_Ridge = c(beta_hat), # -1 to skip over the intercept estimation
GLMNet_Ridge = (coef(fit_glmnet)[, which.max(fit_glmnet$dev.ratio)])[-1]) %>%
kable_df()
Age_mean | Albumin_max | Albumin_median | Albumin_min | Albumin_range | |
---|---|---|---|---|---|
RKHS_Ridge | 0.042 | 1.277 | -0.313 | -0.246 | -4.674 |
GLMNet_Ridge | 0.042 | 1.304 | -0.319 | -0.261 | -4.747 |
Again, we will define a manual lasso()
method de novo using RKHS modeling,
the Gram matrix, and fitting a LASSO regularized linear model for estimating the parameters (effect sizes). To see the theoretical derivation for kernelized LASSO regression (See appendix 5.4).
There is a distinction of shrinking on observations and shrinking on features. We run some experiments showing a possible way of implementing kernelized LASSO with shrinkage on the features or on the instances. Then, we will compare the results with the classical glmnet approach for efficient elastic net regularized linear modeling.
To demonstrate participant selection, i.e., instance or case shrinkage, we’ll use the following linear kernel implementation.
library(MASS)
# observation shrinkage
lasso_observation <- function(X, y, lambda=0.1, kernel=rk) {
K = gram(X,kernel)
# assumed X to be n*d where observations stacked on the rows,
# GramMatrix (nxn) XTX (already phi^Tphi)
n = length(y)
#Q = cbind(1, GramMatrix) no need for intercept as the kernel already captures this
#S = rbind(0, cbind(0, GramMatrix))
alpha_hat = inverse(K) %*% y
for (j in 1:n) { # The explicit formula for LASSO loss
alpha_hat[j] = alpha_hat[j]*max(0,1-length( alpha_hat)*lambda/abs( alpha_hat[j]))
}
M = K%*%K + lambda*ginv(diag(as.list(abs(alpha_hat)))) # X^TX+lambda W-
M_inv = inverse(M) # invert M
f_hat = K %*% alpha_hat
A = K %*% M_inv %*% K # Hat matrix X(X^XX+lambda W-)^_1 X^T
tr_A = sum(diag(A)) # trace of hat matrix
rss = crossprod(y - f_hat) # residual sum of squares
gcv = n*rss / (n - tr_A)^2 # compute GCV score
return(list(f_hat = f_hat, alpha_hat = alpha_hat, gcv = gcv, rss=rss))
}
The prediction procedure is essentially \[\hat{y}=K S_{\lambda}(K^{-1}y)\ .\] Notice that when \(\lambda=0\), this degenerates into a perfect reconstruction. As a benchmark, the resulting RSS and GCV values are \(34.57\) and \(11.71\), respectively.
# when lambda is 0
fitted_lambda0 <- lasso_observation(X, y, 0,rk_Laplace)
fitted_lambda0$gcv
fitted_lambda0$rss
Next, we run over the grid space of the \(\lambda\) parameter.
lambda = 10^seq(-9, 2, by = 0.5)
library(furrr)
# For classical sequential parameter search use:
# gcv_search = map(lambda, function(lam) ridge(X, y, lam))
# For multicore execution use:
plan(multisession, workers = 12)
gcv_search <- future_map(lambda, function(lam) lasso_observation(X, y, lam,rk_Laplace))
V_lasso = map_dbl(gcv_search, function(x) x$gcv)
lasso_coefs = map_df(gcv_search, function(x) {
data.frame(value = x$alpha_hat, coef =as.character(1:2223))
},
.id = 'iter') %>% mutate(lambda = lambda[as.integer(iter)])
We observe that initially, the GCV score decreases, since \(GCV(lambda=0)= 11.71\)), and then alternates between increases and decreases.
library(patchwork)
library(plotly)
RSS_lasso = map_dbl(gcv_search, function(x) x$rss)
plot_ly(x=lambda, y=RSS_lasso, type="scatter", mode="lines") %>%
layout(title="(LASSO) RSS (Observation) vs. Log10(Lambda)",
xaxis = list(title="log10(Lambda)", type = "log"))
# gcv_plot = qplot(lambda, V, geom='line', main='GCV score (log scale)', ylab='GCV') +
# scale_x_log10()
plot_ly(x=lambda, y=V_lasso, type="scatter", mode="lines") %>%
layout(title="(LASSO) GCV (Observation) vs. Log10(Lambda)",
xaxis = list(title="log10(Lambda)", type = "log"))
# beta_plot = ridge_coefs %>%
# ggplot(aes(x = lambda, y = value, color = coef)) +
# geom_line() +
# scale_x_log10() +
# scico::scale_color_scico_d(end = 0.8) +
# labs(title = 'Model Effects (Betas) Across Lambda')
# ggplotly(gcv_plot + beta_plot)
plot_ly(data=lasso_coefs[as.numeric(lasso_coefs$coef)<20,], x=~lambda, y=~value,
type="scatter", mode="lines", color=~coef, name=~coef) %>%
layout(title="LASSO Effect Estimates (Beta's) vs. log10(Lambda)",
xaxis = list(title="", type = "log"), # title="log10(Lambda)",
legend = list(title="Predicting Covariates", orientation = 'h'))
index <- 1
meta_matrix <- matrix(0,nrow=2223,ncol=length(lambda)) # 2223 is the number of rows
for (i in lambda){
meta_matrix[,index] <- (lasso_coefs[lasso_coefs$lambda==i,]$value ==0)
index <- index + 1
}
library(DT)
# image(t(meta_matrix),xlab="Lambda values", yaxt="n",xaxt="n")
# axis(3, at=seq(0,1, length=length(lambda)), labels=as.character(signif(lambda,2)))
plot_ly(z=t(meta_matrix), type="heatmap") %>%
layout(title="Lambda values (x-axis) vs. LASSO Coefficient Estimates (y-axis)",
xaxis = list(tickvals = seq(0, dim(meta_matrix)[1], length=length(lambda)),
ticktext = c(as.character(signif(lambda,2)))))
The large lambda values (on the right) indicate trivial effect size estimates, whereas the smaller lambda values (on the left) signify nonzero effect size estimates for the importance of the observation instance’s (cases) as contributors to the LASSO model prediction. Of course, due to the shrinkage nature for LASSO, as lambda increases, all observation effect sizes are discounted and thresholded to \(0\).
Selection of salient features (variable selection) is independent from the prior search for influential cases (observations) that impact the LASSO model prediction.
Now we focus on the traditional feature selection using LASSO.
kernelized_y <- function(X,y, rkfunc = rk) { # compute the `crossprod` using the specified RK
apply(X, 1, function(Row)
rkfunc(Row, y) # specifies the Reproducing Kernel
)
}
library(MASS)
# observation shrinkage
lasso_feature <- function(X, y, lambda=0.1, kernel=rk) {
K = gram(t(X),kernel) #assumed X to be n*d where observations stacked on the rows, GramMatrix (nxn) XTX (already phi^Tphi)
n = ncol(X)
#Q = cbind(1, GramMatrix) no need for intercept as the kernel already captures this
#S = rbind(0, cbind(0, GramMatrix))
kernel_yval <-kernelized_y(t(X),y,kernel)
beta_hat = inverse(K) %*% kernel_yval
for (j in 1:n) { # The explicit formula for LASSO loss
beta_hat[j] = beta_hat[j]*max(0,1-length(beta_hat )*lambda/abs(beta_hat[j]))
}
M = K%*%K + lambda*ginv(diag(as.list(abs(beta_hat)))) # X^TX+lambda W-
M_inv = inverse(M) # invert M
f_hat = K %*% beta_hat
A = K %*% M_inv %*% K # Hat matrix X(X^XX+lambda W-)^_1 X^T
tr_A = sum(diag(A)) # trace of hat matrix
rss = crossprod(kernel_yval - f_hat) # residual sum of squares
gcv = n*rss / (n - tr_A)^2 # compute GCV score
return(list(f_hat = f_hat, beta_hat = beta_hat, gcv = gcv, rss=rss))
}
Let’s examine the GCV and RSS performance metrics.
Fit the LASSO feature-selection model over the \(\lambda\) space partition (grid).
lambda = 10^seq(-9, 2, by = 0.5)
library(furrr)
# For classical sequential parameter search use:
# gcv_search = map(lambda, function(lam) ridge(X, y, lam))
# For multicore execution use:
plan(multisession, workers = 12)
gcv_search <- future_map(lambda, function(lam) lasso_feature(X_augmented, y, lam))
V_lasso = map_dbl(gcv_search, function(x) x$gcv)
lasso_coefs = map_df(gcv_search, function(x) {
data.frame(value = x$beta_hat, coef =as.character(1:6))
},
.id = 'iter') %>% mutate(lambda = lambda[as.integer(iter)])
Plot the resulting effect-size (beta) estimate trajectories.
library(patchwork)
library(plotly)
RSS_lasso = map_dbl(gcv_search, function(x) x$rss)
plot_ly(x=lambda, y=RSS_lasso, type="scatter", mode="lines") %>%
layout(title="(LASSO) RSS (Feature) vs. Log10(Lambda)",
xaxis = list(title="log10(Lambda)", type = "log"))
# gcv_plot = qplot(lambda, V, geom='line', main='GCV score (log scale)', ylab='GCV') +
# scale_x_log10()
plot_ly(x=lambda, y=V_lasso, type="scatter", mode="lines") %>%
layout(title="(LASSO) GCV (Feature) vs. Log10(Lambda)",
xaxis = list(title="log10(Lambda)", type = "log"))
# beta_plot = ridge_coefs %>%
# ggplot(aes(x = lambda, y = value, color = coef)) +
# geom_line() +
# scale_x_log10() +
# scico::scale_color_scico_d(end = 0.8) +
# labs(title = 'Model Effects (Betas) Across Lambda')
# ggplotly(gcv_plot + beta_plot)
plot_ly(data=lasso_coefs, x=~lambda, y=~value,
type="scatter", mode="lines", color=~coef, name=~coef) %>%
layout(title="(LASSO Estimates) Effects (Beta's) vs. log10(Lambda)",
xaxis = list(title="", type = "log"), # title="log10(Lambda)",
legend = list(title="Predicting Covariates", orientation = 'h'))
index <- 1
meta_matrix <- matrix(0,nrow=6,ncol=length(lambda)) # 2223 is the number of columns
for (i in lambda){
meta_matrix[,index] <- (lasso_coefs[lasso_coefs$lambda==i,]$value ==0)
index <- index + 1
}
library(DT)
# image(t(meta_matrix),xlab="Lambda values", yaxt="n",xaxt="n")
# axis(3, at=seq(0,1, length=length(lambda)), labels=as.character(signif(lambda,2)))
plot_ly(z=t(meta_matrix), type="heatmap") %>%
layout(title="Lambda values (x-axis) vs. LASSO Coefficient Estimates (y-axis)",
xaxis = list(tickvals = seq(0, dim(meta_matrix)[1], length=length(lambda)),
ticktext = c(as.character(signif(lambda,2)))))
Direct comparisons between different RSS measures across different kernels is challenging, as the corresponding feature maps \(\phi(y)\) are different and hence, the RKHS embedding is in different space scales. For instance, the scale of the Laplace kernel feature map may be very small compared to the larger scale of the feature map associated with the linear kernel. There is no uniform criterion to interpreting performance metrics such as RSS across different RKHS embedding spaces.
As a sanity check at the end, we can select the best model corresponding to the
optimal regularization parameter \(\lambda=\lambda_{[which.min(V)]}\) associated
with minimization of the GCV search and obtain the corresponding effect size
estimates \(\beta\)’s. This will allow us to contrast our implementation of
LASSO kernel feature selection shrinkage against the linear dot product kernel,
which roughly matches the glmnet()
approach. Again, these will not be
identical as \(X^TX\neq I\).
Let’s compare the manual LASSO-regularized linear model parameter estimation strategy (using RKHS) against the classical glmnet()
, as a gold standard. Recall that setting the glmnet
hyperparameter alpha = 1
corresponds to LASSO (\(L_1\)) regression.
library(glmnet)
# c(beta_0, beta_hat)
fit_glmnet = glmnet(X, y, alpha = 1, lambda=lambda, standardize = FALSE)
The following table compares the manual RKHS-LASSO and the automated
GLM-Net LASSO regression effect size estimates for all \(5\) covariate
predictors of the outcome ALSFRS_slope
.
kable_df = function(..., digits =3, full_width = FALSE)
kable(..., digits=digits) %>% kable_styling(full_width = full_width)
rbind(RKHS_LASSO = c(beta_hat),
GLMNet_LASSO = coef(fit_glmnet)[, which.max(fit_glmnet$dev.ratio)]) %>%
kable_df()
(Intercept) | Age_mean | Albumin_max | Albumin_median | Albumin_min | Albumin_range | |
---|---|---|---|---|---|---|
RKHS_LASSO | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -2.861 |
GLMNet_LASSO | -0.518 | 0.042 | 1.308 | -0.319 | -0.264 | -4.758 |
Compare these LASSO (effect-size) \(\beta\) estimates against their Ridge counterparts, which we computed earlier for the same ALS data.
For this example we’ll simulate a spirals dataset representing 2D point cloud data of two spirals corrupted with Gaussian noise using the functions kernlab::spirals()
and mlbench::mlbench.spirals()
.
### Binary Cross-Entropy (BCE) loss function
The binary cross-entropy (BCE) loss function, also called Log Loss, or logistic loss can be used for binary outcome predictions
\[BCE\equiv L_{\log}(y,p) = -\left ( y\ \log(p) + (1-y)\ \log(1-p) \right )\ , \]
where \(y\) is the true binary label (e.g., \(0\) or \(1\)) and \(p\) is the model-based probability estimate that \(0\leq p=Prob(y=1)\leq 1\). A more general Balanced Log Loss includes weight coefficients \(w(y_i)\)
\[BalancedBCE\equiv = -\frac{1}{n}\sum_{i=1}^n { w(y_i)\left ( y_i\ \log(p(y_i)) + (1-y(p_i))\ \log(1-p(y_i)) \right )}\ , \]
where \(n\) is the total number of samples, \(y_i\) is the true binary label of the \(i^{th}\) sample (\(0\) or \(1\)), \(p(y_i)\) is the predicted probability of the \(i^{th}\) sample being of class \(1\), and the weights \(w(y_i)\) for class \(y_i\) are appropriately chosen, e.g., the inverse of the class frequencies.
To compute the Gram matrix, we need to first define the spline reproducing kernel function rk_spline()
implementing the special case of a piecewise cubic polynomial
\[\kappa(x, y) = \prod_{i=1}^n\left (1+ x_i y_i (1+\min(x_i,y_i))- \frac{(x_i+y_i)\min^2(x_i,y_i)}{2} +\frac{\min^3(x_i,y_i)}{3}\right )\ .\]
Next we define the smoothing_spline()
function using the Gram matrix and the
RKHS rk_spline()
function.
# May need to use BCE loss function (instead of RSS)
# binary_cross_entropy(pred, data.valid,loss.unit=c("individuals","L2 units"), y, L2.unit)
# https://rdrr.io/cran/autoMrP/src/R/utils.R
###########################################################################
# binary cross-entropy ----------------------------------------------------
###########################################################################
binary_cross_entropy <- function(dataProb, trueLabels, positiveClassLabel="1",
w_positive=0.5, w_negative=0.5) {
# positiveClassLabel is the positive class
# Extract true labels and predicted probabilities
truth <- as.numeric(trueLabels == positiveClassLabel)
prob <- dataProb # Probabilities for positive class
# Calculate the BCE balanced log loss
BCE <- -mean(
(truth * log(prob) * w_positive) + ((1 - truth) * log(1 - prob) * w_negative)
)
return(BCE)
}
# Normalize all probability estimates to [0-e, 1-e]
probNormalize <- function(p) {
epsil <- 1e-4 # avoid singularities
m <- min(p)
M <- max(p)
return ( (p-m+epsil)/(M-m+2*epsil) )
}
# # ################# BCE test
# # # smoothing_spline(x, y, lam)
# X = x
# y = as.numeric(labels)
#
# GramMatrix = gram(X, rkfunc = rk_spline) # GramMatrix matrix (nxn)
#
# n = length(y)
# J = cbind(1, X) # matrix with a basis for the null space of the penalty
# Q = cbind(J, GramMatrix) # design matrix
# m = ncol(J) # dimension of the null space of the penalty
#
# S = matrix(0, n + m, n + m) # initialize S
# S[(m + 1):(n + m), (m + 1):(n + m)] = GramMatrix # non-zero part of S
#
# M = crossprod(Q) + lambda[1]*S
# M_inv = inverse(M) # inverse of M
#
# gamma_hat = crossprod(M_inv, crossprod(Q, y))
# f_hat = Q %*% gamma_hat
#
# A = Q %*% M_inv %*% t(Q)
# tr_A = sum(diag(A)) # trace of hat matrix
#
# # rss = crossprod(y - f_hat) # BCE loss
# # pred <- as.factor(ifelse (f_hat > 1.5, "1", "2"))
# pred <- probNormalize(f_hat - 1)
# BCE <- binary_cross_entropy(dataProb=pred[, 1], trueLabels=as.character(labels),
# positiveClassLabel="1")
#########################################
# Numerical output (regression spline)
#########################################
smoothing_spline <- function(X, y, lambda) {
GramMatrix = gram(X, rkfunc = rk_spline) # GramMatrix matrix (nxn)
n = length(y)
J = cbind(1, X) # matrix with a basis for the null space of the penalty
Q = cbind(J, GramMatrix) # design matrix
m = ncol(J) # dimension of the null space of the penalty
S = matrix(0, n + m, n + m) # initialize S
S[(m + 1):(n + m), (m + 1):(n + m)] = GramMatrix # non-zero part of S
M = crossprod(Q) + lambda*S
M_inv = inverse(M) # inverse of M
gamma_hat = crossprod(M_inv, crossprod(Q, y))
f_hat = Q %*% gamma_hat
A = Q %*% M_inv %*% t(Q)
tr_A = sum(diag(A)) # trace of hat matrix
rss = crossprod(y - f_hat) # residual sum of squares (RSS)
gcv = n * rss/(n - tr_A)^2 # compute GCV score
return(list(f_hat = f_hat, gamma_hat = gamma_hat, gcv = gcv))
}
#########################################
# Categorical output (binary cross-entropy spline)
#########################################
smoothing_BCE_spline <- function(X, y, lambda) {
GramMatrix = gram(X, rkfunc = rk_spline) # GramMatrix matrix (nxn)
n = length(y)
J = cbind(1, X) # matrix with a basis for the null space of the penalty
Q = cbind(J, GramMatrix) # design matrix
m = ncol(J) # dimension of the null space of the penalty
S = matrix(0, n + m, n + m) # initialize S
S[(m + 1):(n + m), (m + 1):(n + m)] = GramMatrix # non-zero part of S
M = crossprod(Q) + lambda*S
M_inv = inverse(M) # inverse of M
gamma_hat = crossprod(M_inv, crossprod(Q, y))
f_hat = Q %*% gamma_hat
A = Q %*% M_inv %*% t(Q)
tr_A = sum(diag(A)) # trace of hat matrix
# rss = crossprod(y - f_hat) # BCE loss
# pred <- as.factor(ifelse (f_hat > 1.5, "1", "2"))
# convert fitted values to probabilities in [0, 1] for BCE loss estimation
pred <- probNormalize(f_hat - 1) # 1 <= f_hat <=2
BCE <- binary_cross_entropy(dataProb=pred[, 1], trueLabels=as.character(labels),
positiveClassLabel="1")
return(list(f_hat = f_hat, gamma_hat = gamma_hat, BCE = BCE))
}
Now we can fit the RKHS spline model for multiple \(\lambda\) values and optimize for the GCV metric to select the optimal \(\lambda\) parameter.
lambda = 10^seq(-6, 0, by = 0.5)
# Slow # gcv_search = map(lambda, function(lam) smoothing_spline(x, y, lam))
# For quicker multi core execution
plan(multisession, workers = 8)
gcv_search = map(lambda, function(lam) smoothing_spline(x, y, lam))
V = map_dbl(gcv_search, function(x) x$gcv)
## BCE Metric
plan(multisession, workers = 8)
BCE_search = map(lambda, function(lam) smoothing_BCE_spline(x, y, lam))
V_BCE = map_dbl(BCE_search, function(x) x$BCE)
Plot of GCV across search space of \(\lambda\) values.
Again, we can compare and contrast the manual spline RKHS model fit estimation
against a standard model, in this case, a generalized additive model (GAM)
estimated using function mgcv::gam()
. See the gam model specification format.
The graph below shows the raw spirals simulated data (as scatter points color coded by their true arc-type class labels), and three different models - Spline-RKHS, GAM, and Binary classifier. The shapes and colors of the data-points and the predictions of the \(3\) models (also rendered as scatter to avoid occlusions/overlaps) represent the true class labels (shapes) and the model-fit (colors). Note how well the binary-GAM model discriminates between the two spiral arches, correctly identifying the majority of the yellow-arch points (high positive values, red-triangle symbols) and black-arch points (low negative values, red-circle symbols).
library(mgcv)
# https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/gam.models.html
fit_rk = smoothing_spline(x, y, lambda[which.min(V)]) # fit optimal model
fit_rk_bin = smoothing_BCE_spline(x, y, lambda[which.min(V_BCE)]) # fit optimal model
gamSpline = gam(y ~ s(x,bs="cr",k=100)) # s(x) = smooth function of x
print(paste0("y ~ s(x) GAM Spline model Log-Likelihood = ", round(logLik(gamSpline), 4)))
## [1] "y ~ s(x) GAM Spline model Log-Likelihood = -340.4828"
gamBin <- gam(labels ~ s(x,y), family = 'binomial')
print(paste0("labels ~ s(x,y) GAM Binary model Log-Likelihood = ", round(logLik(gamBin), 4)))
## [1] "labels ~ s(x,y) GAM Binary model Log-Likelihood = -130.1871"
df <- as.data.frame(cbind(X=spirals$x[ , 1], Y=spirals$x[ , 2],
class=as.factor(spirals$classes)))
# fit_plot = df %>%
# mutate(fit_rk = fit_rk$f_hat[, 1],
# fit_gam = fitted(fit_gam)) %>%
# arrange(X) %>%
# pivot_longer(-c(X, Y), names_to = 'fit', values_to = 'value') %>%
# ggplot(aes(X, Y)) +
# geom_point(color = '#FF55001A') +
# geom_line(aes(y = value, color = fit)) +
# scico::scale_color_scico_d(palette = 'hawaii', begin = .2, end = .8)
# ggplotly(gcv_plot + fit_plot)
# identify the 2 true spirals
# df_reformat$class <- rep(spirals$classes, 2)
# table(fit_rk_bin$f_hat[, 1] > 1.499)
# FALSE TRUE
# 152 148
df_reformat <- df %>%
mutate(RKHS_mod=fit_rk$f_hat[, 1],
RKHS_Spline_Binary=2*as.numeric(fit_rk_bin$f_hat[, 1] > 1.499)-1,
GAM_Spline=fitted(gamSpline), GAM_Binary=3*fitted(gamBin)-1.5) %>%
arrange(X) %>%
pivot_longer(-c(X, Y, class), names_to = 'fit', values_to = 'value')
df_reformat %>%
plot_ly(x=~X, y=~Y, type="scatter", mode="markers", color=~class,
# symbol=~class, symbols=as.character(seq_along(unique(df_reformat$class))),
name=~paste0("True-Class=", class)) %>% hide_colorbar() %>%
add_trace(x=~X, y=~value, color=~fit, colors = c("red", "blue", "green", "purple"),
symbol=~class, symbols=as.character(seq_along(unique(df_reformat$class))),
name=~paste0(" ", fit), mode="markers",
hovertemplate = paste('(Model)')) %>%
# colors = RColorBrewer::brewer.pal(3, "Set2")[1:2]) %>%
layout(title="(Binary & Regressive) GAM vs. RKHS Models of Spirals Data\n
Color=Model Fit, Shape=True-Class",
xaxis = list(title="X", showline=TRUE), yaxis = list(title="Y"),
legend = list(title="Objects", orientation = 'h'), hovermode = 'x')
In this section we will connect spacekime analytics with Reproducing Kernel Hilbert Spaces (RKHS). The basic idea is that repeated measurements of longitudinal processes can be viewed as time-varying probability distributions. The explicit connection is that the kime-phase distribution \(\Phi_{[-\pi,\pi)}^{(t)}\) represents the distribution of the repeated measurements at a fixed spatiotemporal location \(t\). As time (kime-magnitude) and space evolve, these phase distributions change and their dynamics can be modeled using RKHS. Recall the following example shown in TCIU Chapter 3 appendix (Radon-Nikodym Derivatives, Kime Measures, and Kime Operator.
In this simulation, the \(3\) spatial dimensions \((x,y,z)\in\mathbb{R}^3\) are compressed into \(1D\) along the vertical axis \(z\in\mathbb{R}^1\), the radial displacement represents the time dynamics, \(t\in \mathbb{R}^+\), and the angular scatters of three different processes, representing repeated measurements from \(3\) different circular distributions at a fixed spatiotemporal location, are shown in different colors. Mean-dynamics across time of the three different time-varying distributions are shown as smooth curves color-coded to reflect the color of their corresponding phase distributions, samples, and sampling-distributions.
library(animation)
library(circular)
library(plotly)
epsilon <- 0.1
sampleSize <- 1000 # total number of phases to sample for 3 different processes (x, y, z)
sizePerTime <- 100 # number of phases to use for each fixed time (must divide sampleSize)
circleUniformPhi <- seq(from=-pi, to=pi, length.out=sizePerTime)
oopt = ani.options(interval = 0.2)
set.seed(1234)
# sample the the kime-phases for all 3 different processes and the r time points
x <- rvonmises(n=sampleSize, mu=circular(pi/5), kappa=3)
y <- rvonmises(n=sampleSize, mu=circular(-pi/3), kappa=5)
z <- rvonmises(n=sampleSize, mu=circular(0), kappa=10)
r <- seq(from=1, to=sampleSize/sizePerTime, length.out=10)
# Define a function that renormalizes the kime-phase to [-pi, pi)
pheRenormalize <- function (x) {
out <- ifelse(as.numeric(x) <= pi, as.numeric(x)+pi, as.numeric(x)-pi)
return (out)
}
# transform Von Mises samples from [0, 2*pi) to [-pi, pi)
x <- pheRenormalize(x)
y <- pheRenormalize(y)
z <- pheRenormalize(z)
# vectorize the samples
vectorX = as.vector(x)
vectorY = as.vector(y)
vectorZ = as.vector(z)
# Starting phases, set the first phase index=1
plotX = c(vectorX[1])
plotY = c(vectorY[1])
plotZ = c(vectorZ[1])
# pl_list <- list()
pl_scene <- plot_ly(type='scatter3d', mode="markers")
plotX <- list()
plotY <- list()
plotZ <- list()
plotX_df <- list() # need separate data frames to store all time foliations
plotY_df <- list()
plotZ_df <- list()
for (t in 1:length(r)) { # loop over time
# loop over kime-phases
plotX[[t]] <- as.numeric(x[c(( (t-1)*length(r) + 1):((t-1)*length(r) + sizePerTime))])
plotY[[t]] <- as.numeric(y[c(( (t-1)*length(r) + 1):((t-1)*length(r) + sizePerTime))])
plotZ[[t]] <- as.numeric(z[c(( (t-1)*length(r) + 1):((t-1)*length(r) + sizePerTime))])
tempX = circular(unlist(plotX[[t]]))
tempY = circular(unlist(plotY[[t]]))
tempZ = circular(unlist(plotZ[[t]]))
resx <- density(tempX, bw=25, xaxt='n', yaxt='n')
resy <- density(tempY, bw=25, xaxt='n', yaxt='n')
resz <- density(tempZ, bw=25, xaxt='n', yaxt='n')
unifPhi_df <- as.data.frame(cbind(t=t, circleUniformPhi=circleUniformPhi))
plotX_df[[t]] <- as.data.frame(cbind(t=t, plotX=unlist(plotX[[t]])))
plotY_df[[t]] <- as.data.frame(cbind(t=t, plotY=unlist(plotY[[t]])))
plotZ_df[[t]] <- as.data.frame(cbind(t=t, plotZ=unlist(plotZ[[t]])))
pl_scene <- pl_scene %>% add_trace(data=unifPhi_df, showlegend=FALSE,
x = ~((t-epsilon)*cos(circleUniformPhi)),
y = ~((t-epsilon)*sin(circleUniformPhi)), z=0,
name=paste0("Time=",t), line=list(color='gray'),
mode = 'lines', opacity=0.3) %>%
add_markers(data=plotX_df[[t]], x=~(t*cos(plotX)), y=~(t*sin(plotX)), z=0,
type='scatter3d', name=paste0("X: t=",t),
marker=list(color='green'), showlegend=FALSE,
mode = 'markers', opacity=0.3) %>%
add_markers(data=plotY_df[[t]], x=~((t+epsilon)*cos(plotY)),
y=~((t+epsilon)*sin(plotY)), z=0-epsilon, showlegend=FALSE,
type='scatter3d', name=paste0("Y: t=",t),
marker=list(color='blue'),
mode = 'markers', opacity=0.3) %>%
add_markers(data=plotZ_df[[t]], x=~((t+2*epsilon)*cos(plotZ)),
y=~((t+2*epsilon)*sin(plotZ)), z=0+epsilon, showlegend=FALSE,
type='scatter3d', name=paste0("Z: t=",t),
marker=list(color='red'),
mode = 'markers', opacity=0.3)
}
means_df <- as.data.frame(cbind(t = c(1:length(r)),
plotX_means=unlist(lapply(plotX, mean)),
plotY_means=unlist(lapply(plotY, mean)),
plotZ_means=unlist(lapply(plotZ, mean))))
pl_scene <- pl_scene %>%
# add averaged (denoised) phase trajectories
add_trace(data=means_df, x=~(t*cos(plotX_means)),
y=~(t*sin(plotX_means)), z=0,
type='scatter3d', showlegend=FALSE, mode='lines', name="Expected Obs X",
line=list(color='green', width=15), opacity=0.8) %>%
add_trace(data=means_df, x=~(t*cos(plotY_means)),
y=~(t*sin(plotY_means)), z=0-epsilon,
type='scatter3d', showlegend=FALSE, mode='lines', name="Expected Obs X",
line=list(color='blue', width=15), opacity=0.8) %>%
add_trace(data=means_df, x=~(t*cos(plotZ_means)),
y=~(t*sin(plotZ_means)), z=0+epsilon,
type='scatter3d', showlegend=FALSE, mode='lines', name="Expected Obs X",
line=list(color='red', width=15), opacity=0.8) %>%
add_trace(x=0, y=0, z=c(-2,2), name="Space", showlegend=FALSE,
line=list(color='gray', width=15), opacity=0.8) %>%
layout(title="Pseudo Spacekime (1D Space, 2D Kime) Kime-Phase Sampling and Foliation",
scene = list(xaxis=list(title="Kappa1"), yaxis=list(title="Kappa2"),
zaxis=list(title="Space"))) %>% hide_colorbar()
pl_scene
The intrinsic process error \(\varepsilon_m\sim N(0,\sigma^2V)\) reflects classical variability in the underlying process distribution. The more subtle error-tensor \(\varepsilon_g\sim N(0,R_g)\) depends on the variance-covariance matrix of the group-level errors \(R_g\) and can be interpreted as repeated-measures variability due to the kime-phase distribution.
Hence, the kime-phases can be used to model or track repeated-measure variations. Recall this kime-phase simulation we showed above demonstrating alternative kime-phase distributions coupled with corresponding mean-phase-aggregators (solid radial lines) representing real observable (deterministic, not stochastic) measurable quantities. The latter avoid the intrinsic kime-phase distribution noise due to random sampling through the phase-space distribution \(\Phi_{[-\pi,\pi)}^{(t)}\) when taking real measurements depending de facto on some phase-aggregating kernels that smooth the intractable (unobservable) and intrinsically-random kime-phases.
In applied sciences, sample statistics, such as the sample mean, variance, percentiles, range, interquartile range, etc., are always well-defined. Their values are not known at any given time until the actual (finite) sample \(\{x_i\}_{i=1}^n\) is collected (observations are recorded). For instance the sample mean statistic is the arithmetic average of all observations, \(\bar{x}_n=\frac{1}{n}\sum_{i-1}^n {x_i}\), yet, not knowing the value of the sample-mean prior to collecting the data is simply due to lack of evidence, since we have a closed form expression of how to obtain the sample average statistic estimating the population (physical system’s) expected mean response, \(\mu_X\ \underbrace{\leftarrow}_{{n\to\infty}}\ \bar{x}_n\).
Quantum systems interact in ways that can be explained with superpositions of different discrete base-states, and quantum system measurements yield statistical results corresponding to any one of the possible states appearing at random.
Similarly, analytical systems interact in ways that can be explained with superpositions of different, discrete, and finite samples. Correspondingly, analytical system measurements yield statistics corresponding to any one of the specific sample-statistic outcomes, which vary between samples and experiments. However, the first two fundamental laws of probability theory governing all statistical inference (CLT and LLN) yield that this between sample variability of sample-statistics is expected to decrease rapidly with the increases of the sample size. For instance, IID samples \(\{x_i\}_{i=1}^n\sim \mathcal{D}(\mu_X,\sigma_X^2)\), drawn from most nice distributions \(\mathcal{D}\) with mean \(\mu_X\) and variance \(\sigma_X^2\), yield sample-means \(\bar{x}_n\) whose sampling distribution converges (in distribution) to a Normal distribution, an asymptotic limiting distribution with rapidly decaying variance as the sample size increases
\[\underbrace{\bar{x}_n }_{Sample\\ average}\ \ \sim\ \underbrace{\bar{\mathcal{D}}(\overbrace{\mu_{\bar{x}_n}}^{mean}, \overbrace{\sigma^2_{\bar{x}_n}}^{variance}) }_{Sampling\\ distribution}\ \ \ \overbrace{\underbrace{\longrightarrow}_{n\to\infty}}^{Convergence\\ in\ distribution} \underbrace{\mathcal{N}\left(\mu_X,\frac{\sigma_X^2}{n}\right)}_{Asymptotic\\ distribution}\ .\]
Let’s try to explicate the duality between sampling from the kime-phase distribution and the complementary sampling of the process state space (spatiotemporal sampling).
Suppose the process \(X\sim F_X\), where \(F_X\) is an arbitrary distribution with a mean \(\mu=\mu_X\) and variance \(\sigma^2=\sigma^2_X\). If \(\{X_1,\cdots ,X_n\}\sim F_X\) is an IID sample the CLT suggests that under mild assumptions
\[\bar{X}\underset{n\to\infty}{\overset{d}{\longrightarrow}} N\left(\mu,\frac{\sigma^2}{n}\right )\ .\]
In repeated spatiotemporal sampling, \(\forall\ n\), we observe \(K\) (longitudinal) sample means \(\{\bar{X}_{n1},\bar{X}_{n2},\cdots, \bar{X}_{nK}\}\), where \(\forall\ 1\leq k_o\leq K\) each IID sample \(\{X_{1k_o},X_{2k_o},\cdots ,X_{nk_o}\}\sim F_X\) (across the kime-phase space) allows us to compute the \(k_o^{th}\) sample mean \(\bar{X}_{nk_o}\).
Observe that computing \(\bar{X}_{nk_o}\) from the \(k_o^{th}\) sample \(\{X_{1k_o},\cdots ,X_{nk_o}\}\) is identical to directly sampling \(\bar{X}_{nk_o}\) from the distribution \(F_{\bar{X}_n}\). Let’s reflect on this set up which clearly involves \(2\) independent sampling strategies.
Note that the distribution \(F_{\bar{X}_n}\) need not be Normal, it may be, but in general it won’t be exactly Normal. For instance suppose our spatiotemporal sampling scheme involves exponential distribution, i.e., \(\{X_{1k_o},\cdots , X_{nk_o}\}\sim f_X\equiv {\text{Exp}}(\lambda)\equiv\Gamma(1,\lambda)\) and \(\sum_{i=1}^n{X_{ik_o}}\sim \Gamma\left(n,\lambda\right )\),
\[f_{{\text{Exp}}{(\lambda)}}(x)=\lambda e^{-\lambda x}\ \ ; \ \mathbb{E}(X_{{\text{Exp}}{(\lambda)}})=\frac{1}{\lambda} \ \ ; \ Var(X_{{\text{Exp}}{(\lambda)}})= \frac{1}{\lambda^2}\ ; \\ f_{\Gamma(shape=\alpha,scale=\lambda)}(x)=\frac{x^{\alpha-1}\lambda^{\alpha}}{\Gamma(\alpha)}e^{-\lambda x}\ \ ; \ \mathbb{E}(X_{\Gamma(\alpha,\lambda)})=\alpha\lambda \ \ ; \ Var(X_{\Gamma(\alpha,\lambda)})= \alpha\lambda^2\ \ .\]
Based on this sample, the sampling distribution of \(\bar{X}_{nk_o}\equiv \frac{1}{n}\sum_{i=1}^n{X_{ik_o}}\sim Exp\left (\lambda\right )\equiv\Gamma\left(n,n\lambda\right )\), since a sum of exponential IID \(Exp(\lambda)\) variables is \(\Gamma\)-distributed, \(\sum_{i=1}^n{X_{ik_o}}\sim \Gamma\left(n,\lambda\right )\) and scaling a random variable \(Y=cX\) yields \(F_Y(x)=\frac{1}{c}f_X(\frac{x}{c})\), in our case the constant multiplier \(c=\frac{1}{n}\), and \(Y=\frac{1}{n}\sum_{i=1}^n{X_i}\).
Of course, as \(n\to\infty\), \(\Gamma\left(n,\frac{\gamma}{n}\right )\to N(0,1)\), yet for any fixed \(n\), the (gamma) \(\Gamma\) distribution is similar, but not identical to normal.
Prior to data acquisition, \(\bar{X}_{nk_o}\) is a random variable, once the observed data values are plugged in, it’s a constant. Hence, the sample mean random variable (a sum of \(n\) \(\sim F_X\) variables) \(\bar{X}_{nk_o}\) based on \(\{X_{1k_o},\cdots , X_{nk_o}\}\sim F_X\), and the (single!) random variable \(Y\sim F_{\bar{X}_{nk_o}}\) represent exactly the same random variable.
Corollary: Random drawing of a second-order tensor \(X_{n\times K}\) (a random matrix) representing \(K\) (spatiotemporal or longitudinal) samples of \(n\) IID observations \(\{X_{1k_o},\cdots , X_{nk_o}\}_{k_o}\sim F_X\) and then computing the \(K\) corresponding means \(\bar{X}_{nk_o}=\frac{1}{n}\sum_{i=1}^n{X_{ik_o}}\) is equivalent to randomly drawing \(K\) samples (a random vector) directly from the sampling distribution of the mean \(F_{\bar{X}_{nk_o}}\).
Below is an example of a 3D Brownian motion/Wiener process as an example of a stochastic random walk process. In this example, the Wiener process is intentionally disturbed by random Poisson noise, which leads to occasional abrupt disruptions of the already stochastic process.
library(plotly)
# 3D Wiener Process
N=500 # Number of random walk steps
# Define the X, Y, and Z, coordinate displacements independently
# xdis = rnorm(N, 0 , 1)
# ydis = rnorm(N, 0 , 2)
# zdis = rnorm(N, 0 , 3)
# xdis = cumsum(xdis)
# ydis = cumsum(ydis)
# zdis = cumsum(zdis)
# To use simulated 3D MVN Distribution
dis <- mixtools::rmvnorm(N, mu=c(0,0,0),
sigma=matrix(c(1,0,0, 0,2,0, 0,0,3), ncol=3))
# aggregate the displacements to get the actual 3D Cartesian Coordinates
xdis = cumsum(dis[,1])
ydis = cumsum(dis[,2])
zdis = cumsum(dis[,3])
# add Poisson noise
at = rpois(N, 0.1)
for(i in c(1:N)) {
if(at[i] != 0) {
xdis[i] = xdis[i]*at[i]
ydis[i] = ydis[i]*at[i]
zdis[i] = ydis[i]*at[i]
}
}
# plot(xdis, ydis, type="l",
# main ="Brownian Motion in Two Dimension with Poisson Arrival Process",
# xlab="x displacement", ylab = "y displacement")
plot_ly(x=xdis, y=ydis, z=zdis, type="scatter3d", mode="markers+lines",
text=~c(1:N), hoverinfo='text',
marker=list(color='gray'), showlegend=F) %>%
# Emphasize the starting and ending points
add_markers(x=xdis[1], y=ydis[1], z=zdis[1], marker=list(size=20,color="green"),
text=paste0("Starting Node 1")) %>%
add_markers(x=xdis[N], y=ydis[N], z=zdis[N],
marker=list(size=20,color="red"), text=paste0("Ending Node ", N)) %>%
layout(title="3D Brownian Motion Simulation (MVN & Poisson)",
scene=list(xaxis=list(title="X displacement"),
yaxis=list(title="Y displacement"),
zaxis=list(title="Z displacement")))
# 1D Wiener Process
# dis = rnorm(N, 0, 1);
# at = rpois(N,1)
# for(i in 1:N) {
# if(at[i] != 0){
# dis[i]= dis[i]*at[i]
# }
# }
# dis = cumsum(dis)
# plot(dis, type= "l",
# main= "Brownian Motion in One Dimension with Poisson Arrival Process",
# xlab="time", ylab="displacement")
# ub = 20; lb = -20
# xdis = rnorm(N, 0 ,1)
# xdis1 = rep(1,N)
# xdis1[1] = xdis[1]
# for(i in c(1:(N-1))){
# if(xdis1[i] + xdis[i+1] > ub) { xdis1[i+1] <- ub }
# else if(xdis1[i] + xdis[i+1] < lb) { xdis[i+1] = lb }
# else { xdis1[i+1] = xdis1[i] + xdis[i+1] }
# }
#
# plot(xdis1, type="l",main="Brownian Motion with bound in 1-dim", xlab="displacement",ylab="time")
# Compute the row Euclidean differences
df <- data.frame(cbind(xdis, ydis, zdis))
rowEuclidDistance <- dist(df)
plot_ly(z=as.matrix(rowEuclidDistance), type="heatmap") %>%
layout(title="Heatmap of Euclidean Distances between Consecutive Steps of Wiener Process")
plot_ly(x=as.vector(as.matrix(rowEuclidDistance)), type = "histogram") %>%
layout(title="Histogram of Euclidean Distances between Consecutive Steps of the Wiener Process")
In our spacekime representation, we effectively have a (repeated measurement) spatiotemporal process including a 3D spatial Gaussian model which is dynamic in time. In other words, the 3D Gaussian Process with a mean vector and a variance-covariance matrix tensor both dependent on time, \(\mu=\mu(t)=(\mu_x(t),\mu_y(t),\mu_z(t))'\) and
\[\Sigma=\Sigma(t)=\left ( \begin{array}{ccc} \Sigma_{xx}(t) & \Sigma_{xy}(t) & \Sigma_{xz}(t)\\ \Sigma_{yx}(t) & \Sigma_{yy}(t) & \Sigma_{yz}(t) \\ \Sigma_{zx}(t) & \Sigma_{zy}(t) & \Sigma_{zz}(t) \end{array}\right )\ .\]
The process distribution \(\mathcal{D}(x,y,z,t)\) is specified by \(\mu=\mu(t)\) and \(\Sigma=\Sigma(t)\). Given a spatial location, e.g., brain voxel, we the distribution probability density function at \((x,y,z)\in\mathbb{R}^3\) depends on the time localization, \(t\in \mathbb{R}^+\). Actual repeated sample observations will draw phases from the phase distribution, \(\{\phi_i\}_i\in\Phi_{[-\pi,\pi)}^{(t)}\), which are associated with the fixed spatiotemporal location \((x,y,z,t)\in\mathbb{R}^3\times \mathbb{R}^+\). The repeated spatiotemporal samples are \[\left\{\left (\underbrace{x_i}_{x(\phi_i)}, \underbrace{y_i}_{y(\phi_i)}, \underbrace{z_i}_{z(\phi_i)}, \underbrace{t_i}_{t(\phi_i)}\right )\right\}_i\in\mathbb{R}^3\times \mathbb{R}^+\ .\]
When the mean vector and the variance-covariance matrix vary with time \(t\), proper inference may require use of Wiener processes (Brownian motion), Ito calculus, Heston models, or stochastic differential equation models.
More details about stochastic differential equations, Radon-Nikodym derivatives, and kime phase-distribution random sampling are available in TCIU Chapter 3 Appendix (Radon-Nikodym Derivatives, Kime Measures, and Kime Operator).
A Hilbert space \(\mathcal{H}\) is any vector space over the complex field \(\mathbb{C}\) (or the reals \(\mathbb{R}\)) that satisfies the following conditions:
The base scalar field of the Hilbert space could also be the reals, \(\mathbb{R}\). Hilbert spaces are important as for any pair of square-integrable real-valued functions \(f,g\in \mathcal{H}\equiv L^2[a,b]\), their function inner product defined by \(\langle f,g\rangle =\int _{a}^{b}f(x)g(x)\,\mathrm {d} x\) has analogous properties as the classical Euclidean dot product and leads to defining function-bases as orthonormal functions spanning the vector (Hilbert) space \(\mathcal{H}\). This (function) inner product \(\langle f,g\rangle\) induces a spectral decomposition of an operator of the form
\[f(x)\longrightarrow \int _{a}^{b}\kappa(x,y)f(y)\,\mathrm {d} y\ ,\]
where the continuous and symmetric kernel \(\kappa(x,y)\) permits (spectral) eigenfunction decomposition as a series
\[\kappa(x,y)=\sum _{n}\lambda_{n}\varphi _{n}(x)\phi_{n}(y), \]
in terms of the orthonormal eigenfunctions \(\langle \phi_n,\phi_k\rangle=\begin{cases} 1\quad {\text{ , if }}n = k\\ 0\quad {\text{ , if }}n\not= k\end{cases}\) and the corresponding (scalar) eigenvalues \(\{\lambda_{i}\}_i\in\mathbb{C}\). The completeness requirement of Hilbert spaces prevents such eigenfunction expansions from failing to converge to square-integrable functions!
Spectral Theorem: If \(A\) is a bounded self-adjoint operator on a Hilbert space \(\mathcal{H}\). Then, there exists a measure space \((X, \Sigma, \mu)\), a real-valued bounded measurable function \(f:X\to\mathbb{R}\), and a unitary operator \(U:\mathcal{H}\to L^2(X, \mu)\) such that \(A=U^*TU\), where \(T\) is the multiplication operator \((T\phi )(x)=f(x)\phi (x)\) and \(\|T\|=\|f\|_{\infty }\).
The spectral theorem guarantees that all bounded self-adjoint operators are unitarily equivalent to multiplication operators, which are separable and easier to represent, compute, and interpret.
Computational Task: The goal here is to implement this time-Varying kime-phase probability distribution modeling scheme described in this arXiv article Predicting the Future Behavior of a Time-Varying Probability Distribution …
One approach for modeling the temporal distribution dynamics (TDD) requires the following operations:
At a fixed spatiotemporal point we can build the desired repeated sample vector representations in a Hilbert space by using reproducing kernel Hilbert space (RKHS) embedding of the corresponding kime-phase probability distributions. Denote by \(\mathcal{P}=\{\Phi_{[-\pi,\pi)}\}\) the set of all kime-phase probability distributions on \(\mathcal{X}=[-\pi,\pi)\) with respect to a specific \(\sigma\)-algebra. Assume \(\kappa: \mathcal{Z}\times \mathcal{Z}\to \mathbb{R}\) is a positive definite symmetric kernel corresponding to a RKHS \(\mathcal{H}\) and a feature map \(\phi: \mathcal{X}\to \mathcal{H}\), such that \(||\phi(x)||\leq 1\), \(\forall\ x\in \mathcal{X}\). The kernel mean embedding \(\mu :\mathcal{P}\to \mathcal{H}\) associated with \(\kappa\) is defined by
\[\mathcal{P}\ni p \to \mu(p) \equiv \mathbb{E}_{x\sim p(x)}(\phi(x))\in \mathcal{H}\ .\]
Effectively, for a fixed kernel \(\kappa\), feature map \(\phi\), and a corresponding RKHS \(\mathcal{H}\), \(\mu(p)\) as the RKHS embedding of \(p\). Let \(\mathcal{M}\) be the \(\mu\)-image of \(\mathcal{P}\).
Then, \(\mu(\mathcal{P})\equiv \mathcal{M}\subseteq \mathcal{H}\) and the image \(\mathcal{M}\) represents the set of repeated sample vectors corresponding to embedded probability distributions. For special characteristic kernels, such as the Gaussian, the kernel mean map is a lossless bijection \(\mu: \mathcal{P}\longleftrightarrow \mathcal{M}\) where no information is lost by the embedding operation. Hence, we will refer to all objects \(\Phi\in \mathcal{P}\) and \(\mu\in\mathcal{M}\) as distributions.
An important kernel mean embedding property allows computing expected values via inner products by
\[\mathbb{E}_p(\phi)\equiv \mathbb{E}_{x\sim p(x)}(\phi(x)) = \langle \mu(p), \phi\rangle_{\mathcal{H}}\in \mathcal{M}\subseteq \mathcal{H}\ .\]
The theoretical distribution kernel mean embedding \(\mu :\mathcal{P}\to \mathcal{M}\) has a discrete sample counterpart - the empirical sample kernel mean embedding defined for any observed IID sample of kime-phases \(S=\{x_1, x_2, \cdots, x_n\}\)
\[S\to \hat{\mu}_n(S)\equiv\frac{1}{n}\sum_{i=1}^n {\phi(x_i)} \ .\]
Of course, the empirical sample kernel mean embedding is just one of many kime-phase aggregators we can consider, see Spacekime Analytics/TCIU Book. Other examples of kime-phase aggregators include geometric-mean, median, kurtosis, scrambling phase, etc.
Subject to some RKHS constraints on \(\mathcal{H}\), the sample kernel mean embedding converges in probability to the theoretical distribution kernel mean embedding true embedding, see this article for details.
\[\underbrace{\hat{\mu}_n(S)\underset{n\to\infty}{\overset{d}{\longrightarrow}} \mu(p)}_{\text {for any fixed time, } t},\ \ \ \hat{\mu}_n(S)\approx \mu(p) + O\left (\frac{1}{\sqrt{n}}\right )\ .\]
Observe that the sample kernel mean embedding \(\hat{\mu}_n(S)\), implicitly depends on the time \(t\). The first step of the TDD involves computing the series of sample kernel mean embeddings for all time points,
\[\{ \hat{\mu}_{1}=\hat{\mu}_{n,t=1}(S_1), \hat{\mu}_{2}=\hat{\mu}_{n,t=2}(S_2), \cdots, \hat{\mu}_{T}=\hat{\mu}_{n,t=T}(S_T)\}\]
corresponding to the \(T\) longitudinally observed repeated sample sets \(\{ S_1, S_2, \cdots, S_T \}\). The vectors \(\{\hat{\mu}_{t}\}\) may not be always explicitly computed since the kernel feature map \(\phi\) is generally unknown. This is why RKHS representations are important as kernel methods do not require explicit knowledge of the embedding vectors. In many algorithms it is often sufficient to compute the corresponding vector inner products, which can be accomplished by evaluation of the kernel function.
Next we use the kime-phase distributions of the observed repeated-measurement vectors at fixed spatiotemporal locations to derive the global time dynamics of the system.
Vector-valued regression or other strategies can be employed to model the process temporal dynamics represented by the RKHS embedded distributions evolving with time.
We follow the vector-valued RKHS regression approaches in the following papers: Two-Sample Test Statistics for Measuring Discrepancies Between Two Multivariate Probability Density Functions Using Kernel-Based Density Estimates, A Hilbert Space Embedding for Distributions, Predicting the Future Behavior of a Time-Varying Probability Distribution, and Nonlinear functional models for functional responses in reproducing kernel Hilbert spaces.
Note that learning the RKHS operators requires that both the regression independent variables (inputs) and dependent outcome (output) are vectors. Let’s first define the state space, \(\mathcal{F}\), of operators on the RKHS \(\mathcal{H}\) that we will search over for an optimal regression operator, i.e., \(\mathcal{F}:\mathcal{H}\overset{linear}{\longrightarrow} \mathcal{H}\). Suppose \(\mathcal{L}(\mathcal{H})=\{\mathcal{F}:\mathcal{H}\longrightarrow \mathcal{H}\}\) is the space of all bounded linear operators on \(\mathcal{H}\) and \(L:\mathcal{H}\times \mathcal{H}\to \mathcal{L}(\mathcal{H})\) is the non-negative \(\mathcal{L}(\mathcal{H})\)-valued kernel defined by
\[L(f, g) \equiv \underbrace{\langle f, g\rangle _{\mathcal{H}}}_{scalar}\ \underbrace{\mathbb{I}_{\mathcal{H}}}_{operator} :\mathcal{H}\times \mathcal{H}\to \mathcal{L}(\mathcal{H}),\]
where \(f,g\in \mathcal{H}\) and \(\mathbb{I}_{\mathcal{H}}\) is the identity operator on \(\mathcal{H}\) such that \(\mathbb{I}_{\mathcal{H}}(h)=h,\ \forall\ h\in \mathcal{H}\).
Definition: The rank-1 operators \(A\in\mathcal{L}(\mathcal{H})=\{\mathcal{F}:\mathcal{H}\longrightarrow \mathcal{H}\}\) are such that \(\forall\ \phi\in\mathcal{H}\), \(A\phi\equiv \langle \psi, \phi\rangle\nu\), where \(\nu\in Rg(A)\subseteq \mathcal{H}\) and \(\psi\in Dom(A)\subseteq \mathcal{H}\) are bounded linear functionals on \(\mathcal{H}\). Similarly, finite finite-rank \(n\) operators \(A\) have finite dimensional range \(Rg(\mathcal{H})\) and
\[\forall\ \phi\in\mathcal{H},\ \ A\phi\equiv \sum_{i=1}^{n} { \underbrace{\langle \psi_i, \phi\rangle}_{scalar,\ in\ \mathbb{C}} \overbrace{\nu_i}^{basis}},\]
where \(\nu_i\in Rg(A)\subseteq \mathcal{H}\) and \(\psi_i\in Dom(A)\subseteq \mathcal{H}\) are bounded linear functionals on \(\mathcal{H}\).
Definition: A functional RKHS space \(\mathcal{H}\) is a subset of \(\mathcal{F}: \mathcal{H}\to \mathcal{H}\) which is a Hilbert space equipped with the inner product \(\langle\cdot,\cdot\rangle_\mathcal{H}\) so that \(\forall\ h\in \mathcal{H}\) the point evaluation operator \(L_h : \mathcal{F} \to \mathcal{F}(h)\) is a bounded linear operator. This is a non-constructive definition.
By the Riesz representation theorem applied to the Hilbert space \(\mathcal{H}\), \(\exists\ K_h^g\in \mathcal{H}\), such that \(\langle K_h^g, F \rangle_{\mathcal{H}} = \langle F(h), g\rangle _{H}\). Since the mapping \(g\to K_h^g\) is linear and the point evaluation is bounded we have that
\[\langle K_h^g, F \rangle_{\mathcal{H}}= \langle F(h), g \rangle_H \leq c||F||_{\mathcal{H}}||g||_{H}\ .\]
This suggest that implies that the mapping \(g\to K_h^g\) is also bounded and \(\forall\ h'\in H\) \(g\to K_h^g(h')=:K(h,h')\) is a bounded linear operator, i.e, \(K(h,h')(g) := K_h^g(h')\).
Definition: The bounded linear operator \(K(\cdot,\cdot): \mathcal{H}\times\mathcal{H}\to \mathbb{C}\) defined by \(g\to K_h^g(h')=:K(h,h')\) is called the reproducing kernel associated with \(\mathcal{H}\).
Definition: The kernel reproducing property is \(\langle K(h, \cdot)(g)\ ,\ F \rangle_{\mathcal{H}} = \langle F (h), g\rangle_H,\ \forall\ h,g \in H\).
Reproducing kernel Properties
\[\langle K(h, h')g, f \rangle_H = \langle K_h^g(h'), f\rangle_H= \langle K_{h'}^f(h), g\rangle_{H}=\langle g, K(h',h)f\rangle_{H}\ .\]
\[\sum_{i,j}\langle K(h_i,h_j)(f_i), f_j\rangle_H= \sum_{i,j}\langle K_{h_i}^{f_i}, K_{h_j}^{f_j}\rangle_{\mathcal{H}}= ||\sum_{i} K_{h_i}^{f_i}||_{\mathcal{H}}, \]
we have
\[\sum_{i,j}\langle K(h_i,h_j)(f_i), f_j\rangle_H\geq 0\ .\]
The reproducing kernel of some operator-valued RKHS \(L(\cdot, \cdot):\mathcal{H}\times \mathcal{H}\to \mathcal{L}(\mathcal{H})\) is \(\mathcal{F}\subseteq \mathcal{L}(\mathcal{H})\), where \(\mathcal{F}\) contains at least the span of all rank-1 operators whose range is one-dimensional.
Let \(f,g\in \mathcal{H}\) and \[g^*\equiv \langle g,\cdot \rangle_{\mathcal{H}}: \mathcal{H}\to \mathcal{L}(\mathcal{H}).\]
Then \(L(f, g) \equiv \langle f, g\rangle _{\mathcal{H}} \mathbb{I}_{\mathcal{H}}\) is the reproducing kernel of the operator \(g^*\equiv \langle g,\cdot \rangle_{\mathcal{H}}\), since \(\forall\ f_1,f_2,g_1,g_2\in \mathcal{H}\) the feature-space (\(\mathcal{F}\)) inner product is separable in the Hilbert-space (\(\mathcal{H}\)),
\[\langle f_1g_1^*, f_2g_2^* \rangle_{\mathcal{F}}= \langle f_1, g_1 \rangle_{\mathcal{H}} \langle f_2, g_2 \rangle_{\mathcal{H}}\ ,\] where
\[fg^*\in \mathcal{L}(\mathcal{H}):\mathcal{H}\to \mathcal{H}, \ {\text{s.t.}}\ \ \ \forall\ f,g\in \mathcal{H},\ \ g^*\equiv \langle g,\cdot \rangle_{\mathcal{H}}, \ \ fg^*= f\langle g,\cdot \rangle_{\mathcal{H}}\ .\]
All bounded operators of rank-1 \(A\in\mathcal{L}(\mathcal{H})=\{\mathcal{F}:\mathcal{H}\longrightarrow \mathcal{H}\}\) have 1D range, \(\dim(Rg(A)\equiv1\)), and \(\forall\ h\in \mathcal{H}\) \(\exists!\ \phi\in \mathcal{H}\) and \(\exists\ \psi\in Rg(\mathcal{H})\setminus\{0\}\) so that \(Ah=\langle h, \phi\rangle \psi\). Observe that since \(\psi\in Rg(\mathcal{H})\setminus\{0\}\), \(\exists\ y\in \mathcal{H}\), such that \(Ay=\psi\) and \(\langle y, \phi\rangle\equiv 1\), since
\[Ah=\langle h, \phi\rangle \psi= \langle h, \phi\rangle \overbrace{\underbrace{\langle y, \phi\rangle}_{1} Ay}^{\psi=Ay}\ .\]
By linearity of the inner product and the completeness of the Hilbert space, the inner product of all finite-rank \((n)\) operators in the feature-space \(\mathcal{F}\subseteq \mathcal{L}(\mathcal{H})\) can be expressed as superpositions, or linear combinations, of rank-1 operator inner products relative to some basis.
Next we can define vector-valued regression to learn a predictive model of the dynamics of the kime-phase distribution and explicate the time dynamics of the repeated-measurement vectors. Suppose the changes of the phase distributions from one time point to the next can be approximated by an autoregressive process \(\mu_{t+1} = A\mu_t +\varepsilon_t,\ \forall\ t\geq 0\), for some bounded operator \(A:\mathcal{H}\to \mathcal{H}\) where the residual errors \(\{\varepsilon_t\}_{t=0}^T \sim\mathcal{D}\) are independent zero-mean random variables (IID) from the same distribution \(\mathcal{D}\). To approximate the unknown autoregressive operator \(A\) we can use classical least-squares where the objective (cost) function is a mixture of a fidelity term and a regularization term with some unknown regularization hyperparameter \(\lambda\geq 0\):
\[C(A)=\min_{A\in \mathcal{F}} \left ( \underbrace{\sum_{t=0}^{T-1} {||\overbrace{\hat{\mu}_{t+1}}^{observed}- \overbrace{A\hat{\mu}_t}^{model\ est.}||_{\mathcal{H}}^2}}_{fidelity} + \lambda \underbrace{||A||_{\mathcal{F}}^2}_{regularizer} \right ) \ ,\]
The closed-form solution minimizing this cost function \(C(A)\) is
\[\hat{A}= \sum_{t=0}^{T-1} {\left ( \hat{\mu}_{t+1} \sum_{s=0}^{T-1} {w_{ts}\hat{\mu}_s^*} \right )}\ ,\]
where the weight-coefficient matrix \(W = (w_{ts})=(K + \lambda I)^{-1}\), \(I=I_{(T-1)\times(T-1)}\) is the identity matrix, and the kernel matrix \(K=\left (\{\langle \hat{\mu}_s , \hat{\mu}_t \rangle_{\mathcal{H}} \}_{s,t} \right )\in \mathbb{R}^{(T-1)\times(T-1)}\), or in \(\mathbb{C}^{(T-1)\times(T-1)}\).
This optimal solution is the result of the following theorem (Lian):
Theorem: Given an observed dataset \(\{(x_i, y_i)\}^n_{i=1}\), the solution to the optimization problem \(\min_A C(A)\) has the following representation \(\hat{A}=\sum_{i=1}^n {K(h_i,\cdot)\alpha_i}\), where the functional coefficients \(\alpha_i \in H\).
Proof: Let’s denote by \(\mathcal{H}_o\subseteq \mathcal{H}\) the subspace spanned by the kernel centered at the observed covariates \[\mathcal{H}_o = \left \{\sum_{i=1}^n { K(h_i,\cdot )\alpha_i, \ \ \alpha_i \in H }\right \}.\]
Any \(A\in \mathcal{H}\) can be decomposed to \(A = A_o + G\), where \(A_o\in \mathcal{H}_o\) and \(G\in \mathcal{H} \perp \mathcal{H}_o\). Then by the reproducing property, \(\forall\ 1\leq j\leq n\) and \(\forall\ h \in H\), \(\langle G(x_i), h\rangle_H = \langle K(x_j , \cdot)(h)\rangle, G\rangle_H = 0\). As this holds \(\forall\ h \in H\), \(G(x_j) = 0\). As \(G \perp A_o\), \(\forall\ G\not= 0\)
\[||A||_{\mathcal{H}} = ||A_o||_{\mathcal{H}} + ||G||_{\mathcal{H}} > ||A_o||_{\mathcal{H}}\ .\]
Therefore, the minimizer \(\hat{A}\in \mathcal{H}_o\) since
\[\sum_{i=1}^n {||y_i - A(x_i)||_2^2} + \lambda ||A||_{\mathcal{H}} \gt \sum_{i=1}^n {||y_i - A_o(x_i)||_2^2} + \lambda ||A_o||_{\mathcal{H}}\ .\]
Hence, when \(A\in \mathcal{F}\), the estimated operator \(\hat{A}\) will converge to the true operator \(A\) as the number of sample sets (replicated samples) \(n\to\infty\) and the number of samples per set (time) \(T\to\infty\). This is true under some minor requirements (topological separability for embedded distributions in reproducing kernel Hilbert spaces).
To model and extrapolate the temporal dynamics of the kime-phase distribution forward we can apply the learned operator \(\hat{A}\) to the last time point \(t\) where we observed the kime-phase distribution \(\hat{\mu}_T\). In reality, the kime-phases are random, i.e., they may not be directly observable as sample phases. However, their distribution \(\Phi_{[-\pi,\pi)}\) is generally symmetric and zero-mean, and it can be modeled using some priors (e.g., truncated Laplace distribution) or inferred via analytical time-to-kime transformations, e.g., Laplace Transform \(\mathcal{L}:\mathbb{R}^+ \to \mathbb{C}\).
The forward prediction \(\hat{\mu}_{T+1}=\hat{A}\hat{\mu}_{T}\) approximates the unknown phase distribution \(\mu_{T+1}\) as a mixture (weighted linear combination) of the previously observed kime-phase distributions at times \(1\leq t\leq T\). Specifically,
\[\hat{\mu}_{T+1}= \sum_{t=2}^T {\beta_t \hat{\mu}_{t}}\ , \ \ {\text{where }}\\ \underbrace{\beta_{t+1}}_{time\ effects} =\ \sum_{s=1}^{T-1} {w_{ts}\langle \hat{\mu}_{s}, \hat{\mu}_{T}\rangle_{\mathcal{H}} } \ , \ \forall\ 1\leq t\leq T-1\ , \\ \langle \hat{\mu}_{s}, \hat{\mu}_{t}\rangle_{\mathcal{H}}= \frac{1}{n_s\ n_t} \sum_{i=1}^{n_s}{\sum_{j=1}^{n_t}{ \kappa\left(z_i^{(s)}, z_i^{(t)}\right )}}, \\ \underbrace{W = (\{w_{ts}\}_{ts})}_{weights}=(\underbrace{K}_{kernel} + \lambda I)^{-1},\ \ \underbrace{I=I_{(T-1)\times(T-1)}}_{identity\ matrix},\\ K=\left (\{\langle \hat{\mu}_s , \hat{\mu}_t \rangle_{\mathcal{H}} \}_{s,t} \right )\in \mathbb{C}^{(T-1)\times(T-1)}\ .\]
Thus, \(\hat{\mu}_{T+1}\) can be expressed via the RKHS relation, \(\langle \kappa (x, \cdot ),\ \kappa (x_o, \cdot )\rangle_{\mathcal{H}}\equiv \kappa (x, x_o)\), in terms of the originally observed repeated sample measurements \[\underbrace{S_t \equiv \left \{z_i^{(t)}\right \}}_{\text{Data}},\ \underbrace{\forall\ \ 1\leq i\leq n_t}_{\text{repeated phase sampling}}, \ \underbrace{\forall\ \ 1\leq t\leq T}_{\text{time points}}\ .\]
Employing the estimates of the time-effect values \(\beta_t\in \mathbb{C}\) we can extrapolate the kime-phase aggregate statistics \(\hat{\mu}_{T+1}\) one time-step forward beyond the observed time-points \(1\leq t\leq T\). That is we can estimate various kime-phase population distribution characteristics (e.g., mean phase) outside of the convex hull of the observed kime-phase distributions. However, the estimate \(\hat{\mu}_{T+1}\) is guaranteed to be in the subspace spanned \(\{\hat{\mu}_{2}, \hat{\mu}_{3}, \cdots, \hat{\mu}_{T}\}\) computed using the observed samples \(\{S_t\}_t\).
Since \(\mathcal{H}\) is the RKHS of the kernel \(\kappa(\cdot,\cdot)\) corresponding to the feature map \(\phi(\cdot)\), \(\forall\ z\in \mathcal{Z},\ {\text{ and }}\ \forall\ f\in \mathcal{H},\ \langle \phi(z), f\rangle_{\mathcal{H}}\equiv f(z)\). Hence, for any element in the (vector) Hilbert space, \(\forall\ f\in \mathcal{H}\), we can estimate the pseudo expected value of \(f\), \(\widehat{\mathbb{E}}(f)\), with respect to the forward prediction \(\hat{\mu}_{T+1}\) as follows
\[\widehat{\langle f\rangle}_{\hat{\mu}_{T+1}} \equiv {\widehat{\mathbb{E}}}_{\hat{\mu}_{T+1}}(f)= \langle \hat{\mu}_{T+1}, f\rangle_{\mathcal{H}}= \sum_{t=2}^T {\beta_t \langle \hat{\mu}_{t}, f\rangle_{\mathcal{H}}}=\\ \sum_{t=2}^T {\left (\beta_t \frac{1}{n_t}\sum_{i=1}^{n_t}\langle \phi(z_i^t), f\rangle_{\mathcal{H}}\right )}= \sum_{t=2}^T \sum_{i=1}^{n_t}{\left (\frac{\beta_t}{n_t}f(z_i^t) \right )}\ .\]
Recall that the kernel mean embedding from the set \(\mathcal{P}\) of all kime-phase probability distributions on \(\mathcal{Z}\) into the RKHS \(\mathcal{H}\), \(\mu: \mathcal{P}\to \mathcal{H}\) associated with the kernel \(\kappa:\mathcal{Z}\times \mathcal{Z}\underset{p\mapsto \mu(p)}{\longrightarrow} \mathbb{C}\) is defined by \(\mu(p) = \mathbb{E}_{z\sim p(z)}(\phi(z))\). Thus, the pseudo expected value \(\widehat{\mathbb{E}}(f)\) is just an estimate of the corresponding expected value \({\mathbb{E}}(f)\). This is because \(\langle \hat{\mu}_{T+1}, f\rangle_{\mathcal{H}}\) is not guaranteed to have a pre-image \(p\in \mathcal{P}\) in the space of kime-phase probability distributions \(\mathcal{P}\). However this lemma shows that \(\hat{\mu}_{T+1}\) is a reasonable proxy measure for the the unknown kernel mean embedding at time \(T+1\), i.e., \(\hat{\mu}_{T+1}\approx {\mu}_{T+1}\).
Lemma: Let \(\mathcal{M}\) represent the space of all kernel mean embedding from the set \(\mathcal{P}\) of all kime-phase probability distributions on \(\mathcal{Z}\) over time points \(1\leq t\leq T\). Suppose \(\exists\ \mu_T\in \mathcal{M}\), \(\exists\ \hat{\mu}_T, \varepsilon_T\in \mathcal{M}\), and \(A,\hat{A}\in \mathcal{F}\) are bounded linear operators such that \(\hat{\mu}_{T+1}=\hat{A}\hat{\mu}_T\) and \({\mu}_{T+1}=A\mu_T+\varepsilon_T\). Then, \(\forall\ f\in \mathcal{H}\) with \(||f||_{\mathcal{H}}\lt 1\) we have the following upper bound on the distance between the corresponding exact and pseudo expected values
\[|{\mathbb{E}}_{{\mu}_{T+1}}(f) - {\widehat{\mathbb{E}}}_{\hat{\mu}_{T+1}}(f)|\leq \underbrace{\underbrace{||A||_{\mathcal{F}}}_{constant\ in\ time}\ \underbrace{||\mu_T - \hat{\mu}_T||_{\mathcal{H}}}_{\underset{n_T\to\infty} {\ \longrightarrow\ 0}}}_{\longrightarrow\ 0} + \underbrace{||A - \hat{A}||_{\mathcal{F}}}_{\underset{T \to \infty\\ n_T \to \infty}{\longrightarrow 0\ ,}} + \underbrace{||\varepsilon_T||_{\mathcal{F}}}_{{\text{Autoregressive Model}}\\ {\text{Error }}\ \sim\ \mathcal{D}(mean=0)}\ .\]
Proof: Derive the upper bound by expanding the left hand side and using the inner product properties and the RKHS embedding definition.
This Lemma quantifies the approximation \(\hat{\mu}_{T+1}\approx {\mu}_{T+1}\) of the mean kernel embedding at the future time point \(T+1\). The approximation is good since
Hence, given sufficiently large and representative data \(\{S_t\}\), the time distribution dynamic estimation \(\hat{\mu}_{T+1}\) of the next distribution time step (\(T+1\)) is expected to be highly accurate approximation of the unknown true time distribution \({\mu}_{T+1}\).
The explicit expectation estimate we derived above
\[\widehat{\langle f\rangle}_{\hat{\mu}_{T+1}} \equiv {\widehat{\mathbb{E}}}_{\hat{\mu}_{T+1}}(f) = \sum_{t=2}^T \sum_{i=1}^{n_t}{\left (\frac{\beta_t}{n_t}f(z_i^t) \right )}\]
suggests an explicit scheme to draw new IID (random) samples from the approximate mean kernel \(\hat{\mu}_{T+1}\), as proxy of prospective true random samples from the unknown kime-phase distribution at time \(T+1\), \({\mu}_{T+1}\). The explicit random sampling scheme representing draws from \(\hat{\mu}_{T+1}\) is:
\[\hat{S}_{T+1} = \cup _{t=2}^{T} \left \{ \frac{\beta_t}{n_t}\cdot z_1^{(t)}\ , \ \frac{\beta_t}{n_t}\cdot z_2^{(t)}\ ,\ \cdots \ ,\ \underbrace{\overbrace{\frac{\beta_t}{n_t}}^{weight}\cdot \overbrace{z_i^{(t)}}^{sample}}_{{\text{Each sample } z_i^{(t)}\sim \mathcal{Z}}\\ {\text{is weighted by }} \frac{\beta_t}{n_t}} \ , \ \cdots \ ,\ \frac{\beta_t}{n_t}\cdot z_{n_T}^{(t)}\right \}\ .\]
Both weighted sampling and uniformly equally-weighted sampling weighting schemes are possible. Above we show the weighted sampling approach and a uniformly weighted samples can be generated by arithmetically averaging
\[\bar{S}_{T+1}=\{\bar{z}_{1}, \bar{z}_{2}, \cdots, \bar{z}_{m}\},\ \ \hat{\mu}_{T+1}\approx\mu(\bar{S}_{T+1})=\frac{1}{m}\sum_{i=1}^m {\phi(\bar{z}_i)}\ .\]
RKHS kernel herding may be used to draw random samples using an algorithmic approach approximating the unknown kime-phase probability distribution based on observed samples. Each embedded phase distribution \(\eta\in\mathcal{M}\) can be used to generate a sequence of RKHS kernel herding samples \(\{\bar{z}_i\}_i\) as follows
\[\bar{z}_1=\arg\max_{z\in\mathcal{Z}} {\langle \phi (z), \eta\rangle_{\mathcal{H}}}\ ,\\ \bar{z}_2=\arg\max_{z\in\mathcal{Z}} {\left \langle \phi (z), \eta - \frac{1}{2}{\phi(\bar{z}_1)} \right \rangle_{\mathcal{H}}}\ ,\\ \vdots\\ \bar{z}_n=\arg\max_{z\in\mathcal{Z}} {\left \langle \phi (z), \eta - \frac{1}{n}\sum_{i=1}^{n-1} {\phi(\bar{z}_i)} \right \rangle_{\mathcal{H}}}\ ,\ \forall\ n\gt 2\ .\]
The RKHS kernel herding sampling scheme represents an iterative optimization for drawing examples \(\{\bar{z}_i\}\) by minimizing the objective function
\[\{\bar{z}_i\}=\min_{\{\bar{z}_i\}}{\left |\eta - \frac{1}{n}\sum_{i=1}^n {\phi(\bar{z}_i)}\right |}_{\mathcal{H}}\ .\]
In this optimization, the target vector \(\eta\) may or may not be in the embedded kime-phase distribution, which makes the RKHS herding applicable to any vector elements in the RKHS \(\mathcal{H}\).
For instance, letting \(\eta = \hat{\mu}_{T+1}\) allows us to draw random samples \(\bar{S}_{T+1}=\{\bar{z}_1, \bar{z}_2, \cdots, \bar{z}_{n_{T+1}} \}\) as proxy samples from the true phase distribution \({S}_{T+1}=\{{z}_1, {z}_2, \cdots, {z}_{n_{T+1}} \}\), which is unknown.
In practice, computing \(\bar{S}_{T+1}\) requires explicit protocols for solving multiple inverse (preimage) problems, which may be difficult in many situations. RKHS kernel herding sampling represents an approximate projection \(\mathcal{H}\longrightarrow \mathcal{H}\) since \(\forall\ \eta\in \mathcal{H}\) \(\exists\ p_{\eta}\in \mathcal{P}\), a preimage in the phase family of symmetric phase distributions, associated with the sample \(\{\bar{z}_1, \bar{z}_2, \cdots, \bar{z}_n, \}\).
The following algorithm for extrapolating the kime-phase distribution dynamics was proposed by Lampert in 2014.
Algorithm for extrapolating the kime-phase distribution
Task: Pilot test this with real fMRI data…
We characterize the failure modes for the kernel embedding algorithm in terms of predicting the toy Fokker Planck equation. In the linear potential scenario \(U(x)=cx\), the 1D Fokker planck equation is the Smoluchowski equation \[\begin{equation} \partial_t P(x,t\mid x_0,t_0) = \partial_x D (\partial_x+\beta c) P(x,t\mid x_0,t_0) \end{equation}\] With initial impulse condition \(P(x,t_0\mid x_0,t_0)=\delta(x-x_0)\), where \(D\) is the diffusion constant, \(\beta=\frac{1}{k_BT}\), \(k_B\) being the Boltzman constant. The solution is \[\begin{equation} P(x,t\mid x_0,t_0)=\frac{1}{\sqrt{4\pi D(t-t_0)}}\exp(-\frac{(x-x_0+D\beta c(t-t_0))^2}{4D(t-t_0)}) \end{equation}\]
which after assuming \(D=1\), \(\beta c=1\), and setting \(x_0,t_0=0\) and the equation simplifies to \[\begin{equation} p_t\sim \mathcal{N}(\mu = -t,\sigma = \sqrt{2t}) \end{equation}\]
diffusion_sol <- function(x,t,x_init,t_init,D=1) {
return(1/sqrt(4*pi*D*(t-t_init))*exp(-(x-x_init+D*(t-t_init))^2/(4*D*(t-t_init))))
}
Assume we are given access to the temporal samples at \(S_{0.2}\sim p_{0.2},S_{0.4}\sim p_{0.4},...,S_{1.2}\sim p_{1.2}\). Can we predict forward \(p_{1.4}\) or generate samples effectively? A visualization of these distributions is the following:
library(plotly)
# legend is false
fig <- plotly::subplot(
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),0.01,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),0.8,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),0.2,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),0.4,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.2,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),0.6,0,0), type = 'scatter', mode = 'lines'),
plot_ly(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.4,0,0), type = 'scatter', mode = 'lines'),
nrows = 4, margin = 0.05
)
# move legends to annotations
annotations = list(
list(
x = 0.2,
y = 1.0,
text = TeX("\\textbf{t=0.01}, N(-0.01,\\sqrt{0.02})"),
xref = "paper",
yref = "paper",
xanchor = "center",
font = list(size = 12),
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.2,
y = 0.7,
text = TeX("\\textbf{t=0.2}, N(-0.2,\\sqrt{0.4})"),
xref = "paper",
yref = "paper",
xanchor = "center",
font = list(size = 15),
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.2,
y = 0.45,
text = TeX("\\textbf{t=0.4}, N(-0.2,\\sqrt{0.8})"),
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
font = list(size = 15),
showarrow = FALSE
),
list(
x = 0.2,
y = 0.2,
text = TeX("\\textbf{t=0.6}, N(-0.6,\\sqrt{1.2})"),
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
font = list(size = 15),
showarrow = FALSE
),
list(
x = 0.8,
y = 1,
text = TeX("\\textbf{t=0.8}, N(-0.8,\\sqrt{1.6})"),
xref = "paper",
yref = "paper",
font = list(size = 12),
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.8,
y = 0.7,
text = "t=1",
xref = "paper",
yref = "paper",
font = list(size = 15),
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.8,
y = 0.45,
text = TeX("\\textbf{t=1.2}, N(-1.2,\\sqrt{2.4})"),
xref = "paper",
yref = "paper",
font = list(size = 15),
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.8,
y = 0.2,
text = TeX("\\textbf{t=1.4}, N(-1.4,\\sqrt{2.8})"),
xref = "paper",
yref = "paper",
font = list(size = 15),
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
))
fig <- fig %>% layout(title = "Fokker Planck solution p(x,t)",
annotations = annotations,showlegend=FALSE) %>% config(mathjax = 'cdn')
fig
** Initialize the Data **
The following implements the algorithm for extrapolating the kime-phase distribution dynamics was proposed by Lampert in 2014.
Remark: Since all the samples in the same batch \(S_t\) are weighted equally. Thus, their sample mean would likely stay the same. Namely, \(\bar{S}_{1.2}=-1.2,\bar{S}_{0.6}=-0.6\) then it is impossible for the \(\hat{S}_{T+1}\) to reach a minimal first moment -1.4, as the maximal achievable smallest first moment from this sample is \(-1.2\). In other words, since the weight-averaging of all prior temporal sample means, each greater than 1.2, one can’t extrapolate out side the sample mean range \([\min\bar{S_t},\max\bar{S_t}]=[-1.2,0]\). Also, \(\beta_t<0\) correspond to unphysical sampling. In these cases, herding needs to be applied.
#Kernels
library(GPBayes)
# Matern kernel function
besselk_nu <- function(nu){
return(function(x){
return (besselK(x,nu,expon.scaled=TRUE))
})
}
maternkernel <- function(nu,x,sigma=1,rho=1){
scaled_x = sqrt(2*nu)*x/rho
return (sapply(scaled_x,besselk_nu(nu))*scaled_x**nu*sigma**2*2**(1-nu)/gamma(nu))
}
linear_exp<- function (s,t){
return (exp(-abs(s-t)))
}
quad_exp <- function (s,t){
return (exp(-(s-t)^2))
}
poly_exp <- function (s,t){
return (exp(-s*t))
}
matern <- function(nu){
return(function(s,t){
return (maternkernel(nu,s-t))
})
}
# Implement the estimation for kappa kernel sum
kappa <- function(s,t,f=linear_exp){
ns <- length(s)
nt <- length(t)
tot <- 0
for (i in 1:ns){
for (j in 1:nt){
tot <- tot+ f(s[i],t[j])
}
}
return (tot/(ns*nt))
}
K_st <- function(X, func = linear_exp){
ncol = dim(X)[2]
K_matrix <- matrix(0, ncol, ncol)
for (i in 1:ncol){
for (j in 1:ncol){
K_matrix[i,j] <- kappa(X[,i],X[,j], func)
}
}
return(K_matrix)
}
K_t <- function(X,Y, func = linear_exp){
ncol = dim(X)[2]
K_matrix <- matrix(0, ncol, 1)
for (i in 1:ncol){
K_matrix[i,1] <- kappa(X[,i],Y,func)
}
return(K_matrix)
}
sum_kernels <- function(val,beta, func=quad_exp){
XX <- list(X0,X1,X2,X3,X4,X5,X6)
tot <- 0
for (i in 1:7){
tot <- tot+ beta[i]*sum(func(val,XX[[i]]))
}
return(tot)
}
lampert <- function(n,lambda, func = quad_exp){
# Kst <- gram(t(cbind(X0,X1,X2,X3,X4,X5,X6)))
Kst <- K_st(cbind(X0,X1,X2,X3,X4,X5),func)
kt <- K_t(cbind(X0,X1,X2,X3,X4,X5),X6,func)
beta <- solve(Kst+lambda*diag(6))%*%kt
beta <- c(beta,beta[6])
print(beta)
# sample 1 to 6 according to the weights in beta
# sample <- sample(1:7,1000,prob=beta,replace=TRUE)
# clist <- cbind(X0,X1,X2,X3,X4,X5,X6)
# # drew according to X0,X1,X2,X3,..,X6 according to sample
# sampled <- c()
# for (i in 1:1000){
# bucket <- clist[,sample[i]]
# # sample 1,2,.., n uniformly
# s <- length(bucket)
# # sample 1,2,3,4.., s uniformly
# sampled <- c(sampled,bucket[sample(1:s,1)])
# }
# Herding
tar <- seq(-10,10,0.01)
largest_x <- -10
largest_y <- -100
for (val in tar){
curr_y <- sum_kernels(val,beta, func)
if (curr_y > largest_y){
largest_y <- curr_y
largest_x <- val
}
}
largest_xs <- c(largest_x)
# append the largest largest_x in largest_xs
for (i in 1:(n-1)){
largest_x <- -10
largest_y <- -100
for (val in tar){
curr_y <- sum_kernels(val,beta,func)
n_av <- length(largest_xs)
curr_y <- 1/100*curr_y - sum(func(val,largest_xs))/(n_av+1)
if (curr_y > largest_y){
largest_y <- curr_y
largest_x <- val
}
}
largest_xs <- c(largest_xs,largest_x)
}
return (largest_xs)
}
## [1] 0.02171327 -0.19034616 -0.02568082 0.10171712 0.26794725 0.74750785
## [7] 0.74750785
fit <- density(x)
plot_ly(x = x) %>%
# probability density
add_histogram(histnorm ="probability" ,name="Empirical sampled histogram") %>%
add_lines(x = fit$x, y = fit$y, fill = "tozeroy", yaxis = "y2", name = "Fitted density") %>%
add_lines(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.4,0,0), type = 'scatter', mode = 'lines', name = 't=1.4') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
The following heuristic transforms the probability transform into pointswise time series prediction. The obvious drawback is that .
We sample \(n\) data points each from \(p_{0},p_{0.2},p_{0.4},\cdots,p_{1.2}\) obtaining \(S_0,\cdots,S_{1.2}.\)
For each group, obtain the order statistics by sorting the \(n\) data points.
Fit \(n\) arima models to predict the next time step for each order statistic. For example, pick the smallest value from \(S_{0},...,S_{1.2}\) and perform time series prediction (for more efficient compute, we can alternatively perform arima on certain percentiles, e.g, min, max, median,…).
Aggregate the generated sample data.
order_stats_reg <- function(n){
#sort the data X0 from small to large
X0d <- sort(X0)
X1d <- sort(X1)
X2d <- sort(X2)
X3d <- sort(X3)
X4d <- sort(X4)
X5d <- sort(X5)
X6d <- sort(X6)
X7 <- rep(0,n)
for (i in 1:n){
# fit a arima model given X0(i),X1(i),X2(i),X3(i),X4(i),X5(i),X6(i) fit to X7(i)
# predict X7(i) given X0(i),X1(i),X2(i),X3(i),X4(i),X5(i),X6(i)
# sample X7(i) according to the predicted distribution
arma_model <- arima(c(X0d[i],X1d[i],X2d[i],X3d[i],X4d[i],X5d[i],X6d[i]),order=c(0,1,1))
X7[i] <- predict(arma_model,n.ahead=1)$pred[1]
}
return (X7)
}
x <- order_stats_reg(n)
fit <- density(x)
plot_ly(x = x) %>%
# probability density
add_histogram(histnorm ="probability" ,name="Empirical sampled histogram") %>%
add_lines(x = fit$x, y = fit$y, fill = "tozeroy", yaxis = "y2", name = "Fitted density") %>%
add_lines(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.4,0,0), type = 'scatter', mode = 'lines', name = 't=1.4') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
This only works for low dimensional examples - this methodology however does not generalize straightforwardly to higher dimension.
A highly related problem in speech recognition, which seeks to optimize \[\begin{equation} \min_{\psi}\{\int_{0}^1\|x(t)-y(\psi(t))\|dt: \psi:[0,1]\to[0,1], increasing\} \end{equation}\] This would flexibly model this scenario as the time varying process is just a reparametrization that can be implemented by \(\psi\). This strike analogy to quantile representation for random variables.
library(dtw);
# X0 <- rnorm(100,-0.01,1)*sqrt(2*0.01)
# Y0 <- rep(0,10)
# X1 <- rnorm(100,-0.2,1)*sqrt(2*0.2)
# X2 <- rnorm(100,-0.4,1)*sqrt(2*0.4)
# X3 <- rnorm(100,-0.6,1)*sqrt(2*0.6)
# X4 <- rnorm(100,-0.8,1)*sqrt(2*0.8)
# X5 <- rnorm(100,-1,1)*sqrt(2*1)
# X6 <- rnorm(100,-1.2,1)*sqrt(2*1.2)
alignment<-dtw(sort(X0),sort(X1),keep=TRUE);
## Display the warping curve, i.e. the alignment curve
# jpeg(file="three-way.jpg")
alignment2<-dtw(sort(X1),sort(X2),keep=TRUE);
alignment3<-dtw(sort(X2),sort(X3),keep=TRUE);
alignment4<-dtw(sort(X3),sort(X4),keep=TRUE);
alignment5<-dtw(sort(X4),sort(X5),keep=TRUE);
alignment6<-dtw(sort(X5),sort(X6),keep=TRUE);
plot(alignment,type="threeway")
#plot(alignment2)
# superimpose plot(alignment) and plot(alignment2)
# jpeg(file="comparisons.jpg")
plot(alignment,label="S0 and S0.2",style="l")
lines(alignment2$index1,alignment2$index2,col="red",label="S0.2 and S0.4")
lines(alignment3$index1,alignment3$index2,col="blue",label="S0.4 and S0.6")
lines(alignment4$index1,alignment4$index2,col="green",label="S0.6 and S0.8")
lines(alignment5$index1,alignment5$index2,col="yellow",label="S0.8 and S1")
lines(alignment6$index1,alignment6$index2,col="purple",label="S1 and S1.2")
legend("bottomright",col=c("black","red","blue","green","yellow","purple"),lty=1,legend=c("S0 and S0.2","S0.2 and S0.4","S0.4 and S0.6","S0.6 and S0.8","S0.8 and S1","S1 and S1.2"))
As a crude approximation we will use the \(\hat{\psi}_{1;1.2}\) to approximate \(p_{1.4}\). The optimzation problem is now \[\begin{equation} \min_{Q_{p_{1.4}}}\|Q_{p_{1.4}}-(Q_{p_{1.2}}(\hat{\psi}_{1;1.2}))\| \end{equation}\] where \(Q_{p_{1.4}}\) is the quantile function. The protocol is to estimate the order matching \(\hat{\psi}_{1;1.2}\) (alignment6 variable) and use the sampled datapoints at the last timestep (\(S_{1.2}\)) to forward match the samples, which achieves a desirable mean that is approximately -1.4 (See DTW below).
# Use the alignment score (estimated psi) to predict the new quantile function
sorted_X6 <- sort(X6)
list_out <- c()
for(i in 1:length(alignment6$index1)){
list_out <- c(list_out,sorted_X6[alignment6$index1[i]])
}
hist(list_out)
fit <- density(list_out)
plot_ly(x = list_out) %>%
# probability density
add_histogram(histnorm ="probability" ,name="Empirical sampled histogram") %>%
add_lines(x = fit$x, y = fit$y, fill = "tozeroy", yaxis = "y2", name = "Fitted density") %>%
add_lines(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.4,0,0), type = 'scatter', mode = 'lines', name = 't=1.4') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
We visualize the Fourier random features method for encoding the probability distribution via finite samples and then aim to find a strategy to capture the dynamics of \(\kappa_t\)
d <- n
# number of points in frequency
D <- 100
# sample for x and y
x <- matrix(runif(D,-5,5))
y <- matrix(runif(D,-5,5))
bias_x0 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x0 <- sqrt(2)*cos(X0%*%t(x)+bias_x0)
z_y0 <- sqrt(2)*cos(X0%*%t(y)+bias_x0)
bias_x1 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x1 <- sqrt(2)*cos(X1%*%t(x)+bias_x1)
z_y1 <- sqrt(2)*cos(X1%*%t(y)+bias_x1)
bias_x2 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x2 <- sqrt(2)*cos(X2%*%t(x)+bias_x2)
z_y2 <- sqrt(2)*cos(X2%*%t(y)+bias_x2)
bias_x3 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x3 <- sqrt(2)*cos(X3%*%t(x)+bias_x3)
z_y3 <- sqrt(2)*cos(X3%*%t(y)+bias_x3)
bias_x4 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x4 <- sqrt(2)*cos(X4%*%t(x)+bias_x4)
z_y4 <- sqrt(2)*cos(X4%*%t(y)+bias_x4)
bias_x5 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x5 <- sqrt(2)*cos(X5%*%t(x)+bias_x5)
z_y5 <- sqrt(2)*cos(X5%*%t(y)+bias_x5)
bias_x6 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x6 <- sqrt(2)*cos(X6%*%t(x)+bias_x6)
z_y6 <- sqrt(2)*cos(X6%*%t(y)+bias_x6)
k_x0 <- z_x0*z_y0
# apply mean rowwise
k_x0 <- apply(k_x0,2,mean)
k_x1 <- z_x1*z_y1
k_x1 <- apply(k_x1,2,mean)
k_x2 <- z_x2*z_y2
k_x2 <- apply(k_x2,2,mean)
k_x3 <- z_x3*z_y3
k_x3 <- apply(k_x3,2,mean)
k_x4 <- z_x4*z_y4
k_x4 <- apply(k_x4,2,mean)
k_x5 <- z_x5*z_y5
k_x5 <- apply(k_x5,2,mean)
k_x6 <- z_x6*z_y6
k_x6 <- apply(k_x6,2,mean)
k_x7 <- rep(0,D)
t <- y[,1]-x[,1]
#reorder k_x0 according to sorting order in t
k_x0 <- k_x0[order(t)]
# plot t versus k_x0,k_x1,k_x2,k_x3,k_x4,k_x5,k_x6 in plotly
plot_ly(x = sort(t), y = k_x0[order(t)], type = 'scatter', mode = 'lines', name = 'k_x0') %>%
add_trace(y = k_x1[order(t)], type = 'scatter', mode = 'lines', name = 'k_x1') %>%
add_trace(y = k_x2[order(t)], type = 'scatter', mode = 'lines', name = 'k_x2') %>%
add_trace(y = k_x3[order(t)], type = 'scatter', mode = 'lines', name = 'k_x3') %>%
add_trace(y = k_x4[order(t)], type = 'scatter', mode = 'lines', name = 'k_x4') %>%
add_trace(y = k_x5[order(t)], type = 'scatter', mode = 'lines', name = 'k_x5') %>%
add_trace(y = k_x6[order(t)], type = 'scatter', mode = 'lines', name = 'k_x6') %>%
# add_trace(y = t, type = 'scatter', mode = 'markers', name = 't') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
# fit a arima model
# for (i in 1:D){
# # fit a arima model given X0(i),X1(i),X2(i),X3(i),X4(i),X5(i),X6(i) fit to X7(i)
# arma_model <- arima(c(k_x0[i],k_x1[i],k_x2[i],k_x3[i],k_x4[i],k_x5[i],k_x6[i]),order=c(0,1,1))
# k_x7[i] <- predict(arma_model,n.ahead=1)$pred[1]
# }
# return (list(k_x7, y-x))
# }
# use a smoothing spline to smooth K_x1
spar_coe <- 0.9
lowpass.spline1 <- smooth.spline(sort(t),k_x1[order(t)],spar=spar_coe)
lowpass.spline2 <- smooth.spline(sort(t),k_x2[order(t)],spar=spar_coe)
lowpass.spline3 <- smooth.spline(sort(t),k_x3[order(t)],spar=spar_coe)
lowpass.spline4 <- smooth.spline(sort(t),k_x4[order(t)],spar=spar_coe)
lowpass.spline5 <- smooth.spline(sort(t),k_x5[order(t)],spar=spar_coe)
lowpass.spline6 <- smooth.spline(sort(t),k_x6[order(t)],spar=spar_coe)
# # plot t versus lowpass.spline1, ..., lowpass.spline6 in plotly
plot_ly(x = sort(t), y = lowpass.spline1$y, type = 'scatter', mode = 'lines', name = 'k_x1') %>%
add_trace(x = sort(t),y = lowpass.spline2$y, type = 'scatter', mode = 'lines', name = 'k_x2') %>%
add_trace(x = sort(t),y = lowpass.spline3$y, type = 'scatter', mode = 'lines', name = 'k_x3') %>%
add_trace(x = sort(t),y = lowpass.spline4$y, type = 'scatter', mode = 'lines', name = 'k_x4') %>%
add_trace(x = sort(t),y = lowpass.spline5$y, type = 'scatter', mode = 'lines', name = 'k_x5') %>%
add_trace(x = sort(t),y = lowpass.spline6$y, type = 'scatter', mode = 'lines', name = 'k_x6') %>%
layout(yaxis = list(overlaying = "y", side = "right",name="Ground truth density"))
The oscillations at the end are due to the fact that the end points are unstable since \(x,y\sim Unif(-5,5)\) and the minimum is sampled from \(x=-5,y=5\) which are two extreme values.
t_h <- sort(t)
kx66 <- lowpass.spline6$y
kx6_alt <- k_x6[order(t)]
empirical_x7 <- function(t){
t_h_diff <- diff(t_h)
t_h_diff <- c(t_h_diff,median(t_h_diff))
return (abs(sum(exp(1i*t_h*t)*kx6_alt*t_h_diff)))
}
# first order difference of t_h
empiri_out <- c()
for (i in seq(-10,10,0.05)){
empiri_out <- c(empiri_out,empirical_x7(i))
}
tot_den <- sum(empiri_out)*0.05
empiri_out <- empiri_out/tot_den
lowpass.spline1 <- smooth.spline(seq(-10,10,0.05),empiri_out,spar=1)
plot_ly(x = x) %>%
add_lines(x = seq(-10,10,0.05), y = empiri_out, fill = "tozeroy", yaxis = "y2", name = "Fitted density") %>%
add_lines(x = seq(-10,10,0.05), y = lowpass.spline1$y, fill = "tozeroy", yaxis = "y2", name = "Smoothed Fitted density") %>%
add_lines(x = seq(-10,10,0.1), y = diffusion_sol(seq(-10,10,0.1),1.4,0,0), yaxis = "y2", type = 'scatter', mode = 'lines', name = 't=1.4') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
This numerical strategy makes sense as \[ \mathcal{F}(e^{-ax^2})(k) = \sqrt{\frac{\pi}{a}}e^{-\pi^2k^2/a} \] Since our diffusion process has a gradually increasing a, it becomes more evident in Figure \(\ref{fig:compare_gaus}\) that the maximum value (\(\frac{\pi}{a}\)) decreases. However, when we try to do the decoding, the distribution is almost symmetric.
This approach, like the lampert’s approach may generalize to higher dimension flexibly. However, the main problem in the cosines is that the “translation information”(phase factor) is completely lost in the purely real treatment. Namely,
\[ \mathcal{F}{g(t-a)} = e^{-iw a}G(w)\]
where the \(e^{-iwa}\) is essential in recovering the translational distributional information. The full prediction result is much better if we incorporate the full complex domain information
Input: Batch inputs: \(S_0,S_1,\cdots,S_{T}\) each sampled iid from time-varying probability distribution \(p_T\)
Output: Batch Output \(S_{T+1}\) and \(p_{T+1}\)
(A) Bundling the observations to fixed Fourier transformed points
*Sample the corresponding coupled frequency domain \(\omega\in\mathbb{R}^d\), \(\pmb{\omega}=(\omega_1,\omega_2,..,\omega_N)\)
*Compute the empirical average for Fourier Kernel \(\hat{k}_t(\omega_j)=\frac{1}{|S_t|}\sum_{x\in S_t}e^{i x^T \omega_j}, 0\leq t\leq T, 1\leq j\leq N\) (\(\kappa_t(\omega_j)=\mathbb{E}_{x\sim p_t}[e^{i x^T\omega_j}]\))
(B) Forward prediction of the bundled frequency points
For each \(\omega_j\)
Forward predict \(\hat{\kappa_{t+1}}(\omega_j)\) (using proper arima model, separate imaginary/real prediction)
(C) Inverse transform
Approximate \(p_{T+1}(x)=\frac{1}{(2\pi)^d}\int_{\mathbb{R}^d}\hat{\kappa}_{t+1}(\omega_j)e^{-i \omega_j^T x}d\omega_j\)
In 1 dimension, this can be effectively computed
\[\hat{p}_{T+1}(x)=\frac{1}{C}\sum_{j=2}^{N}\hat{\kappa}_{t+1}(\omega_{(j)})e^{-i\omega_{(j)}^T x}\Big(\omega_{(j)}-\omega_{(j+1)}\Big)\]
where \(w_{(j)}\) is the sorted sequence for \(\pmb{\omega}\), where \(C\) is some normalization factor.
(D) Post process smoothing
library(forecast)
d <- n
D <- 100
# sample x and constrain to low freuency modes
x <- matrix(runif(D,-2,2))
# y <- matrix(runif(D,-5,5))
bias_x0 <- t(matrix(rep(runif(D,0,2*pi),n),nrow=D,ncol=n))
z_x0 <- exp(1i*X0%*%t(x))
z_y1 <- exp(1i*X1%*%t(x))
z_y2 <- exp(1i*X2%*%t(x))
z_y3 <- exp(1i*X3%*%t(x))
z_y4 <- exp(1i*X4%*%t(x))
z_y5 <- exp(1i*X5%*%t(x))
z_y6 <- exp(1i*X6%*%t(x))
mean0 <- apply(z_x0,2,mean)
mean1 <- apply(z_y1,2,mean)
mean2 <- apply(z_y2,2,mean)
mean3 <- apply(z_y3,2,mean)
mean4 <- apply(z_y4,2,mean)
mean5 <- apply(z_y5,2,mean)
mean6 <- apply(z_y6,2,mean)
# perform arima regression on Re(mean0), Re(mean1), Re(mean2), Re(mean3), Re(mean4), Re(mean5), Re(mean6)
output_val <- c()
for (j in 1:D){
# fit a arima model given X0(i),X1(i),X2(i),X3(i),X4(i),X5(i),X6(i) fit to X7(i)
arima_model <- auto.arima(c(Re(mean0[j]),Re(mean1[j]),Re(mean2[j]),Re(mean3[j]),Re(mean4[j]),Re(mean5[j]),Re(mean6[j])), ic = "aic")
#arma_model <- arima(c(Re(mean0[i]),mean1[i],mean2[i],mean3[i],mean4[i],mean5[i],mean6[i]),order=c(0,1,1))
real_val<-forecast( arima_model)$mean[1]
# fit the imaginary part
arima_model <- auto.arima(c(Im(mean0[j]),Im(mean1[j]),Im(mean2[j]),Im(mean3[j]),Im(mean4[j]),Im(mean5[j]),Im(mean6[j])), ic = "aic")
imag_val<- forecast( arima_model)$mean[1]
output_val <- c(output_val,real_val+1i*imag_val)
}
t_h <- sort(x[,1])
empirical_kime_x7 <- function(t){
t_h_diff <- diff(t_h)
t_h_diff <- c(t_h_diff,median(t_h_diff))
return (abs(sum(exp(-1i*t_h*t)*output_val[order(x[,1])]*t_h_diff)))
}
# first order difference of t_h
empiri_out <- c()
for (i in seq(-10,5,0.05)){
empiri_out <- c(empiri_out,empirical_kime_x7(i))
}
tot_den <- sum(empiri_out)*0.05
empiri_out <- empiri_out/tot_den
lowpass.spline1 <- smooth.spline(seq(-10,5,0.05),empiri_out,spar=0.8)
plot_ly(x = x) %>%
add_lines(x = seq(-10,5,0.05), y = empiri_out, fill = "tozeroy", yaxis = "y2", name = "Fitted density") %>%
add_lines(x = seq(-10,5,0.05), y = lowpass.spline1$y, fill = "tozeroy", yaxis = "y2", name = "Smoothed Fitted density") %>%
add_lines(x = seq(-10,5,0.1), y = diffusion_sol(seq(-10,5,0.1),1.4,0,0), yaxis = "y2", type = 'scatter', mode = 'lines', name = 't=1.4') %>%
layout(yaxis2 = list(overlaying = "y", side = "right",name="Ground truth density"))
The system in general is still a bit off in mean as the auto-arima prediction with hyperparameter search is a bit unstable.
Note: All methodologies are trained on n=100 batch sample from each distribution.
… Improve this simulation example …
We consider a sample of size n = 50, (\(y_1, y_2, y_3, ..., y_{50}\)), from the model \(y_i = \cos(\pi x_i) + \varepsilon_i\) where \(\varepsilon \sim N\left (\mu=0, \sigma^2=(3/2)^2\right )\). Let’s simulate \(x\) and \(y\), and then run a direct search for the GCV optimal smoothing parameter \(\lambda\) using the GCV metric.
Start with a function fitting cubic smoothing splines using the RKHS.
###### Data ########
set.seed(3)
n <- 100
x <- matrix(runif(n, min = -2, max = 2), ncol = 1)
x.star <- matrix(sort(x), ncol = 1) # sorted x, used by plot
y <- cos(pi * x.star) + rnorm(n, sd = 3/2)
#### Reproducing Kernel for <f,g>=int_0^1 { f’’(x) g’’(x) dx } #####
rk.1 <- function(s, t) {
return(0.5 * min(s, t) ^ 2) * (max(s, t) + (1 / 5) * (min(s, t)) ^ 3)
}
get.GramMatrix.1 <- function(X) {
n <- dim(X)[1]
GramMatrix <- matrix(0, n, n) #initializes GramMatrix array
#i=index for rows
#j=index for columns
GramMatrix <- as.matrix(GramMatrix) # GramMatrix matrix
for (i in 1:n) {
for (j in 1:n) {
GramMatrix[i, j] <- rk.1(X[i, ], X[j, ])
}
}
return(GramMatrix)
}
smoothing.spline <- function(X, y, lambda) {
GramMatrix <- get.GramMatrix.1(X) # Gram matrix (nxn)
n <- dim(X)[1] # n=length of y
J <- matrix(1, n, 1) # vector of ones dim
T <- cbind(J, X) # matrix with a basis for the null space of the penalty
Q <- cbind(T, GramMatrix) # design matrix
m <- dim(T)[2] # dimension of the null space of the penalty
S <- matrix(0, n+m, n+m) #initialize S
S[(m + 1):(n + m), (m + 1):(n + m)] <- GramMatrix # non-zero part of S
M <- (t(Q) %*% Q + lambda * S)
M.inv <- inverse(M) # inverse of M
gamma.hat <- crossprod(M.inv, crossprod(Q, y))
f.hat <- Q %*% gamma.hat
A <- Q %*% M.inv %*% t(Q)
tr.A <- sum(diag(A)) # trace of hat matrix
rss <- t(y - f.hat) %*% (y - f.hat) # residual sum of squares
gcv <- n * rss / (n - tr.A) ^ 2 # obtain GCV score
return(list(f.hat = f.hat, gamma.hat = gamma.hat, gcv = gcv))
}
### Now we have to find an optimal lambda using GCV...
### Plot of GCV
lambda <- 1e-8
n <- 100
V <- rep(0, n)
for (i in 1:n) {
V[i] <- smoothing.spline(x.star, y, lambda)$gcv # obtain GCV score
lambda <- lambda * 1.5 # increase lambda
}
plot_ly(x=1:n, y=V, type="scatter", mode="lines") %>%
layout(title="GCV vs. Log10(Lambda)", xaxis = list(type = "log"))
i <- (1:n)[V == min(V)] # extract index of min(V)
fit_rk <- smoothing.spline(x.star, y, 1.5 ^ (i - 1) * 1e-8)
# Graph (Trigonometric Signal)
plot_ly(x=x.star, y=fit_rk$f.hat[,1], type="scatter", mode="lines",
name="Ridge RKHS Model") %>%
add_trace(x=x.star, y=y[,1], name="Data", mode="markers",
hovertemplate = paste('(%{x},%{y})')) %>%
add_trace(x=x.star, y=cos(pi * x.star)[,1], name="Exact Source") %>%
layout(title="Exact Source, Noisy data, and RKHS Model",
xaxis = list(title="X", showline=TRUE), yaxis = list(title="Y"),
legend = list(title="Objects", orientation = 'h'), hovermode = 'x')
(Multivariate Kernels) The RKHS for Matern kernels are the Sobolev spaces.The Matérn kernels are the Green’s function for the \(d\)-dimensional modified Helmholtz operator
\[(-\nabla^2+ \epsilon^{2} I)^m \]
These facts can be derived from several crucial facts. First of all, the inner product for Sobolev space is (See P176)
\[\langle f,g\rangle_{W_2^{m}(\mathbb{R}^d)}=\int_{\mathbb{R}^d}(1+\|\omega\|^2)^m \hat{f}(w) \overline{\hat{g}(w)}dw \ .\] After inserting into a translation-invariant kernel ansatz and using the reproducing kernel property \(f(x)=\langle f,K(x-\cdot)\rangle_{W_2^m(\mathbb{R}^d)}\). This leads to the Fourier translation property \(\hat{K}(x-\cdot)(w)=e^{-ix^Tw}\hat{K}(w)\).
Also we note that the inverse Fourier transform for \((\epsilon^2+\|w\|^2)^{-m}\)
\[K(x-y) = (2\pi)^{-d/2}\int_{\mathbb{R}^d}(\epsilon^2+\|w\|_2^2)^{-m}e^{i(x-y)^Tw}dw= \frac{2^{1-m}}{\Gamma(m)}\Big(\frac{\|x-y\|}{\epsilon}\Big)^{m-\frac{d}{2}}K_{\frac{d}{2}-m}(\epsilon\|x-y\|)\] is the Matern kernel, see the explicit calculation here, P76 of Scattered approximation by Wendland. A rough sketch of the proof involves writing out the gamma function definition and then applying a scaling and insert the \(1/(1+|x|^2)^{m}\) form and then plug into the fourier transform. Finally applying three variable transformations to arrive at the Modified Bessel function of second kind integral representation.
Finally, the operator and the inverse Fourier transform \((\epsilon^2+\|w\|^2)^{-m}\) can be established from integration by parts: \[\begin{equation} -\nabla^2\to -i^2\|w\|^2 = \|w\|^2, (-\nabla^2+\epsilon^2 I)^m\to (\epsilon^2+\|w\|^2)^m \end{equation}\]
Matern kernels shape
library(GPBayes)
besselk_nu <- function(nu) {
return(function(x) {
return(GPBayes::BesselK(nu, x))
})
}
maternkernel <- function(sigma,nu,rho,x){
scaled_x = sqrt(2*nu)*x/rho
return (sapply(scaled_x,besselk_nu(nu))*scaled_x**nu*sigma**2*2**(1-nu)/gamma(nu))
}
library(plotly)
p = plot_ly()
x_values = seq(from = -5, to = 5, by = 0.01)
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,1/2,1,abs(x_values)), name = ~paste("Plot for nu =", 1/2), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,1,1,abs(x_values)), name = ~paste("Plot for nu =", 1), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,3/2,1,abs(x_values)), name = ~paste("Plot for nu =", 3/2), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,2,1,abs(x_values)), name = ~paste("Plot for nu =", 2), mode="lines")
p <- p %>%add_trace( x = ~x_values, y = ~maternkernel(1,10,1,abs(x_values)), name = ~paste("Plot for nu =", 10), mode="lines")
p = layout(p, title = "Matern kernels with varying nu",
xaxis = list(title = "X values"), yaxis = list(title = "Y values"))
p
We use a similar setup that \(X=[x_1,x_2,...,x_n]^T\in\mathbb{R}^{n\times d}\). In the kernelized ridge regression case, we establish the differences and synergies between feature modeling and the RHKS modeling. Namely, there are two possible starting points
\[\begin{equation} \begin{split} \text{Linear Vector Modeling: }y_i &= \mathbf{w}^T\phi(x_i)=\sum_{j=1}^D w_j \phi(x_i)_j\\ \text{Function Modeling (From representer theorem): }f(\cdot) & = \sum_{j=1}^n \tilde{w_j} \phi_{x_j}(\cdot)= \sum_{j=1}^n \tilde{w_j} \kappa(x_j, \cdot) \end{split} \end{equation}\]
The first modeling maps data points to feature vectors \(x_i\in\mathbb{R}^d\) and \(\phi(x_i)\in\mathbb{R}^D\) - finite dimensional, while the latter modeling maps data points to feature functions \(\phi_{x_i}(\cdot)\) - infinite dimensional. Note that in the first modeling, the \(D\) is specified by the nonlinear transformation \(\phi\) and can be arbitrary (does not have to be \(D=d\), where \(x_i\in \mathbb{R}^d\)). The vector modeling optimization problem is
\[\begin{equation} \min_{\mathbf{w}\in\mathbb{R}^D}\|y-\Phi^T \mathbf{w}\|_2^2+\lambda\|\mathbf{w}\|_2^2, \Phi_{ij}=\phi(x_i)_j,\Phi\in \mathbb{R}^{D\times n} \tag{5.1} \end{equation}\]
\[\mathbf{w} = (\lambda I_D+\Phi\Phi^T)^{-1}\Phi y\] If we set up the problem using reproducing kernel property where \(K_{ij}=\kappa(x_i,x_j)\) and penalize on the raw \(\mathbf{\tilde{w}}\), the solution will be different
\[\begin{align} \min_{\mathbf{\tilde{w}}\in \mathbb{R}^n}\|y-K \mathbf{\tilde{w}}\|_2^2 + \lambda\|\mathbf{\tilde{w}}\|_2^2, \mathbf{\tilde{w}} = (KK+\lambda I_n)^{-1}Ky \end{align}\]
However, the correspondence between the two approaches can be realized by instead penalizing the norm of the weights for the “features”
\[\begin{align} \min_{\mathbf{\tilde{w}}\in \mathbb{R}^n}\|y-K \mathbf{\tilde{w}}\|_2^2 + \lambda\|\Phi\mathbf{\tilde{w}}\|_2^2 \tag{5.2} \end{align}\]
Hence the two optimization problem (5.1) and (5.2) are equivalent upon the fixed linear transformation \(\Phi\mathbf{\tilde{w}}=\mathbf{w}\).
Lemma: In L2 ridge regression it is equivalent to express the fitted parameter as \[\begin{equation} \mathbf{w} = (\lambda I_d+\Phi\Phi^T)^{-1}\Phi y = \Phi(\Phi^T\Phi+\lambda I_n)^{-1}y \tag{5.3} \end{equation}\]
Proof: A neat trick that could perform the matrix inverse for (5.1) in a smaller space is \((P^{-1}+B^TR^{-1}B)^{-1}B^TR^{-1}=PB^T(BPB^T+R)^{-1}\) for non-square matrix \(B\) (See Max Welling’s notes here.). Namely, if \(B\in \mathbb{R}^{n\times d}\), then the left hand side inverse performs in \(d\times d\) while the right hand side performs inverse in \(n \times n\). That is, the right hand side manipulation is more preferrable if \(n < d\) and vice versa. Let \(\Phi=B^T\in \mathbb{R}^{d\times n}, P^{-1}=\lambda I_d, R^{-1}= I_n\), we establish the equivalence that \[\begin{equation} \mathbf{w} = (\lambda I_d+\Phi\Phi^T)^{-1}\Phi y = \Phi(\Phi^T\Phi+\lambda I_n)^{-1}y \tag{5.3} \end{equation}\]
Note: From this derivation, we see that the exact calculation for \(\mathbf{w}\) is intractable. Thus, it is hard to precisely interpret the value since \(\Phi\) is not tractable. However, since \(\Phi\tilde{w}=w.\) The observation-wise weights are calculable \(\mathbf{\tilde{w}} = (\Phi^T\Phi+\lambda I_n)^{-1}y=(K+\lambda I_n)^{-1}y\).
Here, we also used \(d\) to denote the number of features and \(n\) denoting the number of samples. The \(\mathbf{w}\) encodes the weights of the data-cases, at test time, for an unknown data \(x_{test}\), it will be interpolated using Equation (5.3) \[\begin{equation} \hat{y}(x_{test}) = f(x_{test}) = \mathbf{w^T} \Phi(x_{test}) = y^T(K+\lambda I_n)^{-1}\kappa(x_{test}) \end{equation}\] where \(K(x_i,x_j)=\Phi(x_i)^T\Phi(x_j)\) and \(\kappa(x_i,x_{test})=\Phi(x_i)^T\Phi(x_{test})\), and \(\kappa(x_{test})=[\kappa(x_1,x_{test}),....,\kappa(x_n,x_{test})]^T\)
Recall that the hat matrix implements a mapping between the raw observed \(\mathbf{y}\) to the fitted \(\hat{\mathbf{y}}\), i.e, \(\hat{\mathbf{y}}= Hy\). In this derivation, the hat matrix is \[\begin{equation} \hat{y} = \Phi^T \mathbf{w} = \underbrace{K(K+\lambda I_n)^{-1}}_{H}\mathbf{y} \end{equation}\] The approximation for Generalized cross validation is then \[\begin{equation} GCV = \frac{n*RSS}{(n-tr(H))^2} \end{equation}\]
The canonical set up for LASSO is
\[\begin{equation} \min_{\beta\in \mathbb{R}^d}\|y-X\beta\|_2^2 + \lambda\|\beta\|_1 \end{equation}\]
Unlike ridge regression, it is especially important to make an important distinction of either enforcing the shrinkage on the instances or on the features. Here \(X=[x_1,x_2,...,x_n]^T\in\mathbb{R}^{n\times d}\) corresponding to \(n\) samples, and each \(x_i\in \mathbb{R}^d\) has \(d\) dimensional features. There are two ways of applying nonlinear feature mapping leading to different interpretations of modeling and shrinkage.
Applying a nonlinear feature map on the instances give \(\Phi^T=[\varphi(x_i),\varphi(x_2),...,\varphi(x_n)]^T\in\mathbb{R}^{n\times d'}\). The kernel maps
\[\begin{equation} \kappa(x_i,x_j)=\varphi(x_i)^T\varphi(x_j) \end{equation}\]
We can explicitly expand the RKHS property
\[\begin{equation} f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot, x_i)=\sum_{i=1}^n\alpha_i \varphi_{x_i}(\cdot) \end{equation}\]
The linear superposition correspond to the summation weighting of the \(n\) observations. The linear equation can be set up by interpolating the \(x_j\)’s. \[\begin{equation} y_j = f(x_j) = \sum_{i=1}^n \alpha_i \kappa(x_i,x_j) \forall j=1,2,3,...,n \end{equation}\]
Crucially, the LASSO problem is then \[\begin{equation} \min_{\alpha\in \mathbb{R}^n}\|y-A\alpha\|_2^2 + \lambda\|\alpha\|_1 \tag{5.4} \end{equation}\] where \(A_{ij} = \kappa(x_i,x_j)\), \(A=\Phi^T\Phi\in \mathbb{R}^{n\times n}\). The matrix form for \(A\) is often by plugging in the parametric form for the kernel \(\kappa\). The optimization problem enforces the shrinkage on the observation instances.
\(\blacktriangleright\) For example, for the linear kernel \(\kappa(x_i,x_j)=x_i\cdot x_j\). The problem (5.4) would translate into
\[\begin{align} \min_{\beta\in \mathbb{R}^n}\|y-XX^T \alpha\|_2^2 + \lambda\|\alpha\|_1 \end{align}\]
The solution is then
\[\begin{equation} \hat{\alpha_j}^{(LASSO)} = S_{\lambda}(\hat{\alpha_j}^{(OLS)})=\hat{\alpha_j}^{(OLS)}\max(0,1-\frac{\lambda}{|\hat{\alpha_j}^{(OLS)}|}), \end{equation}\]
\[\begin{equation} \hat{\alpha}^{OLS}= (XX^TXX^T)^{-1}XX^Ty \end{equation}\]
The more general nonlinear version is
\[\begin{equation} \hat{\alpha}^{OLS}= (\Phi^T\Phi\Phi^T\Phi)^{-1}\Phi^T\Phi y=(KK)^{-1}Ky=K^{-1}y \end{equation}\]
However, as we have seen in ridge regression if we adjust the optimization problem (5.4) into (which is similar to (5.2)), but enforces sparsity in the transformed feature space as \(\Phi\alpha\in \mathbb{R}^{d'}\)
\[\min_{\alpha\in \mathbb{R}^n}\|y-\Phi^T\Phi\alpha\|_2^2 + \lambda\|\Phi\alpha\|_1 .\]
This becomes
\[\min_{\alpha'\in \mathbb{R}^{d'}}\|y-\Phi^T\alpha' \|_2^2 + \lambda\|\alpha'\|_1 .\] Here, \(\hat{\alpha'}^{OLS}=(\Phi\Phi^T)^{-1}\Phi y\), however since \(\Phi\) is not known and we do not have a trick as in ridge regression to do the evaluation, this optimization is hard to realize just by knowing \(K\).
Applying a nonlinear feature mapping on the features \(X=[x^1,x^2,...,x^d]\in\mathbb{R}^{n\times d}\to [\phi(x^1),...,\phi(x^d)]\in\mathbb{R}^{n'\times d}\). Crucially, the \(y\) outcome also needs to be transformed \(\phi(y)\in\mathbb{R}^{n'}\). We can explicitly expand the RKHS property
\[\begin{equation} f(\cdot) = \sum_{i=1}^d \beta_i \kappa(\cdot, x^i)=\sum_{i=1}^d\beta_i \phi_{x_i}(\cdot) \end{equation}\]
Thus, the linear system is
\[\begin{equation} \phi_{y}(x^j)=\kappa(x^j,y)=\sum_{i=1}^d\beta_i \phi_{x^i}(x^j)=\sum_{i=1}^d\beta_i\kappa(x^i,x^j) \end{equation}\]
The LASSO problem is
\[\begin{equation} \min_{\beta\in \mathbb{R}^d}\|\phi(y)-C\beta\|_2^2 + \lambda\|\beta\|_1 \end{equation}\]
where \(\phi(y)=[\kappa(x^1,y),...,\kappa(x^d,y)]^T\in \mathbb{R}^d\) and \(C_{ij}=\kappa(x^i,x^j), C\in\mathbb{R}^{d\times d}\).
\(\blacktriangleright\) For example, when the linear kernel condition is \(\kappa(x^i,x^j)=x^i\cdot x^j\). Then, \(\phi(y)=X^Ty\), \(C=X^TX\). The solution is then
\[\begin{equation} \hat{\beta_j}^{(LASSO)} = S_{\lambda}(\hat{\beta_j}^{(OLS)})=\hat{\beta_j}^{(OLS)}\max(0,1-\frac{\lambda}{|\hat{\beta_j}^{(OLS)}|}), \end{equation}\]
\[\begin{equation} \hat{\beta}^{OLS}=(C^TC)^{-1}C^T\phi(y)= (X^TXX^TX)^{-1}X^TXX^Ty \end{equation}\]
Depending on the optimization objective, there may be a \(N\) couples to the \(\lambda\) factor when one minimizes for the \(\|y-X\beta\|_2^2\) or \(\frac{1}{N}\|y-X\beta\|_2^2\) In the general nonlinear case,
\[\begin{equation} \hat{\beta}^{OLS}=C^{-1}C_y, where C_y = [\kappa(x^1,y),\kappa(x^2,y),...,\kappa(x^d,y)]^T=\phi(y) \end{equation}\]
Remark: In both cases, it seems unlikely that the OLS estimate will be the same as the raw OLS in the glmnet (\(\alpha=1, \hat{\beta}=(X^TX)^{-1}X^Ty\)), and if we still stick with the second feature shrinkage, we will get similar shrinkage performance.
Also note that the observation shrinkage approach seems to lack feature attribution given the fitted \(\alpha_i\). Whatever the outcome corresponds exactly at test time, the model is fitting a linear combination of the previous \(n\) observations.
In the feature shrinkage approach, there seems to be good feature attribution \(\beta_i\) weighting the importance of each feature, however at test time. It is hard to invert the \(\phi(y)\) to get the outcome variable \(y\).
Computing the GCV: Since the LASSO estimate is non-linear and non-differentiable. The Hat matrix is not directly computable, according to Tibshirani, this non-differentiable \(\sum_{i=1}^n |\beta_i|\) may be approximated by \(\sum_{i=1}^n \frac{\beta_i^2}{|\beta_i|}\) and hence the hat matrix for the GCV computation may be approximated by similar to ridge regression
\[\hat{H}=X(X^TX+\lambda W^{-})^{-1}X^T\]
where \(W=diag(\hat{\beta_1}^{(LASSO)},...,\hat{\beta_n}^{(LASSO)})\) and \(W^-\) is the generalized inverse of \(W\).
Note: The hat matrix is a linear transformation thus in the feature shrinkage approach we define the hat matrix in the transformed space \[\hat{\phi(y)}=\hat{H}\phi(y)=K(KK+\lambda W^{-})^{-1}K\phi(y)\]