1 Information Theory and Statistical Learning

1.1 Summary

Machine learning relies heavily on entropy-based (e.g., Renyi-entropy) information theory and kernel-based methods. For instance, Parzen-kernel windows may be used for estimation of various probability density functions, which facilitates the expression of information theoretic concepts as kernel matrices or statistics, e.g., mean vectors, in a Mercer kernel feature space. The parallels between machine learning and information theory allows the interpretation and understanding of computational methods from one field in terms of their dual representations in the other.

Machine learning (ML) is the process of data-driven estimation (quantitative evidence-based learning) of optimal parameters of a model, network or system, that lead to output prediction, classification, regression or forecasting based on a specific input (prospective, validation or testing data, which may or may not be related to the original training data). Parameter optimality is tracked and assessed iteratively by a learning criterion depending on the specific type of ML problem. Classical learning assessment criteria, including mean squared error (MSE), accuracy, $R^2$, see Chapter 9, may only capture low-order statistics of the data, e.g., first or second order. Higher-order learning criteria enable solving problems where sensitivity to higher-moments is important (e.g., matching skewness or kurtosis for non-linear clustering, classification, dimensionality reduction).

This Figure provides a schematic workflow description of machine learning.

Figure: Prior knowledge (e.g., initialization parameters) are fed into the machine learning system, along with batches of data (input samples) to generate outputs (e.g., regression, classification, clustering, etc.). At each iteration, the outputs, which depend on the current parameters and input data, are compared against known or expected results and the discrepancies are evaluated leading to a feedback loop which recursively attempts to improve these assessment quality measures. When more (training) data is available, multiple batches of data may be provided to fine-tune the system and reduce the in-bag error rate, i.e., fit the model better to the data. Learning the optimal parameters is equivalent to minimizing the error rate of the model at each iteration. The correction feedback guides the optimization of the parameters.

Information-theoretic learning (ITL) utilizes assessment criteria that are defined in terms of information theoretic functions defined on the probability densities (pdf’s) of the input or output datasets, which describe completely the statistical properties of the data.

An example of an information-theoretic criteria is the entropy of a pdf, $H(p)$, where $p(x)$ is a specific pdf. If the expectation of a process is $E$, then the entropy is the expected value of the logarithm of the probability

\[H(p)=E(-\ln (p(x)).\]

Widely varying pdf’s, like the Gaussian distribution, have large entropy, and tight pdfs, like Cauchy distribution, tend to have lower entropy measures.

For example, the Gaussian density is:

\[p(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}.\]

Then, \[H(p)= E(-\ln (p(x)))=-\int_{-\infty}^{\infty}{p(x)\ln(p(x))dx}=\] \[=\frac{1}{2}\ln(2\pi\sigma^2) \int_{-\infty}^{\infty}{\frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}dx} + \frac{1}{2 \sigma^2}\int_{-\infty}^{\infty}{(x-\mu)^2\frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}dx} =\] \[=\frac{1}{2}\ln(2\pi\sigma^2) + \frac{1}{2} = \frac{1}{2}\ln(2 e \pi\sigma^2).\]

Whereas the entropy of the Cauchy distribution, $q(x) = \frac{1}{1+x^2}$, is constant:

\[H(q)= \int_{-\infty}^{\infty}{\frac{\ln(1+x^2)}{1+x^2}dx}.\]

Making the substitution $x=\tan(\theta)$ yields $dx=\sec^2(\theta)d\theta$, and

\[H(q) = -2 \int_{-\pi/2 }^{\pi/2}{\ln(\cos(\theta))d\theta} = -4 \int_{0}^{\pi/2}{\ln(\cos(\theta))d\theta} = -4A,\] where $A=\int_{0}^{\pi/2}{\ln(\cos(\theta))d\theta}$.

However, if $A'=\int_{0}^{\pi/2}{\ln(\sin(\theta))d\theta}$, then $A==A'$.

Thus, $2A = -\frac{\pi}{2}\ln(2)+ \int_{0}^{\pi/2}{\ln(\sin(2\theta))d\theta}$.

A change of variables in the last integral, $\rho = 2\theta$, leads to:

\[2A = -\frac{\pi}{2}\ln(2)+{\frac{1}{2}(A+A)} = -\frac{\pi}{2}\ln(2)+A.\]

Therefore, $A= -\frac{\pi}{2}\ln(2)$, and thus,

\[H(q)=-4A = 2\pi\ln(2).\]

See more details here.

Similarly, for an n-dimensional Gaussian \[ P(X) = \prod_{i=1}^n{\sqrt{2 e \pi\sigma_i^2} \left (e^{\frac{-(x_i-\mu_i)^2}{2\sigma_i^2}} \right ) }, \]

the corresponding entropy is

\[H(P) = \frac{n}{2}\ln\left (2 e \pi \prod_{i=1}^n{\sigma_i^2}\right ).\]

The Renyi entropy is a generalized form of the entropy, which is not unique, but represents a one-parameter operator (indexed by $\alpha$)

\[H_{\alpha}(p) = \frac{1}{1-\alpha}\ln \left (\int{p^{\alpha}(x)dx}\right ).\]

For $\alpha=0$, $H_o$ is the Hartley entropy representing the logarithm of the cardinality of $X$.

\[H_o(X) = \ln(n) = \ln(| X |).\]

For $\alpha=1$, this reduces to the classical Shannon’s entropy,
For $\alpha=2$, the quadratic Renyi entropy corresponds to information-theoretic learning (ITL) and can be used as an objective function in supervised neural network training. For samples from the set D = $\{x_1,..., x_N\}$, generated from a pdf p(x), the pdf may be approximated using Parzen window density estimation.

Parzen-window density estimates are defined by:

\[P(x)=\frac{1}{n}\sum_{i=1}^n{\frac{1}{h_{d,n}\times K\left ( \frac{x-x_i}{h_n}\right )}}, \]

Where $K(y)\ge 0$ is an n-dimensional smoothing kernel (window) function or kernel, i.e., $\int_{R^d}{K(y)dy}$, $h_n$ is the bandwidth (window width) parameter that corresponds to the full-width at half-maximum of the kernel, which typically depends on the number of observations, $n$. Note that the kernel function is also itself a pdf, guaranteeing that the estimated function $P(x)$ remains a pdf, as well.

The quadratic Renyi entropy may be used as an objective function in supervised neural networks learning, see Chapter 14. Briefly, in supervised learning, the input training data is paired with labels, $\{x_t, d_t\},\ t = 1,...,n$, where $d_t$ is the class label associated with the input $x_t$ . The output label generated by the neural network, $y_t$ for each input $x_t$ is then compared to $d_t$, based on the residuals $e_t = d_t - y_t$ , using some assessment criterion (e.g., MSE).

Rather than minimizing the mean residual square error, like in MSE assessment, the Renyi quadratic entropy minimizes the error pdf, $p(e)$ with respect to the (current) neural network weights. A gradient descent optimization may be used to approximate $p(e)$ by a Parzen window estimation.

1.2 Comparing PDFs

There are many statistical approaches to compare two pdfs. e.g., quantile-quantile plots, Kolmogorov-Smirnov test (KS test).

The divergence measure provides an information theoretic approach to quantity is the similarity between two or more pdf’s. Let $\{p_1(x),..., p_k (x)\}$ represent $k$ pdf functions. Then their divergence measure is:

\[ D(p_1,..., p_k )\] And it measures the distance between the pdf’s. $D=0$ only when $p_1(x)= ??? = p_k (x)$ and $D$ increases with the increase of the similarity of the pdf functions. There are many alternative divergence measures.

The most well-known is Kullback-Leibler (KL) divergence, which is based on the Shannon entropy. There are several divergence measures based on Renyi’s quadratic entropy that utilize the Parzen window estimation.

The Cauchy-Schwarz (CS) divergence is an example of a Renyi’s quadratic entropy and is successfully applied for non-linear clustering, or partitioning of data into groups, or clusters, which include objects that are more similar within a group and the differences between objects in different groups are amplified. The CS clustering partitions the input data to maximize the between-cluster divergence relative to within-cluster variability. That is, CS divergence aims to increase $D(p_1(x),..., p_k (x))$, where $p_i(x)$ represents the pdf associated with the $k$ clusters $D_i$. The cluster partitions represent the group weights, the cluster pdfs encode the output labels, and the divergence, $D$, represents the learning criterion.

The Cauchy-Schwarz Inequality for symmetric kernels $K(x,z)=\langle \Phi(x), \Phi(z)\rangle =\langle\Phi(z), \Phi(x)\rangle=K(z,x)$:

\[K^2(x,z) =\langle\Phi(x), \Phi(z)\rangle^2 \leq ||\Phi(x)||^2 ||\Phi(z)||^2=\] \[ = \langle \Phi(x), \Phi(x)\rangle \times \langle \Phi(z), \Phi(z)\rangle = K(x,x)\times K(z,z).\]

The output pdfs are approximated by Parzen window estimates $p_1(x),..., p_k(x)$ to enable evaluation of the CS divergence measure by $D_{CS}( p_1(x),..., p_k(x))$. The process maximizes the dissimilarity between the clusters by increasing the dissimilarity between the corresponding cluster pdfs:

\[\arg \max_{D_1,...,D_k } D_{CS}( p_1,..., p_k).\]

Kernel methods essentially allow non-linear mapping of the input data $x_t \in D$ into a higher (could be infinite) dimensional space, where the kernel-filtered data are represented by $\Phi(x_t ),\ t = 1,..., n$.

Next, the machine learning algorithm is trained on the kernel-mapped data. The Cover’s theorem suggests that the mapped data set is more likely to be linearly separable compared to the original data, which may only be non-linearly separable.

Suppose we are trying to optimize an objective function: \[W(\alpha) = \sum_{i=1}^m{\alpha_i} -\frac{1}{2}\sum_{i,j=1}^m{\alpha_i \alpha_j y_i y_j \langle x_i,x_j\rangle},\]

Subject to $\alpha_i\ge 0$ and $\sum_{i=1}^m{\alpha_i y_i} =0$.

When the input data is not-linearly separable, we can $\Phi$-transform the data into a new space where the data becomes linearly separable.

\[\Phi:\mathbb{R}^2 \longrightarrow \mathbb{R}^3\] \[(x_1,x_2) \longrightarrow \Phi(x_1,x_2) = (\phi_1, \phi_2, \phi_3)= (x_1^2, \sqrt{2}x_1x_2, x_2^2)\]

The following hyper-plane linearly separates the clusters in the data. \[\langle w,\Phi\rangle (x_1,x_2)=(w_1\phi_1+w_2\phi_2+w_3\phi_3)(x_1,x_2)=0\]

This we can complete the data clustering challenge by solving the corresponding maximization problem based on the $\Phi$-mapped data:

\[W(\alpha) = \sum_{i=1}^m{\alpha_i} -\frac{1}{2}\sum_{i,j=1}^m{\alpha_i \alpha_j y_i y_j \langle\Phi(x_i), \Phi(x_j)\rangle },\]

And the Lagrangian dual representation of the hyper-plane is:

\[f(x)=\langle w,x\rangle +b=\sum{\alpha_iy_i\langle x_i,x_j \rangle},\ \text{with} \ w=\sum_{i=1}^m{\alpha_i y_i x_i}.\]

When $x=(x_1, x_2)$ and $z=(z_1,z_2)$, the kernel-trick refers to $K(x,z)=\langle \Phi(x),\Phi(z)\rangle$:

\[K(x,z)=\langle x,z\rangle ^2=(x_1z_1 + x_2z_2)^2=(x_1^2 z_1^2 + 2x_1 z_1 x_2z_2+x_2^2 z_2^2)=\]

\[=\langle (x_1^2,\sqrt{2}x_1x_2, x_2^2), (z_1^2,\sqrt{2}z_1z_2, z_2^2)\rangle =\langle \Phi(x),\Phi(z)\rangle .\]

Note that: (1) the weight vector ($w$) representation includes only data points, (2) the coordinates of the data points are not used, only the inner-products are necessary, (3) the kernel function $K(x_1, x_2)=\langle \Phi(x_i),\Phi(x_j)\rangle$.

Machine learning classification algorithms based on kernel estimation methods that rely on $\Phi$-transformed data are linear and can be expressed as inner-products. Inner-product representation of computational algorithms is beneficial since in high-dimensions they may be efficiently computed using a positive semi-definite Mercer kernel function:

\[k(x_t, x_{t'} )= \langle \Phi(x_t ),\Phi(_{t'} )) \rangle.\]

Examples of Mercer kernels are:

Radial basis function: $K(x,z)=e^{-\frac{||x-z||^2}{2\sigma^2}}, \ ||x||=\sqrt{\langle x,x \rangle},$$

where $\sigma$ is a scale parameter.

Polynomial kernel: \[K(x,z)=(\langle x,z\rangle +\theta)^2,\ d\ge 0.\]
Sigmoid kernel: \[K(x,z)=\tanh(\eta\times \langle x,z\rangle +\theta).\]

Note that the kernel-trick facilitates the learning algorithm without operating directly on the $\Phi$-transformed data. It only computes inner-products via the kernel function. The learning algorithm operates implicitly on the kernel feature space, avoiding the $\Phi$ mapping.

As the kernel feature space is non-linearly related to the original input data, kernel methods are considered non-linear operations.

Support vector machines (SVM), see Chapter 6, represents an example of kernel-based machine learning methods.

Information theoretic machine learning utilizes the second-order (quadratic) Renyi entropy, $\alpha = 2$:

\[H_{\alpha=2}=-\ln\left ( \int_{\mathbb{R}} {p^2(x)dx} \right ) .\]

As the logarithm function, $\ln()$, is monotonic, optimizing $H_2(p)$ is equivalent to optimizing $V (p)= \int{ p^2(x)dx}$.

The usefulness of the Renyi entropy, i.e., $V (p)$ as a machine learning assessment criterion, relies on its effective estimation using real data samples $D = \{x_1,..., x_N\}$ generated from the distribution $p(x)$.

For example, to estimate $V(p)$, we can replace $p(x)$ by a smooth sample-driven kernel density $\hat{p}(x)$. This ensures that we can compute the derivatives of $V (p)$ with respect to system weights and thus find the extrema.

We already saw that the Parzen window provides kernel density estimation:

\[\hat{p}(x)=\frac{1}{n}\sum_{x_t\in D}{W_{\sigma}(x,x_t)},\]

where $W_{\sigma}(x, \ .)$ is the Parzen window centered at $x$ with a width parameter $\sigma$. The smoothness of the density estimator requires smoothness of the window function. In addition, the estimate is assumed to be a density (pdf) which requires $\hat{p}(x)\ge0$ and $\int{ \hat{p}(x)dx}=1$.

Many alternative Parzen windows may be used, e.g., any density function works, and we already saw the Gaussian windowing function

When using kernels with fixed kernel sizes,

\[E(\hat{p}(x)) = \lim_{n\longrightarrow \infty}{\hat{p}(x)} = p(x) \star W_{\sigma}(x),\]

where $\star$ is the convolution operator. Assuming a suitable annealing rate for the kernel size as $n\longrightarrow \infty$, the Parzen density estimator is asymptotically unbiased and consistent.

In practice we always deal with a limited number of samples. Thus, the choice of a kernel size trades-off with estimation bias and precision. Smaller kernel sizes yield low bias higher variance, whereas larger kernel sizes may increase the bias but reduce the variance.

# define support set of X
x = seq(-5, 10, by = 0.01)

# get normal kernel function values
norm1 = dnorm(x, -1, 1)
norm2 = dnorm(x, 2, 1)
norm3 = dnorm(x, 4, 1)
norm4 = dnorm(x, 6,1)

plot(x, norm1, type = 'l', ylim=c(0, 1), ylab = 'f(x)', xlab = 'x', main = 'Gaussian Kernels KDE Simulation', col = 'red')

# add plots of additional kernel functions
lines(x, norm2, col = 'green')
lines(x, norm3, col = 'blue')
lines(x, norm4, col = 'pink')

data_1=c(rnorm(1000, -1, 1), rnorm(1000, 2, 1), rnorm(1000, 4, 1), rnorm(1000, 6, 1))
data_2=c(-1, 2, 4, 6)
data_kde=kde(data_1)
lines(data_kde[1,],4*data_kde[2,],type='l',xlab='x',ylab='density f(x)', col = 'black', lwd=4)

data2_kde=kde(data_2)
lines(data2_kde[1,],data2_kde[2,], type='l',xlab='x', col = 'gray', lwd=2)

# legend; use 'lty' to discriminate between kernels
legend("top", c('Norm(-1,1)', 'Norm(-2,1)', 'Norm(4,1)', 'Norm(6,1)', 'KDE Estimate'), lty = c(1,2,3,4,5), lwd=c(1,1,1,1,4), col = c('red', 'green', 'blue', 'pink', 'black'), cex=0.5, box.lwd = 0)

In the above plot, gray lines indicate the sample data points, the color curves represent the individual (Gaussian) kernel functions at the sampling points, and the thick black line represents the kernel-density estimate.

1.3 Divergence Measures based on Renyi Entropy

Let’s start with $k$ groups, each of size $n_i$ representing the data set $D$ composed of subgroups $D_1,..., D_k$ including the data points corresponding to the pdf’s $p_i(x)$.

The Cauchy-Schwarz Divergence is:

\[ D_{CS}(p_i,p_j)=-\ln \frac{\int{p_i(x)p_j(x)dx}}{\sqrt{\int{p_i^2(x)dx}\times \int{p_j^2(x)dx}}} \ \ \in [0, \infty].\]

Note that $D_{CS}(p_i,p_j)=0 \iff p_i(x)= p_ j (x)$, it’s symmetric, and its magnitude indicates the discrepancy between the two pdfs.

It is based on the Renyi entropy:

\[D_{CS}(p_i,p_j)=-\ln\int{p_i(x)p_j(x)dx} -\frac{1}{2}H_2(p_i) -\frac{1}{2}H_2(p_j).\]

The monotonicity of the logarithm function allows us to use Parzen window estimators on the simpler $V_{CS}(p_i, p_ j )$.

\[\hat{p}_i(x)=\frac{1}{n_i}\sum_{x_n\in D_i}{W_{\sigma}(x,x_n)}\]

\[\hat{p}_j(x)=\frac{1}{n_j}\sum_{x_l\in D_j}{W_{\sigma}(x,x_l)}\]

These yield the following estimator:

\[V_{CS}(\hat{p}_i,\hat{ p}_ j )= \frac{\frac{1}{n_in_j}\sum_{x_n\in D_i}{ \sum_{x_l\in D_j}{W_{\sigma}(x_n,x_l)}}} {\sqrt{\frac{1}{n_i^2}\sum_{x_n\in D_i}{ \sum_{x_{n'} \in D_j}{W_{\sigma}(x_n,x_{n'})}} \times \frac{1}{n_j^2}\sum_{x_l\in D_j}{ \sum_{x_{l'} \in D_j}{W_{\sigma}(x_l,x_{l'})} }}}. \]

Finally, the CS-divergence estimator is:

\[ D_{CS}(\hat{p}_i,\hat{ p}_ j )= -\ln\left (V_{CS}(\hat{p}_i,\hat{ p}_ j )\right ).\]

The Integrated Squared Error (ISE) Divergence is an alternative measure that is based on Renyi entropy capturing the aggregate square error between the pdfs:

\[D_{ISE}(p_i,p_j)=\int{[p_i(x)-p_j(x)]^2dx}.\]

1.4 Parzen windows

A common approach for estimating non-parametric density functions based on using Parzen windows. Suppose we are trying to estimate a process density function $f(x)$, which generates the observed random sample $\{x_1, \cdots, x_N\}$. Then, the Parzen window estimator of this distribution is defined by \[\hat{f}(x) \equiv \frac{1}{N} \sum_{l=1}^N {W_{\sigma^2} (x, x_l)}.\] The Parzen window, $W_{\sigma^2}$, is a kernel whose extent (width) is controlled by the parameter $\sigma^2$. Typically, the Parzen window is normalized to integrate to one, similarly to the unknown pdf function $f$. For instance, for a $d$-dimensional process, we can use a Gaussian kernel \[W_{\sigma^2} (x, x_l ) = \frac{1}{(2\pi\sigma^2)^{\frac{d}{2}}} e^{−\frac{||x − x_l||^2}{2\sigma^2}}.\]

1.5 Relation between Information Theoretic Learning and Kernel Methods

Based on the Parzen window kernel trick, the information theoretic learning criteria may be expressed in terms of mean vectors in a Mercer kernel feature space.

We can express every Renyi entropy-based information-theoretic measure in terms of sums of Parzen windows, $W_{\sigma} (\cdot, \cdot)$, by positive semi-definite functions – Parzen window estimates.

Parzen window estimates satisfy the Mercer’s conditions, and therefore compute inner products in some kernel induced feature space

\[W_{\sigma} (\cdot,\cdot)= k(\cdot,\cdot)= (\Phi(\cdot),\Phi(\cdot)).\]

Machine learning clustering may be achieved by maximizing the divergence measure.

Various objective functions may be estimated using Parzen windowing to obtain non-linear clustering of classification. To optimize with respect to the data partitioning, i.e., optimize the clusters, we maximize the Cauchy-Schwarz (CS) divergence:

\[\max_{D_1, D_k}{D_{CS}(\hat{p}_1, \hat{p}_k)}.\]

Using gradient descent and Lagrange multipliers we can obtain an optimal solution corresponding to a partition of the data.

1.6 Some notes

Question: Given a specific probability distribution, is there an analytical expression of the entropy as an inner product of a function multiplied by the imaginary unit $i$?

Approach: The Dirac notation used in quantum mechanics conveniently represents kets (state vectors), $|\psi\rangle$, bras (dual vectors) $\langle \psi |$, and operators. The inner product of two state vectors $|\psi\rangle$) and $|\varphi\rangle$ is $\langle \psi| \varphi \rangle$. Then the states are normalized, the inner product corresponds to the probability of finding the system in state $| \varphi \rangle$ if it was initially in state $|\psi\rangle$. The entropy of a discrete probability distribution can be expressed using Dirac notation as follows.

COnsider a discrete process with a probability distribution with probabilities $\{p_1, p_2, \cdots, p_n\}$. The entropy $S$ of this distribution is

\[S = - \sum_{i} {(p_i \times \log(p_i))}\]

Using Dirac notation, express $p_i$ as $|\psi\rangle$ and the normalization condition as $\sum_i |\psi_i\rangle \langle \psi_i|=1$. Then, the entropy is

\[S = - \sum_i {\left [\langle \psi_i | \psi_i \rangle \times \log(\langle \psi_i | \psi_i \rangle)\right ]}\equiv -Tr(\rho \log(\rho))\ ,\]

where $\rho=\sum_i p_i |\psi_i\rangle \langle \psi_i|$ is the density matrix representing the probability distribution, and $Tr$ is the trace of a matrix.

The Dirac notation provides a compact way to express the entropy of a probability distribution. The entropy is still the negative expectation of the logarithm of the probabilities.

2 References

Information Theory and Statistical Learning, ed. Frank Emmert-Streib and Matthias Dehmer, Springer, 2009, ISBN: 978-0-387-84815-0.
Jose C. Principe, 2010, Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer, Information Science and Statistics, ISBN 9781441915702.
Renyi A. (1976). On measures of entropy and information. Selected Papers of Alfred Renyi, Akademiai Kiado, Budapest, 2:565-580.
Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.

DSPA2: Data Science and Predictive Analytics (UMich HS650)

Appendix: Information Theory and Statistical Learning

SOCR/MIDAS (Ivo Dinov)

June 2024