Joint Gaussian Processes

Explanation, for the interested few, of what is going on in the paper Joint Gaussian Processes for Biophysical Parameter Retrieval

(almost finished)

Gaussian Processes

GPs are state-of-the-art tools for regression and function approximation, and have been recently shown to excel in biophysical variable retrieval by following both statistical and hybrid approaches. Let us consider a set of $n$ pairs of observations or measurements, ${\mathcal D}_n:=\{\mathbf{x}_i,y_i\}_{i=1}^n$, perturbed by an additive independent noise. The input data pairs $(\mathbf{x},y)$ used to fit the inverse machine learning model $f(\cdot)$ come from either {\em in situ} field campaign data (statistical approach) or simulations by means of an RTM (hybrid approach).
We assume the following model,

$$y_i = f(\mathbf{x}_i) + e_i,~e_i \sim\mathcal{N}(0,\sigma_n^2),$$

where $f(\mathbf{x})$ is an unknown latent function, $\mathbf{x}$ $\in$ $\mathcal{R}^d$, and $\sigma_n^2$ stands for the noise variance. Defining $\mathbf{y}$ $=$ $[y_1, \ldots ,y_n]^\intercal$ and $\mathbf{f}$ $=$ $[f(\mathbf{x}_1),\ldots , f(\mathbf{x}_n)]^\intercal$, the conditional distribution of $\mathbf{y}$ given $\mathbf{f}$ becomes $p(\mathbf{y} | \mathbf{f}) = \mathcal{N}(\mathbf{f}, \sigma_n^2\mathbf{I})$, where $\mathbf{I}$ is the $n\times n$ identity matrix. Now, in the GP approach, we assume that $\mathbf{f}$ follows a $n$-dimensional Gaussian distribution $\mathbf{f} \sim \mathcal{N}(\mathbf{0}, \mathbf{K})$ \cite{bishop2006pattern}.\

The covariance matrix ${\bf K}$ of this distribution is determined by a kernel function with entries $\mathbf{K}_{ij}=k(\mathbf{x}_i,\mathbf{x}_j)=\exp(-\|\mathbf{x}_i-\mathbf{x}_j\|^2/(2\sigma^2))$, encoding similarity between the input points \cite{Rasmussen06}. The intuition here is the following: the more similar input $i$ and $j$ are, according to some metric, the more correlated output $i$ and $j$ ought to be. Thus, the marginal distribution of $\mathbf{y}$ can be written as

$$p(\mathbf{y}) = \int p(\mathbf{y} | \mathbf{f}) p(\mathbf{f}) d \mathbf{f} = \mathcal{N}(\mathbf{0},\mathbf{C}_n), \nonumber$$

where $\mathbf{C}_n=\mathbf{K} + \sigma_n^2\mathbf{I}$.
Now, what we are really interested in is predicting a new output $y_\ast$, given an input $x_\ast$. The GP framework handles this by constructing a joint distribution over the training and test points,
$$\begin{bmatrix}\mathbf{y} \\y_\ast\end{bmatrix}\sim\mathcal{N} \left( \mathbf{0},\begin{bmatrix}C_n & \mathbf{K}_\ast^\intercal \\\mathbf{K}_\ast & c_\ast\end{bmatrix} \right), \nonumber$$

where $\mathbf{K}_{*} = [k(\mathbf{x}_*,\mathbf{x}_1), \ldots, k(\mathbf{x}_*,\mathbf{x}_n)]^\intercal$ is an $n\times 1$ vector and $c_\ast = k(\mathbf{x}_*,\mathbf{x}_\ast) + \sigma_n^2$. Then, using standard GP manipulations, we can find the distribution over $y_\ast$ conditioned on the training data, which is a normal distribution with predictive mean and variance given by

$$\begin{align} \begin{aligned} \mu_{\text{GP}} (\mathbf{x}_\ast) &= \mathbf{K}_{*}^\intercal (\mathbf{K} + \sigma_n^2\mathbf{I}_n)^{-1}\mathbf{y}, \\ \sigma^2_{\text{GP}} (\mathbf{x}_\ast) &= c_\ast - \mathbf{K}_{*}^\intercal (\mathbf{K} + \sigma_n^2\mathbf{I}_n)^{-1} \mathbf{K}_{*}. \end{aligned} \end{align}$$

Thus, GPs yield not only predictions $\mu_{\text{GP}\ast}$ for test
data, but also the so-called ``error-bars'', $\sigma_{\text{GP}\ast}$, assessing the uncertainty
of the mean prediction. The hyperparameters $\mathbf{\theta}=[\sigma, \sigma_n]$ to be tuned in the GP determine the width of the squared exponential kernel function and the noise on the observations. This can be done by marginal likelihood maximization or simple grid search, attempting to minimize the squared prediction errors.

Jointly modeling real and simulated data (Joint GP)

where the parameter $\gamma>0$ accounts for the importance of the two sources of information relative to each other.

The resulting distribution of $\mathbf{y}$ given $\mathbf{f}$ is only slightly different from that of the regular GP, namely $p(\mathbf{y}|\mathbf{f}) = \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{V})$ where $\mathbf{V}$ is an $n\times n$ diagonal matrix in which the first $r$ diagonal elements are equal to $1$ and the remaining $s$ are equal to $\gamma^{-1}$: $\mathbf{V}=\text{diag}(1,\ldots,1,\gamma^{-1},\ldots, \gamma^{-1})$.
The predictive mean and variance of a test output $y_\ast$, conditioned on the training data, then becomes

$$\begin{align} \begin{aligned} \mu_{\text{JGP}} (\mathbf{x}_\ast) &= \mathbf{K}_{*}^\intercal (\mathbf{K} + \sigma_n^2\mathbf{V})^{-1}\mathbf{y}, \\ \sigma^2_{\text{JGP}} (\mathbf{x}_\ast) &= c_\ast - \mathbf{K}_{*}^\intercal (\mathbf{K} + \sigma_n^2\mathbf{V})^{-1} \mathbf{K}_{*}. \end{aligned} \end{align}$$

Note that when $\gamma=1$ the standard GP formulation is obtained. Otherwise $\gamma$ acts as an extra regularization term accounting for the relative importance of the real and the simulated data points. The hyperparameters of the JGP are $\mathbf{\theta} = [\sigma,\sigma_n,\gamma],$ which can be selected by maximizing the marginal likelihood of the observations as usual in the GP framework. It is important to note that hyperparameter fitting should be performed with respect to real data, so that the method learns the mapping from \textit{real} input to output.