Explanation, for the interested few, of what is going on in the paper Joint Gaussian Processes for Biophysical Parameter Retrieval
(almost finished)GPs are state-of-the-art tools for regression and function approximation, and have been recently shown to excel in biophysical variable retrieval by following both statistical and hybrid approaches. Let us consider a set of $n$ pairs of observations or measurements, ${\mathcal D}_n:=\{\mathbf{x}_i,y_i\}_{i=1}^n$, perturbed by an additive independent noise. The input data pairs $(\mathbf{x},y)$ used to fit the inverse machine learning model $f(\cdot)$ come from either {\em in situ} field campaign data (statistical approach) or simulations by means of an RTM (hybrid approach).
We assume the following model,
where $f(\mathbf{x})$ is an unknown latent function, $\mathbf{x}$ $\in$ $\mathcal{R}^d$, and $\sigma_n^2$ stands for the noise variance. Defining $\mathbf{y}$ $=$ $[y_1, \ldots ,y_n]^\intercal$ and $\mathbf{f}$ $=$ $[f(\mathbf{x}_1),\ldots , f(\mathbf{x}_n)]^\intercal$, the conditional distribution of $\mathbf{y}$ given $\mathbf{f}$ becomes $p(\mathbf{y} | \mathbf{f}) = \mathcal{N}(\mathbf{f}, \sigma_n^2\mathbf{I})$, where $\mathbf{I}$ is the $n\times n$ identity matrix. Now, in the GP approach, we assume that $\mathbf{f}$ follows a $n$-dimensional Gaussian distribution $\mathbf{f} \sim \mathcal{N}(\mathbf{0}, \mathbf{K})$ \cite{bishop2006pattern}.\
The covariance matrix ${\bf K}$ of this distribution is determined by a kernel function with entries $\mathbf{K}_{ij}=k(\mathbf{x}_i,\mathbf{x}_j)=\exp(-\|\mathbf{x}_i-\mathbf{x}_j\|^2/(2\sigma^2))$, encoding similarity between the input points \cite{Rasmussen06}. The intuition here is the following: the more similar input $i$ and $j$ are, according to some metric, the more correlated output $i$ and $j$ ought to be. Thus, the marginal distribution of $\mathbf{y}$ can be written as
where $\mathbf{K}_{*} = [k(\mathbf{x}_*,\mathbf{x}_1), \ldots, k(\mathbf{x}_*,\mathbf{x}_n)]^\intercal$ is an $n\times 1$ vector and $c_\ast = k(\mathbf{x}_*,\mathbf{x}_\ast) + \sigma_n^2$. Then, using standard GP manipulations, we can find the distribution over $y_\ast$ conditioned on the training data, which is a normal distribution with predictive mean and variance given by
where the parameter $\gamma>0$ accounts for the importance of the two sources of information relative to each other.
The resulting distribution of $\mathbf{y}$ given $\mathbf{f}$ is only slightly different from that of the regular GP, namely $p(\mathbf{y}|\mathbf{f}) = \mathcal{N}(\mathbf{f}, \sigma_n^2 \mathbf{V})$ where $\mathbf{V}$ is an $n\times n$ diagonal matrix in which the first $r$ diagonal elements are equal to $1$ and the remaining $s$ are equal to $\gamma^{-1}$: $\mathbf{V}=\text{diag}(1,\ldots,1,\gamma^{-1},\ldots, \gamma^{-1})$.
The predictive mean and variance of a test output $y_\ast$, conditioned on the training data, then becomes