I was recently asked to introduce my team to Gaussian processes. In my experience, Gaussian processes are often introduced in a very abstract manner, and it took me some time to understand exactly what was going on behind the scenes.

As such, I prepared the following notebook which walks you through what is actually going on when you train a Gaussian process model. If you want to run or modify the notebook, you can download it here.

When you start to learn about Gaussian processes, you come across the update equations fairly early on. The update equations are fairly intimidating to look at, and are typically dismissed as trivial to derive (for example, Rasmussen and Williams simply point you towards a statistics book from the 1930's, which is neither available online nor in our university library...). I had a go at the derivation, and promptly realised it wasn't trivial at all from a cold start.

Fortuitously, at the PEN emulators workshop I recently attended, there was an introductory lecture from Jochen Voss, where he went through a univariate case, and then gave us the structure of the derivation for the full multivariate case. Luckily, this gave me the push I needed to try the derivation again, so I went away and filled in the gaps.

So here it is, in all it's glory; the derivation of the update equations for Gaussian processes.

The overall endgame of Gaussian process regression is to write down a conditional distribution \(P(y_p | y_t)\) for a set of predicted outputs \(y_p\) given a set of training (observed) outputs \(y_t\). By the product rule

$$ P(y_p|y_t) = \frac{P(y_p,y_t)}{P(y_t)} $$

Since we have a constant set of data, \(P(y_t)\) is just a constant in this expression.

The underlying assumption in Gaussian process regression is that outputs are jointly Gaussian distributed, so that

$$ P(y_p|y_t) \propto P(y_p,y_t) \propto \exp\left[-\frac{1}{2} \left(\begin{array}{c} y_t \ y_p \end{array}\right)^T\Sigma^{-1}\left(\begin{array}{c} y_t \ y_p \end{array}\right)\right] $$

Where \(\Sigma\) is the joint covariance matrix. Remember that under the Gaussian process model we have trained a function which computes the elements of the Covariance matrix purely as a function of the inputs, it is only a function of the outputs \(y_p\) that we're trying to find. We can define the covariance matrix blockwise

$$ \Sigma = \left(\begin{array}{cc} T & C^T \\ C & P \end{array}\right) $$

Where \(T\) is the covariance matrix computed using only the training inputs \(x_t\), \(P\) is the covariance matrix computed using the prediction inputs \(x_p\), and \(C\) is the cross terms (i.e. the covariance \emph{between} \(y_t\) and \(y_p\), computed using \(x_t\) and \(x_p\)). It is a well known result (it's in numerical recipes, or on Wikipedia) that you can blockwise invert a matrix;

$$ \Sigma^{-1} = \left(\begin{array}{cc}T^{-1} + T^{-1}C^T M CT^{-1} & -T^{-1}C^TM \\ -MCT^{-1} & M\end{array}\right) $$

Where \(M = (P-CT^{-1}C^T)^{-1}\). So, we can directly compute our Gaussian density

$$ P(y_p|y_t) \propto \exp\left[-\frac{1}{2} y_t^T(T^{-1} + T^{-1}C^T M CT^{-1})y_t + \frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right] $$

However, the only thing that isn't a constant here is \(y_p\), so we can drop a bunch of terms (since we're only interested in the density, not absolute values)

$$ P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM)y_p + \frac{1}{2}y_p^T (MCT^{-1})y_t - \frac{1}{2}y_p^TMy_p\right] $$

If we take the transpose of the middle term, we can group the terms together a bit more

$$ P(y_p|y_t) \propto \exp\left[\frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T)y_p - \frac{1}{2}y_p^TMy_p\right] $$

Now, in general, a multivariate Gaussian has the form;

$$ \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}(y-\tilde{y})^T\tilde{\Sigma}^{-1}(y-\tilde{y})\right] $$

If we remember that covariance matrices are symmetric, we can expand, drop some constant terms and then rearrange this to

$$ \mathcal{N}(\tilde{y},\tilde{\Sigma}) \propto \exp\left[-\frac{1}{2}y^T\tilde{\Sigma}^{-1}y + \tilde{y}^T\tilde{\Sigma}^{-1}y\right] $$

we can therefore see that both \(P(y_p|y_t)\) and \(\mathcal{N}(\tilde{y},\tilde{\Sigma})\) have exactly the same form. We can therefore straightforwardly match expressions for \(\tilde{\Sigma}\).

$$ \tilde{\Sigma} = M^{-1} = P-CT^{-1}C^T $$

The expression for \(\tilde{y}\) requires a little more work. We start by matching terms

$$ \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T (T^{-1}C^TM + (MCT^{-1})^T) $$

We can rearrange this a little bit

$$ \tilde{y}^T\tilde{\Sigma}^{-1} = \frac{1}{2}y_t^T T^{-1}C^T (M + M^T) $$

We know that \(M\) is a symmetric matrix (we just showed that its inverse is the covariance matrix). So, if we right multiply by the covariance matrix \(\tilde{\Sigma}\) and take the transpose, we finally arrive at

$$\tilde{y} = CT^{-1}y_t $$

And, so, in conclusion we know that

$$P(y_p|y_t) \sim \mathcal{N}(CT^{-1}y_t, P-CT^{-1}C^T) $$

So, not quite as trivial as the textbooks claim!

A subject of debate (mostly in the pub) over the last few months has been about the predictability of sports results. My hypothesis was that the total number of points scored in a team-vs-team sport should be roughly Poisson distributed, and as such you would expect greater statistical fluctuations in the scores of sports where the score-lines are typically low (e.g. football), compared to sports with a high scoreline (e.g. Basketball). My thinking was that the "better" team should win more consistently in high scoreline sports.

Over the Christmas break, I decided to put my money where my mouth is, and actually investigate it. You can find the Jupyter notebook here. I downloaded a bunch of historical results from football, ice hockey and basketball, together with bookmakers pre-game odds. The data took a bit of cleaning, but I ultimately looked at how often the bookie's favourite lost for each sport. The tl;dr conclusion is that I was wrong. Football is the most accurately predicted, and has the lowest average scoreline.

I had intended to clean the notebook up a little bit, and make some more investigations (e.g. was I wrong because the score-lines aren't Poisson distributed? Or because each team's individual scoreline are too highly correlated with each other?) but whilst I was clearing out some of my external hard-drives I accidentally formatted my internal hard-drive too. That was a lot of fun to fix.... Anyway, I lost all my squeaky clean data, so I couldn't make prettier plots. Apologies.

During my work today, at some point I had to draw samples uniformly from a sphere, and made an observation I hadn't really appreciated before. The name of the game is to make sure that every infinitesimal shell has the same density of samples, on average.

$$ \frac{4\pi r^2}{\left< N_{samp} \right>} = \mathrm{constant} $$

Or in other words, the volume gets more spread out, so we need more samples for higher r. The density is simply constant with respect to direction (angles). So, to get the correct distribution we simply need to sample from

$$ \theta \sim \mathcal{U}(0,2\pi) $$

$$ \phi \sim \mathcal{U}(0,\pi) $$

$$ r \sim r^2 $$

Or in other words, just sample from a power law distribution with index 2. Elementary observation? Maybe, but it makes it easier to code up!

A couple of weeks ago I put my newest paper on to the arXiv, which you can find here; https://arxiv.org/abs/1711.06287

The paper describes how much we can hope to learn from gravitational-wave observations of black holes, focusing in particular on how well we can constrain our models of the evolution of binary stars. In order to calculate this I used COMPAS, the binary evolution simulation software we are developing here in Birmingham, and a technique called Fisher information matrices.

If you have a model, which depends on some parameters ({\lambda}), and makes predictions about some observable random variable, then the Fisher information quantifies how sensitive the expected observations are to changes in the underlying parameters. If some parameters have a large effect on our expected observations, then it makes sense that we can learn a lot about them.

I calculated the Fisher matrix by running COMPAS (a lot!) and seeing how the output changed with respect to four different bits of physics; the strength of supernova kicks, the ferocity of stellar winds during two different stellar evolutionary phases, and the frictional energetics during the common envelope phase. Below is the money plot from the paper.

The plot shows how accurately we will be able to measure each of these parameters after 1000 observations of binary black holes. The fuzziness comes from a quantification of the various sources of uncertainty in this process. It also shows how the different parameters behave with respect to each other (their correlation). The way I like to think about this plot is in terms of null hypothesis testing. If we find the truth is a massive outlier with respect to these distributions, then we reject the null hypothesis that our model is correct. This plot shows that the model doesn't have to be wrong by more than a few percent before we can rule out our model.

This morning I participated in the Astronomy and Astrophysics group Science Jamboree 2016, where I managed to win a prize (chocolates!) for the best talk. I'm not quite sure how I managed it, since in the midst of a series of informative and engaging talks, I performed a silly little poem;

*For my science Jamboree
I'll tell you a bit about me
My journey through physics
in a few clumsy limericks
Right through to my PhD*

*I was born in July '91
In the blistering east Norfolk sun
And... oh wait a minute,
there's a strict time limit
So I had best be getting along*

*We'll skip ahead 18 odd years
Whilst I, still wet 'round the ears
kicked off my degree,
eager to see,
How I felt after 2 dozen beers*

*But I did learn some physics as well
Einsten, Newton, Maxwell
But as some of you know,
not a lot of astro
Although I think you can probably all tell*

*To be honest I found space a bit dreary
I preferred condensed matter theory
my vague proclivity
for superconductivity
"I should do a PhD in this clearly!"*

*Alas, I left those low temperatures
And became a software developer
but the real world is crap
I resolved to come back
And I somehow became an Astronomer....*

*My first project involved an obsession
with continuous autoregression
take an impulse response
to stochastic nonsense
et voila! banded matrix compression!*

*And with that O(n) inference
Time to write a paper of significance!
Except instabilities
destroy all utility
Try selling that at a conference!*

*So my PhD project, take 2!
But what was I going to do?
I felt it was prudent
to become Ilya's student
since bless him, he's only got a few*

*Binary population synthesis;
i.e how did it end up like this?
which hyperparameters
make a chirp that glamorous?
The combinations seem virtually limitless*

*So here's where I come in
When parameter space coverage is thin
we can interpolate
perhaps extrapolate
what happens when things spiral in*

*Now this work isn't finished yet
Ive tried random forests and neural nets
but despite all my labour
it seems nearest neighbours
may well be our safest bet*

*To conclude my science Jamboree
I'll make a short summary
my project got switched
because CARMA's a bitch
and I write really shit poetry!*