Gaussian Process-based function learning

Gaussian Process Regression

How can we learn about and manipulate an unknown underlying function as quickly and well as possible?

One answer to this problem comes from the field of non-parametric Bayesian statistics and is called Gaussian Process Regression. A Gaussian Process is a stochastic process for which any finite combination of points is jointly Gaussian (hence the name). It is called non-parametric not because it contains no parameters (in that sense it is kind of a misnomer), but rather because it can potentially incorporate an infinite number of parameters (it should maybe rather be called “potentially infinite statistics”).

Without getting too much into the technical details, the gist of non-parametric models is the following question: “Why should we worry about over-fitting given that we are Bayesian?”. Indeed, this is a good question to ask and a non-parametric answer could be: “Well, we shouldn’t.”. Therefore, a non-parametric model just assumes an unfixed number of parameters and lets the data speak by the means of Bayesian regularization. Almost always the infinite prior assumptions result in a finite weighted sum for predictions at a newly given point; see Gershman & Blei (2012) for an introduction to non-parametric methods.

Gaussian Process (henceforth GP) Regression is the non-parametric approach to standard regression problems, that is we are given an input X and a continuous output y and have to predict new ys based on new (and old) Xs. Now, instead of fixing the number of parameters used a priori, we simply use a Gaussian Process and let a covariance function do the work for us. This makes Gaussian Process regression a really powerful tool, but of course it also comes with some drawbacks, for example the results are not easily interpretable and the computation can take a while, especially for big data sets.

 

Actively learning a function

One thing that I personally find exciting about this approach is that it can also be used in the context of my research, which is centred around the topic of active learning. Here the question is basically how we should select the next observation within an input domain as well as possible in order to learn about an underlying relation.

Imagine a simple regression, where y=f(x)+epsilon and where we can observe x within a given range on different points, let’s say from 0 to 10 in steps of 0.1; that is, we can always choose on which point we want to see the next output of the function, given what we have observed so far. A rather silly approach here would be to select points completely at random and observe the outcome based on that, thereby ignoring what we have seen so far and potentially will see in the future. “Silly.”, you say? As it turns out, this is a very common approach in many parts of the life sciences, where designing experiments (close to-)optimally is rarely seen.

Another thing we could do, however, would be to approach this problem in a more systematic fashion, for example trying to observe the output for points about which we are currently most uncertain. This approach seems quite intuitive as it just means observing the function at points, where we currently know the least about the potential output. Luckily, it turns out that this can be done by by using GP Regression.

The method applied here is a greedy algorithm that tries to minimize σ_{t-1}(x) by picking the next observation x^*_{t}=argmax σ_{t-1}(x). This simply means picking as the next observation the input point for which currently the variance is maximum.

The algorithm then looks as follows:

1. With your current GP: estimate the variances at each input point

2. Determine the point with the greatest variance

3. Choose this point as your next observation point

4. Observe the outcome

5. Update your GP and start from 1

This method is not only incredibly cool, but also provides a solution to the problem that on average returns around 1/3 of the total achievable information gain in a function learning scenario. This result is based on a characteristic called submodularity and can be looked up in a paper by Krause & Guestrin published in 2012.

Ok, now that we have defined the algorithm, let the fun begin. Let’s take a rather simple function and we can try to learn it actively. Let’s take the sine function below as an example:

f(x)=xsin(x)+ε

This function looks like the one below with a variance that was fixed to be 1.

true

If we now apply the above algorithm to this function, we can see how a GP Regression model learns actively.

functionlearning

The red line is the current interpolation line, the grey lines are sample from the GP and the black dashed line is the true underlying function. The black error bars are the chosen input points.

You can see that the GP-approach learns his function very quickly and only needs a couple of observations until it interpolates well. This is already pretty neat, but it is not the only thing we can do with a GP-approach.

 

Maximizing output

Another question could be to exploit a function, rather than only exploring it as we have done above.  Instead of finding points in order to simply learn a function, we are interested in getting the maximum out of it. This is actually a quite common scenario for many things we do in daily life, for example putting the right amount of information into a blog post to keep the highest number of readers excited…

Mathematically, this can be defined as follows:

If we have a function f(x), then we want to find the point x^* such that:

x^*=argmax_{x \in D}f(x)

This means basically finding the input for which the maximum output is produced. As we start out not knowing the underlying function, maximizing the output involves a constant trade-off between exploration and exploitation, that is learning the function while at the same time finding the maximum as quickly as we can. Normally, tasks like this are assessed by the cumulative regret R (the sum of the produced regret per round) an algorithm produces.

Regret r at time point t is defined as:

r_t=f(x^*)-f(x_t)

So basically regret simply quantifies how much we lost on each trial due to not knowing where the maximum is. Obviously, a regret of 0 would be brilliant, but -as said before- we have to learn the function first.

GPs can again provide a simple yet powerful tool for this situation. This approach is actually quite similar to the one applied in the pure learning task, but instead of picking the point where the current variance is maximum, we rather pick a point where a weighted sum between the currently expected outcome and the current variance is maximum. More exactly, we can optimistically pick points according to their current upper confidence bound, which is a natural trade-off between variance and expectation. Just think about it for a moment: When is an upper confidence interval high? Well, obviously it is high if the underlying mean is high, because it is constructed around the mean. However, it is also high if the variance at that point is high as we have less certainty about our estimate. This makes this approach, called the Upper Confidence Bound method, or (for the cool folks) UCB, an intuitive method to apply for exploration-exploitation scenarios.

More formally, we have the following algorithm:

1. With your current GP: estimate the variances and the mean at each input point.

2. Calculate the UCB for each input point: UCB=\mu_{t-1}+\sqrt{β}σ_{t-1}

2. Determine the point with the greatest UCB.

3. Choose this point as your next observation point

4. Observe the outcome.

5. Update your GP and start from 1.

The β-parameter is a trade-off parameter and normally depends on the current step as well as the design space. Traditionally, it is scaled up or down by cross-validation.

Again, submodularity can be used to establish a regret bound for this algorithm, but I will spare you the details for this and rather refer the interested reader to Srinivas et al. (2006).

Instead, let us see this algorithm in action (everyone likes animations, right?). Let’s take the same function as the one above and let us try to maximize its output.

The animation below shows how the GP-UCB would do that.

exploit

 

As we can see, the GP-UCB quickly learns where the maximum is and continues to exploit this point efficiently.

 

Why should psychologists care?

This is a good question indeed and I don’t want to give away too much just now, but I have run a couple of experiments where I confronted participants with scenarios just like the two above. Based on what I have found so far, it looks as if participants behave quite similar to the GP-approach when asked to either learn or exploit actively. This makes the GP approach (at least in my opinion) a valid and -even more interesting- currently the only model that is able to predict step-by-step learning behaviour in an active function learning task.

Now, I don’t think that we have something like a GP represented in our minds, but I do, however, believe that human learning behaviour can be quite sensitive towards the requirements of different environments and that we (quite like a GP) seem to have a mechanism that allows us to quickly adapt to many different scenarios.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s