Quantifying mismatch in Bayesian optimization

Our paper trying to quantify mismatch in Bayesian optimization just got accepted to this year’s Bayesian Optimization workshop at NIPS. In this paper, we have tried to assess how encoding wrong prior smoothness assumptions about the underlying target function affects different acquisition functions in Bayesian optimization and found that mismatch can be a severe problem for the optimization routine, that this problem gets even worse in higher dimensions, and that it remains even if hyper-parameters are optimized. Therefore, we have written up a short cautionary note about how thinking hard about prior assumptions can sometimes be more important than choosing one particular acquisition function over another.

Where do hypotheses come from?

Here‘s the pre-print of our latest work (mostly done by Ishita Dasgupta) on a rational model of hypotheses generation. The main idea is to define hypotheses generation as a Markov chain Monte Carlo algorithm that locally traverses the space of hypotheses with only a finite sample size. It turns out that this algorithm can explain many “biases” of hypotheses generation and evaluation such as super- and subadditivity, anchoring, the crowd within-effect, and the dud alternative effect, to name just a few. We also test the predictions of this model within two experiments.

LJDM talk

Here’s the link to the screen cast of my LJDM talk about our idea of compositional inductive biases of intuitive functions:

Thanks to everyone for coming and for the really useful feedback.  Here are the slides as PDF-format (Please note that there are a lot of animations in there, which only work with a decently up-to-date PDF viewer).

Also, re-listening to it (why is it always so weird to hear your own voice?), I’ve created a little addendum:

More serious:

  • The guy who showed the connection between GPs and neural nets is called Neal and not McNeal.
  • Wertheimer worked in New York, not in Massachusetts. It’s still true that some of the Gestaltists struggled to find positions in the US though.
  • For the spectral kernel, the mean vector of the mixture of Gaussians encodes the periodicity and the inverse standard deviations encode the length-scales, not the other way around like I said.

Less serious:

  • I definitely say “basically” a lot.
  • Same goes for “right”.
  • I shouldn’t have said I “hate” Wham!, I just don’t really like them that much.