One debate within Cognitive Science concerns different assumptions about how complex thinking really is. On the one hand, there are people using sophisticated methods such as MCMC-sampling or Bayesian Networks to model human reasoning. On the other hand, there are scientists who believe that simplicity pays. In today’s post I will focus on the simplicity side of this debate. By doing so, I will go back to one of the traditionally (over-)used data sets and see how one of the most frequently studied simple models, a heuristic called Take-The-Best, performs compared to other classical machine learning algorithms (spoiler alert: not too well).

**Short historic preamble**

<skip this section, if you know the debate>

During the 1970s Kahneman and Tversky came up with their idea of heuristics and biases. They postulated that the human decision maker is not acting as rationally as described by traditional economic theories. Moreover, they developed a whole set of heuristics that bias our decision making towards mistakes and systematic errors, most notably the availability heuristic, the anchoring heuristic, and the representativeness bias.

Starting from the 1980s onwards, Gerd Gigerenzer and colleagues went on a mission to show that heuristics -although sometimes leading to errors- are actually well-adapted and simple short-cuts that allow “satisficingly” good decisions within natural environments. One such heuristic is the Take-The-Best heuristic. TTB is a compensatory strategy, which means that if one orders the used variables in terms of their importance, the first variable has a stronger effect than all the other variables, the second variable has a stronger effect than all the remaining variables, and so forth. In fact, it is postulated that within a choice scenario once a variable is able to decide between the options, the other variables are not even looked up (or processed) any more.

Take the decision about which of two cities has a larger population size as an example. Let us assume that the two to be compared cities can be described by the following variables:

1. Is it a capital city?

2. Does it have an airport?

3. Does it have a major league football team?

4. Is it a trade fair town?

Now, to compare two cities, city A and city B, one only has to look up the questions in the order they are presented above. As soon as one criterion weighs favourably for one city, the comparison stops and the response is that this city is said to be bigger.

For example, comparing London and Manchester, one would ask “Is one of them a capital city?”, which is true for London, but false for Manchester and at which point the comparison stops and the response is that London is the bigger city. Comparing Manchester and Oxford, one would ask Q1, which is true for both, then Q2, again true for both, and then Q3, which is only true for Manchester. Thus, the final response is that Manchester is the bigger city. If none of the questions can distinguish between the options (for example, comparing London and Paris), then the final response would be to simply guess.

Normally, it is assumed that the cues are already sorted in terms of their ecological validity, which means that questions which distinguish well between the options are asked at the beginning.

Starting in 1996, Gigerenzer & Goldstein published a series of papers showing that this seemingly simple rule is able to outperform other, more sophisticated machine learning algorithms such as linear regression and Naive Bayesian classifiers. Up until today one of the major points stressed by Gigerenzer is that simple methods can outperform other more sophisticated strategies.

**The revenge of the city size**

A couple of days ago I was lucky enough to get hold of some of the old data that were used by Gigerenzer . One thing I noticed is that the old comparisons seem to have used a continuous response variable (i.e., population size was coded as the number of people living in a city), instead of creating binary comparisons as a training set. This means that each model was fitted to the continuous case and then used to predict the city size for each city individually. So, in a normal comparison between city A and city B, the population size for city A and for city B where estimated by the given model separately and then the city with the higher estimate was predicted to win the comparison. This is valid for regression models, but it is not the only way to approach a task like this.

Predicting the data as I just described results into two different predictions of a model with continuous outputs, that are then dichotomized by saying which city has the higher predicted number of inhabitants. The TTB heuristic is not suited for this approach as it can only predict binary outcomes by its nature (bigger or smaller). So I guess that TTB’s cue validity was either determined by correlations or -and this is much more likely- by generating potential pairwise comparisons out of the data set. Now, if one is already using pairwise comparisons, it should be valid to fit other models directly to those comparisons as well.

To stress the difference between the two methods a little further, I have shortly described them below:

* Continous approach: *

*– fits each item’s response separately; in the final comparison task, the response is estimated for both options and the option with the higher response is said to win.*

*Pairwise approach: *

*– generates pairwise comparisons, which means subtracting the values of each case from each other and setting the response to 1 (win, city is bigger) or 0 (loose, city is smaller). Models are fitted to the comparison directly.*

In what follows, I will use the pairwise approach, leading to the use of models for binary outputs such as logistic regression or random forest. I will use the original data set of city sizes for model comparison.

**The contestants**

** TTB**: Classic TTB. Fitting TTB I tried to make the algorithm as efficient as possible by only evaluating cues that are actually necessary (as stated in the original paper).

**Logistic Regression:** A generalized linear model using the binomial family, modelling the log-odds of a city to be bigger than its competitor. In Gigerenzer’s original paper the regression approach was said to be the “rational” approach.

**Naive Bayes:** Naive Bayes models the probability for a city to be bigger given the values for each item comparison. It is called naive, because it assumes that the used variables are not correlated with each other.

**Support Vector Machine: **Maps the comparisons to points in space, so that the examples of the separate categories (city lost or won) are divided by a gap that is as wide as possible.

**Random Forest:** Grows an ensemble of decision trees up until a very deep depth and uses the mode of all trees to predict whether or not a city will win a comparison against its competitor. Random Forest trees are normally known (and hated) for their ability to predict all sorts of things very well and are therefore regularly used as benchmark models in prediction tasks.

**Method**

I took the whole city size data set, removed two randomly drawn cities, and generated all possible non-redundant comparisons out of the remaining cities. These comparisons where then used as a training set, which means that all the models where fitted to this set, trying to capture the conditions under which a city would loose or win a comparison. The remaining two cities where used as a test set, meaning that their variables were subtracted from each other and the dependent variable was set to 1 if the first city was bigger, 0 otherwise, thereby creating a single comparison as a test. Each model was fitted to the learning set and used to predict the test set. If the prediction was correct, it counted as a success, otherwise as a failure.

This process was repeated 500 times for each model and the average proportion of correct predictions was calculated at the end.

**Prediction results**

As can be seen in the figure below, TTB did not perform too badly, predicting around 72% of all comparisons correctly.

However, it is not as good or better as the other models. Most notably the logistic regression predicts around 80% of all comparisons correctly and is therefore performing better than TTB. Random Forest outperforms TTB by a margin of 10%. The support vector machine seems to perform poorly due to over-fitting and the high inter-variable correlation.

**Conclusions**

Having chosen a pairwise comparison for all the candidate models and not only for TTB, I have found that TTB is not outperforming the other models as has consistently been claimed in some of the classic papers.

However, TTB did not perform too badly, so one could say that simple models can be a good choice, especially if an agent has to make a trade-off between accuracy and simplicity. However, if simple heuristics are really as simple as is usually claimed remains to be seen.

Nice work, Eric!

As a side note, I think the relative performance of TTB also crucially depends on what cross-validation method you are using. Using leave one-one-out cross-validation, there is plenty of data for fitting the model’s parameters. I’d be curious to see what happens if you systematically vary the size of the training set, so that there is less data available for training the models. (as usually done in the GG et al studies)

Björn

Hi Bjoern,

Thanks a lot. Actually, when I showed my blog entry to Paula Parpart, she said that she is currently doing something very similar and that she has tried different proportions of learning and test sets. I think she has found the same result for the leave-one-out case, but also that TTB performs better when the learning sample is rather small (just like in the original studies). I actually do think that TTB performed rather well compared to the other models. However, I am curious to see how it would scale, i.e. when I generate data and add more and more independent variables. It would be especially interesting to see how the time to learn TTB’s structure would scale in that scenario. Alas, I am not 100% sure if my current implementation is the fastest (and thereby perhaps fairest) way to algorithmically represent TTB.