Take-The-Beast: A brute force C++ algorithm to establish ground truth for classic data sets

Preamble: Finding Take-The-Best’s cue order:

Over the last years I have established a sort of love-hate-relationship with Take-The-Best (TTB), a simple heuristic from the heuristic toolbox.

If you have to make a binary comparison (for example, deciding which of two cities is bigger), then order the cues in decreasing validity, and then chose one option over the other as soon as one cue distinguishes, then stop.

On the one hand, I do not really believe that this particular heuristic is the best way to describe human behaviour (for example, there might be millions of other ways to ignore information, some of which work like TTB). On the other hand, TTB (at least in my opinion) seems to be one of only a few heuristics that we can actually define mathematically and test in different data sets. Or is it?

The only element that one could have an argument about is the actual cue order and how it can be learned or should be estimated. Originally, the cue order was supposed to just depend on how well a cue can distinguish between different options, given that it can be used. However, there are some obvious problems with this approach. For example, if one compares different German cities according to their population size, then “capital city” is a perfect cue according to this definition as Berlin is bigger than any other German city, i.e. it always wins. Yet, this particular cue can only be used in a scenarios where Berlin is one of the two compared cities, which makes it a cue that leads to trees that are not very frugal overall. Thus, a better approach might be to adjust for a cue’s frequency, that is how often it can actually be applied overall. Another approach could be to use the predictive validity in terms of correlation coefficients. However, all of the available approaches cannot really be compared against each other, if one doesn’t know the underlying ground truth.

And this is exactly what I’m going to do now, finding the right Take-The-Best cue order by brute force, and I will post the R-code step by step in the hope others might use it, too (sharing is caring after all).

Step 1: Loading up data and creating pairwise comparisons:

I am going to use the city size data set as an example here. It has 9 cues such has “has a univeristy” or “is a state capital”. These cues are then used to predict which of two cities is bigger.

Here’s the Code:

#house keeping
rm(list=ls())
 
#load packages
 
packages <- c(plyr,Rcpp,RCurl)
 
lapply(packages, library, character.only = TRUE)
 
#load data from my drop box, you can repeat this at home
 
mysource <- getURL(“https://dl.dropboxusercontent.com/u/4213788/citysize.csv&#8221;)
city<- read.csv(text = mysource, header=TRUE)
 
#generate all possible pairwise comparisons
combos <- combn(nrow(city), 2)
 
#function to get differences of pairwise comparisons
mycomfn<-function(x){
outframe<-data.frame(city[x[1],] – city[x[2],])
return(outframe)
}
 
#generate data set with all pairwise comparisons from city data
city<-adply(combos, 2, mycomfn)
 
#Delete useless variables
city$X1<-NULL
city$X<-NULL
 
#dichotomize dependent variable
city$y<-ifelse(city$y>0, 1, -1)

Step 2: Creating all possible cue orders:

So, now we have a data set that contains all possible pairwise comparisons of cities. The next thing we need is a matrix that contains ALL possible permutations of the (in total) 9 cues that can be used to compare different cities.

#Function to create all possible permutations of numbers up to n
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp<-permutations(n-1)
p <-nrow(sp)
A <-matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
 
#Matrix that contains one possible permutation per row
 
allperms<-matrix((1:9)[permutations(9)], ncol=9)

Step 3: Looping through (with C++):

So now we have the complete data, and all possible cue orders. The only thing we have to do now is to loop through all possible cue orders, classify the data by TTB, given that cue order, check how good the classification rate is, and track the rate over all cues. Simple! Done! Or so I thought.

The problem is that R, in all its statistical beauty, can be notoriously slow. For example, when I tried to set this algorithm up for a simple toy example (where I knew the order) containing 8 cues, I went for lunch and when I came back my set up progress bar told me that R had managed to compute around 43% of the task. What now? Compute it on a cluster? Run it over night? Parallelize? None of the options sounded convincing (and I am lazy), so I went for something completely different, which is to set up the code in (amateurish) C++-code and then load it up to R via the Rcpp-package.

Below is my brute force code for Take-The-Beast:

//The following have to be included

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]

//We want a data.frame as output containing the performance

DataFrame takethebeast(NumericMatrix validities, NumericVector weights,

NumericMatrix x, NumericVector y) {

//Initialize all variables

int nrow = x.nrow();
int ncol = x.ncol();
int validitiesnrow=validities.nrow();
NumericVector checkvector(validitiesnrow); //performance array

//loop over all validities
for (int h=0;h<validitiesnrow;h++)

{
NumericVector prediction(nrow); //store predictions

for (int i=0;i<nrow;i++)//loop over whole data
{

for (int j=0;j<ncol;j++){//loop over cue order
if(prediction[i]==0){
prediction[i]=x(i,validities(h,j))*weights(j);//calculate prediction
}

}

}

for (int k=0;k<nrow;k++)
{if (abs(prediction[k]-y[k])<0.01)//check if prediction is right 
checkvector[h]++;//update counter
}
}

return
//return data frame with number of correct predictions.
Rcpp::DataFrame::create(Named(checks) = checkvector);

}

Loop through:

Now let’s give it everything we have:

#load C++ source, this is the code from above that you have saved with a ~.cpp-ending
 
sourceCpp("cppfiles/takethebeast.cpp")
 
#Get the used format
xttb<-city[,-ncol(city)]
yttb<-city[,ncol(city)]
 
#set obvious seed
 
set.seed(1)
 
#do the dance!
 
out<-takethebeast(validities = allperms-1, weights = rep(1,9), x = as.matrix(xttb), y = yttb)
 
#print the order
 
print(names(xttb)[allvs[which.max(out$checks),]])
 
#Save the distribution over reasonably good performance
 
png("allcomps.png")
truehist(out$checks[out$checks>1500], col = "darkred", xlab="Score", main="Performance")
dev.off()

Results:

And that is it. We find that the best cue order by our brute force algorithm is:

Soccer Team > State Capital > Industrial Belt > Intercity trainline > License Plate > Former East Germany > Exposition Site > National Capital

This is already pretty interesting as it puts the national capital cue on the last place, so the truth is the complete opposite from the original proposal regarding the first cue.

The histogram over the high performer is shown below:

allcomps

We can see that a lot of the different cue combinations perform reasonably well. This also tells us something very important about machine learning in general: normally it is the prior selection of variables that is important. Once we have a good set of predictive variables, we can easily predict the dependent variable. The different algorithms do matter less in that case. Garbage in, garbage out. Great cues in, great predictions out. Therefore, it would be nicer to compare TTB and other models in data sets, where some cues are unimportant and the models have to decide on which cues to use for future studies.

Conclusion:

I have developed Take-The-Beast, a brute force algorithm to find the best cue order in a given data set. This algorithm could be easily adapted to all of the other data sets and I have provided all necessary code above.

Testing this algorithm within the city size data set, I found the cue order presented above. This cue order contradicts the traditionally estimated cue order as it puts the city size cue on the last place of the tree.

More importantly, many of the different cue orders lead to reasonably good performance, which means that the cue order does not matter so much in the end. This teaches us something on how arbitrary the commonly used data sets might be. In a sense, the most interesting part, selecting important variables, was done by a human before the model comparison even started. In the future it would therefore be nice to either compare TTB with actual human behaviour or to develop an algorithm that learns to distinguish between important and unimportant cues directly. We are currently working on the second part, so stay tuned.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s