Can You Predict Which Games Will Get the Spiel des Jahres Seal of Approval?

Last month I looked at characteristics of games that have won the Spiel des Jahres. There were some interesting patterns but because of various qualitative methods used to judge the games, there wasn’t much insight in how a game wins.

Every year, the Spiel des Jahres selects a handful of games for its Empfehlungliste Spiel des Jahres (or recommendation list) and from there, nominees for the overall award. Could an algorithm detect patterns that could successfully predict these recommended games?

In order to explore that, I needed more variables to compare these recommended games with other titles.

The Data

The data comes from BoardGameGeek, which I discovered has an excellent XML API, and the Spiel des Jahres’ database of past nominees and winners. I needed data on more games to “train” my prediction models. Since BoardGameGeek identifies games by unique identifiers as well as names, I came up with a sequence of numbers between 1 and 150,000. Then I drew 1,500 random samples from that sequence and pulled data on games with those ids from the site.

After eliminating bad ids (those not attached to a game) as well as ids for video games, consoles, and other non-board game entries, I ended up with 375 board games. For the winners, nominees and recommended games, I scraped the Spiel des Jahres pages for every year between 1979 and 2015. I then used BoardGameGeek’s API to acquire the ids and data for 271 of these games.


I reused some variables from my last post, like Recommend Age and Number of Players, but split some of them into individual columns. By doing that, they went from factor variables (e.x. “2-4 players”) to discrete variables (minimum equals 2, maximum equals 4). There are also some additional variables, described below.

  • Min.Players: The minimum number of players for a game
  • Max.Players: The maximum number of players for a game
  • Min.Age: The minimum age recommended by the manufacturer.
  • Min.Play.Time: The minimum amount of time the manufacturer recommends for playing the game.
  • Avg.Scores: The average score the game has received by users on BoardGameGeek.
  • N.Char: The number of characters in the game’s title.

The number of characters in the game’s title was just for fun. In the past, I’ve done tests on the success of email open rates based on the length of subject lines, and with one series of emails, I found a strong correlation. My hunch is that games with simple titles like Dixit perform better than games with elaborate titles like Weimar: German Politics 1929-1933. (Who doesn’t want to play that??)

The boxplot below shows the range of the number of characters in the game titles for recommended games and games that were part of the random sample.


Picking an Algorithm

When it comes to classification, two methods in particular came to mind. A CART model, or Classification and Regression Trees, and Random Forests. The whole tree theme will become clear shortly.

If you’re not familiar with supervised machine learning, here’s a crash course. After selecting and preparing data, you have to split your data into two sets, the training and test sets. Think of it as actual teaching. You teach a set of students (the training set) and give them an exam to see how well they’ve done. If everything looks cool, you assume the same method will have a similar success rate on another group of students (the test set). When you give that group the exam, you can see how right or wrong you were. Did you overfit (base the exam too much on the strengths and weaknesses of your first group of students)? Rinse, repeat. Rinse, repeat.

CART Model

If you have ever seen a decision tree, the CART model will look familiar. Below is the CART model fit for my training data set.


You start at the top and work your way down: Does the game have more than or equal to 12 characters in its title? Yes, move to the left. No, move to the right. etc… etc… Until you get to the very bottom. If the result is “Yes” at the end, we have a potential Spiel des Jahres seal of approval. Is it accurate, though? Now I use the model on the test set to find out.

The table below shows the prediction (rows) and the actual classification (columns) after applying the model to the test set.

Predicted (Row) vs. Actual (Column) Classification
No Yes
No 130 46
Yes 20 55

It’s not a great fit, as it is only 73.71% accurate. One issue is the fancy tree plot up above. It’s pretty elaborate, right? That’s called “overfitting.” The model fits the training data set so perfectly that when you apply to another data set, it’s prone to misclassification.

Random Forests

Now I’ll try applying Random Forests to the data. I’m not going to spend a lot of time explaining the algorithm, so here’s the gist: Instead of that one tree in the CART model above, we are going to look at a multitude of trees (a forest!!). After looking at more and more trees using random subsets of the data, we take the classification that appears the most. It’s an incredibly accurate method, but it’s really slow if your data set is large.

Random Forests also lets us know which variables have the most influence on the classification, as shown in the plot below.


Looks like the opinions of BoardGameGeek users have a pretty big influence, which makes sense when you think of it. How many horribly rated games are nominated for a game of the year award? Interestingly, the number of characters in the game title also has an influence on the classification.

But how does the model do on a different data set?

After applying the Random Forests model to my test set, I end up with much higher rate of accuracy (80.88%) than the CART model. You can see in the table below how the model performed.

Predicted (Row) vs. Actual (Column) Classification
No Yes
No 135 33
Yes 15 68

Crystal Ball or Bawl?

Right now you’re probably thinking “OK, being able to predict with 80.88% accuracy is pretty cool!” Not so fast. Let’s be real: Unless you’re predicting with 99% accuracy, your prediction method is flawed in some way.

While Random Forests eliminates some of the overfitting from the CART model, it still has an “out-of-bag”” error rate of 24.6%. As we saw with the application on the test set, the misclassification rate was 19%.

As Nate Silver says about predictions, you should always recognize your model’s flaws and strive to improve it. Are there any variables I’m leaving out or am I using too many? Is this something we can actually predict due to the many qualitative aspects of the nominations? Do I need more data? As 2016 continues, I’ll have an opportunity to actually use my model and see how well it does.