How I Used Machine Learning to Predict Soccer Games for 24 Months Straight

For better or worse, machines do not place bets with their hearts.

This is a guest post by Ola Lidmark Eriksson.

Can machine learn­ing make you rich from sports betting?

Two years ago, I asked myself if it would be pos­sible to use machine learn­ing to pre­dict soc­cer games’ out­comes better.

I decided to give it a ser­i­ous try, and today, two years and con­tex­tu­al data from 30,000 soc­cer games later, I’ve gained many inter­est­ing insights.

Here we go:

The Big Data Challenge: Let the Data Mining Begin

Step 1: To begin with, I har­ves­ted as many data points as pos­sible. I mined old game data from every source and API I could find. Some of the more import­ant ones were Football-data, Everysport, and Betfair.

Step 2: I merged these data points with their cor­res­pond­ing res­ults, quan­ti­fied them, and put everything into one database. 

Step 3. Finally, I used the data to train a machine learn­ing mod­el, to be used as my soft­ware for pre­dict­ing upcom­ing soc­cer games.

How To Measure Predictions of the Unpredictable

Of course, the nature of a soc­cer match is that it is unpre­dict­able. I guess that’s why we love the game, right?

Still, I was some­what obsessed with the naïve notion that I, armed with a data-driv­en machine-learn­ing mod­el, could pre­dict games bet­ter than I usu­ally would. At that point, I based most of my sports bets on emo­tions (“gut feel­ings”) rather than actu­al data.

The first chal­lenge was to find out how to meas­ure wheth­er or not my mod­el was suc­ceed­ing. I quickly real­ized that meas­ur­ing the actu­al per­cent­age of cor­rectly guessed games didn’t add much value — not without some form of context.

I decided to com­pare the model’s out­put with the best guesses of the actu­al mar­ket. The easi­est way to assess such data was to har­vest mar­ket-reg­u­lated odds. Therefore, I star­ted com­par­ing how my mod­el would per­form if Betfair, only because their odds are adjus­ted based on real people bet­ting real money against each other.

The Results: Did My Model Make Me Rich?

Fast-for­ward to today: Now — two years have passed. Has the mod­el made me a rich man?

Well, no.

I soon real­ized that my pre­dic­tions, for the most part, were aligned with the market’s best performance.

Since I used a regres­sion-based mod­el, I could pre­dict the strength of the prob­ab­il­ity of a spe­cif­ic game out­come. And at the most sub­stan­tial prob­ab­il­ity grades, my mod­el pre­dicts roughly 70% of the games cor­rectly. Since the mar­ket per­forms just as well, mak­ing ser­i­ous money from my bets is difficult.

But, to be hon­est, I nev­er thought I would cre­ate a “money machine,” either. Instead, I came to sev­er­al rather excit­ing insights about the pos­sib­il­it­ies (and lim­it­a­tions!) of big data and machine learning:

Learning 1: Machine Learning and Diminishing Gains

In the­ory, machine learn­ing should be able to improve over time. The amount of data the mod­el has to learn from grows, enhan­cing the out­come of the predictions.

Well, this wasn’t my exper­i­ence at all.

Two years ago, I star­ted with about 2,000 games in my data­base and rel­at­ively lim­ited data sets attached to them. Today, I have almost 30,000 games in the data­base, with metadata cov­er­ing everything from weath­er and dis­tances between the team’s home grounds to shots and corners.

All this added data — and the mod­el has been able to “learn” over time!— it still didn’t improve its pre­dic­tions. Big data and machine learn­ing will only take you so far in pre­dict­ing the unpredictable.

Learning 2: The Power of Unbiased Generalizations

The power of machine learn­ing seems closely tied to its abil­ity to make unbiased gen­er­al­iz­a­tions.

For example, I was curi­ous to see if my mod­el could pre­dict when win­ning or los­ing streaks would be broken over the past two years. For instance, it could expect that Barcelona would finally lose after win­ning ten games straight. Could my mod­el prove cer­tain anom­alies to be significant?

Well, it has shown to be not that good at that.

Instead, I found that the mod­el was sur­pris­ingly good at bet­ting against over­val­ued teams over time.

Last sea­son, I saw how my soc­cer pre­dic­tion machine often pre­dicted against Borussia Dortmund while the mar­ket made anoth­er pre­dic­tion. Dortmund had a lousy sea­son mak­ing my mod­el advant­age­ous com­pared to mar­ket pre­dic­tions. I have seen the same in teams like Liverpool and Chelsea this season.

So the les­son learned is that some people tend to make sports bet­ting based on emo­tions. Liverpool and Dortmund are teams liked by lots of people, and at times, you make pre­dic­tions with your heart instead of your brain. My machine learn­ing mod­el, well, it does not.

Learning 3: Machine Learning and Easy Gains

If noth­ing else, I learned that mak­ing pre­dic­tions that out­per­form the mar­ket is com­plex. Still, when I star­ted look­ing at what I had achieved (instead of just obsess­ing over what I hadn’t), I found one quite sur­pris­ing fact:

From a simple Python pro­gram and less than 10,000 lines of code, I still had made some­thing that per­formed just as well as the mar­ket. How many per­son-hours aren’t behind book­ies’ odds mod­els and pre­dic­tions? The mod­el can pick out attract­ive bets weekly, just as any news­pa­per or expert would. By mak­ing gen­er­al­iz­a­tions, you might not be able to find that one bet that will make you rich — but it may save you lots of time in the prop­er context.

Implementing Machine Learning to Wide Ideas

With these insights in mind, I star­ted to look at anoth­er pro­ject I’ve been involved in for the last five years: Wide Ideas, a plat­form for com­pan­ies to crowd­source ideas and creativity.

What I wanted to do was to look at the ideas com­pan­ies gathered from their employ­ees and try to pre­dict wheth­er they would imple­ment the idea or not.

The team and I quan­ti­fied the data, but instead of shots on goal and weath­er fore­casts, we looked at how many had inter­ac­ted with an idea — and in what way. And lo and behold, the out­come was on par with the soc­cer predictions:

We can now make decent pre­dic­tions on wheth­er or not we will imple­ment a cre­at­ive idea. We can visu­al­ize this to encour­age more great ideas through gamification.

Can we find a good idea that doesn’t fol­low the gen­er­al pat­terns of a good idea? No, not — not yet, at least.

Still, for the product, and giv­en that you look at an organ­iz­a­tion that can har­vest 10,000 ideas per year, find­ing ways to high­light and encour­age par­tic­u­lar ideas can save time and resources. So just going from 10,000 to 100 (per­haps) good ideas and visu­al­iz­ing the res­ult saves lots of time.

The gap between mak­ing machines just as good as humans and mak­ing them bet­ter than we are.

Big data and machine learn­ing might pre­dict any­thing from early-stage can­cer to mak­ing self-driv­ing cars anti­cip­ate poten­tial dangers. Models like this will prob­ably prove most use­ful where gen­er­al­iz­a­tions save time.

Take med­ic­al imple­ment­a­tions, for example. Sifting through thou­sands of birth­mark pic­tures, a mod­el could help pick the most likely ones to be can­cer, thus sav­ing doc­tors valu­able time and resources.

However, human beha­viour may prove to be tricky. In what way is human beha­viour pre­dict­able? We’re ration­ally irra­tion­al. We can gen­er­al­ize, pla­cing people into dif­fer­ent cat­egor­ies based on what they like to eat, watch or do, but there might be too many factors that set us apart as individuals.

Will big data and machine learn­ing detect the anom­alies — or will it just be superb at generalizations?

I hope we’ll exper­i­ence a future where com­pan­ies focus on actu­al data ana­lys­is instead of think­ing that “big data” by default equals “bet­ter data.”

So, until someone proves me wrong (or Arnold Schwarzenegger returns from the future, whichever comes first!), We should put machine learn­ing to use where gen­er­al­iz­a­tions best can save time from real humans. 

Otherwise, the risk is that we’ll end up with so many met­rics that the sheer amount would suf­foc­ate any pos­sib­il­ity of mak­ing sense of it.

About the writer: Ola Lidmark Eriksson is CTO at Wide Ideas.

