Can data science predict the results of an election?
Dr Chris Hanretty, Reader in Politics at the University of East Anglia, thinks so. Working with Dr Ben Lauderdale of the London School of Economics and Dr Nick Vivyan of Durham University, Dr Hanretty has spent several years decoding national polling information – often seen as a useful, if imperfect, indicator of voter intentions – and developing a model to forecast the results on election night.
Most forecasting models use a select number of national polls to get an idea of how voters' minds have changed since the last election, then extrapolate to predict the picture at a local level. Polling information goes in, seat tallies come out.
Dr Hanretty’s method is different. He set out to create, for the first time, a model that makes predictions on a constituency-by-constituency basis. In doing so, he flipped the traditional input/output on its head, by instead starting with constituency predictions and working up to the national picture.
What follows is a summary of the process underpinning this unique forecasting model. For those who want the complete and unabridged version - beta distributions, logit transforms and all - visit electionforecast.co.uk.
For everyone else - here's how it was done.
National and constituency polls are the most valuable sources of information for any would-be election forecaster. They're essentially surveys that ask a representative sample of the country, or a given constituency, how they intend to vote in the next election. Of course, they can be wildly inaccurate – but the gap between intentions and results becomes dramatically smaller as an election approaches.
Constituency polling is a particularly important way to identify constituency-specific swings. To win a general election outright, a party must win a majority of the 650 seats in the UK parliament. Constituency polling therefore helps to predict the winners of these individual seats.
Some forecasting models assume that swings in each constituency will broadly follow the national trend - known as uniform national swing. In other words, if a party's national vote share rises by two per cent, then its vote share in each constituency will also rise by two per cent.
However, Dr Hanretty believed this assumption wouldn't work as well in the 2015 election, which promised to be a rather different beast to those that had come before. In Scotland, the SNP campaign was drawing votes away from Labour, while elsewhere support for UKIP was on the rise – but generally just in a handful of key areas, rather than nationally. An alternative to uniform national swing had to be found.
The first problem, then, was how to fill the gaps for the remaining constituencies, and for this Hanretty, Lauderdale and Vivyan turned to market research specialist YouGov.
YouGov does not offer constituency polls per se, but its national polling data helpfully includes the constituencies of individual respondents. With a little legwork – and a lot of complex statistical modelling – Dr Hanretty believed they could extrapolate from YouGov's relatively small sample sizes to build a reliable picture of each constituency's voting intentions.
The next problem was how to ensure respondents from a given constituency were indeed representative of that constituency. For example: if ten of your respondents are from Tunbridge Wells, eight of them happen to be women, and nine of them intend to vote Labour, you might conclude that a) Tunbridge Wells has a lot more women than men and b) women in Tunbridge Wells are more likely to vote Labour.
The above may well be true, or it could also be a fluke caused by treating national polls as constituency-specific data - which can create all sorts of headaches in a statistical model. The solution was to use Census data on the gender, age, qualifications and social grades in each constituency to re-weight the data, creating a more accurate picture wherever one characteristic or another was over-represented.
This use of sample weights is common in statistics: essentially, it's a way to mould raw data (which can be riddled with hidden biases) into something more consistent and useful.
Just like predicting the weather, observing the patterns of the past can give us an insight into the future. Historical polling data for past elections can also reveal a lot of useful information – not only about how people vote, and which constituencies tend to swing, but also about how reliable an indicator the polls themselves are. All of this data was factored into the model.
Pooling a wealth of polling data from 1979 onwards, and making adjustments for a number of factors – such as the ways some polls skew toward certain parties – the team estimated the daily level of public support for the Conservatives, Labour and the Liberal Democrats for each day leading up to every election since 1979.
Rather than focusing on the actual vote shares achieved in the elections and suggested by the polls, the team was interested in the change between actual vote shares (compared to the previous election), and the change in share implied by the polls.
Through this, the team learned two important things: first, that polls become more accurate predictors of an outcome the closer they are held to an election. This makes sense – people's voting intentions are less likely to change six days before an election, but they may well change over six months.
However, the pattern is interesting: the predictive power of polls rises steeply in the 40 days prior to an election, suggesting that this is when voting intentions start to become "locked in". Another striking point is that any gains or losses predicted by the polls are only ever, at best, 79% accurate - even from polls held a day before the election.
Second, swings in actual vote share don't always follow the swings implied by the polls, and tend to be more moderate – to be smaller gains, and smaller losses. In other words, polls have a tendency to exaggerate. Once again, this needed to be factored in to the overall model.
The team only had data on the three major UK political parties – smaller parties historically tend to be ignored by the polls. In predicting the 2015 election, the team had to assume that this relationship also applied to relative newcomers, such as UKIP and the SNP – something Dr Hanretty describes as "an important, and unfortunately untestable, assumption".
The team already knew that the 2015 election would be different. But would it be so different that the past would no longer be a reliable guide to the future?
We mentioned earlier that Dr Hanretty's model does something very different: instead of taking polls as input and estimates of a vote share as output, it instead runs simulations to find which eventual vote shares best line up with the polls. This rather tricky feat of reverse-engineering can be explained with an analogy.
If you roll a single marble down a slope covered with pits, you'll see it roll slightly to one side or another as it approaches each indentation. By rolling lots of marbles and recording observations, you can build up a good estimation of the path any marble will take from a given starting-point.
The marbles are parties, the pits are polls, and the journey down the slope is the time leading up to election day. Since actual party support can only be known on the day of an election, the model simulates lots of different levels of party support, then ensures the trajectories of the "marbles" are drawn to the poll information available. From this, the peaks and troughs in party support in the run-up to an election can be estimated.
If this all still sounds a little esoteric, it's all in the name of "smoothing" the data garnered from large numbers of opinion polls. All polls have a reported margin of error: for instance, a poll reporting Conservative support at 36% might report a 3% margin of error, meaning – due to the limitations of the polling process – that if you really interviewed the entire population, you'd likely find support somewhere lies between 33% and 39%.
Since a few percentage points could mean the difference between a correct or incorrect prediction, all these many separate margins of error needed to be factored in to the model as a whole.
Refining constituency data
The team came to a crucial and defining step of its model: modelling constituency outcomes. We've already described how YouGov's national polls, once properly processed and combined with Census information, can be used to approximate constituency-specific data.
On average, the model uses a sample of 174 voters from each constituency, or something like 11,000 people in total: a good cross-section, but a small sample compared with the actual size of the electorate.
Even assuming the voters were indeed representative of their constituencies, the team estimated that they would still be working within a 10% margin of error in each constituency. This random variation, known as noise, could potentially present a very misleading picture overall.
To gain a more accurate insight, another layer was needed. The solution: for each constituency, add extra information which has been a good historical predictor of voting outcomes (such as each party's vote share in the last election), and factor in additional demographic data.
The graph opposite shows what happens when the constituencies with the 100 youngest and 100 oldest populations are plotted against their level of support for the Conservative party. While not a perfect relationship, the rising line through the middle of the dots indicates that older constituencies (and older constituents) are more likely to vote Tory.
This process was repeated for each constituency with 13 more indicators, including demographic information like education level, gender and income, as well as some more specific factors – such as the percentage of people that supported a Brexit. Combined with historical voting data, an abstract "score" was calculated for each party, which would ultimately be used in calculating vote shares.
In itself, this process turned up some surprising results. For instance, one might expect UKIP to do well in a constituency with high levels of Euroskepticism, but – once all the other indicators on the list were controlled for – the reverse turned out to be true.
Onwards to May 7th
All the above broadly describes Dr Hanretty's model (there are one or two more book-keeping steps after the last one, but they're mostly of interest to the statisticians among us). Let's move ahead to the predictions the model made – and how they compared with the reality on May 7th.
“Our current prediction is that there will be no overall majority, but that the Conservatives will be the largest party with 278 seats. However, based on the historical relationships between the sources of information we are using in our forecast and the outcome of UK elections, we know there is substantial uncertainty in our forecast.”
As we now know, in the wake of an extremely dramatic night, the 2015 general election ended with a Conservative majority of 330 seats, narrowly avoiding another coalition government. Labour support in Scotland gave way to an overwhelming SNP victory, while Liberal Democrat seats across the nation dwindled away to almost nothing. Meanwhile, UKIP and the Greens ended up with a seat each, despite the former coming third in vote share overall.
As Dr Hanretty puts it: "A model can be sophisticated, produce consistent and plausible estimates – and still be wrong."
In fact, there are many things the model got exactly right – including the single seat each for UKIP and the Green Party, and the three for Plaid Cymru. It also predicted the SNP would win 54 seats, just two away from the actual total of 56.
Clearly, the model predictions for the Conservatives (in particular), Labour and the Lib Dems were substantially off. In general, the constituency modelling actually worked well: in areas where Labour was forecast to do well, they did, as did the Conservatives.
A crucial note, however, is that in this election the Conservatives outperformed the model's forecasts in marginal seats (as opposed to safe seats). This led to a much greater number of seats for the party, and ultimately, a majority in parliament.
The model was by no means alone in failing to predict a Conservative landslide. In fact, almost every UK poll failed to foresee the outcome in 2015.
"The model might not have predicted the outcome... but its author at least correctly predicted the reason it didn't. In an internal document produced for media partners, Dr Hanretty wrote..."
"We have to assume that polling companies are on average right… if the polling industry suffers a catastrophic failure – as it did in 1992 – then we'll also be wrong. Unfortunately, there's nothing we can do about this."
We all like to learn about the future, and so forecasts like Dr Hanretty’s will continue to be useful. Sometimes these techniques can even be used to tell us about the present. One off-shoot of Dr. Hanretty’s work looks at how we can “remap” the results of the 2016 EU membership referendum – which was counted at local authority level – on to Westminster constituencies. These figures have been used by different media organisations and by the House of Commons Library. Dr. Hanretty admits this exercise was less stressful. “When you’re trying to learn from the data, it’s useful to have something to anchor yourself to. With the EU referendum, I had a handful of constituencies where the true results were known. I could use these bits and pieces to make sure I was on the right lines. Unfortunately, the future is much less co-operative!”
Explore more UEA Research
The Art of Persuasion
Why are political speeches so often boring, predictable and unconvincing?
The consequences of policy and language
Understanding the EU Civil Service
An indepth study aiming to discover more about the people who make up Europe's chief public administration