Can data science predict the results of an election?
Dr Chris Hanretty, former Reader in Politics at the University of East Anglia, thinks so.
Working with Dr Ben Lauderdale of the London School of Economics and Dr Nick Vivyan of Durham University, Dr Hanretty is developing a model to forecast the 2017 UK general election results.
Back in 2015 the team used a similar forecasting methodology which - like almost every UK poll - produced a flawed forecast, failing to forsee the outcome.
Most forecasting models use a select number of national polls to get an idea of how voters' minds have changed since the last election, then extrapolate to predict the picture at a local level. Polling information goes in, seat tallies come out.
Dr Hanretty’s method was different. He set out to create, for the first time, a model that makes predictions on a constituency-by-constituency basis. In doing so, he flipped the traditional input/output on its head, by instead starting with constituency predictions and working up to the national picture.
What follows is a summary of the process underpinning this unique 2015 forecasting model. For those who want the complete and unabridged version - beta distributions, logit transforms and all - visit http://electionforecast.co.uk.
For everyone else - here's how it was done and what lessons have been learned.
National and constituency polls are the most valuable sources of information for any would-be election forecaster. They're essentially surveys that ask a representative sample of the country, or a given constituency, how they intend to vote in the next election. Of course, they can be wildly inaccurate – but the gap between intentions and results becomes dramatically smaller as an election approaches.
Constituency polling is a particularly important way to identify constituency-specific swings. To win a general election outright, a party must win a majority of the 650 seats in the UK parliament. Constituency polling therefore helps to predict the winners of these individual seats.
Some forecasting models assume that swings in each constituency will broadly follow the national trend - known as uniform national swing. In other words, if a party's national vote share rises by two per cent, then its vote share in each constituency will also rise by two per cent.
However, Dr Hanretty believed this assumption wouldn't work as well in the 2015 election, which promised to be a rather different beast to those that had come before. In Scotland, the SNP campaign was drawing votes away from Labour, while elsewhere support for UKIP was on the rise – but generally just in a handful of key areas, rather than nationally. An alternative to uniform national swing had to be found.
The first problem, then, was how to fill the gaps for the remaining constituencies, and for this Hanretty, Lauderdale and Vivyan turned to market research specialist YouGov.
YouGov does not offer constituency polls per se, but its national polling data helpfully includes the constituencies of individual respondents. With a little legwork – and a lot of complex statistical modelling – Dr Hanretty believed they could extrapolate from YouGov's relatively small sample sizes to build a reliable picture of each constituency's voting intentions.
The next problem was how to ensure respondents from a given constituency were indeed representative of that constituency. For example: if ten of your respondents are from Tunbridge Wells, eight of them happen to be women, and nine of them intend to vote Labour, you might conclude that a) Tunbridge Wells has a lot more women than men and b) women in Tunbridge Wells are more likely to vote Labour.
The above may well be true, or it could also be a fluke caused by treating national polls as constituency-specific data - which can create all sorts of headaches in a statistical model. The solution was to use Census data on the gender, age, qualifications and social grades in each constituency to re-weight the data, creating a more accurate picture wherever one characteristic or another was over-represented.
This use of sample weights is common in statistics: essentially, it's a way to mould raw data (which can be riddled with hidden biases) into something more consistent and useful.
Just like predicting the weather, observing the patterns of the past can give us an insight into the future. Historical polling data for past elections can also reveal a lot of useful information – not only about how people vote, and which constituencies tend to swing, but also about how reliable an indicator the polls themselves are. All of this data was factored into the model.
Pooling a wealth of polling data from 1979 onwards, and making adjustments for a number of factors – such as the ways some polls skew toward certain parties – the team estimated the daily level of public support for the Conservatives, Labour and the Liberal Democrats for each day leading up to every election since 1979.
Rather than focusing on the actual vote shares achieved in the elections and suggested by the polls, the team was interested in the change between actual vote shares (compared to the previous election), and the change in share implied by the polls.
Through this, the team learned two important things: first, that polls become more accurate predictors of an outcome the closer they are held to an election. This makes sense – people's voting intentions are less likely to change six days before an election, but they may well change over six months.
However, the pattern is interesting: the predictive power of polls rises steeply in the 40 days before when you can start paying attention to the polls. Another striking point is that the correlation between the polls and the outcome - a measure of the strength of the association between the two, and which ranges from zero to 100% -- reaches a maximum of 80%.
Second, swings in actual vote share don't always follow the swings implied by the polls, and tend to be more moderate – to be smaller gains, and smaller losses. In other words, polls have a tendency to exaggerate. Once again, this needed to be factored in to the overall model.
The team only had data on the three major UK political parties – smaller parties historically tend to be ignored by the polls. In predicting the 2015 election, the team had to assume that this relationship also applied to relative newcomers, such as UKIP and the SNP – something Dr Hanretty describes as "an important, and unfortunately untestable, assumption".
The team already knew that the 2015 election would be different. But would it be so different that the past would no longer be a reliable guide to the future?
We mentioned earlier that Dr Hanretty's model does something very different: instead of taking polls as input and estimates of a vote share as output, it instead runs simulations to find which eventual vote shares best line up with the polls. This rather tricky feat of reverse-engineering can be explained with an analogy.
If you roll a single marble down a slope covered with pits, you'll see it roll slightly to one side or another as it approaches each indentation. By rolling lots of marbles and recording observations, you can build up a good estimation of the path any marble will take from a given starting-point.
The marbles are parties, the pits are polls, and the journey down the slope is the time leading up to election day. Since actual party support can only be known on the day of an election, the model simulates lots of different levels of party support, then ensures the trajectories of the "marbles" are drawn to the poll information available. From this, the peaks and troughs in party support in the run-up to an election can be estimated.
If this all still sounds a little esoteric, it's all in the name of "smoothing" the data garnered from large numbers of opinion polls. All polls have a reported margin of error: for instance, a poll reporting Conservative support at 36% might report a 3% margin of error, meaning – due to the limitations of the polling process – that if you really interviewed the entire population, you'd likely find support somewhere lies between 33% and 39%.
Since a few percentage points could mean the difference between a correct or incorrect prediction, all these many separate margins of error needed to be factored in to the model as a whole.
Refining constituency data
The team came to a crucial and defining step of its model: modelling constituency outcomes. We've already described how YouGov's national polls, once properly processed and combined with Census information, can be used to approximate constituency-specific data.
On average, the model uses a sample of 174 voters from each constituency, or something like 11,000 people in total: a good cross-section, but a small sample compared with the actual size of the electorate.
Even assuming the voters were indeed representative of their constituencies, the team estimated that they would still be working within a 10% margin of error in each constituency. This random variation, known as noise, could potentially present a very misleading picture overall.
To gain a more accurate insight, another layer was needed. The solution: for each constituency, add extra information which has been a good historical predictor of voting outcomes (such as each party's vote share in the last election), and factor in additional demographic data.
The graph opposite shows what happens when the constituencies with the 100 youngest and 100 oldest populations are plotted against their level of support for the Conservative party. While not a perfect relationship, the rising line through the middle of the dots indicates that older constituencies (and older constituents) are more likely to vote Tory.
This process was repeated for each constituency with 13 more indicators, including demographic information like education level, gender and income, as well as some more specific factors – such as the percentage of people that supported a Brexit. Combined with historical voting data, an abstract "score" was calculated for each party, which would ultimately be used in calculating vote shares.
In itself, this process turned up some surprising results. For instance, one might expect UKIP to do well in a constituency with high levels of Euroskepticism, but – once all the other indicators on the list were controlled for – the reverse turned out to be true.
All the above broadly describes Dr Hanretty's model (there are one or two more book-keeping steps after the last one, but they're mostly of interest to the statisticians among us). Let's move ahead to the predictions the model made – and how they compared with the reality on May 7th 2015.
“Our current prediction is that there will be no overall majority, but that the Conservatives will be the largest party with 278 seats. However, based on the historical relationships between the sources of information we are using in our forecast and the outcome of UK elections, we know there is substantial uncertainty in our forecast.”
As we now know, in the wake of an extremely dramatic night, the 2015 general election ended with a Conservative majority of 330 seats, narrowly avoiding another coalition government. Labour support in Scotland gave way to an overwhelming SNP victory, while Liberal Democrat seats across the nation dwindled away to almost nothing. Meanwhile, UKIP and the Greens ended up with a seat each, despite the former coming third in vote share overall.
As Dr Hanretty puts it: "A model can be sophisticated, produce consistent and plausible estimates – and still be wrong."
In fact, there are many things the model got exactly right – including the single seat each for UKIP and the Green Party, and the three for Plaid Cymru. It also predicted the SNP would win 54 seats, just two away from the actual total of 56.
Clearly, the model predictions for the Conservatives (in particular), Labour and the Lib Dems were substantially off. In general, the constituency modelling actually worked well: in areas where Labour was forecast to do well, they did, as did the Conservatives.
A crucial note, however, is that in this election the Conservatives outperformed the model's forecasts in marginal seats (as opposed to safe seats). This led to a much greater number of seats for the party, and ultimately, a majority in parliament.
The model was by no means alone in failing to predict a Conservative landslide. In fact, almost every UK poll failed to foresee the outcome in 2015.
ONWARDS TO 2017
The 2015 model might not have predicted the outcome... but its author at least correctly predicted the reason it didn't. In an internal document produced for media partners, Dr Hanretty wrote..."
"We have to assume that polling companies are on average right… if the polling industry suffers a catastrophic failure – as it did in 1992 – then we'll also be wrong. Unfortunately, there's nothing we can do about this."
With predicting the 2015 election rather than a lack of data a key problem was recruiting the right people to take part in the opinion poll samples.
With the increasing amounts of big data available so much is known about the UK population it ought to be fairly straightforward to anticipate how the population will vote but statisticians point out it’s actually a question of matching data to the true value in the population.
We may know from sources such as the Census how many people are aged 18-25 in particular areas but we don’t know, for example, what types of media they trust: these are things which might affect political attitudes yet pollsters can’t tell what the true population value is. To accurately predict therefore, some baseline truths are needed against which you can measure your sample.
But how do you recruit people who don’t really care about politics to answer questions about politics? Polling companies are working harder to recruit the right types of people but we won’t know until after the 2017 election whether this will improve the poll’s margin of error.
For more information on this research visit http://www.electionforecast.co.uk
Since publication of this case study Dr Chris Hanretty has left University of East Anglia.
Explore more UEA Research
The Art of Persuasion
Why are political speeches so often boring, predictable and unconvincing?
The consequences of policy and language
Understanding the EU Civil Service
An indepth study aiming to discover more about the people who make up Europe's chief public administration