Thursday, November 22, 2012

Big data, Obama’s campaign, and social and political analysis

There have been a few news stories over the past couple of weeks with respect to President Obama’s use of big data in his election campaign (references linked in the text below).  Having just finished Sasha Issenberg’s recent book, The Victory Lab, I’ve been reflecting on the use and potential of big data for social and political analysis, and in particular what it might mean if the Obama campaign’s database was made available for social scientists and policy makers to analyze.  Rather than simply being used to help a candidate get elected, could the data be put to other productive uses?  Here’s my initial thoughts.

The use of very large datasets to underpin social and political analysis is nothing new.  Censuses are enormous undertakings, involving the surveying of whole populations with respect to a diverse range of factual information.  The data produced typically consists of dozens of large tables consisting of hundreds of variables relating to millions of people and thousands of locations.  Similarly, other social research instruments, such as household surveys and political polls, generate data with respect to a large, representative sample of the population and typically ask the respondent a range of topical questions.  Such datasets provide valuable, detailed and representative data with respect to people and places.  In the case of censuses, such data are used to underpin a wide variety of studies, to create other derived data and information, and to guide the formulation of public policy and shape company marketing and expansion strategies.  Given their size, complexity and cost, such surveys are only conducted on a periodic basis.  For example, censuses are typically held every ten years.  Political polls are usually only conducted prior to and during elections. 

The difference between these kind of surveys and their resulting data, and so-called big data, is not principally volume, but rather being able to conduct such surveys on a rolling basis, coping with issues of velocity (the speed at which data are generated) and variety (diverse kinds of data).  A good example of the production and use of big data which is political and social in nature is by Barack Obama’s campaign team in the 2008 and 2012 US elections.

As detailed by Issenberg (2012), Obama’s team sought to quantify and track all aspects of their campaigns in 2008 and 2012, devising a whole series of metrics that were continuously mined for useful information, patterns and trends.  This included monitoring their own actions, such as placing ads across different media, undertaking mail shots, ringing up potential voters, knocking on doors and canvassing areas, organising meetings and rallies, tracking who they’d spoken to or attended and what they had said or committed to, as well as trying to quantify the more ineffable elements, such as the relative value or effect of being approached by a neighbour, stranger or automated system, or the extent to which potential voters were undecided, or the ways in which members of the populace could be persuaded to change their mind on an issue or candidate or be motivated to get out and vote.  They supplemented this information with hundreds of randomized, large-scale experiments designed to test the effectiveness of different ways of persuading voters to back Obama, such as comparing the effectiveness of different modes of contact and how the message was phrased.  This experimentation also included tests of the layout and design of websites such as and how effective different tweaks to the site were for increasing engagement, volunteering and donations.  For example, one test evaluated the effects of changing the ‘sign up’ button to ‘learn more’, ‘join us now’, ‘sign up now’; over the course of 300,000 visits it became clear that ‘join us now’ led to a twenty percent increase in people registering with the site (Issenberg 2012). 

Obama’s team combined all the information they generated with respect to voters with registration data, census and other government data, polling surveys, and data bought from a whole range of suppliers, including general, commercial data aggregators, credit ratings agencies, and cable TV companies.  The result was a set massive databases about every voter in the country consisting of a minimum of eighty variables (Crovitz 2012), and often many more, relating to a potential voter’s demographic characteristics, their voting history, every instance in which they had been approached by the Obama campaign and their reaction, their social and economic history, their patterns of behaviour and consumption, and expressed views and opinions, with the databases updated daily during the campaign as new data was produced or bought.  The resulting databases ended up containing billions of pieces of data.  In cases where Obama’s analysts did not know the political affiliation of a voter, and they could not access this through direct contact, they used a sophisticated algorithm to use what variables they did have to predict a person’s likely voting preference, much in the same way that predicts what books people might like based on what other people who have a similar purchasing profile bought (Issenberg 2012).  In this way they could individually profile voters, assess if they were likely to vote and how, and how they might react to different policies and stories.  This was complemented by a highly detailed knowledge of what forms of communication worked best for different kinds of voters.

For the 2012 election, Obama’s data analytics group was five times larger than in 2008 and included leading technologists hired from industry (Scherer 2012).  The team improved the relationality of data collected through different sources and residing in different databases so that they could be more effectively linked together.  They developed campaign apps and used social media such as Facebook to encourage peer pressure to register and to get out the vote, and dropped their own and third-party cookies onto the computers that visited their websites to track online habits (Crovitz 2012; Kaye 2012).  They also improved their profiling and predictive modelling and how the information from their analytics were used to direct the campaign, as well as testing and honing ways to raise finance to fund the campaign (Scherer 2012).  And they continuously added and processed new data and ran simulations to predict outcomes and the best responses.  As one campaign official stated: “We ran the election 66,000 times every night” to determine the odds of winning each swing state. “And every morning we got the spit-out — here are your chances of winning these states. And that is how we allocated resources” (quoted in Scherer 2012).

By continuously running their evolving datasets through sophisticated algorithms, Obama’s team gained significant advantages over their rivals both in gaining the nomination in 2008 and winning the elections in 2008 and 2012.  First, they were able to micro-manage the running of their campaign across all states, directing resources to where they were needed and analytically assessing the effectiveness of those resources.  If outside agencies were used, such as phone vendors, the services they offered were monitored against agreed targets and let go if they were not performing sufficiently well (half of the ten companies were dropped in the 2008 campaign: Issenberg 2012).  Second, they could monitor unfolding events and conditions in particular locales and respond quickly if necessary.  Third, they could micro-target approaches to individuals and general advertising.  With respect to the latter, they could tailor adverts to specific demographics and places.  For example, in several cities Obama’s campaign bought advertising on selected bus routes based on the profile of who travelled those specific routes, or for particular sporting events, or for specific non-primetime television slots, or online sites popular with certain youth segments.  Such micro-targeting of individuals, locales and events was unheard of a decade earlier when advertising consisted of mass-broadcast on radio and TV in peak slots or mass mail shots.  Fourth, they could use their resources efficiently, directing attention at floating and new voters, minimising the effects of alienating or annoying the electorate who were committed to Obama and other candidates or who had already voted (taking advantage of early voting), and on election day tracking who had voted and making sure the remaining likely Obama voters got to the polls.  As Issenberg (2012: 246) argues, Obama’s 2008 campaign was the “the perfect political corporation: a well-funded, data-driven, empirically rigorous institution”.  It was no different in 2012.

What is noteworthy is that the Obama team’s use of big data is highly resource intensive involving the work of thousands of people in a huge crowdsourcing effort, bought databases, sophisticated software, networked infrastructure, and a lot of organisational skill and finance capital to make it all happen.  Indeed, the estimated bill for the 2008 presidential campaign across all parties was $2.8bn and for 2012 $2.6bn (Center for Responsive Politics 2012).  It is perhaps no surprise therefore that such a big data project only arises in such a well resourced campaign, or with respect to nationwide, large-scale, profitable commercial endeavours such as credit ratings. 

Given the political value of the data assembled by Obama’s team, and the commercial origins of much of it, as far as I am aware it has not been made available, in part or full, as open data in aggregated form (to avoid issues of privacy infringement) for others to mine and analyze.  This is a shame as it is no doubt one of the richest social and political datasets in the world given the diverse and rich range of individual variables included from a variety of sources.  Rather than simply being used to help get a candidate elected, the data could be put to productive use for analysing a range of social, demographic and economic issues, and be used to underpin and evaluate data-driven policy analysis and formulation at local, state and national scales.  I’ve little doubt that it could keep an army of social scientists occupied for a number of years and lead to detailed secondary analysis that to date have been difficult to undertake, fresh empirical and theoretical insights, and new policy suggestions across a diverse set of issues.

Despite Obama’s success at harnessing big data, at present, the use of big data for social and political analysis is limited for a number of reasons, the prime ones being resourcing and focus.  There is no doubt that if academics could afford to buy access to commercially generated data, and to combine them in different ways with public and other commercial datasets, they could tackle a whole range of interesting and valuable questions about contemporary society.  The same could be said if they were able to run very large-scale, rolling experiments, as Obama’s team were able to do.  Social science research budgets are, however, small - much smaller than Obama’s campaign budget - spread across thousands of academics and research teams, and under increasing pressure with cutbacks in public sector spend.  Moreover, the data available through social media APIs, whilst useful, was never designed to answer social science questions, are riddled with anonymous and dirty data, and at best provide proxy data.  Nevertheless, there is much emerging potential in crowdsourcing, open data, and mining new social media to reveal insights into social and political phenomena, and such research looks set to expand rapidly over the next decade or so.  It would certainly be given a significant boost if Obama’s big data machine could be made available to social scientists and policy makers and not just used for electioneering.

1 comment:

pattinase (abbott) said...

And why was Romney's data so wrong. Or for that matter, a lot of the polls like Gallup and Rasmussen were showing a significant Romney lead up until the end. With cell phone use so high and not calculable by pollsters, they cannot be as accurate as in the past without more sophisticated data sets.