Monday, November 24, 2014

Review of Dataclysm: Who We Are (when we think no one’s looking) by Christian Rudder (Fourth Estate, 2014)

The stated aim of Dataclysm is to introduce lay readers to the era of big data, the possibilities of such data, and the types of analysis used to make sense of them.  The problems with the book start with the title and subtitle, neither of which, I think, make much sense.  Dataclysm, Rudder explains, is a play on cataclysm: the wiping away of one era to be replaced by a new one.  However, big data is set to complement small data, not wipe them away as small data are generated to answer specific questions rather than being a by-product that is then repurposed, and most big data are held by private corporations or government and are not readily open to researchers.  All of the data that Rudder analyzes is from social media; they are data produced precisely because we think someone is looking (for a date, for conversation, for information, to provoke a reaction, etc).  Uploading information to the internet is largely a process of the presentation of the self, as Goffman’s famous theory would frame the activity.  Even if other people cannot see the answers to direct questions, as when filling in questions on a dating site, the answers shape the user profile and the process of matching -- something that users are aware of, consider and present to.

The book then proceeds by discussing social media data and what they might reveal about human behaviour and society.  Crucially, however, there is no systematic discussion of big data per se, its forms and characteristics, no discussion of data analytics, and only a cursory discussion of the many ethical, social and political implications of such data.  There is no discussion of statistics, or statistical tests performed on the data presented, nor data mining, data analytics, machine learning, pattern recognition, profiling, prediction, etc.  The irony here is that Rudder’s company – OkCupid – employs these techniques to be able to process and match potential partners, yet he never explains how this is achieved. 

Instead, the entire analysis is rooted in the empiricist form of data science, rather than data-driven science, and never proceeds beyond description.  As such, the analysis of gender and race he presents are based on a ‘letting the data speak for themself’ approach and constitutes armchair interpretation.  He barely engages with the vast academic literature on quantitative analysis of race and gender that has taken place for several decades using large data sets such as the census or public administration data.  Rudder has access to an enormous set of very interesting data that could be used to conduct some fascinating sociological and psychological analysis.  Instead what we get are a series of descriptive statistics and banal revelations, most of which are already well established. 

The result is a book that hints at the potential of big data and data science but undersells it substantially, and it under-estimates in my view the readership level of its potential audience by never progressing beyond mathematics and data visualisations used in junior school.  In contrast, books such as The Signal and the Noise by Nate Silver provide a much wider and deeper discussion.  This is a shame as Rudder is an engaging writer and he has privileged access to an extremely rich social data that could be used to conduct some wonderful and sophisticated social science research.  Such rich research and its policy implications are barely hinted at.

No comments: