Context is not just a variable — the case for data ethnographies
In his essay ‘Data before the fact’, David Rosenberg does an etymological analysis of the word ‘data’. He shows that the meaning of the word has moved from the 17th Century understanding of ‘data’ as an axiomatic premise, the tenet of an experiment, to the modern interpretation of ‘data’ as the outcome of an experiment. Basically, in 17th century data wasn’t really related to the ‘real world’, it was contextualized by the other data around it. But as science developed sophisticated ways of measuring the physical and metaphysical, the word ‘data’ developed an inalienable relationship to the idea of tangible fact and truth.
Data is now an incredibly powerful tool for describing the world around us, and modern technologies are getting progressively better at sifting the relevant from irrelevant, finding patterns the empirical scientific method couldn’t. If used well, data has the capacity to affect untold social and economic value. Most technological developments today are reliant on whether they can access enough of the right kind of data. The success of the scientific and technological revolution of the 20th Century has lead to a highly pervasive impression of science and the data it produces as having a ‘god’s eye view’. The objectivity of the scientific method implies that if it’s quantitative it sits apart from the messy, physical world in a clean, euclidian realm which can be used to interpret and reveal unseen truths. When working with AI engineers using MRI scans to make diagnostic tools, I have been told that ‘context is just a variable’. These engineers were confident that they could translate medical knowledge into a machine-readable taxonomy, and overwrite any tensions, gaps or issues created by different bodies, in different machines in different places, by using data at scale and machine learning to develop statistical certainly.
And they can. Their diagnostic tool was very effective at interpreting the available training data for diagnosing lung cancer. This could be a pretty useful thing to have in the world, particularly in regions where there is a lot more lung cancer than there are radiologists. The engineers regularly pointed out that their application’s error rate was way above the error rate of a doctor. But Rosenberg draws a fundamental difference between fact and data:
‘Facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence. But… the existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless’.
Data needs context to be anything other than an axiomatic premise. Just like the word that represents it, data is a cultural object. It is conceived of, collected, categorised and used to create things by messy social animals, and is therefore subject to all the complexities of any other social object. And that’s where data ethnographies come in.
Ethnography a methodology commonly used in Anthropology. It is the systematic study of people and cultures and is designed to explore cultural phenomena where the researcher analyses society from the point of view of the subject of the study through the collection of detailed observations and interviews. Much has been written about the value of combining so-called ‘thick’ (qualitative) and ‘thin’ (quantitative) data, particularly when understanding behavioural patterns with emerging technologies. A great example of this is Netflix’s work with anthropologist Grant McCraken, which led to the identification of binge-watching habits. These were evident in the data but no-one was looking for them. Netflix needed situated, contextually dependent data to interpret the quant data they had.
Beyond using ethnographic techniques to understand the development and use of emerging technology, by observing the data and data infrastructure used to develop the technology, we may find new ways of using it. Material culture is a specific ethnographic approach which “analyses objects that humans use to survive, define social relationships, represent facets of identity, or benefit peoples’ state of mind, social, or economic standing” (Viktor Buchli, 2004). Treating a data set or data standard as a material object and ethnographically analysing it’s collection, categoriation and use can reveal the historical, social and political contexts which are written into it in a far more powerful way than usual metadata (a set of data that describes and gives information about other data, such as a timestamp on a photo).
An example of a data ethnography is Geoffrey Bowker and Susan Star’s analysis of the International Classification of Diseases, ‘Sorting Things Out’. The ICD is as an international standard for medical diagnosis produced by the WHO. Bowker and Starr characterise it as a “treaty”, a “bloodless set of numbers obscuring the behind-the-scenes battles informing its creation”. They highlight the political consequences of medical categorisation, such as the definition between a legal and illegal abortion. They demonstrate how the social, cultural and political negotiations involved in the creation of such a distinction are obscured once the categorisation is published. It becomes accepted, taken for granted as a set of “natural facts” creating a standardised, naturalised body based on an informational structure, not the other way around. However, Bowker and Star then go on to observe how doctors use the IDC in situ, showing how they find the spaces between the categories to bring context to their diagnoses. The designers of the ICD acknowledge this need directly, claiming they have attempted to “paint a fluid picture of the world of disease — one that is sensitive to changes in the world, to socio-technical conditions, and to the work practices of statisticians and record keepers”. The analysis shows the tension between the development of a standard and it’s actual use. I’m sure most doctors understand these tensions intimately, but analysing it ethnographically makes them explicit. However, there is a high likelihood that engineers building a machine learning application which draws on the ICD will not have the implicit knowledge gained from using it in practice. They are more likely to treat it as a set of hard definitions which can be used as the basis for statistical modeling.
The gap between metadata and contextually situated data is where algorithmic bias creeps in. Current machine learning techniques rely on the common conception that data is the same as fact, at least at when aggregated at scale to create statistical certainty, which results in objective, unbiased decision making. However increasingly they can be seen to take existing social bias embedded in the historical training data as “natural facts”, and amplify it. An example of encoded bias can be seen with the Google Translate service. If you translate “he is a nurse, she is a doctor” from English to Hungarian and then back again the gender of nurse is reversed from male to female and the doctor is converted to male. Neither language is gendered in the way French or Spanish is. Google Translate has not decided to be biased against female doctors in Hungary, but the existing social structures which equate the role of ‘nurse’ as female and ‘doctor’ as male are amplified and reinforced by production and then recursively through use. When creating and using a tool that uses historical data, and develops reinforced learning loops through interactions with the actual, socially complex world it is naive to expect that the scale of any data will automatically resolve the paradoxes and politics which are present. Remember poor old Tay — Microsoft’s chatbot who was taught to be racist by trolls?
MIT Anthropologist Stefan Helmreich has said concludes in his ethnography of biotechnology developers that “when form is decoupled from life, we are left with free-floating form” . We need to situate data in order to root it in the physical world and not perpetuate the illusion that there is a magic maths land where everything makes sense, however sad that may be. In his etymological history, David Rosenberg pretty crushingly goes on to conclude that ‘[d]ata has no truth… It may be that the data we collect and transmit has no relation to truth or reality whatsoever beyond the reality that data helps us to construct’. Truth is not a simple thing. Building a tangible, practical understanding of data’s context by bringing ethnographic practice into technology development may help us make more conscious decisions about the worlds we chose to create, as opposed to having to live in whatever falls out of the echo chamber.