Data Ethnography – An introduction


I’m cooking up some ideas while I’m away on Sabbatical at the Hans Bredow Institute. My core focus at this stage is the ‘how to’ research automation and algorithms. My current approach is integrating retro engineering through ethnography and design thinking. At this stage, I’m calling it Data Ethnography and below sets out a guideline for what I think that should be.

No doubt this is the skeleton for a journal article, but what is here is the early developing of a new method I am currently working on.

If you think this methodology could be useful, or you have any feedback or suggestions, please leave them below in the comments.

Why Data Ethnography?

Humanities and social science digital research methods have been interrupted due to the prominence of privacy and surveillance concerns of platform interoperability that produces large quantities of personified data. The Facebook Cambridge Analytica scandal, especially the revelation of its ability to construct predictive models of its user’s behaviors, brought to the public interest concerns over how platform user data is harvested, shared and manipulated by third party providers. The global pushback against the platform provider’s use of these data resulted in platforms closing down some access to application programming interfaces (APIs) to inhibit data manipulation. However, these restrictions also impact on how public benefit research is conducted, providing a useful prompt to rethink how humanities, social scientists and human computer interaction scholars research the digital.

While the datafication of our digital lives has provided us with new insights, the digital methods that enable us to research our digital selves have always been mixed to understand the field of enquiry, along with its surrounding political, cultural and economic constructs.Increased digital practices built on sophisticated calculations, for example the use of algorithmic recommendations, connected devices, internet of things, and the like, have impacted on our research environments, prompting the question, how do we research what we can’t see?This article provides evidence from investigating the visual cultures that surround YouTube that a new methodology is required to research the apparent ‘black boxes’ that operate alongside our digital selves through data ethnography. Data ethnography is the combination of stakeholder consultation, top level data analysis, persona construction, fine data analysis and finally topic or genre analysis. Data ethnography enables not only what we cannot see, but provides a useful way to understand government interoperability mandates and inform appropriate policy development.

Overview of Data Ethnography

The Five-Stage Process of Data Ethnography


This methodology emerged from asking the question, what does the Australian YouTube visual culture look like? Building on the long-term participant observation that is synonymous with ethnography, a researcher is able to understand the norms, cultural affordances, communication practices, and so on. The researcher is required to both produce and consume videos on the platform to understand how users will create content to suit the platform constraints. Simultaneously, viewing the content provides insights into how viewing publics are constructed, how they communicate, what is considered important, norms and languages. In the context of YouTube, this included the platform, but also the intermediaries such as digital agencies, multichannel networks and other digital intermediaries such as influencers to highlight publication strategies. The combination of this ethnographic data provides a compelling starting point for the additional methods that emerge.

The video content was analysed using discourse analysis reflective of Jakobson (1960) to understand the video language function as referential, poetic, emotive, conative, phatic, and/or metalingual. As such the discourse in one of four ways: contact enunciation – looking into the camera & addressing the audience; emotive enunciation which is the expressive or affective relating to the style of the YouTuber; genre including thematic content, style and compositional structure; enunciative contract which is the reading contract (Véron, 1985) between the enunciator (addressor) and enunciatee (addressee). The discourse analysis enabled the vast amounts of YouTubers to be categorised into a smaller, more manageable group of users.

Building on the discourse analysis, I asked the users of the platform the following questions:

  1. What is your gender?
  2. What is your age?
  3. How often do you use YouTube in a week?
  4. What is your favourite category of YouTube video?
  5. Are you likely to watch the next video that YouTube suggests for you?
  6. Do you ever watch the trending videos?
  7. When you enter this address into your web browser, what is the URL of the “up next” video that it suggests for you:

The results of these several questions then guided the following snowballing process of the additional methods.

Top Level Data Analysis

Before undertaking comprehensive data scraping processes that rely on platform data availability, it is useful to observe how various incidental metrics are available. In the case of YouTube, this related to likes, comments, views, and the like that provide insights into what people are watching, how they engage with the content, and how they talk about the content. These top level metric data observations enable the researcher to direct the research or focus on areas of interest that are not otherwise obvious through the consultation phase of data ethnography. The top level metrics further support the user practices on how content is produced, published, shared, and consumed amongst a wide variety of users. Finally, the top level data analysis enables the researcher to ask questions such as what data are available, which processes might be automated, and how might these data be repurposed for other sorts of measurements.

For YouTube, the top level data analysis translated to the following areas of interest:

  3. Dislikes
  4. Published On
  5. Comment Numbers
  6. Reaction to those comments
  7. Comments on comments

On the YouTube platform, these are the metrics that are available to the non-social science data scraping process. Researchers with no data programming skills are able to extract these data.

Persona Construction

Persona construction is a research approach that is based in human-computer interaction (HCI), user-centred design (UCD) and user-experience (UX). Emerging from the Design Thinking field which is human-centred to solve problems, persona construction is useful to understand how problems can be addressed between human and machine interaction. “Design Thinking is an iterative process in which knowledge is constantly being questioned and acquired so it can help us redefine a problem in an attempt to identify alternative strategies and solutions that might not be instantly apparent with our initial level of understanding” (Interaction Design, n.p.). It can have between 3 and seven stages, but these stages are not sequential or hierarchical, but rather iterative and the process typically does not abide to the dominant or common approaches of problem solving methods.

There are 5 phases in Design Thinking:

  1. Empathise – with your users
  2. Define – your user’s needs, their problem, and your insights
  3. Ideate – by challenging assumptions and creating ideas for innovative solutions
  4. Prototype – to start creating solutions
  5. Test – solutions

Persona Construction in Design Thinking is in the second phase of the process, which enables the researcher to define user needs and problems alongside one’s insights. There are four types of personas: Goal-directed, Role-based, Engaging, and Fictional personas. The data ethnography methodology uses Fictional Personas which “The personas in the fiction-based perspective are often used to explore design and generate discussion and insights in the field” (Nielsen, 2013, p.16). In this environment, a persona “is represented through a fictional individual, who in turn represents a group of real consumers with similar characteristics” (Miaskiewicz & Kozar, 2011, p. 419). Secondly, and similarly to ethnography, a persona is described in narrative form. This narrative has two goals: (1) to make the persona seem like a real person, and (2) to provide a vivid story concerning the needs of the persona in the context of the product being designed.

In the context of YouTube research, the key criteria for the fictional personas were:

  1. Name
  2. Age, gender
  3. Marital status
  4. Occupation
  5. Hobbies
  6. Technology familiarity
  7. Devices used

To ensure the accuracy of the process, the research was conducted behind the university wall which has a large range of IP addresses. The research was conducted using Firefox under a new username for each persona, the researcher was not in a signed in state for Google or YouTube, a new Google account was created for each persona and the location of user was set by suggesting a phone area code as per their country. Their interests (Hobbies) became the search terms and the algorithmically generated results were recorded in a pre-trained and post-trained state.

Fine Grained Data Scrape

By engaging the persona construction method which reveals insights into how an algorithm will treat its users, or within the context of this research the sorts of results it will recommend, it is then possible to engage in a fine-grained data scrape. A fine grained data scrape is defined as ….[ref]. In this research, it become possible to understand which were the top related videos, which channels were the most viewed, and sorts of networks that emerge around those videos. This process is most useful for not only identifying specific nodes or videos, but also clusters which can be translated into thematic areas, issue publics (Burgess and Matamoros-Fernández, 2016), and audience clusters. I have previously written about the specific social network analysis (SNA) method so I will not go into that detail here, but in order to find these thematic clusters there is a process of data extraction, cleaning and processing which can be followed. SNA is defined as a computational practice “which draws on computational technologies to help identify, aggregate, visualise and interpret people’s social networking and social media data traces” (p.1). In the first instance, I engaged the YouTube Network Analysis Tool (Ref) to extract the network data of related videos to those which returned as popular in the persona construction method – a post trained algorithm state. This digital method tool extracts the data as a Gephi file which can then be manipulated to provide a social network analysis (SNA) across the dataset.

Topic Modelling

The final method to understand how users congregate around popular content on YouTube, and how they communicate about the material, was to engage in topic modelling.

Topic Modelling is the final method which attempts to understand how users talk about certain things in particular ways. Specifically, I was trying to understand how certain topics emerged in relationship to other topics, which can be understood through the Latent Dirichlet Allocation topic modelling approach. Smith and Graham note, “Informally, LDA represents a set of text documents in terms of a mixture of topics that generate words with particular probabilities” through a predetermined number of topics. This provides the researchers with a “heuristic approach that aim[s] to maximise the interpretability and usefulness of the topics”.

For example, if we wanted to find out what are the popular topics that are discussed by a 14 year old Australian boy, we would construct the persona with interests, which in turn become search terms of, bike riding, Lego, Playstation, and Phil and Dan. The top YouTube Channel recommendations for this user before the algorithm training were:

  1. Family Guy
  2. Talk Shows
  3. Trailers
  4. Gordon Ramsey
  5. Joe Rogan