The problem of reliability in the design of ergonomics methods…and a potential solution

At the Centre for HF & STS, much of our work focuses on the design of new ergonomics methods to better understand safety and performance in sociotechnical systems.

An “ergonomics method” is essentially a procedure for coding and interpreting qualitative data through the lens of a particular theory. For example, Rasmussen’s Accimap is a procedure for analysing accidents using systems theory. The Event Analysis of Systemic Teamwork (EAST) is a procedure for analysing system, group or team processes from the perspective of distributed situation awareness. In both cases, the method describes what pieces of data are important, how to code or label them, and then how to combine this information into a visual representation of the system under analysis.

Consequently, methods are central to the discipline of ergonomics as they provide a theoretically driven approach to understanding the world. We develop a method when we want to apply a theory to a new domain, or when we think that current methods are inadequate for fully understanding a problem using a particular theory.

i.e. we are trying to better understand the world by developing new methods.

In order to be useful for understanding the world, methods must be both valid and reliable. Neville Stanton, an adjunct Professor in our Centre, has discussed this requirement at length for more than 20 years: essentially methods must produce an accurate representation of the situation under analysis (valid) and repeated analyses of the same data should produce the same results when completed by the same person and by different people (reliable).

It sounds so simple when you put it like that! But in practice, designing valid and reliable methods is very difficult. Even worse – designing robust studies to test whether a method is valid and reliable is also very hard. In the remainder of this article, I’m going to discuss a few of the tricky issues in designing reliability studies, and leave the (even trickier) problem of validity for another time.

To start with, designing a reliability study seems quite straight forward. To evaluate whether a method produces the same results on repeated occasions you can use a “test-retest” design, where you ask participants to analyse the same data twice. You then compare the results from the same participant, and from different participants.

But within this design, there are many factors that have an impact on whether your method will “pass” or “fail”.

Who are the participants? In our research the intended end users are often “practitioners” in a particular domain. However, even if those practitioners have all worked in the same area for 10+ years they are likely to have diverse backgrounds, and potentially very different understanding of the system under analysis. This will likely influence how they code and analyse data. Even if the intended end users are other ergonomics researchers, they too are likely to be diverse in terms of experience and background. So this leads to the requirement of standardisation through training…

What training in the method should be provided? Any study design where participants are allowed to discuss the analyses, ask the method creators questions, or receive feedback on coding will contaminate the results as it allows for calibration between participants. For this reason, written manuals are often used. However, applying ergonomics methods is often not easy and the theory can be difficult to understand. It is questionable whether people can ever get a good understanding of a method without asking questions and getting feedback. So potentially participants need to receive face-to-face training to a certain level of expertise before any evaluation is conducted. While this sounds reasonable, access to participants is usually limited, and asking participants to contribute even one hour of time is considered a big ask…let alone the days or week required to reach an “expert” level.

When should the analysis be repeated? It is likely that reliability will change based on analyst familiarity and experience levels. Despite this, few studies have examined reliability and validity levels over a significant period, or considered how intervening experiences in analysing data may influence the results.

How should the analysis be undertaken? Do participants start with the raw data (e.g. transcripts, observations, documents) and conduct the analysis from scratch? A single analysis usually involves many different sources of data, and the analyst needs to make many decisions about what is important. To overcome this problem, reliability studies typically only examine a few steps in the analysis process. But what are the “important bits” of the process?

Perhaps most importantly – when is a method reliable? Surprisingly, there is no universally accepted “level of agreement” that must be reached. Typically studies use about 75% agreement to indicate “satisfactory” reliability. But why is this the case? What does it mean to say our methods are 75% reliable? Another concern is that percentage agreement is highly influenced by sample size – it’s much easier to get a high level of agreement in a study involving 4 participants, as opposed to a study involving more.

All of these issues highlight the need for the replication of reliability studies under different conditions. One study showing that a method is reliable with a specific, small, group of participants completing a specific task is hardly sufficient….

So what should be done? Surely, I’m not going to conclude this blog by just arguing for more reliability studies.

One way forward would be to require authors to include an assessment of the reliability of analyses and training requirements in all manuscripts reporting applications of ergonomics methods. Currently, journal articles rarely include these details, even though published analyses are usually undertaken by two or more people and reliability is usually calibrated in the initial phases of the analysis. Over time, this would allow for a more accurate assessment of the reliability of ergonomics methods, especially those that are most commonly used.

Our new paper discusses these issue in more detail, and provides a case study in the hurdles faced in developing a reliable and valid method for safety practitioners:

Goode, N., Salmon, P.M., Taylor, N.Z., Lenne, M., & Finch, C.F. (2017). Developing a contributing factor classification scheme for Rasmussen’s AcciMap: reliability and validity evaluation. Applied Ergonomics, 64, 14-26.

Dr Natassia Goode is a Senior Research Fellow and is leader of the Organisational Safety theme.

Centre for Human Factors and Systems Science

Optimising people, technology and their environment