This is a html version of Stafford et al (2020):
Quantifying the benefits of using decision models with response time and accuracy data
Pre-print, Shiny app here: sheffield-university.shinyapps.io/decision_power/

# 1 Abbreviations

DDM - Drift diffusion model

# 2 Introduction

Speed and accuracy of responding are fundamental measures of performance, collected by behavioural scientists across diverse domains in an attempt to track participants’ underlying capacities. As well as being affected by the capacity of participants to respond quickly and accurately, the two measures are also related by participants’ strategic choices of a speed-accuracy trade-off (SATO; for reviews see Wickelgren 1977; Heitz 2014).

The SATO confounds measurement of participant capacity - it means that we cannot directly read either speed or accuracy as an index of participant ability. The SATO is inherent to decision making — it arises whenever we wish to respond as fast and as accurately as possible based on uncertain incoming information. More accurate responses require more information, which takes longer to accumulate; faster responses forgo collecting additional information at the cost of higher error rates. Importantly, because the SATO is unavoidable it is also necessary that all decision-making processes are positioned with respect to the trade-off. This does not need to be done deliberately or explicitly, but any decision process can be characterised as adopting some trade-off between speed and accuracy. For the tasks studied by psychologists, it is important to recognise that there will be individual differences, as well as task and group-related differences, in how participants position themselves on the SATO.

Outside of research focused on SATOs explicitly, different practices have been adopted to account for SATOs or potential SATOs in behavioural data. One approach is to ignore either speed or accuracy. For example, ignoring speed of response is common in psychophysics, whereas some domains of cognitive psychology where high-accuracy is assumed, focus only on response times (e.g. Stafford, Ingram, and Gurney 2011)1, albeit sometimes after a cursory check that standard null-hypothesis tests do not reveal significant differences in error-rates. Another approach is to combine speed and accuracy. For example, in the domain of visual search it is common to calculate efficiency’ scores by dividing search time by search accuracy as a proportion (e.g. Yates and Stafford 2018). Despite being widespread, there is evidence that this practice is unlikely to add clarity to analysis (Bruyer and Brysbaert 2011). We also note that the researchers who initially formulated the efficiency score explicitly counselled against using it in the case of SATOs (Townsend and Ashby 1983).

The efficiency score shares the property with other recent suggestions for accounting for SATOs (Davidson and Martin 2013; Seli et al. 2013) that it assumes a linear relation between response time and accuracy. While such approaches may be better than focusing on a single behavioural variable, the assumption of linearity is at odds with work which has explicitly characterised the SATO (Fitts 1966; Wickelgren 1977; Heitz 2014) and has shown a distinctly curvilinear relation between response time and accuracy. As such, although linear correction methods may work for some portions of the SATO curve, they are likely to be misleading, or at least fail to add clarity, where accuracy and/or speed approaches upper or lower limits of those variables. Recently Liesefeld and Janczyk (2019) showed that several current methods for combing speed and accuracy to correct for SATOs are in fact sensitive to the very SATOs they are designed to account for. These authors advocate the balanced integration score (BIS; Liesefeld, Fu, and Zimmer 2015) as an alternative, but it seems likely that the combination of speed and accuracy remains an estimation problem of some delicacy, especially in the presence of SATOs.

## 2.2 Context

The unprincipled combination of speed and accuracy measures becomes an urgent issue when considered in the context of widespread questions surrounding the reliability of the literature in psychology. Established results fail to replicate, or replicate with substantially reduced effect sizes (Open Science Collaboration 2015; Pashler and Wagenmakers 2012).

Low statistical power has been a persistent problem across many areas of psychology and cognitive neuroscience (Button et al. 2013; Szucs and Ioannidis 2017; Stanley, Carter, and Doucouliagos 2017; Maxwell 2004; Sedlmeier and Gigerenzer 1989; Lovakov and Agadullina 2017), including, but not limited to, research areas which are bound by costly methods or hard-to-reach populations (Geuter et al. 2018; Bezeau and Graves 2001; J. Cohen 1962). This, combined with factors such as analytic flexibility (Simmons, Nelson, and Simonsohn 2011; Silberzahn et al. 2017) — which can only be increased by the lack of a single standard method for accounting for SATOs — has led to a widespread loss of faith in many published results (Ioannidis 2005).

Statistical power is defined with respect to the variability and availability of data, as well as the analysis proposed. For a set experimental design, an obvious candidate for increasing statistical power is to increase sample size, but this is not always easy. Each additional participant costs additional time, money and resources. This is especially true in the case of expensive methods, such as fMRI, or special populations which may be hard to recruit. More sensitive measures also increase statistical power: lower measurement error will tend to reduce variability so that the same mean differences produce larger observed effect sizes.

A motivation for the present work is to demonstrate the practical utility, in terms of increased statistical power, of combining speed and accuracy information in a principled manner using decision models. Such an innovation has the appeal of making the most of data which is normally collected, even if not analysed, whilst not requiring more participants (which is costly), or more trials per participant (which also has costs in terms of participant fatigue which may be especially high for some populations, e.g. children).

## 2.3 Decision modelling

Models of the decision making process provide the foundation for the principled combination of speed and accuracy data, and thus afford experimenters access to considerable statistical power gains.

Many models exist in which decision making is represented by the accumulation of sensory evidence over time. When the accumulated evidence surpasses some threshold (also called a boundary) then a decision is triggered. The accuracy of the decision depends on which accumulator crosses which boundary, the speed is given by time this takes, and thus such models can be used to fit speed and accuracy data within the same framework.

A prominent instance of such accumulator models is the so called drift-diffusion model developed by Roger Ratcliff (DDM, Ratcliff 1978; Ratcliff and Rouder 1998). In these models the rate at which evidence is accumulated is represented by the drift rate parameter, which can be thought of as co-determined by the sensitivity of perceiver and the strength of the stimulus. After a long and successful period of development and application on purely behavioural data, the DDM model was at the centre of an important theoretical confluence. Neurophysiologists found evidence for accumulation like processes in neurons critical to sensory decision making (P. L. Smith and Ratcliff 2004; Gold and Shadlen 2001), whilst theoreticians recognised that accumulator models could be related to statistical methods of uncertain information integration. Under certain parameterisations many different decision models, all in the family of accumulator models, can be shown to be equivalent to the DDM, and thus in turn equivalent to a statistical method which is optimal for making the fastest decision with a given error rate, or the most accurate decision within a fixed time (Bogacz et al. 2006; Gold and Shadlen 2002).

While debate continues around the exact specification of the decision model which best reflects human decision making, there is a consensus that the DDM captures many essential features of decision processing (but see Pirrone et al. 2018; Pirrone, Stafford, and Marshall 2014; Teodorescu, Moran, and Usher 2016). As you would expect, the DDM has also shown considerable success modelling decision data across many different domains (Ratcliff, Smith, and McKoon 2015; Ratcliff et al. 2016), and in particular at separating out response thresholds from stimulus perception (Ratcliff and McKoon 2008), and in estimating these reliably (Lerche and Voss 2017). In the sense that the DDM implements a statistically optimal algorithm for accumulation for uncertain information, we would expect our neural machinery to implement the same algorithm in the absence of other constraints (Pirrone, Stafford, and Marshall 2014). The basic mechanism of the DDM is that of a single accumulator, similar to that shown in Figure 2.1, with the following key parameters: $$v$$, the drift rate which reflects the rate of evidence accumulation; $$a$$, the boundary separation, which defines the threshold which must be crossed to trigger a decision and so reflect response conservativeness; $$z$$, the starting point of accumulation (either equidistant between the two decision thresholds, or closer to one rather than the other), which biases the response based on pre-stimulus expectations and $$T_{er}$$, non-decision time, a fixed delay which does not vary with stimulus information. Additional parameters define noise factors which set factors such as the trial-to-trial variability in drift rate.

For our purposes, the value of these decision models is that they provide a principled reconciliation of speed and accuracy data. Within this framework these observed behavioural measures reflect the hidden parameters of the decision model, most important of which are the drift rate (reflecting the rate of evidence accumulation) and the decision boundary separation (reflecting the conservativeness of the participant’s decision criterion; higher boundaries produce slower but more accurate responses).

By fitting the DDM to our data we can deconfound the observed behavioural variables — speed and accuracy — and recover the putative generating parameters of the decision — drift and boundary separation. In principle, this allows a more sensitive measure of participant capability (reflected in the drift parameter). Drift is a more sensitive measure because a) it is estimated using both speed and accuracy, b) this estimation takes account of both mean response time and the distribution of response times for correct and error responses, and because c) the estimation of the drift parameter is isolated from the effect of different participant’s SATOs (which are reflected in the boundary parameter).

## 2.4 Prior work

Previous authors have established the principled benefits of this approach (Ratcliff and McKoon 2008). Within a psychophysics framework, Stone (2014) extended Palmer, Huk, and Shadlen (2005)’s decision model to show that response time and accuracy contain different, but possibly overlapping, components of Shannon information about the perceived stimulus. If these components do not overlap (as suggested by Stone, in preparation) then combining response time and accuracy data should provide better estimates of key parameters which govern the decision process than relying on either response time or accuracy alone. However, our purpose here is not to make a theoretical innovation in decision modelling, but to use established decision models to demonstrate and quantify the benefits of decision modelling for experimentalists.

Previous authors have shown for specific paradigms and decisions that using decision models confers benefits beyond relying on speed, accuracy or some sub-optimal combination of the two, especially in the case of speed-accuracy trade-offs (Zhang and Rowe 2014; Park and Starns 2015). These results use data collected from participants in single experiments. Park and Starns (2015) show that for their data using decision models to estimate a drift parameter allows participant ability to be gauged separately from speed-accuracy trade-offs, and that these estimates consequently have higher predictive value. Zhang and Rowe (2014) used decision modelling to show that, for their data, it was possible to dissociate behavioural changes due to learning from those due to speed-accuracy trade-offs (revealing the distinct mechanisms of these two processes). In contrast to these studies, our approach is to use simulated data of multiple experiments so as to interrogate the value of decision models across a wide range of possibilities.

Ravenzwaaij, Donkin and Vandekerckhov (2017, henceforth vRDV) has considerable sympathy with the approach we adopt here. They show that the EZ model, for across variations in participant number, trial number and effect size, has higher sensitivity to group differences than the full diffusion model, which they ascribe to its relative simplicity (a striking illustration of the bias/variance trade-off in model fitting Yarkoni and Westfall 2017).

## 2.5 Contribution of the current work

Our work extends prior work in a number of ways. Our fundamental comparison is in the sensitivity of model parameters compared to behaviourally observed measures (RT, accuracy). Our purpose is not to compare different measurement models’ (Ravenzwaaij, Donkin, and Vandekerckhove 2017), but to illustrate the benefits for experimentalists of using any decision model over analysing a singular behavioural measure (reaction time or accuracy in isolation). We use the EZ model, for reasons of computational efficiency, and because prior work has shown that in most circumstances it preserves the benefits of fuller decision modelling approaches. We also confirm that the basic pattern of results holds for other model fitting methods, the HDDM (Wiecki, Sofer, and Frank 2013) and fast-dm (Voss and Voss 2007). We simulate null group effects and so can show false alarm rates as well as calculate results in terms of d’. Our use of d’ allows quantitative comparison and estimation of size of benefit across different speed-accuracy conditions. We explore the combined effects of group shifts in both drift and boundary, and so can show implications of speed-accuracy trade-offs between groups, alongside drift differences. As with all modelling work, the results we present have always been latent in existing models. Our focus is not on theoretical innovation, but in drawing out the implications of established models in a way that reveals the extent of their value and so promotes their uptake. For a discussion of the contribution of elaborating the consequences of existing models see Stafford (2010);Stafford (2009).

Our results are translated into the power-sample size space, which is familiar to experimental psychologists. Our results are accompanied by an interactive data explorer to aid in the translation of the value of decision models into a form most easily comprehendable by experimentalists. For these reasons we hope that the current work can make a contribution in allowing experimentalists with less model fitting experience to readily apprehend the large benefits of model fitting for decision making data.

# 3 Method

The broad approach is to consider a simple standard experimental design: a between groups comparison, where each group contains a number of participants who complete a number of decision trials, providing both response time and accuracy data. We simulate data for true and null differences in drift rate between the groups, as well as true and null differences in boundary between the groups. By varying the number of simulated participants we generate a fixed number of scenarios’ defined by true/null effects in ability (drift) between groups, true/null SATOs (boundary) between groups and experiment sample size. We keep the number of decision trials per participant constant for all these analyses. For each scenario we simulate many virtual experiments and inspect the behavioural measures to see how sensitive and specific they are to true group differences. We also fit the DDM and estimate the participant drift parameters, similarly asking how sensitive and specific estimates of drift are to true group differences. An overview of the method is illustrated in Figure 3.1.

## 3.1 Decision modelling

To generate simulated response data, we use the Hierarchical Drift Diffusion Model (HDDM, Wiecki, Sofer, and Frank 2013). The toolbox can also perform model fitting, which uses Bayesian estimation methods to simultaneously fit individual decision parameters and the group distributions from which they are drawn.

While the HDDM offers a principled and comprehensive model fitting approach, it is computationally expensive. An alternative model fitting method, the EZ-DDM (E.-J. Wagenmakers, Van Der Maas, and Grasman 2007) offers a simple approximation, fitting a decision model with a smaller number of parameters, assuming no bias towards either of the two options and no inter-trial variability. This allows an analytic solution which is computationally cheap. Furthermore, the EZ-DDM has been shown to match the full DDM for a range of situations (Ravenzwaaij, Donkin, and Vandekerckhove 2017).

For the model fitting presented here (Figures 4.1 - 4.4), we use the EZ-DDM, although initial exploration using both the HDDM and the fast-dm (Voss and Voss 2007, a third model fitting framework) found qualitatively similar results, so our current belief is that these results do not depend on the particular decision model deployed from the broad class of accumulator models2.

Obviously, where we wish to simulate many thousands of independent experiments there are significant speed gains from parallelisation. Parallelisation was done by Mike Croucher, and the code run on University of Sheffield High Performance Computing cluster. A sense of the value of parallelisation can be had by noting the data shown in, for example, Figure 4.4 would have taken around 1 calendar month to generate on a single high performance machine, even though they use the computationally cheap’ EZ-DDM method. Python code for running the simulations, as well as the output data, figures and manuscript preparation files, is here http://doi.org/10.5281/zenodo.2648995.

## 3.2 Analysis

Because we are not generating a comprehensive analytic solution for the full DDM we cannot claim that our findings are true for all situations. Our aim is merely to show that, for some reasonable choices of DDM parameters, using decision modelling is a superior approach to analysing response time or accuracy alone, and to quantify the gain in statistical power.

To be able to make this claim of relevance of our simulations to typical psychology experiments we need to be able to justify that our parameter choice is plausible for a typical psychology experiment. In order to establish this we pick parameters which generate response times of the order of 1 second and accuracy of the order 90%. Each participant contributes 40 trials (decisions) to each experiment. Parameters for drift and boundary separation are defined for the group and individual participant values for these parameters are drawn from the group parameters with some level of variability (and, in the case of true effects, a mean difference between the group values, see below for details).

To illustrate this, we show in Figure 3.2 a direct visualisation of the speed-accuracy trade-off, by taking the base parameters we use in our simulated experiments and generating a single participant’s average response time and accuracy, using 1000 different boundary separation values. This shows the effect of varying boundary separation alone, while all other decision parameters are stable.

### 3.2.1 Simulating experimental data

For each scenario we simulate a large number of experiments, testing a group (‘A’) of participants against another group (‘B’), with each participant contributing 40 trials. Participant parameters (most importantly the drift rate and boundary parameters) are sampled each time from distributions defined for each of the two simulated experimental groups, A and B. For the simulations with no true difference in sensitivity between A and B the drift rate of each group has a mean of 2 and within-group standard deviation of 0.05. For the simulations with a true difference in drift group B has a mean of $$2+\delta$$, where $$\delta$$ defines an increase in the mean drift rate; the within-group standard deviations remain the same. For the simulations where there is no SATO the mean boundary parameter is 2, with a within-group standard deviation of 0.05. For the simulations where there is a SATO, the boundary parameter of group B has an average of $$2-\delta$$, where $$\delta$$ defines the size of the decrease in the mean boundary; the within-group standard deviations remain the same.

All simulations assume a non-decision time of 0.3 seconds, no initial starting bias towards either decision threshold and the inter-trial variability parameters for starting point, drift and non-decision time set to 0. Sample sizes between 10 and 400 participants were tested, moving in steps of 10 participants for samples sizes below 150 and steps of 50 for samples sizes above 150. For each sample size 10,000 simulated experiments were run (each of 40 simulated participants in each of two groups).

### 3.2.2 Effect sizes, observed and declared

The difference between two groups can be expressed in terms of Cohen’s d effect size — the mean difference between the groups standardised by the within group standard deviation. For the observed variables, response time and accuracy, effect sizes can only be observed since these arise from the interaction of the DDM parameters and the DDM model which generates responses. For drift rate, the difference between groups is declared (by how we define the group means, see above). The declared group difference in drift rate produces the observed effect size in response time and accuracy (which differ from each other), depending on both the level of noise in each simulated experiment, and the experiment design - particularly on the number of trials per participant. Experiment designs which have a higher number of trials per participant effectively sample the true drift rate more accurately, and so have effect sizes for response time and accuracy which are closer to the ‘true’, declared, effect size in drift rate.

This issue sheds light on why decision modelling is more effective than analysing response time or accuracy alone (because it recovers the generating parameter, drift, which is more sensitive to group differences), and why there are differences in power between measuring response time and accuracy (because these variables show different observed effect sizes when generated by the same true different in drift rates). Figure 3.3 shows how declared differences in drift translate into observed effect sizes for response time and accuracy.

### 3.2.3 Hits (power) and False Alarms (alpha)

For each simulated experiment any difference between groups is gauged with a standard two-sample t-test3. Statistical power is the probability of your measure reporting a group difference when there is a true group difference, analogous to the ‘hit rate’ in a signal detection paradigm. Conventional power analysis assumes a standard false positive (alpha) rate of 0.05. For our simulations we can measure the actual false alarm rate, rather than assume it remains at the intended 0.05 rate.

For situations where only the drift differs between two groups we would not expect any significant variations in false alarm rate. However, when considering speed-accuracy trade-off changes between groups (with or without drift rate differences as well) the situation is different. This means that it is possible to get false positives in tests of a difference in drifts between groups because of SATOs. Most obviously, if a SATO means one group prioritises speed over accuracy, analysis of response time alone will mimic an enhanced drift rate, but analysis of accuracy alone will mimic degraded drift rate. Ideally the DDM will be immune to any distortion of estimates of drift rates, but that is what we have set out to demonstrate so we should not assume.

The consequence of this is that it makes sense to calculate the overall sensitivity, accounting for both the false alarm rate, as well as the hit rate. A principled way for combining false alarm and hit rate into a single metric is d’ (“d prime”), which gives an overall sensitivity of the test, much as we would calculate the sensitivity independent of bias for an observer in a psychophysics experiment (Green 1966).

# 4 Results

The results shown here support our central claim that decision modelling can have substantial benefits. To explore the interaction of power, sample size, effect size and measure sensitivity we have prepared an interactive data explorer which can be found here https://sheffield-university.shinyapps.io/decision_power/ (Krystalli and Stafford 2019)

For an idea of the main implications, it is sufficient to plot a slice of the data when the true difference in drift is a Cohen’s d of 2. Recall, from Figure 3.3 above, that although this is a large difference in terms of the generating parameter, drift, this translates into small observed effect sizes in accuracy and response time (approximately 0.3 - 0.4, reflecting `medium’ effect sizes).

Figure 4.1, left, shows how sample size and hit rate interact for the different measures. The results will be depressingly familiar to any experimentalist who has taken power analysis seriously — a sample size far larger than that conventionally recruited is required to reach adequate power levels for small/medium group differences.

From this figure we can read off the number of participants per group required to reach the conventional 80% power level (equivalent to hit rate of 0.8, if we assume a constant false positive rate). For this part of the parameter space, for this size of difference between groups in drift, and no speed-accuracy trade-off, ~140 participants are required to achieve 80% power if the difference between groups is tested on the speed of correct responses only. If the difference between groups is tested on the accuracy rate only then ~115 participants per group are required. If speed and accuracy are combined using decision modelling, and difference between groups is tested on the recovered drift parameters then we estimate that ~55 participants per group are required for 80% power. An experimentalist who might have otherwise had to recruit 280 (or 230) participants could therefore save herself (and her participants) significant trouble, effort and cost by deploying decision modelling, recruiting half that sample size and still enjoying an increase in statistical power to detect group differences.

Figure 4.1, right, shows the false alarm rate. When the difference in drifts is a Cohen’s d of 0, i.e. no true difference, the t-tests on response time and accuracy both generate false alarm rates at around the standard alpha level of $$0.05$$.

Figure 4.2 shows the measure sensitivity, d’ for each sample size. In effect, this reflects the hit rate (Figure 4.1, left) corrected for fluctuations in false alarm rate (Figure 4.1, right). This correction will be more important when there are systematic variations in false positive rate due to SATOs. Note that the exact value of d’ is sensitive to small fluctuations in the proportions of hits and false alarms observed in the simulations, and hence the d’ curves are visibly kinked despite being derived from the apparently smooth hit and false alarm curves.

## 4.2 With SATOs

The superiority of parameter recovery via a decision model becomes even more stark if there are systematic speed-accuracy trade-offs. To see this, we re-run the simulations above, but with a shift in the boundary parameter between group A and group B, such that individuals from group B have a lower boundary, and so tend to make faster but less accurate decisions compared to group A. On top of this difference, we simulate different sizes of superiority of drift rate of group B over group A.

For the plots below the drift rate difference is, as above in the non-SATO case, 0.1 (which, given the inter-individual variability translates into an effect size of 2). The boundary parameter difference is also 0.1, a between group effect size 2.

Unlike the case where there are no SATOs, the response time measure is now superior for detecting a group difference over the drift measure; Figure 4.3, left.