Teaching: “How reliable is cognitive neuroscience?”

This spring I taught my MSc module ‘PSY6316 Current Issues in Cognitive Neuroscience’ on the topic “How reliable is cognitive neuroscience?”. Here’s the module outline:

What has been called The Replication Crisis has sparked widespread introspection about the standards and protocols of science, particularly within the behavioural sciences. This course, though reading a series of landmark papers and class discussion, will consider to extent to which doubts about the reliability of empirical work affect cognitive neuroscience. Can we trust the published papers in this field? Are the effects which we investigate reliable? If not, how can work in cognitive neuroscience be made more trustworthy?

The basic idea was to read material on robust science and scandals of unreliablity in psychology, and ask the students to consider the extent these applied to cognitive neuroscience.

If asked students before they took the course, and after, a set of questions by anonymous questionnaire. The responses indicate that the course did at least induce some skepticism in the students:

Here are their before-vs-after responses to the statement ‘If you read about a finding that has been demonstrated across multiple papers in multiple journals by multiple authors, how likely do you think that finding is to be reliable?’

Here the responses for ”If PSYCHOLOGY continues as it has, significant progress will be made in understanding in the next 50 years’  and ‘If COGNITIVE NEUROSCIENCE continues as it has, significant progress will be made in understanding in the next 50 years’.

Note that optimism is reduced for both fields, but started higher for cognitive neuroscience (perhaps unsurprising since many of the students are on the cogneuro MSc).

The full list of questions I asked, the responses and the plots are available here. Most importantly maybe, the reading list is also available, which contains landmark papers on replicability/reproducibility in psychology, as well as relevant readings concerning reliability in neuroimaging.

I have always run this course as a discussion class rather than lecture class, and I have always based it around controversies in cognitive neuroscience. Last year it was ‘sex differences in the brain‘. You can read a bit more about the thinking behind the course in

Stafford, T. (2008). A fire to be lighted: a case-study in enquiry based learning. Practice and Evidence of the Scholarship of Teaching and Learning in Higher Education, 3(1), 20-42.

Related: materials from the “Open Science and Robust Research Practices” symposium held in Sheffield on 7/6/18.


2017 review

Things that have consumed my attention in 2017…

Teaching & Public Engagement

At the beginning of the year I taught my graduate seminar class on cognitive neuroscience, and we reviewed Cordelia Fine’s “Delusions of Gender” and the literature on sex differences in cognition. I blogged about some of the topics covered (linked here), and gave talks about the topic at Leeds Beckett and the University of Sheffield. It’s a great example of a situation that is common to so much of psychology: strong intuitions guide interpretation as much as reliable evidence.

In the autumn I helped teach undergraduate cognitive psychology, and took part in the review of our entire curriculum as the lead of the “cognition stream”. It’s interesting to ask exactly what a psychology student should be taught about cognitive psychology over three years.

In January I have a lecture at the University of Greenwich on how cognitive science informs my teaching practice, which you can watch here: “Experiments in Learning”.

We organised a series of public lectures on psychology research, Mind Matters. These included Gustav Kuhn and Megan Freeth talking about the science of magic, and Sophie Scott (who gave this year’s Christmas Lectures at the Royal Institution) talking about the science of laughter. You can read about the full programme on the Mind Matters website. Joy Durrant did all the hard work for these talks – thanks Joy!



Using big data to test cognitive theories.

Our paper “Many analysts, one dataset: Making transparent how variations in analytical choices affect results” is now in press at (the new journal) Advances in Methods and Practices in Psychological Science. See previous coverage in Nature (‘Crowdsourced research: Many hands make tight work’) and 538 (Science Isn’t Broken: It’s just a hell of a lot harder than we give it credit for.). This paper is already more cited than many of mine which have been published for years.

On the way to looking at Chess players’ learning curves I got distracted by sex differences: surely, I thought, Chess would be a good domain to discover the controversial ‘stereotype threat’ effect? It turns out Female chess players outperform expectations when playing men (in press at Psychological Science).

Wayne Gray edited a special issue of Topics in Cognitive Science: Game XP: Action Games as Experimental Paradigms for Cognitive Science, which features our paper Testing sleep consolidation in skill learning: a field study using an online game..

I presented this work at a related symposium at CogSci17 in London (along with our work on learning in the game Destiny), at a Psychonomics Workshop in Madison, WI : Beyond the Lab: Using Big Data to Discover Principles of Cognition at a Pint of Science in Sheffield (video here).

Our map of Implicit Racial bias in europe sparked lots of discussion (and the article was read nearly 2 million times at The Conversation).


Trust and reason

I read Hugo Mercier and Dan Sperber’s ‘The Enigma of Reason: A New Theory of Human Understanding’ and it had a huge effect on me, influencing a lot of the new work I’ve been planning this year. (my review in the Times Higher here).

In April I went to a British Academy roundtable meeting on ‘Trust in Experts’. Presumably I was invited because of this research, why we don’t trust the experts. Again, this has influenced lots of future plans, but nothing to show yet.

Related, we have AHRC funding for our project Cyberselves: How Immersive Technologies Will Impact Our Future Selves. Come to the workshop on the effects of teleoperation and telepresence, in Oxford in February.


Decision making

Our Leverhulme project on implicit bias and blame wound up. Outputs in press or preparation:

My old Phd students Maria Panagiotidi, Angelo Pirrone and Cigar Kalfaoglu have also published papers, with me as co-author, making me look more prolific than I am. See the publications page.


The highlight of the year has been getting to speak to and work with so many generous, interesting, committed people. Thanks and best wishes to all.

Previous years’ reviews: 2016 review, 2015 review.

A hierarchy of critique

Paul Graham has a hierarchy of disagreement. He’s obviously spent his fair share of time watching debates unfold on internet forums, and has categorised the quality of points people make. At the bottom are distraction and name calling. To get to the top you need to identify and refute the central point. Obviously we should aim to produce disagreements from the top of the hierarchy if we want to have a productive debate.

The hierarchy has been expressed in this handy graphic:


I think some students would find it useful to have a ‘hierarchy of critique’ to identify the most valuable points to make in an essay. I’ve written before about how to criticise a psychology study. The essential idea is the same as Graham’s: not all criticisms are equal – there are more and less interesting flaws in a study which you can point out.

In brief, like the top levels of Graham’s hierarchy, the best criticisms of a study engage with the propositions that the study authors are trying to establish. Every study will have flaws, but the critical flaws are the ones which break the links between what the experiment shows and what the author’s are trying to claim based on it.

So, without further ado, here is my hierarchy of critique:


The exact contents aren’t as important as the fact that there is a hierarchy, and we should always be asking ourselves how high up the hierarchy the point we’re trying to make is. If it is near the bottom, maybe there are better criticisms to spent time and words on.

For more on this, read my: what it means to be critical of a psychology study or watch this video I made saying the same thing.

Individualised student feedback

My Cognitive Psychology course is structured around activities which occur before and after the lectures, many of them online. This year I wrote a Python script which emailed each student an analysis and personalised graph of their engagement with the course. Here’s what it looked like:

———- Forwarded message ———-
From: me
Date: 4 December 2015 at 10:53
Subject: engagement with PSY243
To: student@sheffield.ac.uk

This is an automatically generated email, containing feedback on your engagement with PSY243 course activities. Nobody but you (not even me) has seen these results, and they DO NOT AFFECT OR REFLECT your grade for this course. They have been prepared merely as feedback on how you have engaged with activities as part of PSY243.

Here is a record of your activities:
Weeks 1-9, concept checking quizes completed (out of 7):  4
Week 1-10, asked question via wiki or discussion group:  NO
Week 3, submitted practice answer:  NO
Week 7, submitted answer for peer review (compulsory):  YES
Week 8, number of peer reviews submitted (out of 3, compulsory):  3
Week 10, attended seminar discussion :  NO

We can combine these records to create a MODULE ACTIVITY ENGAGEMENT SCORE.

* * * Your score is 57% * * *

This puts you in the TOP half of the course. Obviously this score does not include activities for which I do not have records. This includes things like lecture attendance, asking questions in lectures, private study, etc.

If we plot the engagement scores for the whole year against the number of people who get that engagement score or lower we get a graph showing the spread of engagement across the course. This graph, and your position on it, are attached to this email. People who have done the least will be towards the left, people who have done the most will appear towards the right of the curve. You can see that there is a spread of engagement scores. Very few people have not done anything, very few have done everything.

I hope you find this feedback useful. PSY243 is designed as a course where the activities structure your private study, rather than as a course where a fixed set of knowledge is conveyed in lectures. This is why I put such emphasis on these extra activities, and provide feedback on your engagement with them. Next week you have the chance to give feedback on PSY243 as part of the course evaluation, so please do say if you can think how the course might be improved

Tom, PSY243 MO


I designed this course to be structured around a single editable webpage, a wiki, which would provide all the information needed to understand the course from day one. My ambition was to use the lectures to focus on two things you can’t get from a textbook. The first being live exposure to a specialist explaining how they approach a problem or topic in their area. The second being an opportunity to discuss the material (a so called ‘flipped classroom‘). This year I added pre-lecture quizzes to the range of activities available on the course (you can see these here). These were designed so students could test their understanding of the foundational material upon which each lecture drew, and are part of this wider plan to provide clear structure for student’s engagement with the course around the lectures.

If you’re the sort of person who wants to see the code, it is here. At your own risk.

Power analysis for a between-sample experiment

Understanding statistical power is essential if you want to avoid wasting your time in psychology. The power of an experiment is its sensitivity – the likelihood that, if the effect tested for is real, your experiment will be able to detect it.

Statistical power is determined by the type of statistical test you are doing, the number of people you test and the effect size. The effect size is, in turn, determined by the reliability of the thing you are measuring, and how much it is pushed around by whatever you are manipulating.

Since it is a common test, I’ve been doing a power analysis for a two-sample (two-sided) t-test, for small, medium and large effects (as conventionally defined). The results should worry you.


This graph shows you how many people you need in each group for your test to have 80% power (a standard desirable level of power – meaning that if your effect is real you’ve an 80% chance of detecting it).

Things to note:

  • even for a large (0.8) effect you need close to 30 people (total n = 60) to have 80% power
  • for a medium effect (0.5) this is more like 70 people (total n = 140)
  • the required sample size increases drammatically as effect size drops
  • for small effects, the sample required for 80% is around 400 in each group (total n = 800).

What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.

Implications for anyone planning an experiment:

  • Is your effect very strong? If so, you may rely on a smaller sample (For illustrative purposes the effect size of male-female heigh difference is ~1.7, so large enough to detect with small sample. But if your effect is this obvious, why do you need an experiment?)
  • You really should prefer within-sample analysis, whenever possible (power analysis of this left as an exercise)
  • You can get away with smaller samples if you make your measure more reliable, or if you make your manipulation more impactful. Both of these will increase your effect size, the first by narrowing the variance within each group, the second by increasing the distance between them

Technical note: I did this cribbing code from Rob Kabacoff’s helpful page on power analysis. Code for the graph shown here is here. I use and recommend Rstudio.

Teaching: what it means to be critical

3282473832_cb97c4e525_mWe often ask students to ‘critically assess’ research, but we probably don’t explain what we mean by this as well as we could. Being ‘critical’ doesn’t mean merely criticising, just as skepticism isn’t the same as cynicism. A cynic thinks everything is worthless, regardless of the evidence; a skeptic wants to be persuaded of the value of things, but needs to understand the evidence first.

When we ask students to critically assess something we want them to do it as skeptics. You’re allowed to praise, as well as blame, a study, but it is important that you explain why.

As a rule of thumb, I distinguish three levels of criticism. These are the kinds of critical thinking that you might include at the end of a review or a final year project, under a “flaws and limitations” type-heading. Taking the least value first (and the one that will win you the least marks), let’s go through the three types one by one:

General criticisms: These are the sorts of flaws that we’re taught to look out for from the very first moment we start studying psychology. Things like too few participants, lack of ecological validity or the study being carried out on a selective population (such as university psychology students). The problem isn’t that these aren’t flaws of many studies, but rather that they are flaws of too many studies. Because these things are almost always true – we’d always like to have more people in our study! we’re never certain if our results will generalise to other populations – it isn’t very interesting to point this out. Far better if you can make …

Specific criticisms: These are things which are specific weakness of the study you are critiquing. Things which you might say as a general criticism become specific criticisms if you can show how they relate to particular weaknesses of a study. So, for example, almost all studies would benefit from more participants (a general criticism), but if you are looking at a study where the experiment and the control group differed on the dependent variable, but the result was non-significant (p=0.09 say), then you can make the specific criticism that the study is under-powered. The numbers tested, and the statistics used, mean that it isn’t possible to resolve either way that there probably is or probably isn’t an effect. It’s simply uncertain. So, they need to try again with more people (or less noise in their measures).

Finding specific criticisms means thinking hard about the logic of how the measures taken relate to psychological concepts (operationalisation) and what the comparisons made (control groups) really mean. A good specific criticism will be particular to the details of the study, showing that you’ve thought about the logic of how an experiment relates to the theoretical claims being considered (that’s why you get more credit for making this kind of criticisms). Specific criticism are good, but even better are…

Specific criticisms with crucial tests or suggestions: This means identifying a flaw in the experiment, or a potential alternative explanation, and simultaneously suggesting how the flaw can be remedied or the alternative explanation can be assessed for how likely it is. This is the hardest to do, because it is the most interesting. If you can do this well you can use existing information (the current study, and its results) to enhance our understanding of what is really true, and to guide our research so we can ask more effective questions next time. Exciting stuff!

Let me give an example. A few years ago I ran course which used a wiki (reader edited webpages) to help the students organise their study. At the end of the course I thought I’d compare the final exam scores of people who used the wiki against those who hadn’t. Surprise: people who used the wiki got better exam scores. An interesting result, I thought, which could suggest that using the wiki helped people understand the material. Next, I imagined I’d written this up as a study and then imagined the criticisms you could make of it. Obviously the major one is that it is observational rather than experimental (there is no control group), but why is this a problem? It’s a problem because there could be all sorts of differences between students which might mean they both score well on the exam and use the wiki more. One way this could manifest is that diligent students used the wiki more, but they also studied harder, and so got better marks because of that. But this criticism can be tested using the existing data. We can look and see if only highly grading students use the wiki. They don’t – there is a spread of students who score well and who score badly, independently of whether they use the wiki or not. In both groups, the ones who use the wiki more score better. This doesn’t settle the matter (we still need to run a randomised control study), but it allows us to finesse our assessment of one criticism (that only good students used the wiki). There are other criticisms (and other checks), you can read about it in the paper we eventually published on the topic.

Overall, you get credit in a critical assessment for showing that you are able to assess the plausibility of the various flaws a study has. You don’t get marks just for identifying as many flaws as possible without balancing them against the merits of the study. All studies have flaws, the interesting thing is to make positive suggestions about what can be confidently learnt from a study, whilst noting the most important flaws, and – if possible – suggesting how they could be dismissed or corrected.

Addenda: I made a video of this post. And a postscript: A hierarchy of critique

New paper: wiki users get higher exam scores

Just out in Research in Learning Technology, is our paper Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. This is an observational study which tries to answer the question of how students on my undergraduate cognitive psychology course can improve their grades.

One of the great misconceptions about sudying is that you just need to learn the material. Courses and exams which encourage regurgitation don’t help. In fact, as well as memorising content, you also need to understand it and reflect that understanding in writing. That is what the exam tests (and what an undergraduate education should test, in my opinion). A few years ago I realised, marking exams, that many students weren’t fulfilling their potential to understand and explain, and were relying too much on simply recalling the lecture and textbook content.

To address this, I got rid of the textbook for my course and introduced a wiki – an editable set of webpages, using which the students would write their own textbook. An inspiration for this was a quote from Francis Bacon:

Reading maketh a full man,
conference a ready man,
and writing an exact man.

(the reviewers asked that I remove this quote from the paper, so it has to go here!)

Each year I cleared the wiki and encouraged the people who took the course to read, write and edit using the wiki. I also kept a record of who edited the wiki, and their final exam scores.

The paper uses this data to show that people who made more edits to the wiki scored more highly on the exam. The obvious confound is that people who score more highly on exams will also be the ones who edit the wiki more. We tried to account for this statistically by including students’ scores on their other psychology exams in our analysis. This has the effect – we argue – of removing the general effect of students’ propensity to enjoy psychology and study hard and isolate the additional effect of using the wiki on my particular course.

The result, pleasingly, is that students who used the wiki more scored better on the final exam, even accounting for their general tendancy to score well on exams (as measured by grades for other courses). This means that even among people who generally do badly in exams, and did badly on my exam, those who used the wiki more did better. This is evidence that the wiki is beneficial for everyone, not just people who are good at exams and/or highly motivated to study.

Here’s the graph, Figure 1 from our paper:


This is a large effect – the benefit is around 5 percentage points, easily enough to lift you from a mid 2:2 to a 2:1, or a mid 2:1 to a first.

Fans of wiki research should check out this recent paper Wikipedia Classroom Experiment: bidirectional benefits ofstudents’ engagement in online production communities, which explores potential wider benefits of using wiki editing in the classroom. Our paper is unique for focussing on the bottom line of final course grades, and for trying to address the confound that students who work harder at psychology are likely to both get higher exam scores and use the wiki more.

The true test of the benefit of the wiki would be an experimental intervention where one group of students used a wiki and another did something else. For a discussion of this, and discussion of why we believe editing a wiki is so useful for learning, you’ll have to read the paper.

Thanks go to my collaborators. Harriet reviewed the literature and Herman instaled the wiki for me, and did the analysis. Together we discussed the research and wrote the paper.

Full citation:
Stafford, T., Elgueta, H., Cameron, H. (2014). Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. Research in Learning Technology, 22, 22797. doi:10.3402/rlt.v22.22797

First visualise, then test

My undergraduate project students are in the final stages of their writing up. We’ve had a lot of meetings over the last few weeks about the correct way to analyse their data. It struck me that there was something I wish I’d emphasised more before they started analysing the data – you should visualise your data first, and only then run your statistical test.

It’s all too easy to approach statistical tests as a kind of magic black box which you apply to the data and – cher-ching! – a result comes out (hopefully p<0.05). We teach our students all about the right kinds of tests, and the technical details of reporting them (F values, p values, degrees of freedom and all that). These last few weeks it has felt to me that our focus on teaching these details can obscure the big picture – you need to understand your data before you can understand the statistical test. And understanding the data means first you want to see the shape of the distributions and the tendency for any difference between groups. This means histograms of the individual scores (how are they distributed? outliers?), scatterplots of variables against each other (any correlation?) and a simple eye-balling of the means for different experimental conditions (how big is the difference? Is it in the direction you expected?).

Without this preparatory stage where you get an appreciation for the form of the data, you risk running an inappropriate test, or running the appropriate test but not knowing what it means (for example, you get a significant difference between the groups, but you haven’t checked first whether it is in the direction predicted or not). These statistical tests are not a magic black box to meaning, they are props for our intuition. You look at the graph and think that Group A scored higher on average than Group B. Now your t-test tells you something about whether your intuition is reliable, or whether you have been fooling yourself through wishful thinking (all too easy to do).

The technical details of running and reporting statistical tests are important, but they are not as important as making an argument about the patterns in the data. Your tests support this argument – they don’t determine it.

Further reading:

Abelson, R. P. (1995). Statistics as principled argument. Psychology Press.

6 clear writing tips for exam success

I just sent something like this to the people who took my module (PSY241 Cognitive Psychology) last semester. In case you wonder, the Tim Minchin video I mention is this one

Dear PSY241 students from last semester

Having just marked your exam scripts, I thought that many of you could improve your grades with some simple advice about the use of written English. Remember, in University everything you write will be read by someone who is trying their hardest to understand exactly what you mean, but who also cares deeply about the way ideas are expressed in writing. When you leave University what you write may or may not be read by people who care so deeply about the exact use of words, but it is also likely that they won’t be trying so hard to understand you. So for University work, and after, it is very important that you say what you mean in the clearest possible way.

Some examples of improvements you should check you make to what you write, from the exams I’ve just marked:

1. Be specific

Bad: “a study shows”, “the timing must be right” “Pavlov’s work is relevant”

Whose study shows? How must the timing be right? How is Pavlov’s work relevant? We want to give you marks for showing off that you understand exactly what is known and how. If you can’t remember it is okay to write ‘one study shows’, but it is so much more impressive if you can say which study

Improve by saying, e.g., “Shanks’ study shows” “The timing must be as close to zero delay as possible, but no less”, “Pavlov’s theory is relevant because it shows how cause and effect relations can be learned”
Continue reading