New paper: “The path to learning: Action acquisition is impaired when visual reinforcement signals must first access cortex”

Using cunning experimental design we provide evidence which supports a new theory of how the brain learns new actions. Back in 2006, our professors Redgrave and Gurney proposed a new theory of how the brain learns new actions, centered around the subcortical brain area the basal ganglia and the function of the neurotransmitter dopamine. This was exciting for two reasons: it proposed a theory of what these parts of the brain might do, based on our understanding of the pathways involved and the computations they might support and because it was a theory that was in flat contradiction to the most popular theory of dopamine function, the reward prediction error hypothesis.

We set out to test this theory. We used a novel task to assess action-outcome learning, in which human subjects moved a joystick around until they could identify a target movement. We didn’t record the dopamine directly – a tall order for human subjects – but instead used our knowledge of what triggers dopamine to compare two learning conditions: one where dopamine would be triggered as normal, and one where we reasoned the dopamine signal would be weakened.

We did this by using two different kinds of reinforcement signals, either a simple luminance change (i.e. a white flash), or a specifically calibrated change in colour properties (visual psychophysics fans: a shift along the tritan line). The colour change signal is only visible to some of the cells in the eye, the s-cone photoreceptors. Importantly, for our purposes, this means that although the signal travels the cortical visual pathways it does not enter the subcortical visual pathway to the superior colliculus. And the colliculus is the main, if not only, route to trigger dopamine release in the basal ganglia.

So by manipulating the stimulus properties we can control the pathways the stimulus information travels. Either the reinforcement signal goes directly to the colliculus and so to the dopamine (luminance change condition), or the signal must travel through visual cortex first and then to the colliculus, ‘the long way round’, to get to the dopamine (s-cone condition).

The result is a validation for the action-learning hypothesis: when reinforcement signals are invisible to the colliculus learning new action-outcome associations is harder. We also did an important control experiment which showed that the impairment due to the s-cone signals couldn’t be matched by simple transport delay of the stimulus information; this suggests the s-cone signal is weaker, not just slower in terms of dopaminergic action. You can read the full thing here.

The results aren’t conclusive – no behavioural experiment which didn’t record dopamine directly could be – but we think it is a strong result. Popper said there are two kinds of results to be most interested in. One was the experiment which proved a theory wrong. The other – which we believe this is – is an experiment which confirms a bold hypothesis. There are no other theories which would suggest this experiment, and only the Redgrave and Gurney theory predicted the result we got before we got it. This makes it a startling validation for the theory and that is why we’re really proud of the paper.

This work was funded by our European project, im-clever, and all the difficult work was done by Martin Thirkettle, building on Tom Walton’s foundation.

Thirkettle, M., Walton, T., Shah, A., Gurney, K., Redgrave, P., & Stafford, T. (2013). The path to learning: Action acquisition is impaired when visual reinforcement signals must first access cortex. Behavioural Brain Research, 243, 267–272. doi:10.1016/j.bbr.2013.01.023

New paper: Memory Enhances the Mere Exposure Effect

This research used a novel testing strategy to overturn a long-standing claim in the literature. The mere exposure effect is the finding that simply experiencing something inclines you to like it. Obviously, back in the days of behaviourism this provided a marked contrast to reward-induced preferences. A landmark paper by Bob Zajonc showed that this effect could hold even if you weren’t aware of the original exposure. (Incidentally it was this paper, as far as I can tell, which reignited interest in subliminal perception after the topic had fallen into ‘hidden persuader’ ignominy).

For a long time, based partly on the influence of this seminal paper, it has been reported that explicit memory for stimuli will reduce the mere exposure effect. The logic is that explicit memory will allow people to use a deliberate discounting strategy (something along the lines of “I know I’ve seen that before, so maybe I just feel positive about it because I’ve seen it before”). This isn’t implausible, but does conflict with a large marketing literature which suggests that sustained engagement with marketing materials is more likely to lead to preference (and it is just such engagement with adverts which you would expect to be accompanied by explicit memory).

I put test stimuli in my PSY101 lectures, and then weeks later tested the students on their preferences for these stimuli and a matched group which they hadn’t seen. This allowed me to collect high number of participants for an experiment which had a high ecological validity (and still many elements of experimental control). Continue reading

Frontiers special issue on intrinstic motivation and open-ended development

Our special issue in Frontiers in Cognitive Science is now accepting submissions: Intrinsic motivations and open-ended development in animals, humans, and robots

This call stems from the EU FP7 project “IM-CLEVER”, programme of work that involved computer scientists, neuroscientists, psychologists and roboticist in developing robot controllers that can guide a robot to learn by exploring the world.

The special issue will gather together work related to this task. ‘Intrinsic motivations’ are those that guide exploration – things like curiosity, play or desire for mastery. The emphasis is on learning systems which are more than the simple stimulus-response or response-reward learning which has dominated learning theory for so long. ‘Open-ended development’ means learning that doesn’t have a goal or limit, but is instead designed to produce skills and abilities which can be build on to produce ever more complex skills and abilities. The call welcomes papers from experimental, theoretical and engineering perspectives. The full text of the call is here.

Brain network: social media and the cognitive scientist

This just published in Trends in Cognitive Sciences. Abstract:

Cognitive scientists are increasingly using online social media, such as blogging and Twitter, to gather information and disseminate opinion, while linking to primary articles and data. Because of this, internet tools are driving a change in the scientific process, where communication is characterised by rapid scientific discussion, wider access to specialist debates, and increased cross-disciplinary interaction. This article serves as an introduction to and overview of this transformation.

Reference: Stafford, T., & Bell, V. (2012). Brain network: social media and the cognitive scientist. Trends in Cognitive Sciences, 16(10), 489–490. doi:10.1016/j.tics.2012.08.001

I’m on Twitter as @tomstafford, btw

Fundamentals of learning: the exploration-exploitation trade-off

The exploration-exploitation trade-off is a fundamental dilemma whenever you learn about the world by trying things out. The dilemma is between choosing what you know and getting something close to what you expect (‘exploitation’) and choosing something you aren’t sure about and possibly learning more (‘exploration’). For example, suppose you are in a restaurant and you look at the menu:

  • Fish and Chips
  • Chole Poori
  • Paneer Uttappam
  • Khara Dosa

Assuming for the sake of example that you’re not very good with Sri Lankan food, you’ve now got a choice. You can ‘exploit’ – go with the fish and chips, which will probably be alright – or you can ‘explore’ – try something you haven’t had before and see what you get. Obviously which you decide to do will depend on many things: how hungry you are, how good the restaurant reviews are, how adventurous you are, how often you reckon you’ll be coming back ..etc. What’s important is that the study of the best way to make these kinds of choices – called reinforcement learning – has shown that optimal learning requires that you to sometimes make some bad choices. This means that sometimes you have to choose to avoid the action you think will be most rewarding, and take an action which you think will be less rewarding. The rationale is that these ‘sub-optimal’ actions are necessary for your long term benefit – you need to go off track sometimes to learn more about the environment. The exploration-exploitation dilemma is really a trade-off : enjoy more now vs learn more now and enjoy later. You can’t avoid it, all you can do is position yourself somewhere along the spectrum.

Because the trade-off is fundamental we would expect to be able to see it in all learning domains, not just restaurant food choices. In work just published, we’ve been using a new task to look at how actions are learnt. Using a joystick we asked people to explore the space of all possible movements, giving them a signal when they made a particular target movement. This task – which we’re pretty keen on – gives us a lens to look at the relation between how people explore the possible movements they can make and which particular movements they learn to rely on to generate predictable outcomes (which we call ‘actions’).

Using data gathered from this task, it is possible to see the exploitation-exploration trade-off in action. With each target people get 10 attempts to try to identify the right movement to make. Obviously some successful movements will be more efficient than others, because it is possible to hit the target after going all “round the houses” first, adding lots of extraneous movements and taking longer than needed. If you had a success like this you could repeat it exactly (‘exploit’), or try and cut out some of the extraneous movement and risk missing the target (‘explore’). Obviously this refinement of action through trial and error is of critical interest to anyone who cares about how we learn skilled movements.

I calculated an average performance score for the first 50% and second 50% of attempts (basically a measure of distance travelled before hitting the target – so lower scores mean better performance). I also calculated how variable these performance scores were in the first 50% and second 50%. Normally we would expect people who perform best in the first half of a test to perform best in the second half (depressingly people who start out ahead usually stay there!). But this analysis showed up something interesting: a strong correlation between variability in the first half and performance in the second half. You can see this in the graph

This shows that people who are most inconsistent when they start to learn perform best towards the end of learning. Usually inconsistency is a bad sign, so it is somewhat surprising that it predicts better performance later on. The obvious interpretation is in terms of the exploration-exploitation trade-off. The inconsistent people are trying out more things at the beginning, learning more about what works and what doesn’t. This provides them with the foundation to perform well later on. This pattern holds when comparing across individuals, but it also holds for comparing across trials (so for the same individual, their later performance is better for targets on which they are most inconsistent on early in learning).

You can read about this, and more, in our new paper, which is open-access over at PLoS One A novel task for the investigation of action acquisition.

New paper: A novel task for the investigation of action acquisition

Our new paper, A novel task for the investigation of action acquisition, has been published in PLoS One today. The paper describes a new paradigm we’ve been using to investigate how actions are learnt.

It’s a curious fact that although psychologists have thoroughly investigated how actions are valued (i.e. how you figure out how good or bad a thing is to do), and how actions are trained (i.e. shaped and refined over time), the same effort has not gone into investigating how a behaviour is first identified and stored as a part of our repertoire. We hope this task provides a useful tool for opening up this area for investigation.

As well as the basic description of the task, the paper also contains a section outlining how the form of learning the the task makes available for inspection is different from the forms of learning made available by other ‘action learning’ tasks (such as, for example, operant conditioning tasks). In addition to serving an under-investigated area of learning research, the task also has a number of practical benefits. It is scalable in difficulty, suitable for repeated measures designs (meaning you can do it again and again – it isn’t something you learn once and then can’t be tested on any more) as well being adaptable for different species (meaning you can test humans and non-human animals on the task).

The paper is based on work done as part of the EU robotics project I’m on (‘I’M-CLeVeR‘) and on Tom Walton’s PhD thesis, The Discovery of Novel Actions