We have been asking the public if they can tell the difference between two cups of tea: one made with milk added first, the other with tea added first.
This experiment is run by Kristian Glenny and Anna Galloway as part of a summer studentship at the University of Sheffield. It was originally performed in the 1920s by one of the founding fathers of modern statistics, Ronald Fisher. We chose to recreate Fisher’s study because we wanted to take a topic that everyone can relate to and use it to illustrate some key features of scientific investigation.
What follows are some of the key considerations that went into designing this experiment.
Okay so it might seem like we’ve gone a bit overboard on the tea front but we promise that we haven’t had one cup too many. There is method to our madness, scientific method that is. We’ve been looking at the public’s beliefs about how tea should be made, and their actual abilities to distinguish different brews.
As scientists, we know that the most effective way to get reliable information is to design a method of systematic measurement. When systematic measurements are combined with interesting comparisons we have a scientific experiment.
There is an art to designing good experiments. We hope that the tea taste test has tested your beliefs and abilities whilst illustrating some essential principles of experimental design which are used by all scientists.
The key to the tea taste test and all modern experiments is the use of a branch of mathematics called statistics. The tea taste test was used as an example by one of the founding fathers of modern statistics, Ronald Fisher. Fisher taught us that if we can create groups that are different from each other in just one respect (for example, milk first or tea first) then we can use simple mathematics to compare the groups. Any difference can then be clearly attributed to the single thing that is different between the two groups.
If you dropped a kilogram weight off the top of your house and at exactly the same time a friend dropped a ten kilogram weight, which one do you think would land first? We might have all sorts of good reasons for answering this question one way or the other but if we really want to know, at some point we have to try it out and take some measurements. Measurements are at the heart of science. Rather than making assumptions about what is true, we’re investigating people’s beliefs and abilities by making measurements. In this case it is a simple test: can you tell us which cup has milk in first out of a choice of two?
The answers we received were recorded and then analysed for the overall pattern. This data is kept anonymous. In this research we are most interested in how the group responds, not in the differences between individuals. We will use the data from the experiment to check if people really are able to identify which tea is which, and how this fits with what they say they prefer. One important question is how many correct answers we would need to get to be convinced that people really can tell the difference. Obviously, everyone has a 50-50 chance of getting it right, even if they were merely guessing. The answer depends on how many people you test: obviously if 1 out of 2 people get it right that is no evidence that it is possible to tell the difference. But what if 7 out of 10 get it right? Or 65 out of 100? The branch of mathematics dealing with these sorts of questions is called statistics. Not making a mistake about people’s tea tasting abilities may seem fairly unimportant, but the same techniques are used in scientific tests of food safety, drug effectiveness and medical procedures; all areas in which mistakes can be vastly expensive, if not fatal.
No matter how respectable our tea tasters looked, we couldn’t tell them which type of tea they were tasting. We had to be sure that their decision was based on taste alone. If this was all we had done to prevent cheating our experiment would be described as single-blind but we were much craftier than that. It has been shown that experimenters can inadvertently change the results of an experiment by affecting the choices made by participants. In order to avoid this problem, the person providing you with the cups of tea was also ignorant as to which cup was ‘milk first’ and which was ‘tea first’; our experiment was therefore double-blind.
Imagine if we had given everyone the milk-first tea in a cup with a picture of kittens on it and the tea-first tea in a cup with a picture of houseflies on it. How would we know whether the tea preference was based on when the milk was added or whether you had drunk out of the horrible fly cup? Now, admittedly, having cups with different pictures on is a pretty obvious mistake but there are loads of other things we had to take into account in order to run this experiment under controlled conditions. For example, we always used the same brands of milk and tea, the tea was always the same temperature, with the same milk to tea ratio, etc. By going to these lengths we have attempted to ensure that the only difference for all participants was whether milk was added first or last.
Our prediction is that people won’t be able to tell the difference between the two drinks. However, given the similarity of the tea in each cup it is possible that people will make a choice based on something irrelevant like ‘the cup on the right’. If everyone did this and we always put milk-first tea on the right we would think that everyone prefers tea with milk added first! To avoid problems like this, we have used a technique known as counterbalancing whereby we change the order in which the cups are presented for each person doing the test. If we’ve done this properly, in the end we should have an equal number of people who have tasted milk-first first and tea-first first.
If we had wanted to (and were persuasive enough), we could have stayed inside our department and invited a bunch of expert wine tasters, chefs and food critics to take part in our tea experiment. Maybe these people could use their expert taste buds to taste differences that most people wouldn’t notice. But this just wouldn’t have made us happy scientists. The problem is that we would have no idea how the rest of the population would react and so no idea about whether people in general can tell the difference between the two cups of tea. Choosing who to test in an experiment is known as the sampling choice. We used a sampling method known as opportunity sampling. In other words, we used participants that were conveniently available in the hope that, on average, the people drinking the tea are fairly typical of the general population. Sampling is extremely important in science and the sample we choose depends on what it is we want to find out. Sometimes we might want to test only women, sometimes children, sometimes cancer patients but in this case we wanted to test the Great British public!
We've now given 94 people the tea taste test and can announce some provisional results.
On average, it seems that people could not tell the difference between tea made with the milk first and tea made with the tea first. 42 people out of our sample got it right, and 52 got it wrong. This is not different from what we would expect if people were just guessing!
Interestingly, people who believed that they would be able to tell the difference were no better than those who believed they would not be able to or were unsure. Both groups performed at chance levels.
Sometimes in psychology, people are able to make decisions with information they do not know they have. Was this the case in our taste test? Did people with strong preferences prefer their traditional cuppa even through they didn't believe they could consciously tell the difference?
People's stated preferences were not reflected in the cup of tea that they said they prefered during the taste test.
Overall, our findings suggest that not only can most people not tell the difference, but even those who have strong beliefs or preferences are not able to tell the difference. This is very interesting because it suggests that people have developed beliefs about how they like tea which are out of step with reality and that these beliefs persist because people systemmatically fail to test them.
The methods we've used in the tea taste test can be used in other domains to reveal similar biases and unfounded beliefs..
The results were analysed using SPSS - a program that we use to run statistical tests on our experimental data. In our case we were looking for a non-effect; that is we thought that participants would not be able to tell the difference between a cup of tea made with the milk first and a cup of tea made with the tea first. A 'non-significant result' would indicate that this was the case.
In psychology, we refer to a result as significant when the probability of obtaining that result by chance is less than .05 or 5%. If the probability (p) is greater than .05 then the statistic is said to be 'non-significant'. That is, the results observed are likely to have occurred due to chance and, more importantly, there is unlikely to be a relationship between the variables of interest.
This principle of 95% confidence is the foundation of most modern statistics, especially in psychological research. However, there is little justification for it being set as such, suffice to say that Fisher (the original tea taste tester) decided upon it. He considered it the smallest probability where by one could be confidant that the variables of interest were related in some way because the likelihood of a relationship occurring by chance alone was so small. Even so, psychologists are careful not to make what is termed a ‘Type 1’ error: finding a relationship when in fact none exists. This is why replication and peer reviewing are such important components of psychological research.
The statistical test we used to analyse the data is called a chi square test. It was chosen because it allows for the comparison of categorical variables. But what is a categorical variable? Categorical variables are items or groups to which one can belong to one or another, but not both. Eye colour is a good example of this. There is another type of variable which is known as continuous. These are items that can be measured on a scale and so have an infinite number of possible values. Some good examples of this are height and weight. As mentioned previously, the chi square test examines the relationship between two categorical variables. It calculates the expected number of people for each condition, were the results to occur by chance, and compares this to what was actually observed. If the observed statistics are significantly different from the expected ones then the experimental question is supported and a relationship proposed.
Making people make a forced-choice decision, even though we thought there would be no actual difference between the tea made with the milk first and the tea made with the tea first, was an important component of our experimental design. This was because we wanted a strong test of people's explicit and implicit abilities. Were we not to use a forced-choice paradigm then they may be some who did guess the correct answer but because of their lack of confidence in that answer would have answered to being unsure. By including a forced-choice we could fully investigate whether people were truly able to tell the difference between the two methods of tea making, even if this was an implicit or explicit skill. Explicit skills are skills or knowledge which one is consciously aware of having, where as implicit skills are skills or knowledge which one possesses but is not consciously aware of having.
Whilst the graph may suggest that more people were wrong than right, there is actually no significant difference between the two groups [X2(1, N=95) = .93, p = .334].
In other words, people were as likely to correctly identify the cup of tea made with the milk first as they were to incorrectly choose the cup of tea made with the tea first, meaning that correct responses occurred on a purely chance basis. This suggests that people were unable to tell the difference between the two cups of tea and responded at a rate we would expect if they were simply guessing!
Often psychologists are interested in whether one gender out performs the other on particular tasks or skills. Whilst the graph above may appear to suggest that females made more incorrect responses than males, this mainly reflects the larger number of female participants we had in our sample.
In fact, there was no effect of gender on ability to identify the cup of tea made with the milk first [X2 (1, N=95) = .97, p= .325]. The non-significant p-value suggests that neither gender was better or worse than the other, indicating that tea-tasting ability may not be gender specific.
One question we were particularly interested in was whether those who were more sure of the superiority of their tea tasting abilities would be better at identifying the cup of tea made with the milk first, as opposed to those who were unsure or didn’t think they would be able to. As demonstrated by the graph above, belief had no affect on discrimination [X2 (2, N=95) = .228, p= .892]. Those who considered themselves able to tell the difference between the two cups of tea still performed at chance level, which was similar to the performance of those who thought they wouldn't be able to tell the difference and those who were unsure.
Another point of interest was whether people’s previously stated preferences actually translated into the same preference choice in our experimental situation. Were this to be the case then it would suggest that the way the two cups of tea were made mattered. However, as demonstrated by the graph above there appears to be little relationship between people’s previous preferences and the cup of tea they chose as tasting better as part of the experiment [X2 (2, N=95) = .63, p = .728]. In line with our previous findings, this suggests that people were unable to tell the difference between the cup of tea made with the milk first and the cup of tea made with the tea first. This is important because it implies that people have developed beliefs that are not in line with reality and that these persist because they fail to systematically test them.
As we thought that, on average, people would be unable to taste the difference between the two cups of tea there were a number variables we controlled for in order to ensure that they did not affect the results observed. The main biases we could construe as problematic were that people would implicitly prefer the cup they drank from first, because the first sip is always the most refreshing, or that they would exhibit a left/right bias, as has been shown in other studies. In light of this we used random number lists to ensure that the cup of tea made with the milk first was placed on the left hand side for 50% of the participants and on the right hand side the other 50% of the participants. We also instructed each individual which cup to drink from first such that 50% of our sample drank from the left hand cup first and the other 50% drank from the right hand cup first. This is also known as counterbalancing.
As demonstrated by the graph above, there was no relationship between the first cup that participants drank from and the subsequent preference they then stated [X2 (1, N = 95) = .55, p = .457]. This suggests that participants did not make their subsequent decision based upon which cup they drank from first.
This project was funded by the SURE summer studentship scheme run by CILASS at the University of Sheffield. It has ethical approval from the department's ethics committee.
It was run by:
Having extolled the virtue and importance of testing our beliefs and biases we strongly suggest you take the tea taste test to where ever you are and let us know your results! You can apply the same method to other things (can you tell the difference between different makes of coke, or different kinds of flapjack for example?). The taste test is a template for one way in which we can investigate if our beliefs are supported by reality, or whether we have allowed spurious beliefs and biases to arise.
To create your very own tea taste test all you will need is:
Before the tea tasting starts it may be useful to explain to those who are taking part why the taste test is relevant to psychology and briefly explain the importance of the original experiment. The experimenter should also ask the first three questions on the questionnaire, which can be found on the website as part of the results section:
As each participant is given their two cups of tea you must instruct each individual as to which cup they need to drink from first. This information is found in the random number tables below and achieved by matching the number the participant pulled out from the hat with the one on your list. You should also inform the participants that you need them to think about which cup they preferred and which one they think had the milk added first.
Once the participants have reached a decision they should give their answers to the final two questions, which cup of tea they preferred and which they believe was made with the milk first. Following this the tea maker is then at liberty to reveal to each participant where their milk first was. The experimenter should then explain what the aim of this experiment was (to test our everyday biases to elucidate whether they are formed on the basis of reality), explain what they expected to find (that there is no difference between the two methods of making tea in terms of taste and so, on average, people are correct only 50% of the time, which is at chance level) and ask if there anyone has questions about the experiment.
The tea maker should choose a location to make the tea that is not directly visible by either the participants or the experimenter in order to ensure a double blind scenario. This way the experimenter cannot affect the choice of the participants and the participants have no idea which cup of tea is made with the milk first and which is made with the tea first. Once each participant has picked a raffle ticket the tea maker should note down the numbers which the participants pull out. The tea maker must then match each number to one on their list, which tells them on which side to place the cup of tea made with the milk first
The protocol for making the tea can be decided between the experimenter and the tea maker, but it must remain constant throughout the trials. This is the method we used as part of our experiment:
This experiment provides an excellent opportunity to discuss some key concepts in scientific investigation, namely single and double blind experiments, controlled conditions, sampling techniques and counter balancing. Throughout the course of the tea tasting the experimenter may feel they are able to explain what these concepts are and why they are so vital to scientific study. A more in-depth explanation of these concepts can be found in 'The Science' section of the website.
Department of Psychology
University of Sheffield