Understanding Statistical Inference

Inference is THE big idea of statistics. This is where people come unstuck. Most people can accept the use of summary descriptive statistics and graphs. They can understand why data is needed. They can see that the way a sample is taken may affect how things turn out. They often understand the need for control groups. Most statistical concepts or ideas are readily explainable. But inference is a tricky, tricky idea. Well actually – it doesn’t need to be tricky, but the way it is generally taught makes it tricky.

Procedural competence with zero understanding

I cast my mind back to my first encounter with confidence intervals and hypothesis tests. I learned how to calculate them (by hand  – yes I am that old) but had not a clue what their point was. Not a single clue. I got an A in that course. This is a common occurrence. It is possible to remain blissfully unaware of what inference is all about, while answering procedural questions in exams correctly.

But, thanks to the research and thinking of a lot of really smart and dedicated statistics teachers, we are able put a stop to that. And we must. Help us make great resourcces

We need to explicitly teach what statistical inference is. Students do not learn to understand inference by doing calculations. We need to revisit the ideas behind inference frequently. The process of hypothesis testing, is counter-intuitive and so confusing that it spills its confusion over into the concept of inference. Confidence intervals are less confusing so a better intermediate point for understanding statistical inference. But we need to start with the concept of inference.

What is statistical inference?

The idea of inference is actually not that tricky if you unbundle the concept from the application or process.

The concept of statistical inference is this –

We want to know stuff about a large group of people or things (a population). We can’t ask or test them all so we take a sample. We use what we find out from the sample to draw conclusions about the population.

That is it. Now was that so hard?

Developing understanding of statistical inference in children

I have found the paper by Makar and Rubin, presenting a “framework for thinking about informal statistical inference”, particularly helpful. In this paper they summarise studies done with children learning about inference. They suggest that “ three key principles … appeared to be essential to informal statistical inference: (1) generalization, including predictions, parameter estimates, and conclusions, that extend beyond describing the given data; (2) the use of data as evidence for those generalizations; and (3) employment of probabilistic language in describing the generalization, including informal reference to levels of certainty about the conclusions drawn.” This can be summed up as Generalisation, Data as evidence, and Probabilistic Language.

We can lead into informal inference early on in the school curriculum. The key Ideas in the NZ curriculum suggest that “ teachers should be encouraging students to read beyond the data. Eg ‘If a new student joined our class, how many children do you think would be in their family?’” In other words, though we don’t specifically use the terms population and sample, we can conversationally draw attention to what we learn from this set of data, and how that might relate to other sets of data.

Explaining directly to Adults

When teaching adults we may use a more direct approach, explaining explicitly, alongside experiential learning to understanding inference. We have just completed made a video: Understanding Inference. Within the video we have presented three basic ideas condensed from the Five Big Ideas in the very helpful book published by NCTM, “Developing Essential Understanding of Statistics, Grades 9 -12”  by Peck, Gould and Miller and Zbiek.

Ideas underlying inference

  • A sample is likely to be a good representation of the population.
  • There is an element of uncertainty as to how well the sample represents the population
  • The way the sample is taken matters.

These ideas help to provide a rationale for thinking about inference, and allow students to justify what has often been assumed or taught mathematically. In addition several memorable examples involving apples, chocolate bars and opinion polls are provided. This is available for free use on YouTube. If you wish to have access to more of our videos than are available there, do email me at n.petty@statslc.com.

Please help us develop more great resources

We are currently developing exciting innovative materials to help students at all levels of the curriculum to understand and enjoy statistical analysis. We would REALLY appreciate it if any readers here today would help us out by answering this survey about fast food and dessert. It will take 10 minutes at a maximum. We don’t mind what country you are from, and will do the currency conversions.  And in a few months I will let you know how we got on. and we would love you to forward it to your friends and students to fill it out also – the more the merrier! It is an example of a well-designed questionnaire, with a meaningful purpose.



Don’t teach significance testing – Guest post

The following is a guest post by Tony Hak of Rotterdam School of Management. I know Tony would love some discussion about it in the comments. I remain undecided either way, so would like to hear arguments.


It is now well understood that p-values are not informative and are not replicable. Soon null hypothesis significance testing (NHST) will be obsolete and will be replaced by the so-called “new” statistics (estimation and meta-analysis). This requires that undergraduate courses in statistics now already must teach estimation and meta-analysis as the preferred way to present and analyze empirical results. If not, then the statistical skills of the graduates from these courses will be outdated on the day these graduates leave school. But it is less evident whether or not NHST (though not preferred as an analytic tool) should still be taught. Because estimation is already routinely taught as a preparation for the teaching of NHST, the necessary reform in teaching will not require the addition of new elements in current programs but rather the removal of the current emphasis on NHST or the complete removal of the teaching of NHST from the curriculum. The current trend is to continue the teaching of NHST. In my view, however, teaching of NHST should be discontinued immediately because it is (1) ineffective and (2) dangerous, and (3) it serves no aim.

1. Ineffective: NHST is difficult to understand and it is very hard to teach it successfully

We know that even good researchers often do not appreciate the fact that NHST outcomes are subject to sampling variation and believe that a “significant” result obtained in one study almost guarantees a significant result in a replication, even one with a smaller sample size. Is it then surprising that also our students do not understand what NHST outcomes do tell us and what they do not tell us? In fact, statistics teachers know that the principles and procedures of NHST are not well understood by undergraduate students who have successfully passed their courses on NHST. Courses on NHST fail to achieve their self-stated objectives, assuming that these objectives include achieving a correct understanding of the aims, assumptions, and procedures of NHST as well as a proper interpretation of its outcomes. It is very hard indeed to find a comment on NHST in any student paper (an essay, a thesis) that is close to a correct characterization of NHST or its outcomes. There are many reasons for this failure, but obviously the most important one is that NHST a very complicated and counterintuitive procedure. It requires students and researchers to understand that a p-value is attached to an outcome (an estimate) based on its location in (or relative to) an imaginary distribution of sample outcomes around the null. Another reason, connected to their failure to understand what NHST is and does, is that students believe that NHST “corrects for chance” and hence they cannot cognitively accept that p-values themselves are subject to sampling variation (i.e. chance)

2. Dangerous: NHST thinking is addictive

One might argue that there is no harm in adding a p-value to an estimate in a research report and, hence, that there is no harm in teaching NHST, additionally to teaching estimation. However, the mixed experience with statistics reform in clinical and epidemiological research suggests that a more radical change is needed. Reports of clinical trials and of studies in clinical epidemiology now usually report estimates and confidence intervals, in addition to p-values. However, as Fidler et al. (2004) have shown, and contrary to what one would expect, authors continue to discuss their results in terms of significance. Fidler et al. therefore concluded that “editors can lead researchers to confidence intervals, but can’t make them think”. This suggests that a successful statistics reform requires a cognitive change that should be reflected in how results are interpreted in the Discussion sections of published reports.

The stickiness of dichotomous thinking can also be illustrated with the results of a more recent study of Coulson et al. (2010). They presented estimates and confidence intervals obtained in two studies to a group of researchers in psychology and medicine, and asked them to compare the results of the two studies and to interpret the difference between them. It appeared that a considerable proportion of these researchers, first, used the information about the confidence intervals to make a decision about the significance of the results (in one study) or the non-significance of the results (of the other study) and, then, drew the incorrect conclusion that the results of the two studies were in conflict. Note that no NHST information was provided and that participants were not asked in any way to “test” or to use dichotomous thinking. The results of this study suggest that NHST thinking can (and often will) be used by those who are familiar with it.

The fact that it appears to be very difficult for researchers to break the habit of thinking in terms of “testing” is, as with every addiction, a good reason for avoiding that future researchers come into contact with it in the first place and, if contact cannot be avoided, for providing them with robust resistance mechanisms. The implication for statistics teaching is that students should, first, learn estimation as the preferred way of presenting and analyzing research information and that they get introduced to NHST, if at all, only after estimation has become their routine statistical practice.

3. It serves no aim: Relevant information can be found in research reports anyway

Our experience that teaching of NHST fails its own aims consistently (because NHST is too difficult to understand) and the fact that NHST appears to be dangerous and addictive are two good reasons to immediately stop teaching NHST. But there is a seemingly strong argument for continuing to introduce students to NHST, namely that a new generation of graduates will not be able to read the (past and current) academic literature in which authors themselves routinely focus on the statistical significance of their results. It is suggested that someone who does not know NHST cannot correctly interpret outcomes of NHST practices. This argument has no value for the simple reason that it is assumed in the argument that NHST outcomes are relevant and should be interpreted. But the reason that we have the current discussion about teaching is the fact that NHST outcomes are at best uninformative (beyond the information already provided by estimation) and are at worst misleading or plain wrong. The point is all along that nothing is lost by just ignoring the information that is related to NHST in a research report and by focusing only on the information that is provided about the observed effect size and its confidence interval.


Coulson, M., Healy, M., Fidler, F., & Cumming, G. (2010). Confidence Intervals Permit, But Do Not Guarantee, Better Inference than Statistical Significance Testing. Frontiers in Quantitative Psychology and Measurement, 20(1), 37-46.

Fidler, F., Thomason, N., Finch, S., & Leeman, J. (2004). Editors Can Lead Researchers to Confidence Intervals, But Can’t Make Them Think. Statistical Reform Lessons from Medicine. Psychological Science, 15(2): 119-126.

This text is a condensed version of the paper “After Statistics Reform: Should We Still Teach Significance Testing?” published in the Proceedings of ICOTS9.


The silent dog – null results matter too!

Recently I was discussing the process we use in a statistical enquiry. The ideal is that we start with a problem and follow the statistical enquiry cycle through the steps Problem, Plan, Data collection, Analysis and Conclusion, which then may lead to other enquiries. 
I have previously written a post suggesting that the cyclical nature of the process was overstated.

The context of our discussion was a video I am working on, that acknowledges that often we start, not at the beginning, but in the middle, with a set of data. This may be because in an educational setting it is too expensive and time consuming to require students to collect their own data. Or it may be that as statistical consultants we are brought into an investigation once the data has been collected, and are needed to make some sense out of it. Whatever the reason, it is common to start with the data, and then loop backwards to the Problem and Plan phases, before performing the analysis and writing the conclusions.

Looking for relationships

We, a group of statistical educators, were suggesting what we would do with a data set, which included looking at the level of measurement, the origins of the data, and the possible intentions of the people who collected it. One teacher suggests to her students that they do exploratory scatter plots of all the possible pairings, as well as comparative dotplots and boxplots. The students can then choose a problem that is likely to show a relationship – because they have already seen that there is a relationship in the data.

I have a bit of a problem with this. It is fine to get an overview of the relationships in the data – that is one of the beauties of statistical packages. And I can see that for an assignment, it is more rewarding for students to have a result they can discuss. If they get a null result there is a tendency to think that they have failed. Yet the lack of evidence of a relationship may be more important than evidence of one. The problem is that we value positive results over null results. This is a known problem in academic journals, and many words have been written about the problems of over-occurrence of type 1 errors, or publication bias. Let me illustrate. A drug manufacturer hopes that drug X is effective in treating depression. In reality drug X is no more effective than a placebo. The manufacturer keeps funding different tests by different scientists. If all the experiments use a significance level of 0.05, then about 5% of the experiments will produce a type 1 error and say that there is an effect attributable to drug X. The (false) positive results are able to be published, because academic journals prefer positive results to null-results. Conversely the much larger number of researchers who correctly concluded that there is no relationship, do not get published and the abundance of evidence to the contrary is invisible. To be fair, it is hoped that these researchers will be able to refute the false positive paper.

Let them see null results

So where does this leave us as teachers of statistics? Awareness is a good start. We need to show null effects and why they are important. For every example we give that ends up rejecting the null hypothesis, we need to have an example that does not. Text books tend to over-include results that reject the null, so that when a student meets a non-significant result he or she is left wondering whether they have made a mistake. In my preparation of learning materials, I endeavour to keep a good spread of results – strongly positive, weakly positive, inconclusive, weakly negative and strongly negative.  This way students are accepting of a null result, and know what to say when they get one.

Another example is in the teaching of time series analysis. We love to show series with strong seasonality. It tells a story. (see my post about time series analysis as storytelling.) Retail sales nearly all peak in December, and various goods have other peaks. Jewellery retail sales in the US has small peaks in February and May, and it is fun working out why. Seasonal patterns seem like magic. However, we need also to allow students to analyse data that does not have a strong seasonal pattern, so that they can learn that they also exist!

My final research project before leaving the world of academia involved an experiment on the students in my class of over 200. It was difficult to get through the human ethics committee, but made it in the end. The students were divided into two groups, and half were followed up by tutors weekly if they were not keeping up with assignments and testing. The other half were left to their own devices, as had previously been the case. The interesting result was that it made no difference to the pass rate of the students. In fact the proportion of passes was almost identical. This was a null result. I had supposed that following up and helping students to keep up would increase their chances of passing the course. But they didn’t. This important result saved us money in terms of tutor input in following years. Though it felt good to be helping our students more, it didn’t actually help them pass, so was not justifiable in straitened financial times.

I wonder if it would have made it into a journal.

By the way, my reference to the silent dog in the title is to the famous Sherlock Holmes story, Silver Blaze, where the fact that the dog did not bark was important as it showed that the person was known to it.

Proving causation

Aeroplanes cause hot weather

In Christchurch we have a weather phenomenon known as the “Nor-wester”, which is a warm dry wind, preceding a cold southerly change. When the wind is from this direction, aeroplanes make their approach to the airport over the city. Our university is close to the airport in the direct flightpath, so we are very aware of the planes. A new colleague from South Africa drew the amusing conclusion that the unusual heat of the day was caused by all the planes flying overhead.

Statistics experts and educators spend a lot of time refuting claims of causation. “Correlation does not imply causation” has become a catch cry of people trying to avoid the common trap. This is a great advance in understanding that even journalists (notoriously math-phobic) seem to have caught onto. My own video on important statistical concepts ends with the causation issue. (You can jump to it at 3:51)

So we are aware that it is not easy to prove causation.

In order to prove causation we need a randomised experiment. We need to make random any possible factor that could be associated, and thus cause or contribute to the effect.

There is also the related problem of generalizability. If we do have a randomised experiment, we can prove causation. But unless the sample is also a random representative sample of the population in question, we cannot infer that the results will also transfer to the population in question. This is nicely illustrated in this matrix from The Statistical Sleuth by Fred L. Ramsey and Daniel W Schafer.

The relationship between the type of sample and study and the conclusions that may be drawn.

The relationship between the type of sample and study and the conclusions that may be drawn.

The top left-hand quadrant is the one in which we can draw causal inferences for the population.

Causal claims from observational studies

A student posed this question:  Is it possible to prove a causal link based on an observational study alone?

It would be very useful if we could. It is not always possible to use a randomised trial, particularly when people are involved. Before we became more aware of human rights, experiments were performed on unsuspecting human lab rats. A classic example is the Vipeholm experiments where patients at a mental hospital were the unknowing subjects. They were given large quantities of sweets in order to determine whether sugar caused cavities in teeth. This happened into the early 1950s. These days it would not be acceptable to randomly assign people to groups who are made to smoke or drink alcohol or consume large quantities of fat-laden pastries. We have to let people make those lifestyle choices for themselves. And observe. Hence observational studies!

There is a call for “evidence-based practice” in education to follow the philosophy in medicine. But getting educational experiments through ethics committee approval is very challenging, and it is difficult to use rats or fruit-flies to impersonate the higher learning processes of humans. The changing landscape of the human environment makes it even more difficult to perform educational experiments.

To find out the criteria for justifying causal claims in an observational study I turned to one of my favourite statistics text-books, Chance Encounters by Wild and Seber  (page 27). They cite the Surgeon General of the United States. The criteria for the establishment of a cause and effect relationship in an epidemiological study are the following:

  1. Strong relationship: For example illness is four times as likely among people exposed to a possible cause as it is for those who are not exposed.
  2. Strong research design
  3. Temporal relationship: The cause must precede the effect.
  4. Dose-response relationship: Higher exposure leads to a higher proportion of people affected.
  5. Reversible association: Removal of the cause reduces the incidence of the effect.
  6. Consistency: Multiple studies in different locations producing similar effects
  7. Biological plausibility: there is a supportable biological mechanism
  8. Coherence with known facts.

Teaching about causation

In high school, and entry-level statistics courses, the focus is often on statistical literacy. This concept of causation is pivotal to correct understanding of what statistics can and cannot claim. It is worth spending some time in the classroom discussing what would constitute reasonable proof and what would not. In particular it is worthwhile to come up with alternative explanations for common fallacies, or even truths in causation. Some examples for discussion might be drink-driving and accidents, smoking and cancer, gender and success in all number of areas, home game advantage in sport, the use of lucky charms, socks and undies. This also ties nicely with probability theory, helping to tie the year’s curriculum together.

Parts and whole

The whole may be greater than the sum of the parts, but the whole still needs those parts. A reflective teacher will think carefully about when to concentrate on the whole, and when on the parts.


If you were teaching someone golf, you wouldn’t spend days on a driving range, never going out on a course. Your student would not get the idea of what the game is, or why they need to be able to drive straight and to a desired length. Nor would it be much fun! Similarly if the person only played games of golf it would be difficult for them to develop their game. Practice driving and putting is needed.  A serious student of golf would also read and watch experts at golf.


Learning music is similar. Anyone who is serious about developing as a musician will spend a considerable amount of time developing their technique and their knowledge by practicing scales, chords and drills. But at the same time they need to be playing full pieces of music so that they feel the joy of what they are doing. As they play music, as opposed to drill, they will see how their less-interesting practice has helped them to develop their skills. However, as they practice a whole piece, they may well find a small part that is tripping them up, and focus for a while on that. If they play only the piece as a whole, it is not efficient use of time. A serious student of music will also listen to and watch great musicians, in order to develop their own understanding and knowledge.

Benefits of study of the whole and of the parts

In each of these examples we can see that there are aspects of working with the whole, and aspects of working with the parts. Study of the whole contributes perspective and meaning to study, and helps to tie things together. It helps to see where they have made progress. Study of the parts isolates areas of weakness, develops skills and saves time in practice, thus being more efficient.

It is very important for students to get an idea of the purpose of their study, and where they are going. For this reason I have written earlier about the need to see the end when starting out in a long procedure such as a regression or linear programming model.

It is also important to develop “statistical muscle memory” by repeating small parts of the exercise over and over until it is mastered. Practice helps people to learn what is general and what is specific in the different examples.

Teaching conditional probability

We are currently developing a section on probability as part of our learning materials. A fundamental understanding of probability and uncertainty are essential to a full understanding of inference. When we look at statistical evidence from data, we are holding it up against what we could reasonably expect to happen by chance, which involves a probability model. Probability lies in the more mathematical area of the study of statistics, and has some fun problem-solving aspects to it.

A popular exam question involves conditional probability. We like to use a table approach to this as it avoids many of the complications of terminology. I still remember my initial confusion over the counter-intuitive expression P(A|B) which means the probability that an object from subset B has the property of A. There are several places where students can come unstuck in Bayesian review, and the problems can take a long time. We can liken solving a conditional probability problem to a round of golf, or a long piece of music. So what we do in teaching is that first we take the students step by step through the whole problem. This includes working out what the words are saying, putting the known values into a table, calculating the unknown values in the table, and the using the table to answer the questions involving conditional probability.

Then we work on the individual steps, isolating them so that students can get sufficient practice to find out what is general and what is specific to different examples. As we do this we endeavour to provide variety such that students do not work out some heuristic based on the wording of the question, that actually stops them from understanding. An example of this is that if we use the same template each time, students will work out that the first number stated will go in a certain place in the table, and the second in another place etc. This is a short-term strategy that we need to protect them from in careful generation of questions.

As it turns out students should already have some of the necessary skills. When we review probability at the start of the unit, we get students to calculate probabilities from tables of values, including conditional probabilities. Then when they meet them again as part of the greater whole, there is a familiar ring.

Once the parts are mastered, the students can move on to a set of full questions, using each of the steps they have learned, and putting them back into the whole. Because they are fluent in the steps, it becomes more intuitive to put the whole back together, and when they meet something unusual they are better able to deal with it.

Starting a course in Operations Research/Management Science

It is interesting to contemplate what “the whole” is, with regard to any subject. In operations research we used to begin our first class, like many first classes, talking about what management science/operations research is. It was a pretty passive sort of class, and I felt it didn’t help as first-year university students had little relevant knowledge to pin the ideas on. So we changed to an approach that put them straight into the action and taught several weeks of techniques first. We started with project management and taught critical path. Then we taught identifying fixed and variable costs and break-even analysis. The next week was discounting and analysis of financial projects. Then for a softer example we looked at multi-criteria decision-making, using MCDM. It tied back to the previous week by taking a different approach to a decision regarding a landfill. Then we introduced OR/MS, and the concept of mathematical modelling. By then we could give real examples of how mathematical models could be used to inform real world problems. It was helpful to go from the concrete to the abstract. This was a much more satisfactory approach.

So the point is not that you should always start with the whole and then do the parts and then go back to the whole. The point is that a teacher needs to think carefully about the relationship between the parts and the whole, and teach in a way that is most helpful.

Why engineers and poets need to know about statistics

I’m kidding about poets. But lots of people need to understand the three basic areas of statistics, Chance, Data and Evidence.

Recently Tony Greenfield, an esteemed applied statistician, (with his roots in Operations Research) posted the following request on a statistics email list:

“I went this week to the exhibition and conference in the NEC run by The Engineer magazine. There were CEOs of engineering companies of all sizes, from small to massive. I asked a loaded question:  “Why should every engineer be a competent applied statistician?” Only one, from more than 100 engineers, answered: “We need to analyse any data that comes along.” They all seemed bewildered when I asked if they knew about, or even used, SPC and DoE. I shall welcome one paragraph responses to my question. I could talk all day about it but it would be good to have a succinct and powerful few words to use at such a conference.”

For now I will focus on civil engineers, as they are often what people think of as engineers. I’m not sure about the “succinct and powerful” nature of the words to follow, but here goes…

The subject of statistics can be summarised as three areas – chance, data and evidence (CDE!)

Chance includes the rules and perceptions of probability, and emphasises the uncertainty in our world. I suspect engineers are more at home in a deterministic world, but determinism is just a model of reality. The strength of a bar of steel is not exact, but will be modelled with a probability distribution. An understanding of probability is necessary before using terms such as “one hundred year flood”. Expected values are used for making decisions on improving roads and intersections. The capacity of stadiums and malls, and the provision of toilets and exits all require modelling that relies on probability distributions. It is also necessary to have some understanding of our human fallibility in estimating and communicating probability. Statistical process control accounts for acceptable levels of variation, and indicates when they have been exceeded.

The Data aspect of the study of statistics embraces the collection, summary and communication of data. In order to make decisions, data must be collected. Correct summary measures must be used, often the median, rather than the more popular mean. Summary measures should preferably be expressed as confidence intervals, thus communicating the level of precision inherent in the data. Appropriate graphs are needed, which seldom includes pictograms or pie charts.

Evidence refers to the inferential aspects of statistical analysis. The theories of probability are used to evaluate whether a certain set of data provides sufficient evidence to draw conclusions. An engineer needs to understand the use of hypothesis testing and the p-value in order to make informed decisions regarding data. Any professional in any field should be using evidence-based practice, and journal articles providing evidence will almost always refer to the p-value. They should also be wary of claims of causation, and understand the difference between strength of effect and strength of evidence. Our video provides a gentle introduction to these concepts.

Design of Experiments also incorporates the Chance, Data and Evidence aspects of the discipline of statistics.  By randomising the units in an experiment we can control for other extraneous elements that might affect the outcome in an observational study. Engineers should be at home with these concepts.

So, Tony, how was that? Not exactly succinct, and four paragraphs rather than one. I think the Chance, Data, Evidence framework helps provide structure to the explanation.

So what about the poets?

I borrow the term from Peter Bell of Richard Ivey School of Business, who teaches operations research to MBA students, and wrote a paper, Operations Research For Everyone (including poets). If it is difficult to get the world to recognise the importance of statistics, how much harder is it to convince them that Operations Research is vital to their well-being!

Bell uses the term, “poet” to refer to students who are not naturally at home with mathematics. In conversation Bell explained how many of his poets, who were planning to work in the area of human resource management found their summer internships were spent elbow-deep in data, in front of a spreadsheet, and were grateful for the skills they had resisted gaining.

An understanding of chance, data and evidence is useful/essential for “efficient citizenship”, to paraphrase the often paraphrased H. G. Wells. I have already written on the necessity for journalists to have an understanding of statistics. The innovative New Zealand curriculum recognises the importance of an understanding of statistics for all. There are numerous courses dedicated to making sure that medical practitioners have a good understanding.

So really, there are few professions or trades that would not benefit from a grounding in Chance, Data and Evidence. And Operations Research too, but for now that may be a bridge too far.

Confidence Intervals: informal, traditional, bootstrap

Confidence Intervals

Confidence intervals are needed because there is variation in the world. Nearly all natural, human or technological processes result in outputs which vary to a greater or lesser extent. Examples of this are people’s heights, students’ scores in a well written test and weights of loaves of bread. Sometimes our inability or lack of desire to measure something down to the last microgram will leave us thinking that there is no variation, but it is there. For example we would check the weights of chocolate bars to the nearest gram, and may well find that there is no variation. However if we were to weigh them to the nearest milligram, there would be variation. Drug doses have a much smaller range of variation, but it is there all the same.

You can see a video about some of the main sources of variation – natural, explainable, sampling and due to bias.

When we wish to find out about a phenomenon, the ideal would be to measure all instances. For example we can find out the heights of all students in one class at a given time. However it is impossible to find out the heights of all people in the world at a given time. It is even impossible to know how many people there are in the world at a given time. Whenever it is impossible or too expensive or too destructive or dangerous to measure all instances in a population, we need to take a sample. Ideally we will take a sample that gives each object in the population an equal likelihood of being chosen.

You can see a video here about ways of taking a sample.

When we take a sample there will always be error. It is called sampling error. We may, by chance, get exactly the same value for our sample statistic as the “true” value that exists in the population. However, even if we do, we won’t know that we have.

The sample mean is the best estimate for the population mean, but we need to say how well it is estimating the population mean. For example, say we wish to know the mean (or average) weight of apples in an orchard. We take a sample and find that the mean weight of the apples in the sample  is 153g. If we only took a few apples, it is only a rough idea and we might say we are pretty sure the mean weight of the apples in the orchard is between 143g and 163g. If someone else took a bigger sample, they might be able to say that they are pretty sure that the mean weight of apples in the orchard is between 158g and 166g. You can tell that the second confidence interval is giving us better information as the range of the confidence interval is smaller.

There are two things that affect the width of a confidence interval. The first is the sample size. If we take a really large sample we are getting a lot more information about the population, so our confidence interval will be more exact, or smaller. It is not a one-to-one relationship, but a square-root relationship.  If we wish to reduce the confidence interval by a factor of two, we will need to increase our sample size by a factor of 4.

The second thing to affect the width of a confidence interval is the amount of variation in the population. If all the apples in the orchard are about the same weight, then we will be able to estimate that weight quite accurately. However, if the apples are all different sizes, then it will be harder to be sure that the sample represents the population, and we will have a larger confidence interval as a result.

Three ways to find confidence intervals

Traditional (old-fashioned?) Approach

The standard way of calculating confidence intervals is by using formulas developed on the assumptions of normality and the Central Limit Theorem. These formulas are used to calculate the confidence intervals of means, proportions and slopes, but not for medians or standard deviations. That is because there aren’t nice straight-forward formulas for these. The formulas were developed when there were no computers, and analytical methods were needed in the absence of computational power.

In terms of teaching, these formulas are straight-forward, and also include the concept of level of confidence, which is part of the paradigm. You can see a video teaching the traditional approach to confidence intervals, using Excel to calculate the confidence interval for a mean.

Rule of Thumb

In the New Zealand curriculum at year 12, students are introduced to the concept of inference using an informal method for calculating a confidence interval. The formula is median +/-  1.5 times the interquartile range divided by the square-root of the sample size. There is a similar formula for proportions.


Bootstrapping is a very versatile way to find a confidence interval. It has three strengths:

  1. It can be used to calculate the confidence interval for a large range of different parameters.
  2. It uses ALL the information the sample gives us, rather than the summary values
  3. It has been found to aid in understanding the concepts of inference better than the traditional methods.

There are also some disadvantages

  1. Old fogeys don’t like it. (Just kidding) What I mean is that teachers who have always taught using the traditional approach find it difficult to trust what seems like a hit-and-miss method without the familiar theoretical underpinning.
  2. Universities don’t teach bootstrapping as much as the traditional methods.
  3. The common software packages do not include bootstrap confidence intervals.

The idea behind a bootstrap confidence interval is that we make use of the whole sample to represent the population. We take lots and lots of samples of the same size from the original sample. Obviously we need to sample with replacement, or the samples would all be identical. Then we use these repeated samples to get an idea of the distribution of the estimates of the population parameter. We chop the tails off at a given point, and we give the confidence interval.  Voila!

Answers to the disadvantages (burn the straw man?)

  1. There is a sound theoretical underpinning for bootstrap confidence intervals. A good place to start is a previous blog about George Cobb’s work. Either that or – “Trust me, I’m a Doctor!” (This would also include trusting far more knowledgeable people such as Chris Wild and Maxine Pfannkuch, and the team of statistical educators led by Joan Garfield.
  2. We have to start somewhere. Bootstrap methods aren’t used at universities because of inertia. As an academic of twenty years I can say that there is NO PAY OFF for teaching new stuff. It takes up valuable research time and you don’t get promoted, and sometimes you even get made redundant. If students understand what confidence intervals are, and the concept of inference, then learning to use the traditional formulas is trivial. Eventually the universities will shift. I am aware that the University of Auckland now teaches the bootstrap approach.
  3. There are ways to deal with the software package problem. There is a free software interface called “iNZight” that you can download. I believe Fathom also uses bootstrapping. There may be other software. Please let me know of any and I will add them to this post.

In Summary

Confidence intervals involve the concepts of variation, sampling and inference. They are a great way to teach these really important concepts, and to help students be critical of single value estimates. They can be taught informally, traditionally or using bootstrapping methods. Any of the approaches can lead to rote use of formula or algorithm and it is up to teachers to aim for understanding. I’m working on a set of videos around this topic. Watch this space.

Make journalists learn statistics

All journalists should be required to pass a course in basic statistics before they are let loose on the unsuspecting public.

I am not talking about the kind of statistics course that mathematical statisticians are talking about. This does not involve calculus, R or anything tricky requiring a post-graduate degree. I am talking about a statistics course for citizens. And journalists. :)

I have thought about this for some years. My father was a journalist, and fairly innumerate unless there was a dollar sign involved. But he was of the old school, who worked their way up the ranks. These days most media people have degrees, and I am adamant that the degree should contain basic numeracy and statistics. The course I devised (which has now been taken over by the maths and stats department and will be shut down later this year, but am I bitter…?) would have been ideal. It included basic number skills, including percentages (which are harder than you think), graphing, data, chance and evidence. It required students to understand the principles behind what they were doing rather than the mechanics.

Here is what journalists should know about statistics:


One of the key concepts in statistics is that of variability and chance.  Too often a chance event is invested with unnecessary meaning. A really good example of this is the road toll. In New Zealand the road toll over the Easter break can fluctuate between 21 (in 1971) and 3 in 1998, 2002 and 2003. Then in 2012 the toll was zero, a cause of great celebration. I was happy to see one report say “There was no one reason for the zero toll this Easter, and good fortune may have played a part.” However this was a refreshing change as normally the police seem to take the credit for good news, and blame bad news on us. Rather like Economists.

With any random process you will get variability. The human mind looks for patterns and meanings even where there are none. Sadly the human mind often finds patterns and imbues meaning erroneously. Astrology is a perfect example of this – and watching Deal or No Deal is inspiring in the meaning people can find in random variation.

All journalists should have a good grasp of the concepts of variability so they stop drawing unfounded conclusions

Data Display

There are myriad examples of graphs in the media that are misleading, badly constructed, incorrectly specified, or just plain wrong. There was a wonderful one in the Herald Sun recently, which has had considerable publicity. We hope it was just an error, and nothing more sinister. But good subediting (what my father used to do, but I think ceased with the advent of the computer) would have picked this up.

There is a very nice website dedicated to this: StatsChat.   It unfortunately misquotes H.G.Wells, but has a wonderful array of examples of good and bad statistics in the media. This post gives links to all sorts of sites with bad graphs, many of which were either produced or promulgated by journalists. But not all – scientific literature also has its culprits.

Just a little aside here – why does NO-ONE ever report the standard deviation? I was writing questions involving the normal distribution for practice by students. I am a strong follower of Cobb’s view that all data should be real, so I went looking for some interesting results I could use, with a mean and standard deviation. Heck I couldn’t even find uninteresting results! The mean and the median rule supreme, and confidence intervals are getting a little look in. Percentages are often reported with a “margin of error” (does anyone understand that?). But the standard deviation is invisible. I don’t think the standard deviation is any harder to understand than the mean. (Mainly because the mean is very hard to understand!) So why is the standard deviation not mentioned?


One of the main ideas in inferential statistics is that of evidence: The data is here; do we have evidence that this is an actual effect rather than caused by random variation and sampling error? In traditional statistics this is about understanding the p-value. In resampling the idea is very similar to that of a p-value – we ask “could we have got this result by chance?” You do not have to be a mathematician to grasp this idea if it is presented in an accessible way. (See my video “Understanding the p-value” for an example.)

One very exciting addition to the New Zealand curriculum are Achievement Standards at Years 12 and 13 involving reading and understanding statistical reports. I have great hopes that as teachers embrace these standards, the level of understanding in the general population will increase, and there will be less tolerance for statistically unsound conclusions.

Another source of hope for me is “The Panel”, an afternoon radio programme hosted by Jim Mora on Radio New Zealand National. Each day different guests are invited to comment on current events in a moderately erudite and often amusing way. Sometimes they even have knowledge about the topic, and usually an expert is interviewed. It is as talkback radio really could be. I think. I’ve never listened long enough to talk-back radio to really judge as it always makes me SO ANGRY! Breathe, breathe…

I digress. I have been gratified to hear people on The Panel making worthwhile comments about sample size, sampling method, bias, association and causation. (Not usually using those exact terms, but the concepts are there.) It gives me hope that critical response to pseudo-scientific, and even scientific research is possible in the general populace. My husband thinks that should be “informed populace”, but I can dream.

It is possible for journalists to understand the important ideas of statistics without a mathematically-based and alienating course. I feel an app coming on… (Or should that be a nap?)

Which type of error do you prefer?

Mayor Bloomberg is avoiding a Type 2 error

As I write this, Hurricane Sandy is bearing down on the east coast of the United States. Mayor Bloomberg has ordered evacuations from various parts of New York City. All over the region people are stocking up on food and other essentials and waiting for Sandy to arrive. And if Sandy doesn’t turn out to be the worst storm ever, will people be relieved or disappointed? Either way there is a lot of money involved. And more importantly, risk of human injury and death. Will the forecasters be blamed for over-predicting?

Types of error

There are two ways to get this sort of decision wrong. We can do something and find out it was a waste of time, or we can do nothing and wish that we had done something. In the subject of statistics these are known as Type 1 and Type 2 errors. Teaching about Type 1 and Type 2 errors is quite tricky and students often get confused. Does it REALLY matter if they get them around the wrong way? Possibly not, but what really does matter is that students are aware of their existence. We would love to be able to make decisions under certainty, but most decisions involve uncertainty, or risk. We have to choose between the possibility of taking an opportunity and finding out that it was a mistake, and the possibility of turning down an opportunity and missing out on something.

Earthquake prediction

In another recent event, Italian scientists have been convicted of manslaughter for failing to predict a catastrophic earthquake. This has particular resonance in Christchurch as our city has recently been shaken by several large quakes and a multitude of smaller aftershocks. You can see a graph of the Christchurch shakes at this site. In most part the people of Christchurch understand that it is not possible to predict the occurrence of earthquakes. However it seems that the scientists in Italy may have overstated the lack of risk. Just because you can’t accurately predict an earthquake, it doesn’t mean it won’t happen. Here is a link to a story by Nature of the Italian earthquake.

Tornado warnings

Laura McLay wrote a very interesting post entitled. “what is the optimal false alarm rate for tornado warnings?” . A high rate of false alarms is likened to the “boy who cried wolf”, to whom nobody listens any more. You would think that there is no harm in warning unnecessarily, but in the long term there is potential loss of life because people fail to heed subsequent warnings.

Operations Research and errors

Pure mathematicians tend not to like statistics much as it isn’t exact. It’s a little bit sullied by its contact with the real world. However Operations Research goes a step further into the messy world of reality and evaluates the cost of each type of error. Decisions are often converted into dollar terms within decision analysis. Like it or not, the dollar is the usual measure of worth, even for a human life, though sometimes a measure called “utility” is employed.

Costs of Errors

Sometimes there is very little cost to a type 2 error. A bank manager refusing to fund a new business is avoiding the risk of a type 1 error, which would result in a loss of money. They then become open to at type 2 error, that they missed out on funding a winner. The balance is very much on the side of avoiding a type 1 error. In terms of choosing a life partner, some people are happy to risk a type 1 error, and marry, while others, hold back, perhaps invoking a type 2 error by missing out on a “soul-mate”. Or it may be that we make this decision under the illusion of certainty and perfect information, and the possible errors do not cross our minds.

Cancer screening is a common illustration of type 1 and type 2 errors. With a type 1 error, we get a false positive and are told we have a cancer when we do not. With type 2, the test fails to detect a cancer. In this example the cost of a type 2 error seems to be much worse than type 1. Surely we would rather know if we have cancer? However in the case of prostate cancer, a type 1 error can lead to awful side-effects from unnecessary tests. Conversely a large number of men die from other causes, happily unaware that they have early stages of prostate cancer.

The point is that there is no easy answer when making such decisions.

Teaching about type 1 and type 2 errors

I have found the following helpful when teaching about type 1 and type 2 errors in statistics. Think first about the action that was taken. If the null hypothesis was rejected, we have said that there is an effect. After rejecting the null only two outcomes are possible. We have made the correct decision, or we have made a type 1 error. Conversely if we do not reject the null hypothesis, and do nothing, we have either been correct or made a type 2 error. You cannot make a type 1 error and a type 2 error in the same decision.

  • Decision:Reject the Null. Outcome is:
    • Correct or
    • Type 1 error
  • Decision:Do not reject the Null. Outcome is:
    • Correct or
    • Type 2 error.

Or another way of looking at it is:

  • Do something and get it wrong – Type 1 error
  • Do nothing and regret it – Type 2 error

Avoid error

Students may wonder why we have to have any kind of error. Can we not do something to remove error? In some cases we can – we can spend more money and take a larger sample, thus reducing the likelihood of error. However, that too has its cost. The three costs are important aspects of decision-making, and helping students to understand this will help them to make and understand decisions.

Judgment Calls in Statistics and O.R.

The one-armed operations researcher

My mentor, Hans Daellenbach told me a story about a client asking for a one-armed Operations Researcher. The client was sick of getting answers that went, “On the one hand, the best decision would be to proceed, but on the other hand…”

People like the correct answer. They like certainty. They like to know they got it right.

I tease my husband that he has to find the best picnic spot or the best parking place, which involves us driving around considerably longer than I (or the children) were happy with. To be fair, we do end up in very nice picnic spots. However, several of the other places would have been just fine too!

In a different context I too am guilty of this – the reason I loved mathematics at school was because you knew whether you were right or wrong and could get a satisfying row of little red ticks (checkmarks) down the page. English and other arts subjects, I found too mushy as you could never get it perfect. Biology was annoying as plants were so variable, except in their ability to die. Chemistry was ok, so long as we stuck to the nice definite stuff like drawing organic molecules and balancing redox equations.

I think most mathematics teachers are mathematics teachers because they like things to be right or wrong. They like to be able to look at an answer and tell whether it is correct, or if it should get half marks for correct working. They do NOT want to mark essays, which are full of mushy judgements.

Again I am sympathetic. I once did a course in basketball refereeing. I enjoyed learning all the rules, and where to stand, and the hand signals etc, but I hated being a referee. All those decisions were just too much for me. I could never tell who had put the ball out, and was unhappy with guessing. I think I did referee two games at a church league and ended up with an angry player bashing me in the face with the ball. Looking back I think it didn’t help that I wasn’t much of a player either.

I also used to find marking exam papers very challenging, as I wanted to get it right every time. I would agonise over every mark, thinking it could be the difference between passing and failing for some poor student. However as the years went by, I realised that the odd mistake or inconsistency here or there was just usual, and within the range of error. To someone who failed by one mark, my suggestion is not to be borderline. I’m pretty sure we passed more people that we shouldn’t have, than the other way around.

Life is not deterministic

The point is, that life in general is not deterministic and certain and rule-based. This is where the great divide lies between the subject of mathematics and the practice of statistics. Generally in mathematics you can find an answer and even check that it is correct. Or you can show that there is no answer (as happened in one of our national exams in 2012!). But often in statistics there is no clear answer. Sometimes it even depends on the context. This does not sit well with some mathematics teachers.

In operations research there is an interesting tension between optimisers and people who use heuristics. Optimisers love to say that they have the optimal solution to the problem. The non-optimisers like to point out that the problem solved optimally, is so far removed from the actual problem, that all it provides is an upper or lower bound to a practical solution to the actual real-life problem situation.

Judgment calls occur all through the mathematical decision sciences. They include

  • What method to use – Linear programming or heuristic search?
  • Approximations – How do we model a stochastic input in a deterministic model?
  • Assumptions – Is it reasonable to assume that the observations are independent?
  • P-value cutoff – Does a p-value of exactly 0.05 constitute evidence against the null hypothesis?
  • Sample size – Is it reasonable to draw any inferences at all from a sample of 6?
  • Grouping – How do we group by age? by income?
  • Data cleaning – Do we remove the outlier or leave it in?

A comment from a maths teacher on my post regarding the Central Limit Theorem included the following: “The questions that continue to irk me are i) how do you know when to make the call? ii) What are the errors involved in making such a call? I suppose that Hypothesis testing along with p-values took care of such issues and offered some form of security in accepting or rejecting such a hypothesis. I am just a little worried that objectivity is being lost, with personal interpretation being the prevailing arbiter which seems inadequate.”

These are very real concerns, and reflect the mathematical desire for correctness and security. But I propose that the security was an illusion in the first place. There has always been personal interpretation.Informal inference is a nice introduction to help us understand that. And in fact it would be a good opportunity for lively discussion in a statistics class.

With bootstrapping methods we don’t have any less information than we did using the Central Limit Theorem. We just haven’t assumed normality or independence. There was no security. There was the idea that with a 95% confidence interval, for example, we are 95% sure that we contain the true population value. I wonder how often we realised that 1 in 20 times we were just plain wrong, and in quite a few instances the population parameter would be far from the centre of the interval.

The hopeful thing about teaching statistics via bootstrapping, is that by demystifying it we may be able to inject some more healthy scepticism into the populace.