# The Central Limit Theorem – with Dragons

To quote Willy Wonka, “A little magic now and then is relished by the best of men [and women].” Any frequent reader of this blog will know that I am of a pragmatic nature when it comes to using statistics. For most people the Central Limit Theorem can remain in the realms of magic. I have never taught it, though at times I have waved my hands past it.

Sometimes you don’t need to know.

Students who want that sort of thing can read about it in their textbooks or look it up online. The New Zealand school curriculum does not include it, as I explained in 2012.

But – there are many curricula and introductory statistics courses that include The Central Limit Theorem, so I have chosen to blog about it, in preparation to making a video. In this post I will cover what the Central Limit does. Maybe my approach will give ideas to teachers on how they might teach it.

## Sampling distribution of a mean

First let me explain what a sampling distribution is. (And let me add the term to Dr Nic’s long list of statistics terms that cause unnecessary confusion.) A sampling distribution of a mean is the distribution of the means of samples of the same size taken from the same population. The distribution of the means will be different from the distribution of values in the original population.  The Central Limit Theorem tells us useful things about the sampling distribution and its relationship to the distribution of the values in the population.

## Example using dragons

We have a population of 720 dragons, and each dragon has a strength value of 1 to 8. The distribution of the strengths goes from 1 to 8 and has a population mean somewhere around 4.5. We take a sample of four dragons from the population. (Dragons are difficult to catch and measure so it will just be 4.)

We find the mean. Then we think about what other values we might have got for samples that size. In real life, that is all we can do. But to understand what is happening, we will take multiple samples using cards, and then a spreadsheet, to explore what happens.

# Important aspects of the Central Limit Theorem

Aspect 1: The sampling distribution will be less spread than the population from which it is drawn.

Dragon example

What do you think is the largest value the mean strength of the four dragons will take? Theoretically you could have a sample of four dragons, each with strength of 8, giving us a sample mean of 8. But it isn’t very likely. The chances that all four values are greater than the mean are pretty small.  (It’s about a 6% chance). If there are equal numbers of dragons with each strength value, then the probability of getting all four dragons with strength 8 is 0.0002.

So already we have worked out that the distribution of the sample means is going to be less spread than the distribution of the original population.

Aspect 2: The sampling distribution will be well-modelled by a normal distribution.

Now isn’t that amazing – and really useful! And even more amazing, it doesn’t even matter what the underlying population distribution is, the sampling distribution will still (in most cases) look like a normal distribution.

If you think about it, it does make sense. I like to see practical examples – so here is one!

Dragon example

We worked out that it was really unlikely to get a sample of four dragons with a mean strength of 8. Similarly it is really unlikely to get a sample of four dragons with a mean strength of 1.
Say we assumed that the strength of dragons was uniform – there are equal numbers of dragons with each of the strengths. Then we find out all the possible combinations of strengths from samples of 4 dragons. Bearing in mind there are eight different strengths, that gives us 8 to the power of 4 or 4096 possible combinations. We can use a spreadsheet to enumerate all these equally likely combinations. Then we find the mean strength and we get this distribution.

Or we could take some samples of four dragons and see what happens. We can do this with our cards, or with a handy spreadsheet, and here is what we get.

Four samples of four dragons each

The sample mean values are 4.25, 5.25, 4.75 and 6. Even with really small samples we can see that the values of the means are clustering around some central point.

Here is what the means of 1000 samples of size 4 look like:

And hey presto – it resembles a normal distribution! By that I mean that the distribution is symmetric, with a bulge in the middle and tails in either direction. A normal distribution is useful for modelling just about anything that is the result of a large number of change effects.

The bigger the sample size and the more samples we take, the more the distribution of the means (the sampling distribution) looks like a normal distribution. The Central Limit Theorem gives mathematical explanation for this. I put this in the “magic” category unless you are planning to become a theoretical statistician.

Aspect 3: The spread of the sampling distribution is related to the spread of the population.

If you think about it, this also makes sense. If there is very little variation in the population, then the sample means will all be about the same.  On the other hand, if the population is really spread out, then the sample means will be more spread out too.

Dragon example

Say the strengths of the dragons occur equally from 1 to 5 instead of from 1 to 8. The spread of the means of teams of four dragons are going to go from 1 to 5 also, though most of the values will be near the middle.

Aspect 4: Bigger samples lead to a smaller spread in the sampling distribution.

As we increase the size of the sample, the means become less varied. We reduce the effect of one extreme value. Similarly the chance of getting all high values in our sample or all low values gets smaller and smaller. Consequently the spread of the sample means will decrease. However, the reduction is not linear. By that I mean that the effect achieved by adding one more to the sample decreases, depending on how big the sample is in the first place. Say you have a sample of size n = 4, and you increase it to n = 5, that is a 25% increase in information. If you have a sample n = 100 and increase it to size n=101, that is only a 1% increase in information.

Now here is the coolest thing! The spread of the sampling distribution is the standard deviation of the population, divided by the square root of the sample size. As we do not know the standard deviation of the population (σ), we use the standard deviation of the sample (s) to approximate it. The spread of the sampling distribution is usually called the standard error, or s.e.

# Implications of the Central Limit Theorem

The properties listed above underpin most traditional statistical inference. When we find a confidence interval of a mean, we use the standard error in the formula. If we used the sample standard deviation we would be finding the values between which most of the values in the sample lie. By using the standard error, we are finding the values between which most of the sample means lie.

# Sample size

The Central Limit Theorem applies best with large samples. A rule of thumb is that the sample should be 30 or more. For smaller samples we need to use the t distribution rather than the normal distribution in our testing or confidence intervals. If the sample is very small, such as less than 15, then we can still use the t-distribution if the underlying population has a normal shape. If the underlying population is not normal, and the sample is small, then other methods, such as resampling should be used, as the Central Limit Theorem does not hold.

# Reminder!

We do not take multiple samples of the same population in real life. This simulation is just that – a pretend example to show how the Central Limit Theorem plays out. When we undergo inferential statistics we have one sample, and from that we use what we know about it to make inferences about the population from which it is drawn.

## Teaching suggestion

Data cards are extremely useful tools to help understand sampling and other aspects of inference. I would suggest getting the class to take multiple small samples(n=4), using cards, and finding the means. Plot the means. Then take larger samples (n=9) and similarly plot the means. Compare the shape and spread of the distributions of the means.

The Dragonistics data cards used in this post can be purchased at The StatsLC shop.

# Why Journalists need to understand statistics – Sensational Listener article about midwifery risks

The recent article in the Listener highlights again the need for all citizens to  be statistically literate. In particular I believe statistical literacy should be a compulsory part of all journalists’ training. I have written before about this. I was happy to see letters to the Editor in the 22 October issue of the Listener condemning the sensationalist cover, which was not supported in the article, and even less supported in the original research. I like the Listener, and subscribe, but this was badly done!

The following was written by a fellow statistician, John Maindonald and published here with his permission.

# Midwife led vs Medical led models of care

A just published major observational study, comparing midwife led with medical led models of care has attracted extensive media attention.  The front cover of the NZ Listener (October 8) presented the “results” in particularly sensationalist terms (“ALARMING MATERNITY RESEARCH …”).

http://www.listener.co.nz/archive/october-8-2016-2/

Much more alarming is what this sensationalist cover page has made of results that are at an optimistic best suggestive.

Adjustments, inevitably simplistic, were made for 8 factors in which the groups differed.  There is, with so many factors operating, no good way to be sure that the inevitably simple forms of adjustment were adequate.  Additionally, there will have been differences in mothers’ circumstances that the deprivation index used was too crude to capture.  Substance abuse was not taken into consideration.

http://www.otago.ac.nz/news/news/otago622928.html

(Otago U PR)

http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002134

(the paper)

I am disappointed that in its response to criticism of its presentation in Letters to the Editor, the Listener (October 22) continues to defend its reporting.

John Maindonald.

# Understanding Statistical Inference

Inference is THE big idea of statistics. This is where people come unstuck. Most people can accept the use of summary descriptive statistics and graphs. They can understand why data is needed. They can see that the way a sample is taken may affect how things turn out. They often understand the need for control groups. Most statistical concepts or ideas are readily explainable. But inference is a tricky, tricky idea. Well actually – it doesn’t need to be tricky, but the way it is generally taught makes it tricky.

## Procedural competence with zero understanding

I cast my mind back to my first encounter with confidence intervals and hypothesis tests. I learned how to calculate them (by hand  – yes I am that old) but had not a clue what their point was. Not a single clue. I got an A in that course. This is a common occurrence. It is possible to remain blissfully unaware of what inference is all about, while answering procedural questions in exams correctly.

But, thanks to the research and thinking of a lot of really smart and dedicated statistics teachers, we are able put a stop to that. And we must.

We need to explicitly teach what statistical inference is. Students do not learn to understand inference by doing calculations. We need to revisit the ideas behind inference frequently. The process of hypothesis testing, is counter-intuitive and so confusing that it spills its confusion over into the concept of inference. Confidence intervals are less confusing so a better intermediate point for understanding statistical inference. But we need to start with the concept of inference.

# What is statistical inference?

The idea of inference is actually not that tricky if you unbundle the concept from the application or process.

The concept of statistical inference is this –

We want to know stuff about a large group of people or things (a population). We can’t ask or test them all so we take a sample. We use what we find out from the sample to draw conclusions about the population.

That is it. Now was that so hard?

# Developing understanding of statistical inference in children

I have found the paper by Makar and Rubin, presenting a “framework for thinking about informal statistical inference”, particularly helpful. In this paper they summarise studies done with children learning about inference. They suggest that “ three key principles … appeared to be essential to informal statistical inference: (1) generalization, including predictions, parameter estimates, and conclusions, that extend beyond describing the given data; (2) the use of data as evidence for those generalizations; and (3) employment of probabilistic language in describing the generalization, including informal reference to levels of certainty about the conclusions drawn.” This can be summed up as Generalisation, Data as evidence, and Probabilistic Language.

We can lead into informal inference early on in the school curriculum. The key Ideas in the NZ curriculum suggest that “ teachers should be encouraging students to read beyond the data. Eg ‘If a new student joined our class, how many children do you think would be in their family?’” In other words, though we don’t specifically use the terms population and sample, we can conversationally draw attention to what we learn from this set of data, and how that might relate to other sets of data.

When teaching adults we may use a more direct approach, explaining explicitly, alongside experiential learning to understanding inference. We have just completed made a video: Understanding Inference. Within the video we have presented three basic ideas condensed from the Five Big Ideas in the very helpful book published by NCTM, “Developing Essential Understanding of Statistics, Grades 9 -12”  by Peck, Gould and Miller and Zbiek.

## Ideas underlying inference

• A sample is likely to be a good representation of the population.
• There is an element of uncertainty as to how well the sample represents the population
• The way the sample is taken matters.

These ideas help to provide a rationale for thinking about inference, and allow students to justify what has often been assumed or taught mathematically. In addition several memorable examples involving apples, chocolate bars and opinion polls are provided. This is available for free use on YouTube. If you wish to have access to more of our videos than are available there, do email me at n.petty@statslc.com.

# Don’t teach significance testing – Guest post

The following is a guest post by Tony Hak of Rotterdam School of Management. I know Tony would love some discussion about it in the comments. I remain undecided either way, so would like to hear arguments.

# GOOD REASONS FOR NOT TEACHING SIGNIFICANCE TESTING

It is now well understood that p-values are not informative and are not replicable. Soon null hypothesis significance testing (NHST) will be obsolete and will be replaced by the so-called “new” statistics (estimation and meta-analysis). This requires that undergraduate courses in statistics now already must teach estimation and meta-analysis as the preferred way to present and analyze empirical results. If not, then the statistical skills of the graduates from these courses will be outdated on the day these graduates leave school. But it is less evident whether or not NHST (though not preferred as an analytic tool) should still be taught. Because estimation is already routinely taught as a preparation for the teaching of NHST, the necessary reform in teaching will not require the addition of new elements in current programs but rather the removal of the current emphasis on NHST or the complete removal of the teaching of NHST from the curriculum. The current trend is to continue the teaching of NHST. In my view, however, teaching of NHST should be discontinued immediately because it is (1) ineffective and (2) dangerous, and (3) it serves no aim.

1. Ineffective: NHST is difficult to understand and it is very hard to teach it successfully

We know that even good researchers often do not appreciate the fact that NHST outcomes are subject to sampling variation and believe that a “significant” result obtained in one study almost guarantees a significant result in a replication, even one with a smaller sample size. Is it then surprising that also our students do not understand what NHST outcomes do tell us and what they do not tell us? In fact, statistics teachers know that the principles and procedures of NHST are not well understood by undergraduate students who have successfully passed their courses on NHST. Courses on NHST fail to achieve their self-stated objectives, assuming that these objectives include achieving a correct understanding of the aims, assumptions, and procedures of NHST as well as a proper interpretation of its outcomes. It is very hard indeed to find a comment on NHST in any student paper (an essay, a thesis) that is close to a correct characterization of NHST or its outcomes. There are many reasons for this failure, but obviously the most important one is that NHST a very complicated and counterintuitive procedure. It requires students and researchers to understand that a p-value is attached to an outcome (an estimate) based on its location in (or relative to) an imaginary distribution of sample outcomes around the null. Another reason, connected to their failure to understand what NHST is and does, is that students believe that NHST “corrects for chance” and hence they cannot cognitively accept that p-values themselves are subject to sampling variation (i.e. chance)

2. Dangerous: NHST thinking is addictive

One might argue that there is no harm in adding a p-value to an estimate in a research report and, hence, that there is no harm in teaching NHST, additionally to teaching estimation. However, the mixed experience with statistics reform in clinical and epidemiological research suggests that a more radical change is needed. Reports of clinical trials and of studies in clinical epidemiology now usually report estimates and confidence intervals, in addition to p-values. However, as Fidler et al. (2004) have shown, and contrary to what one would expect, authors continue to discuss their results in terms of significance. Fidler et al. therefore concluded that “editors can lead researchers to confidence intervals, but can’t make them think”. This suggests that a successful statistics reform requires a cognitive change that should be reflected in how results are interpreted in the Discussion sections of published reports.

The stickiness of dichotomous thinking can also be illustrated with the results of a more recent study of Coulson et al. (2010). They presented estimates and confidence intervals obtained in two studies to a group of researchers in psychology and medicine, and asked them to compare the results of the two studies and to interpret the difference between them. It appeared that a considerable proportion of these researchers, first, used the information about the confidence intervals to make a decision about the significance of the results (in one study) or the non-significance of the results (of the other study) and, then, drew the incorrect conclusion that the results of the two studies were in conflict. Note that no NHST information was provided and that participants were not asked in any way to “test” or to use dichotomous thinking. The results of this study suggest that NHST thinking can (and often will) be used by those who are familiar with it.

The fact that it appears to be very difficult for researchers to break the habit of thinking in terms of “testing” is, as with every addiction, a good reason for avoiding that future researchers come into contact with it in the first place and, if contact cannot be avoided, for providing them with robust resistance mechanisms. The implication for statistics teaching is that students should, first, learn estimation as the preferred way of presenting and analyzing research information and that they get introduced to NHST, if at all, only after estimation has become their routine statistical practice.

3. It serves no aim: Relevant information can be found in research reports anyway

Our experience that teaching of NHST fails its own aims consistently (because NHST is too difficult to understand) and the fact that NHST appears to be dangerous and addictive are two good reasons to immediately stop teaching NHST. But there is a seemingly strong argument for continuing to introduce students to NHST, namely that a new generation of graduates will not be able to read the (past and current) academic literature in which authors themselves routinely focus on the statistical significance of their results. It is suggested that someone who does not know NHST cannot correctly interpret outcomes of NHST practices. This argument has no value for the simple reason that it is assumed in the argument that NHST outcomes are relevant and should be interpreted. But the reason that we have the current discussion about teaching is the fact that NHST outcomes are at best uninformative (beyond the information already provided by estimation) and are at worst misleading or plain wrong. The point is all along that nothing is lost by just ignoring the information that is related to NHST in a research report and by focusing only on the information that is provided about the observed effect size and its confidence interval.

## Bibliography

Coulson, M., Healy, M., Fidler, F., & Cumming, G. (2010). Confidence Intervals Permit, But Do Not Guarantee, Better Inference than Statistical Significance Testing. Frontiers in Quantitative Psychology and Measurement, 20(1), 37-46.

Fidler, F., Thomason, N., Finch, S., & Leeman, J. (2004). Editors Can Lead Researchers to Confidence Intervals, But Can’t Make Them Think. Statistical Reform Lessons from Medicine. Psychological Science, 15(2): 119-126.

This text is a condensed version of the paper “After Statistics Reform: Should We Still Teach Significance Testing?” published in the Proceedings of ICOTS9.

# The silent dog – null results matter too!

Recently I was discussing the process we use in a statistical enquiry. The ideal is that we start with a problem and follow the statistical enquiry cycle through the steps Problem, Plan, Data collection, Analysis and Conclusion, which then may lead to other enquiries.
I have previously written a post suggesting that the cyclical nature of the process was overstated.

The context of our discussion was a video I am working on, that acknowledges that often we start, not at the beginning, but in the middle, with a set of data. This may be because in an educational setting it is too expensive and time consuming to require students to collect their own data. Or it may be that as statistical consultants we are brought into an investigation once the data has been collected, and are needed to make some sense out of it. Whatever the reason, it is common to start with the data, and then loop backwards to the Problem and Plan phases, before performing the analysis and writing the conclusions.

## Looking for relationships

We, a group of statistical educators, were suggesting what we would do with a data set, which included looking at the level of measurement, the origins of the data, and the possible intentions of the people who collected it. One teacher suggests to her students that they do exploratory scatter plots of all the possible pairings, as well as comparative dotplots and boxplots. The students can then choose a problem that is likely to show a relationship – because they have already seen that there is a relationship in the data.

I have a bit of a problem with this. It is fine to get an overview of the relationships in the data – that is one of the beauties of statistical packages. And I can see that for an assignment, it is more rewarding for students to have a result they can discuss. If they get a null result there is a tendency to think that they have failed. Yet the lack of evidence of a relationship may be more important than evidence of one. The problem is that we value positive results over null results. This is a known problem in academic journals, and many words have been written about the problems of over-occurrence of type 1 errors, or publication bias. Let me illustrate. A drug manufacturer hopes that drug X is effective in treating depression. In reality drug X is no more effective than a placebo. The manufacturer keeps funding different tests by different scientists. If all the experiments use a significance level of 0.05, then about 5% of the experiments will produce a type 1 error and say that there is an effect attributable to drug X. The (false) positive results are able to be published, because academic journals prefer positive results to null-results. Conversely the much larger number of researchers who correctly concluded that there is no relationship, do not get published and the abundance of evidence to the contrary is invisible. To be fair, it is hoped that these researchers will be able to refute the false positive paper.

## Let them see null results

So where does this leave us as teachers of statistics? Awareness is a good start. We need to show null effects and why they are important. For every example we give that ends up rejecting the null hypothesis, we need to have an example that does not. Text books tend to over-include results that reject the null, so that when a student meets a non-significant result he or she is left wondering whether they have made a mistake. In my preparation of learning materials, I endeavour to keep a good spread of results – strongly positive, weakly positive, inconclusive, weakly negative and strongly negative.  This way students are accepting of a null result, and know what to say when they get one.

Another example is in the teaching of time series analysis. We love to show series with strong seasonality. It tells a story. (see my post about time series analysis as storytelling.) Retail sales nearly all peak in December, and various goods have other peaks. Jewellery retail sales in the US has small peaks in February and May, and it is fun working out why. Seasonal patterns seem like magic. However, we need also to allow students to analyse data that does not have a strong seasonal pattern, so that they can learn that they also exist!

My final research project before leaving the world of academia involved an experiment on the students in my class of over 200. It was difficult to get through the human ethics committee, but made it in the end. The students were divided into two groups, and half were followed up by tutors weekly if they were not keeping up with assignments and testing. The other half were left to their own devices, as had previously been the case. The interesting result was that it made no difference to the pass rate of the students. In fact the proportion of passes was almost identical. This was a null result. I had supposed that following up and helping students to keep up would increase their chances of passing the course. But they didn’t. This important result saved us money in terms of tutor input in following years. Though it felt good to be helping our students more, it didn’t actually help them pass, so was not justifiable in straitened financial times.

I wonder if it would have made it into a journal.

By the way, my reference to the silent dog in the title is to the famous Sherlock Holmes story, Silver Blaze, where the fact that the dog did not bark was important as it showed that the person was known to it.

# Proving causation

## Aeroplanes cause hot weather

In Christchurch we have a weather phenomenon known as the “Nor-wester”, which is a warm dry wind, preceding a cold southerly change. When the wind is from this direction, aeroplanes make their approach to the airport over the city. Our university is close to the airport in the direct flightpath, so we are very aware of the planes. A new colleague from South Africa drew the amusing conclusion that the unusual heat of the day was caused by all the planes flying overhead.

Statistics experts and educators spend a lot of time refuting claims of causation. “Correlation does not imply causation” has become a catch cry of people trying to avoid the common trap. This is a great advance in understanding that even journalists (notoriously math-phobic) seem to have caught onto. My own video on important statistical concepts ends with the causation issue. (You can jump to it at 3:51)

So we are aware that it is not easy to prove causation.

In order to prove causation we need a randomised experiment. We need to make random any possible factor that could be associated, and thus cause or contribute to the effect.

There is also the related problem of generalizability. If we do have a randomised experiment, we can prove causation. But unless the sample is also a random representative sample of the population in question, we cannot infer that the results will also transfer to the population in question. This is nicely illustrated in this matrix from The Statistical Sleuth by Fred L. Ramsey and Daniel W Schafer.

The relationship between the type of sample and study and the conclusions that may be drawn.

The top left-hand quadrant is the one in which we can draw causal inferences for the population.

## Causal claims from observational studies

A student posed this question:  Is it possible to prove a causal link based on an observational study alone?

It would be very useful if we could. It is not always possible to use a randomised trial, particularly when people are involved. Before we became more aware of human rights, experiments were performed on unsuspecting human lab rats. A classic example is the Vipeholm experiments where patients at a mental hospital were the unknowing subjects. They were given large quantities of sweets in order to determine whether sugar caused cavities in teeth. This happened into the early 1950s. These days it would not be acceptable to randomly assign people to groups who are made to smoke or drink alcohol or consume large quantities of fat-laden pastries. We have to let people make those lifestyle choices for themselves. And observe. Hence observational studies!

There is a call for “evidence-based practice” in education to follow the philosophy in medicine. But getting educational experiments through ethics committee approval is very challenging, and it is difficult to use rats or fruit-flies to impersonate the higher learning processes of humans. The changing landscape of the human environment makes it even more difficult to perform educational experiments.

To find out the criteria for justifying causal claims in an observational study I turned to one of my favourite statistics text-books, Chance Encounters by Wild and Seber  (page 27). They cite the Surgeon General of the United States. The criteria for the establishment of a cause and effect relationship in an epidemiological study are the following:

1. Strong relationship: For example illness is four times as likely among people exposed to a possible cause as it is for those who are not exposed.
2. Strong research design
3. Temporal relationship: The cause must precede the effect.
4. Dose-response relationship: Higher exposure leads to a higher proportion of people affected.
5. Reversible association: Removal of the cause reduces the incidence of the effect.
6. Consistency: Multiple studies in different locations producing similar effects
7. Biological plausibility: there is a supportable biological mechanism
8. Coherence with known facts.

In high school, and entry-level statistics courses, the focus is often on statistical literacy. This concept of causation is pivotal to correct understanding of what statistics can and cannot claim. It is worth spending some time in the classroom discussing what would constitute reasonable proof and what would not. In particular it is worthwhile to come up with alternative explanations for common fallacies, or even truths in causation. Some examples for discussion might be drink-driving and accidents, smoking and cancer, gender and success in all number of areas, home game advantage in sport, the use of lucky charms, socks and undies. This also ties nicely with probability theory, helping to tie the year’s curriculum together.

# Parts and whole

The whole may be greater than the sum of the parts, but the whole still needs those parts. A reflective teacher will think carefully about when to concentrate on the whole, and when on the parts.

## Golf

If you were teaching someone golf, you wouldn’t spend days on a driving range, never going out on a course. Your student would not get the idea of what the game is, or why they need to be able to drive straight and to a desired length. Nor would it be much fun! Similarly if the person only played games of golf it would be difficult for them to develop their game. Practice driving and putting is needed.  A serious student of golf would also read and watch experts at golf.

## Music

Learning music is similar. Anyone who is serious about developing as a musician will spend a considerable amount of time developing their technique and their knowledge by practicing scales, chords and drills. But at the same time they need to be playing full pieces of music so that they feel the joy of what they are doing. As they play music, as opposed to drill, they will see how their less-interesting practice has helped them to develop their skills. However, as they practice a whole piece, they may well find a small part that is tripping them up, and focus for a while on that. If they play only the piece as a whole, it is not efficient use of time. A serious student of music will also listen to and watch great musicians, in order to develop their own understanding and knowledge.

## Benefits of study of the whole and of the parts

In each of these examples we can see that there are aspects of working with the whole, and aspects of working with the parts. Study of the whole contributes perspective and meaning to study, and helps to tie things together. It helps to see where they have made progress. Study of the parts isolates areas of weakness, develops skills and saves time in practice, thus being more efficient.

It is very important for students to get an idea of the purpose of their study, and where they are going. For this reason I have written earlier about the need to see the end when starting out in a long procedure such as a regression or linear programming model.

It is also important to develop “statistical muscle memory” by repeating small parts of the exercise over and over until it is mastered. Practice helps people to learn what is general and what is specific in the different examples.

# Teaching conditional probability

We are currently developing a section on probability as part of our learning materials. A fundamental understanding of probability and uncertainty are essential to a full understanding of inference. When we look at statistical evidence from data, we are holding it up against what we could reasonably expect to happen by chance, which involves a probability model. Probability lies in the more mathematical area of the study of statistics, and has some fun problem-solving aspects to it.

A popular exam question involves conditional probability. We like to use a table approach to this as it avoids many of the complications of terminology. I still remember my initial confusion over the counter-intuitive expression P(A|B) which means the probability that an object from subset B has the property of A. There are several places where students can come unstuck in Bayesian review, and the problems can take a long time. We can liken solving a conditional probability problem to a round of golf, or a long piece of music. So what we do in teaching is that first we take the students step by step through the whole problem. This includes working out what the words are saying, putting the known values into a table, calculating the unknown values in the table, and the using the table to answer the questions involving conditional probability.

Then we work on the individual steps, isolating them so that students can get sufficient practice to find out what is general and what is specific to different examples. As we do this we endeavour to provide variety such that students do not work out some heuristic based on the wording of the question, that actually stops them from understanding. An example of this is that if we use the same template each time, students will work out that the first number stated will go in a certain place in the table, and the second in another place etc. This is a short-term strategy that we need to protect them from in careful generation of questions.

As it turns out students should already have some of the necessary skills. When we review probability at the start of the unit, we get students to calculate probabilities from tables of values, including conditional probabilities. Then when they meet them again as part of the greater whole, there is a familiar ring.

Once the parts are mastered, the students can move on to a set of full questions, using each of the steps they have learned, and putting them back into the whole. Because they are fluent in the steps, it becomes more intuitive to put the whole back together, and when they meet something unusual they are better able to deal with it.

## Starting a course in Operations Research/Management Science

It is interesting to contemplate what “the whole” is, with regard to any subject. In operations research we used to begin our first class, like many first classes, talking about what management science/operations research is. It was a pretty passive sort of class, and I felt it didn’t help as first-year university students had little relevant knowledge to pin the ideas on. So we changed to an approach that put them straight into the action and taught several weeks of techniques first. We started with project management and taught critical path. Then we taught identifying fixed and variable costs and break-even analysis. The next week was discounting and analysis of financial projects. Then for a softer example we looked at multi-criteria decision-making, using MCDM. It tied back to the previous week by taking a different approach to a decision regarding a landfill. Then we introduced OR/MS, and the concept of mathematical modelling. By then we could give real examples of how mathematical models could be used to inform real world problems. It was helpful to go from the concrete to the abstract. This was a much more satisfactory approach.

So the point is not that you should always start with the whole and then do the parts and then go back to the whole. The point is that a teacher needs to think carefully about the relationship between the parts and the whole, and teach in a way that is most helpful.

# Why engineers and poets need to know about statistics

I’m kidding about poets. But lots of people need to understand the three basic areas of statistics, Chance, Data and Evidence.

Recently Tony Greenfield, an esteemed applied statistician, (with his roots in Operations Research) posted the following request on a statistics email list:

“I went this week to the exhibition and conference in the NEC run by The Engineer magazine. There were CEOs of engineering companies of all sizes, from small to massive. I asked a loaded question:  “Why should every engineer be a competent applied statistician?” Only one, from more than 100 engineers, answered: “We need to analyse any data that comes along.” They all seemed bewildered when I asked if they knew about, or even used, SPC and DoE. I shall welcome one paragraph responses to my question. I could talk all day about it but it would be good to have a succinct and powerful few words to use at such a conference.”

For now I will focus on civil engineers, as they are often what people think of as engineers. I’m not sure about the “succinct and powerful” nature of the words to follow, but here goes…

The subject of statistics can be summarised as three areas – chance, data and evidence (CDE!)

Chance includes the rules and perceptions of probability, and emphasises the uncertainty in our world. I suspect engineers are more at home in a deterministic world, but determinism is just a model of reality. The strength of a bar of steel is not exact, but will be modelled with a probability distribution. An understanding of probability is necessary before using terms such as “one hundred year flood”. Expected values are used for making decisions on improving roads and intersections. The capacity of stadiums and malls, and the provision of toilets and exits all require modelling that relies on probability distributions. It is also necessary to have some understanding of our human fallibility in estimating and communicating probability. Statistical process control accounts for acceptable levels of variation, and indicates when they have been exceeded.

The Data aspect of the study of statistics embraces the collection, summary and communication of data. In order to make decisions, data must be collected. Correct summary measures must be used, often the median, rather than the more popular mean. Summary measures should preferably be expressed as confidence intervals, thus communicating the level of precision inherent in the data. Appropriate graphs are needed, which seldom includes pictograms or pie charts.

Evidence refers to the inferential aspects of statistical analysis. The theories of probability are used to evaluate whether a certain set of data provides sufficient evidence to draw conclusions. An engineer needs to understand the use of hypothesis testing and the p-value in order to make informed decisions regarding data. Any professional in any field should be using evidence-based practice, and journal articles providing evidence will almost always refer to the p-value. They should also be wary of claims of causation, and understand the difference between strength of effect and strength of evidence. Our video provides a gentle introduction to these concepts.

Design of Experiments also incorporates the Chance, Data and Evidence aspects of the discipline of statistics.  By randomising the units in an experiment we can control for other extraneous elements that might affect the outcome in an observational study. Engineers should be at home with these concepts.

So, Tony, how was that? Not exactly succinct, and four paragraphs rather than one. I think the Chance, Data, Evidence framework helps provide structure to the explanation.

# So what about the poets?

I borrow the term from Peter Bell of Richard Ivey School of Business, who teaches operations research to MBA students, and wrote a paper, Operations Research For Everyone (including poets). If it is difficult to get the world to recognise the importance of statistics, how much harder is it to convince them that Operations Research is vital to their well-being!

Bell uses the term, “poet” to refer to students who are not naturally at home with mathematics. In conversation Bell explained how many of his poets, who were planning to work in the area of human resource management found their summer internships were spent elbow-deep in data, in front of a spreadsheet, and were grateful for the skills they had resisted gaining.

An understanding of chance, data and evidence is useful/essential for “efficient citizenship”, to paraphrase the often paraphrased H. G. Wells. I have already written on the necessity for journalists to have an understanding of statistics. The innovative New Zealand curriculum recognises the importance of an understanding of statistics for all. There are numerous courses dedicated to making sure that medical practitioners have a good understanding.

So really, there are few professions or trades that would not benefit from a grounding in Chance, Data and Evidence. And Operations Research too, but for now that may be a bridge too far.

# Confidence Intervals

(This post was updated to include newer videos in May 2018)

Confidence intervals are needed because there is variation in the world. Nearly all natural, human or technological processes result in outputs which vary to a greater or lesser extent. Examples of this are people’s heights, students’ scores in a well written test and weights of loaves of bread. Sometimes our inability or lack of desire to measure something down to the last microgram will leave us thinking that there is no variation, but it is there. For example we would check the weights of chocolate bars to the nearest gram, and may well find that there is no variation. However if we were to weigh them to the nearest milligram, there would be variation. Drug doses have a much smaller range of variation, but it is there all the same.

You can see a video about some of the main sources of variation – natural, explainable, sampling and due to bias.

When we wish to find out about a phenomenon, the ideal would be to measure all instances. For example we can find out the heights of all students in one class at a given time. However it is impossible to find out the heights of all people in the world at a given time. It is even impossible to know how many people there are in the world at a given time. Whenever it is impossible or too expensive or too destructive or dangerous to measure all instances in a population, we need to take a sample. Ideally we will take a sample that gives each object in the population an equal likelihood of being chosen.

You can see a video here about ways of taking a sample.

When we take a sample there will always be error. It is called sampling error. We may, by chance, get exactly the same value for our sample statistic as the “true” value that exists in the population. However, even if we do, we won’t know that we have.

The sample mean is the best estimate for the population mean, but we need to say how well it is estimating the population mean. For example, say we wish to know the mean (or average) weight of apples in an orchard. We take a sample and find that the mean weight of the apples in the sample  is 153g. If we only took a few apples, it is only a rough idea and we might say we are pretty sure the mean weight of the apples in the orchard is between 143g and 163g. If someone else took a bigger sample, they might be able to say that they are pretty sure that the mean weight of apples in the orchard is between 158g and 166g. You can tell that the second confidence interval is giving us better information as the range of the confidence interval is smaller.

There are two things that affect the width of a confidence interval. The first is the sample size. If we take a really large sample we are getting a lot more information about the population, so our confidence interval will be more exact, or smaller. It is not a one-to-one relationship, but a square-root relationship.  If we wish to reduce the confidence interval by a factor of two, we will need to increase our sample size by a factor of 4.

The second thing to affect the width of a confidence interval is the amount of variation in the population. If all the apples in the orchard are about the same weight, then we will be able to estimate that weight quite accurately. However, if the apples are all different sizes, then it will be harder to be sure that the sample represents the population, and we will have a larger confidence interval as a result.

# Three ways to find confidence intervals

The standard way of calculating confidence intervals is by using formulas developed on the assumptions of normality and the Central Limit Theorem. These formulas are used to calculate the confidence intervals of means, proportions and slopes, but not for medians or standard deviations. That is because there aren’t nice straight-forward formulas for these. The formulas were developed when there were no computers, and analytical methods were needed in the absence of computational power.

In terms of teaching, these formulas are straight-forward, and also include the concept of level of confidence, which is part of the paradigm. You can see a video teaching the traditional approach to confidence intervals using a formula to calculate the confidence interval for a mean.

# Rule of Thumb

In the New Zealand curriculum at year 12, students are introduced to the concept of inference using an informal method for calculating a confidence interval. The formula is median +/-  1.5 times the interquartile range divided by the square-root of the sample size. There is a similar formula for proportions.

# Bootstrapping

Bootstrapping is a very versatile way to find a confidence interval. It has three strengths:

1. It can be used to calculate the confidence interval for a large range of different parameters.
2. It uses ALL the information the sample gives us, rather than the summary values
3. It has been found to aid in understanding the concepts of inference better than the traditional methods.

1. Old fogeys don’t like it. (Just kidding) What I mean is that teachers who have always taught using the traditional approach find it difficult to trust what seems like a hit-and-miss method without the familiar theoretical underpinning.
2. Universities don’t teach bootstrapping as much as the traditional methods.
3. The common software packages do not include bootstrap confidence intervals.

The idea behind a bootstrap confidence interval is that we make use of the whole sample to represent the population. We take lots and lots of samples of the same size from the original sample. Obviously we need to sample with replacement, or the samples would all be identical. Then we use these repeated samples to get an idea of the distribution of the estimates of the population parameter. We chop the tails off at a given point, and we give the confidence interval.  Voila!

1. There is a sound theoretical underpinning for bootstrap confidence intervals. A good place to start is a previous blog about George Cobb’s work. Either that or – “Trust me, I’m a Doctor!” (This would also include trusting far more knowledgeable people such as Chris Wild and Maxine Pfannkuch, and the team of statistical educators led by Joan Garfield.
2. We have to start somewhere. Bootstrap methods aren’t used at universities because of inertia. As an academic of twenty years I can say that there is NO PAY OFF for teaching new stuff. It takes up valuable research time and you don’t get promoted, and sometimes you even get made redundant. If students understand what confidence intervals are, and the concept of inference, then learning to use the traditional formulas is trivial. Eventually the universities will shift. I am aware that the University of Auckland now teaches the bootstrap approach.
3. There are ways to deal with the software package problem. There is a free software interface called “iNZight” that you can download. I believe Fathom also uses bootstrapping. There may be other software. Please let me know of any and I will add them to this post.

# In Summary

Confidence intervals involve the concepts of variation, sampling and inference. They are a great way to teach these really important concepts, and to help students be critical of single value estimates. They can be taught informally, traditionally or using bootstrapping methods. Any of the approaches can lead to rote use of formula or algorithm and it is up to teachers to aim for understanding. I’m working on a set of videos around this topic. Watch this space.

# Make journalists learn statistics

All journalists should be required to pass a course in basic statistics before they are let loose on the unsuspecting public.

I am not talking about the kind of statistics course that mathematical statisticians are talking about. This does not involve calculus, R or anything tricky requiring a post-graduate degree. I am talking about a statistics course for citizens. And journalists. 🙂

I have thought about this for some years. My father was a journalist, and fairly innumerate unless there was a dollar sign involved. But he was of the old school, who worked their way up the ranks. These days most media people have degrees, and I am adamant that the degree should contain basic numeracy and statistics. The course I devised (which has now been taken over by the maths and stats department and will be shut down later this year, but am I bitter…?) would have been ideal. It included basic number skills, including percentages (which are harder than you think), graphing, data, chance and evidence. It required students to understand the principles behind what they were doing rather than the mechanics.

Here is what journalists should know about statistics:

# Chance

One of the key concepts in statistics is that of variability and chance.  Too often a chance event is invested with unnecessary meaning. A really good example of this is the road toll. In New Zealand the road toll over the Easter break can fluctuate between 21 (in 1971) and 3 in 1998, 2002 and 2003. Then in 2012 the toll was zero, a cause of great celebration. I was happy to see one report say “There was no one reason for the zero toll this Easter, and good fortune may have played a part.” However this was a refreshing change as normally the police seem to take the credit for good news, and blame bad news on us. Rather like Economists.

With any random process you will get variability. The human mind looks for patterns and meanings even where there are none. Sadly the human mind often finds patterns and imbues meaning erroneously. Astrology is a perfect example of this – and watching Deal or No Deal is inspiring in the meaning people can find in random variation.

All journalists should have a good grasp of the concepts of variability so they stop drawing unfounded conclusions

# Data Display

There are myriad examples of graphs in the media that are misleading, badly constructed, incorrectly specified, or just plain wrong. There was a wonderful one in the Herald Sun recently, which has had considerable publicity. We hope it was just an error, and nothing more sinister. But good subediting (what my father used to do, but I think ceased with the advent of the computer) would have picked this up.

There is a very nice website dedicated to this: StatsChat.   It unfortunately misquotes H.G.Wells, but has a wonderful array of examples of good and bad statistics in the media. This post gives links to all sorts of sites with bad graphs, many of which were either produced or promulgated by journalists. But not all – scientific literature also has its culprits.

Just a little aside here – why does NO-ONE ever report the standard deviation? I was writing questions involving the normal distribution for practice by students. I am a strong follower of Cobb’s view that all data should be real, so I went looking for some interesting results I could use, with a mean and standard deviation. Heck I couldn’t even find uninteresting results! The mean and the median rule supreme, and confidence intervals are getting a little look in. Percentages are often reported with a “margin of error” (does anyone understand that?). But the standard deviation is invisible. I don’t think the standard deviation is any harder to understand than the mean. (Mainly because the mean is very hard to understand!) So why is the standard deviation not mentioned?

# Evidence

One of the main ideas in inferential statistics is that of evidence: The data is here; do we have evidence that this is an actual effect rather than caused by random variation and sampling error? In traditional statistics this is about understanding the p-value. In resampling the idea is very similar to that of a p-value – we ask “could we have got this result by chance?” You do not have to be a mathematician to grasp this idea if it is presented in an accessible way. (See my video “Understanding the p-value” for an example.)

One very exciting addition to the New Zealand curriculum are Achievement Standards at Years 12 and 13 involving reading and understanding statistical reports. I have great hopes that as teachers embrace these standards, the level of understanding in the general population will increase, and there will be less tolerance for statistically unsound conclusions.

Another source of hope for me is “The Panel”, an afternoon radio programme hosted by Jim Mora on Radio New Zealand National. Each day different guests are invited to comment on current events in a moderately erudite and often amusing way. Sometimes they even have knowledge about the topic, and usually an expert is interviewed. It is as talkback radio really could be. I think. I’ve never listened long enough to talk-back radio to really judge as it always makes me SO ANGRY! Breathe, breathe…

I digress. I have been gratified to hear people on The Panel making worthwhile comments about sample size, sampling method, bias, association and causation. (Not usually using those exact terms, but the concepts are there.) It gives me hope that critical response to pseudo-scientific, and even scientific research is possible in the general populace. My husband thinks that should be “informed populace”, but I can dream.

It is possible for journalists to understand the important ideas of statistics without a mathematically-based and alienating course. I feel an app coming on… (Or should that be a nap?)