Confidence Intervals: informal, traditional, bootstrap

Confidence Intervals

Confidence intervals are needed because there is variation in the world. Nearly all natural, human or technological processes result in outputs which vary to a greater or lesser extent. Examples of this are people’s heights, students’ scores in a well written test and weights of loaves of bread. Sometimes our inability or lack of desire to measure something down to the last microgram will leave us thinking that there is no variation, but it is there. For example we would check the weights of chocolate bars to the nearest gram, and may well find that there is no variation. However if we were to weigh them to the nearest milligram, there would be variation. Drug doses have a much smaller range of variation, but it is there all the same.

You can see a video about some of the main sources of variation – natural, explainable, sampling and due to bias.

When we wish to find out about a phenomenon, the ideal would be to measure all instances. For example we can find out the heights of all students in one class at a given time. However it is impossible to find out the heights of all people in the world at a given time. It is even impossible to know how many people there are in the world at a given time. Whenever it is impossible or too expensive or too destructive or dangerous to measure all instances in a population, we need to take a sample. Ideally we will take a sample that gives each object in the population an equal likelihood of being chosen.

You can see a video here about ways of taking a sample.

When we take a sample there will always be error. It is called sampling error. We may, by chance, get exactly the same value for our sample statistic as the “true” value that exists in the population. However, even if we do, we won’t know that we have.

The sample mean is the best estimate for the population mean, but we need to say how well it is estimating the population mean. For example, say we wish to know the mean (or average) weight of apples in an orchard. We take a sample and find that the mean weight of the apples in the sample  is 153g. If we only took a few apples, it is only a rough idea and we might say we are pretty sure the mean weight of the apples in the orchard is between 143g and 163g. If someone else took a bigger sample, they might be able to say that they are pretty sure that the mean weight of apples in the orchard is between 158g and 166g. You can tell that the second confidence interval is giving us better information as the range of the confidence interval is smaller.

There are two things that affect the width of a confidence interval. The first is the sample size. If we take a really large sample we are getting a lot more information about the population, so our confidence interval will be more exact, or smaller. It is not a one-to-one relationship, but a square-root relationship.  If we wish to reduce the confidence interval by a factor of two, we will need to increase our sample size by a factor of 4.

The second thing to affect the width of a confidence interval is the amount of variation in the population. If all the apples in the orchard are about the same weight, then we will be able to estimate that weight quite accurately. However, if the apples are all different sizes, then it will be harder to be sure that the sample represents the population, and we will have a larger confidence interval as a result.

Three ways to find confidence intervals

Traditional (old-fashioned?) Approach

The standard way of calculating confidence intervals is by using formulas developed on the assumptions of normality and the Central Limit Theorem. These formulas are used to calculate the confidence intervals of means, proportions and slopes, but not for medians or standard deviations. That is because there aren’t nice straight-forward formulas for these. The formulas were developed when there were no computers, and analytical methods were needed in the absence of computational power.

In terms of teaching, these formulas are straight-forward, and also include the concept of level of confidence, which is part of the paradigm. You can see a video teaching the traditional approach to confidence intervals, using Excel to calculate the confidence interval for a mean.

Rule of Thumb

In the New Zealand curriculum at year 12, students are introduced to the concept of inference using an informal method for calculating a confidence interval. The formula is median +/-  1.5 times the interquartile range divided by the square-root of the sample size. There is a similar formula for proportions.

Bootstrapping

Bootstrapping is a very versatile way to find a confidence interval. It has three strengths:

  1. It can be used to calculate the confidence interval for a large range of different parameters.
  2. It uses ALL the information the sample gives us, rather than the summary values
  3. It has been found to aid in understanding the concepts of inference better than the traditional methods.

There are also some disadvantages

  1. Old fogeys don’t like it. (Just kidding) What I mean is that teachers who have always taught using the traditional approach find it difficult to trust what seems like a hit-and-miss method without the familiar theoretical underpinning.
  2. Universities don’t teach bootstrapping as much as the traditional methods.
  3. The common software packages do not include bootstrap confidence intervals.

The idea behind a bootstrap confidence interval is that we make use of the whole sample to represent the population. We take lots and lots of samples of the same size from the original sample. Obviously we need to sample with replacement, or the samples would all be identical. Then we use these repeated samples to get an idea of the distribution of the estimates of the population parameter. We chop the tails off at a given point, and we give the confidence interval.  Voila!

Answers to the disadvantages (burn the straw man?)

  1. There is a sound theoretical underpinning for bootstrap confidence intervals. A good place to start is a previous blog about George Cobb’s work. Either that or – “Trust me, I’m a Doctor!” (This would also include trusting far more knowledgeable people such as Chris Wild and Maxine Pfannkuch, and the team of statistical educators led by Joan Garfield.
  2. We have to start somewhere. Bootstrap methods aren’t used at universities because of inertia. As an academic of twenty years I can say that there is NO PAY OFF for teaching new stuff. It takes up valuable research time and you don’t get promoted, and sometimes you even get made redundant. If students understand what confidence intervals are, and the concept of inference, then learning to use the traditional formulas is trivial. Eventually the universities will shift. I am aware that the University of Auckland now teaches the bootstrap approach.
  3. There are ways to deal with the software package problem. There is a free software interface called “iNZight” that you can download. I believe Fathom also uses bootstrapping. There may be other software. Please let me know of any and I will add them to this post.

In Summary

Confidence intervals involve the concepts of variation, sampling and inference. They are a great way to teach these really important concepts, and to help students be critical of single value estimates. They can be taught informally, traditionally or using bootstrapping methods. Any of the approaches can lead to rote use of formula or algorithm and it is up to teachers to aim for understanding. I’m working on a set of videos around this topic. Watch this space.

Make journalists learn statistics

All journalists should be required to pass a course in basic statistics before they are let loose on the unsuspecting public.

I am not talking about the kind of statistics course that mathematical statisticians are talking about. This does not involve calculus, R or anything tricky requiring a post-graduate degree. I am talking about a statistics course for citizens. And journalists. :)

I have thought about this for some years. My father was a journalist, and fairly innumerate unless there was a dollar sign involved. But he was of the old school, who worked their way up the ranks. These days most media people have degrees, and I am adamant that the degree should contain basic numeracy and statistics. The course I devised (which has now been taken over by the maths and stats department and will be shut down later this year, but am I bitter…?) would have been ideal. It included basic number skills, including percentages (which are harder than you think), graphing, data, chance and evidence. It required students to understand the principles behind what they were doing rather than the mechanics.

Here is what journalists should know about statistics:

Chance

One of the key concepts in statistics is that of variability and chance.  Too often a chance event is invested with unnecessary meaning. A really good example of this is the road toll. In New Zealand the road toll over the Easter break can fluctuate between 21 (in 1971) and 3 in 1998, 2002 and 2003. Then in 2012 the toll was zero, a cause of great celebration. I was happy to see one report say “There was no one reason for the zero toll this Easter, and good fortune may have played a part.” However this was a refreshing change as normally the police seem to take the credit for good news, and blame bad news on us. Rather like Economists.

With any random process you will get variability. The human mind looks for patterns and meanings even where there are none. Sadly the human mind often finds patterns and imbues meaning erroneously. Astrology is a perfect example of this – and watching Deal or No Deal is inspiring in the meaning people can find in random variation.

All journalists should have a good grasp of the concepts of variability so they stop drawing unfounded conclusions

Data Display

There are myriad examples of graphs in the media that are misleading, badly constructed, incorrectly specified, or just plain wrong. There was a wonderful one in the Herald Sun recently, which has had considerable publicity. We hope it was just an error, and nothing more sinister. But good subediting (what my father used to do, but I think ceased with the advent of the computer) would have picked this up.

There is a very nice website dedicated to this: StatsChat.   It unfortunately misquotes H.G.Wells, but has a wonderful array of examples of good and bad statistics in the media. This post gives links to all sorts of sites with bad graphs, many of which were either produced or promulgated by journalists. But not all – scientific literature also has its culprits.

Just a little aside here – why does NO-ONE ever report the standard deviation? I was writing questions involving the normal distribution for practice by students. I am a strong follower of Cobb’s view that all data should be real, so I went looking for some interesting results I could use, with a mean and standard deviation. Heck I couldn’t even find uninteresting results! The mean and the median rule supreme, and confidence intervals are getting a little look in. Percentages are often reported with a “margin of error” (does anyone understand that?). But the standard deviation is invisible. I don’t think the standard deviation is any harder to understand than the mean. (Mainly because the mean is very hard to understand!) So why is the standard deviation not mentioned?

Evidence

One of the main ideas in inferential statistics is that of evidence: The data is here; do we have evidence that this is an actual effect rather than caused by random variation and sampling error? In traditional statistics this is about understanding the p-value. In resampling the idea is very similar to that of a p-value – we ask “could we have got this result by chance?” You do not have to be a mathematician to grasp this idea if it is presented in an accessible way. (See my video “Understanding the p-value” for an example.)

One very exciting addition to the New Zealand curriculum are Achievement Standards at Years 12 and 13 involving reading and understanding statistical reports. I have great hopes that as teachers embrace these standards, the level of understanding in the general population will increase, and there will be less tolerance for statistically unsound conclusions.

Another source of hope for me is “The Panel”, an afternoon radio programme hosted by Jim Mora on Radio New Zealand National. Each day different guests are invited to comment on current events in a moderately erudite and often amusing way. Sometimes they even have knowledge about the topic, and usually an expert is interviewed. It is as talkback radio really could be. I think. I’ve never listened long enough to talk-back radio to really judge as it always makes me SO ANGRY! Breathe, breathe…

I digress. I have been gratified to hear people on The Panel making worthwhile comments about sample size, sampling method, bias, association and causation. (Not usually using those exact terms, but the concepts are there.) It gives me hope that critical response to pseudo-scientific, and even scientific research is possible in the general populace. My husband thinks that should be “informed populace”, but I can dream.

It is possible for journalists to understand the important ideas of statistics without a mathematically-based and alienating course. I feel an app coming on… (Or should that be a nap?)

Which type of error do you prefer?

Mayor Bloomberg is avoiding a Type 2 error

As I write this, Hurricane Sandy is bearing down on the east coast of the United States. Mayor Bloomberg has ordered evacuations from various parts of New York City. All over the region people are stocking up on food and other essentials and waiting for Sandy to arrive. And if Sandy doesn’t turn out to be the worst storm ever, will people be relieved or disappointed? Either way there is a lot of money involved. And more importantly, risk of human injury and death. Will the forecasters be blamed for over-predicting?

Types of error

There are two ways to get this sort of decision wrong. We can do something and find out it was a waste of time, or we can do nothing and wish that we had done something. In the subject of statistics these are known as Type 1 and Type 2 errors. Teaching about Type 1 and Type 2 errors is quite tricky and students often get confused. Does it REALLY matter if they get them around the wrong way? Possibly not, but what really does matter is that students are aware of their existence. We would love to be able to make decisions under certainty, but most decisions involve uncertainty, or risk. We have to choose between the possibility of taking an opportunity and finding out that it was a mistake, and the possibility of turning down an opportunity and missing out on something.

Earthquake prediction

In another recent event, Italian scientists have been convicted of manslaughter for failing to predict a catastrophic earthquake. This has particular resonance in Christchurch as our city has recently been shaken by several large quakes and a multitude of smaller aftershocks. You can see a graph of the Christchurch shakes at this site. In most part the people of Christchurch understand that it is not possible to predict the occurrence of earthquakes. However it seems that the scientists in Italy may have overstated the lack of risk. Just because you can’t accurately predict an earthquake, it doesn’t mean it won’t happen. Here is a link to a story by Nature of the Italian earthquake.

Tornado warnings

Laura McLay wrote a very interesting post entitled. “what is the optimal false alarm rate for tornado warnings?” . A high rate of false alarms is likened to the “boy who cried wolf”, to whom nobody listens any more. You would think that there is no harm in warning unnecessarily, but in the long term there is potential loss of life because people fail to heed subsequent warnings.

Operations Research and errors

Pure mathematicians tend not to like statistics much as it isn’t exact. It’s a little bit sullied by its contact with the real world. However Operations Research goes a step further into the messy world of reality and evaluates the cost of each type of error. Decisions are often converted into dollar terms within decision analysis. Like it or not, the dollar is the usual measure of worth, even for a human life, though sometimes a measure called “utility” is employed.

Costs of Errors

Sometimes there is very little cost to a type 2 error. A bank manager refusing to fund a new business is avoiding the risk of a type 1 error, which would result in a loss of money. They then become open to at type 2 error, that they missed out on funding a winner. The balance is very much on the side of avoiding a type 1 error. In terms of choosing a life partner, some people are happy to risk a type 1 error, and marry, while others, hold back, perhaps invoking a type 2 error by missing out on a “soul-mate”. Or it may be that we make this decision under the illusion of certainty and perfect information, and the possible errors do not cross our minds.

Cancer screening is a common illustration of type 1 and type 2 errors. With a type 1 error, we get a false positive and are told we have a cancer when we do not. With type 2, the test fails to detect a cancer. In this example the cost of a type 2 error seems to be much worse than type 1. Surely we would rather know if we have cancer? However in the case of prostate cancer, a type 1 error can lead to awful side-effects from unnecessary tests. Conversely a large number of men die from other causes, happily unaware that they have early stages of prostate cancer.

The point is that there is no easy answer when making such decisions.

Teaching about type 1 and type 2 errors

I have found the following helpful when teaching about type 1 and type 2 errors in statistics. Think first about the action that was taken. If the null hypothesis was rejected, we have said that there is an effect. After rejecting the null only two outcomes are possible. We have made the correct decision, or we have made a type 1 error. Conversely if we do not reject the null hypothesis, and do nothing, we have either been correct or made a type 2 error. You cannot make a type 1 error and a type 2 error in the same decision.

  • Decision:Reject the Null. Outcome is:
    • Correct or
    • Type 1 error
  • Decision:Do not reject the Null. Outcome is:
    • Correct or
    • Type 2 error.

Or another way of looking at it is:

  • Do something and get it wrong – Type 1 error
  • Do nothing and regret it – Type 2 error

Avoid error

Students may wonder why we have to have any kind of error. Can we not do something to remove error? In some cases we can – we can spend more money and take a larger sample, thus reducing the likelihood of error. However, that too has its cost. The three costs are important aspects of decision-making, and helping students to understand this will help them to make and understand decisions.

Judgment Calls in Statistics and O.R.

The one-armed operations researcher

My mentor, Hans Daellenbach told me a story about a client asking for a one-armed Operations Researcher. The client was sick of getting answers that went, “On the one hand, the best decision would be to proceed, but on the other hand…”

People like the correct answer. They like certainty. They like to know they got it right.

I tease my husband that he has to find the best picnic spot or the best parking place, which involves us driving around considerably longer than I (or the children) were happy with. To be fair, we do end up in very nice picnic spots. However, several of the other places would have been just fine too!

In a different context I too am guilty of this – the reason I loved mathematics at school was because you knew whether you were right or wrong and could get a satisfying row of little red ticks (checkmarks) down the page. English and other arts subjects, I found too mushy as you could never get it perfect. Biology was annoying as plants were so variable, except in their ability to die. Chemistry was ok, so long as we stuck to the nice definite stuff like drawing organic molecules and balancing redox equations.

I think most mathematics teachers are mathematics teachers because they like things to be right or wrong. They like to be able to look at an answer and tell whether it is correct, or if it should get half marks for correct working. They do NOT want to mark essays, which are full of mushy judgements.

Again I am sympathetic. I once did a course in basketball refereeing. I enjoyed learning all the rules, and where to stand, and the hand signals etc, but I hated being a referee. All those decisions were just too much for me. I could never tell who had put the ball out, and was unhappy with guessing. I think I did referee two games at a church league and ended up with an angry player bashing me in the face with the ball. Looking back I think it didn’t help that I wasn’t much of a player either.

I also used to find marking exam papers very challenging, as I wanted to get it right every time. I would agonise over every mark, thinking it could be the difference between passing and failing for some poor student. However as the years went by, I realised that the odd mistake or inconsistency here or there was just usual, and within the range of error. To someone who failed by one mark, my suggestion is not to be borderline. I’m pretty sure we passed more people that we shouldn’t have, than the other way around.

Life is not deterministic

The point is, that life in general is not deterministic and certain and rule-based. This is where the great divide lies between the subject of mathematics and the practice of statistics. Generally in mathematics you can find an answer and even check that it is correct. Or you can show that there is no answer (as happened in one of our national exams in 2012!). But often in statistics there is no clear answer. Sometimes it even depends on the context. This does not sit well with some mathematics teachers.

In operations research there is an interesting tension between optimisers and people who use heuristics. Optimisers love to say that they have the optimal solution to the problem. The non-optimisers like to point out that the problem solved optimally, is so far removed from the actual problem, that all it provides is an upper or lower bound to a practical solution to the actual real-life problem situation.

Judgment calls occur all through the mathematical decision sciences. They include

  • What method to use – Linear programming or heuristic search?
  • Approximations – How do we model a stochastic input in a deterministic model?
  • Assumptions – Is it reasonable to assume that the observations are independent?
  • P-value cutoff – Does a p-value of exactly 0.05 constitute evidence against the null hypothesis?
  • Sample size – Is it reasonable to draw any inferences at all from a sample of 6?
  • Grouping – How do we group by age? by income?
  • Data cleaning – Do we remove the outlier or leave it in?

A comment from a maths teacher on my post regarding the Central Limit Theorem included the following: “The questions that continue to irk me are i) how do you know when to make the call? ii) What are the errors involved in making such a call? I suppose that Hypothesis testing along with p-values took care of such issues and offered some form of security in accepting or rejecting such a hypothesis. I am just a little worried that objectivity is being lost, with personal interpretation being the prevailing arbiter which seems inadequate.”

These are very real concerns, and reflect the mathematical desire for correctness and security. But I propose that the security was an illusion in the first place. There has always been personal interpretation.Informal inference is a nice introduction to help us understand that. And in fact it would be a good opportunity for lively discussion in a statistics class.

With bootstrapping methods we don’t have any less information than we did using the Central Limit Theorem. We just haven’t assumed normality or independence. There was no security. There was the idea that with a 95% confidence interval, for example, we are 95% sure that we contain the true population value. I wonder how often we realised that 1 in 20 times we were just plain wrong, and in quite a few instances the population parameter would be far from the centre of the interval.

The hopeful thing about teaching statistics via bootstrapping, is that by demystifying it we may be able to inject some more healthy scepticism into the populace.

Teaching experimental design

Teaching Experimental Design – a cross-curricular opportunity

The elements that make up a statistics, operations research or quantitative methods course cover three different dimensions (and more). There are:

  • techniques we wish students to master,
  • concepts we wish students to internalise, and
  • attitudes and emotions we wish the students to adopt.

Techniques, concepts and attitudes interact in how a student learns and perceives the subject. Sadly it is possible (and not uncommon) for students to master techniques, while staying oblivious to many of the concepts, and with an attitude of resignation or even antipathy towards the discipline.

Techniques

Often, and less than ideally, course design begins with techniques. The backbone is a list of tests, graphs and procedures that students need to master in order to pass the course. The course outline includes statements like:

  • Students will be able to calculate a confidence interval for a mean.
  • Students will be able to formulate a linear programming model from data.
  • Students will use Excel to make correct histograms. (Good luck with this one!)

Textbooks are organised around techniques, which usually appear in a given sequence, relying on the authors’ perception of how difficult each technique is. Textbooks within a given field are remarkably similar in the techniques they cover in an introductory course.

Concepts

Concepts are more difficult to articulate. In a first course in statistics we wish students to gain an appreciation of the effects of variation. They need to understand how data from a sample differs from population data. In all of the mathematical decision sciences students struggle to understand the nature of a model. The concept of a mathematical model is far from intuitive, but essential.

Attitudes

You can’t explicitly teach attitudes. “Today class, you are going to learn to love statistics!”. These are absorbed and formed and reformed as part of the learning process, as a result of prior experiences and attitudes. I have written a post on Anxiety, fear and antipathy for maths, stats and OR, which describes the importance of perseverance, relevance, borrowed self-efficacy and love in the teaching of these subjects. Content and problem context choices can go a long way towards improving attitudes. The instructor should know whether his or her class is more interested in the projectories of gummy bears, or the more serious topics of cancer screening and crime prevention. Classes in business schools will use different examples than classes in psychology or forestry. Whatever the context, the data should be real, so that students can really engage with it.

I was both amused and a little saddened at this quote from a very good book, “Succeed – how we can reach our goals”. The author (Heidi Grant Halvorson) has described the outcomes of some interesting experiments regarding motivation. She then says, “At this point, you may be wondering if social psychologists get a particular pleasure out of asking people to do really odd things, like eating Cheerios with chopsticks, or eating raw radishes, or not laughing at Robin Williams. The short answer is yes, we do. It makes up for all those hours spent learning statistics.” Hmmm

Experimental Design

So what does this have to do with experimental design?

I have a little confession. I’ve never taught experimental design. I wish I had. I didn’t know as much then as I do now about teaching statistics, and I also taught business students. That’s my excuse, but I regret it. My reasoning was that businesses usually use observational data, not experimental data. And it’s true, except perhaps in marketing research, and process control and possibly several other areas. Oh.

George Cobb, whom I have quoted in several previous posts, proposed that experimental design is a mechanism by which students may learn important concepts. The technique is experimental design, but taught well, it is a way to convey important concepts in statistics and decision science. The pivotal concept is that of variation. If there were no variation, there would be no need for statistics or experimentation. It would be a sad, boring deterministic world. But variation exists, some of which is explainable, and some of which is natural, some of which is due to sampling and some of which is due to bad sampling or experimental practices. I have a YouTube video that explains these four sources of variation. Because variation exists, experiments need to be designed in such a way that we can uncover as best we can the explainable variation, without confounding it with the other types of variation.

The new New Zealand curriculum for Mathematics and Statistics includes experimental design at levels 2 and 3 of the National Certificate of Educational Achievement. (The last two years of Secondary School). The assessments are internal, and teachers help students set up, execute and analyse small experiments. At level two (implemented this year) the experiments generally involve two groups which are given two treatments, or a treatment and a control. The analysis involves boxplots and informal inference. Some schools used paired samples, but found the type of analysis to be limited as a result.  At level three (to be implemented in 2013) this is taken a step further, but I haven’t been able to work out what this step is from the curriculum documents. I was hoping it might be things like randomised block design, or even Taguchi methods, but I don’t think so.

Subjects for Experimentation

Bearing in mind the number of students, many of whom wish to use other members of the class, there can be issues of time and fatigue.Here are some possibilities. It would be great if other suggestions could be added as comments to this post.

Behavioural

Some teachers are reluctant to use psychological experiments as it can be a bit worrying to use our students as guinea pigs. However, this is probably the easiest option, and provided informed and parental consent is received, it should be acceptable. All sorts have been suggested such as effects of various distractions (and legal stimulants) on task completion. There are possible experiments in Physical Education (Evaluate the effectiveness of a performance enhancing programme). Or in Music – how do people respond to different music?

I’d love to see some experiments done on time taken to solve Rogo puzzles! and what the effect of route length or number choice, or size or age is.

Biology

Anything that involves growing things takes a while and can be fraught. (My own recollection of High School biology is that all my plants died.) But things like water uptake could be possible. Use sticks of celery of different lengths and see how much water they take up in a given time. Germination times or strike rates under different circumstances using cress or mustard?  Talk to the Biology teacher. There are assessment standards in NZ NCEA at levels 2 and 3 which mesh well with the statistics standards.

Technology

Baking. There are various ingredients that could have two or three levels of inclusion – making muffins with and without egg – does it affect the height? Pretty tricky to control, but fun – maybe use uniform amounts of mixture. Talk to the Food tech teacher.

Barbie bungee jumping. How does Barbie’s weight affect how far she falls. By having Barbie with and without a backpack, you get the two treatments. The bungee cords can be made out of rubber bands or elastic.

Things flying through the air from catapaults. This has been shown to work as a teaching example. There are a number of variables to alter, such as the weight of the object, the slope of the launchpad, and the person firing.

Inject statistical ideas in application areas

John Maindonald from ANU made the following comment on a previous post: “I am increasingly attracted to the idea that the place to start injecting statistical ideas is in application areas of the curriculum.  This will however work only if the teaching and learning model changes, in ways that are arguably anyway necessary in order to make effective use of those teachers who have really good and effective mathematics and statistics and computing skills.”

How exciting is that? Teachers from different discipline areas work together! There may well be logistical issues and even problems of “turf”. But wouldn’t it be great for mathematics teachers to help students with experiments and analysis in other areas of the curriculum. The students will gain from the removal of “compartments” in their learning, which will help them to integrate their knowledge. The worth of what they are doing would be obvious.

(Note for teachers in NZ. A quick look through the “assessment matrices” for other subjects uncovered a multitude of possibilities for curricular integration if the logistics and NZQA allow. )

The Central Limit Theorem: To teach or not to teach

The question of whether to teach explicitly the Central Limit Theorem seems to divide instructors along philosophical lines. Let us look first at these lines.

There are at least three different areas of activity within the discipline of statistics. These are

  • Theory of statistics and research into statistics
  • Practice of statistics
  • Teaching statistics and related research

Theory and research in statistics

The theory of statistics is mathematical. It is taught and practised in Mathematics and Statistics Departments of Universities. It is possible to be an expert on the theory and mathematics of statistics while having little contact with real data. The theory provides underpinnings to the practice of statistics. It is vital that some people know this – but not most of us. One would hope that people employed as statisticians would have a sound understanding of both the theoretical and applied aspects of statistics. This relates strongly to the research into statistics, which seems to be very mathematical, from my perusal of journals. This research advances the theory and use of statistical methods and philosophy.

Practice of statistics

The practice of statistics occurs in many, many areas, particularly in universities. Most postgraduate courses require some proficiency in the application of statistical methods. Researchers in areas as diverse as psychology, genetics, market research, education, geography, speech therapy, physiotherapy, mechanics, management, economics and medicine all use statistical methods. Some researchers have a deep understanding of the theory of statistics, but most aim to be safe and competent practitioners. When they get to the tricky bits they know to ask a statistician, but most of the day-to-day data generation, collection and analysis is within their capability.

Teaching of statistics and related research

Then there is the teaching of statistics. The level of applicability and theory taught will depend on the context. An instructor in statistics (in a non-service course) in a Department of Mathematics would tend towards the mathematical aspects, as that is most appropriate to the audience. However in just about every other setting the emphasis will be on the practical aspects of data collection and inference. This treatment of statistics is explicable, accessible and interesting to just about anyone, whereas only the mathematically inclined are likely to get excited about the theory of statistics.

There is another growing area, which is the research into the teaching and learning of statistics. This informs and is informed by the other areas, as well as general educational research and cognitive psychology. Much of my thinking comes from this background. An overview of some of the material relating to college level can be found in this literature review. The general topic of How Students Learn Statistics is introduced in this early paper by Joan Garfield (1995), a leader in the field of statistics education research.

Statistics in the school curriculum

Statistics is gradually making its way into the school curriculum internationally, and in New Zealand has become a separate subject in the final year of schooling. There are philosophical issues arising as most of the teachers of statistics are mathematicians, and some tend towards the beauty and elegance of the formulas, proofs etc. The aim of the curriculum, however, is more towards statistical investigations and statistical literacy. There are fuzzy, dirty, ambiguous, context driven explorations with sometimes extensive write-ups. There is discussion and critique of statistical reports. There are experiments which may or may not produce usable results. Some of this is well into the realms of social science and well away from what mathematicians find appealing or even comfortable. In another life I can hear myself saying, “I didn’t become a maths teacher to mark essay questions!” There is a bit of a mismatch between the skill-set and attitudes of the teachers and the curriculum.

Teaching the Central Limit Theorem

One place where this is particularly evident is in the question of teaching the Central Limit Theorem. Mathematicians like the Central Limit Theorem and it seems that they like to teach it. One teacher states “The fact that the CLT is to be de-emphasised in Yr 13 is a major disappointment to me…” This statement prompted this post. I agree that the CLT is neat. It is really handy. And it makes confidence interval calculation almost trivial. There are cool little exercises you can do to illustrate it. It is the backbone of traditional statistical theory.

However, teaching and learning do not always go hand in hand. I wonder how many students really do internalise the Central Limit Theorem. Evidence says not many. Chance, Delmas and Garfield, in “The challenge of developing statistical literacy reasoning and thinking” (Ben Zvi and Garfield 2004) state: “Sampling distributions is a difficult topic for students to learn. A complete understanding of sampling distributions requires students to integrate and apply several concepts from different parts of a statistics course and to be able to reason about the hypothetical behavior of many samples – a distinct, intangible thought process for most students. The Central Limit Theorem provides a theoretical model of the behavior of sampling distributions, but students often have difficulty mapping this model to applied contexts. As a result students fail to develop a deep understanding of the concept of sampling distributions and therefore often develop only a mechanical knowledge of statistical inference. Students may learn how to compute confidence intervals and carry out tests of significance, but the are not able to understand and explain related concepts, such as interpreting a p-value.”

I have a confession to make. I didn’t teach the Central Limit Theorem. It never seemed as if it were going to help my students understand what was going on. For a few years I made them do a little simulation exercise which helped them to see why the square-root of n occurred in the denominator of the formula for the standard error. That was fun and seemed to help. But the words “Central Limit Theorem” seldom passed my lips in my twenty years of instruction.

What has helped immeasurably have been videos, beginning with “Understanding the p-value” and plenty of different examples and exercises using confidence intervals and hypothesis tests. (Another confession – I taught traditional statistical inference, not resampling. My excuse was that I didn’t know any better, and I had to stay in parallel with the course provided by the maths department.) What I have found from my own experience as a learner and as a teacher is that students learn to understand statistics by DOING statistics.

Definition of the Central Limit Theorem

The Central Limit Theorem states that regardless of the shape of the population distribution, the distribution of sample means is normal if the sample size is large. This was a really brilliant model for when simulation and resampling was impossible. The Central Limit Theorem makes it possible to calculate confidence intervals for population means from sample data. It is the reason why most statistical procedures either assume normality at some point, or take steps to correct for the lack thereof. (See the paper by Cobb I referred to extensively in last week’s post.)

In a curriculum that develops from informal inference to formal inference using resampling, there is no need to call on the Central Limit Theorem. With resampling we use the distribution of the sample as the best estimate of the distribution of the population. True, it is quicker to use the old method of plug the values in the formula. However it isn’t much quicker than using the free iNZight software for resampling.

At high school level we want students to get an understanding of what inference is. (I would suggest my Pinkie Bar lesson as a good way of introducing the rejection part of Cobbs mantra, Randomise, Repeat, Reject.) I’m not convinced that teaching the Central Limit Theorem, and formula-based Confidence intervals for means and proportions lead to understanding. Research suggests that it doesn’t. I agree that statistical theorists, and educators and researchers should all understand the Central Limit Theorem. I just don’t think that it has a vital place in an innovative curriculum based on resampling.

Concern for students

I suspect that teachers fear that if their students are not taught the Central Limit Theorem and traditional confidence intervals at high school they will be at a disadvantage at university. I’d like to reassure them that it just isn’t true. All first year university statistics courses that I know of assume no prior knowledge of statistics. (The same is true of some second year courses as well!) The greatest gift a high school statistics teacher can give their students is an attitude of excitement and success, with a healthy helping of scepticism, and an idea of what inference is – that we can draw conclusions about a population from a sample. If my first year students had started from that point, half our work would have been done.

Seductive Causation

Causation is a seductive notion. We want to make meaning out of our world.

I love playing “the beeping nose” with little children. I press their nose and it beeps. I press my nose and it whirrs. It fascinates them. They have discovered cause and effect. They can make cool sounds by pressing noses. You can keep them amused for quite some time.

Cause and effect implies control. If we know what causes things we are better able to control them. Scientific endeavor is largely a search for causes.

History is littered with examples of misplaced cause and effect theories. Many of them apply to medicine, and still do. Gerd Gigerenzer cites the example of Rudi Giuliani claiming victory over socialized medicine. Giuliani points out that that life expectancy for men diagnosed with prostate cancer is longer in the US than in the UK. Gigerenzer points out that Giuliani omits to mention that they all live about the same length of time from contracting the disease but that because of screening, American men are aware of their illness for longer. And many would not see that awareness as a plus, especially combined with the high rate of false positives and consequent nasty side-effects.

Association can imply several different explanations

In the early days of autism research the blame was often placed on “refrigerator mothers” – mothers who did not show warmth to their babies. This was a result of doctors’ observations of the mothers of children with autism. This has since been discredited as a cause. It is suggested that the mothers were acting that way in response to their baby. It takes two to bond.

It is difficult to prove causation. In any identified statistical correlation or association there are multiple explanations. Effects A and B are found to be related. It could be that A causes B. But maybe B causes A. Or a third possibility is that C causes both A and B. Passage of time may be the universal factor.

Granny cures the common cold

This reminds me of a Beverly Hillbillies episode where Granny has a cure for the common cold. One spoonful is enough and you are sure to be cured! Miss Hathaway gets all excited about this entrepreneurial opportunity until Granny explains that it takes about ten days to get better. But her patients always do! I think a control group might have been helpful in this instance.

The prevalence of misplaced causation is one of the most important concepts that a teacher of statistics can teach. We need to make sure the citizens of the world take a critical approach to claims of causation.

So how do we teach this?

I don’t have the answer to this one, other than that I know we should try. Stories and more stories, I suspect. Have them identify types of data. Placing labels on things helps. Make sure they can identify observational, epidemiological and experimental data. Get them to think up alternative explanations, and identify misplaced claims of causation. I find True/False questions remarkably useful in challenging students’ thinking.

Unfortunately causation is probably an example where “school learning” and real learning may part company. The students will give the correct answer using their own dysfunctional rules, such as “If the statement includes the word “causes”, it must be false”. But then again, maybe that’s not such a bad rule!

If students laugh at this I think they "get" causation.

You’re teaching it wrong!

“Every year I teach them this and every year they get it wrong!”. This is a phrase I’ve heard from colleagues and from my own mouth. Then it dawned on me – if the students keep getting it wrong, maybe I’m teaching it wrong!

Example of Linear regression analysis

Here’s an example. In linear regression I found that students often had trouble interpreting the slope. They would get it the wrong way around, or just not get it. Every year it was the same and I repeatedly groaned at incorrect interpretations in their work. Then it struck me that maybe it was my fault. Maybe I needed to think harder about why they were not getting it, and implement some changes.

It can be frustrating when students don't seem to get it.

So I did. Consequently we now have a stronger emphasis earlier in the course on fitting lines and interpreting them correctly within multiple real-life contexts. Then later when we have addressed the concepts of hypothesis teaching in multiple contexts, we introduce regression. Students are now equipped to bring together their learning on line-fitting and on inference and hypothesis testing. And, happy day, they do! As a result of the redesigned syllabus, their final written reports are much improved and I am spared the annoyance of repeatedly grading incorrect statements.

What do I mean by “teaching it wrong”?

There are many ways we can teach something poorly (see I do know about adverbs, but “teaching it wrong” is a more memorable phrase). Some of them are given below, with suggested remedies.

We can assume prior understanding

We can make incorrect assumptions about students’ prior understanding. I assumed students understood the meaning of a slope. They probably should have. But they didn’t, and there is no point in berating students or their previous teachers for their deficiency. It doesn’t help. If students need prior knowledge and don’t have it, then we need to teach it. (And not grudgingly!) We may think we don’t have time to teach the earlier material, but it is pointless pressing on if they are not prepared. A quick pre-test can help us assess when students are ready for the new knowledge.

We can miss what the truly tricky aspects are.

When we really understand something, it can be difficult to remember what was difficult. How many of us can remember learning to change gears in a car, a task that becomes automatic? One of the best ways to work out what is difficult is to be there when the students are learning. At university level it is customary to leave the one-on-one  or small group teaching to graduate assistants. The problem is the professors miss out on understanding what is happening to the students in their class. For this reason I always take at least one tutorial group in any class I teach. Grading papers can also help identify what is causing problems.

We can fail to problem solve – staying in our own rut

Teachers need to reflect and experiment. Teachers are smart people, but sometimes we don’t use our smarts well enough in our teaching. It is not good enough to just keep doing what we have always done even if everybody, including the textbook does it too. It is a source of interest to me how many statistics courses and textbooks still teach the normal approximation of the binomial distribution. Fair enough show that the binomial approaches the normal (if you must), but Excel will solve binomial examples just fine for any parameters I’ve given it. There is no need to approximate.

We can fail to give enough good examples

We can fail to give enough examples for students to generalise – or have the examples create incorrect generalisations. A previous post talks about the need for repetition or practice in the construction of knowledge. Given the opportunity, students will entrench wrong interpretations by finding spurious rules and patterns. For example if all the minimizing Linear Program examples have only greater than constraints, students will form the idea that this is what must happen. Or if all examples testing means of weight loss are paired, students will use the context to judge, often erroneously. Well-thought-out sets of examples and exercises can really help, and give the students a sense of unfolding understanding.

The fun part is when you teach it right

Along with many wrong ways, there are many wonderful, right ways we can teach something well. Our task is to find or create these ways, and when we do, the result for learner and teacher is joyful.

Drill and Rote in teaching LP and Hypothesis Testing

Drill and rote-learning are derogatory terms in many education settings. They have the musty taint of “old-fashioned” ways of teaching. They evoke images of wooden classrooms and tight-lipped spinsters dressed in grey looming over trembling pupils as they recite their times-tables. Drill and rote-learning imply mindless repetition, devoid of understanding.

Much more attractive educational terms are “discovery”, “exploration”, “engagement”. Constructivism requires that learners engage with their materials and create learning by building on existing knowledge and experiences.

But (and I’m sure you could see this coming) I think there is a place for something not far from drill or rote-learning when teaching statistics and operations research. However I like to call it “well-designed repetitive practice”, rather than drill or rote-learning. With another name it smells a little sweeter.

Students need repeated exposure to and exploration of spreadsheet Linear Programming models in order to generalize and construct their own understanding correctly. Students benefit from repeated exposure to hypothesis testing in different contexts in order to discern the general from the specific. But this is not “mindless repetition” of similar examples where wrong generalizations can (and will) be constructed. The different examples should be carefully managed to make effective use of students’ time, and avoid reinforcement of incorrect concepts.

Reason for well-designed repetitive practice

A single instance of a phenomenon does not provide enough information to transfer to another instance. It is only by being exposed to multiple instances that learners can decide which aspects are in common or general, and which are specific to that particular example. Exploring one instance of a linear program (LP) in a standard format gives an initial understanding, but in order to generalize, there must be multiple examples.

Learners, in general, endeavor to make sense of the material by making generalizations about the different examples they are given. If the common elements they perceive are not relevant, the learners make incorrect generalizations. If the first three examples of an LP spreadsheet have all decision variables in the same units, students can reasonably assume that LPs require decision variables to use the same units. To avoid this, the set of examples used must be carefully constructed. If all the hypothesis testing examples result in rejecting the null hypothesis, students gain an incorrect generalization that this is the usual result.

It is popular practice in entry-level statistics courses to require students to collect their own data, analyse and report on it. This is a wonderful way for students to learn and engage with the process of statistical analysis. My concern is that it gives only one example from which the student can construct their understanding of the process. Ideally students would have exposure to many different examples before embarking on their own project.

A learning management system is invaluable. We have a bank of very carefully constructed examples which students work through, to help them gradually develop understanding. The data is real – from questionnaires they or earlier classes completed. There is immediate feedback on submission of their answers, again to reinforce correct concepts. We explain to students that they should not to wait until they understand the process completely before they begin, but rather that the understanding comes with doing. There are many parallels for this kind of learning. Chess, sports, driving and speaking a language all develop through practice. Understanding follows practice.

What’s more, this method seems to work. Students are motivated to work through multiple examples so that they internalize the process and improve their understanding. And they gain a sense of accomplishment and confidence at correctly completing the examples.

Statistics for all

Let’s start with a question. Please answer it now before you read any further!

Statistics, like Operations Research, is a mathematical science. However people can be intelligent consumers of statistical analysis without having to use mathematics. The statement in the box above is false.

Often statistics is taught by mathematics teachers, who understand the mathematical aspects of statistics, but may never have dirtied their hands with real data. They teach the mechanics of calculating the values of standard deviations and confidence intervals, intending that this will lead to understanding. Unfortunately many of their pupils do not gain understanding from the application of formulas. A high school maths teacher in a Masters in Education course I taught was excited to understand at last what a confidence interval was. He had taught his students how to calculate one, and the textbook interpretation, but he hadn’t really “got it” until then.

Statistical analysis is like detective work.

Statistics is not just mathematics with context. Statistics is magical and exciting, like a treasure hunt or a detective story. You start with an idea, and collect some data and then explore the data for its secrets. You uncover relationships and effects, and have to decide whether they constitute real evidence for your ideas. You then need to work out how to express your findings in sentences and in graphs in ways that your audience will understand.

Statistical analysis is needed for most research. Research in areas such as psychology, marketing, sociology, astronomy, medicine, political science, forensics and education, all rely on statistical analysis.

My belief is that there are a few main concepts behind statistics, and if you can understand them, most analysis will be comprehensible.

The key ideas are:

  • variability,
  • sampling and
  • the p-value (inference).

The aim of this blog is to help people learn statistics and the allied discipline, operations research. It also aims to provide ideas and insights to teachers of statistics and operations research. Each of the key ideas will be addressed, and techniques explained.

I hope that people will sometimes disagree with what I say, and let me know. Debate without rancour leads to improved thinking. There is also room for contributions from other teachers of statistics and operations research.

Two scientists discussing

Debate can help understanding