# The silent dog – null results matter too!

Recently I was discussing the process we use in a statistical enquiry. The ideal is that we start with a problem and follow the statistical enquiry cycle through the steps Problem, Plan, Data collection, Analysis and Conclusion, which then may lead to other enquiries. We have recently published a video outlining this process.

I have also previously written a post suggesting that the cyclical nature of the process was overstated.

The context of our discussion was another video I am working on, that acknowledges that often we start, not at the beginning, but in the middle, with a set of data. This may be because in an educational setting it is too expensive and time consuming to require students to collect their own data. Or it may be that as statistical consultants we are brought into an investigation once the data has been collected, and are needed to make some sense out of it. Whatever the reason, it is common to start with the data, and then loop backwards to the Problem and Plan phases, before performing the analysis and writing the conclusions.

## Looking for relationships

We, a group of statistical educators, were suggesting what we would do with a data set, which included looking at the level of measurement, the origins of the data, and the possible intentions of the people who collected it. One teacher suggests to her students that they do exploratory scatter plots of all the possible pairings, as well as comparative dotplots and boxplots. The students can then choose a problem that is likely to show a relationship – because they have already seen that there is a relationship in the data.

I have a bit of a problem with this. It is fine to get an overview of the relationships in the data – that is one of the beauties of statistical packages. And I can see that for an assignment, it is more rewarding for students to have a result they can discuss. If they get a null result there is a tendency to think that they have failed. Yet the lack of evidence of a relationship may be more important than evidence of one. The problem is that we value positive results over null results. This is a known problem in academic journals, and many words have been written about the problems of over-occurrence of type 1 errors, or publication bias. Let me illustrate. A drug manufacturer hopes that drug X is effective in treating depression. In reality drug X is no more effective than a placebo. The manufacturer keeps funding different tests by different scientists. If all the experiments use a significance level of 0.05, then about 5% of the experiments will produce a type 1 error and say that there is an effect attributable to drug X. The (false) positive results are able to be published, because academic journals prefer positive results to null-results. Conversely the much larger number of researchers who correctly concluded that there is no relationship, do not get published and the abundance of evidence to the contrary is invisible. To be fair, it is hoped that these researchers will be able to refute the false positive paper.

## Let them see null results

So where does this leave us as teachers of statistics? Awareness is a good start. We need to show null effects and why they are important. For every example we give that ends up rejecting the null hypothesis, we need to have an example that does not. Text books tend to over-include results that reject the null, so that when a student meets a non-significant result he or she is left wondering whether they have made a mistake. In my preparation of learning materials, I endeavour to keep a good spread of results – strongly positive, weakly positive, inconclusive, weakly negative and strongly negative.  This way students are accepting of a null result, and know what to say when they get one.

Another example is in the teaching of time series analysis. We love to show series with strong seasonality. It tells a story. (see my post about time series analysis as storytelling.) Retail sales nearly all peak in December, and various goods have other peaks. Jewellery retail sales in the US has small peaks in February and May, and it is fun working out why. Seasonal patterns seem like magic. However, we need also to allow students to analyse data that does not have a strong seasonal pattern, so that they can learn that they also exist!

My final research project before leaving the world of academia involved an experiment on the students in my class of over 200. It was difficult to get through the human ethics committee, but made it in the end. The students were divided into two groups, and half were followed up by tutors weekly if they were not keeping up with assignments and testing. The other half were left to their own devices, as had previously been the case. The interesting result was that it made no difference to the pass rate of the students. In fact the proportion of passes was almost identical. This was a null result. I had supposed that following up and helping students to keep up would increase their chances of passing the course. But they didn’t. This important result saved us money in terms of tutor input in following years. Though it felt good to be helping our students more, it didn’t actually help them pass, so was not justifiable in straitened financial times.

I wonder if it would have made it into a journal.

By the way, my reference to the silent dog in the title is to the famous Sherlock Holmes story, Silver Blaze, where the fact that the dog did not bark was important as it showed that the person was known to it.

# Proving causation

## Aeroplanes cause hot weather

In Christchurch we have a weather phenomenon known as the “Nor-wester”, which is a warm dry wind, preceding a cold southerly change. When the wind is from this direction, aeroplanes make their approach to the airport over the city. Our university is close to the airport in the direct flightpath, so we are very aware of the planes. A new colleague from South Africa drew the amusing conclusion that the unusual heat of the day was caused by all the planes flying overhead.

Statistics experts and educators spend a lot of time refuting claims of causation. “Correlation does not imply causation” has become a catch cry of people trying to avoid the common trap. This is a great advance in understanding that even journalists (notoriously math-phobic) seem to have caught onto. My own video on important statistical concepts ends with the causation issue. (You can jump to it at 3:51)

So we are aware that it is not easy to prove causation.

In order to prove causation we need a randomised experiment. We need to make random any possible factor that could be associated, and thus cause or contribute to the effect. This next video, about experimental design, addresses this concept. It is possible to prove that one factor causes an effect by using randomised design.

There is also the related problem of generalizability. If we do have a randomised experiment, we can prove causation. But unless the sample is also a random representative sample of the population in question, we cannot infer that the results will also transfer to the population in question. This is nicely illustrated in this matrix from The Statistical Sleuth by Fred L. Ramsey and Daniel W Schafer.

The relationship between the type of sample and study and the conclusions that may be drawn.

The top left-hand quadrant is the one in which we can draw causal inferences for the population.

## Causal claims from observational studies

A student posed this question:  Is it possible to prove a causal link based on an observational study alone?

It would be very useful if we could. It is not always possible to use a randomised trial, particularly when people are involved. Before we became more aware of human rights, experiments were performed on unsuspecting human lab rats. A classic example is the Vipeholm experiments where patients at a mental hospital were the unknowing subjects. They were given large quantities of sweets in order to determine whether sugar caused cavities in teeth. This happened into the early 1950s. These days it would not be acceptable to randomly assign people to groups who are made to smoke or drink alcohol or consume large quantities of fat-laden pastries. We have to let people make those lifestyle choices for themselves. And observe. Hence observational studies!

There is a call for “evidence-based practice” in education to follow the philosophy in medicine. But getting educational experiments through ethics committee approval is very challenging, and it is difficult to use rats or fruit-flies to impersonate the higher learning processes of humans. The changing landscape of the human environment makes it even more difficult to perform educational experiments.

To find out the criteria for justifying causal claims in an observational study I turned to one of my favourite statistics text-books, Chance Encounters by Wild and Seber  (page 27). They cite the Surgeon General of the United States. The criteria for the establishment of a cause and effect relationship in an epidemiological study are the following:

1. Strong relationship: For example illness is four times as likely among people exposed to a possible cause as it is for those who are not exposed.
2. Strong research design
3. Temporal relationship: The cause must precede the effect.
4. Dose-response relationship: Higher exposure leads to a higher proportion of people affected.
5. Reversible association: Removal of the cause reduces the incidence of the effect.
6. Consistency: Multiple studies in different locations producing similar effects
7. Biological plausibility: there is a supportable biological mechanism
8. Coherence with known facts.

## Teaching about causation

In high school, and entry-level statistics courses, the focus is often on statistical literacy. This concept of causation is pivotal to correct understanding of what statistics can and cannot claim. It is worth spending some time in the classroom discussing what would constitute reasonable proof and what would not. In particular it is worthwhile to come up with alternative explanations for common fallacies, or even truths in causation. Some examples for discussion might be drink-driving and accidents, smoking and cancer, gender and success in all number of areas, home game advantage in sport, the use of lucky charms, socks and undies. This also ties nicely with probability theory, helping to tie the year’s curriculum together.

# Parts and whole

The whole may be greater than the sum of the parts, but the whole still needs those parts. A reflective teacher will think carefully about when to concentrate on the whole, and when on the parts.

## Golf

If you were teaching someone golf, you wouldn’t spend days on a driving range, never going out on a course. Your student would not get the idea of what the game is, or why they need to be able to drive straight and to a desired length. Nor would it be much fun! Similarly if the person only played games of golf it would be difficult for them to develop their game. Practice driving and putting is needed.  A serious student of golf would also read and watch experts at golf.

## Music

Learning music is similar. Anyone who is serious about developing as a musician will spend a considerable amount of time developing their technique and their knowledge by practicing scales, chords and drills. But at the same time they need to be playing full pieces of music so that they feel the joy of what they are doing. As they play music, as opposed to drill, they will see how their less-interesting practice has helped them to develop their skills. However, as they practice a whole piece, they may well find a small part that is tripping them up, and focus for a while on that. If they play only the piece as a whole, it is not efficient use of time. A serious student of music will also listen to and watch great musicians, in order to develop their own understanding and knowledge.

## Benefits of study of the whole and of the parts

In each of these examples we can see that there are aspects of working with the whole, and aspects of working with the parts. Study of the whole contributes perspective and meaning to study, and helps to tie things together. It helps to see where they have made progress. Study of the parts isolates areas of weakness, develops skills and saves time in practice, thus being more efficient.

It is very important for students to get an idea of the purpose of their study, and where they are going. For this reason I have written earlier about the need to see the end when starting out in a long procedure such as a regression or linear programming model.

It is also important to develop “statistical muscle memory” by repeating small parts of the exercise over and over until it is mastered. Practice helps people to learn what is general and what is specific in the different examples.

# Teaching conditional probability

We are currently developing a section on probability as part of our learning materials. A fundamental understanding of probability and uncertainty are essential to a full understanding of inference. When we look at statistical evidence from data, we are holding it up against what we could reasonably expect to happen by chance, which involves a probability model. Probability lies in the more mathematical area of the study of statistics, and has some fun problem-solving aspects to it.

A popular exam question involves conditional probability. We like to use a table approach to this as it avoids many of the complications of terminology. I still remember my initial confusion over the counter-intuitive expression P(A|B) which means the probability that an object from subset B has the property of A. There are several places where students can come unstuck in Bayesian review, and the problems can take a long time. We can liken solving a conditional probability problem to a round of golf, or a long piece of music. So what we do in teaching is that first we take the students step by step through the whole problem. This includes working out what the words are saying, putting the known values into a table, calculating the unknown values in the table, and the using the table to answer the questions involving conditional probability.

Then we work on the individual steps, isolating them so that students can get sufficient practice to find out what is general and what is specific to different examples. As we do this we endeavour to provide variety such that students do not work out some heuristic based on the wording of the question, that actually stops them from understanding. An example of this is that if we use the same template each time, students will work out that the first number stated will go in a certain place in the table, and the second in another place etc. This is a short-term strategy that we need to protect them from in careful generation of questions.

As it turns out students should already have some of the necessary skills. When we review probability at the start of the unit, we get students to calculate probabilities from tables of values, including conditional probabilities. Then when they meet them again as part of the greater whole, there is a familiar ring.

Once the parts are mastered, the students can move on to a set of full questions, using each of the steps they have learned, and putting them back into the whole. Because they are fluent in the steps, it becomes more intuitive to put the whole back together, and when they meet something unusual they are better able to deal with it.

## Starting a course in Operations Research/Management Science

It is interesting to contemplate what “the whole” is, with regard to any subject. In operations research we used to begin our first class, like many first classes, talking about what management science/operations research is. It was a pretty passive sort of class, and I felt it didn’t help as first-year university students had little relevant knowledge to pin the ideas on. So we changed to an approach that put them straight into the action and taught several weeks of techniques first. We started with project management and taught critical path. Then we taught identifying fixed and variable costs and break-even analysis. The next week was discounting and analysis of financial projects. Then for a softer example we looked at multi-criteria decision-making, using MCDM. It tied back to the previous week by taking a different approach to a decision regarding a landfill. Then we introduced OR/MS, and the concept of mathematical modelling. By then we could give real examples of how mathematical models could be used to inform real world problems. It was helpful to go from the concrete to the abstract. This was a much more satisfactory approach.

So the point is not that you should always start with the whole and then do the parts and then go back to the whole. The point is that a teacher needs to think carefully about the relationship between the parts and the whole, and teach in a way that is most helpful.

# Why engineers and poets need to know about statistics

I’m kidding about poets. But lots of people need to understand the three basic areas of statistics, Chance, Data and Evidence.

Recently Tony Greenfield, an esteemed applied statistician, (with his roots in Operations Research) posted the following request on a statistics email list:

“I went this week to the exhibition and conference in the NEC run by The Engineer magazine. There were CEOs of engineering companies of all sizes, from small to massive. I asked a loaded question:  “Why should every engineer be a competent applied statistician?” Only one, from more than 100 engineers, answered: “We need to analyse any data that comes along.” They all seemed bewildered when I asked if they knew about, or even used, SPC and DoE. I shall welcome one paragraph responses to my question. I could talk all day about it but it would be good to have a succinct and powerful few words to use at such a conference.”

For now I will focus on civil engineers, as they are often what people think of as engineers. I’m not sure about the “succinct and powerful” nature of the words to follow, but here goes…

The subject of statistics can be summarised as three areas – chance, data and evidence (CDE!)

Chance includes the rules and perceptions of probability, and emphasises the uncertainty in our world. I suspect engineers are more at home in a deterministic world, but determinism is just a model of reality. The strength of a bar of steel is not exact, but will be modelled with a probability distribution. An understanding of probability is necessary before using terms such as “one hundred year flood”. Expected values are used for making decisions on improving roads and intersections. The capacity of stadiums and malls, and the provision of toilets and exits all require modelling that relies on probability distributions. It is also necessary to have some understanding of our human fallibility in estimating and communicating probability. Statistical process control accounts for acceptable levels of variation, and indicates when they have been exceeded.

The Data aspect of the study of statistics embraces the collection, summary and communication of data. In order to make decisions, data must be collected. Correct summary measures must be used, often the median, rather than the more popular mean. Summary measures should preferably be expressed as confidence intervals, thus communicating the level of precision inherent in the data. Appropriate graphs are needed, which seldom includes pictograms or pie charts.

Evidence refers to the inferential aspects of statistical analysis. The theories of probability are used to evaluate whether a certain set of data provides sufficient evidence to draw conclusions. An engineer needs to understand the use of hypothesis testing and the p-value in order to make informed decisions regarding data. Any professional in any field should be using evidence-based practice, and journal articles providing evidence will almost always refer to the p-value. They should also be wary of claims of causation, and understand the difference between strength of effect and strength of evidence. Our video provides a gentle introduction to these concepts.

Design of Experiments also incorporates the Chance, Data and Evidence aspects of the discipline of statistics.  By randomising the units in an experiment we can control for other extraneous elements that might affect the outcome in an observational study. Engineers should be at home with these concepts. We have two new videos that illustrate this.

So, Tony, how was that? Not exactly succinct, and four paragraphs rather than one. The videos are optional. I think the Chance, Data, Evidence framework helps provide structure to the explanation.

# So what about the poets?

I borrow the term from Peter Bell of Richard Ivey School of Business, who teaches operations research to MBA students, and wrote a paper, Operations Research For Everyone (including poets). If it is difficult to get the world to recognise the importance of statistics, how much harder is it to convince them that Operations Research is vital to their well-being!

Bell uses the term, “poet” to refer to students who are not naturally at home with mathematics. In conversation Bell explained how many of his poets, who were planning to work in the area of human resource management found their summer internships were spent elbow-deep in data, in front of a spreadsheet, and were grateful for the skills they had resisted gaining.

An understanding of chance, data and evidence is useful/essential for “efficient citizenship”, to paraphrase the often paraphrased H. G. Wells. I have already written on the necessity for journalists to have an understanding of statistics. The innovative New Zealand curriculum recognises the importance of an understanding of statistics for all. There are numerous courses dedicated to making sure that medical practitioners have a good understanding.

So really, there are few professions or trades that would not benefit from a grounding in Chance, Data and Evidence. And Operations Research too, but for now that may be a bridge too far.

# Confidence Intervals

Confidence intervals are needed because there is variation in the world. Nearly all natural, human or technological processes result in outputs which vary to a greater or lesser extent. Examples of this are people’s heights, students’ scores in a well written test and weights of loaves of bread. Sometimes our inability or lack of desire to measure something down to the last microgram will leave us thinking that there is no variation, but it is there. For example we would check the weights of chocolate bars to the nearest gram, and may well find that there is no variation. However if we were to weigh them to the nearest milligram, there would be variation. Drug doses have a much smaller range of variation, but it is there all the same.

You can see a video about some of the main sources of variation – natural, explainable, sampling and due to bias.

When we wish to find out about a phenomenon, the ideal would be to measure all instances. For example we can find out the heights of all students in one class at a given time. However it is impossible to find out the heights of all people in the world at a given time. It is even impossible to know how many people there are in the world at a given time. Whenever it is impossible or too expensive or too destructive or dangerous to measure all instances in a population, we need to take a sample. Ideally we will take a sample that gives each object in the population an equal likelihood of being chosen.

You can see a video here about ways of taking a sample.

When we take a sample there will always be error. It is called sampling error. We may, by chance, get exactly the same value for our sample statistic as the “true” value that exists in the population. However, even if we do, we won’t know that we have.

The sample mean is the best estimate for the population mean, but we need to say how well it is estimating the population mean. For example, say we wish to know the mean (or average) weight of apples in an orchard. We take a sample and find that the mean weight of the apples in the sample  is 153g. If we only took a few apples, it is only a rough idea and we might say we are pretty sure the mean weight of the apples in the orchard is between 143g and 163g. If someone else took a bigger sample, they might be able to say that they are pretty sure that the mean weight of apples in the orchard is between 158g and 166g. You can tell that the second confidence interval is giving us better information as the range of the confidence interval is smaller.

There are two things that affect the width of a confidence interval. The first is the sample size. If we take a really large sample we are getting a lot more information about the population, so our confidence interval will be more exact, or smaller. It is not a one-to-one relationship, but a square-root relationship.  If we wish to reduce the confidence interval by a factor of two, we will need to increase our sample size by a factor of 4.

The second thing to affect the width of a confidence interval is the amount of variation in the population. If all the apples in the orchard are about the same weight, then we will be able to estimate that weight quite accurately. However, if the apples are all different sizes, then it will be harder to be sure that the sample represents the population, and we will have a larger confidence interval as a result.

# Traditional (old-fashioned?) Approach

The standard way of calculating confidence intervals is by using formulas developed on the assumptions of normality and the Central Limit Theorem. These formulas are used to calculate the confidence intervals of means, proportions and slopes, but not for medians or standard deviations. That is because there aren’t nice straight-forward formulas for these. The formulas were developed when there were no computers, and analytical methods were needed in the absence of computational power.

In terms of teaching, these formulas are straight-forward, and also include the concept of level of confidence, which is part of the paradigm. You can see a video teaching the traditional approach to confidence intervals, using Excel to calculate the confidence interval for a mean.

# Rule of Thumb

In the New Zealand curriculum at year 12, students are introduced to the concept of inference using an informal method for calculating a confidence interval. The formula is median +/-  1.5 times the interquartile range divided by the square-root of the sample size. There is a similar formula for proportions.

# Bootstrapping

Bootstrapping is a very versatile way to find a confidence interval. It has three strengths:

1. It can be used to calculate the confidence interval for a large range of different parameters.
2. It uses ALL the information the sample gives us, rather than the summary values
3. It has been found to aid in understanding the concepts of inference better than the traditional methods.

There are also some disadvantages

1. Old fogeys don’t like it. (Just kidding) What I mean is that teachers who have always taught using the traditional approach find it difficult to trust what seems like a hit-and-miss method without the familiar theoretical underpinning.
2. Universities don’t teach bootstrapping as much as the traditional methods.
3. The common software packages do not include bootstrap confidence intervals.

The idea behind a bootstrap confidence interval is that we make use of the whole sample to represent the population. We take lots and lots of samples of the same size from the original sample. Obviously we need to sample with replacement, or the samples would all be identical. Then we use these repeated samples to get an idea of the distribution of the estimates of the population parameter. We chop the tails off at a given point, and we give the confidence interval.  Voila!

# Answers to the disadvantages (burn the straw man?)

1. There is a sound theoretical underpinning for bootstrap confidence intervals. A good place to start is a previous blog about George Cobb’s work. Either that or – “Trust me, I’m a Doctor!” (This would also include trusting far more knowledgeable people such as Chris Wild and Maxine Pfannkuch, and the team of statistical educators led by Joan Garfield.
2. We have to start somewhere. Bootstrap methods aren’t used at universities because of inertia. As an academic of twenty years I can say that there is NO PAY OFF for teaching new stuff. It takes up valuable research time and you don’t get promoted, and sometimes you even get made redundant. If students understand what confidence intervals are, and the concept of inference, then learning to use the traditional formulas is trivial. Eventually the universities will shift. I am aware that the University of Auckland now teaches the bootstrap approach.
3. There are ways to deal with the software package problem. There is a free software interface called “iNZight” that you can download. I believe Fathom also uses bootstrapping. There may be other software. Please let me know of any and I will add them to this post.

# In Summary

Confidence intervals involve the concepts of variation, sampling and inference. They are a great way to teach these really important concepts, and to help students be critical of single value estimates. They can be taught informally, traditionally or using bootstrapping methods. Any of the approaches can lead to rote use of formula or algorithm and it is up to teachers to aim for understanding. I’m working on a set of videos around this topic. Watch this space.

# Make journalists learn statistics

All journalists should be required to pass a course in basic statistics before they are let loose on the unsuspecting public.

I am not talking about the kind of statistics course that mathematical statisticians are talking about. This does not involve calculus, R or anything tricky requiring a post-graduate degree. I am talking about a statistics course for citizens. And journalists. :)

I have thought about this for some years. My father was a journalist, and fairly innumerate unless there was a dollar sign involved. But he was of the old school, who worked their way up the ranks. These days most media people have degrees, and I am adamant that the degree should contain basic numeracy and statistics. The course I devised (which has now been taken over by the maths and stats department and will be shut down later this year, but am I bitter…?) would have been ideal. It included basic number skills, including percentages (which are harder than you think), graphing, data, chance and evidence. It required students to understand the principles behind what they were doing rather than the mechanics.

Here is what journalists should know about statistics:

# Chance

One of the key concepts in statistics is that of variability and chance.  Too often a chance event is invested with unnecessary meaning. A really good example of this is the road toll. In New Zealand the road toll over the Easter break can fluctuate between 21 (in 1971) and 3 in 1998, 2002 and 2003. Then in 2012 the toll was zero, a cause of great celebration. I was happy to see one report say “There was no one reason for the zero toll this Easter, and good fortune may have played a part.” However this was a refreshing change as normally the police seem to take the credit for good news, and blame bad news on us. Rather like Economists.

With any random process you will get variability. The human mind looks for patterns and meanings even where there are none. Sadly the human mind often finds patterns and imbues meaning erroneously. Astrology is a perfect example of this – and watching Deal or No Deal is inspiring in the meaning people can find in random variation.

All journalists should have a good grasp of the concepts of variability so they stop drawing unfounded conclusions

# Data Display

There are myriad examples of graphs in the media that are misleading, badly constructed, incorrectly specified, or just plain wrong. There was a wonderful one in the Herald Sun recently, which has had considerable publicity. We hope it was just an error, and nothing more sinister. But good subediting (what my father used to do, but I think ceased with the advent of the computer) would have picked this up.

There is a very nice website dedicated to this: StatsChat.   It unfortunately misquotes H.G.Wells, but has a wonderful array of examples of good and bad statistics in the media. This post gives links to all sorts of sites with bad graphs, many of which were either produced or promulgated by journalists. But not all – scientific literature also has its culprits.

Just a little aside here – why does NO-ONE ever report the standard deviation? I was writing questions involving the normal distribution for practice by students. I am a strong follower of Cobb’s view that all data should be real, so I went looking for some interesting results I could use, with a mean and standard deviation. Heck I couldn’t even find uninteresting results! The mean and the median rule supreme, and confidence intervals are getting a little look in. Percentages are often reported with a “margin of error” (does anyone understand that?). But the standard deviation is invisible. I don’t think the standard deviation is any harder to understand than the mean. (Mainly because the mean is very hard to understand!) So why is the standard deviation not mentioned?

# Evidence

One of the main ideas in inferential statistics is that of evidence: The data is here; do we have evidence that this is an actual effect rather than caused by random variation and sampling error? In traditional statistics this is about understanding the p-value. In resampling the idea is very similar to that of a p-value – we ask “could we have got this result by chance?” You do not have to be a mathematician to grasp this idea if it is presented in an accessible way. (See my video “Understanding the p-value” for an example.)

One very exciting addition to the New Zealand curriculum are Achievement Standards at Years 12 and 13 involving reading and understanding statistical reports. I have great hopes that as teachers embrace these standards, the level of understanding in the general population will increase, and there will be less tolerance for statistically unsound conclusions.

Another source of hope for me is “The Panel”, an afternoon radio programme hosted by Jim Mora on Radio New Zealand National. Each day different guests are invited to comment on current events in a moderately erudite and often amusing way. Sometimes they even have knowledge about the topic, and usually an expert is interviewed. It is as talkback radio really could be. I think. I’ve never listened long enough to talk-back radio to really judge as it always makes me SO ANGRY! Breathe, breathe…

I digress. I have been gratified to hear people on The Panel making worthwhile comments about sample size, sampling method, bias, association and causation. (Not usually using those exact terms, but the concepts are there.) It gives me hope that critical response to pseudo-scientific, and even scientific research is possible in the general populace. My husband thinks that should be “informed populace”, but I can dream.

It is possible for journalists to understand the important ideas of statistics without a mathematically-based and alienating course. I feel an app coming on… (Or should that be a nap?)

# Mayor Bloomberg is avoiding a Type 2 error

As I write this, Hurricane Sandy is bearing down on the east coast of the United States. Mayor Bloomberg has ordered evacuations from various parts of New York City. All over the region people are stocking up on food and other essentials and waiting for Sandy to arrive. And if Sandy doesn’t turn out to be the worst storm ever, will people be relieved or disappointed? Either way there is a lot of money involved. And more importantly, risk of human injury and death. Will the forecasters be blamed for over-predicting?

## Types of error

There are two ways to get this sort of decision wrong. We can do something and find out it was a waste of time, or we can do nothing and wish that we had done something. In the subject of statistics these are known as Type 1 and Type 2 errors. Teaching about Type 1 and Type 2 errors is quite tricky and students often get confused. Does it REALLY matter if they get them around the wrong way? Possibly not, but what really does matter is that students are aware of their existence. We would love to be able to make decisions under certainty, but most decisions involve uncertainty, or risk. We have to choose between the possibility of taking an opportunity and finding out that it was a mistake, and the possibility of turning down an opportunity and missing out on something.

## Earthquake prediction

In another recent event, Italian scientists have been convicted of manslaughter for failing to predict a catastrophic earthquake. This has particular resonance in Christchurch as our city has recently been shaken by several large quakes and a multitude of smaller aftershocks. You can see a graph of the Christchurch shakes at this site. In most part the people of Christchurch understand that it is not possible to predict the occurrence of earthquakes. However it seems that the scientists in Italy may have overstated the lack of risk. Just because you can’t accurately predict an earthquake, it doesn’t mean it won’t happen. Here is a link to a story by Nature of the Italian earthquake.

Laura McLay wrote a very interesting post entitled. “what is the optimal false alarm rate for tornado warnings?” . A high rate of false alarms is likened to the “boy who cried wolf”, to whom nobody listens any more. You would think that there is no harm in warning unnecessarily, but in the long term there is potential loss of life because people fail to heed subsequent warnings.

# Operations Research and errors

Pure mathematicians tend not to like statistics much as it isn’t exact. It’s a little bit sullied by its contact with the real world. However Operations Research goes a step further into the messy world of reality and evaluates the cost of each type of error. Decisions are often converted into dollar terms within decision analysis. Like it or not, the dollar is the usual measure of worth, even for a human life, though sometimes a measure called “utility” is employed.

## Costs of Errors

Sometimes there is very little cost to a type 2 error. A bank manager refusing to fund a new business is avoiding the risk of a type 1 error, which would result in a loss of money. They then become open to at type 2 error, that they missed out on funding a winner. The balance is very much on the side of avoiding a type 1 error. In terms of choosing a life partner, some people are happy to risk a type 1 error, and marry, while others, hold back, perhaps invoking a type 2 error by missing out on a “soul-mate”. Or it may be that we make this decision under the illusion of certainty and perfect information, and the possible errors do not cross our minds.

Cancer screening is a common illustration of type 1 and type 2 errors. With a type 1 error, we get a false positive and are told we have a cancer when we do not. With type 2, the test fails to detect a cancer. In this example the cost of a type 2 error seems to be much worse than type 1. Surely we would rather know if we have cancer? However in the case of prostate cancer, a type 1 error can lead to awful side-effects from unnecessary tests. Conversely a large number of men die from other causes, happily unaware that they have early stages of prostate cancer.

The point is that there is no easy answer when making such decisions.

# Teaching about type 1 and type 2 errors

I have found the following helpful when teaching about type 1 and type 2 errors in statistics. Think first about the action that was taken. If the null hypothesis was rejected, we have said that there is an effect. After rejecting the null only two outcomes are possible. We have made the correct decision, or we have made a type 1 error. Conversely if we do not reject the null hypothesis, and do nothing, we have either been correct or made a type 2 error. You cannot make a type 1 error and a type 2 error in the same decision.

• Decision:Reject the Null. Outcome is:
• Correct or
• Type 1 error
• Decision:Do not reject the Null. Outcome is:
• Correct or
• Type 2 error.

Or another way of looking at it is:

• Do something and get it wrong – Type 1 error
• Do nothing and regret it – Type 2 error

## Avoid error

Students may wonder why we have to have any kind of error. Can we not do something to remove error? In some cases we can – we can spend more money and take a larger sample, thus reducing the likelihood of error. However, that too has its cost. The three costs are important aspects of decision-making, and helping students to understand this will help them to make and understand decisions.

# The one-armed operations researcher

My mentor, Hans Daellenbach told me a story about a client asking for a one-armed Operations Researcher. The client was sick of getting answers that went, “On the one hand, the best decision would be to proceed, but on the other hand…”

People like the correct answer. They like certainty. They like to know they got it right.

I tease my husband that he has to find the best picnic spot or the best parking place, which involves us driving around considerably longer than I (or the children) were happy with. To be fair, we do end up in very nice picnic spots. However, several of the other places would have been just fine too!

In a different context I too am guilty of this – the reason I loved mathematics at school was because you knew whether you were right or wrong and could get a satisfying row of little red ticks (checkmarks) down the page. English and other arts subjects, I found too mushy as you could never get it perfect. Biology was annoying as plants were so variable, except in their ability to die. Chemistry was ok, so long as we stuck to the nice definite stuff like drawing organic molecules and balancing redox equations.

I think most mathematics teachers are mathematics teachers because they like things to be right or wrong. They like to be able to look at an answer and tell whether it is correct, or if it should get half marks for correct working. They do NOT want to mark essays, which are full of mushy judgements.

Again I am sympathetic. I once did a course in basketball refereeing. I enjoyed learning all the rules, and where to stand, and the hand signals etc, but I hated being a referee. All those decisions were just too much for me. I could never tell who had put the ball out, and was unhappy with guessing. I think I did referee two games at a church league and ended up with an angry player bashing me in the face with the ball. Looking back I think it didn’t help that I wasn’t much of a player either.

I also used to find marking exam papers very challenging, as I wanted to get it right every time. I would agonise over every mark, thinking it could be the difference between passing and failing for some poor student. However as the years went by, I realised that the odd mistake or inconsistency here or there was just usual, and within the range of error. To someone who failed by one mark, my suggestion is not to be borderline. I’m pretty sure we passed more people that we shouldn’t have, than the other way around.

# Life is not deterministic

The point is, that life in general is not deterministic and certain and rule-based. This is where the great divide lies between the subject of mathematics and the practice of statistics. Generally in mathematics you can find an answer and even check that it is correct. Or you can show that there is no answer (as happened in one of our national exams in 2012!). But often in statistics there is no clear answer. Sometimes it even depends on the context. This does not sit well with some mathematics teachers.

In operations research there is an interesting tension between optimisers and people who use heuristics. Optimisers love to say that they have the optimal solution to the problem. The non-optimisers like to point out that the problem solved optimally, is so far removed from the actual problem, that all it provides is an upper or lower bound to a practical solution to the actual real-life problem situation.

Judgment calls occur all through the mathematical decision sciences. They include

• What method to use – Linear programming or heuristic search?
• Approximations – How do we model a stochastic input in a deterministic model?
• Assumptions – Is it reasonable to assume that the observations are independent?
• P-value cutoff – Does a p-value of exactly 0.05 constitute evidence against the null hypothesis?
• Sample size – Is it reasonable to draw any inferences at all from a sample of 6?
• Grouping – How do we group by age? by income?
• Data cleaning – Do we remove the outlier or leave it in?

A comment from a maths teacher on my post regarding the Central Limit Theorem included the following: “The questions that continue to irk me are i) how do you know when to make the call? ii) What are the errors involved in making such a call? I suppose that Hypothesis testing along with p-values took care of such issues and offered some form of security in accepting or rejecting such a hypothesis. I am just a little worried that objectivity is being lost, with personal interpretation being the prevailing arbiter which seems inadequate.”

These are very real concerns, and reflect the mathematical desire for correctness and security. But I propose that the security was an illusion in the first place. There has always been personal interpretation.Informal inference is a nice introduction to help us understand that. And in fact it would be a good opportunity for lively discussion in a statistics class.

With bootstrapping methods we don’t have any less information than we did using the Central Limit Theorem. We just haven’t assumed normality or independence. There was no security. There was the idea that with a 95% confidence interval, for example, we are 95% sure that we contain the true population value. I wonder how often we realised that 1 in 20 times we were just plain wrong, and in quite a few instances the population parameter would be far from the centre of the interval.

The hopeful thing about teaching statistics via bootstrapping, is that by demystifying it we may be able to inject some more healthy scepticism into the populace.

# Teaching Experimental Design – a cross-curricular opportunity

The elements that make up a statistics, operations research or quantitative methods course cover three different dimensions (and more). There are:

• techniques we wish students to master,
• concepts we wish students to internalise, and
• attitudes and emotions we wish the students to adopt.

Techniques, concepts and attitudes interact in how a student learns and perceives the subject. Sadly it is possible (and not uncommon) for students to master techniques, while staying oblivious to many of the concepts, and with an attitude of resignation or even antipathy towards the discipline.

# Techniques

Often, and less than ideally, course design begins with techniques. The backbone is a list of tests, graphs and procedures that students need to master in order to pass the course. The course outline includes statements like:

• Students will be able to calculate a confidence interval for a mean.
• Students will be able to formulate a linear programming model from data.
• Students will use Excel to make correct histograms. (Good luck with this one!)

Textbooks are organised around techniques, which usually appear in a given sequence, relying on the authors’ perception of how difficult each technique is. Textbooks within a given field are remarkably similar in the techniques they cover in an introductory course.

# Concepts

Concepts are more difficult to articulate. In a first course in statistics we wish students to gain an appreciation of the effects of variation. They need to understand how data from a sample differs from population data. In all of the mathematical decision sciences students struggle to understand the nature of a model. The concept of a mathematical model is far from intuitive, but essential.

# Attitudes

You can’t explicitly teach attitudes. “Today class, you are going to learn to love statistics!”. These are absorbed and formed and reformed as part of the learning process, as a result of prior experiences and attitudes. I have written a post on Anxiety, fear and antipathy for maths, stats and OR, which describes the importance of perseverance, relevance, borrowed self-efficacy and love in the teaching of these subjects. Content and problem context choices can go a long way towards improving attitudes. The instructor should know whether his or her class is more interested in the projectories of gummy bears, or the more serious topics of cancer screening and crime prevention. Classes in business schools will use different examples than classes in psychology or forestry. Whatever the context, the data should be real, so that students can really engage with it.

I was both amused and a little saddened at this quote from a very good book, “Succeed – how we can reach our goals”. The author (Heidi Grant Halvorson) has described the outcomes of some interesting experiments regarding motivation. She then says, “At this point, you may be wondering if social psychologists get a particular pleasure out of asking people to do really odd things, like eating Cheerios with chopsticks, or eating raw radishes, or not laughing at Robin Williams. The short answer is yes, we do. It makes up for all those hours spent learning statistics.” Hmmm

# Experimental Design

So what does this have to do with experimental design?

I have a little confession. I’ve never taught experimental design. I wish I had. I didn’t know as much then as I do now about teaching statistics, and I also taught business students. That’s my excuse, but I regret it. My reasoning was that businesses usually use observational data, not experimental data. And it’s true, except perhaps in marketing research, and process control and possibly several other areas. Oh.

George Cobb, whom I have quoted in several previous posts, proposed that experimental design is a mechanism by which students may learn important concepts. The technique is experimental design, but taught well, it is a way to convey important concepts in statistics and decision science. The pivotal concept is that of variation. If there were no variation, there would be no need for statistics or experimentation. It would be a sad, boring deterministic world. But variation exists, some of which is explainable, and some of which is natural, some of which is due to sampling and some of which is due to bad sampling or experimental practices. I have a YouTube video that explains these four sources of variation. Because variation exists, experiments need to be designed in such a way that we can uncover as best we can the explainable variation, without confounding it with the other types of variation.

The new New Zealand curriculum for Mathematics and Statistics includes experimental design at levels 2 and 3 of the National Certificate of Educational Achievement. (The last two years of Secondary School). The assessments are internal, and teachers help students set up, execute and analyse small experiments. At level two (implemented this year) the experiments generally involve two groups which are given two treatments, or a treatment and a control. The analysis involves boxplots and informal inference. Some schools used paired samples, but found the type of analysis to be limited as a result.  At level three (to be implemented in 2013) this is taken a step further, but I haven’t been able to work out what this step is from the curriculum documents. I was hoping it might be things like randomised block design, or even Taguchi methods, but I don’t think so.

# Subjects for Experimentation

Bearing in mind the number of students, many of whom wish to use other members of the class, there can be issues of time and fatigue.Here are some possibilities. It would be great if other suggestions could be added as comments to this post.

# Behavioural

Some teachers are reluctant to use psychological experiments as it can be a bit worrying to use our students as guinea pigs. However, this is probably the easiest option, and provided informed and parental consent is received, it should be acceptable. All sorts have been suggested such as effects of various distractions (and legal stimulants) on task completion. There are possible experiments in Physical Education (Evaluate the effectiveness of a performance enhancing programme). Or in Music – how do people respond to different music?

I’d love to see some experiments done on time taken to solve Rogo puzzles! and what the effect of route length or number choice, or size or age is.

# Biology

Anything that involves growing things takes a while and can be fraught. (My own recollection of High School biology is that all my plants died.) But things like water uptake could be possible. Use sticks of celery of different lengths and see how much water they take up in a given time. Germination times or strike rates under different circumstances using cress or mustard?  Talk to the Biology teacher. There are assessment standards in NZ NCEA at levels 2 and 3 which mesh well with the statistics standards.

# Technology

Baking. There are various ingredients that could have two or three levels of inclusion – making muffins with and without egg – does it affect the height? Pretty tricky to control, but fun – maybe use uniform amounts of mixture. Talk to the Food tech teacher.

Barbie bungee jumping. How does Barbie’s weight affect how far she falls. By having Barbie with and without a backpack, you get the two treatments. The bungee cords can be made out of rubber bands or elastic.

Things flying through the air from catapaults. This has been shown to work as a teaching example. There are a number of variables to alter, such as the weight of the object, the slope of the launchpad, and the person firing.

# Inject statistical ideas in application areas

John Maindonald from ANU made the following comment on a previous post: “I am increasingly attracted to the idea that the place to start injecting statistical ideas is in application areas of the curriculum.  This will however work only if the teaching and learning model changes, in ways that are arguably anyway necessary in order to make effective use of those teachers who have really good and effective mathematics and statistics and computing skills.”

How exciting is that? Teachers from different discipline areas work together! There may well be logistical issues and even problems of “turf”. But wouldn’t it be great for mathematics teachers to help students with experiments and analysis in other areas of the curriculum. The students will gain from the removal of “compartments” in their learning, which will help them to integrate their knowledge. The worth of what they are doing would be obvious.

(Note for teachers in NZ. A quick look through the “assessment matrices” for other subjects uncovered a multitude of possibilities for curricular integration if the logistics and NZQA allow. )

# The Central Limit Theorem: To teach or not to teach

The question of whether to teach explicitly the Central Limit Theorem seems to divide instructors along philosophical lines. Let us look first at these lines.

There are at least three different areas of activity within the discipline of statistics. These are

• Theory of statistics and research into statistics
• Practice of statistics
• Teaching statistics and related research

## Theory and research in statistics

The theory of statistics is mathematical. It is taught and practised in Mathematics and Statistics Departments of Universities. It is possible to be an expert on the theory and mathematics of statistics while having little contact with real data. The theory provides underpinnings to the practice of statistics. It is vital that some people know this – but not most of us. One would hope that people employed as statisticians would have a sound understanding of both the theoretical and applied aspects of statistics. This relates strongly to the research into statistics, which seems to be very mathematical, from my perusal of journals. This research advances the theory and use of statistical methods and philosophy.

## Practice of statistics

The practice of statistics occurs in many, many areas, particularly in universities. Most postgraduate courses require some proficiency in the application of statistical methods. Researchers in areas as diverse as psychology, genetics, market research, education, geography, speech therapy, physiotherapy, mechanics, management, economics and medicine all use statistical methods. Some researchers have a deep understanding of the theory of statistics, but most aim to be safe and competent practitioners. When they get to the tricky bits they know to ask a statistician, but most of the day-to-day data generation, collection and analysis is within their capability.

## Teaching of statistics and related research

Then there is the teaching of statistics. The level of applicability and theory taught will depend on the context. An instructor in statistics (in a non-service course) in a Department of Mathematics would tend towards the mathematical aspects, as that is most appropriate to the audience. However in just about every other setting the emphasis will be on the practical aspects of data collection and inference. This treatment of statistics is explicable, accessible and interesting to just about anyone, whereas only the mathematically inclined are likely to get excited about the theory of statistics.

There is another growing area, which is the research into the teaching and learning of statistics. This informs and is informed by the other areas, as well as general educational research and cognitive psychology. Much of my thinking comes from this background. An overview of some of the material relating to college level can be found in this literature review. The general topic of How Students Learn Statistics is introduced in this early paper by Joan Garfield (1995), a leader in the field of statistics education research.

# Statistics in the school curriculum

Statistics is gradually making its way into the school curriculum internationally, and in New Zealand has become a separate subject in the final year of schooling. There are philosophical issues arising as most of the teachers of statistics are mathematicians, and some tend towards the beauty and elegance of the formulas, proofs etc. The aim of the curriculum, however, is more towards statistical investigations and statistical literacy. There are fuzzy, dirty, ambiguous, context driven explorations with sometimes extensive write-ups. There is discussion and critique of statistical reports. There are experiments which may or may not produce usable results. Some of this is well into the realms of social science and well away from what mathematicians find appealing or even comfortable. In another life I can hear myself saying, “I didn’t become a maths teacher to mark essay questions!” There is a bit of a mismatch between the skill-set and attitudes of the teachers and the curriculum.

## Teaching the Central Limit Theorem

One place where this is particularly evident is in the question of teaching the Central Limit Theorem. Mathematicians like the Central Limit Theorem and it seems that they like to teach it. One teacher states “The fact that the CLT is to be de-emphasised in Yr 13 is a major disappointment to me…” This statement prompted this post. I agree that the CLT is neat. It is really handy. And it makes confidence interval calculation almost trivial. There are cool little exercises you can do to illustrate it. It is the backbone of traditional statistical theory.

However, teaching and learning do not always go hand in hand. I wonder how many students really do internalise the Central Limit Theorem. Evidence says not many. Chance, Delmas and Garfield, in “The challenge of developing statistical literacy reasoning and thinking” (Ben Zvi and Garfield 2004) state: “Sampling distributions is a difficult topic for students to learn. A complete understanding of sampling distributions requires students to integrate and apply several concepts from different parts of a statistics course and to be able to reason about the hypothetical behavior of many samples – a distinct, intangible thought process for most students. The Central Limit Theorem provides a theoretical model of the behavior of sampling distributions, but students often have difficulty mapping this model to applied contexts. As a result students fail to develop a deep understanding of the concept of sampling distributions and therefore often develop only a mechanical knowledge of statistical inference. Students may learn how to compute confidence intervals and carry out tests of significance, but the are not able to understand and explain related concepts, such as interpreting a p-value.”

I have a confession to make. I didn’t teach the Central Limit Theorem. It never seemed as if it were going to help my students understand what was going on. For a few years I made them do a little simulation exercise which helped them to see why the square-root of n occurred in the denominator of the formula for the standard error. That was fun and seemed to help. But the words “Central Limit Theorem” seldom passed my lips in my twenty years of instruction.

What has helped immeasurably have been videos, beginning with “Understanding the p-value” and plenty of different examples and exercises using confidence intervals and hypothesis tests. (Another confession – I taught traditional statistical inference, not resampling. My excuse was that I didn’t know any better, and I had to stay in parallel with the course provided by the maths department.) What I have found from my own experience as a learner and as a teacher is that students learn to understand statistics by DOING statistics.

## Definition of the Central Limit Theorem

The Central Limit Theorem states that regardless of the shape of the population distribution, the distribution of sample means is normal if the sample size is large. This was a really brilliant model for when simulation and resampling was impossible. The Central Limit Theorem makes it possible to calculate confidence intervals for population means from sample data. It is the reason why most statistical procedures either assume normality at some point, or take steps to correct for the lack thereof. (See the paper by Cobb I referred to extensively in last week’s post.)

In a curriculum that develops from informal inference to formal inference using resampling, there is no need to call on the Central Limit Theorem. With resampling we use the distribution of the sample as the best estimate of the distribution of the population. True, it is quicker to use the old method of plug the values in the formula. However it isn’t much quicker than using the free iNZight software for resampling.

At high school level we want students to get an understanding of what inference is. (I would suggest my Pinkie Bar lesson as a good way of introducing the rejection part of Cobbs mantra, Randomise, Repeat, Reject.) I’m not convinced that teaching the Central Limit Theorem, and formula-based Confidence intervals for means and proportions lead to understanding. Research suggests that it doesn’t. I agree that statistical theorists, and educators and researchers should all understand the Central Limit Theorem. I just don’t think that it has a vital place in an innovative curriculum based on resampling.

## Concern for students

I suspect that teachers fear that if their students are not taught the Central Limit Theorem and traditional confidence intervals at high school they will be at a disadvantage at university. I’d like to reassure them that it just isn’t true. All first year university statistics courses that I know of assume no prior knowledge of statistics. (The same is true of some second year courses as well!) The greatest gift a high school statistics teacher can give their students is an attitude of excitement and success, with a healthy helping of scepticism, and an idea of what inference is – that we can draw conclusions about a population from a sample. If my first year students had started from that point, half our work would have been done.