The Myth of Random Sampling

I feel a slight quiver of trepidation as I begin this post – a little like the boy who pointed out that the emperor has  no clothes.

Random sampling is a myth. Practical researchers know this and deal with it. Theoretical statisticians live in a theoretical world where random sampling is possible and ubiquitous – which is just as well really. But teachers of statistics live in a strange half-real-half-theoretical world, where no one likes to point out that real-life samples are seldom random.

The problem in general

In order for most inferential statistical conclusions to be valid, the sample we are using must obey certain rules. In particular, each member of the population must have equal possibility of being chosen. In this way we reduce the opportunity for systematic error, or bias. When a truly random sample is taken, it is almost miraculous how well we can make conclusions about the source population, with even a modest sample of a thousand. On a side note, if the general population understood this, and the opportunity for bias and corruption were eliminated, general elections and referenda could be done at much less cost,  through taking a good random sample.

However! It is actually quite difficult to take a random sample of people. Random sampling is doable in biology, I suspect, where seeds or plots of land can be chosen at random. It is also fairly possible in manufacturing processes. Medical research relies on the use of a random sample, though it is seldom of the total population. Really it is more about randomisation, which can be used to support causal claims.

But the area of most interest to most people is people. We actually want to know about how people function, what they think, their economic activity, sport and many other areas. People find people interesting. To get a really good sample of people takes a lot of time and money, and is outside the reach of many researchers. In my own PhD research I approximated a random sample by taking a stratified, cluster semi-random almost convenience sample. I chose representative schools of different types throughout three diverse regions in New Zealand. At each school I asked all the students in a class at each of three year levels. The classes were meant to be randomly selected, but in fact were sometimes just the class that happened to have a teacher away, as my questionnaire was seen as a good way to keep them quiet. Was my data of any worth? I believe so, of course. Was it random? Nope.

Problems people have in getting a good sample include cost, time and also response rate. Much of the data that is cited in papers is far from random.

The problem in teaching

The wonderful thing about teaching statistics is that we can actually collect real data and do analysis on it, and get a feel for the detective nature of the discipline. The problem with sampling is that we seldom have access to truly random data. By random I am not meaning just simple random sampling, the least simple method! Even cluster, systematic and stratified sampling can be a challenge in a classroom setting. And sometimes if we think too hard we realise that what we have is actually a population, and not a sample at all.

It is a great experience for students to collect their own data. They can write a questionnaire and find out all sorts of interesting things, through their own trial and error. But mostly students do not have access to enough subjects to take a random sample. Even if we go to secondary sources, the data is seldom random, and the students do not get the opportunity to take the sample. It would be a pity not to use some interesting data, just because the collection method was dubious (or even realistic). At the same time we do not want students to think that seriously dodgy data has the same value as a carefully collected random sample.

Possible solutions

These are more suggestions than solutions, but the essence is to do the best you can and make sure the students learn to be critical of their own methods.

Teach the best way, pretend and look for potential problems.

Teach the ideal and also teach the reality. Teach about the different ways of taking random samples. Use my video if you like!

Get students to think about the pros and cons of each method, and where problems could arise. Also get them to think about the kinds of data they are using in their exercises, and what biases they may have.

We also need to teach that, used judiciously, a convenience sample can still be of value. For example I have collected data from students in my class about how far they live from university , and whether or not they have a car. This data is not a random sample of any population. However, it is still reasonable to suggest that it may represent all the students at the university – or maybe just the first year students. It possibly represents students in the years preceding and following my sample, unless something has happened to change the landscape. It has worth in terms of inference. Realistically, I am never going to take a truly random sample of all university students, so this may be the most suitable data I ever get.  I have no doubt that it is better than no information.

All questions are not of equal worth. Knowing whether students who own cars live further from university, in general, is interesting but not of great importance. Were I to be researching topics of great importance, such safety features in roads or medicine, I would have a greater need for rigorous sampling.

So generally, I see no harm in pretending. I use the data collected from my class, and I say that we will pretend that it comes from a representative random sample. We talk about why it isn’t, but then we move on. It is still interesting data, it is real and it is there. When we write up analysis we include critical comments with provisos on how the sample may have possible bias.

What is important is for students to experience the excitement of discovering real effects (or lack thereof) in real data. What is important is for students to be critical of these discoveries, through understanding the limitations of the data collection process. Consequently I see no harm in using non-random, realistic sampled real data, with a healthy dose of scepticism.

Deterministic and Probabilistic models and thinking

The way we understand and make sense of variation in the world affects decisions we make.

Part of understanding variation is understanding the difference between deterministic and probabilistic (stochastic) models. The NZ curriculum specifies the following learning outcome: “Selects and uses appropriate methods to investigate probability situations including experiments, simulations, and theoretical probability, distinguishing between deterministic and probabilistic models.” This is at level 8 of the curriculum, the highest level of secondary schooling. Deterministic and probabilistic models are not familiar to all teachers of mathematics and statistics, so I’m writing about it today.

Model

The term, model, is itself challenging. There are many ways to use the word, two of which are particularly relevant for this discussion. The first meaning is “mathematical model, as a decision-making tool”. This is the one I am familiar with from years of teaching Operations Research. The second way is “way of thinking or representing an idea”. Or something like that. It seems to come from psychology.

When teaching mathematical models in entry level operations research/management science we would spend some time clarifying what we mean by a model. I have written about this in the post, “All models are wrong.”

In a simple, concrete incarnation, a model is a representation of another object. A simple example is that of a model car or a Lego model of a house. There are aspects of the model that are the same as the original, such as the shape and ability to move or not. But many aspects of the real-life object are missing in the model. The car does not have an internal combustion engine, and the house has no soft-furnishings. (And very bumpy floors). There is little purpose for either of these models, except entertainment and the joy of creation or ownership. (You might be interested in the following video of the Lego Parisian restaurant, which I am coveting. Funny way to say Parisian!)

Many models perform useful functions. My husband works as a land-surveyor, and his work involves making models on paper or in the computer, of phenomenon on the land, and making sure that specified marks on the model correspond to the marks placed in the ground. The purpose of the model relates to ownership and making sure the sewers run in the right direction. (As a result of several years of earthquakes in Christchurch, his models are less deterministic than they used to be, and unfortunately many of our sewers ended up running the wrong way.)

Our world is full of models:

  • a map is a model of a location, which can help us get from place to place.
  • sheet music is a written model of the sound which can make a song
  • a bus timetable is a model of where buses should appear
  • a company’s financial reports are a model of one aspect of the company

Deterministic models

A deterministic model assumes certainty in all aspects. Examples of deterministic models are timetables, pricing structures, a linear programming model, the economic order quantity model, maps, accounting.

Probabilistic or stochastic models

Most models really should be stochastic or probabilistic rather than deterministic, but this is often too complicated to implement. Representing uncertainty is fraught. Some more common stochastic models are queueing models, markov chains, and most simulations.

For example when planning a school formal, there are some elements of the model that are deterministic and some that are probabilistic. The cost to hire the venue is deterministic, but the number of students who will come is probabilistic. A GPS unit uses a deterministic model to decide on the most suitable route and gives a predicted arrival time. However we know that the actual arrival time is contingent upon all sorts of aspects including road, driver, traffic and weather conditions.

Model as a way of thinking about something

The term “model” is also used to describe the way that people make sense out of their world. Some people have a more deterministic world model than others, contributed to by age, culture, religion, life experience and education. People ascribe meaning to anything from star patterns, tea leaves and moon phases to ease in finding a parking spot and not being in a certain place when a coconut falls. This is a way of turning a probabilistic world into a more deterministic and more meaningful world. Some people are happy with a probabilistic world, where things really do have a high degree of randomness. But often we are less happy when the randomness goes against us. (I find it interesting that farmers hit with bad fortune such as a snowfall or drought are happy to ask for government help, yet when there is a bumper crop, I don’t see them offering to give back some of their windfall voluntarily.)

Let us say the All Blacks win a rugby game against Australia. There are several ways we can draw meaning from this. If we are of a deterministic frame of mind, we might say that the All Blacks won because they are the best rugby team in the world.  We have assigned cause and effect to the outcome. Or we could take a more probabilistic view of it, deciding that the probability that they would win was about 70%, and that on the day they were fortunate.  Or, if we were Australian, we might say that the Australian team was far better and it was just a 1 in 100 chance that the All Blacks would win.

I developed the following scenarios for discussion in a classroom. The students can put them in order or categories according to their own criteria. After discussing their results, we could then talk about a deterministic and a probabilistic meaning for each of the scenarios.

  1. The All Blacks won the Rugby World Cup.
  2. Eri did better on a test after getting tuition.
  3. Holly was diagnosed with cancer, had a religious experience and the cancer was gone.
  4. A pet was given a homeopathic remedy and got better.
  5. Bill won $20 million in Lotto.
  6. You got five out of five right in a true/false quiz.

The regular mathematics teacher is now a long way from his or her comfort zone. The numbers have gone, along with the red tick, and there are no correct answers. This is an important aspect of understanding probability – that many things are the result of randomness. But with this idea we are pulling mathematics teachers into unfamiliar territory. Social studies, science and English teachers have had to deal with the murky area of feelings, values and ethics forever.  In terms of preparing students for a random world, I think it is territory worth spending some time in. And it might just help them find mathematics/statistics relevant!

Those who can, teach statistics

The phrase I despise more than any in popular use (and believe me there are many contenders) is “Those who can, do, and those who can’t, teach.” I like many of the sayings of George Bernard Shaw, but this one is dismissive, and ignorant and born of jealousy. To me, the ability to teach something is a step higher than being able to do it. The PhD, the highest qualification in academia, is a doctorate. The word “doctor” comes from the Latin word for teacher.

Teaching is a noble profession, on which all other noble professions rest. Teachers are generally motivated by altruism, and often go well beyond the requirements of their job-description to help students. Teachers are derided for their lack of importance, and the easiness of their job. Yet at the same time teachers are expected to undo the ills of society. Everyone “knows” what teachers should do better. Teachers are judged on their output, as if they were the only factor in the mix. Yet how many people really believe their success or failure is due only to the efforts of their teacher?

For some people, teaching comes naturally. But even then, there is the need for pedagogical content knowledge. Teaching is not a generic skill that transfers seamlessly between disciplines. You must be a thinker to be a good teacher. It is not enough to perpetuate the methods you were taught with. Reflection is a necessary part of developing as a teacher. I wrote in an earlier post, “You’re teaching it wrong”, about the process of reflection. Teachers need to know their material, and keep up-to-date with ways of teaching it. They need to be aware of ways that students will have difficulties. Teachers, by sharing ideas and research, can be part of a communal endeavour to increase both content knowledge and pedagogical content knowledge.

There is a difference between being an explainer and being a teacher. Sal Khan, maker of the Khan Academy videos, is a very good explainer. Consequently many students who view the videos are happy that elements of maths and physics that they couldn’t do, have been explained in such a way that they can solve homework problems. This is great. Explaining is an important element in teaching. My own videos aim to explain in such a way that students make sense of difficult concepts, though some videos also illustrate procedure.

Teaching is much more than explaining. Teaching includes awakening a desire to learn and providing the experiences that will help a student to learn.  In these days of ever-expanding knowledge, a content-driven approach to learning and teaching will not serve our citizens well in the long run. Students need to be empowered to seek learning, to criticize, to integrate their knowledge with their life experiences. Learning should be a transformative experience. For this to take place, the teachers need to employ a variety of learner-focussed approaches, as well as explaining.

It cracks me up, the way sugary cereals are advertised as “part of a healthy breakfast”. It isn’t exactly lying, but the healthy breakfast would do pretty well without the sugar-filled cereal. Explanations really are part of a good learning experience, but need to be complemented by discussion, participation, practice and critique.  Explanations are like porridge – healthy, but not a complete breakfast on their own.

Why statistics is so hard to teach

“I’m taking statistics in college next year, and I can’t wait!” said nobody ever!

Not many people actually want to study statistics. Fortunately many people have no choice but to study statistics, as they need it. How much nicer it would be to think that people were studying your subject because they wanted to, rather than because it is necessary for psychology/medicine/biology etc.

In New Zealand, with the changed school curriculum that gives greater focus to statistics, there is a possibility that one day students will be excited to study stats. I am impressed at the way so many teachers have embraced the changed curriculum, despite limited resources, and late changes to assessment specifications. In a few years as teachers become more familiar with and start to specialise in statistics, the change will really take hold, and the rest of the world will watch in awe.

In the meantime, though, let us look at why statistics is difficult to teach.

  1. Students generally take statistics out of necessity.
  2. Statistics is a mixture of quantitative and communication skills.
  3. It is not clear which are right and wrong answers.
  4. Statistical terminology is both vague and specific.
  5. It is difficult to get good resources, using real data in meaningful contexts.
  6. One of the basic procedures, hypothesis testing, is counter-intuitive.
  7. Because the teaching of statistics is comparatively recent, there is little developed pedagogical content knowledge. (Though this is growing)
  8. Technology is forever advancing, requiring regular updating of materials and teaching approaches.

On the other hand, statistics is also a fantastic subject to teach.

  1. Statistics is immediately applicable to life.
  2. It links in with interesting and diverse contexts, including subjects students themselves take.
  3. Studying statistics enables class discussion and debate.
  4. Statistics is necessary and does good.
  5. The study of data and chance can change the way people see the world.
  6. Technlogical advances have put the power for real statistical analysis into the hands of students.
  7. Because the teaching of statistics is new, individuals can make a difference in the way statistics is viewed and taught.

I love to teach. These days many of my students are scattered over the world, watching my videos (for free) on YouTube. It warms my heart when they thank me for making something clear, that had been confusing. I realise that my efforts are small compared to what their teacher is doing, but it is great to be a part of it.

Statistics is not beautiful (sniff)

Statistics is not really elegant or even fun in the way that a mathematics puzzle can be. But statistics is necessary, and enormously rewarding. I like to think that we use statistical methods and principles to extract truth from data.

This week many of the high school maths teachers in New Zealand were exhorted to take part in a Stanford MOOC about teaching mathematics. I am not a high school maths teacher, but I do try to provide worthwhile materials for them, so I thought I would take a look. It is also an opportunity to look at how people with an annual budget of more than 4 figures produce on-line learning materials. So I enrolled and did the first lesson, which is about people’s attitudes to math(s) and their success or trauma that has led to those attitudes. I’m happy to say that none of this was new to me. I am rather unhappy that it would be new to anyone! Surely all maths teachers know by now that how we deal with students’ small successes and failures in mathematics will create future attitudes leading to further success or failure. If they don’t, they need to take this course. And that makes me happy – that there is such a course, on-line and free for all maths teachers. (As a side note, I loved that Jo, the teacher switched between the American “math” and the British/Australian/NZ “maths”).

I’ve only done the first lesson so far, and intend to do some more, but it seems to be much more about mathematics than statistics, and I am not sure how relevant it will be. And that makes me a bit sad again. (It was an emotional journey!)

Mathematics in its pure form is about thinking. It is problem solving and it can be elegant and so much fun. It is a language that transcends nationality. (Though I have always thought the Greeks get a rough deal as we steal all their letters for the scary stuff.) I was recently asked to present an enrichment lesson to a class of “gifted and talented” students. I found it very easy to think of something mathematical to do – we are going to work around our Rogo puzzle, which has some fantastic mathematical learning opportunities. But thinking up something short and engaging and realistic in the statistics realm is much harder. You can’t do real statistics quickly.

On my run this morning I thought a whole lot more about this mathematics/statistics divide. I have written about it before, but more in defense of statistics, and warning the mathematics teachers to stay away or get with the programme. Understanding commonalities and differences can help us teach better. Mathematics is pure and elegant, and borders on art. It is the purest science. There is little beautiful about statistics. Even the graphs are ugly, with their scattered data and annoying outliers messing it all up. The only way we get symmetry is by assuming away all the badly behaved bits. Probability can be a bit more elegant, but with that we are creeping into the mathematical camp.

English Language and English literature

I like to liken. I’m going to liken maths and stats to English language and English literature. I was good at English at school, and loved the spelling and grammar aspects especially. I have in my library a very large book about the English language, (The Cambridge encyclopedia of the English Language, by David Crystal) and one day I hope to read it all. It talks about sounds and letters, words, grammar, syntax, origins, meanings. Even to dip into, it is fascinating. On the other hand I have recently finished reading “The End of Your Life Book Club” by Will Schwalbe, which is a biography of his amazing mother, set around the last two years of her life as she struggles with cancer. Will and his mother are avid readers, and use her time in treatment to talk about books. This book has been an epiphany for me. I had forgotten how books can change your way of thinking, and how important fiction is. At school I struggled with the literature side of English, as I wanted to know what the author meant, and could not see how it was right to take my own meaning from a book, poem or work of literature.  I have since discovered post-modernism and am happy drawing my own meaning.

So what does this all have to do with maths and statistics? Well I liken maths to English language. In order to be good at English you need to be able to read and write in a functional way. You need to know the mechanisms. You need to be able to DO, not just observe. In mathematics, you need to be able to approach a problem in a mathematical way.  Conversely, to be proficient in literature, you do not need to be able to produce literature. You need to be able to read literature with a critical mind, and appreciate the ideas, the words, the structure. You do need to be able to write enough to express your critique, but that is a different matter from writing a novel.  This, to me is like being statistically literate – you can read a statistical report, and ask the right questions. You can make sense of it, and not be at the mercy of poorly executed or mendacious research. You can even write a summary or a critique of a statistical analysis. But you do not need to be able to perform the actual analysis yourself, nor do you need to know the exact mathematical theory underlying it.

Statistical Literacy?

Maybe there is a problem with the term “statistical literacy”. The traditional meaning of literacy includes being able to read and write – to consume and to produce – to take meaning and to create meaning. I’m not convinced that what is called statistical literacy is the same.

Where I’m heading with this, is that statistics is a way to win back the mathematically disenfranchised. If I were teaching statistics to a high school class I would spend some time talking about what statistics involves and how it overlaps with, but is not mathematics. I would explain that even people who have had difficulty in the past with mathematics, can do well at statistics.

The following table outlines the different emphasis of the two disciplines.

Mathematics Statistics
Proficiency with numbers is important Proficiency with numbers is helpful
Abstract ideas are important Concrete applications are important
Context is to be removed so that we can model the underlying ideas Context is crucial to all statistical analysis
You don’t need to write very much. Written expression in English is important

Another idea related to this is that of “magic formulas” or the cookbook approach. I don’t have a problem with cookbooks and knitting patterns. They help me to make things I could not otherwise. However, the more I use recipes and patterns, the more I understand the principles on which they are based. But this is a thought for another day.

The importance of being wrong

We don’t like to think we are wrong

One of the key ideas in statistics is that sometimes we will be wrong. When we report a 95% confidence interval, we will be wrong 5% of the time. Or in other words, about 1 in 20 of 95% confidence intervals will not contain the population parameter we are attempting to estimate. That is how they are defined. The thing is, we always think we are part of the 95% rather than the 5%. Mostly we will be correct, but if we do enough statistical analysis, we will almost definitely be wrong at some point. However, human nature is such that we tend to think it will be someone else. There is also a feeling of blame associated with being wrong. The feeling is that if we have somehow missed the true value with our confidence interval, it must be because we have made a mistake. However, this is not true. In fact we MUST be wrong about 5% of the time, or our interval is too big, and not really a 95% confidence interval.

The term “margin of error” appears with increasing regularity as elections approach and polling companies are keen to make money out of sooth-saying. The common meaning of the margin of error is half the width of a 95% confidence interval. So if we say the margin of error is 3%, then about one time in twenty, the true value of the proportion will actually be more than 3% away from the reported sample value.

What doesn’t help is that we seldom do know if we are correct or not. If we knew the real population value we wouldn’t be estimating it. We can contrive situations where we do know the population but pretend we don’t. If we do this in our teaching, we need to be very careful to point out that this doesn’t normally happen, but does in “classroom world” only. (Thanks to MD for this useful term.) General elections can give us an idea of being right or wrong after the event, but even then the problem of non-sampling error is conflated with sampling error. When opinion polls turn out to miss the mark, we tend to think of the cause as being due to poor sampling, or people changing their minds, or all number of imaginative explanations rather than simple, unavoidable sampling error.

So how do we teach this in such a way that it goes beyond school learning and is internalised for future use as efficient citizens?

Teaching suggestions

I have two suggestions. The first is a series of True/False statements that can be used in a number of ways. I have them as part of on-line assessment, so that the students are challenged by them regularly. They could be well used in the classroom as part of a warm-up exercise at the start of a lesson. Students can write their answers down or vote using hands.

Here are some examples of True/False statements (some of which could lead to discussion):

  1. You never know if your confidence interval contains the true population value.
  2. If you make your confidence interval wide enough you can be sure that you contain the true population value.
  3. A confidence interval tells us where we are pretty sure the sample statistic lies.
  4. It is better to have a narrow confidence interval than a wide one, as it gives us more certain information, even though it is more likely to be wrong.
  5. If your study involves twenty confidence intervals, then you know that exactly one of them will be wrong.
  6. If a confidence interval doesn’t contain the true population value, it is because it is one of the 5% that was calculated incorrectly.

You can check your answers at the end of this post.

Experiential exercise

The other teaching suggestion is for an experiential exercise. It requires a little set up time.

Make a set of cards for students with numbers on them that correspond to the point estimate of a proportion, or a score that will lead to that. (Specifications for a set of 35 cards representing the results from a proportion of 0.54 and 25 trials is given below).

Introduce the exercise as follows:
“I have a computer game, and have set the ratio of wins to losses at a certain value. Each of you has played 25 times, and the number of wins you have obtained will be on your card. It is really important that you don’t look at other people’s cards.”

Hand them out to the students. (If you have fewer than 35 in your class, it might be a good idea to make sure you include the cards with 8 and 19 in the set you use – sometimes it is ok to fudge slightly to teach a point.)
“Without getting information from anyone else, write down your best estimate of the true proportion of wins to losses in the game. Do you think you are correct? How close do you think you are to the true value?”

They will need to divide the number of wins by 25, which should not lead to any computational errors! The point is that they really can’t know how close their estimate is to the true value – and what does “correct” mean?

Then work out the margin of error for a sample of size 25, which in this case is estimated at 20%. Get the students to calculate their 95% confidence intervals, and decide if they have the interval that contains the true population value. Get them to commit one way or the other.

Now they can talk to each other about the values they have.

There are several ways you can go from here. You can tell them what the population proportion was from which the numbers were drawn (0.54). They can then see that most of them had confidence intervals that included the true value, and some didn’t. Or you can leave them wondering, which is a better lesson about real life. Or you can do one exercise where you do tell them and one where you don’t.

This is an area where probability and statistics meet. You could make a nice little binomial distribution problem out of being correct in a number of confidence intervals. There are potential problems with independence, so you need to be a bit careful with the wording. For example: Fifteen  students undertake separate statistical analyses on the topics of their choice, and construct 95% confidence intervals. What is the probability that all the confidence intervals are correct, in that they do contain the estimated population parameter? This is well modelled by a binomial distribution with n =15 and p=0.05. P(X=0)=0.46. And another interesting idea – what is the probability that two or more are incorrect? 0.17 is the answer. So there is a 17% chance that more than one of the confidence intervals does not contain the population parameter of interest.

This is an area that needs careful teaching, and I suspect that some teachers have only a sketchy understanding of the idea of confidence intervals and margins of error. It is so important to know that statistical results are meant to be wrong some of the time.

Answers: T,T,F, debatable, F,F.

Data for the 35 cards:

Number on card

8

9

10

11

12

13

14

15

16

17

18

19

Number of cards

1

1

2

3

5

5

6

5

3

2

1

1

A dearth of raw data

The desired outcome of this post is to be proved wrong.

Here is my assertion: It is really difficult to find appropriate sets of data to use for teaching and assessing statistical analysis.

This is a problem; one of the key factors in teaching statistics effectively is to use real data. I have written about the need for real data (not faked) in my post Stop faking it, data should be real. I’d like to apologise here and now for my arrogant assertion that “The internet abounds with data. We can just about drown in it.” I feel like the ancient mariner staring at the data abounding, with no drop fit to drink, let alone drown in.

Recently a teacher contacted me to help her find a set of data for an assessment task in Year 13 statistics. The data set needs to have the following characteristics:

  • It must be real
  • A sample (not a population)
  • Multivariate so that the students have a choice of variables to model
  • Have at least one variable of interval/ratio data
  • Have at least one way of dividing the sample into two groups
  • It should not be a set that has previously been used for assessment in the public domain in New Zealand.
  • It should be of interest to the students
  • It should be open to background research
  • Ideally it should be randomly sampled
  • It should preferably be from New Zealand (Australia is near enough), and not too old.

How hard could that be? ( I joke of course – it is very hard)

I fancy I am pretty good at ferretting things out on the internet, but though I found wonderful sites with lots of sets of data, I could not find one set to fit the criteria. And the problem is, this will need to happen every year in every school in New Zealand, often more than once.

This is not a unique problem, I suspect. When I taught at university I was challenged to come up with appropriate data sets each year for assessment exercises. Consequently we would sometimes rotate data sets in a three year cycle, or (oh the shame) make fake data.

All over the world people are collecting data and doing analysis. Why is it so difficult to find raw data?

One issue is that of privacy – in New Zealand we have strict laws with regard to privacy and informed consent, which means that it is easier to keep the data hidden rather than try to anonymise it for general consumption. Surely that is not the case in non-human research, though. It takes a bit of work to make data available, and academics and researchers do not have time to spare. Some data is commercially sensitive, forbidding its release to the public domain. Often what look like promising data sets are not at a unit level, but a summarised into tables for the reader.

I went searching for links to data sets, and found the following. So I guess there is data out there, but it is time-consuming to find appropriate sets. And very little of it relates to NZ, sadly. And baseball, basketball and medical sets abound.

http://www.statsci.org/datasets.html looks promising, and I am grateful for the efforts. However very few of the sets meet the criteria.

http://iase-web.org/Links.php?p=Datasets has links to other sources

http://www.amstat.org/publications/jse/jse_data_archive.htm This one has the most informative layout, in terms of finding out whether the data base is likely to be useful.

So in a way I have proved myself wrong already. There are datasets out there. But difficult to find one that is just right! I feel for teachers having to trawl through so many sites to find something, though.I had hoped that there would be sets of data along with PhD thesis dissertations, but even in the area of statistics education, I couldn’t find any.

I don’t have an answer to this problem. As a uni lecturer I solved it for my own class by collecting data from them, pretending that it was a random sample of first year university students, and giving it back to them  to play with. Obviously not ideal, but fun!

Please share suggestions in the comments.

Probability and Deity

Our perception of chance affects our worldview

There are many reasons that I am glad that I majored in Operations Research rather than mathematics or statistics. My view of the world has been affected by the OR way of thinking, which combines hard and soft aspects. Hard aspects are the mathematics and the models, the stuff of the abstract world. Soft aspects relate to people and the reality of the concrete world.  It is interesting that concrete is soft! Operations Research uses a combination of approaches to aid in decision making.

My mentor was Hans Daellenbach, who was born and grew up in Switzerland, did his postgrad study in California, and then stepped back in time several decades to make his home in Christchurch, New Zealand. Hans was ahead of his time in so many ways. The way I am writing about today was his teaching about probability and our subjective views on the likelihood of events.

Thanks to Daniel Kahneman’s publishing and 2002 Nobel prize, the work by him and Amos Tversky is reaching into the popular realm and is even in the high school mathematics curriculum, in a diluted form. Hans Daellenbach required first year students to read a paper by Tversky and Kahneman in the late 1980’s, over a decade earlier. This was not popular, either with the students or the tutors who were trying to make sense of the paper. Eventually we made up some interesting exercises in tutorials, and structured the approach enough for students to catch on. (Sometimes nearly half our students were from a non-English speaking background, so reading the paper was extremely challenging for them.) As a tutor and later a lecturer, I internalised the thinking, and it changed the way I see the world and chance.

People’s understanding of probability and chance events has an impact on how they see the world as well as the decisions they make.

For example, Kahneman introduced the idea of the availability heuristic. This means that if someone we know has been affected by a catastrophic (or wonderful) unlikely event, we will perceive the possibility of such an event as more likely. For example if someone we know has had their house broken into, then we feel less secure, as we perceive the likelihood of that as increased.  Someone we know wins the lottery, and suddenly it seems possible for us. Nothing has changed in the world, but our perception has changed.

There is another easily understood concept of confirmation bias. We notice and remember events and sequences of events that reinforce or confirm our preconceived notions. “Bad things come in threes” is a wonderful example. Something bad or two things bad happen, so we look for or wait for the third, and then stop counting. Similarly we remember the times when our lucky number is lucky, and do not remember the unlucky times. We mentally record the times our hunches pay off, and quietly forget the times they don’t.

So how does this affect us as teachers of statistics? Are there ethical issues involved in how we teach statistics?

I believe in God and I believe that He guides me in my decisions in life. However I do not perceive God as a “micro-manager”. I do not believe that he has time in his day to help me to find carparks, and to send me to bargains in the supermarket. I may be wrong, and I am prepared to be proven wrong, but this is my current belief. There are many people who believe in God (or in that victim-blaming book, “The Secret”), who would disagree with me. When they see good things happen, they attribute them to the hand of God, or karma or The Secret.  There are people in some cultures who do not believe in chance at all. Everything occurs as God’s will, hence the phrase, “ insha’Allah”, or “God willing”. If they are delayed in traffic, or run into a friend, or lose their job, it is because God willed it so. This is undoubtedly a simplistic explanation, but you get the idea.

Now along comes the statistics teacher and teaches probability.  Mathematically there are some things for which the probability is easily modelled. Dice, cards, counters, balls in urns, socks in drawers can all have their probability modelled, using the ratio of number of chosen events over number of possible events. There are also probabilities estimated using historic frequencies, and there are subjective estimates of probabilities. Tversky and Kahnemann’s work showed how flawed humans are at the subjective estimates.

For some (most?) students probability remains “school-knowledge” and makes no difference to their behaviour and view of the world. It is easy to see this on game-shows such as “Deal or No Deal”, my autistic son’s favourite. It is clear that except for the decision to take the deal or not, there is no skill whatsoever in this game. In the Australian version, members of the audience hold the different cases and can guess what their case holds. If they get it right they win $500. When this happens they are praised – well done! When the main player is choosing cases, he or she is warned that they will need to be careful to avoid the high value cases. This is clearly impossible, as there is no way of knowing which cases contain which values. Yet they are praised, “Well done!” for cases that contain low values. Sometimes they even ask the audience members what they think they are holding in the case. This makes for entertaining television – with loud shouting at times to “Take the Deal!”. But it doesn’t imbue me with any confidence that people understand probability.

Having said that, I know that I act irrationally as well. In the 1990s there were toys called Tamagotchis which were electronic pets. To keep your pet happy you had to “play” with it, which involved guessing which way the pet would turn. I KNEW that it made NO difference which way I chose and that I would do just as well by always choosing the same direction. Yet when the pet had turned to the left four times in succession, I would choose turning to the right. Assuming a good random number generator in the pet, this was pointless. But it also didn’t matter!

So if I, who have a fairly sound understanding of probability distributions and chance, still think about which way my tamagotchi is going to turn, I suspect truly rational behaviour in the general populace with regard to probabilistic events is a long time coming! Astrologers, casinos, weather forecasters, economists, lotteries and the like will never go broke.

However there are other students for whom a better understanding of the human tendency to find patterns, and confirm beliefs could provide a challenge. Their parents may strongly believe that God intervenes often or that there is no uncertainty, only lack of knowledge. (In a way this is true, but that’s a topic for another day) Like the child who has just discovered the real source of Christmas bounty, probability models are something to ponder, and can be disturbing.

We do need to be sensitive in how we teach probability. Not only can we shake people’s beliefs, but we can also use insensitive examples. I used to joke about how car accidents are a poisson process with batching, which leads to a very irregular series. Then for the last two and a half years I have been affected by the Christchurch earthquakes.  I have no sense of humour when it comes to earthquakes. None of us do. When I saw in a textbook an example of probability a building falling down as a result of an earthquake, I found that upsetting. A friend was in such a building and, though she physically survived it will be a long time before she will have a full recovery, if ever. Since then I have never used earthquakes as an example of a probabilistic event when teaching in Christchurch. I also refrain as far as possible from using other examples that may stir up pain, or try to treat them in a sober manner. Breast cancer, car accidents and tornadoes kill people and may well have affected our pupils. Just a thought.

Teaching statistical report-writing

Teaching how to write statistical reports

It is difficult to write statistical reports and it is difficult to teach how to write statistical reports.

When statistics is taught in the traditional way, with emphasis on the underlying mathematics the process of statistics is truncated at both ends. When we concentrate on the sterile analysis, the messy “writing stuff” is avoided. Students do not devise their own investigative questions, and they do not write up the results.

Here’s the thing though – in reality, the analysis step of a statistical investigation is a very small part of the whole, and performed at the click of a button or two.

Ultimately the embedding of the analysis back into an investigation should not be a problem. The really interesting part of statistics happens all around the analysis. Understanding the context enriches the learning, transforming the discipline from mathematics to statistics. We can help students embrace the excitement of a true statistical investiation. But in this time of transition, the report-writing aspects are a problem. They are a problem for the learner and for the teacher.

The new New Zealand curriculum for statistics requires report-writing as an essential component of the majority of assessment, particularly at the final year of high school. This is causing understandable concern among teachers, who come predominantly from a mathematical background. I can imagine myself a few years ago saying. “I became a maths teacher so I wouldn’t have to teach and mark essays!” In addition the results from the students are less than stellar, even from capable students. Teachers do not like their students to perform poorly.

All statistics courses should have a component of report-writing, unless they are courses in the mathematics of statistics. The problem here is, like the secondary school teachers in New Zealand, many statistics instructors are dealing with the mathematics more than the application of statistics, and are not confident of their own ability at report-writing themselves. Normal human behaviour is to avoid it. Having taught service statistics courses in a business school for two decades, I have gradually made the transition to more emphasis on report-writing and am convinced that statistical report-writing needs to be taught explicitly, and taught well.

Report-writing is a fundamental and useful skill

For teachers who are uncomfortable with teaching and marking reports, it would be nice to dismiss the process of report-writing  as “not important”. Much of statistics teaching is in a service course, as discussed in my previous blog. It is unlikely that any of these students will ever have to write a report on a statistical analysis, other than as part of the assessment for the course.  So why do we put them and ourselves through this?

You don’t realise whether you understand or not until you try to write it down.

The written word requires a higher level of precision than a thought or a spoken explanation. Your sentences look at you from the page and mock you with their vagueness and ambiguity. I find this out time and again as I blog. What seems like a well thought out argument in my head as I do my morning run, falls to shreds on paper, before being mustered into some semblance of order. It is in writing that we identify the flaws in our understanding. As we try to write our findings we become more aware of fuzzy thinking and gaps in reasoning. As we write we are required to organise our thoughts.

Better critics of other reports

A student who has been required to produce a report of a good standard will be exposed to examples of good and bad reports and will be better able to identify incorrect thinking in reports they read themselves. This is perhaps the most important purpose of a terminal course in statistics. Having said that, it is both heart-warming and alarming to hear from past-students the wonderful things they are doing with the statistics they learned in my one-semester course.

Useful skill for employment

Students need to be able to read and write as part of empowered citizenship. The skill of writing a coherent report in good English is highly sought after by employers, and of great use at university in just about every discipline. It is a transferable skill to many endeavours.

Reports are needed for assessment

On a practical level, if the teacher is going to evaluate understanding they need evidence to work from. A written report provides one form of evidence of understanding.

Report-writing is difficult to teach

Some maths teachers may feel inadequate in teaching “English”, as they see report-writing. They do not have the pedagogical content knowledge in teaching writing that they do for teaching algebra or percentages, for instance. Pedagogical content knowledge is more than the intersection of knowing a subject, and being able to teach in a general sort of way. It is the knowledge of how to teach a certain discipline, what is difficult to learners, and how to help them learn.

Some basic ideas for teaching report-writing

To write at good report you need to understand what is going on, have the appropriate vocabulary, and use a clear structure. Good teaching will emphasise understanding. Getting students to write sentences about output, and sharing them with their peers is a great way to identify misunderstandings. As these sentences are shared, the teacher can model the use of correct technical language. They can say, for instance, “You have the essence correct here, but there are some more precise terms you could use, such as …” Teachers can either give students outlines for reports, or they can give them several good reports and get the students to identify the underlying structure. I am a firm believer in the generous use of headings within a report. They provide signposts for writer and reader alike.

You can see this in my video, Writing up a Time Series Report.

Report-writing requires practice. The assessment report should not be the first report of that type that a student writes. In the world of motivated students with no other demands on their time, it would be great to have them write up one assignment for the practice and then learn from that to produce a better one. I am aware that students tend not to do the work unless there is a grade attached to it, so it can be difficult to get a student to do a “practice report” ahead of the “real assessment.”  There are other alternatives that approximate this, however, which require less input from the teacher. One of these, the use of templates, is explained in an earlier post, Templates for statistical reports – spoon-feeding?

There is nothing wrong with using templates and “sensible sentences”. (not to be confused with “sensible sentencing”, which seems devoid of sense.) There are only so many ways to say that “the median number of pairs of shoes owned by women is ten.” It is also a difficult sentence to make sound elegant. Good reports will look similar. This is not creative-writing – it is report-writing. Sure the marking may be boring when all the reports seem very similar, but it is a small price to pay when you avoid banging your head against the desk at the bizarre and disorganised offerings.

This is but a musing on the teaching of report-writing. Glenda Francis, in  “An approach to report writing in statistics courses” identifies similar issues, and provides a fuller background to the problem. She also indicates that there is much to be done in developing this area of teaching and research. I will be providing professional development in this area over the next month to at least three groups of teachers, and I look forward to learning a great deal from them, as we explore these issues together.

Context – if it isn’t fun…

The role of context in statistical analysis

The wonderful advantage of teaching statistics is the real-life context within which any applicaton must exist. This can also be one of the difficulties. Statistics without context is merely the mathematics of statistics, and is sterile and theoretical.  The teaching of statistics requires real data. And real data often comes with a fairly solid back-story.

One of the interesting aspects for practicing statisticians, is that they can find out about a wide range of applications, by working in partnership with specialists. In my statistical and operations research advising I have learned about a range of subjects, including the treatment of hand injuries, children’s developmental understanding of probability, the bed occupancy in public hospitals, the educational needs of blind students, growth rates of vegetables, texted comments on service at supermarkets, killing methods of chickens, rogaine route choice, co-ordinating scientific expeditions to Antarctica and the cost of care for neonatals in intensive care. I found most of these really interesting and was keen to work with the experts on these projects. Statisticians tend to work in teams with specialists in related disciplines.

Learning a context can take time

When one is part of a long-term project, time spent learning the intricacies of the context is well spent. Without that, the meaning from the data can be lost. However, it is difficult to replicate this in the teaching of statistics, particularly in a general high school or service course. The amount of time required to become familiar with the context takes away from the time spent learning statistics. Too much time spent on one specific project or area of interest can mean that the students are unable to generalise. You need several different examples in order to know what is specific to the context and what is general to all or most contexts.

One approach is to try to have contexts with which students are already familiar. This can be enabled by collecting the data from the students themselves. The Census at School project provides international data for students to use in just this way. This is ideal, in that the context is familiar, and yet the data is “dirty” enough to provide challenges and judgment calls.

Some teachers find that this is too low-level and would prefer to use biological data, or dietary or sports data from other sources. I have some reservations about this. In New Zealand the new statistics curriculum is in its final year of introduction, and understandably there are some bedding-in issues. One I perceive is the relative importance of the context in the students’ reports. As these reports have high-stakes grades attached to them, this is an issue. I will use as an example the time series “standard”. The assessment specification states, among other things, “Using the statistical enquiry cycle to investigate time series data involves: using existing data sets, selecting a variable to investigate, selecting and using appropriate display(s), identifying features in the data and relating this to the context, finding an appropriate model, using the model to make a forecast, communicating findings in a conclusion.”

The full “standard” is given here: Investigate Time Series Data This would involve about five weeks of teaching and assessment, in parallel with four other subjects.(The final 3 years of schooling in NZ are assessed through the National Certificate of Educational Achievement (NCEA). Each year students usually take five subject areas, each of which consists of about six “achievement standards” worth between 3 and 6 credits. There is a mixture of internally and externally assessed standards.)

In this specification I see that there is a requirement for the model to be related to the context. This is a great opportunity for teachers to show how models are useful, and their limitations. I would be happy with a few sentences indicating that the student could identify a seasonal pattern and make some suggestions as to why this might relate to the context, followed by a similar analysis of the shape of the trend. However there are some teachers who are requiring students to do independent literature exploration into the area, and requiring references, while forbidding the referencing of Wikipedia.

This concerns me, and I call for robust discussion.

Statistics is not research methods any more than statistics is mathematics. Research methods and standards of evidence vary between disciplines. Clearly the evidence required in medical research will differ from that of marketing research. I do not think it is the place of the statistics teacher to be covering this. Mathematics teachers are already being stretched to teach the unfamiliar material of statistics, and I think asking them and the students to become expert in research methods is going too far.

It is also taking out all the fun.

Keep the fun

Statistics should be fun for the teacher and the students. The context needs to be accessible or you are just putting in another opportunity for antipathy and confusion. If you aren’t having fun, you aren’t doing it right. Or, more to the point, if your students aren’t having fun, you aren’t doing it right.

Some suggestions about the role of context in teaching statistics and operations research

  • Use real data.
  • If the context is difficult to understand, you are losing the point.
  • The results should not be obvious. It is not interesting that year 12 boys weigh more than year 9 boys.
  • Null results are still results. (We aren’t trying for academic publications!)
  • It is okay to clean up data so you don’t confuse students before they are ready for it.
  • Sometimes you should use dirty data – a bit of confusion is beneficial.
  • Various contexts are better than one long project.
  • Avoid the plodding parts of research methods.
  • Avoid boring data. Who gives a flying fish about the relative sizes of dolphin jaws?
  • Wikipedia is a great place to find out the context for most high school statistics analysis. That is where I look. It’s a great starting place for anyone.

Excel, SPSS, Minitab or R?

I often hear this question: Should I use Excel to teach my class? Or should I use R? Which package is the best?

It depends on the class

The short answer is: It depends on your class. You have to ask yourself, what are the attitudes, skills and knowledge that you wish the students to gain in the course. What is it that you want them to feel and do and understand?

If the students are never likely to do any more statistics, what matters most is that they understand the elementary ideas, feel happy about what they have done, and recognise the power of statistical analysis, so they can later employ a statistician.

If the students are strong in programming, such as engineering or computer science students, then they are less likely to find the programming a barrier, and will want to explore the versatility of the package.

If they are research students and need to take the course as part of a research methods paper, then they should be taught on the package they are most likely to use in their research.

Over the years I have taught statistics using Excel, Minitab and SPSS. These days I am preparing materials for courses using iNZight, which is a specifically designed user interface with an R engine. I have dabbled in R, but never had students who are suitable to be taught using R.

Here are my pros and cons for each of these, and when are they most suitable.

Excel

I have already written somewhat about the good and bad aspects of Excel, and the evils of Excel histograms. There are many problems with statistical analysis with Excel. I am told there are parts of the analysis toolpak which are wrong, though I’ve never found them myself. There is no straight-forward way to do a hypothesis test for a mean. The data-handling capabilities of the spreadsheet are fantastic, but the toolpak cannot even deal well with missing values. The output is idiosyncratic, and not at all intuitive. There are programming quirks which should have been eliminated many years ago. For example when you click on a radio button to say where you wish the output to go, the entry box for the data is activated, rather than the one for the output. It requires elementary Visual Basic to correct this, but has never happened. Each time Excel upgrades I look for this small fix, and have repeatedly been disappointed.

So, given these shortcomings, why would you use Excel? Because it is there, because you are helping students gain other skills in spreadsheeting at the same time, because it is less daunting to use a familiar interface. These reasons may not apply to all students. Excel is the best package for first year business students for so many reasons.

PivotTables in Excel are nasty to get your head around, but once you do, they are fantastic. I resisted teaching PivotTables for some years, but I was wrong. They may well be one of the most useful things I have ever taught at university. I made my students create comparative bar charts on Excel, using Pivot-Tables. One day Helen and I will make a video about PivotTables.

Minitab

Minitab is a lovely little package, and has very nice output. Its roots as a teaching package are obvious from the user-friendly presentation of results. It has been some years since I taught with Minitab. The main reason for this is that the students are unlikely ever to have access to Minitab again, and there is a lot of extra learning required in order to make it run.

SPSS

Most of my teaching at second year undergraduate and MBA and Masters of Education level has been with SPSS. Much of the analysis for my PhD research was done on SPSS. It’s a useful package, with its own peculiarities. I really like the data-handling in terms of excluding data, transforming variables and dealing with missing values. It has a much larger suite of analysis tools, including factor analysis, discriminant analysis, clustering and multi-dimensional scaling, which I taught to second year business students and research students.  SPSS shows its origins as a suite of barely related packages, in the way it does things differently between different areas. But it’s pretty good really.

R

R is what you expect from a command-line open-source program. It is extremely versatile, and pretty daunting for an arts or business major. I can see that R is brilliant for second-level and up in statistics, preferably for students who have already mastered similar packages/languages like MatLab or Maple. It is probably also a good introduction to high-level programming for Operations Research students.

iNZight

This brings us to iNZight, which is a suite of routines using R, set in a semi-friendly user interface. It was specifically written to support the innovative New Zealand school curriculum in statistics, and has a strong emphasis on visual representation of data and results. It includes alternatives that use bootstrapping as well as traditional hypothesis testing. The time series package allows only one kind of seasonal model. I like iNZight. If I were teaching at university still, I would think very hard about using it. I certainly would use it for Time Series analysis at first year level. For high school teachers in New Zealand, there is nothing to beat it.

It has some issues. The interface is clunky and takes a long time to unzip if you have a dodgy computer (as I do). The graphics are unattractive. Sorry guys, I HATE the eyeball, and the colours don’t do it for me either. I think they need to employ a professional designer. SOON! The data has to be just right before the interface will accept it. It is a little bit buggy in a non-disastrous sort of way. It can have dimensionality/rounding issues. (I got a zero slope coefficient for a linear regression with an r of 0.07 the other day.)

But – iNZight does exactly what you want it to do, with lots of great graphics and routines to help with understanding. It is FREE. It isn’t crowded with all the extras that you don’t really need. It covers all of the New Zealand statistics curriculum, so the students need only to learn one interface.

There are other packages such as Genstat, Fathom and TinkerPlots, aimed at different purposes. My university did not have any of these, so I didn’t learn them. They may well be fantastic, but I haven’t the time to do a critique just now. Feel free to add one as a comment below!