# Use real data when teaching statistics

In statistical analysis the context of the data is integral, not a story added on afterwards to make it more interesting. It is not like algebra where “making it real” means  you make up a reason for the equation, and require the students to give the correct units for the answer. In statistics the analysis involves understanding what is happening in the data.

For this reason, as much as possible, data must be real.

In a previous incarnation I have been guilty of making up data. I was even quite proud of being able to make sure my fake multivariate data displayed heteroscedasticity and multicollinearity. That was fine for an assessment item, I reasoned at the time, as I wanted to make sure that students could recognise those effects.

I recently reviewed a case which had been submitted for publication. The case story was great, with some interesting soft aspects, based on a real-life scenario. Then the second part of the case involved analysing data, which was openly fake. I decided to see how I would go, downloaded the data and started playing around in it. I found it disturbing that there was an R-squared value of more than 99%. Then the more I explored, the worse it got, and the more convinced I was that the problem lay in the generation of the data. This would have caused perplexity for students who really wanted to understand what was going on. It is not acceptable to have badly faked data in a case.

# What is so great about real data?

With appropriate topics, the outcome interests the students. It can cause them to think, and realise that there is a use for statistics. It can be exciting! You can have discussions about why this result might have happened.

An interesting bonus, that you can choose to use or not, is that the data is dirty! (See my post about dirty data). Students learn that data does not arrive beautifully sanitised like the pristine textbook sets. They meet with the problems of real data, so they are better prepared for real data in the real world.

## The failings of fake data

1. Effects may seem really interesting, but they were put there by the instructor (sometimes by mistake) so there is no basis in reality. I see this as rather the equivalent of the movie, “the Truman Show”, where a whole world is generated for Truman Burbank with exactly the events needed to make a television series interesting.  Sure you may find a relationship in the data, but only because you put it there in the first place!
2. You can get odd artefacts of the generation process. Some interesting pattern shows up when a student looks at the data a different way from what you expect. This pattern could be just because you didn’t think to get rid of it.
3. Generating good fake data is actually quite tricky to do if you want to get it right.
4. Using fake data trivialises the statistical process to mechanistic algorithm application. Fake data may be better that numeric data with no context, but not by much.

## Sources of real data

The internet abounds with data. We can just about drown in it. This is one source of data, but it is mostly clean, which removes one of the advantages of real data.

However I prefer to get the data from the students themselves. Each year I have a questionnaire which the students fill out anonymously on-line at the start of the course. Then I use this a source of data for use in class examples, exercises and testing. Over the years I have found some interesting effects among the data from our students. An important thing to remember is to make sure you have a range of levels of data. It is very easy to collect nominal/categorical data, but it’s not much use for teaching regression. Paired difference of two means can also be difficult, so you have to think ahead on that one. Here are some example questions for each level of measurement.

## Nominal

• What type of chocolate do you prefer?
• What kind of mobile phone do you own?
• Sex?
• Nationality?
• How did you travel to university today?
• What subject are you majoring in?

## Ordinal

• How useful do you think this course will be in your future career? (Very useful, somewhat useful, not useful)
• How successful have you been in mathematics in the past? (Very successful, somewhat successful, not successful)
• How often do you check Facebook? (More than once a day, about once a day, several times a week, about once a week, less often than once a week.)

## Interval

• How many pairs of trousers do you own?
• What is the most you have ever paid for a pair of trousers.
• What annual income do you expect to be earning in ten years’ time?
• What do you think the average income for the class with be in ten years’ time?
• How many children would you like to have?
• What is the ideal age to get married?

## Real data in Operations Research

Unfortunately it is more difficult to find real-life problems in OR which can be solved in the classroom. One possible approach is to start with a real-life case, and then provide a cut-down version for the students to work on. When we make up exercises for OR, we search the web to make sure that the figures used are realistic estimations of real costs.

In a lesson on Multi-criteria Decision Making we had the case of locating a landfill. This was especially pertinent as our city had recently gone through the political process to set up a new landfill. A helpful website gave ballpark figures on costs for many of the aspects. With the internet at our fingertips there is no excuse for unrealistic figures.

There is work involved in collecting real data, but if we want students to accept that statistics and operations research are relevant, it must be done.