There is a push for teachers and students to use real data in learning statistics. In this post I am going to address the benefits and drawbacks of different sources of real data, and make a case for the use of good fictional data as part of a statistical programme.
Here is a video introducing our fictional data set of 180 or 240 dragons, so you know what I am referring to.
Real collected, real database, trivial, fictional
There are two main types of real data. There is the real data that students themselves collect and there is real data in a dataset, collected by someone else, and available in its entirety. There are also two main types of unreal data. The first is trivial and lacking in context and useful only for teaching mathematical manipulation. The second is what I call fictional data, which is usually based on reallife data, but with some extra advantages, so long as it is skilfully generated. Poorly generated fictional data, as often found in case studies, is very bad for teaching.
Focus
When deciding what data to use for teaching statistics, it matters what it is that you are trying to teach. If you are simply teaching how to add up 8 numbers and divide the result by 8, then you are not actually doing statistics, and trivial fake data will suffice. Statistics only exists when there is a context. If you want to teach about the statistical enquiry process, then having the students genuinely involved at each stage of the process is a good idea. If you are particularly wanting to teach about fitting a regression line, you generally want to have multiple examples for students to use. And it would be helpful for there to be at least one linear relationship.
I read a very interesting article in “Teaching Children Mathematics” entitled, “Practıcal Problems: Using Literature to Teach Statistics”. The authors, Hourigan and Leavy, used a children’s book to generate the data on the number of times different characters appeared. But what I liked most, was that they addressed the need for a “driving question”. In this case the question was provided by a preschool teacher who could only afford to buy one puppet for the book, and wanted to know which character appears the most in the story. The children practised collecting data as the story is read aloud. They collected their own data to analyse.
Let’s have a look at the different pros and cons of studentcollected data, provided real data, and highquality fictional data.
Collecting data
When we want students to experience the process of collecting real data, they need to collect real data. However real time data collection is time consuming, and probably not necessary every year. Student data collection can be simulated by a program such as The Islands, which I wrote about previously. Data students collect themselves is much more likely to have errors in it, or be “dirty” (which is a good thing). When students are only given clean datasets, such as those usually provided with textbooks, they do not learn the skills of deciding what to do with an errant data point. Fictional databases can also have dirty data, generated into it. The fictional inhabitants of The Islands sometimes lie, and often refuse to give consent for data collection on them.
Motivation
I have heard that after a few years of school, graphs about cereal preference, number of siblings and type of pet get a little old. These topics, relating to the students, are motivating at first, but often there is no purpose to the investigation other than to get data for a graph. Students need to move beyond their own experience and are keen to try something new. Data provided in a database can be motivating, if carefully chosen. There are opportunities to use databases that encourage awareness of social justice, the environment and politics. Fictional data must be motivating or there is no point! We chose dragons as a topic for our first set of fictional data, as dragons are interesting to boys and girls of most ages.
A meaningful question
Here I refer again to that excellent article that talks about a driving question. There needs to be a reason for analysing the data. Maybe there is concern about food provided at the tuck shop, with healthy alternatives. Or can the question be tied into another area of the curriculum, such as which type of bean plant grows faster? Or can we increase the germination rate of seeds. The Census@school data has the potential for driving questions, but they probably need to be helped along. For existing datasets the driving question used by students might not be the same as the one (if any) driving the original collection of data. Sometimes that is because the original purpose is not ‘motivating’ for the students or not at an appropriate level. If you can’t find or make up a motivating meaningful question, the database is not appropriate. For our fictional dragon data, we have developed two scenarios – vaccinating for Pacific Draconian flu, and building shelters to make up for the deforestation of the island. With the vaccination scenario, we need to know about behaviour and size. For the shelter scenario we need to make decisions based on size, strength, behaviour and breath type. There is potential for a number of other scenarios that will also create driving questions.
Getting enough data
It can be difficult to get enough data for effects to show up. When students are limited to their class or family, this limits the number of observations. Only some databases have enough observations in them. There is no such problem with fictional databases, as you can just generate as much data as you need! There are special issues with regard to teaching about sampling, where you would want a large database with constrained access, like the Islands data, or the use of cards.
Variables
A problem with the data students collect is that it tends to be categorical, which limits the types of analysis that can be used. In databases, it can also be difficult to find measurement level data. In our fictional dragon database, we have height, strength and age, which all take numerical values. There are also four categorical variables. The Islands database has a large number of variables, both categorical and numerical.
Interesting Effects
Though it is good for students to understand that quite often there is no interesting effect, we would like students to have the satisfaction of finding interesting effects in the data, especially at the start. Interesting effects can be particularly exciting if the data is real, and they can apply their findings to the real world context. Studentcollecteddata is risky in terms of finding any noticeable relationships. It can be disappointing to do a long and involved study and find no effects. Databases from known studies can provide good effects, but unfortunately the variables with no effect tend to be left out of the databases, giving a false sense that there will always be effects. When we generate our fictional data, we make sure that there are the relationships we would like there, with enough interaction and noise. This is a highly skilled process, honed by decades of making up data for student assessment at university. (Guilty admission)
Ethics
There are ethical issues to be addressed in the collection of real data from people the students know. Informed consent should be granted, and there needs to be thorough vetting. Young students (and not so young) can be damagingly direct in their questions. You may need to explain that it can be upsetting for people to be asked if they have been beaten or bullied. When using fictional data, that may appear real, such as the Islands data, it is important for students to be aware that the data is not real, even though it is based on real effects. This was one of the reasons we chose to build our first database on dragons, as we hope that will remove any concerns about whether the data is real or not!
The following table summarises the post.
Real data collected by the students  Real existing database  Fictional data (The Islands, Kiwi Kapers, Dragons, Desserts) 

Data collection  Real experience  Nil  Sometimes 
Dirty data  Always  Seldom  Can be controlled 
Motivating  Can be  Can be  Must be! 
Enough data  Time consuming, difficult  Hard to find  Always 
Meaningful question  Sometimes. Can be trivial  Can be difficult  Part of the fictional scenario 
Variables  Tend towards nominal  Often too few variables  Generate as needed 
Ethical issues  Often  Usually fine  Need to manage reality 
Effects  Unpredictable  Can be obvious or trivial, or difficult  Can be managed 
Another advantage of fictional data is the ability to explore and compare different sampling methods on the exact same population (or repeats of the same sampling method with different randomnumber outcomes), compare their behaviour, and compare selections to the entire population to help understand why the samples behave as they do.
This goes well with the ability to control relationships between variables; we can do things like create a street where the households on one side of the road are wealthier than those on the other, and then look at what happens when we run systematic sampling by street number with an evennumbered skip.
I find that generating fictional data is also very helpful as a way to test whether I’m using an analysis technique correctly. For example, back in my early SAS days I was using logistic regression to analyse real data, but the results weren’t making much sense. I generated a fictional data set, ran PROC REG on that, and from looking at the results I was able to understand that I’d been misinterpreting the coefficients generated by PROC REG.
I have used simulated data but I also have used real data and data from examples found in classical textbooks. Sometimes, real data is too messy for beginners. Simulated data might be useful to demonstrate the application of particular techniques. Examples from textbooks encourage the students to borrow the books and read the related material on their own time. Usually, I refer where the data came from, except in assignments when the students collect their own data.
I do wonder. I think people with PhD in Mathematical Statistics are very comfortable with the idea of fake data because they spent several years of their life justifying themselves with simulation studies and have had to think very very very hard about how they are using fake data to challenge assumptions.
However, I worry that the more intuitive computers get the less we should use fake data. To anyone but me, smartphones seem utterly intuitive, so intuitive you don’t need to think about how it work. For all you know, a phonefairy is sitting on your shoulder waving a wand. The trouble with that engagement with technology is that I’m not sure you’re trained to think about how fake data are constructed. You just click a few buttons, type a few commands and hey presto, the data fairy simulates some data.
That’s not to say I disagree that fake data is useful – just that I think in 2016 we need to think a lot harder about computerhuman interactions than we might have done.
That’s a very interesting view on the subject. I think it always depends on how well the simulated data is made. From a student’s point of view I can say that it is sometimes very confusing when teachers make up data/samples that are not really suitable for the method but they think it is easier to understand. But I don’t want to say that this opposes fictional data because they also do it with real data, which makes it even worse for students to understand the problem (thought process: “data is real and method is real, therefore this application must show how it should be done”).
The dragon example is really cool and shows how fictional data can have “real” characteristics.
Thanks for sharing!
I agree with many of your points. I detest trivial examples such as “a statistics professor wants to compare grades on a test from two different classes, etc.” I typically go to the internet for ideas of variables to use, like Gapminder and various government agencies, and I pull real data when I can. I do often manipulate the data a bit or shrink the dataset so that it demonstrates the pointer I want to make, such as effects of outliers, etc. Since the issues are real, it looks and feels more real than completely fake data, and I can use scenarios that really do interest my students (community college). I also love The Island data and your Dragon data. Interesting problems can still be evaluated while using fictional data. My students also read many of the Gallup and Pew Research articles.
Pingback: Data for teaching – real, fake, fictional...
Pingback: Teaching sampling with dragon data cards  Learn and Teach Statistics and Operations Research