There is a push for teachers and students to use real data in learning statistics. In this post I am going to address the benefits and drawbacks of different sources of real data, and make a case for the use of good fictional data as part of a statistical programme.
Here is a video introducing our fictional data set of 180 or 240 dragons, so you know what I am referring to.
Real collected, real database, trivial, fictional
There are two main types of real data. There is the real data that students themselves collect and there is real data in a dataset, collected by someone else, and available in its entirety. There are also two main types of unreal data. The first is trivial and lacking in context and useful only for teaching mathematical manipulation. The second is what I call fictional data, which is usually based on real-life data, but with some extra advantages, so long as it is skilfully generated. Poorly generated fictional data, as often found in case studies, is very bad for teaching.
When deciding what data to use for teaching statistics, it matters what it is that you are trying to teach. If you are simply teaching how to add up 8 numbers and divide the result by 8, then you are not actually doing statistics, and trivial fake data will suffice. Statistics only exists when there is a context. If you want to teach about the statistical enquiry process, then having the students genuinely involved at each stage of the process is a good idea. If you are particularly wanting to teach about fitting a regression line, you generally want to have multiple examples for students to use. And it would be helpful for there to be at least one linear relationship.
I read a very interesting article in “Teaching Children Mathematics” entitled, “Practıcal Problems: Using Literature to Teach Statistics”. The authors, Hourigan and Leavy, used a children’s book to generate the data on the number of times different characters appeared. But what I liked most, was that they addressed the need for a “driving question”. In this case the question was provided by a pre-school teacher who could only afford to buy one puppet for the book, and wanted to know which character appears the most in the story. The children practised collecting data as the story is read aloud. They collected their own data to analyse.
Let’s have a look at the different pros and cons of student-collected data, provided real data, and high-quality fictional data.
When we want students to experience the process of collecting real data, they need to collect real data. However real time data collection is time consuming, and probably not necessary every year. Student data collection can be simulated by a program such as The Islands, which I wrote about previously. Data students collect themselves is much more likely to have errors in it, or be “dirty” (which is a good thing). When students are only given clean datasets, such as those usually provided with textbooks, they do not learn the skills of deciding what to do with an errant data point. Fictional databases can also have dirty data, generated into it. The fictional inhabitants of The Islands sometimes lie, and often refuse to give consent for data collection on them.
One of the species of dragons included in our database
I have heard that after a few years of school, graphs about cereal preference, number of siblings and type of pet get a little old. These topics, relating to the students, are motivating at first, but often there is no purpose to the investigation other than to get data for a graph. Students need to move beyond their own experience and are keen to try something new. Data provided in a database can be motivating, if carefully chosen. There are opportunities to use databases that encourage awareness of social justice, the environment and politics. Fictional data must be motivating or there is no point! We chose dragons as a topic for our first set of fictional data, as dragons are interesting to boys and girls of most ages.
A meaningful question
Here I refer again to that excellent article that talks about a driving question. There needs to be a reason for analysing the data. Maybe there is concern about food provided at the tuck shop, with healthy alternatives. Or can the question be tied into another area of the curriculum, such as which type of bean plant grows faster? Or can we increase the germination rate of seeds. The Census@school data has the potential for driving questions, but they probably need to be helped along. For existing datasets the driving question used by students might not be the same as the one (if any) driving the original collection of data. Sometimes that is because the original purpose is not ‘motivating’ for the students or not at an appropriate level. If you can’t find or make up a motivating meaningful question, the database is not appropriate. For our fictional dragon data, we have developed two scenarios – vaccinating for Pacific Draconian flu, and building shelters to make up for the deforestation of the island. With the vaccination scenario, we need to know about behaviour and size. For the shelter scenario we need to make decisions based on size, strength, behaviour and breath type. There is potential for a number of other scenarios that will also create driving questions.
Getting enough data
It can be difficult to get enough data for effects to show up. When students are limited to their class or family, this limits the number of observations. Only some databases have enough observations in them. There is no such problem with fictional databases, as you can just generate as much data as you need! There are special issues with regard to teaching about sampling, where you would want a large database with constrained access, like the Islands data, or the use of cards.
A problem with the data students collect is that it tends to be categorical, which limits the types of analysis that can be used. In databases, it can also be difficult to find measurement level data. In our fictional dragon database, we have height, strength and age, which all take numerical values. There are also four categorical variables. The Islands database has a large number of variables, both categorical and numerical.
Though it is good for students to understand that quite often there is no interesting effect, we would like students to have the satisfaction of finding interesting effects in the data, especially at the start. Interesting effects can be particularly exciting if the data is real, and they can apply their findings to the real world context. Student-collected-data is risky in terms of finding any noticeable relationships. It can be disappointing to do a long and involved study and find no effects. Databases from known studies can provide good effects, but unfortunately the variables with no effect tend to be left out of the databases, giving a false sense that there will always be effects. When we generate our fictional data, we make sure that there are the relationships we would like there, with enough interaction and noise. This is a highly skilled process, honed by decades of making up data for student assessment at university. (Guilty admission)
There are ethical issues to be addressed in the collection of real data from people the students know. Informed consent should be granted, and there needs to be thorough vetting. Young students (and not so young) can be damagingly direct in their questions. You may need to explain that it can be upsetting for people to be asked if they have been beaten or bullied. When using fictional data, that may appear real, such as the Islands data, it is important for students to be aware that the data is not real, even though it is based on real effects. This was one of the reasons we chose to build our first database on dragons, as we hope that will remove any concerns about whether the data is real or not!
The following table summarises the post.
||Real data collected by the students
||Real existing database
(The Islands, Kiwi Kapers, Dragons, Desserts)
||Can be controlled
||Time consuming, difficult
||Hard to find
||Sometimes. Can be trivial
||Can be difficult
||Part of the fictional scenario
||Tend towards nominal
||Often too few variables
||Generate as needed
||Need to manage reality
||Can be obvious or trivial, or difficult
||Can be managed