The desired outcome of this post is to be proved wrong.
Here is my assertion: It is really difficult to find appropriate sets of data to use for teaching and assessing statistical analysis.
This is a problem; one of the key factors in teaching statistics effectively is to use real data. I have written about the need for real data (not faked) in my post Stop faking it, data should be real. I’d like to apologise here and now for my arrogant assertion that “The internet abounds with data. We can just about drown in it.” I feel like the ancient mariner staring at the data abounding, with no drop fit to drink, let alone drown in.
Recently a teacher contacted me to help her find a set of data for an assessment task in Year 13 statistics. The data set needs to have the following characteristics:
- It must be real
- A sample (not a population)
- Multivariate so that the students have a choice of variables to model
- Have at least one variable of interval/ratio data
- Have at least one way of dividing the sample into two groups
- It should not be a set that has previously been used for assessment in the public domain in New Zealand.
- It should be of interest to the students
- It should be open to background research
- Ideally it should be randomly sampled
- It should preferably be from New Zealand (Australia is near enough), and not too old.
How hard could that be? ( I joke of course – it is very hard)
I fancy I am pretty good at ferretting things out on the internet, but though I found wonderful sites with lots of sets of data, I could not find one set to fit the criteria. And the problem is, this will need to happen every year in every school in New Zealand, often more than once.
This is not a unique problem, I suspect. When I taught at university I was challenged to come up with appropriate data sets each year for assessment exercises. Consequently we would sometimes rotate data sets in a three year cycle, or (oh the shame) make fake data.
All over the world people are collecting data and doing analysis. Why is it so difficult to find raw data?
One issue is that of privacy – in New Zealand we have strict laws with regard to privacy and informed consent, which means that it is easier to keep the data hidden rather than try to anonymise it for general consumption. Surely that is not the case in non-human research, though. It takes a bit of work to make data available, and academics and researchers do not have time to spare. Some data is commercially sensitive, forbidding its release to the public domain. Often what look like promising data sets are not at a unit level, but a summarised into tables for the reader.
I went searching for links to data sets, and found the following. So I guess there is data out there, but it is time-consuming to find appropriate sets. And very little of it relates to NZ, sadly. And baseball, basketball and medical sets abound.
http://www.statsci.org/datasets.html looks promising, and I am grateful for the efforts. However very few of the sets meet the criteria.
http://iase-web.org/Links.php?p=Datasets has links to other sources
http://www.amstat.org/publications/jse/jse_data_archive.htm This one has the most informative layout, in terms of finding out whether the data base is likely to be useful.
So in a way I have proved myself wrong already. There are datasets out there. But difficult to find one that is just right! I feel for teachers having to trawl through so many sites to find something, though.I had hoped that there would be sets of data along with PhD thesis dissertations, but even in the area of statistics education, I couldn’t find any.
I don’t have an answer to this problem. As a uni lecturer I solved it for my own class by collecting data from them, pretending that it was a random sample of first year university students, and giving it back to them to play with. Obviously not ideal, but fun!
Please share suggestions in the comments.