Dirty data is real data as it is collected before someone gets hold of it and takes out the tricky bits. You won’t find dirty data in textbooks. Dirty data is what real researchers have to deal with. And even amateur researchers and students doing real-life projects will have to deal with dirty data. Yet not much is said about dirty data, and what to do with it.
Elements of dirty data
Mistakes – people put down the current year for their date of birth, give their weight in the wrong unit, put an extra decimal point.
Missing data – people leave gaps, possibly by mistake and possibly intentionally, or give up before the end.
Mindless response – people just tick all the middle responses to Likert scales, or answer “no” to everything.
Silly answers – people state that they expect to earn over 1 billion dollars next year, say they want to have 28 children or are 105 years old and weigh 500 pounds.
Detecting dirty data
First of all assume your data is dirty, particularly if humans have been involved, and even more so if students have been involved. To find the problem areas you need to make tables, graphs and summary statistics of all the variables, and look for outliers. Look for consistent missing values. Look at the highest and lowest values. Scatter-charts are also good for identifying anomalous data.
Dealing with dirty data
Well – this is where mathematics and statistics inextricably part company. There is no single right answer. It all depends! (Students hate that phrase.) Sometimes you should take the response out. Sometimes you should make it a missing value. Sometimes you should correct it. Sometimes you should remove a complete record or observation. Always you should document and justify your decisions, and be aware of any possible implications. There is a fine line between cleaning data and massaging it into something that will give the results you are seeking. There are some actions that are insupportable.
Teaching with dirty data
If students do their own projects they will need to deal with dirty data. It is a wonderful opportunity to
make them suffer help them learn. Don’t give them the answers, but get them to make the judgment calls – that’s what real researchers have to do.
However not all statistics courses include student projects. (Our first year course doesn’t for reasons I will cover in a later post). I do give postgraduate business students a set of data as it was collected, raw from the students. Part of their assignment is to clean it up before they start, and provide a report on what they have done and why.
For the introductory course for undergraduate students I clean up the data, so that the missing and spurious values don’t injure their fragile confidence. The point in this particular instance is to practice multiple examples of different types of testing, in order to generalise the principles of hypothesis testing. Excel, which I use with reservations (another later post) doesn’t cope well with missing values and would provide barriers too early in their learning. Whether the data is given to them clean or dirty depends on the learning objective of the exercise – and the nature of the students.
I would like to get them using the original data, but the course is not quite long enough. I’m still mulling over that one. Having written this post, I am convinced I need to do something about it. I’ll get back to you.