Patterns, vocab and practice, practice, practice
An important part of statistical analysis is being able to look at graphical representation of data, extract meaning and make comments about it, particularly related to the context. Graph interpretation is a difficult skill to teach as there is no clear algorithm, such as mathematics teachers are used to teaching, and the answers are far from clear-cut.
This post is about the challenges of teaching scatterplot interpretation, with some suggestions.
When undertaking an investigation of bivariate measurement data, a scatterplot is the graph to use. On a scatterplot we can see what shape the data seems to have, what direction the relationship goes in, how close the points are to the line, if there are clear groups and if there are unusual observations.
The problem is that when you know what to look for, spurious effects don’t get in the way, but when you don’t know what to look for, you don’t know what is spurious. This can be likened to a master chess player who can look at a game in play and see at a glance what is happening, whereas the novice sees only the individual pieces, and cannot easily tell where the action is taking place. What is needed is pattern recognition.
In addition, there is considerable room for argument in interpreting scatterplots. What one person sees as a non-linear relationship, another person might see as a line with some unusual observations. My experience is that people tend to try for more complicated models than is sensible. A few unusual observations can affect how we see the graph. There is also a contextual content to the discussion. The nature of the individual observations, and the sample can make a big difference to the meaning drawn from the graph. For example, a scatterplot of the sodium content vs the energy content in food should not really have a strong relationship. However, if the sample of food taken is predominantly fast food, high sodium content is related to high fat content (salt on fries!) and this can appear to be a relationship. In the graph below, is there really a linear relationship, or is it just because of the choice of sample?
Students need to be exposed to a large number of different scatterplots, Fortunately this is now possible, thanks to computers. Students should not be drawing graphs by hand.
So how do we teach this? I think about how I learned to interpret graphs, and the answer is practice, practice, practice. This is actually quite tricky for teachers to arrange, as you need to have lots of sets of data for students to look at, and you need to make sure they are giving correct answers. Practice without feedback and correction can lead to entrenched mistakes.
Because graph interpretation is about pattern recognition, we need to have patterns that students can try to match the new graphs to. It helps to have some examples that aren’t beautifully behaved. The reality of data is that quite often the nature of measurement and rounding means that the graph appears quite different from the classic scatter-plot. The following graph has a strangely ordered look to it because the x-axis variable takes only whole numbers, and the prices are nearly always close to the nearest thousand.
Students also need examples of the different aspects that you would comment on in a graph, using appropriate vocabulary. Just as musicians need to label different types of scales in order to communicate with each other their musical ideas, there is a specific vocabulary for describing graphs. Unfortunately the art of describing scatterplots is not as developed as music, and at times the terms are unclear and even used in different ways by different people.
Materials produced for teacher development , available on Census @ School suggest the following things to comment on: Trend, Association, Strength, Groups and unusual observations.
The following uses the framework provided by R. Kaniuk, R. Parsonage
Trend covers the idea of whether the graph is linear or non-linear. I don’t really like the use of the word “trend” here, as to me it should be used for time-series data only. I would use the word “shape” in preference. It means a general tendency.
Association is about the direction. Is the relationship positive or negative? For example, “as the distance a car has travelled increases, the asking price tends to decrease.” The term “tends to” is very useful here.
Strength is about how close the dots are to the fitted line. In a linear model we can use correlation to quantify the strength. My experience is that students often confuse strength with slope.
Groups can appear in the data, and it is much more relevant if the appearance of groups is related to an attribute of the observations. For example in some data about food values in fast food, the dessert and salad items were quite separate from the other menu items. You can see that in the graph above of food items.
Unusual observations are a challenging feature of real-world data. Is it a mistake? Is it someone being silly, or misinterpreting a question? Is it not really from this population? Is it the result of a one-off rare occurrence (such as my redundancy payment earlier this year)? And what should you do with unusual observations? I’ve written a bit more about this in my post on dirty data. And there is uneven scatter, or heteroscedastiticity, which does not affect model definition, so much as prediction intervals.
On line practice works
An effective way to give students practice, with timely feedback, is through on-line materials. Graphs take up a lot of room on paper, so textbooks cannot easily provide the number of examples that are needed to develop fluency. With our on-line materials we provide many examples of graphs, both standard, and not so well-behaved. Students choose from statements about the graphs. Most of the questions provide two graphs, as pattern recognition is easier to develop when looking at comparisons. For example if you give one graph and say “How strong is this relationship?”, it can be difficult to quantify. This is made easier when you ask which of two graphs has a stronger relationship.
Students get immediate feedback in a “low-jeopardy” situation. When a tutor is working one-on-one with a student, it can be worrying to the student if they get wrong answers. The computer is infinitely patient and the student can keep trying over and over until they get their answers correct, thus reinforcing correct understanding.
This system and set of questions is part of our on-line resources for teaching Bivariate investigations, which occurs within the NZ Stats 3 course. You can find out more about our resources at www.statslc.com, and any teachers who wish to explore the materials for free should email me at n.petty(at)statslc.com.