Graphs – beauty and truth

Graphs – beauty and truth (with apologies to Keats)

A good graph is elegant

I really like graphs. I like the way graphs turn numbers into pictures. A good graph is elegant. It uses a few well-placed lines to communicate what would take a paragraph of text. And like a good piece of literature or art, a good graph continues to give, beyond the first reading. I love looking at my YouTube and WordPress graphs. These graphs tell me stories. The WordPress analytics tell me that when I put up a new post, I get more hits, but that everyday more than 1000 people read one of my posts. The YouTube analytics tell me stories about when people want to know about different aspects of statistics. It is currently the end of the North American school year, and the demand is for my video on Choosing which statistical test to use. Earlier in the year, the video about levels of measurement is the most popular. And not many people view videos about statistics on the 25th of December. I’m happy to report that the YouTube and WordPress graphs are good graphs.

Spreadsheets have made it possible for anyone and everyone to create graphs. I like that graphs are easier to make. Drawing graphs by hand is a laborious task and fraught with error. But sometimes my heart aches when I see a graph used badly. I suspect that this is when a graphic artist has taken control, and the search for beauty has over-ridden the need for truth.

Three graphs spurred me to write this post.

Graph One: Bad-tasting Donut on house occupation

The first was on a website to find out about property values. I must have clicked onto something to find out about the property values in my area, and was taken to the qv website. And this is the graph that disturbed me.

Graphs named after food are seldom a good idea

Sure it is pretty – uses pretty colours and shading, and you can find out what it is saying by looking at the key – with the numbers beside it. But a pie or donut chart should not be used for data which has inherent order. The result here is that the segments are not in order. Or rather they are ordered from most frequent to least frequent, which is not intuitive. Ordinal data is best represented in a bar or column chart. To be honest, most data is best represented in a bar or column chart. My significant other suggested that bar charts aren’t as attractive as pie charts. Circles are prettier than rectangles. Circles are curvy and seem friendlier than straight lines and rectangles. So prettiness has triumphed over truth.

Graph Two: Misleading pictogram (a tautology?)

It may be a little strong to call bad communication lack of truth. Let’s look at another example. In a way it is cheating to cite a pictogram in a post like this. Pictograms are the lowest form of graph and are so often incorrect, that finding a bad one is easier than finding a good one. In the graph below of fatalities it is difficult to work out what one little person represents.

What does one little person represent?

A quick glance, ignoring the numbers, suggests that the road toll in 2014 is just over half what it was in 2012. However, the truth, calculated from the numbers, is that the relative size is 80%. 2012 has 12 people icons, representing 280 fatalities. One icon is removed for 2013, representing a drop of 9 fatalities. 2011 has one icon fewer again, representing a drop of 2 fatalities. There is so much wrong in the reporting of road fatalities, that I will stop here. Perhaps another day…

Graph Three: Mysterious display on Household income

And here is the other graph that perplexed me for some time. It came in the Saturday morning magazine from our newspaper, as part of an article about inequality in New Zealand. Anyone who reads my blog will be aware that my politics place me well left of centre, and I find inequality one of the great ills of the modern day. So I was keen to see what this graph would tell me. And the answer is…

See how long it takes for you to find where you appear on the graph. (Pretending you live in NZ)

I have no idea. Now, I have expertise in the promulgation of statistics, and this graph stumped me for some time. Take a good look now, before I carry on.

I did work out in the end, what was going on in the graph, but it took far longer than it should. This article is aimed at an educated but not particularly statistically literate audience, and I suspect there will be very few readers who spent long enough working out what was going on here. This graph is probably numerically correct. I had a quick flick back to the source of the data (who, by the way, are not to be blamed for the graph, as the data was presented in a table) and the graph seems to be an accurate depiction of the data. However, the graph is so confusing as to be worse than useless. Please post critiques in the comments. This graph commits several crimes. It is difficult to understand. It poses a question and then fails to help the reader find the answer. And it does not provide insights that an educated reader could not get from a table. In fact, I believe it has obscured the data.

Graphs are the main way that statistical analysts communicate with the outside world. Graphs like these ones do us no favours, even if they are not our fault. We need to do better, and make sure that all students learn about graphs.

Teaching suggestion – a graph a day

Here is a suggestion for teachers at all levels. Have a “graph a day” display – maybe for a month? Students can contribute graphs from the news media. Each day discuss what the graph is saying, and critique the way the graph is communicating. I have a helpful structure for reading graphs in my post: There’s more to reading graphs than meets the eye; 

Here is a summary of what I’ve said and what else I could say on the topic.

Thoughts about Statistical Graphs

  • The choice of graph depends on the purpose
  • The text should state the purpose of the graph
  • There is not a graph for everything you wish to communicate
  • Sometimes a table communicates better than a graph
  • Graphs are part of the analysis as well as part of the reporting. But some graphs are better to stay hidden.
  • If it takes more than a few seconds to work out what a graph is communicating it should either be dumped or have an explanation in the text
  • Truth (or communication) is more important than beauty
  • There is beauty in simplicity
  • Be aware than many people are colour-blind, or cannot easily differentiate between different shades.

Feedback from previous post on which graph to use

Late last year I posted four graphs of the same data and asked for people’s opinions. You can link back to the post here and see the responses: Which Graph to Use.

The interesting thing is not which graph was selected as the most popular, but rather that each graph had a considerable number of votes. My response is that it depends.  It depends on the question you are answering or the message you are sending. But yes – I agree with the crowd that Graph A is the one that best communicates the various pieces of information. I think it would be improved by ordering the categories differently. It is not very pretty, but it communicates.

I recently posted a new video on YouTube about graphs. It is a quick once-over of important types of graphs, and can help to clarify what they are about. There are examples of good graphs in there.


I have written about graphs previously and you can find them here on the Collected Works page.

I’m interested in your thoughts. And I’d love to see some beautiful and truthful graphs in the comments.

Advertisements

There’s more to reading graphs than meets the eye

There’s more to reading graphs than meets the eye

For those of us who know how to read a graph, it can be difficult to imagine what another person could find difficult. But then when I am presented with an unusual style of graph, or one where the data has been presented badly, I suddenly feel empathy for those who are less graph-literate.

Graphs are more common now as we have Excel to make them for us – for better or worse. An important skill for the citizens of tomorrow and today is to be able to read a graph or table and to be critical of how well it accomplishes its goals.

Here are some stages of reading a graph, much of which also applies to reading a table.

Reading about the graph

When one is familiar with graphs, and the graph is well made, we can become oblivious to the conventions. Just as readers know that English is written from left to right, graph readers understand that the height of a bar chart corresponds to the quantity of something. When people familiar with graphs look at a graph, they take in information unconsciously. This would include what type of graph it is – bar chart, line graph, scatterplot…and what it is about – the title, axis labels and legend tell us this. And they are also able to ignore unimportant aspects. For example if someone has made a 3-D bar chart, experienced graph-readers know that the thickness of the bar does not express information. Colours are generally used to distinguish different elements, but the choice of which colour is used is seldom part of the message. Other aspects about graphs, which may or may not be apparent, include the purpose of the graph and the source of the data.

Beginner graph readers need to learn how to use the various conventions to read ABOUT the data or graph. Any exploration of a graph needs to start with the question, “What is this graph about?”

Identifying one piece of data

When children start making and reading graphs, it is good for them to start with data about themselves, often represented in a picture graph, where each individual observation is shown.  A picture graph is concrete. Each child may point out their particular piece of data – the one that says that they like Wheaties, or prefer mushrooms on their pizza. This is an early stage in  the process of abstraction, that leads eventually to understanding less intuitive graphs such as the box and whisker or a time series chart. It is also important for all graph readers to be aware what each piece of data, or observation, represents and how it is represented.

Identifying one piece of data may help avoid the confusion of graphs which show raw data rather than summary data. For an example, a class may have data about the number of people in households. If this data is entered raw into a spreadsheet, and a graph created, we can end up with something like the graph immediately below (Graph 1).

This is not a good graph, but is what a naive user may well get out of Excel

Graph 1: This is not a good graph, but is what a naive user may well get out of Excel

In this we can identify that each member is represented by a bar, and the height gives the number of people in their family. I usually call this a value graph, as it shows only the individual values, with no aggregation.

A more useful representation of this same data is a summary bar chart, as shown below. (Graph 2) There are two dimensions operating. Horizontally we have the number of people in a household, and vertically we have the number of class members that have the corresponding number of people in their household. Note that it is less intuitive seeing where each class member is. Dividing the bar up into individual blocks can help with that.

Household size

Graph 2: A summary of the size of household for a group of people

Reading off the graph

In order to make sense of a graph, we often need to look at two dimensions simultaneously. If we wish to know how many people in the class come from a household of 5, we need to select along the horizontal axis, the value 5. Then we follow the bar up to the top and take our eye back to the vertical axis to see how high this value is. A ruler can help with this process.  When we read off a graph, our statements tend to be summaries of a single attribute, such as “There are 2 people who come from households of 6.”  “There are 17 dragons that breathe fire.”

Reading within the graph (comparisons, relationships)

Reading within the graph is a more complex task, even with simple graphs. When we read within a graph we are interested in comparisons and relationships. For example we may wish to see which breath type is most common among our herd of dragons. In order to answer this using the graph below, we first need to find the highest bar, by drawing our eye along the top, or drawing a ruler down the page. Then we look down that bar, and read of the name of the breath type. There are many more complex relationships, such as whether green dragons tend to be taller or shorter than red dragons, and which are more likely to be friendly. By introducing another attribute, we are in fact adding a dimension to our analysis.

This is a column chart (or bar chart) summarising the breath types.

This is a column chart (or bar chart) summarising the breath types.

Reading beyond the graph, beyond the data

This idea of reading beyond the data has been suggested as a step towards informal and then formal inference. We can perceive that our data does not represent all existing instances, and can make predictions or suppositions about what might happen in the other instances. For example, for our sample of dragons, we have seen that the green dragons tend to be more likely to be friendly than the red dragons. We could surmise that this holds over the other dragons as well. We can introduce this idea by asking the students, “I wish to have a new dragon join the herd and would prefer it to be friendly. Would I be better to get a green dragon or a red dragon?”

Judging the graph

The advantage of programs like Excel is that many people can make graphs without too much trouble. This is also a problem, as often the graph Excel produces is not really suitable for the task, and can have all sorts of visual clutter which obscures the information displayed. Learners need to think about the graph, either their own, or one they are reading and ask whether it is successful in communicating correctly the information that needs to be communicated. Does the graph serve the purpose it was created for?

I suggest that the steps listed here are a worthwhile structure to use in reading graphs, particularly for beginners. This then leads into another process, summarised as OSEM. You can read about this here in this post, A helpful structure for analysing graphs.

Summarising with Box and Whisker plots

In the Northern Hemisphere, it is the start of the school year, and thousands of eager students are beginning their study of statistics. I know this because this is the time of year when lots of people watch my video, Types of Data. On 23rd August the hits on the video bounced up out of their holiday slumber, just as they do every year. They gradually dwindle away until the end of January when they have a second jump in popularity, I suspect at the start of the second semester.

One of the first topics in many statistics courses is summary statistics. The greatest hits of summary statistics tend to be the mean and the standard deviation. I’ve written previously about what a difficult concept a mean is, and then another post about why the median is often preferable to the mean. In that one I promised a video. Over two years ago – oops. But we have now put these ideas into a video on summary statistics. Enjoy! In 5 minutes you can get a conceptual explanation on summary measures of position. (Also known as location or central tendency)

 

I was going to follow up with a video on spread and started to think about range, Interquartile range, mean absolute deviation, variance and standard deviation. So I decided instead to make a video on the wonderful boxplot, again comparing the shoe- owning habits of male and female students in a university in New Zealand.

Boxplots are great. When you combine them with dotplots as done in iNZIght and various other packages, they provide a wonderful way to get an overview of the distribution of a sample. More importantly, they provide a wonderful way to compare two samples or two groups within a sample. A distribution on its own has little meaning.

John Tukey was the first to make a box and whisker plot out of the 5-number summary way back in 1969. This was not long before I went to High School, so I never really heard about them until many years later. Drawing them by hand is less tedious than drawing a dotplot by hand, but still time consuming. We are SO lucky to have computers to make it possible to create graphs at the click of a mouse.

Sample distributions and summaries are not enormously interesting on their own, so I would suggest introducing boxplots as a way to compare two samples. Their worth then is apparent.

A colleague recently pointed out an interesting confusion and distinction. The interquartile range is the distance between the upper quartile and the lower quartile. The box in the box plot contains the middle 50% of the values in the sample. It is tempting for people to point this out and miss the point that the interquartile range is a good resistant measure of spread for the WHOLE sample. (Resistant means that it is not unduly affected by extreme values.) The range is a poor summary statistic as it is so easily affected by extreme values.

And now we come to our latest video, about the boxplot. This one is four and a half minutes long, and also uses the shoe sample as an example. I hope you and your students find it helpful. We have produced over 40 statistics videos, some of which are available for free on YouTube. If you are interested in using our videos in your teaching, do let us know and we will arrange access to the remainder of them.

A helpful structure for analysing graphs

Mathematicians teaching English

“I became a maths teacher so I wouldn’t have to mark essays”
“I’m having trouble getting the students to write down their own ideas”
“When I give them templates I feel as if it’s spoon-feeding them”

These are comments I hear as I visit mathematics teachers who are teaching the new statistics curriculum in New Zealand. They have a point. It is difficult for a mathematics teacher to teach in a different style. But – it can also be rewarding and interesting, and you never get asked, “Where is this useful?”

The statistical enquiry cycle provides a structure for all statistical investigations and learning.

We start with a problem or question, and undergo an investigation, either using extant data, an experiment or observational study to answer the question. Writing skills are key in several stages of the cycle. We need to be able to write an investigative question (or hypotheses). We need to write down a plan, and sometimes an entire questionnaire. We need to write down what we find in the analysis and we need to write a conclusion to answer the original question. That’s a whole heap of writing!

And for teachers who may not be all that happy about writing themselves, and students who chose mathematical subjects to avoid writing, it can be a bridge too far.
In previous posts on teaching report writing I promote the use of templates, and give some teaching suggestions.

In this post I am concentrating on analysing graphs, using a handy acronym, OSEM. OSEM was developed by Jeremy Brocklehurst from Lincoln High School near Christchurch NZ. There are other acronyms that would work just as well, but we like this one, not the least for its link with kiwi culture. We think it is awesome (OSEM). You could Google “o for awesome”, to get the background. OSEM stands for Obvious, Specific, Evidence and Meaning. It is a process to follow, rather than a checklist.

I like the use of O for obvious. I think students can be scared to say what they think might be too obvious, and look for tricky things. By including “obvious” in the process, it allows them to write about the important, and usually obvious features of a graph. I also like the emphasis on meaning, Unless the analysis of the data links back to the context and purpose of the investigation, it is merely a mathematical exercise.

Is this spoon-feeding? Far from it. We are giving students a structure that will help them to analyse any graph, including timeseries, scatter plots, and histograms, as well as boxplots and dotplots. It emphasises the use of quantitative information, linked with context. There is nothing revolutionary about it, but I think many statistics teachers may find it helpful as a way to breakdown and demystify the commenting process.

Class use of OSEM

In a class setting, OSEM is a helpful framework for students to work in groups. Students individually (perhaps on personal whiteboards) write down something obvious about the graph. Then they share answers in pairs, and decide which one to carry on with. In the pair they specify and give evidence for their “obvious” statement. Then the pairs form groups of four, and they come up with statements of meaning, that are then shared with the class as a whole.

Spoon feeding has its place

On a side-note – spoon-feeding is a really good way to make sure children get necessary nutrition until they learn to feed themselves. It is preferable to letting them starve before they get the chance to develop sufficient skills and co-ordination to get the food to their mouths independently.

Difficult concepts in statistics

Recently someone asked: “I don’t suppose you’d like to blog a little on the pedagogical knowledge relevant to statistics teaching, would you? A ‘top five statistics student misconceptions (and what to do about them)’ would be kind of a nice thing to see …”

I wish it were that easy. Here goes:

Things that I have found students find difficult to understand and what I have done about them.

Observations

When I taught second year regression we would get students to collect data and fit their own multiple regressions. The interesting thing was that quite often students would collect unrelated data. The columns of the data would not be of the same observations. These students had made it all the way through first year statistics without really understanding about multivariate data.

So from them on when I taught about regression I would specifically begin by talking about observations (or data points) and explain how they were connected. It doesn’t hurt to be explicit. In the NZ curriculum materials for high school students are exercises using data cards which correspond to individuals from a database. This helps students to see that each card, which corresponds to a line of data, is one person or thing. In my video about Levels of measurement, I take the time to show this.

First suggestion is “Don’t assume”.  This applies to so much!

And this is also why it is vital that instructors do at least some of their own marking (grading). High school teachers are going, “Of course”. College professors – you know you ought to! The only way you find out what the students don’t understand, or misunderstand, or replicate by rote from your own notes, is by reading what they write. This is tedious, painful and sometimes funny in a head-banging sort of way, but necessary. I also check the prevalence of answers to multiple choice questions in my on-line materials. If there is a distracter scoring highly it is worthwhile thinking about either the question or the teaching that is leading to incorrect responses.

Inference

Well duh! Inference is a really, really difficult concept and is the key to inferential statistics. The basic idea, that we use information from a sample to draw conclusions about the population seems straight-forward. But it isn’t. Students need lots and lots of practice at identifying what is the population and what is the sample in any given situation. This needs to be done with different types of observations, such as people, commercial entities, plants or animals, geographical areas, manufactured products, instances of a physical experiment (Barbie bungee jumping), and times.

Second suggestion is “Practice”. And given the choice between one big practical project and a whole lot of small applied exercises, I would go with the exercises. A big real-life project is great for getting an idea of the big picture, and helping students to learn about the process of statistical analysis. But the problem with one big project is that it is difficult to separate the specific from the general. Context is at the core of any analysis in statistics, and makes every analysis different. Learning occurs through experiencing many different contexts and from them extracting what is general to all analysis, what is common to many analyses and what is specific to that example. The more different examples a student is exposed to, the better opportunity they have for constructing that learning. An earlier post extols the virtues of practice, even drill!

Connections

One of the most difficult things is for students to make connections between parts of the curriculum. A traditional statistics course can seem like a toolbox of unrelated but confusingly different techniques. It takes a high level of understanding to link the probability, data and evidence aspects together in a meaningful way. It is good to have exercises that hep students to make these connections. I wrote about this with regard to Operations Research and Statistics. But students need also to be making connections before they get to the end of the course.

The third suggestion is “get students to write”

Get students to write down what is the same and what is different between chi-sq analysis and correlation. Get them to write down how a poisson distribution is similar to and different from a binomial distribution. Get them to write down how bar charts and histograms are similar and different. The reason students must write is that it is in the writing that they become aware of what they know or don’t know. We even teach ourselves things as we write.

Graphs and data

Another type of connection that students have trouble with is that between the data and the graph, and in particular identifying variation and distribution in a histogram or similar. There are many different graphs, that can look quite similar, and students have problems identifying what is going on. The “value graph” which is produced so easily in Excel does nothing to help with these problems. I wrote a full post on the problems of interpreting graphs.

The fourth suggestion is “think hard”. (or borrow)

Teaching statistics is not for wusses. We need to think really hard about what students are finding difficult, and come up with solutions. We need to experiment with different ways of explaining and teaching. One thing that has helped my teaching is the production of my videos. I wish to use both visual and text (verbal) inputs as best as possible to make use of the medium. I have to think of ways of representing concepts visually, that will help both understanding and memory. This is NOT easy, but is extremely rewarding. And if you are not good at thinking up new ideas, borrow other people’s ideas. A good idea collector can be as good as or better than a good creator of ideas.

To think of a fifth suggestion I turned to my favourite book , “The Challenge of Developing Statistical Literacy, Reasoning and Thinking”, edited by Dani Ben-Zvi and Joan Garfield. I feel somewhat inadequate in the suggestions given above. The book abounds with studies that have shown areas of challenge or students and teachers. It is exciting that so many people are taking seriously the development of pedagogical content knowledge regarding the discipline of statistics. Some statisticians would prefer that the general population leave statistics to the experts, but they seem to be in the minority. And of course it depends on what you define “doing statistics” to mean.

But the ship of statistical protectionism has sailed, and it is up to statisticians and statistical educators to do our best to teach statistics in such a way that each student can understand and apply their knowledge confidently, correctly and appropriately.

Interpreting Scatterplots

Patterns, vocab and practice, practice, practice

An important part of statistical analysis is being able to look at graphical representation of data, extract  meaning and make comments about it, particularly related to the context. Graph interpretation is a difficult skill to teach as there is no clear algorithm, such as mathematics teachers are used to teaching, and the answers are far from clear-cut.

This post is about the challenges of teaching scatterplot interpretation, with some suggestions.

When undertaking an investigation of bivariate measurement data, a scatterplot is the graph to use. On a scatterplot we can see what shape the data seems to have, what direction the relationship goes in, how close the points are to the line, if there are clear groups and if there are unusual observations.

The problem is that when you know what to look for, spurious effects don’t get in the way, but when you don’t know what to look for, you don’t know what is spurious. This can be likened to a master chess player who can look at a game in play and see at a glance what is happening, whereas the novice sees only the individual pieces, and cannot easily tell where the action is taking place. What is needed is pattern recognition.

In addition, there is considerable room  for argument in interpreting scatterplots. What one person sees as a non-linear relationship, another person might see as a line with some unusual observations. My experience is that people tend to try for more complicated models than is sensible. A few unusual observations can affect how we see the graph. There is also a contextual content to the discussion. The nature of the individual observations, and the sample can make a big difference to the meaning drawn from the graph. For example, a scatterplot of the sodium content vs the energy content in food should not really have a strong relationship. However, if the sample of food taken is predominantly fast food, high sodium content is related to high fat content (salt on fries!) and this can appear to be a relationship. In the graph below, is there really a linear relationship, or is it just because of the choice of sample?

In a set of data about fast food, there appears to be a relationship between sodium content and energy.

In a set of data about fast food, there appears to be a relationship between sodium content and energy.

Students need to be exposed to a large number of different scatterplots, Fortunately this is now possible, thanks to computers. Students should not be drawing graphs by hand.

So how do we teach this? I think about how I learned to interpret graphs, and the answer is practice, practice, practice. This is actually quite tricky for teachers to arrange, as you need to have lots of sets of data for students to look at, and you need to make sure they are giving correct answers. Practice without feedback and correction can lead to entrenched mistakes.

Because graph interpretation is about pattern recognition, we need to have patterns that students can try to match the new graphs to. It helps to have some examples that aren’t beautifully behaved. The reality of data is that quite often the nature of measurement and rounding means that the graph appears quite different from the classic scatter-plot. The following graph has a strangely ordered look to it because the x-axis variable takes only whole numbers, and the prices are nearly always close to the nearest thousand.

The asking price of used Toyota sedans against the year of manufacture.

The asking price of used Toyota sedans against the year of manufacture.

Students also need examples of the different aspects that you would comment on in a graph, using appropriate vocabulary. Just as musicians need to label different types of scales in order to communicate with each other their musical ideas, there is a specific vocabulary for describing graphs. Unfortunately the art of describing scatterplots is not as developed as music, and at times the terms are unclear and even used in different ways by different people.

Materials produced for teacher development , available on Census @ School suggest the following things to comment on: Trend, Association, Strength, Groups and unusual observations.

The following uses the framework provided by R. Kaniuk, R. Parsonage

Trend covers the idea of whether the graph is linear or non-linear. I don’t really like the use of the word “trend” here, as to me it should be used for time-series data only. I would use the word “shape” in preference. It means a general tendency.

Association is about the direction. Is the relationship positive or negative? For example, “as the distance a car has travelled increases, the asking price tends to decrease.” The term “tends to” is very useful here.

Strength is about how close the dots are to the fitted line. In a linear model we can use correlation to quantify the strength. My experience is that students often confuse strength with slope.

Groups can appear in the data, and it is much more relevant if the appearance of groups is related to an attribute of the observations. For example in some data about food values in fast food, the dessert and salad items were quite separate from the other menu items. You can see that in the graph above of food items.

Unusual observations are a challenging feature of real-world data. Is it a mistake? Is it someone being silly, or misinterpreting a question? Is it not really from this population? Is it the result of a one-off rare occurrence (such as my redundancy payment earlier this year)? And what should you do with unusual observations? I’ve written a bit more about this in my post on dirty data. And there is uneven scatter, or heteroscedastiticity, which does not affect model definition, so much as prediction intervals.

On line practice works

An effective way to give students practice,  with timely feedback, is through on-line materials. Graphs take up a lot of room on paper, so textbooks cannot easily provide the number of examples that are needed to develop fluency. With our on-line materials we provide many examples of graphs, both standard, and not so well-behaved. Students choose from statements about the graphs. Most of the questions provide two graphs, as pattern recognition is easier to develop when looking at comparisons. For example if you give one graph and say “How strong is this relationship?”, it can be difficult to quantify. This is made easier when you ask which of two graphs has a  stronger relationship.

Students get immediate feedback in a “low-jeopardy” situation. When a tutor is working one-on-one with a student, it can be worrying to the student if they get wrong answers. The computer is infinitely patient and the student can keep trying over and over until they get their answers correct, thus reinforcing correct understanding.

This system and set of questions is part of our on-line resources for teaching Bivariate investigations, which occurs within the NZ Stats 3 course. You can find out more about our resources at www.statslc.com, and any teachers who wish to explore the materials for free should email me at n.petty(at)statslc.com.

Statistical Story-telling with time series data

Statistics is about story-telling.

For people who understand them, graphs tell a story. To the initiated, even a p-value, and some summary statistics can help to tell a story. Part of the role of a statistician is to extract the story from the data. The role of a statistics teacher is to enable students first to recognise that there is a story, then to enable them to tell the story through the tools of analysis and communication.

This idea of statistics as story-telling is explained in an award-winning paper byPfannkuch, Regan, Wild and Horton,Telling Data Stories: Essential Dialogues for Comparative Reasoning, which won  the inaugural Journal of Statistics Education Best Paper Award.

Time series data, especially seasonal time series data, yields its story abundantly. For this reason I changed my mind about the teaching of time series analysis at high school. I used to think that it was far too complex for high school students and should be left to higher education. In a way that is true, but if you stick to the basic concepts, it is a contextually rich area of study.

Time series data is full of little hazards, not the least being auto-correlation. We can use moving averages to take out the bumps and exponential smoothing to be more responsive to more recent data. We can deseasonalise and fit a trend line, predict and then put the seasonality back in. There are weighty (in more ways than one) volumes dedicated to time series analysis and the various discoveries and inventions that have helped us draw meaning from the past and forecast the future.

Because of the inherent complexity of time series analysis, I used to think that time series was not an appropriate part of the high school curriculum.

However, if a storytelling approach is used, backed up by appropriate software, then time series is a wonderful introduction to statistics. It is a good example of modelling, it has clear purpose, and the contexts can be fascinating.

Time series analysis is a clear example of the concept of a model, as there are so many different ways that it is possible to model a set of time series data. In contrast, when you teach linear regression with only one possible predictor variable, on data that is nicely behaved, there is generally one sensible model to use. This gives students the idea that you are trying to find “the right model”. This is not the case with time series, as models change, depending on how we choose to define the model.

Another selling-point for time series analysis is that its main function is forecasting. We all want to have crystal balls that can predict the future. The main reason we study a time series is to understand the patterns of data so that we can project into the future, usually for economic reasons. There is no question of “Why are we doing this, Miss?”, as the purpose of the analysis is self-evident.

There are numerous economic time series available from official statistics sites. In New Zealand I went to Infoshare and in the US there is Economagic.  Some of the series are fascinating. (I like the three peaks per year in jewellery sales in the US – December, February and May.)

Analysis can be difficult, and Excel is hideous for time series graphing and deseasonalising. There has been a free front end for R set up, called iNZight, which enables straight-forward time series analysis. One drawback is that it only allows for one model, which I fear perpetuates the “there is one model” mindset.

But the opportunities for storytelling are there. You can talk about trend, seasonality, variation, the relative contribution of each. As teachers and students are exposed to more and more time series graphs, they are better able to tell stories. The graphs of the seasonal shape are rich with story-telling potential.

To support this we have made four videos about time series analysis, and an app, which is still in the pipeline. We hope that these will help develop the confidence of teachers and students to tell stories about time series data. We also have further quizzes and step-by-step guide to writing up a time series analysis.

For teachers where there is limited access to computer resources, I have an earlier post with some ideas of how to overcome this problem and emphasise the story in time series data: Teaching Time Series with Limited Computer access.

Excel, SPSS, Minitab or R?

I often hear this question: Should I use Excel to teach my class? Or should I use R? Which package is the best?

It depends on the class

The short answer is: It depends on your class. You have to ask yourself, what are the attitudes, skills and knowledge that you wish the students to gain in the course. What is it that you want them to feel and do and understand?

If the students are never likely to do any more statistics, what matters most is that they understand the elementary ideas, feel happy about what they have done, and recognise the power of statistical analysis, so they can later employ a statistician.

If the students are strong in programming, such as engineering or computer science students, then they are less likely to find the programming a barrier, and will want to explore the versatility of the package.

If they are research students and need to take the course as part of a research methods paper, then they should be taught on the package they are most likely to use in their research.

Over the years I have taught statistics using Excel, Minitab and SPSS. These days I am preparing materials for courses using iNZight, which is a specifically designed user interface with an R engine. I have dabbled in R, but never had students who are suitable to be taught using R.

Here are my pros and cons for each of these, and when are they most suitable.

Excel

I have already written somewhat about the good and bad aspects of Excel, and the evils of Excel histograms. There are many problems with statistical analysis with Excel. I am told there are parts of the analysis toolpak which are wrong, though I’ve never found them myself. There is no straight-forward way to do a hypothesis test for a mean. The data-handling capabilities of the spreadsheet are fantastic, but the toolpak cannot even deal well with missing values. The output is idiosyncratic, and not at all intuitive. There are programming quirks which should have been eliminated many years ago. For example when you click on a radio button to say where you wish the output to go, the entry box for the data is activated, rather than the one for the output. It requires elementary Visual Basic to correct this, but has never happened. Each time Excel upgrades I look for this small fix, and have repeatedly been disappointed.

So, given these shortcomings, why would you use Excel? Because it is there, because you are helping students gain other skills in spreadsheeting at the same time, because it is less daunting to use a familiar interface. These reasons may not apply to all students. Excel is the best package for first year business students for so many reasons.

PivotTables in Excel are nasty to get your head around, but once you do, they are fantastic. I resisted teaching PivotTables for some years, but I was wrong. They may well be one of the most useful things I have ever taught at university. I made my students create comparative bar charts on Excel, using Pivot-Tables. One day Helen and I will make a video about PivotTables.

Minitab

Minitab is a lovely little package, and has very nice output. Its roots as a teaching package are obvious from the user-friendly presentation of results. It has been some years since I taught with Minitab. The main reason for this is that the students are unlikely ever to have access to Minitab again, and there is a lot of extra learning required in order to make it run.

SPSS

Most of my teaching at second year undergraduate and MBA and Masters of Education level has been with SPSS. Much of the analysis for my PhD research was done on SPSS. It’s a useful package, with its own peculiarities. I really like the data-handling in terms of excluding data, transforming variables and dealing with missing values. It has a much larger suite of analysis tools, including factor analysis, discriminant analysis, clustering and multi-dimensional scaling, which I taught to second year business students and research students.  SPSS shows its origins as a suite of barely related packages, in the way it does things differently between different areas. But it’s pretty good really.

R

R is what you expect from a command-line open-source program. It is extremely versatile, and pretty daunting for an arts or business major. I can see that R is brilliant for second-level and up in statistics, preferably for students who have already mastered similar packages/languages like MatLab or Maple. It is probably also a good introduction to high-level programming for Operations Research students.

iNZight

This brings us to iNZight, which is a suite of routines using R, set in a semi-friendly user interface. It was specifically written to support the innovative New Zealand school curriculum in statistics, and has a strong emphasis on visual representation of data and results. It includes alternatives that use bootstrapping as well as traditional hypothesis testing. The time series package allows only one kind of seasonal model. I like iNZight. If I were teaching at university still, I would think very hard about using it. I certainly would use it for Time Series analysis at first year level. For high school teachers in New Zealand, there is nothing to beat it.

It has some issues. The interface is clunky and takes a long time to unzip if you have a dodgy computer (as I do). The graphics are unattractive. Sorry guys, I HATE the eyeball, and the colours don’t do it for me either. I think they need to employ a professional designer. SOON! The data has to be just right before the interface will accept it. It is a little bit buggy in a non-disastrous sort of way. It can have dimensionality/rounding issues. (I got a zero slope coefficient for a linear regression with an r of 0.07 the other day.)

But – iNZight does exactly what you want it to do, with lots of great graphics and routines to help with understanding. It is FREE. It isn’t crowded with all the extras that you don’t really need. It covers all of the New Zealand statistics curriculum, so the students need only to learn one interface.

There are other packages such as Genstat, Fathom and TinkerPlots, aimed at different purposes. My university did not have any of these, so I didn’t learn them. They may well be fantastic, but I haven’t the time to do a critique just now. Feel free to add one as a comment below!

Beware of Excel Histograms

Excel histograms are a disgrace. Microsoft should be embarrassed to have them associated with their ubiquitous and generally wonderful spreadsheet, Excel. I have previously posted on how useful and versatile Excel is for enabling people to bypass the number crunching, and get to the ideas behind statistics. This is mostly true. But the histogram add-in should come with a health and safety warning.

To start with, the default look for the histogram is outrageously poor. I have some data on times a person takes to solve a Rogo puzzle. (Collected as part of our research on what factors affect solution time.) I put the data in the spreadsheet, and use the data analysis toolpak to create a histogram using the default settings. Voila!

Histogram produced using default settings in Excel

I’ll stretch it out a bit so you can see it in all its glory:

Histogram using Excel defaults, stretched out.

And here is what can be produced from this with a fair degree of manipulation:

Excel Histogram whipped into shape

This is not just a question of cosmetics. The way the horizontal axis is labelled makes it very difficult to read correctly an Excel-produced histogram unless the adaptation shown above is used. And sometimes, an Excel-produced histogram is just plain incorrect.

Bad histogram in NCEA sample exam

What prompted this tirade is a question from the sample external examination question in NCEA level 2 “Apply probability methods in solving problems”. This is an exam that over 15,000 students are likely to take. Fortunately this is only an examplar and not the real thing. It includes a badly labelled histogram, which I am almost certain was made in Excel.

Histogram taken from an exam exemplar

The introduction says: “Ali has a farm in Southland. She records the weights of 32 lambs born on her farm. The results are shown on the histogram above.”

The first question asks: “What proportion of the lambs weighs less than 1.25 kg?”

Go on – work it out.

I can’t answer this for certain. The labelling of the horizontal axis renders this question unanswerable. I suspect the desired answer is one out of 32. To get this answer I assume that the lambs are weighed to a precision of one decimal place, and the numbers under the graph are the inclusive upper bounds for the area above. I make this assumption because I know that is what Excel does. That is two too many assumptions for students in an exam. This is too many assumptions for any graph. Graphs exist in order to communicate, not confuse.

A histogram always has “bins” which cover a range of values. If you went to school last century and learnt to draw them by hand, you would put the boundary number between the bins on the tick mark on the graph that was the boundary between the bins. Intuitive!

A Google search on the word histogram shows most of the histograms with the tick marks at the boundaries, and quite a few using the Excel work around shown above. That is because the only way you can get the number and the tick mark to line up, is to move the tick mark to the centre. An Excel column chart is designed to be a value graph for nominal data, and it is being pressed into service in an unnatural way.

Another example

This is a simple mockup to illustrate

Example A

The question is, how many people scored 3 or less in the test?

It isn’t clear. Did one person score between 0 and 3, and then three between 3 and 6? The data is actually:  0, 0.5, 1, 2, 5, 5.5, 7, 8, 9,10 and the answer is that four people scored three or less. The following histogram shows this.

Example B (same data as Example A)

All it takes is some relabelling and the meaning is clear.

Teaching implications

We thought long and hard about the teaching of histograms within a Business Statistics course. We concluded that any student who is likely to need to produce a histogram in the future, is likely to (ought to?) have a better statistics package than Excel to use. Teaching them this bizarre work-around in Excel is a waste of student time (We decided this after we made students do this in a course.) It is more important for students to be able to interpret histograms correctly, and be aware of the pitfalls of badly labelled histograms. Consequently we taught students to interpret and then critique histograms, rather than construct them themselves, and assessed the same way.

If you ever use a histogram yourself, make sure you do not fall into the pit shown above!

And for those of you who persist in teaching histograms in Excel (or need help yourself in knowing how to do it – hence avoiding said pit), here is a pdf handout.

Drawing_a_histogram_2007

Good luck.

The best outcome would be that Microsoft get their very poor data analysis add-in fixed up, and the world would be a better place. Any chance of that?

What Mathematics teachers need to know about statistics

My post suggesting that statistics is more vital for efficient citizens than algebra has led to some interesting discussions on Twitter and elsewhere. Currently I am beginning an exciting venture to provide support materials for teachers and students of statistics, starting with New Zealand. These two circumstances have led me to ponder about why maths teachers think that statistics is a subset of mathematics, and what knowledge and attitudes will help them make the transition to teaching statistics as a subject.

An earlier post called for mathematics to leave statistics alone. This post builds on that by providing some ways of thinking that might be helpful to mathematics teachers who have no choice but to teach statistics.

Statistics is not a subset of mathematics

Let me quote a forum post from a teacher of mathematics in New Zealand:

  • “It seems strange to me that Statistics is a small part of Mathematics (which also includes Trigonometry, Algebra, Geometry, Calculus … but for our year 13s and now our year 12s, it’s attained equal parity as one area with all the other branches of maths put together as another area.”

This is very helpful as it lets us see where the writer is coming from. To him, statistics is a subset of mathematics – a small part, and somehow it has managed to push its way to become on equal footing with “mathematics.”

I disagree.

I think we need to take a look at the role of compulsory schooling. It is popular among people who go to university, and even more so among those who never leave (having become academics themselves) to think that the main, if not only role of school is to prepare students for university. If the students somehow have not gained the skills and knowledge that the university lecturers believe are necessary, then the schools have failed – or worse still, the system has failed. Again I disagree.

The vision for the young people of New Zealand is stated in the official curriculum.

“Our vision is for young people:

  • who will be creative, energetic, and enterprising
  • who will seize the opportunities offered by new knowledge and technologies to secure a sustainable social, cultural, economic, and environmental future for our country
  • who will work to create an Aotearoa New Zealand in which Māori and Pākehā recognise each other as full Treaty partners, and in which all cultures are valued for the contributions they bring
  • who, in their school years, will continue to develop the values, knowledge, and competencies that will enable them to live full and satisfying lives
  • who will be confident, connected, actively involved, and lifelong learners.”

It doesn’t actually mention preparing people for university.

My view is that school is about preparing young people for life, while helping them to enjoy the journey.

What teachers need to know about statistics

Statistics for life can be summed up as C, D, E, standing for Chance, Data and Evidence.

Chance

Students need to understand about the variability in their world. Probability is a mathematical way of modelling the inherent uncertainty around us. The mathematical part of probability includes combinatorics and the ability to manipulate tables. You can use Venn diagrams and trees if you like, and tables can be really useful too. The most difficult part for many students is converting the ideas into mathematical terminology and making sense of that.

Bear in mind that perceptions of chance have cultural implications. Some cultures play board games and other games of chance from a young age, and gain an inherent understanding of the nature of uncertainty as provided by dice. However there are other cultures for whom all things are decided by God, and nothing is by chance. There are many philosophical discussions which can be had regarding the nature of uncertainty and variability. The work of Tversky and Kahnemann and others have alerted us to the misconceptions we all have about chance.

An area where the understanding of probabilities and relative risk is vital is that of medical screening. Studies among medical practitioners have shown that many of them cannot correctly estimate the probability of a false positive, or the probability of a true positive, given that the result of a test is positive. This is easily conveyed through contingency tables, which are now part of the NZ curriculum.

Data

When people talk about “statistics”, more often they are talking about data and information than the discipline of statistical analysis. Just about everyone is interested in some area of statistics. Note the obsession of the media for reporting the road toll and comparing with previous years or holiday periods. Sports statistics occupy many people’s thoughts, and can fill in the (long) gaps between the action in a cricket commentary. Weather statistics are vital to farmers, planners, environmentalists. Hospitals are now required to report various statistics. The web is full of statistics. It is difficult to think of an area of life which does not use statistics.The second thing we want to know about a new-born baby is its weight.

Just because data contains numbers does not make it mathematics. There are arithmetic skills, such as adding and dividing, which can be practised using data. But that’s about it when it comes to mathematics and data. These days we have computer packages which can calculate all sorts of summary values, and create graphs for better or worse, so the need for mathematical or numeracy skills is much diminished. What is needed is the ability to communicate ideas using numbers and diagrams; by communication I mean production and interpretation of reports and diagrams.

The area of data also includes the collection of data. This is taught at all levels of the NZ curriculum. Students are taught to think about measurement, both physical and through questionnaires. Eventually students learn to design experiments to explore new ideas. Some might see this as science or biology, social studies or psychology, technology and business. There are even applications in music where students explore people’s music preferences. Data occurs in all subjects, and really the skills of data analysis should be taught in context. But until the current generation of students become the teachers, we may need to rely on the teachers of statistics to provide support. There are wonderful opportunities for collaboration between disciplines, if our compartmentalised school system would allow them.

Evidence

Much data is population data and conclusions can easily be drawn from it. However we also use samples to draw conclusions about populations. Inferential statistics has been developed using theoretical probability distributions to help us use samples to draw conclusions about populations. Unfortunately the most popular form of inference, hypothesis testing, is counter-intuitive at best. Many teachers do not truly understand the application of inferential statistics – and why should they – they may never have performed a real statistical analysis. It is only through repeated application of techniques to multiple contexts that most people can start to feel comfortable and get some understanding of what is happening. The beauty is that today the technology makes it possible for students to perform multiple analyses so that they can learn the specific from the general.

The New Zealand school system has taken the courageous* step to introduce the use of resampling, also known as bootstrapping or randomisation, for the generation of confidence intervals. This is contentious and is causing teachers concern. I will dedicate a whole post to the ideas of resampling and why they may be preferable to more traditional approaches. I empathise with the teachers who are feeling out of their depth, and hope that our materials, along with the excellent ones provided by “Census at School” can be of help.

I have no doubt that educators all over the world are watching to see how this goes before attempting similar moves in their own countries. Yet again New Zealand gets to lead the world. Watch this space!

*In the popular British television show, “Yes Minister”, the public servant, Sir Humphrey, would use the term “courageous” to describe a proposal which was probably right, but also likely to lose votes.