The median suffers from poor marketing.
All my time at school the “average” was always calculated as the arithmetic mean, by adding up all the scores and then dividing by the number of scores. When we were taught about the median, it seemed like an inferior version of the mean. It was the thing you worked out when you weren’t smart enough to add and divide. It was used for house prices, and that was about it. Of course the mean was the superior product! Why wouldn’t you use the mean?
I’ve been preparing resources for teaching the fabulous new New Zealand curriculum, and have been brought face-to-face with my prejudices. It strikes me that the median has had very poor representation.
Public opinion of the median and mean
I put a question on Facebook and Twitter to see what people felt about the mean and the median. I briefly explained what each was, then asked which one they thought was better. Some people had no idea what I was talking about, but most felt that the mean was the superior statistic. The following are a selection of responses:
The mean, but I don’t know why.. maybe that’s just what we were taught to use when I was back in school (a long time ago!) lol
When I think of “average” I always think of the mean. I don’t know if it’s actually better though
well the median is a real pain to work out. you have to make a list of all the numbers, in order, and then count how many they are and then go to the middle. PAIN IN THE BUM. the average… well that is somewhat quicker to do, no? and i don’t see the point in the median at all. unless well no, there is just no need for it. who cares what the15th person in the class got on a test? the lowes and highes is much more interesting. As i remember it, the mode is the most commonly occuring number out of a set of numbers… i think of this as the “mode” or in English (not French), the ‘fashionable” number. oh and it stresses me how all 3 start with Ms cos that is confusing. which is why i like to use the word average.
The mean, which I’m guessing is the same as the average? When the media refer to real estate stats they always use median price, which can distort reality, we would prefer the average price. (From a real estate agent)
I don’t really think it’s a case of which is better. They’re two different things aren’t they? I think it’s usually easier to work out the average.
A number of my Facebook friends did know about statistics, and responded in favour of the median in most cases. This was an interesting comment:
“It depends. Everyone who proof read my thesis was like why on earth are you using the median – no one uses it. And most of the other similar primate studies I’ve read use the mean (except one, that was published by my associate supervisor). But my means were off their rocker, and I’m pretty sure my medians were a much better representation of reality in this case. It makes making comparisons between studies a little awkward though.
Why NOT use the median all the time?
I am hard pressed to find an instance where the mean is actually a better measure of central tendency than the median. The purpose of the mean or median (or mode) is to provide a one number summary of a set of data. The whole idea of the mean is actually quite tricky, as you can read in one of my early posts about explaining what the mean is. Generally the summary value is used to compare with another sample or population.
In my lectures I often illustrated times when the median is a better summary measure of a sample or population than the mean. This is quite common in notes and YouTube videos. Never once did I show where the mean was preferred to the median! So why were/are we so loyal to the mean, bringing out the median for special occasions and real estate?
I think there are two answers, both of them no longer valid. It is a question of legacy.
Time and ease to calculate
Despite first appearances, for anything larger than a trivial sample the mean is actually easier to calculate than the median. Putting a set of 100 values in order by hand is no easy task. (Pain in the bum, as my friend so elegantly expressed it.) Adding up scores and dividing by 100 is a walk in the park in comparison. In the early 1980s when I learned programming (in Fortran, Pascal and Cobol), writing a sorting program was far from trivial and a large set of numbers would take a large amount of time to sort. Only in later years, as computing power has expanded, has it been possible to get a computer to calculate a median.
Formulas for confidence intervals
Means behave nicely and give nice mathematical results when manipulated. Because of this we can calculate confidence intervals using a nifty little formula and statistical tables. Until bootstrapping by computer became do-able on a large and small scale, there was no practical way to perform inference on a number of very useful statistics, including the median and the inter-quartile range.
Conclusion: the median is better
A median is intrinsically understandable. It is the middle number when the values are put in order. End of story. – Well not quite – you do have that slightly tricky thing where the sample is even and you have to average the middle two terms, but apart from that it is easy!
A median is not affected by outliers. I learned a new term for this when I was reading up in preparation for writing this post. The term is “resistant” and I learned it from one of Mr Tarrou’s videos for AP Statistics. I found these videos after my tirade against videos on confidence intervals. Tarrou’s videos are long and a bit more mathematical than I would like. (He can’t help it – he is a maths teacher and the AP Statistics syllabus seems to have been devised by mathematical statisticians trying to put students off ever taking the subject again.) But they are GOOD. Tarrou’s videos are sound, and interesting and well put together. I will be recommending them as complementary to my own offerings. (Because I sure as heck don’t want to have to do all that icky mathsy stuff).
But I digress. The median is “resistant” because it is not at the mercy of outliers. There are lots of great examples, including in Mr Tarrou’s video. If you have a median of 5 and then add another observation of 80, the median is unlikely to stray far from the 5. However a mean is a fickle beast, and easily swayed by a flashy outlier.
The main disadvantage I can see for the median is that it can be a bit jumpy in small samples made up of discrete values. I guess if you have two well-behaved populations that are very similar and you want to see precise differences then the means might just be better – but even then you would possibly be over-interpreting small differences.
I have found it very interesting observing the behaviour of confidence intervals for the difference of two medians, compared with confidence intervals for the difference in two means. While I was preparing materials for our on-line resource, I performed nine such tests on different real data taken from students at university. The scores are very jumpy, and the differences between the medians often include exactly zero. Consequently the confidence intervals of the difference of two medians quite often have zero as their lower bound. This provides a challenge in interpretation, as I had not met this often when looking at the differences between means. However, it also illuminates the odd relationship we have with zero. Just because a confidence interval for a difference of two means is (-0.13, 3.98) and includes a zero, it is tempting to conclude that there is no significant difference. But is -0.13 really any different from zero in practical terms? The other point is that we should be leaving the confidence interval as it is, rather than stretching it into further inference.
Word on the web
I did a little surfing to see what the word on the web was. To find out who said what, drop the entire phrase into Google. (Ah ‘tis a wonderful we live in, indeed)
- “The mean is the one to use with symmetrically distributed data; otherwise, use the median.” Hmm – but if the data is symmetric, surely the mean = the median?
- “An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. “ Ok – hard to argue with that.
- “Calculation of medians is a popular technique in summary statistics and summarizing statistical data, since it is simple to understand and easy to calculate, while also giving a measure that is more robust in the presence of outlier values than is the mean.” Totally!
- “However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency. “(Then goes on to give an example of when the median is better.)
- “Use the median to describe the middle of a set of data that does have an outlier. Advantages of the median: Extreme values (outliers) do not affect the median as strongly as they do the mean, useful when comparing sets of data, it is unique – there is only one answer.
Disadvantages of the median: Not as popular as mean.” (Not as popular??!)
Sorry median – you do not win X-Factor for summary statistics. You may be more robust, and less fickle, not to mention easier to understand, but you just aren’t as popular!
I can feel a video coming on – the median has been relegated to the periphery long enough!
Update in 2018
Here is our video about different summary statistics, which also addresses the relative merits of mean and median, and why they even matter!