# Why do we teach about random variables, and why is it so difficult to understand?

Probability and statistics go together pretty well and basic probability is included in most introductory statistics courses. Often maths teachers prefer the probability section as it is more mathematical than inference or exploratory data analysis. Both probability and statistics deal with the idea of uncertainty and chance, statistics mostly being about what has happened, and probability about what might happen. Probability can be, and often is, reduced to fun little algebraic puzzles, with little link to reality. But a sound understanding of the concept of probability and distribution, is essential to H.G. Wells’s “efficient citizen”.

When I first started on our series of probability videos, I wrote about the worth of probability. Now we are going a step further into the probability topic abyss, with random variables. For an introductory statistics course, it is an interesting question of whether to include random variables. Is it necessary for the future marketing managers of the world, the medical practitioners, the speech therapists, the primary school teachers, the lawyers to understand what a random variable is? Actually, I think it is. Maybe it is not as important as understanding concepts like risk and sampling error, but random variables are still important.

## Random variables

Like many concepts in our area, once you get what a random variable is, it can be hard to explain. Now that I understand what a random variable is, it is difficult to remember what was difficult to understand about it. But I do remember feeling perplexed, trying to work out what exactly a random variable was. The lecturers use the term freely, but I remember (many decades ago) just not being able to pin down what a random variable is. And why it needed to exist.

To start with, the words “random variable” are difficult on their own. I have dedicated an entire post to the problems with “random”, and in the writing of it, discovered another inconsistency in the way that we use the word. When we are talking about a random sample, random implies equal likelihood. Yet when we talk about things happening randomly, they are not always equally likely. The word “variable” is also a problem. Surely all variables vary? Students may wonder what a non-random variable is – I know I did.

I like to introduce the idea of variables, as part of mathematical modelling. We can have a simple model:

**Cost of event = hall hire + per capita charge x number of guests. **

In this model, the **hall hire** and **per capita charge** are both constants, and the **number of guests** is a variable. The **cost of the event** is also a variable, and can be expressed as a function of the number of guests. And vice versa! Now if we know the number of guests, we can then calculate the cost of the event. But the number of guests may be uncertain – it could be something between 100 and 120. It is thus a **random variable**.

Another way to look at a random variable is to come from the other direction – start with the random part and add the variable part. When something random happens, sometimes the outcome is discrete and non-numerical, such as the sex of a baby, the colour of a tulip, or the type of fruit in a lunchbox. But when the random outcome is given a value, then it becomes a **random variable**.

**Distributions**

Then we come to distributions. I fear that too often distributions are taught in such a way that students believe that the normal or bell curve is a property guiding the universe, rather than a useful model that works in many different circumstances. (Rather like Adam Smith’s invisible hand that economists worship.) I’m pretty sure that is what I believed for many years, in my fog of disconnected statistical concepts. Somewhat telling, is the tendency for examples to begin with the words, “The life expectancy of a particular brand of lightbulb is normally distributed with a mean of …” or similar. Worse still, they don’t even mention the normal distribution, and simply say “The mean income per household in a certain state is $9500 with a standard deviation of $1750. The middle 95% of incomes are between what two values?” Students are left to assume that the normal distribution will apply, which in the second case is only a very poor approximation as incomes are likely to be skewed. This sloppy question-writing perpetuates the idea of the normal distribution as the rule that guides the universe.

Take a look at the textbook you use, and see what language it uses when asking questions about the normal distribution. The two examples above are from a popular AP statistics test preparation text.

I thought I’d better take a look at what Khan Academy did to random variables. I started watching the first video and immediately got hit with the flipping coin and rolling dice. No, people – this is not the way to introduce random variables! No one cares how many coins are heads. And even worse he starts with a zero/one random variable because we are only flipping one coin. And THEN he says that he could define a head as 100 and tail as 703 and…. Sorry, I can’t take it anymore.

## A good way to introduce random variables

After LOTS of thinking and explaining, and trying stuff out, I have come up with what I think is a revolutionary and fabulous way to introduce random variables and distributions. To begin with we use a discrete empirical distribution to illustrate the idea of a random variable. The random variable models the number of ice creams per customer.

Then we use that discrete distribution to teach about expected value and standard deviation, and combining random variables.The third video introduces the idea of families of distributions, and shows how different distributions can be used to model the same random process.

Another unusual feature, is the introduction of the triangular distribution, which is part of the New Zealand curriculum. You can read here about the benefits of teaching the triangular distribution.

I’m pretty excited about this approach to teaching random variables and distributions. I’d love some feedback about it!

Hi Dr Nic. I like the way your videos combine formulas with explanation and visual cues. Just wondering though – why are categorical random variables ignored (the ethnicity of the next customer, for example)?

Hi Anna

Thanks. To be honest I’ve always taught that something had to take numeric values to be classed as a “random variable”. At introductory level, it works. You’ve got me wondering now.

It is indeed something to wonder about, even just considering terminology. Most stats courses have a section about variables in data, classifying them as quantitative/numerical or qualitative/categorical. When we study probability we have questions that ask about the probabilities of certain events, that often are described in words with no reference to numbers. However, when we come to RANDOM variables, they suddenly are only allowed to be numbers. This is highly confusing!

I think the reason is that when doing statistical modeling you need to have equations and expected values, which don’t work with things that aren’t numbers. If you think carefully about how statistical modelling is done, the non-number variable is always converted into one or several variables that are 0 or 1, so that the mathematical formulas work. Another point worth making is that the only summary statistic that makes sense for a word-variable is the mode.

Where exactly do I find the videos?! I seem to be overlooking them…

Hi Cheryl

Thanks for drawing my attention to this. Here is a link to the first one. https://www.youtube.com/watch?v=lHCpYeFvTs0 The other two videos have been made private in order to make enough money to keep our business going. However, if you email me at n.petty@statsLC.com, I’ll happily give you access.

Great Video!, would love the others

Why is the number of ice cream cones the next person buys a random variable? Isn’t that number based on a lot of factors and not simply based on chance? I do not see the randomness of this.

Hi

Randomness means that it can take a number of different values, and we don’t know ahead of time which one it will take. There is very little that is simply based on chance. Some would say that nothing is based on chance, but rather that we do not know the contributing factors. For example, the air temperature can be modelled as a random variable, even though there is a definite cause for it. Remember that probability is a model to help explain reality. I hope this explanation helps, as it is a very good question, and one that continues to vex philosophers, mathematicians and statisticians.

Nic