I feel a slight quiver of trepidation as I begin this post – a little like the boy who pointed out that the emperor has no clothes.

Random sampling is a myth. Practical researchers know this and deal with it. Theoretical statisticians live in a theoretical world where random sampling is possible and ubiquitous – which is just as well really. But teachers of statistics live in a strange half-real-half-theoretical world, where no one likes to point out that real-life samples are seldom random.

## The problem in general

In order for most inferential statistical conclusions to be valid, the sample we are using must obey certain rules. In particular, each member of the population must have equal possibility of being chosen. In this way we reduce the opportunity for systematic error, or bias. When a truly random sample is taken, it is almost miraculous how well we can make conclusions about the source population, with even a modest sample of a thousand. On a side note, if the general population understood this, and the opportunity for bias and corruption were eliminated, general elections and referenda could be done at much less cost, through taking a good random sample.

However! It is actually quite difficult to take a random sample of people. Random sampling is doable in biology, I suspect, where seeds or plots of land can be chosen at random. It is also fairly possible in manufacturing processes. Medical research relies on the use of a random sample, though it is seldom of the total population. Really it is more about randomisation, which can be used to support causal claims.

But the area of most interest to most people is people. We actually want to know about how people function, what they think, their economic activity, sport and many other areas. People find people interesting. To get a really good sample of people takes a lot of time and money, and is outside the reach of many researchers. In my own PhD research I approximated a random sample by taking a stratified, cluster semi-random almost convenience sample. I chose representative schools of different types throughout three diverse regions in New Zealand. At each school I asked all the students in a class at each of three year levels. The classes were meant to be randomly selected, but in fact were sometimes just the class that happened to have a teacher away, as my questionnaire was seen as a good way to keep them quiet. Was my data of any worth? I believe so, of course. Was it random? Nope.

Problems people have in getting a good sample include cost, time and also response rate. Much of the data that is cited in papers is far from random.

## The problem in teaching

The wonderful thing about teaching statistics is that we can actually collect real data and do analysis on it, and get a feel for the detective nature of the discipline. The problem with sampling is that we seldom have access to truly random data. By random I am not meaning just simple random sampling, the least simple method! Even cluster, systematic and stratified sampling can be a challenge in a classroom setting. And sometimes if we think too hard we realise that what we have is actually a population, and not a sample at all.

It is a great experience for students to collect their own data. They can write a questionnaire and find out all sorts of interesting things, through their own trial and error. But mostly students do not have access to enough subjects to take a random sample. Even if we go to secondary sources, the data is seldom random, and the students do not get the opportunity to take the sample. It would be a pity not to use some interesting data, just because the collection method was dubious (or even realistic). At the same time we do not want students to think that seriously dodgy data has the same value as a carefully collected random sample.

## Possible solutions

These are more suggestions than solutions, but the essence is to do the best you can and make sure the students learn to be critical of their own methods.

Teach the best way, pretend and look for potential problems.

Teach the ideal and also teach the reality. Teach about the different ways of taking random samples. Use my video if you like!

Get students to think about the pros and cons of each method, and where problems could arise. Also get them to think about the kinds of data they are using in their exercises, and what biases they may have.

We also need to teach that, used judiciously, a convenience sample can still be of value. For example I have collected data from students in my class about how far they live from university , and whether or not they have a car. This data is not a random sample of any population. However, it is still reasonable to suggest that it may represent all the students at the university – or maybe just the first year students. It possibly represents students in the years preceding and following my sample, unless something has happened to change the landscape. It has worth in terms of inference. Realistically, I am never going to take a truly random sample of all university students, so this may be the most suitable data I ever get. I have no doubt that it is better than no information.

All questions are not of equal worth. Knowing whether students who own cars live further from university, in general, is interesting but not of great importance. Were I to be researching topics of great importance, such safety features in roads or medicine, I would have a greater need for rigorous sampling.

So generally, I see no harm in pretending. I use the data collected from my class, and I say that we will pretend that it comes from a representative random sample. We talk about why it isn’t, but then we move on. It is still interesting data, it is real and it is there. When we write up analysis we include critical comments with provisos on how the sample may have possible bias.

What is important is for students to experience the excitement of discovering real effects (or lack thereof) in real data. What is important is for students to be critical of these discoveries, through understanding the limitations of the data collection process. Consequently I see no harm in using non-random, realistic sampled real data, with a healthy dose of scepticism.

In my sampling class (which was modeled after the class I took from Bill Kalsbeek, and he in turn borrowed the idea from a class taught by his student Ginny Lesser), we had students take their own samples and write up term projects with them. Some were rather silly samples — estimate the total number of bricks used in sidewalks. Some were rather contentious samples — one student of mine took a snowball sample of her friends and their friends, and really had trouble explaining why it makes scientific sense. Some were public policy samples — estimate the proportion of people carpooling to campus. A computer science student in my class worked on estimating the proportion of ISBNs that correspond to an actual book by an equivalent of random digit dialing and asking Amazon and Google books whether a book with this random ISBN exists. Nick Horton gave CAUSEWeb webinar a couple of years ago in which students estimated the area of the land in the US that is further than 1 mile from a road using Google maps. Great thing about teaching is that you can make your students do whatever — it is your responsibility as a teacher to set the students up so that by doing this “whatever” they learn practical skills… along with (infinitely practical!) theorems about unbiasedness of Horvitz-Thompson estimator and non-existence of an unbiased variance estimator for systematic samples.

Hi,

I like this, and I generally agree with your ideas. Two small points for you to consider.

I disagree with the notion that censuses are different from samples. I’ve worked with cancer registration data extensively, which is certianly intended to be a census, but in reality it’s a sample, with unknown bias and a high samplng fraction!

I also think that you should emphasize selection bias much more strongly. In my work what messes up inference is almost never not knowing the sampling probability. Rather it is differential non-responses. Non-responders are widly different from responders, often in very interesting and relvant ways.

Thanks for another interesting post!

I think that it is worth pointing out that there is theory for non-i.i.d data sampling too, in fact a major research direction in modern probability theory is the field of “measure concentration” which considers, amongst other things, under what conditions we can have CLT-like behaviour when we relax the i.i.d assumption of classical statistics. It turns out that provided observations don’t depend on each other “too much” (which can be made precise in several ways), CLT-like behaviour is a very general phenomenon. This is certainly a partial theoretical explanation for why oftentimes when we break the i.i.d assumption it doesn’t matter too much.

Dear Dr Nic, Nothing to do with statistics: I just want to point out the person who commented the Emporer had no clothes was a “child”. Somehow in translation into English this has become a “boy”! I have always seen myself as that child so it is helpful to know that, given I am female!! Keep up your good work. It is OK to have more than one such child in the population! Actually we need a lot of them in statistics!

Brilliant. It makes so much more sense that she was a girl. Nice to know there are others around too.

Pingback: Proxem » La lettre du 20 janvier : en 2014, le futur sera fait de gadgets connectés

Dear Dr Nic,

I couldn’t agree more about non-randomness. When I worked as a statistician in a medical school a phrase that often featured in the write up of the latest research project was “the control group consisted of normal healthy adults” which I discovered was code for “anyone who came into the lab that day”.

Dick Brown

I think Vic Barnett book on sampling has a real example that will interest most students. Arrange the students in their rows acordingly and compare the results of random , stratified and cluster sampling for estimating proportions or mean eg of their hight, weight or BMI

Basilio

Hi, I stumbled reading your discussion of, what you call Random (near to ransom, by the way).

Why did I stumble? Because:

a. Probability sampling is the term used widely and its definition is far off from pure randomness (whatever this might be): if one knows the probability of being selected to be part of a sample and this probability is not zero, than you get a probability sample that allows you to infer from the sample to the whole population. From this follows the definition of a quota sample: selection probability in a quota sample can be zero, because if a quota cell is full, the next respondent fitting into that cell will not be selected, i.e. her selection probability is zero. Thus stratified probability samples have nothing in common with quota samples. However, identifying each respondent’s selection probability can be cumbersome.

b. Strict random sample could be defined along the line using only one characteristic of a respondent for selection: the person must belong to the target population. This means in turn, if you could enumerate your target population completely, than you could use random numbers or equivalent to select respondents “at random from your target population” (in-patients at a hospital at a specified point in time could be called such a target population).

c. This leads to the issue of defining a target population, a quite peculiar one. Think of a definition common for General Social Surveys: the target population consists of all people living in a defined administrative region (say United States mainland (!)), there in private household (don not ask me for a household definition, that could become extremely lengthy) who are 18+ of age and capable to answer questions; this excludes prison inmates, military in barracks, people in homes for the elderly or convents, etc. but it definitely could include hermites.

d. Finally, a census differs quite substantially from a sample because mean, variance etc. are known, i.e. the must not be estimated as in sampling. This also means that one does not infer from a sample but one computes frequencies, patterns, etc. straightforwardly.

So much for now.

PM

Dr. Nic, you wrote: “each member of the population must have equal possibility of being chosen.” I infer from this that you are trying to squeeze the issue of inference with survey samples into the i.i.d. paradigm. Yet there are many epsem (equal probability of selection method) sampling designs that cannot be analyzed by the i.i.d. methods. Of course the simple random sample (which is sort of the golden standard you are implicitly advocating for) is the design for which i.i.d. inference methods are appropriate (sans the finite population correction though). However, a proportional stratified sample is an example where the i.i.d. analysis is wrong, and a systematic sample is yet another example. In the first case, the standard errors are smaller than for the i.i.d. case, and in the second case, the standard errors cannot even be computed in an unbiased way, technically speaking, and survey statisticians fall back onto variance estimators that are known to be biased.

Yet there are designs that are perfectly valid which don’t have the property of equal probabilities of selection. Probability-proportional-to-size is a very obvious class of such designs where the probability of selecting a school is proportional to the number of students in that school; these designs can work wonders on estimating the totals that are correlated with the measure of size used in sampling (e.g., the number of minority students). Stratified designs with a required margin of error for every subpopulation of interest is another example — think about a sample with required 500 observations per province; larger provinces will have smaller probabilities of having a unit selected from them. The phone calling (random digit dialing) designs in which you need to select one person per households may have wildly different probabilities of selection which have to be determined post-data-collection by counting the number of people in the household, the number of landline and cell phones in the household, and accounting for potentially different probabilities of selection from the landline frame, from the cell phone frame, or from both. So as Peter Mohler wrote, what matters is that the probability of selection and response is known, and is compensated for by proper weighting.

Thank you for highlighting the myth of random sampling.Once a year, I teach statistics as an adjunct but in my “real job” I run a group of technology companies. When I speak to researchers about our work creating educational games, they always insist that we MUST have random sampling and blithely ignore the fact that there is no way for us to force schools to use our games nor does it make any economic sense to turn away schools so we can collect data on them. Don’t even get me started on the low probability of a school we have told cannot have our games being willing to then turn around and help us by collecting pretest and posttest data.

Reblogged this on Balcostics.

Can I just make a comment here, the same comment as I made to Dr Nic a while back: In the original Hans Christian Anderson story, it was a CHILD who observed the emporer had no clothes – sex not specified.

Pingback: discussion 2-3 – Statistics Research Paper