Statistics Saturdays: Sampling
I’ve felt for a long time that statistics ought to be taught from the earliest levels of the US educational system. A large part of the way that people get cheated, fail to make good decisions, and get duped by politicians and propagandists is a basic lack of understanding of statistics.
To that end, and to help my own education, I’ve decided to start doing a post on stats every Saturday. Since I’ll probably have to be TAing this stuff in a couple of years, it’s a good idea for me to get used to writing it out in a legible and easy-to-read fashion. So feel free to give feedback.
Today’s post is on the bedrock of stats: Sampling.
Sampling is how you decide how you’re going to get the sample of whatever or whoever you’re going to survey/count/lick/stare at creepily or otherwise measure for your study. What you’re striving for is for your sampled population—which, in statistical meaning, can be people, or cancer cells, or toasters—to match the total population as closely as possible.
There are five main methods of sampling, and I’m going to talk a bit about the virtues and evils of each.
Convenience Sampling: You stick your head out the window and holler your questions to the nearest people. You post a survey on LGF. You ask random guys at the bar, you harass cab drivers and then write long articles about it. You buy five hot dogs from the carts near to where you work.
Pros: Convenience. The most you can do with a convenience sample is find a trend that you want to investigate more systematically, or find something significant about that sample. For example, if you sample four random guys at the bar about military service and find that they all served 20+ years, that’d be enough on its own to say that that bar had an unusually high number of long-serving military members drinking at it.
Cons: Everything. No real statistics can be done with this. Even in something like an LGF poll where it might seem like meaning emerges, it’s really just having fun and a very, very general idea of things—people might not have seen the poll, might not have cared to vote, might see the results of the voting and decide to or not to vote because of that, might vote ironically, etc. Many polls you see on pundit sites or, sadly, in newspapers these days, and all random website polls, are of this nature. Anecdotal data is all basically convenience sampling.
Simple random: Every member of the larger group has an equal chance of being polled. So, if you’ve lured a thousand people to a room with a promise of free Quiznos and have them all draw a ping-pong ball from a jar with a number on it, and then use 1-100 as your sample population, that’s a simple random sample of that thousand people. It is not a random sample of the population of the US as a whole, or the town, or anything, just of those thousand people. (I edited this, because you want an sample size of 100 in statistics as your ground floor of statistical significance. I earlier had 10 as the sample size.) For non-people, you have to have a way of really getting a ‘random’ sample, which is quite hard for something like animals in the wild, since by catching them they form a subset “animals that were dumb enough to get caught”. It’s important to note that a scientist doing a survey of all bears on a mountain, using the sample of bears he could catch, is still doing a ‘simple random’ sample, but he’ll acknowledge in his studies that the method means there may be a non-random pattern.
Pros: Simple, random. All other things being equal, this is the ideal way to sample a population. Other sampling methods are judged on how close they get to a simple random sample.
Cons: As with any sampling method, there will always be factors making this non-random. If you poll people by going door-to-door, you’ll miss the homeless and people who just aren’t home a lot. Call phones, you get only the people who answer phones—people with limited minutes, or those who are bigamist tax-dodging agoraphobics and dodging calls of an unknown nature, or those who are just non-social will be overlooked.
Here, the ideal is fine, but the problem is in the method: Truly random in the real world is hard. However, if you have a captive population—every employee at a company, every cop on a police force, every animal in a captive population, then this worry goes away and simple random sampling is really ideal.
This is the main reasons why social science is a ‘soft’ science. It is not just that things are studied that are subjective, it’s also that with even the objective things, obtaining a perfectly random sample is difficult.
Systematic sample: You take every nth member for your sample.
Pros: Systematic, highly automatable, in some circumstances identical to simple random. For example, most manufactured goods have every nth item pulled off the line for testing. You couldn’t possibly systematically sample every nth rabbit in a warren, but you could sample every nth rabbit to come out of particular run in that warren.
Cons: Most populations don’t come pre-sorted, so it’s got limited applicability. (What is the ‘nth’ person in Manhattan, where do you start counting, how do you count?) Systematic sampling will also miscount or miss any systematic error. If you are sampling every 10th computer chip but the machine actually has a flaw in making a chip that occurs every 10th time after it’s produced 5 chips, you’ll test chip 10, 20, 30, but the error will occur at 15, 25, 35, and you will literally never notice. If the error occurs every nth chip with a different n than your system, then you’ll only encounter that error when the frequencies intersect. So, if you test every 10th chip and an error occurs every 11th chip, it’ll take until the 110th chip to find the error, and then the 220th for it to occur again. If you’re not watching for a pattern, you’d conclude that a problem occurred every 110th chip.
Cluster Sampling: Divide the population into clusters and randomly select some of those clusters, which are then fully investigated or have random members of those random clusters sampled. Wee!
Pros: Solves real-world scalability problems. Otherwise, is inferior to all other sampling methods except convenience sampling. Say you want to test blood levels of LSD among accountants in small-town America, because hey, why not. You could travel to every small town and then interview one out of ten of those accountants, but that’s a lot of driving, a lot of long hours on the lonely road with only crackling radio stations playing the same 7 pop-country songs. Instead, you can take all the small towns in the us, and randomly pick one out of every ten, and then go to those towns and take blood from every accountant in the town, revealing what you suspected: They’re all tripping balls.
This is a method that has become less important in the internet age. In my example above, you needed a physical sample; most statistics is about things that can be done over the phone, or over the internet, and the demographics necessary to design a simple random or a stratified sample are easier to access. Clustered sampling is a valid method, but it is less precise, even when just as accurate, as other survey methods.
Cons: As above, it is less precise: It cannot give as many significant digits of response. A scale that measures things down to pounds is more precise than one that measures things down to tens pounds, even if it is less accurate. This is stats terminology only, not common usage, obviously. Most populations have differences in their clusters, even if these differences average out—this is what accounts for the lower precision. If you’re studying the popularity of pulled pork, and divide the US up into clusters of counties and sample randomly among those, you’ll probably be broadly accurate—you’ll get a rough idea of how much people are into pulled pork. But if you randomly select a few more counties in the south than in the north, this’ll pull your numbers off a bit. Knowing this, you can’t claim that the survey is very precise. (How we calculate precision of a survey I’ll get into in another post).
Stratified sampling: People get this confused with cluster sampling, but it’s very different: Stratified sampling divides the population up into subgroups, strata, and then randomly selects members from each strata. How is this different from cluster sampling? Really, stratified sampling is just doing two samples at one time of closely related things. For example, we have both DOs and MDs as ‘doctors’ in this country, practicing medicine. In a survey of doctors, you might want to separate out DO answers from MD answers, even if you’ll sometimes just be conflating them together.
The states are another example of strata: If you wanted to conduct a poll seeing how likely it was to pass a constitutional amendment mandating that clowns dressed as beauty pageant winners stay out of town after sunset (The No-Gown-Clown-After-Sundown-in-Our-Town Amendment), only stratified sampling would give you anything close to precision in delivering the answer, since the responses of each state matters.