Statistics Saturdays (most of the time): Normal Distributions
A little late, here’s your slice o’ statistics for this Saturday that looks suspiciously like a Sunday:
The Normal Curve!
(from Wiki)
What is the normal curve? The Normal Curve or normal distribution is a continuous distribution that forms a ‘bell curve’ curve pattern. There’s a lot of things that tend to form a normal distribution: The heights of people, the scores in basketball games, the weights of commemorative memorial plates from the Franklin Mint with pictures of Abe Lincoln saluting an American flag being clutched by an eagle—most things that are produced by a highly similar process with some element of randomness will tend to produce a normal curve.
What is special about the normal curve? It’s so normal! It’s so very normal that you can do all sorts of good math with it. For mathapobes, hold on to your jimmies, there’s going to be a few actual equations.
The first thing I should say here is that to get a real normal curve you generally have to have a very high N. As covered in previous posts, an N below 100 is generally going to have too much randomness in it to be very valuable.
But let’s say we’ve got a good solid N of 1000 weights of Ron Paul/Prince R3volution Commemorative Figurines from the Frank Zappa mint. These are appropriate things to think will be ‘normal’ because machine stamping will sometimes give a little more or less weight, and weight can be measured on a continuous scale.
First we find the mean, which is just adding up the total weight of all the figurines and dividing it by the number of figurines. So:
Total weight/N.
This is where you start being very glad you live in an age of computers and calculators. The entire calculation for the standard deviation is this:
That means:
First you get your mean, X.
Then for every number in your N, you subtract the mean and then square the result.
Then you add all these numbers together.
Then you divide by N-1.
Then you take the square root of that number.
What you’re left with is the standard deviation, which is the average distance every number in the data set is away from the mean.
Doing this operation also means all negative numbers turn positive; by squaring and then deriving the square root, among other things this turns this into an absolute value.
So very simply, if you had 1, 3, and 5 as your values, your mean is 3. Plugging this into the equation, we get ((1-3) squared + (3-3) squared + (5-3) squared). This is 4, 0, 4. Summing those together, we get 8. Dividing 8 by N-1, we get 8/2, which is 4. Taking the square root of that, we get 2. The standard deviation of this set of numbers is 2, which we can verify with our naked eyes.
But of course, that’s not that useful for a small set like that. So what is the use of the standard deviation of a normal curve? The most useful thing about it is because the curve is regular, we can say with absolute certainty that in a normally distributed set approximately 68.3% of the values will be within one standard deviation, 95.5% will be within two, and 99.7% will be within three standard variations. Now, the normal curve can be very broad, so this can cover a very wide area, there may be an enormous size of the standard distribution, but no matter whether you’re talking about a normal distribution of things that tend to be very, very similar—like the scales used at the department of Weights and Measures, or a normal distribution of things that have a large range, like adult tree height, the rule still stands.
Okay, so what’s the takeaway? The main one is that, in any quantitative statistical survey (whether or not it is a normal curve), the standard deviation is critically important! There are a lot of cases where consistency is something that is really important, and standard deviation is a measure of that. Standard deviation is most useful for comparing the variability of two similar things.
For example, if you are the hottest new celebrity chef on Knives of Horror, the brand new cooking/assassination program on TLC, and you have your choice of nanothermite shakers and a limited amount of nanothermite, you want the nanothermite shaker that shows the smallest standard deviation, even if their mean is the same.
So if shaker A has a mean of 11 grams of nanothermite per shake with a standard deviation of 2 grams, with a lethal dose being 9.01 grams, then you know that 15.85% of the time you’ll be giving a non-lethal dose with your shaker—and similarly 15.85% of the time you’ll be giving more than 13 grams, wasting your precious precious nanothermite which you bought off the internet and is totally not made up.
Shaker B, on the other hand, made by actual Shakers, has a mean of 11 grams of nanothermite per shake, but a standard deviation of .67 grams. Since you know that 99.85% of results are above 3 standard deviations, and 3 times .67 - 2.01, only .15% of the time will your Shaker Shaker fail to deliver a lethal dose—and you also now that you won’t be wasting that nanothermite.
So whenever you see a quantitative statistical study purporting to show a normal curve, the first and immediate question should be “What is the standard deviation?” For some things, this is just a matter of interest, for other things, when you want consistency, standard deviation is the whole ball game.