Probability and measurement spread ($\sigma$)

Download this spreadsheet with height and weight data for a sample of adolescents.

One of the columns is height, $h$. Use a spreadsheet function to calculate the average height, which I'll denote as $\overline h$. Check your answer against mine: I found $\overline h\approx 67.85$ inches.

Browser issues!

Your textbook uses "$w$ with a line above it" to mean the average value of $w$. Do you see these two images displaying the same??

  • $\leftarrow$ A picture of the variable $w$ with a horizontal line above it.
  • $\overline{w}$ $\leftarrow$ MathJax rendering of $\LaTeX$ code \overline w.

If not (e.g. Chrome on PC), download the Firefox browser. In class, Firefox on a PC appears to display that correctly.

Probability of a Discrete variable: weight, $w$

The weights, $w$, of 27 individual adolescents have been measured in pounds, and rounded to the nearest integer. Let $N(w)$ mean "the number of individuals weighing (rounded) $w$ pounds".

Sort the data in the spreadsheet by weight. What are the values of...
$N(143)$=?
$N(144)$=?
$N(145)$=?

Let $P(w)$ represent the probability that someone from this dataset, chosen at random, will have a rounded weight of $w$. Calculate the following probabilities:
$P(143)$=?
$P(144)$=?
$P(145)$=?
$P(w\geq 143)$=
$P(w \geq 80)$=

Which of these relationships make sense or don't make sense...? $$N=\sum_{w=0}^\infty N(w).$$ $$P(w)=\frac{N(w)}{N}.$$ $$\sum_{w=0}^\infty P(w)=1.$$

Characterizing a data set

Figure out these...

  1. Most probable weight?
  2. Median weight? (Just as many individuals with weights above as below)
  3. Average weight, written as "$\overline w$"? You will need to carry out a calculation, which you can do in the spreadsheet--showing your result in the row of "averages".

Statistics of functions of $w$

See if you can calculate these. For some of these you will need to add an extra column or two and do some spreadsheet calculations.

  1. What is the probability that the square of the weight will be (144)^2?
  2. What is the probability that the square of the weight will be (136)^2 =18,496?
  3. What is the average value of ($w^2$)? Write this as $\overline{w^2}$.
  4. What is $\overline{ \ln w}$?
  5. What is $(\overline{w})^2$? This should *not* be the same as your answer to #3...

Do these make sense...? (sums, if not written out completely, are implied from 0 to $\infty$)
$$\overline w = \frac{\sum w\,N(w)}{N}=\sum_{w=0}^\infty w\,P(w).$$ $$\overline{ \ln w} = \sum_w \ln(w) \,P(w)$$ $$\overline{w^2} = \sum_w w^2\,P(w)$$

And, in general... $$\overline{ f(w)} = \sum_w f(w)\,P(w)\label{fof}.$$

And in particular, for a constant value, $k$... $$\overline{ k} = \sum_w k\,P(w)=k\sum_w P(w)=k\cdot 1=k\label {fok},$$ because:

  • $k*P(1)+k*P(2)+k*P(3)+... k*\left(P(1)+P(2)+P(3)+...\right)$
  • and, if we're summing over all possible $w$'s, then the sum of probabilities of all possible weights must be: $\sum_{w=0}^{\infty} P(w)=1.$

    Measuring the spread of values

    One important descriptive statistic for any data set is how "spread out" the data is about its average value.

    The "deviation" of a particular value $w$ from its average value $\overline w$ is usually written as $$\Delta w = w-\overline w.$$

    There's a column in the spreadsheet for $\Delta w$. Calculate this for each adolescent, and then find $\overline{\Delta w}$.

    Perhaps it is not too surprising that $\overline{\Delta w}$ is fairly small.Since the mean is somewhere in the middle of the values, both positive and negative deviations will occur, and cancel each other when calculating the average.

    You can show that $\overline{\Delta w}$ should vanish analytically, as follows: $$\begineq \overline{\Delta w}=\overline{ w-\overline w} &= \sum\left(w-\overline{ w }\right)P(w)&\text{ using (\ref{fof})} \\ &= \sum w P(w) -\sum \overline{ w } P(w)&\\ &= \overline{ w } -\overline{ w } \sum P(w)&\ \text{ using(\ref{fok}) since } \overline w \text{ is constant}\\ &= \overline{ w } -\overline{ w } = 0 \endeq$$ So, $\overline{\Delta w}$ is not very useful for measuring the spread of values!




    But $(\Delta w)^2=(w-\overline{ w })^2$ is always positive. So, we might try finding the average value of that, $$\overline{ (\Delta w)^2 } = \overline{ (w - \overline{ w })^2 } \equiv \sigma^2.$$ and then taking the square root to get back a number, $\sigma$, which has the same units as our data. This is called... the standard deviation: $$ \sigma=\sqrt{\overline{(w-\overline{w})^2}}.$$ Finally, there is a simpler way to calculate this. Consider: $$\begineq \sigma^2\equiv \overline{ (w - \overline{ w })^2 } &=\overline{ w^2-2w\overline w +\overline{w^2}} \\ &=\overline{w^2}-\overline{2w\overline w} +\overline{\overline{w}^2} \\ &=\overline{w^2}-2\overline w\cdot \overline{(w)} +\overline{w}^2 \\ &=\overline{w^2}-\overline{w}^2 \\ \endeq$$ In conclusion:

    The standard deviation can be calculated as: $$\sigma = \sqrt{\overline{w^2}-\overline{w}^2 }.$$

    Verify this using your weight data: First, calculate the average value of the weights: $\overline{ w }$. Now we can calculate a value for $\sigma$ in two ways in your spreadsheet, and they should agree:

    1. In a new column, calculate $(w-\overline{ w})^2$ for each weight. At the bottom of that column, find the average value. We call this $\sigma^2$.
    2. In another column, calculate $w^2$ for each weight, and at the bottom calculate the average value. This is $\overline{ w^2}$. Use that number and $\overline{ w }$ to calculate $\overline{ w^2}-(\overline{ w})^2$. This should give you the same number as $\sigma^2$ which you found in step 7. Are they the same?

    In a course on probability, you would find that for a 'Gaussian random variable'...

    • 68% of measurements will fall in the range $\overline{ w } \pm \sigma$
    • 95% of measurements will fall in the range $\overline{ w } \pm 2\sigma$
    • 99% of measurements will fall in the range $\overline{ w } \pm 3\sigma$

    ("68-95-99" rule).

    Continuous variables

    Picking a random person and weighing them as precisely as possible, we could talk about the probability that their weight is in the range from $w$ to $w+dw$. This probability is $$\rho(w)\,dw$$ where $\rho(w)$ is the probability density.

    The probability that we will find someone with a weight between $a$ and $b$ pounds: $$P_{ab}=\int_a^b \rho(w)\,dw.$$

    We can generalize our results for discrete variables. $$1=\int_{-\infty}^{+\infty}\rho(w)\,dw,$$ $$\overline{ f(w) } =\int_{-\infty}^{+\infty}f(w)\rho(w)\,dw,$$ It is still the case that $$\sigma^2 = \overline{ (\Delta w)^2 } = \overline{ w^2 } - \overline{ w }^2.$$