-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstats-measures.xml
387 lines (273 loc) · 17.5 KB
/
stats-measures.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
<?xml version="1.0" encoding="UTF-8" ?>
<!--********************************************************************
Copyright 2015 Robert A. Beezer
This file is part of MathBook XML.
MathBook XML is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 2 or version 3 of the
License (at your option).
MathBook XML is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with MathBook XML. If not, see <http://www.gnu.org/licenses/>.
*********************************************************************-->
<!-- This file is part of the book -->
<!-- -->
<!-- Inquiries into Probability and Statistics - Math S–304 -->
<!-- -->
<!-- Copyright (C) 2016 Thomas W. Judson -->
<chapter xml:id="stats-measures" xmlns:xi="http://www.w3.org/2001/XInclude">
<title>Measuring Central Tendency and Spread</title>
<introduction>
<p>The learning objectives for this chapter are:
<ul>
<li>To understand and be able to analyze data using measures of central tendency<mdash />mean, median, and mode.</li>
<li>To understand and be able to analyze data using percentiles and the five-number summary.</li>
<li>To understand and be able to analyze data using measures of spread<mdash />range, variance, and standard deviation. To be able to compute the standard deviation.</li>
</ul></p>
</introduction>
<section xml:id="section-stats-measures-center">
<title>Measuring the Center</title>
<paragraphs>
<title>The Mean <m>\overline{x}</m></title>
<p>To find the <term>mean</term> of a set of observations, add their values and divide by the number of observations. If the <m>n</m>observations are <m>x_1, x_2, \ldots, x_n</m>, then their mean is
<me>\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}.</me>
For example, the mean of the data set <m>\{ 3, 4, 6, 7\}</m> is
<me>\overline{x} = \frac{3 + 4 + 6 + 7}{4} = 5.</me></p>
</paragraphs>
<paragraphs>
<title>The Median <m>M</m></title>
<p>The <term>median</term> <m>M</m> is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median:
<ol>
<li>Arrange all of the observations in order of size, from smallest to largest.</li>
<li>If the number of observations <m>n</m> is odd, the median is the center observation in the list. For example, the median of the data set <m>\{ 3, 4, 6, 7, 10\}</m> is 6.</li>
<li>If the number of observations <m>n</m> in the list is even, then the median is the average of the to center observations in the list. For example the median of the data set <m>\{ 3, 4, 6, 7\}</m> is
<me>\frac{4 + 6}{2} = 5.</me></li>
</ol></p>
</paragraphs>
<exercise>
<statement>
<p>A household's <q>net worth</q> is the total value of the household's possessions and investments less the total of its debts. In 1997, the mean and the median of net worth of American households were <dollar />51,000 and <dollar />212,000. Which of these numbers is the median and which is the mean? Explain your reasoning.</p>
</statement>
</exercise>
<exercise>
<statement>
<p>Which measure of center, the mean or the median, should you use in each of the following situations?
<ol>
<li>Middletown is considering imposing an income tax on its citizens. The city government wants to know the average income of citizens so that it can estimate the total tax base.</li>
<li>In a study of the standard of living of typical families in Middletown, a sociologist estimates the average family income in that city.</li>
</ol></p>
</statement>
</exercise>
<paragraphs>
<title>The Mode</title>
<p>The <term>mode</term> of a distribution is the most frequently occurring observation. For example, the mode of the data set <m>\{ 1, 2, 4, 5, 5, 5, 6, 6, 9\}</m> is 5.</p>
</paragraphs>
<exercise>
<statement>
<p>Find the mean, median, and mode of the following data sets.
<ol>
<li><m>\{0, 50, 50, 50, 50, 50, 50, 100\}</m></li>
<li><m>\{ 0, 50, 100000 \}</m></li>
<li><m>\{ 0, 50, 50, 50, 100000 \}</m></li>
</ol></p>
</statement>
</exercise>
<paragraphs>
<title>Percentiles and the Five-Number Summary</title>
<p>A <term>percentile</term> is the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20<percent /> of the observations may be found. The <term>five-number summary</term> consists of the sample minimum (smallest observation) the lower quartile or first quartile (25th percentile), the median, the upper quartile or third quartile (75th percentile), and the sample maximum (largest observation).</p>
</paragraphs>
<example>
<p>The data in Table<nbsp /><xref ref="table-stats-measures-car-thefts" /> show the number of car thefts in Central City for a period of 20 days. The five-number summary of the data is
<me>(\min = 29, Q_1 = 52.5, \text{median} = 63.5, Q_3 = 72, \max = 82),</me>
and it appears that 29 is an outlier. If the five-number summary of data for car thefts in Starling City over the same 20 day period is
<me>(\min = 30, Q_1 = 40, \text{median} = 55, Q_3 = 70, \max = 90),</me>
we can compare the number of car thefts in the two cities drawing two box-and-whisker plots (Figure<nbsp /><xref ref="figure-stats-measures-car-thefts" />). It appears that car theft is a greater problem in Central City.</p>
</example>
<table xml:id="table-stats-measures-car-thefts">
<tabular halign="center" top="medium">
<row>
<cell>29</cell><cell>48</cell><cell>49</cell><cell>50</cell><cell>52</cell><cell>53</cell><cell>53</cell><cell>55</cell><cell>56</cell><cell>62</cell>
</row>
<row bottom="medium">
<cell>65</cell><cell>66</cell><cell>66</cell><cell>69</cell><cell>72</cell><cell>72</cell><cell>75</cell><cell>78</cell><cell>78</cell><cell>82</cell>
</row>
</tabular>
<caption>Car thefts in Central City</caption>
</table>
<figure xml:id="figure-stats-measures-car-thefts">
<image width="80%" xml:id="stats-measures-car-thefts">
<latex-image-code><![CDATA[\begin{tikzpicture}[scale=1]
\draw (4,0) -- (7,0) -- (7,1) -- (4,1) -- cycle;
\draw (5.25,2) -- (7.2,2) -- (7.2,3) -- (5.25,3) -- cycle;
\draw (5.5,0) -- (5.5,1);
\draw (6.35,2) -- (6.35,3);
\draw (3,0.5) -- (4,0.5);
\draw (7,0.5) -- (9,0.5);
\draw (2.9,2.5) -- (5.25,2.5);
\draw (7.2,2.5) -- (8.2,2.5);
\node [right] at (10,0.5) {Starling City};
\node [right] at (10,2.5) {Central City};
\draw (0,-1) -- (10,-1);
\draw (0,-1) -- (0,3);
\node [below] at (0,-1) {0};
\node [below] at (2,-1) {20};
\node [below] at (4,-1) {40};
\node [below] at (6,-1) {60};
\node [below] at (8,-1) {80};
\node [below] at (10,-1) {100};
\foreach \x in {0,...,10} \filldraw[fill=black, draw=black] (\x,-1) circle (0.05cm);
\end{tikzpicture}]]>
</latex-image-code>
</image>
</figure>
<exercise>
<statement>
<p>Matilda intends to go to graduate school and just completed the GRE exam (similar to the SAT exam but for students applying to graduate school). She scored in the 90th percentile on the GRE exam. Which of the following is true about Matilda's performance on the GRE?
<ol>
<li>Matilda's score is 90<percent /> of the average score.</li>
<li>Matilda got 90<percent /> of the questions right.</li>
<li>90<percent /> of test takers scored at or below Matilda's score.</li>
<li>Matilda scored 90<percent /> better than the average student who took the GRE.</li>
</ol>
</p>
</statement>
</exercise>
<exercise>
<statement>
<p>The following data set is the number of home runs that Babe Ruth hit during his 15-year career in major league baseball.</p>
<p>22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60</p>
<p><ol>
<li>Find the mean number of home runs hit by Ruth during his career.</li>
<li>Find the median number of home runs hit by Ruth.</li>
<li>What is the mode of the number of home runs hit by Ruth?</li>
<li>What is the five-number summary of home runs hit by Ruth?</li>
<li>The career home run data for Barry Bonds and Joe DiMaggio are given in Table<nbsp /><xref ref="table-stats-measures-homeruns" />. Compare the career homeruns of all three players using side-by-side box-and-wisker plots.</li>
<li>Who was Joe DiMaggio's second wife?</li>
</ol></p>
</statement>
</exercise>
<table xml:id="table-stats-measures-homeruns">
<tabular halign="center" top="medium">
<row>
<cell>Barry Bonds</cell><cell>5</cell><cell>16</cell><cell>19</cell><cell>24</cell><cell>25</cell><cell>25</cell><cell>26</cell><cell>28</cell><cell>33</cell><cell>33</cell><cell>32</cell><cell>34</cell><cell>34</cell>
</row>
<row bottom="medium">
<cell></cell><cell>37</cell><cell>37</cell><cell>40</cell><cell>42</cell><cell>45</cell><cell>45</cell><cell>46</cell><cell>46</cell><cell>49</cell><cell>73</cell><cell></cell><cell></cell><cell></cell>
</row>
<row bottom="medium">
<cell>Joe DiMaggio</cell><cell>12</cell><cell>14</cell><cell>20</cell><cell>21</cell><cell>25</cell><cell>29</cell><cell>30</cell><cell>30</cell><cell>31</cell><cell>32</cell><cell>32</cell><cell>39</cell><cell>46</cell>
</row>
</tabular>
<caption>Career homeruns for Barry Bonds and Joe DiMaggio</caption>
</table>
<exercise>
<statement>
<p>The data below show the number of car thefts in Central City for a period of 20 days.</p>
<p>29 48 49 50 52 53 53 55 56 62 65 66 66 69 72 72 75 78 78 82</p>
<p><ol>
<li>Construct a stem-and-leaf plot of the data.</li>
<li>Are there any outliers?</li>
<li>What is the five-number summary of the data?</li>
<li>The five-number summary of data for car thefts in Starling City over the same 20 day period is
<me>(\min = 30, Q_1 = 40, \text{median} = 55, Q_3 = 70, \max = 90).</me>
Compare the number of car thefts in the two cities drawing two box-and-whisker plots.</li>
<li>How do car thefts compare in the two cities?</li>
</ol></p>
</statement>
</exercise>
<exercise>
<statement>
<p>The final medal standings of the top 20 countries participating in the 2008 Summer Olympics are given below:</p>
<p>110, 100, 72, 47, 46, 41, 40, 31, 28, 27, 25, 24, 19, 18, 18, 16, 15, 14, 13, 11</p>
<p><ol>
<li>Find the five-number summary for the 2008 Summer Olympics (min, <m>Q_1</m>, median, <m>Q_3</m>, max).</li>
<li>Are there any outliers? If so, what are they?</li>
<li>The 2012 Summer Olympics has a five-number summary of (12, 17, 24, 40, 104). Draw side-by-side a box-and-whisker plots for these two data sets.</li>
<li>How do the 2008 and 2012 Summer Olympics compare?</li>
</ol></p>
</statement>
</exercise>
<exercise>
<statement>
<p>Jill was 80th in a class of 200, whereas Nathan, who is in the same class, has a percentile rank of 80. Which student has the higher standing in the class?</p>
</statement>
</exercise>
</section>
<section xml:id="section-stats-measures-spread">
<title>Measuring Spread</title>
<p>The mean and median do not tell the entire story about a data set. For example, the data sets <m>\{ 0, 50, 100 \}</m> and <m>\{49, 50, 51\}</m> both have a mean and a median of 50, yet these data sets are vey different. The first is spread out while the second is much more compact. There are at least two ways of measuring how much a data set is spread out. The first is the <term>range</term> of data set. The range is simply the difference between the maximum and minimum values. The range of the first data set is 100, while the range of the second data set is 2.</p>
<p>It might be the case that range does not adequately describe the spread of a data set. Consider the two data sets <m>\{ 0, 1, 50, 99, 100 \}</m> and <m>\{0, 49, 50, 51, 100 \}</m>. Both have a rangle of 100, but data in the first set is concentrated near 0 and 100, while data in the second set is concentrated near the mean of 50. A better way to quantify the spread is to compute the standard deviation. Given a data set <m>\{x_1, x_2, \ldots, x_n\}</m> the <term>standard deviation</term> of this data set is
<me>s = \sqrt{ \frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots (x_n - \overline{x})^2}{n-1}}</me>
Essentially, we are just averaging the squares of the <term>deviations</term> <m>x_1 - \overline{x}</m> and then taking the square root of this average. A high standard deviation means that the data is spread out. A low standard deviation means that the data is centered around the mean. Measuring the spread using the standard deviation works best if the data is distributed symmetrically.</p>
<exercise>
<statement>
<p>In our definition of standard deviation, we averaged the squares of the deviations. Wny don't we just average the devitations and why do we divide by <m>n-1</m> instead of <m>n</m>?</p>
</statement>
<hint>
<p>What is the sum of <m>(x_1 - \overline{x}) + (x_2 - \overline{x}) + \cdots + (x_n - \overline{x})</m>?</p>
</hint>
</exercise>
<exercise>
<statement>
<p>Jack and Jill are students in two different sections of MTH 128. Both took exams last week on the data analysis chapter, although the exams were different for the two sections. In Jack's class, the mean score of all of the exams was 78 with standard deviation 8. Jill's class did not do quite as well. The mean score in her class was only 70 with standard deviation 5. If Jack scored an 90 and Jill scored an 80, who did better on the exam compared to the other students in their respective sections?</p>
</statement>
</exercise>
<exercise>
<statement>
<p>In the <em>2014 Michelin Guide to Pizza Restaurants in Texas</em>, each restaurant has ten of their best selling pizzas rated on a scale of 0 to 100, with 100 be a perfect score. The <em>Michelin Guide</em> publishes the mean and standard deviation of top ten pizzas for each restaurant. For example, <em>Three Aces Pizza</em> has a mean of 70 and a standard deviation of 15. If your friend asks you for a recommendation, you should select a restaurant with
<ol>
<li>a low mean and a low standard deviation.</li>
<li>a high mean and a low standard deviation.</li>
<li>a low mean and a high standard deviation.</li>
<li>a high mean and a high standard deviation.</li>
</ol></p>
</statement>
</exercise>
<exercise>
<statement>
<p>Each of the following set of numbers had a mean of 50.
<md>
<mrow>\text{Set } A: 0, 20, 40, 50, 60, 80, 100</mrow>
<mrow>\text{Set } B: 0, 48, 49, 50, 51, 52, 100</mrow>
<mrow>\text{Set } C: 0, 1, 2, 50, 98, 99, 100</mrow>
</md>
<ol>
<li>Without computing the standard deviation $s$, tell which set has the largest standard deviation and which set has the smallest standard deviation.</li>
<li>How did you arrive at your answer?</li>
<li>Compute the standard deviation for each set.</li>
</ol></p>
</statement>
</exercise>
</section>
<section>
<title>Using R in Sage Cells for Basic Statistics</title>
<p>As before, all you have to do is evaluate the cell. You can change the data and evaluate the cell once again to get a new plot. If you make a mistake, just reload the web page.</p>
<p>Suppose that we want the five-number summary for Babe Ruth's homeruns.</p>
<sage language="r">
<input>
<![CDATA[ruth <- c(22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60)
summary(ruth)]]>
</input>
</sage>
<p>We can also compute the mean and standard deviation.</p>
<sage language="r">
<input>
<![CDATA[ruth <- c(22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60)
mean(ruth)
sd(ruth)]]>
</input>
</sage>
<p>It's also possible to do side-by-side boxplots to compare data sets.</p>
<sage language="r">
<input>
<![CDATA[ruth <- c(22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60)
bonds <- c(16, 25, 24, 19, 33, 25, 34, 46, 37, 33, 42, 40, 37, 34, 49, 73, 46, 45, 45, 5, 26, 28)
dimaggio <- c(12, 14, 20, 21, 25, 29, 30, 30, 31, 32, 32, 39, 46)
boxplot(ruth, bonds, dimaggio, names=c("Ruth", "Bonds", "DiMaggio"), main="Comparing Homerun Seasons")]]>
</input>
</sage>
</section>
</chapter>