stats-descript.xml

<?xml version="1.0" encoding="UTF-8" ?>

<!--********************************************************************
Copyright 2015 Robert A. Beezer

This file is part of MathBook XML.

MathBook XML is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 2 or version 3 of the
License (at your option).

MathBook XML is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with MathBook XML.  If not, see <http://www.gnu.org/licenses/>.
*********************************************************************-->
<!-- This file is part of the book    							-->
<!--                                               				-->
<!--   Inquiries into Probability and Statistics - Math S–304	-->
<!--                                                			-->
<!-- Copyright (C) 2016  Thomas W. Judson      					-->

<chapter xml:id="stat-descript" xmlns:xi="http://www.w3.org/2001/XInclude">
	<title>Introduction to Statistics and Descriptive Statistics</title>

    <introduction>
        <p>The learning objectives for this chapter are:
        	<ul>
        		<li>To understand how data collection methods differ and how they affect the nature of the data set.</li>

        		<li>To understand and be able to evaluate different design studies.</li>

                <li>To understand and be able to use sampling.</li>

                <li>To understand observations and experiments.</li>

        		<li>To understand and be able to analyze data using appropriate graphs and tables.</li>

        		<li>To understand how graphs and tables of data can sometimes be misleading.</li>

        	</ul></p>
    </introduction>

    <section xml:id="section-stat-populations">
        <title>Populations and Samples</title>

        <p>In statistics, we study an entire <term>population</term> by taking a subset of the population called a <term>sample</term>.  For example, we may wish to know the outcome of an election.  The population might be all likely voters.  Since it isn't practical to ask everyone how they will vote, we take a subset of likely voters and poll them.  To be useful, a sample must be carefully selected and represent the entire population.  For example, if we want to poll, all Harvard Summer School students on there experience with the 2016 Summer Session, we should probably do mre than just ask the opinion of all students enrolled in mathematics courses.  Students who are studying French or Chemistry might have an entirely different perception of their experience.</p>

        <p>A better method of finding a representative sample would be to take a <term>simple random sample</term> of size <m>n</m>.  That is, we would put the names of all of the students in a hat and then draw <m>n</m> names.  Each sample of size <n>n</n> has an equal chance of being selected as well as each student.</p>

        <p>A variation on the simple random sample is a <term>stratified random sample</term>.  For example, we might divide the population of summer school students into those studying science and mathematics, social sciences, humanities, business and economics, and the arts.  We would then take a simple random sample from each group.  This way we are assured that all groups are represented.</p>

        <p>Of course, we could get a sample simply by asking the first <m>n</m> students that we see.  A sample obtained in this manner is called a <term>convenience sample</term>.  Or we could email everyone and ask them to fill out a questionnaire.  This type of sample is called a <term>voluntary response sample</term>.</p>

        <paragraphs>
            <title>Random Rectangles</title>

            <p>Why do we insist on random samples?  For this activity, you will see a page containing 100 rectangles of different areas.  Each rectangle is composed of small squares.  For example, a rectangle that is 4 squares wide and 3 squares wide has area 12.  The idea of this activity is to compare different methods of estimating the mean area of the 100 rectangles.<fn>This activity can be found in <em>Activity-Based Statistics</em> by R. L. Scheaffer et. al., Spring-Verlag, New York, 1996.</fn></p>
              
        </paragraphs>

        <exercise>
            <statement>
                <p><ol>

                    <li>Your instructor will show you the sheet of random rectangles for a few moments.  Make a guess as to what the mean size of a rectangle is for all 100 rectangles.</li>

                    <li>Select 5 rectangles that you think are representative of the rectangles on the page.  Compute the mean of the 5 rectangles that you selected and report your computation.</li>

                    <li>Using a random number table or a random number generator, select 5 distinct random numbers between 00 and 99.  Find the five corresponding rectangles on the sheet (use 00 for rectangle 100).  Compute the mean of these five rectangles and report your results.</li>

                    <li>How do the results in (1)<ndash />(3) compare?</li>

                    <li>Which works best<mdash />a convenience sample, a random sample, or a wild guess?  Why?</li>
                </ol></p>
            </statement>
        </exercise>

        <!--The mean is 7.420 and the median is 6-->

                <p>It's also possible to do side-by-side boxplots to compare data sets.</p>

        <sage language="r">
            <input>
            <![CDATA[guess <- c(22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60)
            estimate <- c(16, 25, 24, 19, 33, 25, 34, 46, 37, 33, 42, 40, 37, 34, 49, 73, 46, 45, 45, 5, 26, 28)
            rand <- c(12, 14, 20, 21, 25, 29, 30, 30, 31, 32, 32, 39, 46)
            boxplot(guess, estimate, rand)]]>
            </input>
        </sage>


        <exercise>
        	<statement>
        		<p>Lily Robinson is running for the U.S. Senate in Utah.  In a poll of 1,400 voters, 428 voters support her.
        			<ol>
        				<li>What is the population for this poll?</li>

        				<li>If <m>P</m> people vote in the election, what is a good prediction of the number that will vote for Ms.Robinson?</li>
        			</ol></p>
    		</statement>
    	</exercise>

    	<exercise>
        	<statement>
        		<p>A survey of shoppers shows that 20 prefer whole milk, 42 prefer low-fat milk, 28 prefer skim milk, and 10 prefer soy milk.
        			<ol>
        				<li>If 600 shoppers come to the store tomorrow, predict how many will buy each kind of milk.</li>

        				<li>Explain how you made your prediction.</li>
        			</ol></p>
    		</statement>
    	</exercise>

    	<exercise>
        	<statement>
        		<p>Classify each of the following as simple random sampling, stratified sampling, or convenience sampling.
        			<ol>
        				<li>A student newspaper selects 10 names at random from the entire student body.</li>

        				<li>A student newspaper interviews proportional numbers of freshmen, sophomores, juniors, and seniors.</li>

        				<li>A researcher interviews all AIDS patients in the hospital closest to the researcher's home.</li>

        			</ol></p>
    		</statement>
    	</exercise>

    	<exercise>
        	<statement>
        		<p>Describe how to take a sample of 30 high school seniors from  300 high school seniors using
        			<ol>
        				<li>a simple random sample.</li>

        				<li>stratified sampling.</li>

        				<li>voluntary response sampling.</li>
        				
        			</ol></p>
    		</statement>
    	</exercise>

        <p>We also have to worry about <term>bias</term> when we sample.  In a biased sample, everyone does not have an equal chance of being selected.  Convenience and voluntary response samples are inherently biased.  Other sources of bias are <term>undercoverage</term> where some part of the population is excluded, <term>nonresponse</term> where someone in a survey refuses to participate, and <term>response</term> where someone in the sample give false or inaccurate information.</p>

    	<exercise>
        	<statement>
        		<p>A sample of 50 people is obtained to rate <em>Frosted-Chocolate Sugar Bombs</em>, a breakfast cereal.  Tell whether each of the following is an example of undercoverage, nonresponse, or response bias.
        			<ol>
        				<li>An interviewer visits a randomly selected set of homes and asks the first 50 people who answer the door whether they eat <em>Frosted-Chocolate Sugar Bombs</em> for breakfast.</li>

        				<li>An interviewer goes to the local Whole Paycheck Food Market and observes what breakfast foods 50 shoppers select.</li>
        				
        			</ol></p>
    		</statement>
    	</exercise>

    	<exercise>
    		<statement>
    			<p>A school finds that 75<percent /> of the students in detention are boys even though only 45<percent /> of the students are boys.  Does this show that the school is biased against boys?  Tell why or why not.</p>
    		</statement>
    	</exercise>

        <p>We have two types of statistical studies<mdash />experiments and observations.  In an <term>observational study</term>, we simple observe an existing population.  For example, we may take a poll of likely voters.  In an <term>experimental study</term>, we actually do something to the sample.  For example, we might test a new drug.  For example, a researcher might treat a group of patients with a new drug and compare the result to a <term>control group</term>.  Patients in the control group would either recieve a placebo or the standard existing treatment.</p>

   		<exercise>
        	<statement>
        		<p>Tell whether each of the following is an observational study or an experiment.
        			<ol>
        				<li>A researcher randomly assigns 30 people to a statistics class using the lecture method and 30 students to an online statistics class.  Their average scores on the final exam are compared.</li>

        				<li>A researcher finds 50 people who are overweight and 50 who are not.  She compares the average number of hours they spend each week watching television.</li>

                        <li>A researcher finds 100 people who talk on their cell phones while driving and 100 people who do not.  The resercher compares the driving records for one year.</li>
        				
        			</ol></p>
    		</statement>
    	</exercise>

        <exercise>
            <statement>
                <p>Suppose you want to know whether or not chicken soup cures the common cold.  How would you proceed?</p>
            </statement>
        </exercise>


    </section>

    <section xml:id="section-stat-graphs-tables">
        <title>Analyzing Data with Graphs and Tables</title>

        <exercise>
        	<statement>
        		<p>The following data set is the number of home runs that Babe Ruth hit during his 15-year career in major league baseball.</p>

        		<p>22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60</p>

        		<p><ol>

        			<li>Draw a dot plot on the number of home runs that Babe Ruth hit during is career.</li>

        			<li>Draw a stem-and-leaf plot of Ruth's home runs.</li>

        			<li>Discuss your answers.</li>

        		</ol></p>
        	</statement>
        </exercise>

        <exercise>
        	<statement>
        		<p>The following data is  the number of home runs that Barry Bonds hit during his  major league career.</p>

        		<p> 16, 25, 24, 19, 33, 25, 34, 46, 37, 33, 42, 40, 37, 34, 49, 73, 46, 45, 45, 5, 26, 28</p>

        		<p><ol>

        			<li>Draw back-to-back stem-and-leaf plots of Bond's and Ruth's home runs.</li>

        			<li>Who was the better home run hitter during his career?</li>

        		</ol></p>
        	</statement>
        </exercise>

        <exercise>
        	<statement>
        		<p>Table<nbsp /><xref ref="table-stats-descript-grades" /> shows the grade distribution for the final examination in a statistics course.  Draw an appropriate graph for the data.</p>
        	</statement>
        </exercise>

        <table xml:id="table-stats-descript-grades">
           <tabular halign="center" top="medium">
                <row bottom="medium">
                    <cell>Grade</cell><cell>A</cell><cell>B</cell><cell>C</cell><cell>D</cell><cell>F</cell>
                </row>
                <row bottom="medium">
                    <cell>Frequency</cell><cell>4</cell><cell>10</cell><cell>37</cell><cell>8</cell><cell>1</cell>
                </row>
           </tabular>
            <caption>Final Exam Grades for a Statistics Course</caption>
        </table>

        <exercise>
        	<statement>
        		<p>A cheetah can run 70 mph, a lion can run 50 mph, and grizzly bear can run 30 mph, and a human can run 28 mph.  Draw an appropriate graph to represent this data set.</p>
        	</statement>
        </exercise>

        <exercise>
        	<statement>
        		<p>The following are amounts (to the nearest dollar) paid by 25 students for textbooks during the Fall 1990 term.
        			<me>\begin{array}{ccccc}
					35 \amp 42 \amp 37 \amp 60 \amp 50 \\
					42 \amp 50 \amp 16 \amp 58 \amp 39 \\
					33 \amp 39 \amp 23 \amp 53 \amp 51 \\
					48 \amp 41 \amp 49 \amp 62 \amp 40 \\
					45 \amp 37 \amp 62 \amp 30 \amp 23
					\end{array}</me>
					<ol>

						<li>Construct a grouped frequency table for the data, starting the first class at <dollar />15 with intervals <dollar />5 each.</li>

						<li>Draw a histogram of the data.</li>
						
					</ol></p>
        	</statement>
        </exercise>


        <exercise>
        	<statement>
        		<p>The 1918 flu pandemic (commonly referred to as the Spanish flu) was an influenza pandemic that spread to nearly every part of the world.   The Spanish flu lasted from March 1918 to June 1920.  It is estimated that anywhere from 20 to 100 million people were killed worldwide.  The data on the number of new influenza cases and the number of deaths from the epidemic in San Francisco week by week from October 5, 1918 to January 25, 1919 is given in Table<nbsp /><xref ref="table-stats-descript-flu" />.  The data is given on the last day of the week.<ol>

        			<li>Make a line graph of weekly new cases.  Based on your plot, describe the progress of the epidemic.</li>

        			<li>We would like to compare the patterns over time of the number of new cases with the number of deaths.  To make the new variables similar in size for easier comparison, plot the number of deaths against time for October 5 to January 25, then plot the number of cases divided by 10 on the same graph using a different color.  What do you see?  In particular, how long is the lag between the changes in the number of cases and corresponding changes in deaths?</li>

        			<li>Why was the 1918 flu epidemic called <q>the Spanish flu?</q></li>

        		</ol></p>
        	</statement>
        </exercise>

        <table xml:id="table-stats-descript-flu">
           <tabular halign="center" top="medium">
                <row bottom="medium">
                    <cell>Date</cell><cell>Oct 5</cell><cell>Oct 12</cell><cell>Oct 19</cell><cell>Oct 26</cell><cell>Nov 2</cell><cell>Nov 9</cell>
                </row>
                <row>
                    <cell>Cases</cell><cell>36</cell><cell>531</cell><cell>4233</cell><cell>8682</cell><cell>7164</cell><cell>2229</cell>
                </row>
                <row bottom="medium">
                    <cell>Deaths</cell><cell>0</cell><cell>0</cell><cell>130</cell><cell>522</cell><cell>738</cell><cell>414</cell>
                </row>
                <row bottom="medium">
                    <cell>Date</cell><cell>Nov 16</cell><cell>Nov 23</cell><cell>Nov 30</cell><cell>Dec 7</cell><cell>Dec 14</cell><cell>Dec 21</cell>
                </row>
                <row>
                    <cell>Cases</cell><cell>600</cell><cell>164</cell><cell>57</cell><cell>722</cell><cell>1517</cell><cell>1828</cell>
                </row>
                <row bottom="medium">
                    <cell>Deaths</cell><cell>198</cell><cell>90</cell><cell>56</cell><cell>50</cell><cell>71</cell><cell>137</cell>
                </row>
                <row bottom="medium">
                    <cell>Date</cell><cell>Dec 28</cell><cell>Jan 4</cell><cell>Jan 11</cell><cell>Jan 18</cell><cell>Jan 25</cell><cell></cell>
                </row>
                <row>
                    <cell>Cases</cell><cell>1539</cell><cell>2416</cell><cell>3148</cell><cell>3465</cell><cell>1440</cell><cell></cell>
                </row>
                <row bottom="medium">
                    <cell>Deaths</cell><cell>178</cell><cell>194</cell><cell>290</cell><cell>310</cell><cell>149</cell><cell></cell>
                </row>
           </tabular>
            <caption>Data for the 1918 flu epidemic</caption>
        </table>


		<exercise>
			<statement>
				<p>For each of the following data sets, determine whether the best choice to represent the data graphically is a line graph, a bar graph, or a circle graph (pie chart).
					<ol>

						<li>The average trade-in value of a 2000 Honda Civic for the years 2000<ndash />2014 as a percentage of the original purchase price of the car.</li>

						<li>Car sales by percentage of vehicle type (small, midsize, large, and luxury) in 2011.</li>

						<li>The lengths of selected rivers in the U.S. (Arkansas, Columbia, Mississippi, Missouri, Ohio, and Yukon).</li>

					</ol></p>
			</statement>
		</exercise>


	</section>

    <section>
        <title>Using R in Sage Cells to Plot Graphs</title>

        <p>R is an open source statisical programming language.  While R has a reputation of being difficult to learn, we can program R into Sage cells.  As before, all you have to do is evaluate the cell.  You can change the data and evaluate the cell once again to get a new plot.  If you make a mistake, just reload the web page.  <em>Before evaluating the cell, make sure to select the language to be R from the pull down menu to the right and below the cell.</em></p>

        <p>Suppose that we want to a make a stem-and-leaf plot of Ruth's homeruns.</p>

        <sage language="r">
            <input>
            <![CDATA[ruth <- c(22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60)
            stem(ruth)]]>
            </input>
        </sage>

        <p>We can create histograms. Note that R decides on the width of the bin.</p>

        <sage language="r">
            <input>
            <![CDATA[textbooks <- c( 35, 42, 37, 60, 50, 42, 50, 16, 58, 39, 33, 39, 23, 53, 51, 48, 41, 49, 62, 40, 45, 37, 62, 30, 23)
            hist(textbooks)]]>
            </input>
        </sage>

        <p>We can create pie charts.</p>

        <sage language="r">
            <input>
            <![CDATA[slices <- c(4, 10 , 37, 8, 1)
            lbls <- c("A", "B", "C", "D", "F")
            pie(slices, labels = lbls, main="Final Exam Grades for a Statistics Course")]]>
            </input>
        </sage>

        <p>We can plot the time series for the 1918<ndash />1919 influenze epidemic.  The plot for the number of cases is in red, and the plot for 10 times the number of deaths is in blue.</p>

        <sage language="r">
            <input>
            <![CDATA[cases <- c(36, 531, 4233, 8682, 7164, 2229, 600, 164, 57, 722, 1517, 1828, 1539, 2416, 3148, 3465, 1440)
            deaths <- c(0, 0, 130, 552, 738, 414, 198, 90, 56, 50, 71, 137, 178, 194, 290, 310, 149) 
            plot(ts(cases), col="red", main="Influenza Epidemic")
            lines(ts(10*deaths),col="blue")]]>
            </input>
        </sage>

    </section>


</chapter>