
STATISTICS NotesNotes Link for: HamburgerDataNotes Link for: rCalculationNotes Link for: FoodDataNotes Link for videos of: PinkSlimehttp://www.cc.com/videoclips/0mg8t8/thecolbertreportstephenssoundadvicehowtoacethesats
Year
National (Public & Private Combined)
California (Public & Private Combined)
California (Public Schools Only)
201112
Percent of seniors tested
39%
Critical Reading
496
495
491
Math
514
512
510
Writing
488
496
491
Total
1,498
1,503
1,492
201011
Percent of seniors tested
53%
38%
Critical Reading
497
499
495
Math
514
515
513
Writing
489
499
494
Total
1,500
1,513
1,502
200910
Percent of seniors tested
47%
50%
33%
Critical Reading
501
501
501
Math
516
516
520
Writing
492
500
500
Total
1,509
1,517
1,521
200809
Percent of seniors tested
46%
49%
35%
Critical Reading
501
500
495
Math
515
513
513
Writing
493
498
494
Total
1,509
1,511
1,502
200708
Percent of seniors tested
45%
48%
36%
Critical Reading
502
499
494
Math
515
515
513
Writing
494
498
493
Total
1,511
1,512
1,500
200607
Percent of seniors tested
48%
49%
37%
Critical Reading
502
499
493
Math
515
516
513
Writing
494
498
491
Total
1,511
1,531
1,497
200506
Percent of seniors tested
48%
49%
37%
Critical Reading
503
501
495
Math
518
518
516
Writing
497
501
495
Total
1,518
1,520
1,506
200405
Percent of seniors tested
49%
50%
36%
Verbal
508
504
499
Math
520
522
521
Total
1,028
1,026
1,020
200304
Percent of seniors tested
48%
49%
Verbal
508
501
496
Math
518
519
519
Total
1,026
1,020
1,015
200203
Percent of seniors tested
48%
54%
46%
Verbal
507
499
494
Math
519
519
518
Total
1,026
1,018
1,012
200102
Percent of seniors tested
46%
52%
37%
Verbal
504
496
490
Math
516
517
516
Total
1,020
1,013
1,006
200001
Percent of seniors tested
37%
Verbal
506
498
492
Math
514
517
516
Total
1,020
1,015
1,008
199900
Percent of seniors tested
44%
49%
37%
Verbal
505
497
492
Math
514
518
517
Total
1,019
1,015
1,009
199899
Percent of seniors tested
43%
49%
37%
Verbal
505
497
492
Math
511
514
513
Total
1,016
1,011
1,005
199798
Percent of seniors tested
43%
41%
36%
Verbal
505
497
491
Math
512
516
516
Total
1,017
1,013
1,007
199697
Percent of seniors tested
42%
41%
36%
Verbal
505
496
490
Math
511
514
514
Total
1,016
1,010
1,004
199596
Percent of seniors tested
41%
42%
37%
Verbal
505
495
490
Math
508
511
511
Total
1,013
1,006
1,001
199495
Percent of seniors tested
41%
41%
36%
Verbal
504
492
488
Math
506
509
509
Total
1,010
1,001
997
199394
Percent of seniors tested
42%
42%
37%
Verbal
499
489
484
Math
504
506
507
Total
1,003
995
991
199293
Percent of seniors tested
43%
41%
36%
Verbal
500
491
486
Math
503
508
508
Total
1,003
999
994
^{1} In 200506, the total possible score changed from 1,600 to 2,400.
Source: California Department of Education, Policy and Evaluation Division
SAT CorrelationsWhat color is Math and which is Verbal sections?Which correlations are graphed below? (chose IQ, Math, Verbal, and SAT total)Pass cursor over each graph to see answer.Which is more closely correlated to IQMath or Verbal SAT score?Now let's combine them:Can the SAT combined VerbalMath score be used to calculate IQ scores? Should they?Calculate the zscores and make a scatterplot.ZScores plot and slopeHow are these correlated (SATMathVerbalIQ)?SAT DistributionsStandardized Tests by MajorIQ DistributionsCultural biases in IQ tests: http://www.psychpage.com/learning/library/intell/biased.htmlEnter data for height into L1 and spare change into L2.Make a histogram, box plot, then graph both together.Period 3 Height Distributions:n = 21 students, mean = 67.6", population Standard Dev is 3.427, sample StdDev is 2.34.min = 59, Q1 = 65, Med = 68, Q3 = 70.5, and Max = 73 inchesThis was the spare change set of data from L2:and the related graphs:Here's Period 5:heightscash on handStandard Deviation notes link:Standard Deviation  Basic Example
For a finite set of numbers, the standard deviation is found by taking the square root of the average of the squared differences of the values from their average value. For example, consider a population consisting of the following eight values:
These eight data points have the mean (average) of 5:
First, calculate the difference of each data point from the mean, and square the result of each:
Next, calculate the mean of these values, and take the square root:
This quantity is the population standard deviation, and is equal to the square root of the variance. This formula is valid only if the eight values with which we began form the complete population. If the values instead were a random sample drawn from some larger parent population, then we would have divided by 7 (which is n−1) instead of 8 (which is n) in the denominator of the last formula, and then the quantity thus obtained would be called the sample standard deviation. Dividing by n−1 gives a better estimate of the population standard deviation than dividing by n.
As a slightly more complicated reallife example, the average height for adult men in the United States is about 70 inches, with a standard deviation of around 3 inches. This means that most men (about 68 percent, assuming a normal distribution) have a height within 3 inches of the mean (67–73 inches) – one standard deviation – and almost all men (about 95%) have a height within 6 inches of the mean (64–76 inches) – two standard deviations. If the standard deviation were zero, then all men would be exactly 70 inches tall. If the standard deviation were 20 inches, then men would have much more variable heights, with a typical range of about 50–90 inches. Three standard deviations account for 99.7 percent of the sample population being studied, assuming the distribution is normal (bellshaped).
Country Average male height Average female height Stature ratio
(male to female)Sample population /
age rangeShare of
pop. over 15
covered^{[51]}Methodology Year Source U.S. 177.6 cm(5 ft 10 in)163.2 cm(5 ft 4 ^{1}⁄_{2} in)1.09 All Americans, 20–29 17.4% Measured 2003–2006 ^{[151]} U.S. 178 cm(5 ft 10 in)163.2 cm(5 ft 4 ^{1}⁄_{2} in)1.09 Black Americans, 20–39 N/A Measured 2003–2006 ^{[151]} U.S. 170.6 cm(5 ft 7 in)158.7 cm(5 ft 2 ^{1}⁄_{2} in)1.07 Mexican Americans, 20–39 N/A Measured 2003–2006 ^{[151]} U.S. 178.9 cm(5 ft 10 ^{1}⁄_{2} in)164.8 cm(5 ft 5 in)1.09 White Americans, 20–39 N/A Measured 2003–2006 ^{[151]} For more on standard deviation check out this link:
Standard Deviation notes link:Variance
The Variance is defined as:
The average of the squared differences from the Mean.
Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
Mean = 600 + 470 + 170 + 430 + 300= 1970= 394 55so the mean (average) height is 394 mm. Let's plot this on the chart:
Now we calculate each dog's difference from the Mean:
To calculate the Variance, take each difference, square it, and then average the result:
So, the Variance is 21,704.
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)
And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them!
Now try the Standard Deviation Calculator.
But ... there is a small change with Sample Data
Our example was for a Population (the 5 dogs were the only dogs we were interested in).
But if the data is a Sample (a selection taken from a bigger Population), then the calculation changes!
When you have "N" data values that are:
 The Population: divide by N when calculating Variance (like we did)
 A Sample: divide by N1 when calculating Variance
All other calculations stay the same, including how we calculated the mean.
Example: if our 5 dogs were just a sample of a bigger population of dogs, we would divide by 4 instead of 5 like this:
Sample Variance = 108,520 / 4 = 27,130Sample Standard Deviation = √27,130 = 164 (to the nearest mm)Think of it as a "correction" when your data is only a sample.
Formulas
Here are the two formulas, explained at Standard Deviation Formulas if you want to know more:
The "Population Standard Deviation":The "Sample Standard Deviation": Looks complicated, but the important change is to
divide by N1 (instead of N) when calculating a Sample Variance.*Footnote: Why square the differences?
If we just added up the differences from the mean ... the negatives would cancel the positives:
4 + 4  4  4 = 0 4 So that won't work. How about we use absolute values?
4 + 4 + 4 + 4 = 4 + 4 + 4 + 4 = 4 4 4 That looks good (and is the Mean Deviation), but what about this case:
7 + 1 + 6 + 2 = 7 + 1 + 6 + 2 = 4 4 4 Oh No! It also gives a value of 4, Even though the differences are more spread out!
So let us try squaring each difference (and taking the square root at the end):
√ 4^{2} + 4^{2} + 4^{2} + 4^{2} = √ 64 = 4 4 4 √ 7^{2} + 1^{2} + 6^{2} + 2^{2} = √ 90 = 4.74... 4 4 That is nice! The Standard Deviation is bigger when the differences are more spread out ... just what we want!
In fact this method is a similar idea to distance between points, just applied in a different way.
And it is easier to use algebra on squares and square roots than absolute values, which makes the standard deviation easy to use in other areas of mathematics.
Inconvenient GraphsPopulation Growth (timeplot)Hottest years on record
(Note these are bar graphs of categorical data even though the horizontal axis is time.)Bar graphs and histograms (which are essentially adjacent bar graphs) provide a visual representation of quantitative data.The bars are called "bins" and the height of each depends on the quantity that falls into that range.Here is a histogram of all the scores scored by a basketball team in NBA history:What is the mode (most common score)?Is the distribution symmetric or skewed?Here is the graph for baseball runs in MLB history:How does it compare to the NBA graph?What is the mode?Where is the tail, right or left (this is the skewed direction)?An interesting difference pops up in the most common scores in NFL history:What is different about this histogram distribution than previous ones (NBA, MLB)? Why?What is the mode?What score is never possible?Which other score, while possible, appears to have never happened?Here's a graph of the perceived versus actual scores of NFL games. Note the differences:When I looked up NHL history scores, I could only find one for the 20102011 season:How is this graph skewed?What is its mode?Interestingly, its distribution follows a familiar law:Calculate the percentage of each bin above (total games = 787) and draw a histogram of its relative frequency.How does it compare to this?Do you think other sports scores follow Benford's Law?Why or why not?How could you find out from these graphs?Here is a look at the age of hockey players in NHL history:Compare that to teacher age distribution over the years:Why is it changing?Contrast that with the years of experience a public teacher has over the years:What is the mode in each of the three graphs below?What is a better descriptor of the 'middle' of each graphmean, median, or mode?
April Fool's Day Joke  Complex Numbers videoThe two scatter plots below have different viewing windows (300 < y < 700) vs. (700 < y < 2000)We wish to standardize both sets of data onto the same viewing window, roughly 4 < x < 4 by 4 < y < 4.Do this by entering (L1mean(L1)) / stdDev(L1) and store results into L4.To calculate the calorie z scores enter (L3mean(L3)) / stdDev(L3) and store results into L5.Then make a scatterplot of calorie vs. fat z scores by pressing2nd, Y=, Plot1 "On", Type: select scatter (1st one), XList: L4, YList: L5, ZOOM 9.To calculate the linear regression EQ press STAT, select "CALC", press 4 for LinReg(ax+b) L4, L5 then ENTER.To graph that equation press Y=, VARS, 5: Statistics, hilite "EQ", press 1 "RegEQ" and ENTER. Then GRAPH.Do the same for sodium mg vs. fat g regression plot by repeating this process on List2.Note how the yintercept b is really close to zero (in fact, b = 0).This is because we subtracted the mean center of the xdata and the mean of the ydata.This translation shifts the center of the data to the origin.Also note how the corelation factor r is the same as a, the slope of the zscores. (so a = r)April 29, 2013Find the mean measure of center and standard deviation measure of spread for this sandwich data:(L1 is fat grams, L2 is sodium milligrams, L3 is calorie content)Find the z score for the KFC double down sandwich, then locate its data point on the scatter plot.Find the zscore for the KFC double down sandwich's sodium content.Where does 1.8 standard deviations fit in the 689599.7 rule?April 23, 2013The Rollie Egg Master:In the Super Size Me DVD bonus feature "The Smoking Fry" the intern threw out the Mcdonald's fries after ten weeks before they showed any sign of decomposition. The director challenged viewers to continue the experiment.Happy Meal decomposition:April 22, 2013Finding summations:Breakdown the formula by first calculating the numerator, then each radicand in the denominator.Since R = 0.788, there is a positive corelation of a fairly linear relationship.March 26th, 2013Prudential commercial featuring live time dot plot:What are the modes?Which is highermean or median?Which do you think is the best measure of central tendency?Is this a normal bell curve?Is it symmetrical or skewed?March 7th, 2013"Statistical Analysis of HR Kings"The text had an exercise involving Roger Maris' Home Run record of 61*. Let's analyze data for the other home run kings. Here are the number of home runs by season for Babe Ruth, Barry Bonds, Mark McGwire, and Sammy Sosa:Directions:To enter 3 into a steam and leaf plot, list it as "03" with a '0' in the left column and a '3' to the right.If there are no entries in a decade range (e.g. no 50s), still write a '5' in the left column and nothing to the right. Why?So that the outlying values are easier to identify. Remember, outlining a steamandleaf plot makes it appear as a histogram.The median and mean are central measures. Outlying values tend to drag the mean off center. The median is often better centered.Measures of spread are IQR and standard deviation. Which measures are larger in these examples?A box plot provides a visual representation of both center and spread. When there are multiple box plots, draw them onto the same graph vertically instead of horizontally (Note: Graphing Calculators plot them horizontally.)May 8th, 2012"The Birthday Problem"What's the probability that two people in a room have the same birthday? How many people does it take before the odds are even?For the 1st person, the chances are 100% that there is no match because there is no one with whom to match. So P(No Match) = 100%. The event of a match is the complementary event, so P(M) = 1P(not M) = 0%.The next person has a 364/365 chance of not matching the first, and a 1/365 of matching. The third person has a 363/365 of not matching the previous two. Treating birthdays as independent events, we multiply (365*364/365^2), which gives 0.997 and 0.2% of a match. Multiplying this answer by 363/365 gives the chances of the third person not matching the other two. Then 365*364*363/ (365^3) = 99.18% or 0.8% of three having a match. For 4 it's (365*364*363*362) / (365*365*365*365).The numerator is the permutation formula 365 Permute 4, or 365 nPr 4 on a graphing calculator. The graphing calculator gives up after 40 since the numbers involved are more than a googol (that's 100 digits!) However, we can continue by multiplying the previous answer by (36540)/365 or 325/365 and fill in the rest of the table:So after 23 people, chances are 5050 of two having the same birthday. Indeed, both statistics classes had two people with matching birthdays, November 11th in 1st period and November 7th in 3rd.When are you 100% guaranteed of two having a match? Well, theoretically never, but rounded to two places, it's 83 people.The graph illustrates how quickly the chances climb:That means of the 80 teachers at Tam, odds are excellent that two have the same birthday.May 3, 2012"The Monty Hall Problem"There are two goats and a car behind three curtains.Choose a curtain number to win the car.What are your chances of winning the car?You made your pick and the host shows you a goat behind one of the curtains you did not select.(Can he always do this?)Then he offers you the chance to stick with your pick, or switch to the other curtain.What should you do?Most people think switching doesn't matter because there are now two possibilities so it's 50/50.However, there are nine possibilities (Choice of curtain behind 1, 2, or 3 and where car is (1, 2, or 3).Since 3by3 is 9, there is no way of having a 50% chance because 9/2 = 4.5, and you can't play this game four and a half times. But what actually is the chance of winning when you switch if not 50%?Of the 9 possibilities, 6 result in a Win by switching, and 3 in a Loss. Therefore, the probability of winning if you switch is 6/9 = 2/3 or about 66.7%.Think of it this way. Your chances of initially winning the car is 1 out of 3. Therefore, when the hosts asks you later if you want to switch picks, he's offering a chance to get out of a 2/3 losing proposition. Taking it doubles your chances of winning.Scene from Numb3rs episode:Lewis Black summarizes the food issues in the news of late:April 24, 2012group quiz Ch. 78StarBugs frappacino, infinite Hot DogPizza continuum, and Pharmaceutical PoultryApril 23, 2012TakeHome Reading Quiz Ch. 48Corelation Factor computation on circle x^2 + y^2 = 25 and why r=0 matches no line of best fit.April 1620, 2012Super Size Me video (STAR Testing week)April 13, 2012Bill Maher covers the "pink slime" issue and you might recognize some of the other food images along the way. (Also, why Del Taco's drive thru is more advanced than the Korean rocket program) in his New Rules segment during Friday of our Spring Break.Bill Maher and tube sox compared to the KFC Double Down, and hot dog stuffed crust pizzaBill Maher and pink slimeApril 5, 2012The Simpsons' Little Lisa's Animal Slurry, the Krusty Ribwich and Stephen Colbert's apology:"Linear Models"April 3, 2012The night of the previous lesson on hamburger statistics, Stephen Colbert aired this segment:Let's add the data for KFC's Double Down sandwich (8th entry) of 37 fat g, 1880 sodium mg, and 610 calories to the list, along with the 10 fat g, 760 sodium mg, and 330 calories of a veggie burger.Note the slope changed from 6 to 15.7 and the old yintercept of 930 became 678.The new correlation factor is:This is an improvement over the old factor of r = 0.199 (based just on hamburgers), or 0.23 (with the Double Down added), though it is still not a strong correlation. This is primarily due to the chicken sandwich's sodium amount being a huge outlier of the data set. Why did it change so much? To answer that, we must look at the zscores:The mean fat grams was 31.9, and its standard deviation 10.67. z = (fat  31.9)/10.67.For sodium, the mean was 1178.9 mg with a standard deviation of 357 mg of sodium.Applying the formula on the chicken sandwich, z = (18801178.9) / 357.9 = 1.96.The Double Down is nearly two standard deviations above the mean, or the 2.5th percentile, in sodium. The linear model would have predicted Y(1880) = 1259 mg of sodium, so this residual difference is 1259  1880 = 621. The model under predicts because the actual data was 621 above the line. Calculating a linear regression equation for these zscores gives:Note how the slope a is the correlation factor r. Can you spot the veggie burger point? Although it is a minimum of both the sodium and the fat data, it falls very close to the linear model's prediction, zscorewise. Let's take a look at the other stat plotfat g versus calories.The point (10, 330) is in the lower left, a minimum value for both x and y; however it falls very close to the prediction point (10, 327). Consequently, its residual is small. However, on this plot even the KFC Double Down falls close to the line, right in the middle of the densest cluster of hamburger data. Its calorie level is neither a maximum nor an outlier. Therefore, we should expect a high correlation factor. Indeed, it is:The correlation is 0.98, very close to 1, indicating a strong positive correlation between fat grams and calories. Since 0.98 is the slope of the zscores, how do we convert that to the slope for the linear model? Moving one standard deviation over rises 0.98 standard deviations up. Multiplying r by the actual standard deviation values gives the slope of the linear model, a = r * Sy / Sx. The standard deviation, Sy, for calories was 117.26 (recall that Sx for the fat g was 10.67).0.98 * (117.26 / 10.67) = 10.8Since the zscore plot passes through the origin, the linear model passes through the means (mean(x), mean(y)), which in this case is (31.9, 563.3), because the mean calories is 563.3.So the yintercept is calculated from y = ax + b as,b = y  ax = 563.3  10.8*31.9 = 218.78Therefore, the equation is y = 10.8x + 219 and the linear model reads:calories = 10.8 (fat) + 219 calorieswhere 10.8 is the rate of calories per fat gram, i.e. each gram of fat adds 10.8 calories to the sandwich, whether vegetarian, bunless fried chicken, or hamburger. If the sandwich has 0 fat grams, 219 calories come from the rest of the fatless ingredients. This is very similar to our older model from just the hamburger data:April 2, 2012Whereas in algebra we say x is the independent variable and y the dependent variable (i.e. y depends on x),in Statistics we state the x is the explanatory variable and y the response variable. "If I change x, how does that affect y?"Sometimes it does, sometimes it doesn't, and there may be a lurking variable hiding somewhere.Enter the data above into lists L1, L2, and L3.Then make two scatter plotsthe first of sodium mg against fat grams, the second calories vs fat grams.Which plot satisfies the "straight enough" condition? Which plot has too many outliers scattered?Can you match each equation above with its proper scatter plot?Plot both sets of data together to see if there is any corelation.Clearly, one line is a better fit to its data than the other. We can mathematically describe its fit with a formula.You can calculate by hand using a calculator step by step:And then combining the results to calculate the corelation factor r:Once you calculate the regression equation (STAT, CALC, 4: LinReg L1, L2),you can recall the value of r by pressing VARS, 5: Statistics, "EQ", 7:r) without doing the complicated formula.By replacing L2 (sodium) with L3 (calories) and repeating the calculations above, show that the corelation between fat grams and calories is r = 0.96.Since calories are strongly related to fat grams, while sodium is only weakly related, what do you think the corelation factor between sodium and calories is? Would it be closer to 0.96 or to 0.199? Make a scatter plot and find the regression equation between sodium mg in L2 and calories in L3.By way of contrast, KFC's Double Down sandwich has 1880 mg of sodium, 610 calories, and 37 fat grams. Tomorrow we'll look at how that fits in with the hamburgers. So if you do not wish to eat "pink slime", there's always this:March 13th, 2012"Normal Models  IQ and SAT scores""Why do I keep doing this? Is 70 not a good IQ score?!"  30 Rock, "St. Patrick's Day", aired March 15, 2012This model can be applied to IQ scores with a mean parameter of 100 and a standard deviation of 15.Note that 115 =100+15 is one standard deviation (+15) above the mean (100). So a zscore of 2 now corresponds to 130.Also, applying the 689599.7 rule allows a percentile or top percentage to be computed. For example, a zscore of 1, or one standard deviation above the mean, translates to half of 68% or 34% above the middle. 50% + 34% = 84%. The top 16% is the 84th percentile (better than 84% of the population.)This explains why they can make the minimum SAT score 200 instead of 0. Only 0.15% of the population scores beneath 3 standard deviations below the mean, so any score that might have been 0199 can be given a 200 without loss of much statistical significance. This makes the SAT normal model symmetric to the high score of 800. To find what percentile an SAT score of 680 is:1003.6 = 96.4 so an SAT score of 680 is in the 96.4 percentile.A year ago last March, Tim Shriver, Chairman of the Special Olympics, gave an interview on the Stephen Colbert show talking about the language used to describe people with intellectual differences. In the 1970s when the public started to use these descriptors in a hurtful manner, psychologists shifted away from that terminology. Think of the word, "cretin." At one point, its official definition was to describe someone whose IQ fell between 85 and 100, one standard deviation below the mean.Here is the interview:March 8th, 2012"Zeno's Paradoxes"On The Daily Show last night, John Stewart used Zeno's Dichotomy Parodox to describe the race for the Republican Presidential Primary after Super Tuesday. I had referred to Zeno on Monday when describing how Karl Gauss used calculus to find the area under the normal bell curve.
On that same night, Stephen Colbert opened The Colbert Report with a segment on the Cern Hadron Collider. Refer to Zeno's Stadium Paradox."Normal Bell Curve"Recall from Advanced Algebra the graph of e^x where e = 2.7182818459..., the natural log base.What happens on the right? Where does the graph tend on the left?Now modify the equation like so:This appears to be a initial bell curve. Note its yintercept is (0,1) and that it is an even, symmetrical function.Gauss needed to find the area under it so he could make the normal bell curve have an area of 1 to match 100%.To find it with our graphing calculator press 2nd, TRACE "CALC", 7: ∫f(x)dx,and then enter 5 () 5 for the lower limit and 5 for the upper limit.This was the number by which Gauss needed to divide his expression.It appears to be a random decimal until you square it:The area under the normal bell curve is the square root of pi! Compare this curve to the calculator's normalpdf function:Press 2nd VARS to access the normal probability distribution function:Notes #5"Pythagorean Expectation"Compare each baseball teams winning percentage to the formula S^2/(S^2 + A^2),where S is runs Scored and A is runs Allowed. Enter the data into lists L4, L1, and L2.Enter the answers into list L3. Then press 2nd, Y=, Plot1, On, scatter, Xlist: L3, Ylist L4, Zoom 9 to get:(Hold cursor over calculator pic to obtain menu keystrokes.)The equation y = .9157x + 0.0355 predicts the actual winning percentage y from the pythagorean expectation x.Since x is runs (scored^2) / (scored^2 + allowed^2), and we multiply x by 0.9157, nearly 91.6% of a team's actual winning percentage stems from this ratio of squares of runs on offense and defense. Because it's not 100%, it's possible for a team to score more runs than they allow and yet have a losing (sub .500) record (e.g. Texas Rangerssee photo.) How is this possible? If they win one game by more than 10 runs, and than lose the next nine by 1 run, they've scored more runs than allowed but have a .100 or 10% win/loss record. That is why some baseball statisticians prefer to look at run differentials to predict a team's winning percentage.Notes #4Notice anything unusual about this roster of Canadian Hockey all stars?The following histograms are of the birth months of rosters of hockey and soccer teams.The first bar is January, the second February and so on until the last one of December.Canadian Hockey Czechoslovakian Hockey Czechoslovakian SoccerNotice anything peculiar or out of the ordinary? Whether in Canada or Czechoslovakia,whether it's hockey or soccer, their players tend to be born in the first few months of the year.They were graphed on this window.Here are all teams on the same graph:Now do you see it? If not, here is the graph adjusted with a ZoomStat window:Most of the players are born in January and February. Why is that? It's not astrology. It's that the cutoff dates of eligibility in soccer and hockey in both countries are January 1st. So a player who turns 10 on January 2nd plays the sport alongside someone who won't turn ten until December. And at the ages of 10 and 9, who is going to have an advantage in a physical contact sport? The older players are going to appear bigger and stronger than the preadolescents 11 months younger. So the coaches select those players based on those observations when really they are just picking the oldest kids. And those children get better coaching, more practice and experience, a tougher playing field, all of which makes them truly better players by the time they are 13 or 14. What was once a tiny edge or advantage becomes larger and larger because they are consistently given more opportunity and experiences to develop their talents. If you are born in the latter half of the year and want to play soccer or hockey, the deck is stacked against you. The system discourages you from pursuing the sport further because the very month in which you are born becomes the obstacle.Here is the roster for the Czechoslovakian teams. Notice the same phenomenon.In statistics, when there are outliers, anomalies, or spikes in a graph, there is usually a story to be told as to why that occurs. This particular example is from Malcolm Gladwell's book, Outliers. Think about the implications for this in nonsports examples that also select based on age criteria, for example, education in school systems.Notes #3"Histograms"The following histogram is from 50 data points of mg of a drug left in the bloodstream after a certain number of hours.Xmin = 11 means that the left most "bin" starts at 11 milligrams.The yaxis is the number of data points that fall in each category.In this graph, the tallest bar is 17. The other tallies are 1, 2, 3, 5, 10, and 12.You can confirm that by counting the vertical dots on the yaxis.The next two graphs have slightly adjusted windows, the first has an Xmin of 11.25,and the second changes the bin width from 0.5 to 0.25.Note how the two graphs differ not only from the first, but from each other.Yet, they are graphs of the same data. Now they appear bimodal instead of unimodal.The ZOOM window offers an 'optimal' window to view based on the data.Activate it by pressing ZOOM 9: ZoomStat and then ENTER.Note how the Xmax and Xscl (bin width) are slightly different.The Y values are calculated to offer an optimal view of the data.Seeing how the same data can produce three different graphs of the same type offers insight to the quotation attributed to Samuel L. Clemens on the three ways to lie: "There are lies, damn lies, and statistics."Notes #2
Sabermetrics is the use of data to analyze baseball statistics. The word comes from the acronym, S.A.B.R. which stands for the Society for American Baseball Research. It became popular when Billy Beane, manager for the Oakland Athletics, used it to win 103 games in the 2002 season after losing all star players to free agency. It became known as "Moneyball", which was the title of the book and later became a movie. Bill James pioneered the work and is now employed by the Boston Red Sox. This lesson focuses on The Pythagorean Expectation, an equation to predict winloss records based on runs scored on offense and allowed on defense.Data from http://www.baseballreference.com . Also check out http://www.sabr.org
Chapter 2 DataThe book chapter discusses the information that Amazon collects on its customers."Andrew Pole had just started working as a statistician for Target in 2002, when two colleagues from the marketing department stopped by his desk to ask an odd question: “If we wanted to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that? ” "Here is a video of what Target did with the data they collected:"The retailer had analyzed the girl’s purchases and knew about her condition before her closest family members"and other articles covering the story:from the New York Times:"Because birth records are usually public, the moment a couple have a new baby, they are almost instantaneously barraged with offers and incentives and advertisements from all sorts of companies. Which means that the key is to reach them earlier, before any other retailers know a baby is on the way. ""“If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an email we’ve sent you or visit our Web site, we’ll record it and link it to your Guest ID,” Pole said. “We want to know everything we can.” "Here's the excerpt relating to Target's predictive analytics from the article:
Simpson's Paradox  The 1998 NBA Scoring TitleRead pp. 56 of the PDF attachment above and answer these questions:Enter the following equations into a graphing calculator to provide a visual for Jane's grade:Read pp. 67 and answer these questions regarding the 1998 NBA scoring title :Now write expressions for each player's percentage and set them equal to each other.Solve s for j so Shaq knows how many points he must score to win the title given MJ's points.Graph this equation and locate the point representing Shaq's 39 points and Jordan's 44.Graph each percentage equation and find the point where they tie. Is a tie possible?The record number of points scored in an NBA game is 100 by Wilt Chamberlin.MJ's career high was 69 points and Shaq's was 61. So while not probable, it is mathematically possible for Shaq to score few points than Michael Jordan and still win the NBA scoring title despite going into the final game with a lower percentage. This counterintuitive result is an example of Simpson's Paradox.Notes #1With a graphing calculator, enter:That last decimal is the square root of 5.Press Y= and enter the formula Y1 = (F^x  G^x)/(F  G)then set the table (press 2nd WINDOW ) and record the first thirty Fibonacci numbers. Press 2nd GRAPH TABLE.Then use the VARS menuto compute Fibonacci #'s F31 to F40:Count the number of Fibonacci numbers whose leading digit is 1, 2, 3, 4, 5, 6, 7, 8 and 9.You can generate a list of them by pressing 2nd STAT for LIST.Of the first forty Fibonacci numbers, 12 lead with the digit "1", 7 with "2", 6 with "3", etc.and 3 with "8", and 3 with "9".List L4 gives the percentages below left, with Benford's Law log (1+1/d) predicted on the right.1 30% log (1+1/1) = 0.302 17.5% log (1+1/2) = 0.1763 15% log (1+1/3) = 0.1254 5% log (1 +1/4) = 0.0965 10% log (1 +1/5) = 0.086 5% log (1+1/6) = 0.077 5% log (1+1/7) = 0.068 7.5% log (1+1/8) = 0.059 7.5% log (1+1/9) = 0.05Note that the digits don't occur equally likely, but lower digits have a higher frequency. This anomaly is called Benford's law. Here is a graph of their frequencies:Notice how much overlapping there is on the graphs.