•  
     STATISTICS Notes
     
     
    Notes Link for: HamburgerData
     
    Notes Link for:  rCalculation
     
    Notes Link for: FoodData
     
    Notes Link for videos of:  PinkSlime
      
     
     
     

     

    SAT Average Scores1

    Year

     

    National (Public & Private Combined)

    California (Public & Private Combined)

    California (Public Schools Only)

    2011-12

    Percent of seniors tested

     

     

    39%

    Critical Reading

    496

    495

    491

    Math

    514

    512

    510

    Writing

    488

    496

    491

    Total

    1,498

    1,503

    1,492

    2010-11

    Percent of seniors tested

     

    53%

    38%

    Critical Reading

    497

    499

    495

    Math

    514

    515

    513

    Writing

    489

    499

    494

    Total

    1,500

    1,513

    1,502

    2009-10

    Percent of seniors tested

    47%

    50%

    33%

    Critical Reading

    501

    501

    501

    Math

    516

    516

    520

    Writing

    492

    500

    500

    Total

    1,509

    1,517

    1,521

    2008-09

    Percent of seniors tested

    46%

    49%

    35%

    Critical Reading

    501

    500

    495

    Math

    515

    513

    513

    Writing

    493

    498

    494

    Total

    1,509

    1,511

    1,502

    2007-08

    Percent of seniors tested

    45%

    48%

    36%

    Critical Reading

    502

    499

    494

    Math

    515

    515

    513

    Writing

    494

    498

    493

    Total

    1,511

    1,512

    1,500

    2006-07

    Percent of seniors tested

    48%

    49%

    37%

    Critical Reading

    502

    499

    493

    Math

    515

    516

    513

    Writing

    494

    498

    491

    Total

    1,511

    1,531

    1,497

    2005-06

    Percent of seniors tested

    48%

    49%

    37%

    Critical Reading

    503

    501

    495

    Math

    518

    518

    516

    Writing

    497

    501

    495

    Total

    1,518

    1,520

    1,506

    2004-05

    Percent of seniors tested

    49%

    50%

    36%

    Verbal

    508

    504

    499

    Math

    520

    522

    521

    Total

    1,028

    1,026

    1,020

    2003-04

    Percent of seniors tested

    48%

    49%

     

    Verbal

    508

    501

    496

    Math

    518

    519

    519

    Total

    1,026

    1,020

    1,015

    2002-03

    Percent of seniors tested

    48%

    54%

    46%

    Verbal

    507

    499

    494

    Math

    519

    519

    518

    Total

    1,026

    1,018

    1,012

    2001-02

    Percent of seniors tested

    46%

    52%

    37%

    Verbal

    504

    496

    490

    Math

    516

    517

    516

    Total

    1,020

    1,013

    1,006

    2000-01

    Percent of seniors tested

     

     

    37%

    Verbal

    506

    498

    492

    Math

    514

    517

    516

    Total

    1,020

    1,015

    1,008

    1999-00

    Percent of seniors tested

    44%

    49%

    37%

    Verbal

    505

    497

    492

    Math

    514

    518

    517

    Total

    1,019

    1,015

    1,009

    1998-99

    Percent of seniors tested

    43%

    49%

    37%

    Verbal

    505

    497

    492

    Math

    511

    514

    513

    Total

    1,016

    1,011

    1,005

    1997-98

    Percent of seniors tested

    43%

    41%

    36%

    Verbal

    505

    497

    491

    Math

    512

    516

    516

    Total

    1,017

    1,013

    1,007

    1996-97

    Percent of seniors tested

    42%

    41%

    36%

    Verbal

    505

    496

    490

    Math

    511

    514

    514

    Total

    1,016

    1,010

    1,004

    1995-96

    Percent of seniors tested

    41%

    42%

    37%

    Verbal

    505

    495

    490

    Math

    508

    511

    511

    Total

    1,013

    1,006

    1,001

    1994-95

    Percent of seniors tested

    41%

    41%

    36%

    Verbal

    504

    492

    488

    Math

    506

    509

    509

    Total

    1,010

    1,001

    997

    1993-94

    Percent of seniors tested

    42%

    42%

    37%

    Verbal

    499

    489

    484

    Math

    504

    506

    507

    Total

    1,003

    995

    991

    1992-93

    Percent of seniors tested

    43%

    41%

    36%

    Verbal

    500

    491

    486

    Math

    503

    508

    508

    Total

    1,003

    999

    994

    1 In 2005-06, the total possible score changed from 1,600 to 2,400.


    Source: California Department of Education, Policy and Evaluation Division

    http://www.ed-data.k12.ca.us/App_Resx/EdDataClassic/fsTwoPanelPopup.aspx?#!bottom=/_layouts/EdDataClassic/StudentTrendsNew.asp?reportNumber=128&fyr=2013&level=04&report=sat''

    SAT Correlations
     SAT IQ overlap
    What color is Math and which is Verbal sections?
    RED-Verbal & BLUE-Math  
    Which correlations are graphed below? (chose IQ, Math, Verbal, and SAT total)
    SAT math by verbal
     
    SAT to verbal SAT to Math
    Pass cursor over each graph to see answer.
    IQ to Verbal IQ to Math
    Which is more closely correlated to IQ--Math or Verbal SAT score?
     Now let's combine them:
    SAT to IQ  
    Can the SAT combined Verbal-Math score be used to calculate IQ scores? Should they?
     SAT IQ line SAT IQ EQ plot SAT IQ EQ
    Calculate the z-scores and make a scatterplot.
    SAT IQ z-score slope
    Z-Scores plot and slope
     
     
    How are these correlated (SAT-Math-Verbal-IQ)?
     
     
     SAT Distributions
     
    Standardized Tests by Major
     SAT combined by major
     Normal models IQ SAT
     
    GRE Score by major
     
    IQ Distributions
     
     
    IQ by Major  
     
    Enter data for height into L1 and spare change into L2.
    per3 height and money list per3 money height list
     
    Make a histogram, box plot, then graph both together.
    Period 3 Height Distributions:
    histogram per 3 heights per3 box plot height per3 box plot histograms
     
    n = 21 students, mean = 67.6", population Standard Dev is 3.427, sample StdDev is 2.34.
    min = 59, Q1 = 65, Med = 68, Q3 = 70.5, and Max = 73 inches
     
    This was the spare change set of data from L2:
    per3 deviations per3 five number summary
     
    and the related graphs:
     
    per3 money histogram per3 money box plot per3 money plots
     
    Here's Period 5:
    heights
    per5 heights per5 height box per5 height plots
     
    cash on hand
    per5 money histogram per5 money box per5 money plots
     
     
     
    Standard Deviation notes link:

    Standard Deviation - Basic Example

    For a finite set of numbers, the standard deviation is found by taking the square root of the average of the squared differences of the values from their average value. For example, consider a population consisting of the following eight values:

    
    2,\  4,\  4,\  4,\  5,\  5,\  7,\  9.

    These eight data points have the mean (average) of 5:

        \frac{2 + 4 + 4 + 4 + 5 + 5 + 7 + 9}{8} = 5.

    First, calculate the difference of each data point from the mean, and square the result of each:

    
    \begin{array}{lll}
    (2-5)^2 = (-3)^2 = 9  &&  (5-5)^2 = 0^2 = 0 \\
    (4-5)^2 = (-1)^2 = 1  &&  (5-5)^2 = 0^2 = 0 \\
    (4-5)^2 = (-1)^2 = 1  &&  (7-5)^2 = 2^2 = 4 \\
    (4-5)^2 = (-1)^2 = 1  &&  (9-5)^2 = 4^2 = 16. \\
    \end{array}

    Next, calculate the mean of these values, and take the square root:

    
    \sqrt{ \frac{9 + 1 + 1 + 1 + 0 + 0 + 4 + 16}{8} } = 2.

    This quantity is the population standard deviation, and is equal to the square root of the variance. This formula is valid only if the eight values with which we began form the complete population. If the values instead were a random sample drawn from some larger parent population, then we would have divided by 7 (which is n−1) instead of 8 (which is n) in the denominator of the last formula, and then the quantity thus obtained would be called the sample standard deviation. Dividing by n−1 gives a better estimate of the population standard deviation than dividing by n.

    std dev formula anatomy  

    As a slightly more complicated real-life example, the average height for adult men in the United States is about 70 inches, with a standard deviation of around 3 inches. This means that most men (about 68 percent, assuming a normal distribution) have a height within 3 inches of the mean (67–73 inches)  – one standard deviation – and almost all men (about 95%) have a height within 6 inches of the mean (64–76 inches) – two standard deviations. If the standard deviation were zero, then all men would be exactly 70 inches tall. If the standard deviation were 20 inches, then men would have much more variable heights, with a typical range of about 50–90 inches. Three standard deviations account for 99.7 percent of the sample population being studied, assuming the distribution is normal (bell-shaped).

    Country| Average male height |Average female height| Stature ratio
    (male to female)
    Sample population /
    age range
    Share of
    pop. over 15
    covered[51]
    Methodology Year Source
     
    U.S.     
    177.6 cm
    (5 ft 10 in)
    163.2 cm
    (5 ft 4 12 in)
          1.09       All Americans, 20–29       17.4%    Measured    2003–2006 [151]
    U.S.
    178 cm
    (5 ft 10 in)
    163.2 cm
    (5 ft 4 12 in)
          1.09       Black Americans, 20–39 N/A    Measured    2003–2006 [151]
    U.S.
    170.6 cm
    (5 ft 7 in)
    158.7 cm
    (5 ft 2 12 in)
          1.07       Mexican Americans, 20–39   N/A    Measured    2003–2006 [151]
    U.S.
    178.9 cm
    (5 ft 10 12 in)
    164.8 cm
    (5 ft 5 in)
          1.09       White Americans, 20–39 N/A    Measured    2003–2006 [151]
     
    std dev heights  
     
    For more on standard deviation check out this link: 
     

    Standard Deviation notes link:
     

    Variance

    The Variance is defined as:

    The average of the squared differences from the Mean.

    Example

    You and your friends have just measured the heights of your dogs (in millimeters):

    The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.

    Find out the Mean, the Variance, and the Standard Deviation.

    Your first step is to find the Mean:

    Answer:

    Mean =  
    600 + 470 + 170 + 430 + 300
      =  
    1970
      = 394
    5
    5

    so the mean (average) height is 394 mm. Let's plot this on the chart:

    Now we calculate each dog's difference from the Mean:

    To calculate the Variance, take each difference, square it, and then average the result:

    So, the Variance is 21,704.

    And the Standard Deviation is just the square root of Variance, so:

    Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)

     

    And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean:

    So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

    Rottweilers are tall dogs. And Dachshunds are a bit short ... but don't tell them!

    But ... there is a small change with Sample Data

    Our example was for a Population (the 5 dogs were the only dogs we were interested in).

    But if the data is a Sample (a selection taken from a bigger Population), then the calculation changes!

    When you have "N" data values that are:

    • The Population: divide by N when calculating Variance (like we did)
    • A Sample: divide by N-1 when calculating Variance

    All other calculations stay the same, including how we calculated the mean.

    Example: if our 5 dogs were just a sample of a bigger population of dogs, we would divide by 4 instead of 5 like this:

    Sample Variance = 108,520 / 4 = 27,130
    Sample Standard Deviation = √27,130 = 164 (to the nearest mm)

    Think of it as a "correction" when your data is only a sample.

    Formulas

    Here are the two formulas, explained at Standard Deviation Formulas if you want to know more:


    The "Population Standard Deviation":

     
    The "Sample Standard Deviation":  

    Looks complicated, but the important change is to
    divide by N-1 (instead of N) when calculating a Sample Variance.

     

     

    *Footnote: Why square the differences?

    If we just added up the differences from the mean ... the negatives would cancel the positives:

     
    4 + 4 - 4 - 4 = 0
     
    4

    So that won't work. How about we use absolute values?

     
    |4| + |4| + |-4| + |-4|   =   4 + 4 + 4 + 4   = 4
       
    4 4

    That looks good (and is the Mean Deviation), but what about this case:

     
    |7| + |1| + |-6| + |-2|   =   7 + 1 + 6 + 2   = 4
       
    4 4

    Oh No! It also gives a value of 4, Even though the differences are more spread out!

    So let us try squaring each difference (and taking the square root at the end):

     
    42 + 42 + 42 + 42 = √ 64 = 4
       
    4 4
     
    72 + 12 + 62 + 22 = √ 90 = 4.74...
       
    4 4

    That is nice! The Standard Deviation is bigger when the differences are more spread out ... just what we want!

    In fact this method is a similar idea to distance between points, just applied in a different way.

    And it is easier to use algebra on squares and square roots than absolute values, which makes the standard deviation easy to use in other areas of mathematics.

     

     

     

     
    Inconvenient Graphs
     

    Population Growth (timeplot)

     

    Hottest years on record

    (Note these are bar graphs of categorical data even though the horizontal axis is time.)
     
    Bar graphs and histograms (which are essentially adjacent bar graphs) provide a visual representation of quantitative data. 
    The bars are called "bins" and the height of each depends on the quantity that falls into that range.
     
    Here is a histogram of all the scores scored by a basketball team in NBA history:
    nba common scores
    What is the mode (most common score)? 
    Is the distribution symmetric or skewed?
     
    Here is the graph for baseball runs in MLB history:
    common scores MLB
    How does it compare to the NBA graph?
    What is the mode?
    Where is the tail, right or left (this is the skewed direction)?
     
    An interesting difference pops up in the most common scores in NFL history:
    NFL common scores
    What is different about this histogram distribution than previous ones (NBA, MLB)?  Why?
    What is the mode?
    What score is never possible?
    Which other score, while possible, appears to have never happened?
     
    Here's a graph of the perceived versus actual scores of NFL games.  Note the differences:
    NFL actual vs perceived
     
    When I looked up NHL history scores, I could only find one for the 2010-2011 season:
    NHL scores  
    How is this graph skewed?
    What is its mode?
     
    Interestingly, its distribution follows a familiar law:
    NHL Benford's Law
    Calculate the percentage of each bin above (total games = 787) and draw a histogram of its relative frequency.
    How does it compare to this?
    Benford's Law NHL  
    Do you think other sports scores follow Benford's Law?
    Why or why not?
    How could you find out from these graphs?
     
     Here is a look at the age of hockey players in NHL history:
    NHL age
     
    Compare that to teacher age distribution over the years:
    Teacher Age Distribution  
    Why is it changing?
     
    Contrast that with the years of experience a public teacher has over the years:
    What is the mode in each of the three graphs below? 
     
    Teacher Experience  
    What is a better descriptor of the 'middle' of each graph--mean, median, or mode?
     
     

     


     
    April Fool's Day Joke -- Complex Numbers video 
     
    FOOD DATA
     
      Notes on corelation factor
    The two scatter plots below have different viewing windows (300 < y < 700) vs. (700 < y < 2000)
    fat calorie plot fat sodium plot  
    We wish to standardize both sets of data onto the same viewing window, roughly -4 < x < 4 by -4 < y < 4.
    z score data
     
    Do this by entering (L1-mean(L1)) / stdDev(L1) and store results into L4.
     
    calorie fat plots  
     
    To calculate the calorie z scores enter (L3-mean(L3)) / stdDev(L3) and store results into L5.
    Then make a scatterplot of calorie vs. fat z scores by pressing
    2nd, Y=, Plot1 "On", Type: select scatter (1st one), XList: L4, YList: L5, ZOOM 9.
     
    cal fat plot cal fat z graph  
    To calculate the linear regression EQ press STAT, select "CALC", press 4 for LinReg(ax+b) L4, L5 then ENTER.
    z score slope
    To graph that equation press Y=, VARS, 5: Statistics, hilite "EQ", press 1 "RegEQ" and ENTER. Then GRAPH.
     
    Do the same for sodium mg vs. fat g regression plot by repeating this process on List2.
    sodium z score and graph  
    Note how the y-intercept b is really close to zero (in fact, b = 0).
    This is because we subtracted the mean center of the x-data and the mean of the y-data.
    This translation shifts the center of the data to the origin.
    Also note how the corelation factor r is the same as a, the slope of the z-scores. (so a = r)
     
    April 29, 2013
     
    Find the mean measure of center and standard deviation measure of spread for this sandwich data:
    (L1 is fat grams, L2 is sodium milligrams, L3 is calorie content)
     sandwich measures
    Find the z score for the KFC double down sandwich, then locate its data point on the scatter plot.
    sandwich residuals  
    Find the z-score for the KFC double down sandwich's sodium content.
    KFC z-score  
    Where does 1.8 standard deviations fit in the 68-95-99.7 rule?
     
    April 23, 2013
     
    The Rollie Egg Master:
     
    In the Super Size Me DVD bonus feature "The Smoking Fry" the intern threw out the Mcdonald's fries after ten weeks before they showed any sign of decomposition.  The director challenged viewers to continue the experiment.
    Happy Meal decomposition:
     
     April 22, 2013
     
    table calculations
     
    Finding summations:
    Summations
     
    Breakdown the formula by first calculating the numerator, then each radicand in the denominator.
    Formula breakdown  final corelation factor R
     
     Since R = 0.788, there is a positive corelation of a fairly linear relationship.
     
    2nd Y= Plot1 Zoom 9: ZoomStatScatter Plot
     
     
    March 26th, 2013
     
    Prudential commercial featuring live time dot plot:
     
     
    What are the modes?  
    Which is higher--mean or median?
    Which do you think is the best measure of central tendency?
    Is this a normal bell curve?
    Is it symmetrical or skewed?
     
     March 7th, 2013
     "Statistical Analysis of HR Kings"
     The text had an exercise involving Roger Maris' Home Run record of 61*.  Let's analyze data for the other home run kings.  Here are the number of home runs by season for Babe Ruth, Barry Bonds, Mark McGwire, and Sammy Sosa:
     
    Home Runs
     
    Directions:
    Stat HR Directions
     
    To enter 3 into a steam and leaf plot, list it as "03" with a '0' in the left column and a '3' to the right.
     
     Stem and Leaf Babe Bonds  stem and leaf mark sosa
     
    If there are no entries in a decade range (e.g. no 50s), still write a '5' in the left column and nothing to the right.  Why?
    So that the outlying values are easier to identify.  Remember, outlining a steam-and-leaf plot makes it appear as a histogram.
     
    1 Var Stats
     
    The median and mean are central measures.  Outlying values tend to drag the mean off center.  The median is often better centered.
    Measures of spread are IQR and standard deviation.  Which measures are larger in these examples?
     
    HR Box Plots
     
     A box plot provides a visual representation of both center and spread.  When there are multiple box plots, draw them onto the same graph vertically instead of horizontally (Note:  Graphing Calculators plot them horizontally.)
     
     May 8th, 2012
    "The Birthday Problem"
     What's the probability that two people in a room have the same birthday?  How many people does it take before the odds are even?
     
    For the 1st person, the chances are 100% that there is no match because there is no one with whom to match.  So P(No Match) = 100%.  The event of a match is the complementary event, so P(M) = 1-P(not M) = 0%.
    The next person has a 364/365 chance of not matching the first, and a 1/365 of matching.  The third person has a 363/365 of not matching the previous two.  Treating birthdays as independent events, we multiply (365*364/365^2), which gives 0.997 and 0.2% of a match. Multiplying this answer by 363/365 gives the chances of the third person not matching the other two.  Then 365*364*363/ (365^3) = 99.18% or 0.8% of three having a match. For 4 it's (365*364*363*362) / (365*365*365*365).
     
    The numerator is the permutation formula 365 Permute 4, or 365 nPr 4 on a graphing calculator.  The graphing calculator gives up after 40 since the numbers involved are more than a googol (that's 100 digits!)  However, we can continue by multiplying the previous answer by (365-40)/365 or 325/365 and fill in the rest of the table:
     Birthday spreadsheet
    So after 23 people, chances are 50-50 of two having the same birthday.  Indeed, both statistics classes had two people with matching birthdays, November 11th in 1st period and November 7th in 3rd.  
    When are you 100% guaranteed of two having a match?  Well, theoretically never, but rounded to two places, it's 83 people.
    The graph illustrates how quickly the chances climb:
    Birthday Problem  
    That means of the 80 teachers at Tam, odds are excellent that two have the same birthday.
     
     
     May 3, 2012
    "The Monty Hall Problem"
     let's make a deal
     
    There are two goats and a car behind three curtains. 
     choose door number?
    Choose a curtain number to win the car.
    Curtain Number  
    What are your chances of winning the car?
    car  
    You made your pick and the host shows you a goat behind one of the curtains you did not select. 
    goat door  
    (Can he always do this?)
    Then he offers you the chance to stick with your pick, or switch to the other curtain. 
    switch  
    What should you do?
    9 possibilities  
    Most people think switching doesn't matter because there are now two possibilities so it's 50/50.
    goats  
    However, there are nine possibilities (Choice of curtain behind 1, 2, or 3 and where car is (1, 2, or 3).
    Since 3-by-3 is 9, there is no way of having a 50% chance because 9/2 = 4.5, and you can't play this game four and a half times.  But what actually is the chance of winning when you switch if not 50%?
     
     stick or switch
     
    Of the 9 possibilities, 6 result in a Win by switching, and 3 in a Loss.  Therefore, the probability of winning if you switch is 6/9 = 2/3 or about 66.7%.
     
     monty hall solution
    Think of it this way.  Your chances of initially winning the car is 1 out of 3.  Therefore, when the hosts asks you later if you want to switch picks, he's offering a chance to get out of a 2/3 losing proposition.  Taking it doubles your chances of winning.
     
    Scene from Numb3rs episode:
    charlie  
     
    May 1, 2012
     
     Lewis Black summarizes the food issues in the news of late:
     
     
     
    April 24, 2012
     group quiz Ch. 7-8
     
    StarBugs frappacino, infinite Hot Dog-Pizza continuum, and Pharmaceutical Poultry
     
    April 23, 2012
    TakeHome Reading Quiz Ch. 4-8
     Corelation Factor computation on circle x^2 + y^2 = 25 and why r=0 matches no line of best fit.
     
    April  16-20, 2012
    Super Size Me video (STAR Testing week)
     
    April 13, 2012
     Bill Maher covers the "pink slime" issue and you might recognize some of the other food images along the way.  (Also, why Del Taco's drive thru is more advanced than the Korean rocket program) in his New Rules segment during Friday of our Spring Break.
     
    Bill Maher and tube sox compared to the KFC Double Down, and hot dog stuffed crust pizza
     
    Bill Maher and pink slime
     
    April 5, 2012
     
    The Simpsons' Little Lisa's Animal Slurry, the Krusty Ribwich and Stephen Colbert's apology:
     
     
     
    "Linear Models"
     April 3, 2012
     
     The night of the previous lesson on hamburger statistics, Stephen Colbert aired this segment:
     
     
     Let's add the data for KFC's Double Down sandwich (8th entry) of 37 fat g, 1880 sodium mg, and 610 calories to the list, along with the 10 fat g, 760 sodium mg, and 330 calories of a veggie burger.
    new sandwich data  new sandwich plot new sandwich equation
    Note the slope changed from 6 to 15.7 and the old y-intercept of 930 became 678.
    Doubling Down
     Sodium model
     
    The new correlation factor is: new correlation
    This is an improvement over the old factor of r = 0.199 (based just on hamburgers), or 0.23 (with the Double Down added), though it is still not a strong correlation.  This is primarily due to the chicken sandwich's sodium amount being a huge outlier of the data set.  Why did it change so much? To answer that, we must look at the z-scores:
    fat z score
     
     fat z scores   sodium z scores   fat sodium z plot
     
    The mean fat grams was 31.9, and its standard deviation 10.67.  z = (fat - 31.9)/10.67.
    For sodium, the mean was 1178.9 mg with a standard deviation of 357 mg of sodium.
    Applying the formula on the chicken sandwich, z = (1880-1178.9) / 357.9 = 1.96.
    Graph of residuals  Double down's residual
    The Double Down is nearly two standard deviations above the mean, or the 2.5th percentile, in sodium.  The linear model would have predicted Y(1880) = 1259 mg of sodium, so this residual difference is 1259 - 1880 = -621.  The model under predicts because the actual data was 621 above the line.  Calculating a linear regression equation for these z-scores gives:
     
    z score regression  z score graph
    Note how the slope a is the correlation factor r.  Can you spot the veggie burger point?  Although it is a minimum of both the sodium and the fat data, it falls very close to the linear model's prediction, z-score-wise. Let's take a look at the other stat plot--fat g versus calories.
    calorie stat plot
     
     fat calories plot fat calories graph fat calories graph
     
    The point (10, 330) is in the lower left, a minimum value for both x and y; however it falls very close to the prediction point (10, 327).  Consequently, its residual is small. However, on this plot even the KFC Double Down falls close to the line, right in the middle of the densest cluster of hamburger data.  Its calorie level is neither a maximum nor an outlier.  Therefore, we should expect a high correlation factor.  Indeed, it is:
     
     fat calorie correlation
     
    The correlation is 0.98, very close to 1, indicating a strong positive correlation between fat grams and calories.  Since 0.98 is the slope of the z-scores, how do we convert that to the slope for the linear model?  Moving one standard deviation over rises 0.98 standard deviations up.  Multiplying r by the actual standard deviation values gives the slope of the linear model, a = r * Sy / Sx. The standard deviation, Sy, for calories was 117.26 (recall that Sx for the fat g was 10.67).
    0.98 * (117.26 / 10.67) = 10.8
    Since the z-score plot passes through the origin, the linear model passes through the means (mean(x), mean(y)), which in this case is (31.9, 563.3), because the mean calories is 563.3.
    So the y-intercept is calculated from y = ax + b as,
    b = y - ax = 563.3 - 10.8*31.9 = 218.78
    Therefore, the equation is y = 10.8x + 219 and the linear model reads:
     
    calories = 10.8 (fat) + 219 calories
     
    where 10.8 is the rate of calories per fat gram, i.e. each gram of fat adds 10.8 calories to the sandwich, whether vegetarian, bun-less fried chicken, or hamburger.  If the sandwich has 0 fat grams, 219 calories come from the rest of the fat-less ingredients.  This is very similar to our older model from just the hamburger data:
    calorie model  
     
    "Hamburger Data"
    April 2, 2012
     
    Whereas in algebra we say x is the independent variable and y the dependent variable (i.e. y depends on x),
    in Statistics we state the x is the explanatory variable and y the response variable.  "If I change x, how does that affect y?"
    Sometimes it does, sometimes it doesn't, and there may be a lurking variable hiding somewhere.
     
    Chapter 7 concepts
     
    Enter the data above into lists L1, L2, and L3.
     
     
     L1 L2 L3
    Then make two scatter plots--the first of sodium mg against fat grams, the second calories vs fat grams.
     
    L1 L2 L3 data  scatter L1 L2 scatter L1 L3
     
    Which plot satisfies the "straight enough" condition?  Which plot has too many outliers scattered?
     
    LinReg fat v sodium  LinReg Fat v Calories both equations
     
    Can you match each equation above with its proper scatter plot?
     
    fat vs sodium  fat vs calorie
     
    Plot both sets of data together to see if there is any corelation.
    both plots  both lines
     
     Clearly, one line is a better fit to its data than the other.  We can mathematically describe its fit with a formula.
     
     formula for r corelation
     
    You can calculate by hand using a calculator step by step:
     
    r numerator   denominator root 1 denominator root2
     
    And then combining the results to calculate the co-relation factor r:
     
    corelation denominator   fat sodium corelation   corelation fat to sodium
     
    Once you calculate the regression equation (STAT, CALC, 4: LinReg L1, L2),
    you can recall the value of r by pressing VARS, 5: Statistics, "EQ", 7:r) without doing the complicated formula.
     
     
     By replacing L2 (sodium) with L3 (calories) and repeating the calculations above, show that the corelation between fat grams and calories is r = 0.96.
     
    Since calories are strongly related to fat grams, while sodium is only weakly related, what do you think the corelation factor between sodium and calories is?  Would it be closer to 0.96 or to 0.199?  Make a scatter plot and find the regression equation between sodium mg in L2 and calories in L3.
     
    By way of contrast, KFC's Double Down sandwich has 1880 mg of sodium,  610 calories, and 37 fat grams.  Tomorrow we'll look at how that fits in with the hamburgers.  So if you do not wish to eat "pink slime", there's always this:
    Where's the bread?  
     
     March 13th, 2012
     "Normal Models -- IQ and SAT scores"
     
    "Why do I keep doing this? Is 70 not a good IQ score?!" -- 30 Rock, "St. Patrick's Day", aired March 15, 2012
     
    Normal Bell Curve of z-scores
     
    This model can be applied to IQ scores with a mean parameter of 100 and a standard deviation of 15.
     
    IQ scores
     
    Note that 115 =100+15 is one standard deviation (+15) above the mean (100). So a z-score of 2 now corresponds to 130.
    Also, applying the 68-95-99.7 rule allows a percentile or top percentage to be computed.  For example, a z-score of 1, or one standard deviation above the mean, translates to half of 68% or 34% above the middle.  50% + 34% = 84%.  The top 16% is the 84th percentile (better than 84% of the population.)
     
    SAT scores
     
    This explains why they can make the minimum SAT score 200 instead of 0.  Only 0.15% of the population scores beneath 3 standard deviations below the mean, so any score that might have been 0-199 can be given a 200 without loss of much statistical significance.  This makes the SAT normal model symmetric to the high score of 800.  To find what percentile an SAT score of 680 is:
     
    percentile  
     
    100-3.6 = 96.4 so an SAT score of 680 is in the 96.4 percentile.
    A year ago last March, Tim Shriver, Chairman of the Special Olympics, gave an interview on the Stephen Colbert show talking about the language used to describe people with intellectual differences.  In the 1970s when the public started to use these descriptors in a hurtful manner, psychologists shifted away from that terminology.  Think of the word, "cretin." At one point, its official definition was to describe someone whose IQ fell between 85 and 100, one standard deviation below the mean.
     
    Here is the interview:
     
     
      March 8th, 2012
    "Zeno's Paradoxes"
     
     On The Daily Show last night, John Stewart used Zeno's Dichotomy Parodox to describe the race for the Republican Presidential Primary after Super Tuesday.  I had referred to Zeno on Monday when describing how Karl Gauss used calculus to find the area under the normal bell curve.
    Zeno's Paradox
    On that same night, Stephen Colbert opened The Colbert Report with a segment on the Cern Hadron Collider.  Refer to Zeno's Stadium Paradox.
    Zeno's Paradoxes
     
     
     
     
     "Normal Bell Curve"
     
    Recall from Advanced Algebra the graph of e^x where e = 2.7182818459..., the natural log base.
     
    Y1 = e^x Window size graph of e to the x  
     
    What happens on the right?  Where does the graph tend on the left?
     
    Now modify the equation like so:
     
     
    e to negative x squared Graph of e ^ - x^2 new window new window
     
    This appears to be a initial bell curve.  Note its y-intercept is (0,1) and that it is an even, symmetrical function.
    Gauss needed to find the area under it so he could make the normal bell curve have an area of 1 to match 100%.
    To find it with our graphing calculator press 2nd, TRACE "CALC", 7: f(x)dx,
    and then enter -5  (-)  for the lower limit and 5 for the upper limit.
     
     2nd CALC 7: lower limit = -5 upper limit = 5 Area = 1.7724539
     
    This was the number by which Gauss needed to divide his expression.
    It appears to be a random decimal until you square it:
    Square Root Pi
     
    The area under the normal bell curve is the square root of pi!  Compare this curve to the calculator's normalpdf function:
     
    Normalized Bell Curve standard window Guassian Bell Curve  
     
    Press 2nd VARS  to access the normal probability distribution function:
     
    probability distribution function   Y = normalpdf(x) normaldpdf(x)
     
     
     
     
     Notes #5
     
       "Pythagorean Expectation"
     
    Compare each baseball teams winning percentage to the formula S^2/(S^2 + A^2),
    where S is runs Scored and A is runs Allowed.  Enter the data into lists L4, L1, and L2.
     
    Sabermetrics 2005
     
    Enter the answers into list L3.  Then press 2nd, Y=, Plot1, On, scatter, Xlist: L3, Ylist L4, Zoom 9 to get:
     
          Scatter Plot        ZoomStat'd Window       STAT-CALC-LinReg
    (Hold cursor over calculator pic to obtain menu keystrokes.)
            Linear Model    VARS-Statistics-EQ-RegEQ        Line of Best Fit
     
    The equation y = .9157x + 0.0355  predicts the actual winning percentage y from the pythagorean expectation x.
     Since x is runs (scored^2) / (scored^2 + allowed^2), and we multiply x by 0.9157, nearly 91.6% of a team's actual winning percentage stems from this ratio of squares of runs on offense and defense.  Because it's not 100%, it's possible for a team to score more runs than they allow and yet have a losing (sub .500) record (e.g. Texas Rangers--see photo.)  How is this possible?  If they win one game by more than 10 runs, and than lose the next nine by 1 run, they've scored more runs than allowed but have a .100 or 10% win/loss record.  That is why some baseball statisticians prefer to look at run differentials to predict a team's winning percentage.
     
     Notes #4
     
     Notice anything unusual about this roster of Canadian Hockey all stars?
    Canadian Hockey roster  
    The following histograms are of the birth months of rosters of hockey and soccer teams.
    The first bar is January, the second February and so on until the last one of December.
     
    Canadian Hockey     Czech Hockey    Czech Soccer
    Canadian Hockey                                  Czechoslovakian Hockey                        Czechoslovakian Soccer
     
    Notice anything peculiar or out of the ordinary?  Whether in Canada or Czechoslovakia,
    whether it's hockey or soccer, their players tend to be born in  the first few months of the year.
    They were graphed on this window.
     
    Birth month window   Here are all teams on the same graph:   All 3 teams
     
    Now do you see it?  If not, here is the graph adjusted with a ZoomStat window:
    ZoomStat all 3
     
    Most of the players are born in January and February.  Why is that?  It's not astrology.  It's that the cutoff dates of eligibility in soccer and hockey in both countries are January 1st.  So a player who turns 10 on January 2nd plays the sport alongside someone who won't turn ten until December.  And at the ages of 10 and 9, who is going to have an advantage in a physical contact sport?  The older players are going to appear bigger and stronger than the pre-adolescents 11 months younger.  So the coaches select those players based on those observations when really they are just picking the oldest kids.  And those children get better coaching, more practice and experience, a tougher playing field, all of which makes them truly better players by the time they are 13 or 14.  What was once a tiny edge or advantage becomes larger and larger because they are consistently given more opportunity and experiences to develop their talents.  If you are born in the latter half of the year and want to play soccer or hockey, the deck is stacked against you.  The system discourages you from pursuing the sport further because the very month in which you are born becomes the obstacle.
     
    Here is the roster for the Czechoslovakian teams.  Notice the same phenomenon.
    Czech hockey and soccer rosters  
     
    In statistics, when there are outliers, anomalies, or spikes in a graph, there is usually a story to be told as to why that occurs.  This particular example is from Malcolm Gladwell's book, Outliers.  Think about the implications for this in non-sports examples that also select based on age criteria, for example, education in school systems.
     
     
     
     Notes #3
     "Histograms"
     
    The following histogram is from 50 data points of mg of a drug left in the bloodstream after a certain number of hours.
     
    Window1  Histogram1
     
    Xmin = 11 means that the left most "bin" starts at 11 milligrams. 
    The  y-axis is the number of data points that fall in each category.
    In this graph, the tallest bar is 17.  The other tallies are 1, 2, 3, 5, 10, and 12.
    You can confirm that by counting the vertical dots on the y-axis.
     
    The next two graphs have slightly adjusted windows, the first has an Xmin of 11.25,
    and the second changes the bin width from 0.5 to 0.25.
     
    Window2           Histogram2
    Histogram2          Histogram3
     
    Note how the two graphs differ not only from the first, but from each other.
    Yet, they are graphs of the same data.  Now they appear bimodal instead of unimodal.
     
    ZoomStat  ZoomWindow ZoomHistogram
     
    The ZOOM window offers an 'optimal' window to view based on the data.
    Activate it by pressing ZOOM 9: ZoomStat and then ENTER.
    Note how the Xmax and Xscl (bin width) are slightly different.
    The Y values are calculated to offer an optimal view of the data.
     
    Seeing how the same data can produce three different graphs of the same type offers insight to the quotation attributed to Samuel L. Clemens on the three ways to lie: "There are lies, damn lies, and statistics."
     
     
     Notes #2
     
     
    Sabermetrics is the use of data to analyze baseball statistics.  The word comes from the acronym, S.A.B.R. which stands for the Society for American Baseball Research.  It became popular when Billy Beane, manager for the Oakland Athletics, used it to win 103 games in the 2002 season after losing all star players to free agency.  It became known as "Moneyball", which was the title of the book and later became a movie.  Bill James pioneered the work and is now employed by the Boston Red Sox.  This lesson focuses on The Pythagorean Expectation, an equation to predict win-loss records based on runs scored on offense and allowed on defense.
     
     
     


    Pythagorean Expectation
     
     
     
    Chapter 2 Data
     
    The book chapter discusses the information that Amazon collects on its customers.
     "Andrew Pole had just started working as a statistician for Target in 2002, when two colleagues from the marketing department stopped by his desk to ask an odd question: “If we wanted to figure out if a customer is pregnant, even if she didn’t want us to know, can you do that? ” " 
    Here is a video of what Target did with the data they collected:
     
      "The retailer had analyzed the girl’s purchases and knew about her condition before her closest family members"
     and other articles covering the story:
     
    from the New York Times:
     
     "Because birth records are usually public, the moment a couple have a new baby, they are almost instantaneously barraged with offers and incentives and advertisements from all sorts of companies. Which means that the key is to reach them earlier, before any other retailers know a baby is on the way. "
     
    "“If you use a credit card or a coupon, or fill out a survey, or mail in a refund, or call the customer help line, or open an e-mail we’ve sent you or visit our Web site, we’ll record it and link it to your Guest ID,” Pole said. “We want to know everything we can.” "
     
    Here's the excerpt relating to Target's predictive analytics from the article:
     

     
    Simpson's Paradox -- The 1998 NBA Scoring Title 
     
     
    Read pp. 5-6 of the PDF attachment above and answer these questions:
     Jane's tests
     
    Enter the following equations into a graphing calculator to provide a visual for Jane's grade:
    jane's tests graph
     
    Read pp. 6-7 and answer these questions regarding the 1998 NBA scoring title :
    NBA questions  
     
    Now write expressions for each player's percentage and set them equal to each other.
    solve s for j  
     
    Solve s for j so Shaq knows how many points he must score to win the title given MJ's points.
     
    shaq v mj  
     
    Graph this equation and locate the point representing Shaq's 39 points and Jordan's 44.
    shaq v mj graph
     
    Graph each percentage equation and find the point where they tie.  Is a tie possible?
    NBA scoring title graph
     
     The record number of points scored in an NBA game is 100 by Wilt Chamberlin.
     
    NBA scoring table NBA scoring tables  
     
     MJ's career high was 69 points and Shaq's was 61.  So while not probable, it is mathematically possible for Shaq to score few points than Michael Jordan and still win the NBA scoring title despite going into the final game with a lower percentage.  This counter-intuitive result is an example of Simpson's Paradox.
     
    Notes #1
     
     
    With a graphing calculator, enter:
     
     STO /> alpha "F" and "G"   That last decimal is the square root of 5.
    Press   Y=   and enter the formula Y1 = (F^x - G^x)/(F - G)
    Binet's formula
     
    then set the table (press  2nd  WINDOW ) and record the first thirty Fibonacci numbers. Press 2nd GRAPH TABLE.
     
    2nd WINDOW "TBLSET"  Fibonacci #'s 1-7 Fib7-13
    F13-F19  F20-26 F25-30
     
    Then use the VARS menu
    VARS  Y-Vars Function Y1
    to compute Fibonacci #'s F31 to F40:
    F31-33  F34-37 F38-40
    Count the number of Fibonacci numbers whose leading digit is 1, 2, 3, 4, 5, 6, 7, 8 and 9.
    You can generate a list of them by pressing 2nd STAT for LIST.
    List Ops     Fibonacci sequence      Store into list L1.
     
    Of the first forty Fibonacci numbers, 12 lead with the digit "1", 7 with "2", 6 with "3", etc.
    Fibonacci frequency   and 3 with "8", and 3 with "9". 
    List L4 gives the percentages below left, with Benford's Law log (1+1/d) predicted on the right.
     
       1   30%      log (1+1/1) = 0.30
       2   17.5%   log (1+1/2) = 0.176
       3   15%      log (1+1/3) = 0.125
       4   5%        log (1 +1/4) = 0.096
       5   10%      log (1 +1/5) = 0.08
       6   5%         log (1+1/6) = 0.07
       7   5%         log (1+1/7) = 0.06
       8   7.5%      log (1+1/8) = 0.05
       9   7.5%      log (1+1/9) = 0.05
     
    Note that the digits don't occur equally likely, but lower digits have a higher frequency.  This anomaly is called Benford's law.  Here is a graph of their frequencies:
     Fibonacci leading digit frequency    Benford's Law graphed    Fibonacci's Law
     
    Notice how much overlapping there is on the graphs.
     
     
Last Modified on March 5, 2018