a set of data identified with particular circumstances. Typically displayed in tables with rows as the individuals and columns as the variables
Quantitative vs Categorical/Qualitative variables
Quantitaive: Numerical values. Represent a measurement.
Categorical: category or label values into which individuals are grouped.
Three steps in Exploratory Data Analysis
1. Organize and SUMMARIZE raw data
2. DISCOVER important features and patterns and striking deviations.
3. INTERPRET findings in the context of the problem
exploring data obtained from one variable at a time
exploring data obtained from two variables at a time
what values the variable takes, how often
Three types of graphical displays of categorical distributions
1. Pie Charts
2. Bar Charts
ranges of data to make charting easier, like a bar chart where each bar shows a range like 70-80%
category counts and percentages
Four types of Graphical displays of Quantitative Variables
like a bar chart but the x axis is numerical, in order. Eg: x axis is years, y axis is Men’s income and Women’s income. Or, the x axis is number of hours studied, and y axis is number of students falling into each number of hours studied category.
4 ways to interpret a histogram
1. Shape – Symmetry/Skewness, Peakness (Modality)
2. Center – midpoint
3. Spread – approx range covered by all the data
4. Outliers – observations that fall outside overall pattern
Symmetric distributions (on a histogram)
look symmetric. can be multi-peaked, but symmetrical
Skewness (on a histogram)
data is skewed to the right or left because outliers. (Careful because the histogram looks heavy to the opposite side than to that which it is skewed. Think of the outliers as pulling a long tail out from the main data, making it not symmetrical.)
Peakedness (on a histogram) (three types)
1. Unimodal (single peaked) distribution
2. Bimodal (double peaked) distribution
3. Uniform distribution (Many peaks, all the same)
Stemplot (or stem and leaf plot)
1. Write all the “stems” down in a list, in ascending numerical order. (The stems are all the numbers but the right most number. Eg: dataset 34 35 36 347 367 the stems are 3, 3, 3, 34, 36, but you only use each identical stem once, so it would be 3, 34, 36)
2. Draw a line to the right of the list
3. Write all the leaves next to the stem, and rearrange them in increasing order
two Virtues of a stemplot
1. preserves the data while sorting it
2. when rotated looks like a histogram
a stemplot with dots instead of leaves
Shows the “five number spread”: min, Q1, Median, Q2, Max
Y axis is range
Drawn box is interquartile range
Points for outliers, minimum and maximum
Is most useful for showing side by side comparisons
The Five Number Spread
1. “Upper limit” = Q3 through Max
2. 75th percentile = Median through Q3
3. 50th percentile = Q1 through Median
4. 25th percentile = Q1 (this doesn’t make sense)
5. Lower limit = Minimum through Q1
3 Measures of Center
1. Mode – the value most often found (not sensitive to outliers)
2. Median – the center value (or average of the two center values) (not sensitive to outliers)
3. Mean – the average (sensitive to outliers)
3 measures of spread
1. Range – the distance between max and minimum values
2. Inter-Quartile Range – the range of the middle 50%
3. Standard Deviation – how far the observations are from their mean. (The average may be 9, but the real average is 4 away from 9.)
Calculate inter-quartile range
1. Find median (by arranging data in increasing order)
2. Find median of bottom 50% (Q1, “The first quartile)
3. Find median of top 50% (Q3, “The third quartile”)
The 1.5(IQR) criterion for outliers
3. Any datapoints outside of these two points are possible outliers.
Outliers – when to keep, when to discard?
1. Keep if could happen again, produced by essentially same process.
2. Discard if produced by a different process and your purpose is to understand the process which produced most of the data.
3. Discard if produced by an error or typo that cannot be fixed.
Notations for Standard Deviation
SD, s, Sd, St Dev
Calculate Standard Deviation
1. Find the mean
2. Find distances between observations and the mean
3. Square each deviation
4. Add up the squares of each deviation and divide by the number of deviations minus 1
5. Find square root of result
We can’t average the deviations because they add up to zero.
The reason we average the squares of the deviations minus 1 is beyond the scope of this course to explain.
The average of the squared deviations is called the variance of the data.
Is the “standard deviation” or “variance of the data” influenced by outliers?
The “standard deviation rule”
Approx 68% of observations fall within 1 standard deviation of the mean
Approx 95% of observations fall within 2 standard deviations of the mean
Approx 99.7 of observations fall within 3 standard deviations of the mean
(3 standard deviations = the standard deviation x 3)
Notation for mean
an x with a line over it
Choose between using mean and standard deviation verses the five number summary
1. use mean and SD for relatively symmetrical distributions with no outliers
2. use five number summary for all others
Steps to choose which data display and numerical summary is best
1. Identify the explanatory/independent variable (x) and the response/dependent variable (y)
2. Is the explanatory variable categorical or quantitative?
3. Is the response variable categorical or quantitative?
4. Notate it C-C, C-Q, Q-C, or Q-Q
5. Select approach based on above
Select data display and numerical summary approach for case C-C, C-Q, Q-C, or Q-Q
1. Case C-C: Two way table or double bar chart using conditional percents.
2. Case C-Q: Box plots and five number spread
3. Case Q-C: Not covered in the text
4. Case Q-Q: Scatterplot (explanatory on x, response on y) or labelled scatterplot
Measures the strength and direction of a linear relationship between two quantitative variables. Does not tell you IF a relationship is linear. A curvalinear relationship can include a linear relationship or not.The correlation coefficient tells you the strenghth of the linear relationship, not the curvalinear relationship
Notation of the correlation coefficient
Correlation Coefficient and Outliers
Outliers strongly effect the r-value, so the CC should only be used after seeing the scatterplot.
Range of values in the correlation coefficient
-1 to 1
-1 is the strongest negative linear relationship
+1 is the strongest positive linear relationship
Close to zero is a weaker linear relationship
Regression and Linear Regression
The technique that specifies the dependence of the response variable on the explanatory variable. If it’s a linear dependence, then it’s linear regression. It’s finding the line that best fits the pattern of the linear relationship.
Calculate linear regression or the “least squares regression line”
3. a = Y with line over it – b(x with line over it)
r = the correlation coeffient
Sx = standard deviation of the explanatory variable’s values
Sy=standard deviation of the response variable’s values
X with line over it = the mean of the explanatory variable’s values
Y with line over it = the mean of the response variable’s values
Find the slope of the “least squares regression line”.
(Just like the standard line equation, y=a+bX, helps you find the slope, or the change in y when x changes by 1, the “least squares regression line” formula helps you find the average change in the response variable when the explanatory variable increases by 1 unit. It’s called the “least squares regression line” because it’s the line which results in the smallest sum of squared vertical deviations.
a set of points that obey a particular relationship between x and y
Equation of the Line (Algebra Review)
a=the y-intercept, or the value that y takes when x =zero
b=the slope, or the change in y when x changes by 1
prediction for ranges of the explanatory variable that are not in the data
Causation and lurking variables
Association does not imply causation
Lurking variables are not among the variables in a study but could substantially effect your interpretation of the relationship among those variables
Whenever a lurking variable causes us to rethink the direction of an assocation
Correlation between quantitative variables vs. correlation between category variables
There can only be correlation between quantitative variables, not category variables
10 sampling types and terms
(Remember S,V,V,C,S, S,P,C,M,S)
“Some very very cute samples. Some pleasing, cute, magnificent samples.”
1. Sampling Frame
3. Volunteer Response
4. Convenience Sample
5. Systematic Sampling
6. Simple Random Sample
7. Probability Sampling Plan/Technique
8. Cluster Sampling
9. Multi-Stage Sampling
10. Stratified Sampling
The study should be designed so that the sampling frame is the entire population being studied. (My notes just say “should be the population studied”. May want to double check meaning.)
Probability Sampling Plan/Technique
Any sampling plan or technique that relies on random selection
Volunteer Sample and Volunteer Response
1. Participants include themselves in the study. Biased because only people with strong opinions volunteer, but sometimes it’s the only ethical method. (Eg medical)
2.Participants are not required to respond. Biased because you don’t hear from those not interested in responding.
Individuals happen to be there at researcher’s convenience, like standing outside the arts building to catch students to question.
Cluster, Multi-Stage, and Stratefied sampling.
CLUSTER: Select random sample of natural clusters (5 out of 40 majors) and use all the individuals within the selected clusters (all the students with those 5 majors).
MULTI-STAGE: select random sample of clusters (5 out of 40 maors) and select random individuals within the cluster (random students within the five majors).
STRATFIED: Use all the clusters/strata (all 40 majors). Randomly select individuals from each of the strata. (Random students within all 40 majors.)
eg: Send to every 50th address. (Would exclude siblings because same last name. Might have other effects that need to be thought of depending on the system.)
Simple Random Sample
Select names out of a hat. The only sampling system with no bias.
3 Types of studies
1. Observational – no interference
2. Experiment – Researchers control inputs
3. Sample Survey – individuals report
(A study can’t be both observational and experimental)
Prospective vs retrospective studies
forward vs backward in time
the explanatory variable in a study
Imposed values of the explanatory variable in a study. (Four quitting smoking techniques.)
Randomized Controlled Experiment – what is it and can you draw causal conclusions from it?
Researchers control value of explanatory variable with a randomized procedure. (Subjects are randomly assigned to different treatments.) Can draw causal conclusions from this kind of study.
Causal Conclusions (when can you draw them?)
you can draw causal conclusions if the researches randomly assigned the explanatory variable to individuals
Segment of studied individuals who didn’t receive treatment (or a sugar pill). Not always necessary, and sometimes ethically questionable.
“Blind” and “Double Blind”
Blind – participants don’t know what they’re getting
Double Blind – researchers and participants don’t know who is getting what. Prevents “experimenter effect”
prevented by double blind studies
Lack of realism (lack of ecological validity) (in a study)
when study participants don’t do what they are asked to do which skews the data
Not imposing complete randomization in a study, but blocking individuals into groups like male and female
1 individual in a study gets 2 treatments or 2 similar individuals get 2 treatments
Open vs Closed Questions on a survey
What is your favorite kind of food vs. Which of these five foods is your favorite?
6 types of survey questions to be aware of
1. Open vs. Closed questions
2. Unbalanced response options
3. Leading questions
4. Planting ideas with questions
5. complicated questions
6. sensitive questions
Leading questions vs. planting ideas with questions
Leading question: “how long have you been beating your wife?”
Planting ideas with questions: “Given the huge deficit, are you in favor of universal health care?”
P(it will rain) or P(it will not rain)
P(A) or P(not A)
P(B), P(C) and so on
Probability Rule #1: MEASUREMENT OF PROBABILITY (Made up term for memory tool. No title given to the rule in the text.)
between 0-1 (which means between 0-100% chance). So if the solution is above 1 it’s wrong.
Theoretical (Classical) vs. Empirical (Observational) Probability
Theoretical (Classical) : flipping coin, rolling dice. Outcomes can be predicted by the nature of the situation.
Empirical (Observational) : series of trials with outcomes that can’t be predicted
The probability of an event is the relative frequency occurring in a series of trials.
Relative Frequency of event A = number of times A occurred / total number of repetitions
Law of large numbers
As the number of trials increases the empirical probability gets closer and closer to the theoretical probability
Sample Space vs. “Possible Outcomes for the Event”
Sample Space: The list of all possible outcomes
Possible Outcomes for the Event: outcomes which match the “event” being looked for
The complement of event A is
not A, or the probability that A does not occur
Overlapping circles to help visualize relationships between probabilities of events
Probability Rule #2 SUM OF PROBABILITIES (Made up term for memory tool. The rule was given no title in the text.)
The sum of the probabilities of all possible outcomes is 1
Probability Rule #3: THE COMPLEMENT RULE
P(not A) = 1 – P(A) or P(A) = 1 – P(not A)
The probability that an event does not occur is 1 minus the probability that it does occur or vice versa. This makes sense when you remember that the sum of all the probabilities is 1. So the likelihood of something not happening is 1 minus the likelihood of it happening. Often, it is easier to find the compliment, which is why we can use this formula either way.
Use for problems like, “At least one of several events occur”
Probability Rule #4: THE ADDITION RULE FOR DISJOINT EVENTS
If A and B are disjoint events, then
P(A or B) = P(A) + P(B).
In other words, in probability, “or” always means “+”.
Probability Rule #5: THE MULTIPLICATION RULE FOR INDEPENDENT EVENTS
P(A and B) = P(A) x P(B).
In other words, in probability, “and” always means “x”. (Mulitply) (This may seem counterintuitive because you’re expecting that multiplying will make a larger number but actually you’re always multiplying decimals so it makes a smaller result.)
Independent vs Disjoint events
IF EVENT IS DISJOINT, IT CAN’T BE INDEPENDENT. There can be all other combos of the two.
DISJOINT = mutually exclusive. One happening means anther can’t happen. PART OF “OR” QUESTIONS.
INDEPENDENT = one happening doesn’t effect the probability of the other happening. PART OF “AND”
(Note: if the group from which individuals are chosen is very large, then one being chosen does not effect the probability that the next being chosen will be any certain type. In a small set, the first selection does effect the next selection.)
In probability, “or” means ________ and “and” means _______.
1.addition (more chance of)
2.multiplication (less chance of)
Probability Rule #6: GENERAL ADDITION RULE
P(A or B) = P(A) + P(B) – P(A and B)
Think of a venn diagram with overlapping circles. You subtract the overlapping part because you included it twice, once as part of A and once as part of B. Problems like this can be interpreted as “at least one of two events”. Indeed you can use the compliment rule for them to get the same results, but the general addition rule is easier. The compliment rule is best for “at least ___ of many events”.
P(A or B) How do you solve?
1. Are the events disjoint?
2. If disjoint, use Addition Rule for Disjoint events:
3. If not disjoint, use general addition rule:
P(A)+P(B)-P(A and B)
How to solve: Two categorical values each with two possible values
Two way table
Notation of conditional probability
Probability of B, given A or
Probability of B on the condition that A happened
The “definition of conditional probability” formula.
P(B|A) = P(A and B) P(A)
Similar to how we say something has a 30 out of 100 chance of happening by saying 30/100, to find the probability of B happening given that A has happened, we take the probability of A and B happening and divide it by the probability of just A happening. Most common test question for this is “Side effect A, Side effect B, and both”. What is the probability that the patient who has suffered side effect A will also suffer side effect B? P(B|A) We take the chance of A and B and divide it by the chance of just A. You might think you can use a two way table for these problems, but if the question is, given that the patient got A, what is the chance he got B, then it’s not a simple matter of using the given info for the chance of getting both at the same time. You have to take that “both” figure and divide it by the “given side effect” figure. However it’s very useful to make a two way table to get the figures to plug into the “definition” formula.
Perform an independence check
Events are independent if:
Method 1: P(A|B) = P(A)
Method 2: P(B|A) = P(B)
Method 3: P(B|A) = P(B|not A)
Method 4: P(A and B) = P(A) x P(B)
Probability Rule #7: (I gave it a number, text did not. Earlier referred to it as a version of rule #5) THE GENERAL MULTIPLICATION RULE
P(A and B) = P(A) x P(B|A)
Draw diagram where possibilities emerge from events. (My words, not the text)
When to use a Probability Tree
For scenarios where there are stages or conditional probabilities.
Bayes’ Rule or Bayes’ Theorem
P(A|B) = P(A) x P(B) / P(A) x P(B|A) + P(not A) x P(B|not A)
Also known as “The Law of Total Probability” Not sure wrote down this formula right
The “definition” of conditional probability vs.
The General Multiplication Rule
P(B|A) = P(A and B)/P(A)
General Multiplication Rule:
P(A and B) = P(A) x P(B|A)
See how they are the same equation?
Linear Regression vs Correlation Coefficient
Linear Regression is finding the line that matches the way the data falls on the scatterplot. (If it’s not linear than it’s just called regression.)
Correlation Coefficient is calculating the strength of the linear relationship. (Can’t tell you IF there’s a linear relationship though.)
The range of the Correlation Coefficient vs. the range of probability
Range of Correlation Coefficient is -1 to 1. Close to zero is a weaker linear relationship.
Range of probability is 0-1, which can be translated into 0-100% chance.
Calculate the Correlation Coefficient
Text says you don’t need to know the formula. (It has lots os symbols I don’t know.) But it is part of calculating the linear regression. Perhaps you solve for the correlation coefficient.