2. EXPLORATORY DATA ANALYSIS (Summarize data.)

3. PROBABILITY ANALYSIS (Determine how the sample may differ from the population.)

4. INFERENCE (draw conclusions)

Categorical: category or label values into which individuals are grouped.

2. DISCOVER important features and patterns and striking deviations.

3. INTERPRET findings in the context of the problem

2. Bar Charts

3. Pictogram

2. Stemplot

3. Dotplot

4. Boxplot

2. Center – midpoint

3. Spread – approx range covered by all the data

4. Outliers – observations that fall outside overall pattern

2. Bimodal (double peaked) distribution

3. Uniform distribution (Many peaks, all the same)

2. Draw a line to the right of the list

3. Write all the leaves next to the stem, and rearrange them in increasing order

2. when rotated looks like a histogram

Y axis is range

Drawn box is interquartile range

Points for outliers, minimum and maximum

Is most useful for showing side by side comparisons

2. 75th percentile = Median through Q3

3. 50th percentile = Q1 through Median

4. 25th percentile = Q1 (this doesn’t make sense)

5. Lower limit = Minimum through Q1

2. Median – the center value (or average of the two center values) (not sensitive to outliers)

3. Mean – the average (sensitive to outliers)

2. Inter-Quartile Range – the range of the middle 50%

3. Standard Deviation – how far the observations are from their mean. (The average may be 9, but the real average is 4 away from 9.)

2. Find median of bottom 50% (Q1, “The first quartile)

3. Find median of top 50% (Q3, “The third quartile”)

4. Q3-Q1=IQR

2.Q3+1.5(IQR)

3. Any datapoints outside of these two points are possible outliers.

2. Discard if produced by a different process and your purpose is to understand the process which produced most of the data.

3. Discard if produced by an error or typo that cannot be fixed.

2. Find distances between observations and the mean

3. Square each deviation

4. Add up the squares of each deviation and divide by the number of deviations minus 1

5. Find square root of result

EXPLANATION

We can’t average the deviations because they add up to zero.

The reason we average the squares of the deviations minus 1 is beyond the scope of this course to explain.

The average of the squared deviations is called the variance of the data.

Approx 95% of observations fall within 2 standard deviations of the mean

Approx 99.7 of observations fall within 3 standard deviations of the mean

(3 standard deviations = the standard deviation x 3)

2. use five number summary for all others

2. Is the explanatory variable categorical or quantitative?

3. Is the response variable categorical or quantitative?

4. Notate it C-C, C-Q, Q-C, or Q-Q

5. Select approach based on above

2. Case C-Q: Box plots and five number spread

3. Case Q-C: Not covered in the text

4. Case Q-Q: Scatterplot (explanatory on x, response on y) or labelled scatterplot

-1 is the strongest negative linear relationship

+1 is the strongest positive linear relationship

Close to zero is a weaker linear relationship

2. b=r(Sy/Sx)

3. a = Y with line over it – b(x with line over it)

Key:

r = the correlation coeffient

Sx = standard deviation of the explanatory variable’s values

Sy=standard deviation of the response variable’s values

X with line over it = the mean of the explanatory variable’s values

Y with line over it = the mean of the response variable’s values

EXPLANATION

Find the slope of the “least squares regression line”.

(Just like the standard line equation, y=a+bX, helps you find the slope, or the change in y when x changes by 1, the “least squares regression line” formula helps you find the average change in the response variable when the explanatory variable increases by 1 unit. It’s called the “least squares regression line” because it’s the line which results in the smallest sum of squared vertical deviations.

a=the y-intercept, or the value that y takes when x =zero

b=the slope, or the change in y when x changes by 1

Lurking variables are not among the variables in a study but could substantially effect your interpretation of the relationship among those variables

“Some very very cute samples. Some pleasing, cute, magnificent samples.”

1. Sampling Frame

2.Volunteer Sample

3. Volunteer Response

4. Convenience Sample

5. Systematic Sampling

6. Simple Random Sample

7. Probability Sampling Plan/Technique

8. Cluster Sampling

9. Multi-Stage Sampling

10. Stratified Sampling

2.Participants are not required to respond. Biased because you don’t hear from those not interested in responding.

MULTI-STAGE: select random sample of clusters (5 out of 40 maors) and select random individuals within the cluster (random students within the five majors).

STRATFIED: Use all the clusters/strata (all 40 majors). Randomly select individuals from each of the strata. (Random students within all 40 majors.)

2. Experiment – Researchers control inputs

3. Sample Survey – individuals report

(A study can’t be both observational and experimental)

Double Blind – researchers and participants don’t know who is getting what. Prevents “experimenter effect”

2. Unbalanced response options

3. Leading questions

4. Planting ideas with questions

5. complicated questions

6. sensitive questions

Planting ideas with questions: “Given the huge deficit, are you in favor of universal health care?”

P(A) or P(not A)

P(B), P(C) and so on

Empirical (Observational) : series of trials with outcomes that can’t be predicted

Relative Frequency of event A = number of times A occurred / total number of repetitions

Possible Outcomes for the Event: outcomes which match the “event” being looked for

The sum of the probabilities of all possible outcomes is 1

The probability that an event does not occur is 1 minus the probability that it does occur or vice versa. This makes sense when you remember that the sum of all the probabilities is 1. So the likelihood of something not happening is 1 minus the likelihood of it happening. Often, it is easier to find the compliment, which is why we can use this formula either way.

Use for problems like, “At least one of several events occur”

P(A or B) = P(A) + P(B).

In other words, in probability, “or” always means “+”.

In other words, in probability, “and” always means “x”. (Mulitply) (This may seem counterintuitive because you’re expecting that multiplying will make a larger number but actually you’re always multiplying decimals so it makes a smaller result.)

DISJOINT = mutually exclusive. One happening means anther can’t happen. PART OF “OR” QUESTIONS.

INDEPENDENT = one happening doesn’t effect the probability of the other happening. PART OF “AND”

QUESTIONS

(Note: if the group from which individuals are chosen is very large, then one being chosen does not effect the probability that the next being chosen will be any certain type. In a small set, the first selection does effect the next selection.)

2.multiplication (less chance of)

Think of a venn diagram with overlapping circles. You subtract the overlapping part because you included it twice, once as part of A and once as part of B. Problems like this can be interpreted as “at least one of two events”. Indeed you can use the compliment rule for them to get the same results, but the general addition rule is easier. The compliment rule is best for “at least ___ of many events”.

2. If disjoint, use Addition Rule for Disjoint events:

P(A)+P(B).

3. If not disjoint, use general addition rule:

P(A)+P(B)-P(A and B)

Probability of B, given A or

Probability of B on the condition that A happened

Similar to how we say something has a 30 out of 100 chance of happening by saying 30/100, to find the probability of B happening given that A has happened, we take the probability of A and B happening and divide it by the probability of just A happening. Most common test question for this is “Side effect A, Side effect B, and both”. What is the probability that the patient who has suffered side effect A will also suffer side effect B? P(B|A) We take the chance of A and B and divide it by the chance of just A. You might think you can use a two way table for these problems, but if the question is, given that the patient got A, what is the chance he got B, then it’s not a simple matter of using the given info for the chance of getting both at the same time. You have to take that “both” figure and divide it by the “given side effect” figure. However it’s very useful to make a two way table to get the figures to plug into the “definition” formula.

Method 1: P(A|B) = P(A)

Method 2: P(B|A) = P(B)

Method 3: P(B|A) = P(B|not A)

Method 4: P(A and B) = P(A) x P(B)

Also known as “The Law of Total Probability” Not sure wrote down this formula right

The General Multiplication Rule

P(B|A) = P(A and B)/P(A)

General Multiplication Rule:

P(A and B) = P(A) x P(B|A)

See how they are the same equation?

Correlation Coefficient is calculating the strength of the linear relationship. (Can’t tell you IF there’s a linear relationship though.)

Range of probability is 0-1, which can be translated into 0-100% chance.