7 Sampling Distributions

The Bridge Between Probability and Statistical Inference

8 Chapter 6: Sampling Distributions

graph TD
    A[Sampling Distributions] --> B[For Sample Means]
    A --> C[For Sample Proportions]
    A --> D[Sampling Procedures]
    
    B --> B1[Sampling<br/>Error]
    B --> B2[Mean of<br/>Sample Means]
    B --> B3[Standard<br/>Error]
    B --> B4[Applications for<br/>Normal Distribution]
    B --> B5[Central Limit<br/>Theorem]
    B --> B6[Finite Population<br/>Correction Factor]
    
    C --> C1[Sampling<br/>Error]
    C --> C2[Standard<br/>Error]
    C --> C3[Applications for<br/>Normal Distribution]
    C --> C4[Central Limit<br/>Theorem]
    C --> C5[Finite Population<br/>Correction Factor]
    
    D --> D1[Errors and<br/>Bias]
    D --> D2[Sampling<br/>Methods]
    
    D2 --> D2a[Simple Random<br/>Sampling]
    D2 --> D2b[Systematic<br/>Sampling]
    D2 --> D2c[Stratified<br/>Sampling]
    D2 --> D2d[Cluster<br/>Sampling]
    
    style A fill:#000,stroke:#000,color:#fff
    style B fill:#000,stroke:#000,color:#fff
    style C fill:#000,stroke:#000,color:#fff
    style D fill:#000,stroke:#000,color:#fff

Learning Objectives

By the end of this chapter, you will be able to:

Understand sampling distributions and how they differ from population distributions
Calculate sampling error and interpret its business implications
Apply the Central Limit Theorem to analyze sample means
Compute standard error with and without finite population correction
Determine probabilities for sample means using normal distribution
Design effective sampling procedures (random, systematic, stratified, cluster)
Recognize and minimize sampling bias in business research
Make informed decisions based on sample statistics

8.1 Opening Scenario: Investment Portfolio Analysis

Client Investment Decision

Several wealthy clients have selected you as their investment analyst to evaluate three distinct industries:

1. Sports & Recreation Industry
- Thrives during economic downturns (people seek relief from economic stress)
- Anticipated recession makes this sector attractive

2. Healthcare Industry
- Aging population = increasing medical assistance demand
- Social Security system threats create opportunities
- Demographic trends favor long-term growth

3. Environmental Protection Industry
- Wetlands preservation and environmental protection
- Potential for both financial returns and social contribution

Your challenge: Analyze investment portfolios using sampling distributions to assess risk and probability of success for each industry sector.

Key question: Can you sample 50 companies from each industry (instead of analyzing all companies) and still make confident recommendations?

8.2 6.1 Introduction: The Foundation of Statistical Inference

Populations are typically too large to study in their entirety. Imagine trying to:
- Survey all 150 million U.S. consumers about product preferences
- Test every smartphone produced for quality (destructive testing!)
- Measure income of all 7 billion people on Earth

Solution: Select a representative sample of manageable size, then use it to draw conclusions about the population.

Key Terminology

Population Parameter (μ, σ, π):
- Numerical characteristic of the entire population
- Usually unknown (that’s why we sample!)
- Examples: μ = average income of all U.S. households

Sample Statistic (\bar{X}, s, p):
- Numerical characteristic of the sample
- Calculated from sample data
- Used as estimator of population parameter

Statistical Inference:
The process of using a sample statistic to draw conclusions about a population parameter.

Critical insight: Different samples from the same population will produce different statistics!

Example: Fortune 500 companies
- Population: N = 500 companies
- Sample: n = 50 companies randomly selected
- Statistic: \bar{X} = average return rate for the 50
- Parameter: μ = average return rate for all 500
- Inference: Use \bar{X} to estimate μ

8.3 6.2 Sampling Distributions - The Heart of Inference

8.3.1 The Fundamental Concept

From any population of size N, we can select many different samples of size n. Each sample will likely have a different mean.

We can create a distribution of all possible sample means - this is the sampling distribution!

Sampling Distribution Definition

Sampling Distribution: A listing of all possible values for a statistic (like \bar{X}) and the probability associated with each value.

Purpose: Allows us to calculate probabilities about sample statistics and quantify sampling error.

8.3.2 Example 6.1: College Student Incomes - Building a Sampling Distribution

Simple Population

Four college students have annual incomes:
- Student A: $100
- Student B: $200
- Student C: $300
- Student D: $400

Population mean: μ = $250
Population size: N = 4

Instead of calculating μ from all 4 observations, you decide to estimate μ using a sample of n = 2 students.

How many different samples are possible?

Solution:

Number of possible samples:

{}_{4}C_{2} = \frac{4!}{2!2!} = 6 \text{ different samples}

All Possible Samples:

Sample	Students Selected	Incomes	Sample Mean \bar{X}
1	A, B	$100, $200	$150
2	A, C	$100, $300	$200
3	A, D	$100, $400	$250
4	B, C	$200, $300	$250
5	B, D	$200, $400	$300
6	C, D	$300, $400	$350

{.striped .hover}

Key observation: Only samples 3 and 4 yield \bar{X} = \mu = 250!

Sampling Distribution of \bar{X}:

\bar{X}	Frequency	Probability P(\bar{X})
$150	1	1/6 = 0.167
$200	1	1/6 = 0.167
$250	2	2/6 = 0.333
$300	1	1/6 = 0.167
$350	1	1/6 = 0.167
Total	6	1.000

{.striped .hover}

Critical Insights

Probability of “perfect” estimate: Only 33.3% chance (2 out of 6 samples) that \bar{X} = \mu

Sampling error will occur: 66.7% of samples produce some estimation error!

Sampling Error: (\bar{X} - \mu)
- Sample 1: $150 - $250 = -$100 (underestimate)
- Sample 6: $350 - $250 = +$100 (overestimate)

Reality check: You’ll never know actual sampling error because μ is unknown (that’s why you’re sampling!). But you must acknowledge it exists.

8.3.3 The Mean of Sample Means (\bar{\bar{X}})

The sampling distribution itself has a mean, called the “mean of the means” or “grand mean”.

Mean of Sampling Distribution

\bar{\bar{X}} = \frac{\sum \bar{X}}{K}

Where:
- K = number of samples in the sampling distribution
- \bar{X} = individual sample means

Remarkable Property:
\bar{\bar{X}} = \mu

The mean of all possible sample means always equals the population mean!

Calculating for our example:

\bar{\bar{X}} = \frac{150 + 200 + 250 + 250 + 300 + 350}{6} = \frac{1500}{6} = 250 = \mu

This is NOT coincidence - it’s a fundamental property of sampling distributions!

8.3.4 Standard Error - Quantifying Sampling Variability

Just as the population has variance σ², the sampling distribution has variance σ²_{\bar{X}}.

Variance and Standard Error of Sampling Distribution

Variance of Sample Means:

\sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}

Standard Error (SE):

\sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}

Interpretation: Standard error measures how much sample means tend to deviate from the population mean (μ).

Larger SE = greater sampling variability = less precise estimates
Smaller SE = less sampling variability = more precise estimates

Calculating for our example:

\sigma^2_{\bar{X}} = \frac{(150-250)^2 + (200-250)^2 + (250-250)^2 + (250-250)^2 + (300-250)^2 + (350-250)^2}{6}

= \frac{10,000 + 2,500 + 0 + 0 + 2,500 + 10,000}{6} = \frac{25,000}{6} = 4,167 \text{ dollars}^2

\sigma_{\bar{X}} = \sqrt{4,167} = \$64.55

Business interpretation: On average, sample means deviate from true population mean by about $65.

8.3.5 Example 6.2: East Coast Manufacturing Sales

Monthly Sales Analysis

East Coast Manufacturing (ECM) sales (in thousands) for the last 5 months:
68, 73, 65, 80, 72

Population mean: μ = 71.6 thousand

Your task (Marketing Director): Estimate μ using a sample of n = 3 months. How large is the likely sampling error?

Solution:

Step 1: Determine number of samples

{}_{5}C_{3} = \frac{5!}{3!2!} = 10 \text{ possible samples}

Step 2: List all samples and calculate means

Sample	Months Selected	\bar{X}	Sample	Months Selected	\bar{X}
1	68, 73, 65	68.67	6	68, 80, 72	73.33
2	68, 73, 80	73.67	7	73, 65, 80	72.67
3	68, 73, 72	71.00	8	73, 65, 72	70.00
4	68, 65, 80	71.00	9	73, 80, 72	75.00
5	68, 65, 72	68.33	10	65, 80, 72	72.33

{.striped .hover}

Step 3: Create sampling distribution

\bar{X}	Probability
68.33	1/10 = 0.10
68.67	1/10 = 0.10
70.00	1/10 = 0.10
71.00	2/10 = 0.20
72.33	1/10 = 0.10
72.67	1/10 = 0.10
73.33	1/10 = 0.10
73.67	1/10 = 0.10
75.00	1/10 = 0.10

{.striped .hover}

Step 4: Calculate mean of sampling distribution

\bar{\bar{X}} = \frac{68.67 + 73.67 + ... + 72.33}{10} = 71.6 = \mu \quad ✓

Step 5: Calculate standard error

\sigma^2_{\bar{X}} = \frac{(68.67-71.6)^2 + (73.67-71.6)^2 + ... + (72.33-71.6)^2}{10}

= 4.31 \text{ thousand}^2

\sigma_{\bar{X}} = \sqrt{4.31} = 2.08 \text{ thousand dollars}

Business Interpretation

Expected sampling error: ± $2,080 on average

With 10 possible samples:
- 20% chance (2/10) of getting \bar{X} exactly equal to μ = 71.6
- 80% chance of some sampling error

Managerial decision: Is ±$2,080 error “relatively small” for ECM’s purposes?
- If monthly sales are in hundreds of thousands: YES (2.9% error)
- If making critical budget decisions: Maybe need larger sample

8.3.6 Shortcut Formula for Standard Error (When σ is Known)

The manual calculation of \sigma_{\bar{X}} is tedious. There’s a shortcut:

Standard Error Formulas

Basic Formula (Infinite Population or Sampling with Replacement):

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

With Finite Population Correction (FPC):
Use when sampling without replacement AND n > 0.05N

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}

Where \sqrt{\frac{N-n}{N-1}} is the finite population correction factor

When FPC is unnecessary:
If n/N < 0.05 (sample is less than 5% of population), FPC ≈ 1, so skip it!

Example: Quality control sampling
- Population: N = 10,000 units, σ = 5 grams
- Sample: n = 100 units
- Check: n/N = 100/10,000 = 0.01 < 0.05 → No FPC needed

\sigma_{\bar{X}} = \frac{5}{\sqrt{100}} = 0.5 \text{ grams}

8.3.7 The Impact of Sample Size on Standard Error

Critical relationship: As n ↑, σ_{\bar{X}} ↓

Example: Population with σ = 100

Sample Size (n)	Standard Error	Reduction
25	100/√25 = 20.0	baseline
100	100/√100 = 10.0	50% smaller
400	100/√400 = 5.0	75% smaller
1,600	100/√1600 = 2.5	87.5% smaller

{.striped .hover}

The Law of Diminishing Returns

Doubling sample size does NOT halve the standard error!

To cut SE in half, you must quadruple the sample size (because of the √n in denominator).

Business trade-off:
- Larger samples → more precision (smaller SE)
- Larger samples → higher cost, more time

Optimal sample size balances precision needs against resource constraints.

8.4 6.3 The Central Limit Theorem - The Most Important Theorem in Statistics

This is where the magic happens!

Central Limit Theorem (CLT)

Statement:
For a population with any distribution shape, as sample size n increases, the sampling distribution of \bar{X} approaches a normal distribution with:
- Mean: \bar{\bar{X}} = \mu
- Standard Error: \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

Rule of Thumb: If n ≥ 30, sampling distribution is approximately normal regardless of population shape.

If population is already normal: Sampling distribution is normal for any sample size (even n = 2)!

8.4.1 Case 1: Population is Normal → Sampling Distribution is Normal

Example: Adult heights
- Population: Normally distributed, μ = 67 inches, σ = 3 inches
- Samples: n = 25

Sampling Distribution Properties:
- Shape: Normal (because population is normal)
- Mean: \bar{\bar{X}} = \mu = 67 inches
- Standard Error: \sigma_{\bar{X}} = \frac{3}{\sqrt{25}} = 0.6 inches

Key insight: Individual heights vary ±3 inches, but sample means vary only ±0.6 inches!

8.4.2 Case 2: Population is NOT Normal → CLT Creates Normality (if n ≥ 30)

Example: Income distribution (right-skewed)
- Population: Skewed right, μ = $50,000, σ = $20,000
- Samples: n = 50

Sampling Distribution Properties:
- Shape: Approximately Normal (CLT magic, since n = 50 ≥ 30)
- Mean: \bar{\bar{X}} = \mu = 50{,}000
- Standard Error: \sigma_{\bar{X}} = \frac{20{,}000}{\sqrt{50}} = 2{,}828

Even though population is skewed, sample means distribute normally!

8.4.3 Visual Demonstration: The Power of CLT

Scenario: Population is uniform (rectangular, definitely NOT normal)
- μ = 1000
- σ = 100

What happens as n increases?

Sample Size	SE	Distribution Shape
n = 5	100/√5 = 44.7	Still somewhat uniform
n = 10	100/√10 = 31.6	Starting to mound
n = 30	100/√30 = 18.3	Nearly normal
n = 50	100/√50 = 14.1	Normal ✓
n = 100	100/√100 = 10.0	Very normal, tight

{.striped .hover}

Why CLT Matters for Business

Practical implication: You can use normal distribution tables and methods for sample means even when:
- You don’t know the population distribution shape
- The population is definitely not normal

As long as n ≥ 30!

This is why surveys typically use samples of 30+ respondents. It guarantees the statistical tools (confidence intervals, hypothesis tests) will work properly.

8.5 6.4 Applications: Using Sampling Distributions for Business Decisions

Now we put theory into action!

8.5.1 The Z-Score for Sample Means

In Chapter 5, we calculated probabilities for individual observations using:

Z = \frac{X - \mu}{\sigma}

For sample means, the formula changes:

Z-Score for Sampling Distributions

Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}

Interpretation: Number of standard errors \bar{X} is from μ

8.5.2 Example 6.3: TelCom Satellite - Comparing Probabilities

Communication Service Analysis

TelCom Satellite Data:
- μ = 150 seconds (mean transmission time)
- σ = 15 seconds
- Normal distribution

Two Questions:
a) Probability one transmission is between 150-155 seconds?
b) Probability mean of 50 transmissions is between 150-155 seconds?

Solution:

a) Single transmission: P(150 ≤ X ≤ 155)

Z = \frac{X - \mu}{\sigma} = \frac{155 - 150}{15} = 0.33

From Table E: Area = 0.1293

P(150 ≤ X ≤ 155) = 0.1293 (12.93%)

b) Mean of n = 50: P(150 ≤ \bar{X} ≤ 155)

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{50}} = 2.12

Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{155 - 150}{2.12} = 2.36

From Table E: Area = 0.4909

P(150 ≤ \bar{X} ≤ 155) = 0.4909 (49.09%)

Dramatic Difference!

Single call: Only 12.93% chance between 150-155 seconds
Average of 50 calls: 49.09% chance the mean is between 150-155 seconds

Why? Sample means cluster much more tightly around μ than individual observations!
- Individual observations: σ = 15 seconds spread
- Sample means (n=50): σ_{\bar{X}} = 2.12 seconds spread

Business application: If you need to forecast total time for 50 calls, you can be much more confident in your estimate than for a single call.

8.5.3 Example 6.4: TelCom Equipment Investment Decision

Strategic Investment Analysis

TelCom considers new equipment to improve efficiency.
Before deciding, executives need probabilities for mean of n = 35 transmissions:

Between 145 and 150 seconds?
Greater than 145 seconds?
Less than 155 seconds?
Between 145 and 155 seconds?
Greater than 155 seconds?

Given: μ = 150 seconds, σ = 15 seconds, n = 35

Solution:

Calculate standard error first:

\sigma_{\bar{X}} = \frac{15}{\sqrt{35}} = 2.54 \text{ seconds}

a) P(145 ≤ \bar{X} ≤ 150)

Z = \frac{145 - 150}{2.54} = -1.97

Area = 0.4756
P(145 ≤ \bar{X} ≤ 150) = 0.4756 (47.56%)

b) P(\bar{X} ≥ 145)

Z = -1.97 \text{ (same as part a)}

P(\bar{X} ≥ 145) = 0.4756 + 0.5000 = 0.9756 (97.56%)

c) P(\bar{X} ≤ 155)

Z = \frac{155 - 150}{2.54} = +1.97

P(\bar{X} ≤ 155) = 0.5000 + 0.4756 = 0.9756 (97.56%)

d) P(145 ≤ \bar{X} ≤ 155)

From parts a and c: Both give area = 0.4756

P(145 ≤ \bar{X} ≤ 155) = 0.4756 + 0.4756 = 0.9512 (95.12%)

e) P(\bar{X} > 155)

Z = +1.97

P(\bar{X} > 155) = 0.5000 - 0.4756 = 0.0244 (2.44%)

Equipment Investment Decision

Key findings:
✅ 97.56% chance mean time ≥ 145 seconds (very reliable lower bound)
✅ 95.12% chance mean time between 145-155 seconds (excellent precision)
⚠️ Only 2.44% chance mean time > 155 seconds (low risk of extreme delays)

Managerial implications:
- Current system is predictable: 95% of the time, average of 35 calls will be within ±5 seconds of 150
- New equipment decision: If it promises to reduce μ from 150 to 140 seconds, that’s a 10-second improvement - highly significant given SE = 2.54!
- Budget confidently: Can plan capacity around 145-155 second range with 95% confidence

Recommendation: Proceed with equipment investment if improvement > 2 standard errors (2 × 2.54 = 5.08 seconds), which ensures statistically detectable improvement.

8.6 6.5 Sampling Distributions for Proportions

Many business decisions involve proportions rather than means:
- Will a customer buy or not buy? (Marketing)
- Will a depositor default or not default? (Banking)
- Will a project generate positive return or not? (Capital budgeting)
- Is a unit defective or not defective? (Quality control)

We use sample proportion p to estimate population proportion π.

Sampling Distribution of Proportions

Properties:
E(p) = \pi

The mean of all possible sample proportions equals the population proportion!

Standard Error of Proportions:

\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}

With Finite Population Correction (if n > 0.05N):

\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Proportions:

Z = \frac{p - \pi}{\sigma_p}

8.6.1 Example 6.5: Lugget Furniture - Advertising Effectiveness

Small Population Demonstration

Lugget Furniture asks N = 4 customers if they saw today’s newspaper ad.

Responses: Yes, No, No, Yes (notation: S₁, N₂, N₃, S₄)

Population proportion: π = 2/4 = 0.50 (50% saw the ad)

All possible samples of n = 2:

Solution:

Sample	Customers	Number of “Yes”	Sample Proportion p
1	S₁, N₂	1	0.50
2	S₁, N₃	1	0.50
3	S₁, S₄	2	1.00
4	N₂, N₃	0	0.00
5	N₂, S₄	1	0.50
6	N₃, S₄	1	0.50

{.striped .hover}

Expected value:

E(p) = \frac{\sum p}{K} = \frac{0.50 + 0.50 + 1.00 + 0.00 + 0.50 + 0.50}{6} = \frac{3.00}{6} = 0.50 = \pi \quad ✓

Standard error (with FPC):

\sigma_p = \sqrt{\frac{0.50(1-0.50)}{2}} \sqrt{\frac{4-2}{4-1}} = \sqrt{0.125} \sqrt{0.667} = 0.289

Interpretation: Sample proportions vary ±0.29 around true proportion π = 0.50

8.6.2 Example 6.6: BelLabs Component Quality - Multiple Decision Thresholds

Supplier Evaluation Decision

BelLabs purchases cell phone components in lots of n = 200 units.
Components have a 10% defect rate (π = 0.10).

Decision policy based on next shipment’s defect rate:
a) > 12% defects → Definitely seek new supplier
b) 10-12% defects → Consider new supplier
c) 5-10% defects → Definitely stay with current supplier
d) < 5% defects → Increase orders

Which decision is most likely?

Solution:

Assumptions: Population N is very large (many components), so n/N < 0.05 → No FPC needed

Standard error:

\sigma_p = \sqrt{\frac{0.10(0.90)}{200}} = \sqrt{0.00045} = 0.021

a) P(p > 0.12) - Seek new supplier

Z = \frac{0.12 - 0.10}{0.021} = 0.95

From Table E: Area = 0.3289

P(p > 0.12) = 0.5000 - 0.3289 = 0.1711

17.11% probability of seeking new supplier

b) P(0.10 ≤ p ≤ 0.12) - Consider new supplier

From part a: Area between 0.10 and 0.12 = 0.3289 (32.89%)

c) P(0.05 ≤ p ≤ 0.10) - Stay with supplier

Z = \frac{0.05 - 0.10}{0.021} = -2.38

From Table E: Area = 0.4913

P(0.05 \leq p \leq 0.10) = 0.4913

49.13% probability of staying with supplier ✓ HIGHEST!

d) P(p < 0.05) - Increase orders

P(p < 0.05) = 0.5000 - 0.4913 = 0.0087

Only 0.87% probability of increasing orders

Decision Recommendation

Most likely outcome: BelLabs will keep current supplier (49.13% probability).

Business rationale:
- Nearly 50% chance defects stay in acceptable 5-10% range
- Only 17% chance situation is bad enough to require supplier change
- Less than 1% chance quality improves enough to justify order increase

Risk assessment:
- 17% chance of >12% defects is significant risk
- BelLabs should monitor next several shipments closely
- Consider negotiating quality improvement clause with supplier

8.6.3 Example 6.7: Tax Referendum - Political Decision

Sports Stadium Funding

A large city surveys n = 1,000 residents about a tax increase for a new sports stadium.

Rule: If > 85% support the tax, it goes on the ballot.

Actual population support: π = 0.82 (82%)

Question: What’s the probability the referendum makes it onto the ballot?

(This requires sample proportion p > 0.85 despite true π = 0.82)

Solution:

Standard error:

\sigma_p = \sqrt{\frac{0.82(0.18)}{1000}} = \sqrt{0.0001476} = 0.0121

P(p > 0.85):

Z = \frac{0.85 - 0.82}{0.0121} = 2.48

From Table E: Area = 0.4934

P(p > 0.85) = 0.5000 - 0.4934 = 0.0066

Only 0.66% probability the referendum appears on the ballot!

Political Reality Check

Despite 82% actual support (strong majority!), there’s less than 1% chance the sample will show the required 85% threshold.

Why? The 85% requirement is 2.48 standard errors above the true mean - an extreme outcome.

Implications:
- The 85% threshold is too high given sampling variability
- City council should lower threshold to 80-82% for fairness
- With current rule, genuinely popular measures may fail to reach ballot due to random sampling error

Better policy: Use 82% threshold (actual population value) or add confidence interval around survey result.

8.7 6.6 Sampling Methods and Procedures

Selecting a representative sample is critical. A biased sample produces unreliable estimates, even with large n!

8.7.1 Sources of Sampling Error

1. Random Chance (“Bad Luck”)
- By pure chance, sample may include atypical elements
- Sample might have unusually high or low values
- Cannot be eliminated but can be quantified with standard error

2. Sample Bias (Systematic Error)
- Tendency to favor certain samples over others
- Arises from flawed data collection procedures
- CAN and MUST be minimized through proper sampling design

8.7.2 Historical Example: The Literary Digest Debacle (1936)

Classic Sampling Failure

1936 Presidential Election:
- Literary Digest predicted: Alf Landon (Republican) wins overwhelmingly
- Actual result: Franklin D. Roosevelt (Democrat) wins in landslide

What went wrong?

Sampling method:
- Drew names from telephone directories
- Drew names from magazine subscriber lists

The fatal flaw: In 1936, during the Great Depression:
- Only wealthy people could afford telephones and magazine subscriptions
- Wealthy people blamed Republicans less for economic hardship
- Sample was NOT representative of the general voting population

Result: Magazine went out of business. The prediction error destroyed credibility.

Lesson: A biased sample of 2 million is worse than a random sample of 1,000!

8.7.3 Sampling Method 1: Simple Random Sampling (SRS)

Simple Random Sample Definition

Every possible sample of size n has an equal probability of being selected.

Example: Select 5 states from 50 for consumer taste testing
- Total possible samples: {}_{50}C_5 = 2,118,760
- SRS ensures each combination has probability = 1/2,118,760

Implementation:
1. Manual: Write names on identical papers, draw from hat
2. Random number table: Use pre-generated random digits (Table A)
3. Computer: Use random number generators (Excel RANDBETWEEN, Python random.sample)

Advantages:
✓ Unbiased
✓ Simple to understand
✓ Allows calculation of sampling error

Disadvantages:
✗ Requires complete list of population (sampling frame)
✗ May miss important subgroups by chance
✗ Can be expensive if population is geographically dispersed

8.7.4 Sampling Method 2: Systematic Sampling

Systematic Sampling Procedure

Select every ith element from an ordered population.

Steps:
1. Determine sampling interval: i = \frac{N}{n}
2. Randomly select starting point between 1 and i
3. Select every ith element thereafter

Example: Sample 100 from population of 1,000
- Sampling interval: i = 1000/100 = 10
- Random start: 7 (randomly chosen between 1-10)
- Sample: 7, 17, 27, 37, 47, …, 997

Advantages:
✓ Easy to implement (no expertise needed)
✓ Spreads sample evenly across population
✓ Less expensive than SRS for large populations

Disadvantages:
✗ Danger of hidden patterns! If population has cyclical pattern matching the interval, severe bias results
✗ Example: Sampling every 7th day might always hit Mondays (different from other days)

8.7.5 Sampling Method 3: Stratified Sampling

Stratified Sampling Design

Divide heterogeneous population into homogeneous subgroups (strata), then sample from each stratum.

Proportional stratification: Sample from each stratum proportionally to its size in population.

Example: USDA Drought Impact Study

Population: Farmers in 4 states (Kansas, Oklahoma, Nebraska, South Dakota)
- Kansas: 30% of all farmers → 30% of sample from Kansas
- Oklahoma: 25% of all farmers → 25% of sample from Oklahoma
- Nebraska: 28% of all farmers → 28% of sample from Nebraska
- South Dakota: 17% of all farmers → 17% of sample from South Dakota

Advantages:
✓ Ensures representation of important subgroups
✓ More precise than SRS of same size
✓ Can compare subgroups (stratified analysis)
✓ Reduces sampling error when strata are internally homogeneous

Disadvantages:
✗ Requires knowledge of population structure
✗ More complex to implement
✗ Need separate sampling frame for each stratum

When to use: Population is heterogeneous but contains identifiable homogeneous subgroups.

Example application: Income survey
- Strata: Age groups (18-30, 31-45, 46-60, 61+)
- Income varies greatly between age groups but is more similar within groups
- Stratified sampling ensures all age groups represented

8.7.6 Sampling Method 4: Cluster Sampling

Cluster Sampling Design

Divide population into clusters (groups), randomly select some clusters, include ALL elements in selected clusters.

Key difference from stratified:
- Stratified: Sample from ALL strata
- Cluster: Sample only SOME clusters (but all elements within selected clusters)

Example: USDA Drought Study (Cluster Approach)

Clusters: Counties within each state
1. Randomly select 15 counties from all counties in the 4 states
2. Include ALL farmers in the 15 selected counties

Advantages:
✓ Cost-effective when population is geographically dispersed
✓ No need for complete population list (only cluster list)
✓ Easier field logistics (visit all farms in selected counties)

Disadvantages:
✗ Higher sampling error than SRS of same size (elements within cluster may be similar)
✗ If clusters are internally homogeneous, efficiency decreases

When to use: Geographically concentrated populations, high travel costs, or when complete population list unavailable.

Combining methods: Can use stratified cluster sampling
- Divide 4 states into strata
- Within each state, use cluster sampling of counties
- Sample proportional number of counties from each state

8.7.7 Potential Problems in Cluster Sampling

Risk: If selected clusters are atypical, bias results.

Example: If randomly selected counties all have:
- Unusually high irrigation usage → Overestimate crop productivity
- Unusually low irrigation usage → Underestimate crop productivity

Mitigation: Select more clusters (increases cost but reduces risk of unrepresentative clusters)

8.8 Problemas Resueltos (Solved Problems)

8.8.1 Problema 1: Investment Industry Returns

Consumer Goods Industry Analysis

Investment records show consumer goods firms average 30% return rate with 12% standard deviation.

A sample of n = 250 firms is selected.

Find: Probability mean return exceeds 31%

Solution:

\sigma_{\bar{X}} = \frac{0.12}{\sqrt{250}} = 0.0076

Z = \frac{0.31 - 0.30}{0.0076} = 1.32

From Table E: Area = 0.4066

P(\bar{X} > 0.31) = 0.5000 - 0.4066 = 0.0934

9.34% probability of exceeding 31% return

Investment implication: 31% return threshold is 1.32 standard errors above mean - achievable but not common. Set realistic expectations!

8.8.2 Problema 2: Direct Marketing Proportion

Marketing Channel Decision

Only 22% of consumer goods firms market directly to final consumers.

Decision rule: If sample of 250 firms shows > 20% direct marketing, you’ll purchase from this industry.

Find: Probability you’ll spend your money elsewhere (p ≤ 0.20)

Solution:

\sigma_p = \sqrt{\frac{0.22(0.78)}{250}} = 0.0262

Z = \frac{0.20 - 0.22}{0.0262} = -0.76

From Table E: Area = 0.2764

P(p > 0.20) = 0.5000 + 0.2764 = 0.7764

P(p \leq 0.20) = 1 - 0.7764 = 0.2236

22.36% probability you’ll shop elsewhere

Business insight: Most likely (77.64%) you’ll find adequate direct marketing representation and purchase from this industry.

8.8.3 Problema 3: Sampling Error Tolerance

The Paper House Employee Hours

The Paper House (party supplies store):
- μ = 36.7 hours/week average employee work time
- σ = 3.5 hours
- Sample: n = 36 weeks

Owner Jill Ramsey’s requirement: At least 90% confidence estimate is within ±1 hour of true mean.

Find: Probability Ramsey won’t be disappointed

Solution:

Ramsey wants: P(|error| ≤ 1) = P(35.7 ≤ \bar{X} ≤ 37.7)

\sigma_{\bar{X}} = \frac{3.5}{\sqrt{36}} = 0.583

Z = \frac{37.7 - 36.7}{0.583} = 1.71

From Table E: Area = 0.4564

P(35.7 \leq \bar{X} \leq 37.7) = 0.4564 \times 2 = 0.9128

91.28% probability estimate is within ±1 hour (exceeds her 90% requirement ✓)

Managerial decision: The sample size n = 36 is adequate to meet Ramsey’s precision requirements!

8.9 Lista de Fórmulas (Formula Reference)

8.9.1 Sampling Distribution of Means

Mean of Sample Means: \bar{\bar{X}} = \frac{\sum \bar{X}}{K}

Variance of Sampling Distribution: \sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}

Standard Error: \sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}

Standard Error (Shortcut): \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

With Finite Population Correction: \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Sample Means: Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}}

8.9.2 Sampling Distribution of Proportions

Expected Value: E(p) = \frac{\sum p}{K} = \pi

Standard Error: \sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}

With Finite Population Correction: \sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Sample Proportions: Z = \frac{p - \pi}{\sigma_p}

8.9.3 Central Limit Theorem

For n ≥ 30: Sampling distribution of \bar{X} is approximately normal with: - Mean: μ - Standard Error: σ/√n

8.9.4 Sampling Error

Definition: \text{Sampling Error} = \bar{X} - \mu \quad \text{or} \quad p - \pi

8.10 Chapter Summary

This chapter bridged probability theory and statistical inference by introducing sampling distributions:

Key Concepts Mastered:

Sampling Distribution - The distribution of all possible sample statistics (means or proportions)
Sampling Error - Inevitable difference between sample statistic and population parameter
Standard Error - Measure of sampling variability (analogous to standard deviation)
Central Limit Theorem - Magic that makes normal distribution applicable even when population isn’t normal (for n ≥ 30)
Sampling Methods:
- Simple Random: Every sample equally likely
- Systematic: Every ith element
- Stratified: Proportional sampling from subgroups
- Cluster: Sample entire groups

Critical Insights:

✓ Larger samples → smaller standard error (but with diminishing returns - need 4x sample for half the error)
✓ Sample means cluster more tightly around μ than individual observations (σ_{\bar{X}} < σ)
✓ Bias is worse than random error - proper sampling method prevents bias
✓ Different samples yield different statistics - sampling distributions quantify this variability

Business Applications:

Quality control decisions based on sample defect rates
Marketing decisions based on sample purchase proportions
Investment analysis using sample average returns
Policy decisions informed by survey samples

Next Chapter: We’ll use these sampling distribution concepts to build confidence intervals and conduct hypothesis tests - the core tools of statistical inference!

Next Chapter: Confidence Intervals

--- title: "Sampling Distributions" subtitle: "The Bridge Between Probability and Statistical Inference" --- # Chapter 6: Sampling Distributions {#sec-sampling-distributions} ```{mermaid} graph TD A[Sampling Distributions] --> B[For Sample Means] A --> C[For Sample Proportions] A --> D[Sampling Procedures] B --> B1[Sampling Error] B --> B2[Mean of Sample Means] B --> B3[Standard Error] B --> B4[Applications for Normal Distribution] B --> B5[Central Limit Theorem] B --> B6[Finite Population Correction Factor] C --> C1[Sampling Error] C --> C2[Standard Error] C --> C3[Applications for Normal Distribution] C --> C4[Central Limit Theorem] C --> C5[Finite Population Correction Factor] D --> D1[Errors and Bias] D --> D2[Sampling Methods] D2 --> D2a[Simple Random Sampling] D2 --> D2b[Systematic Sampling] D2 --> D2c[Stratified Sampling] D2 --> D2d[Cluster Sampling] style A fill:#000,stroke:#000,color:#fff style B fill:#000,stroke:#000,color:#fff style C fill:#000,stroke:#000,color:#fff style D fill:#000,stroke:#000,color:#fff ``` ::: {.callout-note icon="🎯"} ## Learning Objectives By the end of this chapter, you will be able to: - **Understand sampling distributions** and how they differ from population distributions - **Calculate sampling error** and interpret its business implications - **Apply the Central Limit Theorem** to analyze sample means - **Compute standard error** with and without finite population correction - **Determine probabilities** for sample means using normal distribution - **Design effective sampling procedures** (random, systematic, stratified, cluster) - **Recognize and minimize sampling bias** in business research - **Make informed decisions** based on sample statistics ::: ## Opening Scenario: Investment Portfolio Analysis ::: {.callout-note icon="💼" appearance="minimal"} ## Client Investment Decision Several wealthy clients have selected you as their investment analyst to evaluate **three distinct industries**: **1. Sports & Recreation Industry** - Thrives during economic downturns (people seek relief from economic stress) - Anticipated recession makes this sector attractive **2. Healthcare Industry** - Aging population = increasing medical assistance demand - Social Security system threats create opportunities - Demographic trends favor long-term growth **3. Environmental Protection Industry** - Wetlands preservation and environmental protection - Potential for both financial returns and social contribution **Your challenge:** Analyze investment portfolios using **sampling distributions** to assess risk and probability of success for each industry sector. **Key question:** Can you sample 50 companies from each industry (instead of analyzing all companies) and still make confident recommendations? ::: ## 6.1 Introduction: The Foundation of Statistical Inference Populations are typically **too large** to study in their entirety. Imagine trying to: - Survey all 150 million U.S. consumers about product preferences - Test every smartphone produced for quality (destructive testing!) - Measure income of all 7 billion people on Earth **Solution:** Select a **representative sample** of manageable size, then use it to draw conclusions about the population. ::: {.callout-tip icon="📚"} ## Key Terminology **Population Parameter (μ, σ, π):** - Numerical characteristic of the **entire population** - Usually **unknown** (that's why we sample!) - Examples: μ = average income of all U.S. households **Sample Statistic ($\bar{X}$, s, p):** - Numerical characteristic of the **sample** - Calculated from sample data - Used as **estimator** of population parameter **Statistical Inference:** The process of using a sample **statistic** to draw conclusions about a population **parameter**. **Critical insight:** Different samples from the same population will produce **different statistics**! ::: **Example:** Fortune 500 companies - **Population:** N = 500 companies - **Sample:** n = 50 companies randomly selected - **Statistic:** $\bar{X}$ = average return rate for the 50 - **Parameter:** μ = average return rate for all 500 - **Inference:** Use $\bar{X}$ to estimate μ ## 6.2 Sampling Distributions - The Heart of Inference ### The Fundamental Concept From any population of size N, we can select **many different samples** of size n. Each sample will likely have a **different mean**. **We can create a distribution** of all possible sample means - this is the **sampling distribution**! ::: {.callout-tip icon="📚"} ## Sampling Distribution Definition **Sampling Distribution:** A listing of all possible values for a statistic (like $\bar{X}$) and the probability associated with each value. **Purpose:** Allows us to calculate probabilities about sample statistics and quantify sampling error. ::: ### Example 6.1: College Student Incomes - Building a Sampling Distribution ::: {.callout-note icon="💼" appearance="minimal"} ## Simple Population Four college students have annual incomes: - Student A: $100 - Student B: $200 - Student C: $300 - Student D: $400 **Population mean:** μ = $250 **Population size:** N = 4 Instead of calculating μ from all 4 observations, you decide to **estimate μ** using a sample of **n = 2** students. **How many different samples are possible?** ::: **Solution:** **Number of possible samples:** $${}_{4}C_{2} = \frac{4!}{2!2!} = 6 \text{ different samples}$$ **All Possible Samples:** | Sample | Students Selected | Incomes | Sample Mean $\bar{X}$ | |--------|-------------------|---------|----------------------| | 1 | A, B | $100, $200 | $150 | | 2 | A, C | $100, $300 | $200 | | 3 | A, D | $100, $400 | $250 | | 4 | B, C | $200, $300 | $250 | | 5 | B, D | $200, $400 | $300 | | 6 | C, D | $300, $400 | $350 | {.striped .hover} **Key observation:** Only samples 3 and 4 yield $\bar{X} = \mu = 250$! **Sampling Distribution of $\bar{X}$:** | $\bar{X}$ | Frequency | Probability P($\bar{X}$) | |-----------|-----------|-------------------------| | $150 | 1 | 1/6 = 0.167 | | $200 | 1 | 1/6 = 0.167 | | $250 | 2 | 2/6 = 0.333 | | $300 | 1 | 1/6 = 0.167 | | $350 | 1 | 1/6 = 0.167 | | **Total** | **6** | **1.000** | {.striped .hover} ::: {.callout-important icon="💡"} ## Critical Insights **Probability of "perfect" estimate:** Only 33.3% chance (2 out of 6 samples) that $\bar{X} = \mu$ **Sampling error will occur:** 66.7% of samples produce some estimation error! **Sampling Error:** $(\bar{X} - \mu)$ - Sample 1: $150 - $250 = -$100 (underestimate) - Sample 6: $350 - $250 = +$100 (overestimate) **Reality check:** You'll never know actual sampling error because μ is unknown (that's why you're sampling!). But you must **acknowledge it exists**. ::: ### The Mean of Sample Means ($\bar{\bar{X}}$) The sampling distribution itself has a mean, called the **"mean of the means"** or **"grand mean"**. ::: {.callout-tip icon="📚"} ## Mean of Sampling Distribution $$\bar{\bar{X}} = \frac{\sum \bar{X}}{K}$$ Where: - K = number of samples in the sampling distribution - $\bar{X}$ = individual sample means **Remarkable Property:** $$\bar{\bar{X}} = \mu$$ The mean of all possible sample means **always equals** the population mean! ::: **Calculating for our example:** $$\bar{\bar{X}} = \frac{150 + 200 + 250 + 250 + 300 + 350}{6} = \frac{1500}{6} = 250 = \mu$$ **This is NOT coincidence** - it's a fundamental property of sampling distributions! ### Standard Error - Quantifying Sampling Variability Just as the population has variance σ², the sampling distribution has variance σ²$_{\bar{X}}$. ::: {.callout-tip icon="📚"} ## Variance and Standard Error of Sampling Distribution **Variance of Sample Means:** $$\sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}$$ **Standard Error (SE):** $$\sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}$$ **Interpretation:** Standard error measures how much sample means tend to **deviate from the population mean** (μ). **Larger SE** = greater sampling variability = less precise estimates **Smaller SE** = less sampling variability = more precise estimates ::: **Calculating for our example:** $$\sigma^2_{\bar{X}} = \frac{(150-250)^2 + (200-250)^2 + (250-250)^2 + (250-250)^2 + (300-250)^2 + (350-250)^2}{6}$$ $$= \frac{10,000 + 2,500 + 0 + 0 + 2,500 + 10,000}{6} = \frac{25,000}{6} = 4,167 \text{ dollars}^2$$ $$\sigma_{\bar{X}} = \sqrt{4,167} = \$64.55$$ **Business interpretation:** On average, sample means deviate from true population mean by about $65. ### Example 6.2: East Coast Manufacturing Sales ::: {.callout-note icon="💼" appearance="minimal"} ## Monthly Sales Analysis East Coast Manufacturing (ECM) sales (in thousands) for the last 5 months: **68, 73, 65, 80, 72** **Population mean:** μ = 71.6 thousand **Your task (Marketing Director):** Estimate μ using a sample of **n = 3 months**. How large is the likely sampling error? ::: **Solution:** **Step 1: Determine number of samples** $${}_{5}C_{3} = \frac{5!}{3!2!} = 10 \text{ possible samples}$$ **Step 2: List all samples and calculate means** | Sample | Months Selected | $\bar{X}$ | Sample | Months Selected | $\bar{X}$ | |--------|-----------------|-----------|--------|-----------------|-----------| | 1 | 68, 73, 65 | 68.67 | 6 | 68, 80, 72 | 73.33 | | 2 | 68, 73, 80 | 73.67 | 7 | 73, 65, 80 | 72.67 | | 3 | 68, 73, 72 | 71.00 | 8 | 73, 65, 72 | 70.00 | | 4 | 68, 65, 80 | 71.00 | 9 | 73, 80, 72 | 75.00 | | 5 | 68, 65, 72 | 68.33 | 10 | 65, 80, 72 | 72.33 | {.striped .hover} **Step 3: Create sampling distribution** | $\bar{X}$ | Probability | |-----------|-------------| | 68.33 | 1/10 = 0.10 | | 68.67 | 1/10 = 0.10 | | 70.00 | 1/10 = 0.10 | | 71.00 | 2/10 = 0.20 | | 72.33 | 1/10 = 0.10 | | 72.67 | 1/10 = 0.10 | | 73.33 | 1/10 = 0.10 | | 73.67 | 1/10 = 0.10 | | 75.00 | 1/10 = 0.10 | {.striped .hover} **Step 4: Calculate mean of sampling distribution** $$\bar{\bar{X}} = \frac{68.67 + 73.67 + ... + 72.33}{10} = 71.6 = \mu \quad ✓$$ **Step 5: Calculate standard error** $$\sigma^2_{\bar{X}} = \frac{(68.67-71.6)^2 + (73.67-71.6)^2 + ... + (72.33-71.6)^2}{10}$$ $$= 4.31 \text{ thousand}^2$$ $$\sigma_{\bar{X}} = \sqrt{4.31} = 2.08 \text{ thousand dollars}$$ ::: {.callout-important icon="💡"} ## Business Interpretation **Expected sampling error:** ± $2,080 on average **With 10 possible samples:** - 20% chance (2/10) of getting $\bar{X}$ exactly equal to μ = 71.6 - 80% chance of some sampling error **Managerial decision:** Is ±$2,080 error "relatively small" for ECM's purposes? - If monthly sales are in hundreds of thousands: YES (2.9% error) - If making critical budget decisions: Maybe need larger sample ::: ### Shortcut Formula for Standard Error (When σ is Known) The manual calculation of $\sigma_{\bar{X}}$ is tedious. There's a shortcut: ::: {.callout-tip icon="📚"} ## Standard Error Formulas **Basic Formula (Infinite Population or Sampling with Replacement):** $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$ **With Finite Population Correction (FPC):** Use when sampling **without replacement** AND **n > 0.05N** $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}$$ Where $\sqrt{\frac{N-n}{N-1}}$ is the **finite population correction factor** **When FPC is unnecessary:** If n/N < 0.05 (sample is less than 5% of population), FPC ≈ 1, so skip it! ::: **Example:** Quality control sampling - Population: N = 10,000 units, σ = 5 grams - Sample: n = 100 units - Check: n/N = 100/10,000 = 0.01 < 0.05 → **No FPC needed** $$\sigma_{\bar{X}} = \frac{5}{\sqrt{100}} = 0.5 \text{ grams}$$ ### The Impact of Sample Size on Standard Error **Critical relationship:** As n ↑, σ$_{\bar{X}}$ ↓ **Example:** Population with σ = 100 | Sample Size (n) | Standard Error | Reduction | |-----------------|----------------|-----------| | 25 | 100/√25 = 20.0 | baseline | | 100 | 100/√100 = 10.0 | 50% smaller | | 400 | 100/√400 = 5.0 | 75% smaller | | 1,600 | 100/√1600 = 2.5 | 87.5% smaller | {.striped .hover} ::: {.callout-important icon="💡"} ## The Law of Diminishing Returns **Doubling sample size** does NOT halve the standard error! To cut SE in half, you must **quadruple** the sample size (because of the √n in denominator). **Business trade-off:** - Larger samples → more precision (smaller SE) - Larger samples → higher cost, more time **Optimal sample size** balances precision needs against resource constraints. ::: ## 6.3 The Central Limit Theorem - The Most Important Theorem in Statistics This is where the magic happens! ::: {.callout-tip icon="📚"} ## Central Limit Theorem (CLT) **Statement:** For a population with **any distribution shape**, as sample size n increases, the sampling distribution of $\bar{X}$ approaches a **normal distribution** with: - Mean: $\bar{\bar{X}} = \mu$ - Standard Error: $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ **Rule of Thumb:** If **n ≥ 30**, sampling distribution is approximately normal **regardless of population shape**. **If population is already normal:** Sampling distribution is normal for **any sample size** (even n = 2)! ::: ### Case 1: Population is Normal → Sampling Distribution is Normal **Example:** Adult heights - Population: Normally distributed, μ = 67 inches, σ = 3 inches - Samples: n = 25 **Sampling Distribution Properties:** - **Shape:** Normal (because population is normal) - **Mean:** $\bar{\bar{X}} = \mu = 67$ inches - **Standard Error:** $\sigma_{\bar{X}} = \frac{3}{\sqrt{25}} = 0.6$ inches **Key insight:** Individual heights vary ±3 inches, but sample means vary only ±0.6 inches! ### Case 2: Population is NOT Normal → CLT Creates Normality (if n ≥ 30) **Example:** Income distribution (right-skewed) - Population: Skewed right, μ = \$50,000, σ = \$20,000 - Samples: n = 50 **Sampling Distribution Properties:** - **Shape:** Approximately Normal (CLT magic, since n = 50 ≥ 30) - **Mean:** $\bar{\bar{X}} = \mu = 50{,}000$ - **Standard Error:** $\sigma_{\bar{X}} = \frac{20{,}000}{\sqrt{50}} = 2{,}828$ **Even though population is skewed, sample means distribute normally!** ### Visual Demonstration: The Power of CLT **Scenario:** Population is **uniform** (rectangular, definitely NOT normal) - μ = 1000 - σ = 100 **What happens as n increases?** | Sample Size | SE | Distribution Shape | |-------------|----|--------------------| | n = 5 | 100/√5 = 44.7 | Still somewhat uniform | | n = 10 | 100/√10 = 31.6 | Starting to mound | | n = 30 | 100/√30 = 18.3 | Nearly normal | | n = 50 | 100/√50 = 14.1 | **Normal** ✓ | | n = 100 | 100/√100 = 10.0 | Very normal, tight | {.striped .hover} ::: {.callout-important icon="💡"} ## Why CLT Matters for Business **Practical implication:** You can use **normal distribution tables and methods** for sample means even when: - You don't know the population distribution shape - The population is definitely not normal **As long as n ≥ 30!** This is why surveys typically use samples of 30+ respondents. It guarantees the statistical tools (confidence intervals, hypothesis tests) will work properly. ::: ## 6.4 Applications: Using Sampling Distributions for Business Decisions Now we put theory into action! ### The Z-Score for Sample Means In Chapter 5, we calculated probabilities for **individual observations** using: $$Z = \frac{X - \mu}{\sigma}$$ For **sample means**, the formula changes: ::: {.callout-tip icon="📚"} ## Z-Score for Sampling Distributions $$Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}$$ **Interpretation:** Number of standard errors $\bar{X}$ is from μ ::: ### Example 6.3: TelCom Satellite - Comparing Probabilities ::: {.callout-note icon="💼" appearance="minimal"} ## Communication Service Analysis **TelCom Satellite Data:** - μ = 150 seconds (mean transmission time) - σ = 15 seconds - Normal distribution **Two Questions:** a) Probability **one transmission** is between 150-155 seconds? b) Probability **mean of 50 transmissions** is between 150-155 seconds? ::: **Solution:** **a) Single transmission: P(150 ≤ X ≤ 155)** $$Z = \frac{X - \mu}{\sigma} = \frac{155 - 150}{15} = 0.33$$ **From Table E:** Area = 0.1293 **P(150 ≤ X ≤ 155) = 0.1293** (12.93%) **b) Mean of n = 50: P(150 ≤ $\bar{X}$ ≤ 155)** $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{50}} = 2.12$$ $$Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{155 - 150}{2.12} = 2.36$$ **From Table E:** Area = 0.4909 **P(150 ≤ $\bar{X}$ ≤ 155) = 0.4909** (49.09%) ::: {.callout-important icon="💡"} ## Dramatic Difference! **Single call:** Only 12.93% chance between 150-155 seconds **Average of 50 calls:** 49.09% chance the mean is between 150-155 seconds **Why?** Sample means cluster much more tightly around μ than individual observations! - Individual observations: σ = 15 seconds spread - Sample means (n=50): σ$_{\bar{X}}$ = 2.12 seconds spread **Business application:** If you need to forecast **total time for 50 calls**, you can be much more confident in your estimate than for a single call. ::: ### Example 6.4: TelCom Equipment Investment Decision ::: {.callout-note icon="💼" appearance="minimal"} ## Strategic Investment Analysis TelCom considers new equipment to improve efficiency. Before deciding, executives need probabilities for **mean of n = 35 transmissions**: a) Between 145 and 150 seconds? b) Greater than 145 seconds? c) Less than 155 seconds? d) Between 145 and 155 seconds? e) Greater than 155 seconds? **Given:** μ = 150 seconds, σ = 15 seconds, n = 35 ::: **Solution:** **Calculate standard error first:** $$\sigma_{\bar{X}} = \frac{15}{\sqrt{35}} = 2.54 \text{ seconds}$$ **a) P(145 ≤ $\bar{X}$ ≤ 150)** $$Z = \frac{145 - 150}{2.54} = -1.97$$ **Area = 0.4756** **P(145 ≤ $\bar{X}$ ≤ 150) = 0.4756** (47.56%) **b) P($\bar{X}$ ≥ 145)** $$Z = -1.97 \text{ (same as part a)}$$ **P($\bar{X}$ ≥ 145) = 0.4756 + 0.5000 = 0.9756** (97.56%) **c) P($\bar{X}$ ≤ 155)** $$Z = \frac{155 - 150}{2.54} = +1.97$$ **P($\bar{X}$ ≤ 155) = 0.5000 + 0.4756 = 0.9756** (97.56%) **d) P(145 ≤ $\bar{X}$ ≤ 155)** **From parts a and c:** Both give area = 0.4756 **P(145 ≤ $\bar{X}$ ≤ 155) = 0.4756 + 0.4756 = 0.9512** (95.12%) **e) P($\bar{X}$ > 155)** $$Z = +1.97$$ **P($\bar{X}$ > 155) = 0.5000 - 0.4756 = 0.0244** (2.44%) ::: {.callout-important icon="💡"} ## Equipment Investment Decision **Key findings:** ✅ **97.56% chance** mean time ≥ 145 seconds (very reliable lower bound) ✅ **95.12% chance** mean time between 145-155 seconds (excellent precision) ⚠️ **Only 2.44% chance** mean time > 155 seconds (low risk of extreme delays) **Managerial implications:** - **Current system is predictable:** 95% of the time, average of 35 calls will be within ±5 seconds of 150 - **New equipment decision:** If it promises to reduce μ from 150 to 140 seconds, that's a **10-second improvement** - highly significant given SE = 2.54! - **Budget confidently:** Can plan capacity around 145-155 second range with 95% confidence **Recommendation:** Proceed with equipment investment **if** improvement > 2 standard errors (2 × 2.54 = 5.08 seconds), which ensures statistically detectable improvement. ::: ## 6.5 Sampling Distributions for Proportions Many business decisions involve **proportions** rather than means: - Will a customer **buy or not buy**? (Marketing) - Will a depositor **default or not default**? (Banking) - Will a project **generate positive return or not**? (Capital budgeting) - Is a unit **defective or not defective**? (Quality control) We use sample proportion **p** to estimate population proportion **π**. ::: {.callout-tip icon="📚"} ## Sampling Distribution of Proportions **Properties:** $$E(p) = \pi$$ The mean of all possible sample proportions equals the population proportion! **Standard Error of Proportions:** $$\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}$$ **With Finite Population Correction (if n > 0.05N):** $$\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}$$ **Z-Score for Proportions:** $$Z = \frac{p - \pi}{\sigma_p}$$ ::: ### Example 6.5: Lugget Furniture - Advertising Effectiveness ::: {.callout-note icon="💼" appearance="minimal"} ## Small Population Demonstration Lugget Furniture asks **N = 4 customers** if they saw today's newspaper ad. **Responses:** Yes, No, No, Yes (notation: S₁, N₂, N₃, S₄) **Population proportion:** π = 2/4 = 0.50 (50% saw the ad) **All possible samples of n = 2:** ::: **Solution:** | Sample | Customers | Number of "Yes" | Sample Proportion p | |--------|-----------|-----------------|---------------------| | 1 | S₁, N₂ | 1 | 0.50 | | 2 | S₁, N₃ | 1 | 0.50 | | 3 | S₁, S₄ | 2 | 1.00 | | 4 | N₂, N₃ | 0 | 0.00 | | 5 | N₂, S₄ | 1 | 0.50 | | 6 | N₃, S₄ | 1 | 0.50 | {.striped .hover} **Expected value:** $$E(p) = \frac{\sum p}{K} = \frac{0.50 + 0.50 + 1.00 + 0.00 + 0.50 + 0.50}{6} = \frac{3.00}{6} = 0.50 = \pi \quad ✓$$ **Standard error (with FPC):** $$\sigma_p = \sqrt{\frac{0.50(1-0.50)}{2}} \sqrt{\frac{4-2}{4-1}} = \sqrt{0.125} \sqrt{0.667} = 0.289$$ **Interpretation:** Sample proportions vary ±0.29 around true proportion π = 0.50 ### Example 6.6: BelLabs Component Quality - Multiple Decision Thresholds ::: {.callout-note icon="💼" appearance="minimal"} ## Supplier Evaluation Decision BelLabs purchases cell phone components in lots of **n = 200 units**. Components have a **10% defect rate** (π = 0.10). **Decision policy based on next shipment's defect rate:** a) **> 12% defects** → Definitely seek new supplier b) **10-12% defects** → Consider new supplier c) **5-10% defects** → Definitely stay with current supplier d) **< 5% defects** → Increase orders **Which decision is most likely?** ::: **Solution:** **Assumptions:** Population N is very large (many components), so n/N < 0.05 → **No FPC needed** **Standard error:** $$\sigma_p = \sqrt{\frac{0.10(0.90)}{200}} = \sqrt{0.00045} = 0.021$$ **a) P(p > 0.12) - Seek new supplier** $$Z = \frac{0.12 - 0.10}{0.021} = 0.95$$ **From Table E:** Area = 0.3289 $$P(p > 0.12) = 0.5000 - 0.3289 = 0.1711$$ **17.11% probability** of seeking new supplier **b) P(0.10 ≤ p ≤ 0.12) - Consider new supplier** **From part a:** Area between 0.10 and 0.12 = **0.3289** (32.89%) **c) P(0.05 ≤ p ≤ 0.10) - Stay with supplier** $$Z = \frac{0.05 - 0.10}{0.021} = -2.38$$ **From Table E:** Area = 0.4913 $$P(0.05 \leq p \leq 0.10) = 0.4913$$ **49.13% probability** of staying with supplier ✓ **HIGHEST!** **d) P(p < 0.05) - Increase orders** $$P(p < 0.05) = 0.5000 - 0.4913 = 0.0087$$ **Only 0.87% probability** of increasing orders ::: {.callout-important icon="💡"} ## Decision Recommendation **Most likely outcome:** BelLabs will **keep current supplier** (49.13% probability). **Business rationale:** - Nearly 50% chance defects stay in acceptable 5-10% range - Only 17% chance situation is bad enough to require supplier change - Less than 1% chance quality improves enough to justify order increase **Risk assessment:** - 17% chance of >12% defects is **significant risk** - BelLabs should monitor next several shipments closely - Consider negotiating quality improvement clause with supplier ::: ### Example 6.7: Tax Referendum - Political Decision ::: {.callout-note icon="💼" appearance="minimal"} ## Sports Stadium Funding A large city surveys **n = 1,000 residents** about a tax increase for a new sports stadium. **Rule:** If > 85% support the tax, it goes on the ballot. **Actual population support:** π = 0.82 (82%) **Question:** What's the probability the referendum makes it onto the ballot? (This requires sample proportion **p > 0.85** despite true π = 0.82) ::: **Solution:** **Standard error:** $$\sigma_p = \sqrt{\frac{0.82(0.18)}{1000}} = \sqrt{0.0001476} = 0.0121$$ **P(p > 0.85):** $$Z = \frac{0.85 - 0.82}{0.0121} = 2.48$$ **From Table E:** Area = 0.4934 $$P(p > 0.85) = 0.5000 - 0.4934 = 0.0066$$ **Only 0.66% probability** the referendum appears on the ballot! ::: {.callout-important icon="💡"} ## Political Reality Check Despite **82% actual support** (strong majority!), there's **less than 1% chance** the sample will show the required 85% threshold. **Why?** The 85% requirement is **2.48 standard errors** above the true mean - an extreme outcome. **Implications:** - The 85% threshold is **too high** given sampling variability - City council should lower threshold to 80-82% for fairness - With current rule, genuinely popular measures may fail to reach ballot due to random sampling error **Better policy:** Use 82% threshold (actual population value) or add confidence interval around survey result. ::: ## 6.6 Sampling Methods and Procedures Selecting a **representative sample** is critical. A biased sample produces unreliable estimates, even with large n! ### Sources of Sampling Error **1. Random Chance ("Bad Luck")** - By pure chance, sample may include atypical elements - Sample might have unusually high or low values - **Cannot be eliminated** but can be quantified with standard error **2. Sample Bias (Systematic Error)** - Tendency to favor certain samples over others - Arises from flawed data collection procedures - **CAN and MUST be minimized** through proper sampling design ### Historical Example: The Literary Digest Debacle (1936) ::: {.callout-warning icon="⚠️"} ## Classic Sampling Failure **1936 Presidential Election:** - **Literary Digest** predicted: Alf Landon (Republican) wins overwhelmingly - **Actual result:** Franklin D. Roosevelt (Democrat) wins in landslide **What went wrong?** **Sampling method:** - Drew names from telephone directories - Drew names from magazine subscriber lists **The fatal flaw:** In 1936, during the **Great Depression**: - Only wealthy people could afford telephones and magazine subscriptions - Wealthy people blamed Republicans less for economic hardship - **Sample was NOT representative** of the general voting population **Result:** Magazine went out of business. The prediction error destroyed credibility. **Lesson:** A biased sample of 2 million is worse than a random sample of 1,000! ::: ### Sampling Method 1: Simple Random Sampling (SRS) ::: {.callout-tip icon="📚"} ## Simple Random Sample Definition **Every possible sample** of size n has an **equal probability** of being selected. **Example:** Select 5 states from 50 for consumer taste testing - Total possible samples: ${}_{50}C_5 = 2,118,760$ - SRS ensures each combination has probability = 1/2,118,760 **Implementation:** 1. **Manual:** Write names on identical papers, draw from hat 2. **Random number table:** Use pre-generated random digits (Table A) 3. **Computer:** Use random number generators (Excel RANDBETWEEN, Python random.sample) **Advantages:** ✓ Unbiased ✓ Simple to understand ✓ Allows calculation of sampling error **Disadvantages:** ✗ Requires complete list of population (sampling frame) ✗ May miss important subgroups by chance ✗ Can be expensive if population is geographically dispersed ::: ### Sampling Method 2: Systematic Sampling ::: {.callout-tip icon="📚"} ## Systematic Sampling Procedure **Select every ith element** from an ordered population. **Steps:** 1. Determine sampling interval: $i = \frac{N}{n}$ 2. Randomly select starting point between 1 and i 3. Select every ith element thereafter **Example:** Sample 100 from population of 1,000 - Sampling interval: i = 1000/100 = 10 - Random start: 7 (randomly chosen between 1-10) - Sample: 7, 17, 27, 37, 47, ..., 997 **Advantages:** ✓ Easy to implement (no expertise needed) ✓ Spreads sample evenly across population ✓ Less expensive than SRS for large populations **Disadvantages:** ✗ **Danger of hidden patterns!** If population has cyclical pattern matching the interval, severe bias results ✗ Example: Sampling every 7th day might always hit Mondays (different from other days) ::: ### Sampling Method 3: Stratified Sampling ::: {.callout-tip icon="📚"} ## Stratified Sampling Design **Divide heterogeneous population into homogeneous subgroups (strata)**, then sample from each stratum. **Proportional stratification:** Sample from each stratum proportionally to its size in population. **Example: USDA Drought Impact Study** **Population:** Farmers in 4 states (Kansas, Oklahoma, Nebraska, South Dakota) - Kansas: 30% of all farmers → 30% of sample from Kansas - Oklahoma: 25% of all farmers → 25% of sample from Oklahoma - Nebraska: 28% of all farmers → 28% of sample from Nebraska - South Dakota: 17% of all farmers → 17% of sample from South Dakota **Advantages:** ✓ **Ensures representation** of important subgroups ✓ **More precise** than SRS of same size ✓ Can compare subgroups (stratified analysis) ✓ Reduces sampling error when strata are internally homogeneous **Disadvantages:** ✗ Requires knowledge of population structure ✗ More complex to implement ✗ Need separate sampling frame for each stratum **When to use:** Population is **heterogeneous** but contains identifiable **homogeneous subgroups**. ::: **Example application:** Income survey - **Strata:** Age groups (18-30, 31-45, 46-60, 61+) - Income varies greatly between age groups but is more similar within groups - Stratified sampling ensures all age groups represented ### Sampling Method 4: Cluster Sampling ::: {.callout-tip icon="📚"} ## Cluster Sampling Design **Divide population into clusters (groups), randomly select some clusters, include ALL elements in selected clusters.** **Key difference from stratified:** - **Stratified:** Sample from ALL strata - **Cluster:** Sample only SOME clusters (but all elements within selected clusters) **Example: USDA Drought Study (Cluster Approach)** **Clusters:** Counties within each state 1. Randomly select 15 counties from all counties in the 4 states 2. Include **ALL farmers** in the 15 selected counties **Advantages:** ✓ **Cost-effective** when population is geographically dispersed ✓ No need for complete population list (only cluster list) ✓ Easier field logistics (visit all farms in selected counties) **Disadvantages:** ✗ **Higher sampling error** than SRS of same size (elements within cluster may be similar) ✗ If clusters are internally homogeneous, efficiency decreases **When to use:** Geographically concentrated populations, high travel costs, or when complete population list unavailable. ::: **Combining methods:** Can use **stratified cluster sampling** - Divide 4 states into strata - Within each state, use cluster sampling of counties - Sample proportional number of counties from each state ### Potential Problems in Cluster Sampling **Risk:** If selected clusters are atypical, bias results. **Example:** If randomly selected counties all have: - Unusually high irrigation usage → Overestimate crop productivity - Unusually low irrigation usage → Underestimate crop productivity **Mitigation:** Select **more clusters** (increases cost but reduces risk of unrepresentative clusters) --- ## Problemas Resueltos (Solved Problems) ### Problema 1: Investment Industry Returns ::: {.callout-note icon="💼" appearance="minimal"} ## Consumer Goods Industry Analysis Investment records show consumer goods firms average **30% return rate** with **12% standard deviation**. A sample of **n = 250 firms** is selected. **Find:** Probability mean return exceeds 31% ::: **Solution:** $$\sigma_{\bar{X}} = \frac{0.12}{\sqrt{250}} = 0.0076$$ $$Z = \frac{0.31 - 0.30}{0.0076} = 1.32$$ **From Table E:** Area = 0.4066 $$P(\bar{X} > 0.31) = 0.5000 - 0.4066 = 0.0934$$ **9.34% probability** of exceeding 31% return **Investment implication:** 31% return threshold is **1.32 standard errors** above mean - achievable but not common. Set realistic expectations! --- ### Problema 2: Direct Marketing Proportion ::: {.callout-note icon="💼" appearance="minimal"} ## Marketing Channel Decision Only **22% of consumer goods firms** market directly to final consumers. **Decision rule:** If sample of 250 firms shows **> 20% direct marketing**, you'll purchase from this industry. **Find:** Probability you'll spend your money elsewhere (p ≤ 0.20) ::: **Solution:** $$\sigma_p = \sqrt{\frac{0.22(0.78)}{250}} = 0.0262$$ $$Z = \frac{0.20 - 0.22}{0.0262} = -0.76$$ **From Table E:** Area = 0.2764 $$P(p > 0.20) = 0.5000 + 0.2764 = 0.7764$$ $$P(p \leq 0.20) = 1 - 0.7764 = 0.2236$$ **22.36% probability** you'll shop elsewhere **Business insight:** Most likely (77.64%) you'll find adequate direct marketing representation and purchase from this industry. --- ### Problema 3: Sampling Error Tolerance ::: {.callout-note icon="💼" appearance="minimal"} ## The Paper House Employee Hours **The Paper House** (party supplies store): - μ = 36.7 hours/week average employee work time - σ = 3.5 hours - Sample: n = 36 weeks **Owner Jill Ramsey's requirement:** At least 90% confidence estimate is within ±1 hour of true mean. **Find:** Probability Ramsey won't be disappointed ::: **Solution:** **Ramsey wants:** P(|error| ≤ 1) = P(35.7 ≤ $\bar{X}$ ≤ 37.7) $$\sigma_{\bar{X}} = \frac{3.5}{\sqrt{36}} = 0.583$$ $$Z = \frac{37.7 - 36.7}{0.583} = 1.71$$ **From Table E:** Area = 0.4564 $$P(35.7 \leq \bar{X} \leq 37.7) = 0.4564 \times 2 = 0.9128$$ **91.28% probability** estimate is within ±1 hour (exceeds her 90% requirement ✓) **Managerial decision:** The sample size n = 36 is **adequate** to meet Ramsey's precision requirements! --- ## Lista de Fórmulas (Formula Reference) ### Sampling Distribution of Means **Mean of Sample Means:** $$\bar{\bar{X}} = \frac{\sum \bar{X}}{K}$$ **Variance of Sampling Distribution:** $$\sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}$$ **Standard Error:** $$\sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}$$ **Standard Error (Shortcut):** $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$ **With Finite Population Correction:** $$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}$$ **Z-Score for Sample Means:** $$Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}}$$ ### Sampling Distribution of Proportions **Expected Value:** $$E(p) = \frac{\sum p}{K} = \pi$$ **Standard Error:** $$\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}$$ **With Finite Population Correction:** $$\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}$$ **Z-Score for Sample Proportions:** $$Z = \frac{p - \pi}{\sigma_p}$$ ### Central Limit Theorem **For n ≥ 30:** Sampling distribution of $\bar{X}$ is approximately normal with: - Mean: μ - Standard Error: σ/√n ### Sampling Error **Definition:** $$\text{Sampling Error} = \bar{X} - \mu \quad \text{or} \quad p - \pi$$ --- ## Chapter Summary This chapter bridged probability theory and statistical inference by introducing **sampling distributions**: **Key Concepts Mastered:** 1. **Sampling Distribution** - The distribution of all possible sample statistics (means or proportions) 2. **Sampling Error** - Inevitable difference between sample statistic and population parameter 3. **Standard Error** - Measure of sampling variability (analogous to standard deviation) 4. **Central Limit Theorem** - Magic that makes normal distribution applicable even when population isn't normal (for n ≥ 30) 5. **Sampling Methods:** - **Simple Random:** Every sample equally likely - **Systematic:** Every ith element - **Stratified:** Proportional sampling from subgroups - **Cluster:** Sample entire groups **Critical Insights:** ✓ **Larger samples → smaller standard error** (but with diminishing returns - need 4x sample for half the error) ✓ **Sample means cluster more tightly around μ** than individual observations (σ$_{\bar{X}}$ < σ) ✓ **Bias is worse than random error** - proper sampling method prevents bias ✓ **Different samples yield different statistics** - sampling distributions quantify this variability **Business Applications:** - Quality control decisions based on sample defect rates - Marketing decisions based on sample purchase proportions - Investment analysis using sample average returns - Policy decisions informed by survey samples **Next Chapter:** We'll use these sampling distribution concepts to build **confidence intervals** and conduct **hypothesis tests** - the core tools of statistical inference! --- **Next Chapter:** [Confidence Intervals](07-confidence-intervals.qmd)