7  Sampling Distributions

The Bridge Between Probability and Statistical Inference

8 Chapter 6: Sampling Distributions

graph TD
    A[Sampling Distributions] --> B[For Sample Means]
    A --> C[For Sample Proportions]
    A --> D[Sampling Procedures]
    
    B --> B1[Sampling<br/>Error]
    B --> B2[Mean of<br/>Sample Means]
    B --> B3[Standard<br/>Error]
    B --> B4[Applications for<br/>Normal Distribution]
    B --> B5[Central Limit<br/>Theorem]
    B --> B6[Finite Population<br/>Correction Factor]
    
    C --> C1[Sampling<br/>Error]
    C --> C2[Standard<br/>Error]
    C --> C3[Applications for<br/>Normal Distribution]
    C --> C4[Central Limit<br/>Theorem]
    C --> C5[Finite Population<br/>Correction Factor]
    
    D --> D1[Errors and<br/>Bias]
    D --> D2[Sampling<br/>Methods]
    
    D2 --> D2a[Simple Random<br/>Sampling]
    D2 --> D2b[Systematic<br/>Sampling]
    D2 --> D2c[Stratified<br/>Sampling]
    D2 --> D2d[Cluster<br/>Sampling]
    
    style A fill:#000,stroke:#000,color:#fff
    style B fill:#000,stroke:#000,color:#fff
    style C fill:#000,stroke:#000,color:#fff
    style D fill:#000,stroke:#000,color:#fff

NoteLearning Objectives

By the end of this chapter, you will be able to:

  • Understand sampling distributions and how they differ from population distributions
  • Calculate sampling error and interpret its business implications
  • Apply the Central Limit Theorem to analyze sample means
  • Compute standard error with and without finite population correction
  • Determine probabilities for sample means using normal distribution
  • Design effective sampling procedures (random, systematic, stratified, cluster)
  • Recognize and minimize sampling bias in business research
  • Make informed decisions based on sample statistics

8.1 Opening Scenario: Investment Portfolio Analysis

NoteClient Investment Decision

Several wealthy clients have selected you as their investment analyst to evaluate three distinct industries:

1. Sports & Recreation Industry
- Thrives during economic downturns (people seek relief from economic stress)
- Anticipated recession makes this sector attractive

2. Healthcare Industry
- Aging population = increasing medical assistance demand
- Social Security system threats create opportunities
- Demographic trends favor long-term growth

3. Environmental Protection Industry
- Wetlands preservation and environmental protection
- Potential for both financial returns and social contribution

Your challenge: Analyze investment portfolios using sampling distributions to assess risk and probability of success for each industry sector.

Key question: Can you sample 50 companies from each industry (instead of analyzing all companies) and still make confident recommendations?

8.2 6.1 Introduction: The Foundation of Statistical Inference

Populations are typically too large to study in their entirety. Imagine trying to:
- Survey all 150 million U.S. consumers about product preferences
- Test every smartphone produced for quality (destructive testing!)
- Measure income of all 7 billion people on Earth

Solution: Select a representative sample of manageable size, then use it to draw conclusions about the population.

TipKey Terminology

Population Parameter (μ, σ, π):
- Numerical characteristic of the entire population
- Usually unknown (that’s why we sample!)
- Examples: μ = average income of all U.S. households

Sample Statistic (\bar{X}, s, p):
- Numerical characteristic of the sample
- Calculated from sample data
- Used as estimator of population parameter

Statistical Inference:
The process of using a sample statistic to draw conclusions about a population parameter.

Critical insight: Different samples from the same population will produce different statistics!

Example: Fortune 500 companies
- Population: N = 500 companies
- Sample: n = 50 companies randomly selected
- Statistic: \bar{X} = average return rate for the 50
- Parameter: μ = average return rate for all 500
- Inference: Use \bar{X} to estimate μ

8.3 6.2 Sampling Distributions - The Heart of Inference

8.3.1 The Fundamental Concept

From any population of size N, we can select many different samples of size n. Each sample will likely have a different mean.

We can create a distribution of all possible sample means - this is the sampling distribution!

TipSampling Distribution Definition

Sampling Distribution: A listing of all possible values for a statistic (like \bar{X}) and the probability associated with each value.

Purpose: Allows us to calculate probabilities about sample statistics and quantify sampling error.

8.3.2 Example 6.1: College Student Incomes - Building a Sampling Distribution

NoteSimple Population

Four college students have annual incomes:
- Student A: $100
- Student B: $200
- Student C: $300
- Student D: $400

Population mean: μ = $250
Population size: N = 4

Instead of calculating μ from all 4 observations, you decide to estimate μ using a sample of n = 2 students.

How many different samples are possible?

Solution:

Number of possible samples:

{}_{4}C_{2} = \frac{4!}{2!2!} = 6 \text{ different samples}

All Possible Samples:

Sample Students Selected Incomes Sample Mean \bar{X}
1 A, B $100, $200 $150
2 A, C $100, $300 $200
3 A, D $100, $400 $250
4 B, C $200, $300 $250
5 B, D $200, $400 $300
6 C, D $300, $400 $350

{.striped .hover}

Key observation: Only samples 3 and 4 yield \bar{X} = \mu = 250!

Sampling Distribution of \bar{X}:

\bar{X} Frequency Probability P(\bar{X})
$150 1 1/6 = 0.167
$200 1 1/6 = 0.167
$250 2 2/6 = 0.333
$300 1 1/6 = 0.167
$350 1 1/6 = 0.167
Total 6 1.000

{.striped .hover}

ImportantCritical Insights

Probability of “perfect” estimate: Only 33.3% chance (2 out of 6 samples) that \bar{X} = \mu

Sampling error will occur: 66.7% of samples produce some estimation error!

Sampling Error: (\bar{X} - \mu)
- Sample 1: $150 - $250 = -$100 (underestimate)
- Sample 6: $350 - $250 = +$100 (overestimate)

Reality check: You’ll never know actual sampling error because μ is unknown (that’s why you’re sampling!). But you must acknowledge it exists.

8.3.3 The Mean of Sample Means (\bar{\bar{X}})

The sampling distribution itself has a mean, called the “mean of the means” or “grand mean”.

TipMean of Sampling Distribution

\bar{\bar{X}} = \frac{\sum \bar{X}}{K}

Where:
- K = number of samples in the sampling distribution
- \bar{X} = individual sample means

Remarkable Property:
\bar{\bar{X}} = \mu

The mean of all possible sample means always equals the population mean!

Calculating for our example:

\bar{\bar{X}} = \frac{150 + 200 + 250 + 250 + 300 + 350}{6} = \frac{1500}{6} = 250 = \mu

This is NOT coincidence - it’s a fundamental property of sampling distributions!

8.3.4 Standard Error - Quantifying Sampling Variability

Just as the population has variance σ², the sampling distribution has variance σ²_{\bar{X}}.

TipVariance and Standard Error of Sampling Distribution

Variance of Sample Means:

\sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}

Standard Error (SE):

\sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}

Interpretation: Standard error measures how much sample means tend to deviate from the population mean (μ).

Larger SE = greater sampling variability = less precise estimates
Smaller SE = less sampling variability = more precise estimates

Calculating for our example:

\sigma^2_{\bar{X}} = \frac{(150-250)^2 + (200-250)^2 + (250-250)^2 + (250-250)^2 + (300-250)^2 + (350-250)^2}{6}

= \frac{10,000 + 2,500 + 0 + 0 + 2,500 + 10,000}{6} = \frac{25,000}{6} = 4,167 \text{ dollars}^2

\sigma_{\bar{X}} = \sqrt{4,167} = \$64.55

Business interpretation: On average, sample means deviate from true population mean by about $65.

8.3.5 Example 6.2: East Coast Manufacturing Sales

NoteMonthly Sales Analysis

East Coast Manufacturing (ECM) sales (in thousands) for the last 5 months:
68, 73, 65, 80, 72

Population mean: μ = 71.6 thousand

Your task (Marketing Director): Estimate μ using a sample of n = 3 months. How large is the likely sampling error?

Solution:

Step 1: Determine number of samples

{}_{5}C_{3} = \frac{5!}{3!2!} = 10 \text{ possible samples}

Step 2: List all samples and calculate means

Sample Months Selected \bar{X} Sample Months Selected \bar{X}
1 68, 73, 65 68.67 6 68, 80, 72 73.33
2 68, 73, 80 73.67 7 73, 65, 80 72.67
3 68, 73, 72 71.00 8 73, 65, 72 70.00
4 68, 65, 80 71.00 9 73, 80, 72 75.00
5 68, 65, 72 68.33 10 65, 80, 72 72.33

{.striped .hover}

Step 3: Create sampling distribution

\bar{X} Probability
68.33 1/10 = 0.10
68.67 1/10 = 0.10
70.00 1/10 = 0.10
71.00 2/10 = 0.20
72.33 1/10 = 0.10
72.67 1/10 = 0.10
73.33 1/10 = 0.10
73.67 1/10 = 0.10
75.00 1/10 = 0.10

{.striped .hover}

Step 4: Calculate mean of sampling distribution

\bar{\bar{X}} = \frac{68.67 + 73.67 + ... + 72.33}{10} = 71.6 = \mu \quad ✓

Step 5: Calculate standard error

\sigma^2_{\bar{X}} = \frac{(68.67-71.6)^2 + (73.67-71.6)^2 + ... + (72.33-71.6)^2}{10}

= 4.31 \text{ thousand}^2

\sigma_{\bar{X}} = \sqrt{4.31} = 2.08 \text{ thousand dollars}

ImportantBusiness Interpretation

Expected sampling error: ± $2,080 on average

With 10 possible samples:
- 20% chance (2/10) of getting \bar{X} exactly equal to μ = 71.6
- 80% chance of some sampling error

Managerial decision: Is ±$2,080 error “relatively small” for ECM’s purposes?
- If monthly sales are in hundreds of thousands: YES (2.9% error)
- If making critical budget decisions: Maybe need larger sample

8.3.6 Shortcut Formula for Standard Error (When σ is Known)

The manual calculation of \sigma_{\bar{X}} is tedious. There’s a shortcut:

TipStandard Error Formulas

Basic Formula (Infinite Population or Sampling with Replacement):

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

With Finite Population Correction (FPC):
Use when sampling without replacement AND n > 0.05N

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}

Where \sqrt{\frac{N-n}{N-1}} is the finite population correction factor

When FPC is unnecessary:
If n/N < 0.05 (sample is less than 5% of population), FPC ≈ 1, so skip it!

Example: Quality control sampling
- Population: N = 10,000 units, σ = 5 grams
- Sample: n = 100 units
- Check: n/N = 100/10,000 = 0.01 < 0.05 → No FPC needed

\sigma_{\bar{X}} = \frac{5}{\sqrt{100}} = 0.5 \text{ grams}

8.3.7 The Impact of Sample Size on Standard Error

Critical relationship: As n ↑, σ_{\bar{X}}

Example: Population with σ = 100

Sample Size (n) Standard Error Reduction
25 100/√25 = 20.0 baseline
100 100/√100 = 10.0 50% smaller
400 100/√400 = 5.0 75% smaller
1,600 100/√1600 = 2.5 87.5% smaller

{.striped .hover}

ImportantThe Law of Diminishing Returns

Doubling sample size does NOT halve the standard error!

To cut SE in half, you must quadruple the sample size (because of the √n in denominator).

Business trade-off:
- Larger samples → more precision (smaller SE)
- Larger samples → higher cost, more time

Optimal sample size balances precision needs against resource constraints.

8.4 6.3 The Central Limit Theorem - The Most Important Theorem in Statistics

This is where the magic happens!

TipCentral Limit Theorem (CLT)

Statement:
For a population with any distribution shape, as sample size n increases, the sampling distribution of \bar{X} approaches a normal distribution with:
- Mean: \bar{\bar{X}} = \mu
- Standard Error: \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

Rule of Thumb: If n ≥ 30, sampling distribution is approximately normal regardless of population shape.

If population is already normal: Sampling distribution is normal for any sample size (even n = 2)!

8.4.1 Case 1: Population is Normal → Sampling Distribution is Normal

Example: Adult heights
- Population: Normally distributed, μ = 67 inches, σ = 3 inches
- Samples: n = 25

Sampling Distribution Properties:
- Shape: Normal (because population is normal)
- Mean: \bar{\bar{X}} = \mu = 67 inches
- Standard Error: \sigma_{\bar{X}} = \frac{3}{\sqrt{25}} = 0.6 inches

Key insight: Individual heights vary ±3 inches, but sample means vary only ±0.6 inches!

8.4.2 Case 2: Population is NOT Normal → CLT Creates Normality (if n ≥ 30)

Example: Income distribution (right-skewed)
- Population: Skewed right, μ = $50,000, σ = $20,000
- Samples: n = 50

Sampling Distribution Properties:
- Shape: Approximately Normal (CLT magic, since n = 50 ≥ 30)
- Mean: \bar{\bar{X}} = \mu = 50{,}000
- Standard Error: \sigma_{\bar{X}} = \frac{20{,}000}{\sqrt{50}} = 2{,}828

Even though population is skewed, sample means distribute normally!

8.4.3 Visual Demonstration: The Power of CLT

Scenario: Population is uniform (rectangular, definitely NOT normal)
- μ = 1000
- σ = 100

What happens as n increases?

Sample Size SE Distribution Shape
n = 5 100/√5 = 44.7 Still somewhat uniform
n = 10 100/√10 = 31.6 Starting to mound
n = 30 100/√30 = 18.3 Nearly normal
n = 50 100/√50 = 14.1 Normal
n = 100 100/√100 = 10.0 Very normal, tight

{.striped .hover}

ImportantWhy CLT Matters for Business

Practical implication: You can use normal distribution tables and methods for sample means even when:
- You don’t know the population distribution shape
- The population is definitely not normal

As long as n ≥ 30!

This is why surveys typically use samples of 30+ respondents. It guarantees the statistical tools (confidence intervals, hypothesis tests) will work properly.

8.5 6.4 Applications: Using Sampling Distributions for Business Decisions

Now we put theory into action!

8.5.1 The Z-Score for Sample Means

In Chapter 5, we calculated probabilities for individual observations using:

Z = \frac{X - \mu}{\sigma}

For sample means, the formula changes:

TipZ-Score for Sampling Distributions

Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}

Interpretation: Number of standard errors \bar{X} is from μ

8.5.2 Example 6.3: TelCom Satellite - Comparing Probabilities

NoteCommunication Service Analysis

TelCom Satellite Data:
- μ = 150 seconds (mean transmission time)
- σ = 15 seconds
- Normal distribution

Two Questions:
a) Probability one transmission is between 150-155 seconds?
b) Probability mean of 50 transmissions is between 150-155 seconds?

Solution:

a) Single transmission: P(150 ≤ X ≤ 155)

Z = \frac{X - \mu}{\sigma} = \frac{155 - 150}{15} = 0.33

From Table E: Area = 0.1293

P(150 ≤ X ≤ 155) = 0.1293 (12.93%)

b) Mean of n = 50: P(150 ≤ \bar{X} ≤ 155)

\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} = \frac{15}{\sqrt{50}} = 2.12

Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}} = \frac{155 - 150}{2.12} = 2.36

From Table E: Area = 0.4909

P(150 ≤ \bar{X} ≤ 155) = 0.4909 (49.09%)

ImportantDramatic Difference!

Single call: Only 12.93% chance between 150-155 seconds
Average of 50 calls: 49.09% chance the mean is between 150-155 seconds

Why? Sample means cluster much more tightly around μ than individual observations!
- Individual observations: σ = 15 seconds spread
- Sample means (n=50): σ_{\bar{X}} = 2.12 seconds spread

Business application: If you need to forecast total time for 50 calls, you can be much more confident in your estimate than for a single call.

8.5.3 Example 6.4: TelCom Equipment Investment Decision

NoteStrategic Investment Analysis

TelCom considers new equipment to improve efficiency.
Before deciding, executives need probabilities for mean of n = 35 transmissions:

  1. Between 145 and 150 seconds?
  2. Greater than 145 seconds?
  3. Less than 155 seconds?
  4. Between 145 and 155 seconds?
  5. Greater than 155 seconds?

Given: μ = 150 seconds, σ = 15 seconds, n = 35

Solution:

Calculate standard error first:

\sigma_{\bar{X}} = \frac{15}{\sqrt{35}} = 2.54 \text{ seconds}

a) P(145 ≤ \bar{X} ≤ 150)

Z = \frac{145 - 150}{2.54} = -1.97

Area = 0.4756
P(145 ≤ \bar{X} ≤ 150) = 0.4756 (47.56%)

b) P(\bar{X} ≥ 145)

Z = -1.97 \text{ (same as part a)}

P(\bar{X} ≥ 145) = 0.4756 + 0.5000 = 0.9756 (97.56%)

c) P(\bar{X} ≤ 155)

Z = \frac{155 - 150}{2.54} = +1.97

P(\bar{X} ≤ 155) = 0.5000 + 0.4756 = 0.9756 (97.56%)

d) P(145 ≤ \bar{X} ≤ 155)

From parts a and c: Both give area = 0.4756

P(145 ≤ \bar{X} ≤ 155) = 0.4756 + 0.4756 = 0.9512 (95.12%)

e) P(\bar{X} > 155)

Z = +1.97

P(\bar{X} > 155) = 0.5000 - 0.4756 = 0.0244 (2.44%)

ImportantEquipment Investment Decision

Key findings:
97.56% chance mean time ≥ 145 seconds (very reliable lower bound)
95.12% chance mean time between 145-155 seconds (excellent precision)
⚠️ Only 2.44% chance mean time > 155 seconds (low risk of extreme delays)

Managerial implications:
- Current system is predictable: 95% of the time, average of 35 calls will be within ±5 seconds of 150
- New equipment decision: If it promises to reduce μ from 150 to 140 seconds, that’s a 10-second improvement - highly significant given SE = 2.54!
- Budget confidently: Can plan capacity around 145-155 second range with 95% confidence

Recommendation: Proceed with equipment investment if improvement > 2 standard errors (2 × 2.54 = 5.08 seconds), which ensures statistically detectable improvement.

8.6 6.5 Sampling Distributions for Proportions

Many business decisions involve proportions rather than means:
- Will a customer buy or not buy? (Marketing)
- Will a depositor default or not default? (Banking)
- Will a project generate positive return or not? (Capital budgeting)
- Is a unit defective or not defective? (Quality control)

We use sample proportion p to estimate population proportion π.

TipSampling Distribution of Proportions

Properties:
E(p) = \pi

The mean of all possible sample proportions equals the population proportion!

Standard Error of Proportions:

\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}

With Finite Population Correction (if n > 0.05N):

\sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Proportions:

Z = \frac{p - \pi}{\sigma_p}

8.6.1 Example 6.5: Lugget Furniture - Advertising Effectiveness

NoteSmall Population Demonstration

Lugget Furniture asks N = 4 customers if they saw today’s newspaper ad.

Responses: Yes, No, No, Yes (notation: S₁, N₂, N₃, S₄)

Population proportion: π = 2/4 = 0.50 (50% saw the ad)

All possible samples of n = 2:

Solution:

Sample Customers Number of “Yes” Sample Proportion p
1 S₁, N₂ 1 0.50
2 S₁, N₃ 1 0.50
3 S₁, S₄ 2 1.00
4 N₂, N₃ 0 0.00
5 N₂, S₄ 1 0.50
6 N₃, S₄ 1 0.50

{.striped .hover}

Expected value:

E(p) = \frac{\sum p}{K} = \frac{0.50 + 0.50 + 1.00 + 0.00 + 0.50 + 0.50}{6} = \frac{3.00}{6} = 0.50 = \pi \quad ✓

Standard error (with FPC):

\sigma_p = \sqrt{\frac{0.50(1-0.50)}{2}} \sqrt{\frac{4-2}{4-1}} = \sqrt{0.125} \sqrt{0.667} = 0.289

Interpretation: Sample proportions vary ±0.29 around true proportion π = 0.50

8.6.2 Example 6.6: BelLabs Component Quality - Multiple Decision Thresholds

NoteSupplier Evaluation Decision

BelLabs purchases cell phone components in lots of n = 200 units.
Components have a 10% defect rate (π = 0.10).

Decision policy based on next shipment’s defect rate:
a) > 12% defects → Definitely seek new supplier
b) 10-12% defects → Consider new supplier
c) 5-10% defects → Definitely stay with current supplier
d) < 5% defects → Increase orders

Which decision is most likely?

Solution:

Assumptions: Population N is very large (many components), so n/N < 0.05 → No FPC needed

Standard error:

\sigma_p = \sqrt{\frac{0.10(0.90)}{200}} = \sqrt{0.00045} = 0.021

a) P(p > 0.12) - Seek new supplier

Z = \frac{0.12 - 0.10}{0.021} = 0.95

From Table E: Area = 0.3289

P(p > 0.12) = 0.5000 - 0.3289 = 0.1711

17.11% probability of seeking new supplier

b) P(0.10 ≤ p ≤ 0.12) - Consider new supplier

From part a: Area between 0.10 and 0.12 = 0.3289 (32.89%)

c) P(0.05 ≤ p ≤ 0.10) - Stay with supplier

Z = \frac{0.05 - 0.10}{0.021} = -2.38

From Table E: Area = 0.4913

P(0.05 \leq p \leq 0.10) = 0.4913

49.13% probability of staying with supplier ✓ HIGHEST!

d) P(p < 0.05) - Increase orders

P(p < 0.05) = 0.5000 - 0.4913 = 0.0087

Only 0.87% probability of increasing orders

ImportantDecision Recommendation

Most likely outcome: BelLabs will keep current supplier (49.13% probability).

Business rationale:
- Nearly 50% chance defects stay in acceptable 5-10% range
- Only 17% chance situation is bad enough to require supplier change
- Less than 1% chance quality improves enough to justify order increase

Risk assessment:
- 17% chance of >12% defects is significant risk
- BelLabs should monitor next several shipments closely
- Consider negotiating quality improvement clause with supplier

8.6.3 Example 6.7: Tax Referendum - Political Decision

NoteSports Stadium Funding

A large city surveys n = 1,000 residents about a tax increase for a new sports stadium.

Rule: If > 85% support the tax, it goes on the ballot.

Actual population support: π = 0.82 (82%)

Question: What’s the probability the referendum makes it onto the ballot?

(This requires sample proportion p > 0.85 despite true π = 0.82)

Solution:

Standard error:

\sigma_p = \sqrt{\frac{0.82(0.18)}{1000}} = \sqrt{0.0001476} = 0.0121

P(p > 0.85):

Z = \frac{0.85 - 0.82}{0.0121} = 2.48

From Table E: Area = 0.4934

P(p > 0.85) = 0.5000 - 0.4934 = 0.0066

Only 0.66% probability the referendum appears on the ballot!

ImportantPolitical Reality Check

Despite 82% actual support (strong majority!), there’s less than 1% chance the sample will show the required 85% threshold.

Why? The 85% requirement is 2.48 standard errors above the true mean - an extreme outcome.

Implications:
- The 85% threshold is too high given sampling variability
- City council should lower threshold to 80-82% for fairness
- With current rule, genuinely popular measures may fail to reach ballot due to random sampling error

Better policy: Use 82% threshold (actual population value) or add confidence interval around survey result.

8.7 6.6 Sampling Methods and Procedures

Selecting a representative sample is critical. A biased sample produces unreliable estimates, even with large n!

8.7.1 Sources of Sampling Error

1. Random Chance (“Bad Luck”)
- By pure chance, sample may include atypical elements
- Sample might have unusually high or low values
- Cannot be eliminated but can be quantified with standard error

2. Sample Bias (Systematic Error)
- Tendency to favor certain samples over others
- Arises from flawed data collection procedures
- CAN and MUST be minimized through proper sampling design

8.7.2 Historical Example: The Literary Digest Debacle (1936)

WarningClassic Sampling Failure

1936 Presidential Election:
- Literary Digest predicted: Alf Landon (Republican) wins overwhelmingly
- Actual result: Franklin D. Roosevelt (Democrat) wins in landslide

What went wrong?

Sampling method:
- Drew names from telephone directories
- Drew names from magazine subscriber lists

The fatal flaw: In 1936, during the Great Depression:
- Only wealthy people could afford telephones and magazine subscriptions
- Wealthy people blamed Republicans less for economic hardship
- Sample was NOT representative of the general voting population

Result: Magazine went out of business. The prediction error destroyed credibility.

Lesson: A biased sample of 2 million is worse than a random sample of 1,000!

8.7.3 Sampling Method 1: Simple Random Sampling (SRS)

TipSimple Random Sample Definition

Every possible sample of size n has an equal probability of being selected.

Example: Select 5 states from 50 for consumer taste testing
- Total possible samples: {}_{50}C_5 = 2,118,760
- SRS ensures each combination has probability = 1/2,118,760

Implementation:
1. Manual: Write names on identical papers, draw from hat
2. Random number table: Use pre-generated random digits (Table A)
3. Computer: Use random number generators (Excel RANDBETWEEN, Python random.sample)

Advantages:
✓ Unbiased
✓ Simple to understand
✓ Allows calculation of sampling error

Disadvantages:
✗ Requires complete list of population (sampling frame)
✗ May miss important subgroups by chance
✗ Can be expensive if population is geographically dispersed

8.7.4 Sampling Method 2: Systematic Sampling

TipSystematic Sampling Procedure

Select every ith element from an ordered population.

Steps:
1. Determine sampling interval: i = \frac{N}{n}
2. Randomly select starting point between 1 and i
3. Select every ith element thereafter

Example: Sample 100 from population of 1,000
- Sampling interval: i = 1000/100 = 10
- Random start: 7 (randomly chosen between 1-10)
- Sample: 7, 17, 27, 37, 47, …, 997

Advantages:
✓ Easy to implement (no expertise needed)
✓ Spreads sample evenly across population
✓ Less expensive than SRS for large populations

Disadvantages:
Danger of hidden patterns! If population has cyclical pattern matching the interval, severe bias results
✗ Example: Sampling every 7th day might always hit Mondays (different from other days)

8.7.5 Sampling Method 3: Stratified Sampling

TipStratified Sampling Design

Divide heterogeneous population into homogeneous subgroups (strata), then sample from each stratum.

Proportional stratification: Sample from each stratum proportionally to its size in population.

Example: USDA Drought Impact Study

Population: Farmers in 4 states (Kansas, Oklahoma, Nebraska, South Dakota)
- Kansas: 30% of all farmers → 30% of sample from Kansas
- Oklahoma: 25% of all farmers → 25% of sample from Oklahoma
- Nebraska: 28% of all farmers → 28% of sample from Nebraska
- South Dakota: 17% of all farmers → 17% of sample from South Dakota

Advantages:
Ensures representation of important subgroups
More precise than SRS of same size
✓ Can compare subgroups (stratified analysis)
✓ Reduces sampling error when strata are internally homogeneous

Disadvantages:
✗ Requires knowledge of population structure
✗ More complex to implement
✗ Need separate sampling frame for each stratum

When to use: Population is heterogeneous but contains identifiable homogeneous subgroups.

Example application: Income survey
- Strata: Age groups (18-30, 31-45, 46-60, 61+)
- Income varies greatly between age groups but is more similar within groups
- Stratified sampling ensures all age groups represented

8.7.6 Sampling Method 4: Cluster Sampling

TipCluster Sampling Design

Divide population into clusters (groups), randomly select some clusters, include ALL elements in selected clusters.

Key difference from stratified:
- Stratified: Sample from ALL strata
- Cluster: Sample only SOME clusters (but all elements within selected clusters)

Example: USDA Drought Study (Cluster Approach)

Clusters: Counties within each state
1. Randomly select 15 counties from all counties in the 4 states
2. Include ALL farmers in the 15 selected counties

Advantages:
Cost-effective when population is geographically dispersed
✓ No need for complete population list (only cluster list)
✓ Easier field logistics (visit all farms in selected counties)

Disadvantages:
Higher sampling error than SRS of same size (elements within cluster may be similar)
✗ If clusters are internally homogeneous, efficiency decreases

When to use: Geographically concentrated populations, high travel costs, or when complete population list unavailable.

Combining methods: Can use stratified cluster sampling
- Divide 4 states into strata
- Within each state, use cluster sampling of counties
- Sample proportional number of counties from each state

8.7.7 Potential Problems in Cluster Sampling

Risk: If selected clusters are atypical, bias results.

Example: If randomly selected counties all have:
- Unusually high irrigation usage → Overestimate crop productivity
- Unusually low irrigation usage → Underestimate crop productivity

Mitigation: Select more clusters (increases cost but reduces risk of unrepresentative clusters)


8.8 Problemas Resueltos (Solved Problems)

8.8.1 Problema 1: Investment Industry Returns

NoteConsumer Goods Industry Analysis

Investment records show consumer goods firms average 30% return rate with 12% standard deviation.

A sample of n = 250 firms is selected.

Find: Probability mean return exceeds 31%

Solution:

\sigma_{\bar{X}} = \frac{0.12}{\sqrt{250}} = 0.0076

Z = \frac{0.31 - 0.30}{0.0076} = 1.32

From Table E: Area = 0.4066

P(\bar{X} > 0.31) = 0.5000 - 0.4066 = 0.0934

9.34% probability of exceeding 31% return

Investment implication: 31% return threshold is 1.32 standard errors above mean - achievable but not common. Set realistic expectations!


8.8.2 Problema 2: Direct Marketing Proportion

NoteMarketing Channel Decision

Only 22% of consumer goods firms market directly to final consumers.

Decision rule: If sample of 250 firms shows > 20% direct marketing, you’ll purchase from this industry.

Find: Probability you’ll spend your money elsewhere (p ≤ 0.20)

Solution:

\sigma_p = \sqrt{\frac{0.22(0.78)}{250}} = 0.0262

Z = \frac{0.20 - 0.22}{0.0262} = -0.76

From Table E: Area = 0.2764

P(p > 0.20) = 0.5000 + 0.2764 = 0.7764

P(p \leq 0.20) = 1 - 0.7764 = 0.2236

22.36% probability you’ll shop elsewhere

Business insight: Most likely (77.64%) you’ll find adequate direct marketing representation and purchase from this industry.


8.8.3 Problema 3: Sampling Error Tolerance

NoteThe Paper House Employee Hours

The Paper House (party supplies store):
- μ = 36.7 hours/week average employee work time
- σ = 3.5 hours
- Sample: n = 36 weeks

Owner Jill Ramsey’s requirement: At least 90% confidence estimate is within ±1 hour of true mean.

Find: Probability Ramsey won’t be disappointed

Solution:

Ramsey wants: P(|error| ≤ 1) = P(35.7 ≤ \bar{X} ≤ 37.7)

\sigma_{\bar{X}} = \frac{3.5}{\sqrt{36}} = 0.583

Z = \frac{37.7 - 36.7}{0.583} = 1.71

From Table E: Area = 0.4564

P(35.7 \leq \bar{X} \leq 37.7) = 0.4564 \times 2 = 0.9128

91.28% probability estimate is within ±1 hour (exceeds her 90% requirement ✓)

Managerial decision: The sample size n = 36 is adequate to meet Ramsey’s precision requirements!


8.9 Lista de Fórmulas (Formula Reference)

8.9.1 Sampling Distribution of Means

Mean of Sample Means: \bar{\bar{X}} = \frac{\sum \bar{X}}{K}

Variance of Sampling Distribution: \sigma^2_{\bar{X}} = \frac{\sum(\bar{X} - \mu)^2}{K}

Standard Error: \sigma_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}

Standard Error (Shortcut): \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}

With Finite Population Correction: \sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Sample Means: Z = \frac{\bar{X} - \mu}{\sigma_{\bar{X}}}

8.9.2 Sampling Distribution of Proportions

Expected Value: E(p) = \frac{\sum p}{K} = \pi

Standard Error: \sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}}

With Finite Population Correction: \sigma_p = \sqrt{\frac{\pi(1-\pi)}{n}} \sqrt{\frac{N-n}{N-1}}

Z-Score for Sample Proportions: Z = \frac{p - \pi}{\sigma_p}

8.9.3 Central Limit Theorem

For n ≥ 30: Sampling distribution of \bar{X} is approximately normal with: - Mean: μ - Standard Error: σ/√n

8.9.4 Sampling Error

Definition: \text{Sampling Error} = \bar{X} - \mu \quad \text{or} \quad p - \pi


8.10 Chapter Summary

This chapter bridged probability theory and statistical inference by introducing sampling distributions:

Key Concepts Mastered:

  1. Sampling Distribution - The distribution of all possible sample statistics (means or proportions)

  2. Sampling Error - Inevitable difference between sample statistic and population parameter

  3. Standard Error - Measure of sampling variability (analogous to standard deviation)

  4. Central Limit Theorem - Magic that makes normal distribution applicable even when population isn’t normal (for n ≥ 30)

  5. Sampling Methods:

    • Simple Random: Every sample equally likely
    • Systematic: Every ith element
    • Stratified: Proportional sampling from subgroups
    • Cluster: Sample entire groups

Critical Insights:

Larger samples → smaller standard error (but with diminishing returns - need 4x sample for half the error)
Sample means cluster more tightly around μ than individual observations (σ_{\bar{X}} < σ)
Bias is worse than random error - proper sampling method prevents bias
Different samples yield different statistics - sampling distributions quantify this variability

Business Applications:

  • Quality control decisions based on sample defect rates
  • Marketing decisions based on sample purchase proportions
  • Investment analysis using sample average returns
  • Policy decisions informed by survey samples

Next Chapter: We’ll use these sampling distribution concepts to build confidence intervals and conduct hypothesis tests - the core tools of statistical inference!


Next Chapter: Confidence Intervals