4 Measures of Central Tendency and Dispersion

graph TD
    A[Measures of Central Tendency<br/>and Dispersion] --> B[Central Tendencies]
    A --> C[Dispersion]
    
    B --> D[Ungrouped Data]
    B --> E[Grouped Data]
    
    D --> F[Mean]
    D --> G[Median]
    D --> H[Mode]
    D --> I[Weighted Mean]
    D --> J[Geometric Mean]
    
    E --> K[Mean]
    E --> L[Median]
    E --> M[Mode]
    
    C --> N[Ungrouped Data]
    C --> O[Grouped Data]
    
    N --> P[Range]
    N --> Q[Variance and<br/>Standard Deviation]
    N --> R[Percentiles]
    
    O --> S[Variance and<br/>Standard Deviation]
    
    T[Related Concepts] --> U[Chebyshev's<br/>Theorem]
    T --> V[Empirical Rule]
    T --> W[Skewness]
    T --> X[Coefficient of<br/>Variation]
    
    style A fill:#2E86AB,stroke:#A23B72,stroke-width:3px,color:#fff
    style B fill:#A23B72,stroke:#F18F01,stroke-width:2px,color:#fff
    style C fill:#A23B72,stroke:#F18F01,stroke-width:2px,color:#fff
    style T fill:#2E86AB,stroke:#A23B72,stroke-width:3px,color:#fff
    style D fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff
    style E fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff
    style N fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff
    style O fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff

Chapter 3 Overview: Measures of Central Tendency and Dispersion

4.1 Business Scenario: Investment Analysis at Boggs Financial Services

Robert Boggs operates a successful investment advisory firm serving individual and institutional clients. His reputation depends on providing accurate, data-driven recommendations about stocks, bonds, mutual funds, and other securities.

Daily, Boggs faces critical questions from nervous investors:
- “What’s the average return on this mutual fund?”
- “How consistent are the quarterly earnings of this company?”
- “Is this stock more volatile than that one?”
- “Which investment has less risk for the same expected return?”

To answer these questions professionally, Boggs cannot simply list raw numbers—he must summarize and describe the data using precise statistical measures. A list of 50 daily stock prices tells an investor nothing; the mean price and standard deviation tell the complete story in two numbers.

This chapter introduces the fundamental tools for describing datasets: measures of central tendency (where the data is centered) and measures of dispersion (how spread out the data is). These concepts form the foundation of all statistical analysis in business, economics, finance, quality control, and scientific research.

The Fundamental Challenge

Raw data is overwhelming and meaningless. A dataset of 1,000 observations cannot be communicated effectively. Statistical measures reduce thousands of numbers to a handful of meaningful summaries that support intelligent decision-making.

Central Tendency: Where is the “typical” value?
Dispersion: How much variability exists around that typical value?

These two concepts together provide a complete numerical description of any dataset.

4.2 3.1 Measures of Central Tendency for Ungrouped Data

Central tendency measures identify the “center” or “typical value” of a dataset. The three primary measures are:

Mean (\bar{X} or \mu) — Arithmetic average
Median — Middle value when data is ordered
Mode — Most frequently occurring value

Each measure has specific use cases, advantages, and limitations.

4.2.1 A. The Mean (La Media)

The mean is the arithmetic average—the sum of all values divided by the count of observations. It is the most commonly used measure of central tendency in statistical analysis.

Definition: Mean

Sample Mean: \bar{X} = \frac{\sum X_i}{n}

Population Mean: \mu = \frac{\sum X_i}{N}

where: - X_i = individual observations
- n = sample size
- N = population size
- \sum = summation symbol (add all values)

Key Characteristics of the Mean:

Uses every observation in the dataset
Sensitive to extreme values (outliers)
Has desirable mathematical properties (totals, weighted averages)
Represents the “center of gravity” or “balance point” of the data

4.2.2 Example 3.1: Mean Sales Revenue

A small retail store recorded monthly sales revenue (in thousands of dollars) for 5 months:

56, 67, 52, 45, 67

Calculate the mean sales revenue.

Solution:

\bar{X} = \frac{\sum X_i}{n} = \frac{56 + 67 + 52 + 45 + 67}{5} = \frac{287}{5} = 57.4 \text{ thousand dollars}

Interpretation: The average monthly sales revenue is $57,400. This represents the typical monthly performance. Notice that the value 67 appears twice (the mode), but the mean (57.4) falls below this repeated value because the lower values (45, 52, 56) pull the average down.

Business Application: Using the Mean

The mean is ideal for: - Budgeting: Average monthly expenses for annual planning
- Performance metrics: Average sales per employee
- Quality control: Average product weight or dimension
- Financial analysis: Average stock return over time

When NOT to use the mean:
- Data contains extreme outliers (CEO salaries, real estate prices)
- Data is highly skewed (income distributions)
- In these cases, the median provides a better “typical” value

Code

import matplotlib.pyplot as plt
import numpy as np

# Sales data
sales = [45, 52, 56, 67, 67]
mean_sales = np.mean(sales)

fig, ax = plt.subplots(figsize=(12, 6))

# Plot data points as weights on a balance beam
y_positions = [0.5] * len(sales)
colors = ['#3498db' if x < mean_sales else '#e74c3c' for x in sales]

ax.scatter(sales, y_positions, s=300, alpha=0.7, c=colors, edgecolors='black', linewidths=2, zorder=3)

# Draw balance beam
ax.plot([40, 75], [0.5, 0.5], 'k-', linewidth=3, alpha=0.3)

# Draw fulcrum at mean
fulcrum_x = [mean_sales - 1, mean_sales, mean_sales + 1, mean_sales - 1]
fulcrum_y = [0, 0.3, 0, 0]
ax.fill(fulcrum_x, fulcrum_y, color='gold', edgecolor='black', linewidth=2, zorder=2)

# Vertical line at mean
ax.axvline(mean_sales, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Mean = ${mean_sales:.1f}K')

# Annotations
for i, (x, y) in enumerate(zip(sales, y_positions)):
    ax.annotate(f'${x}K', xy=(x, y), xytext=(0, 20), 
               textcoords='offset points', ha='center', fontweight='bold', fontsize=10)

# Add deviation arrows
for x in sales:
    if x < mean_sales:
        ax.annotate('', xy=(mean_sales, 0.5), xytext=(x, 0.5),
                   arrowprops=dict(arrowstyle='<->', color='blue', lw=1.5, alpha=0.5))
    elif x > mean_sales:
        ax.annotate('', xy=(x, 0.5), xytext=(mean_sales, 0.5),
                   arrowprops=dict(arrowstyle='<->', color='red', lw=1.5, alpha=0.5))

ax.set_xlim(40, 75)
ax.set_ylim(-0.2, 1.2)
ax.set_xlabel('Sales Revenue ($1000s)', fontsize=12, fontweight='bold')
ax.set_title('Mean as Balance Point: Sales Data\n(Deviations below mean balance deviations above mean)', 
            fontsize=13, fontweight='bold', pad=15)
ax.legend(loc='upper right', fontsize=11)
ax.set_yticks([])
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

Figure 4.1: The Mean as Center of Gravity: Balancing Point of Data

4.2.3 B. The Median (La Mediana)

The median is the middle value when observations are arranged in order from smallest to largest. Exactly 50% of observations fall below the median and 50% fall above it.

Definition: Median

Calculation Steps:

Order the data from smallest to largest
Find the position: L_{median} = \frac{n+1}{2}
Determine the median:
- If n is odd: Median = value at position L_{median}
- If n is even: Median = average of the two middle values

Key Characteristics of the Median:

Not affected by extreme values (outliers)
Robust measure for skewed distributions
Represents the 50th percentile
Better than mean for income, prices, and other skewed data

4.2.4 Example 3.2: Median Manufacturing Times

A chip manufacturer tests processing times (in seconds) for 20 processors:

3.2, 4.1, 6.3, 1.9, 0.6, 5.4, 5.2, 3.2, 4.9, 6.2, 1.8, 1.7, 3.6, 1.5, 2.6, 4.3, 6.1, 2.4, 2.2, 3.3

Calculate the median processing time.

Solution:

Step 1: Order the data:
0.6, 1.5, 1.7, 1.8, 1.9, 2.2, 2.4, 2.6, 3.2, 3.2, 3.3, 3.6, 4.1, 4.3, 4.9, 5.2, 5.4, 6.1, 6.2, 6.3

Step 2: Find position:
L_{median} = \frac{20+1}{2} = 10.5

The median is between the 10th and 11th observations.

Step 3: Calculate median:
- 10th observation: 3.2
- 11th observation: 3.3

\text{Median} = \frac{3.2 + 3.3}{2} = 3.25 \text{ seconds}

Interpretation: Half the processors complete in 3.25 seconds or less; half take 3.25 seconds or more. This “typical” value is not influenced by the extreme low value (0.6 seconds) or high value (6.3 seconds).

When to Use Median vs. Mean

Use MEDIAN when: - Data contains outliers (extremely high or low values)
- Distribution is skewed (income, house prices, CEO compensation)
- You want the “typical middle” value unaffected by extremes

Use MEAN when: - Data is symmetric (approximately normally distributed)
- All observations are equally important
- You need mathematical properties (totals, weighted averages)
- Calculating averages of averages

Classic Example: In a neighborhood with 9 homes worth $200K each and 1 mansion worth $5M:
- Mean price = $680K (misleading—no home sells near this price!)
- Median price = $200K (accurate representation of typical home)

Code

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Symmetric data (Mean ≈ Median)
np.random.seed(42)
symmetric_data = np.random.normal(50, 10, 100)
mean_sym = np.mean(symmetric_data)
median_sym = np.median(symmetric_data)

axes[0].hist(symmetric_data, bins=20, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].axvline(mean_sym, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_sym:.1f}')
axes[0].axvline(median_sym, color='green', linestyle=':', linewidth=2.5, label=f'Median = {median_sym:.1f}')
axes[0].set_title('Symmetric Distribution\n(Mean ≈ Median)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Value', fontsize=11)
axes[0].set_ylabel('Frequency', fontsize=11)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Panel 2: Skewed data with outlier (Mean >> Median)
skewed_data = np.concatenate([
    np.random.normal(30, 5, 90),
    np.array([80, 85, 90, 95, 100, 120, 140, 160, 180, 200])  # Outliers
])
mean_skew = np.mean(skewed_data)
median_skew = np.median(skewed_data)

axes[1].hist(skewed_data, bins=25, color='coral', alpha=0.7, edgecolor='black')
axes[1].axvline(mean_skew, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_skew:.1f}')
axes[1].axvline(median_skew, color='green', linestyle=':', linewidth=2.5, label=f'Median = {median_skew:.1f}')
axes[1].set_title('Right-Skewed Distribution\n(Mean > Median due to outliers)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Value', fontsize=11)
axes[1].set_ylabel('Frequency', fontsize=11)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

# Add annotation arrows
axes[1].annotate('Outliers pull\nmean upward', 
                xy=(mean_skew, 5), xytext=(100, 15),
                arrowprops=dict(arrowstyle='->', lw=2, color='red'),
                fontsize=10, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7))

plt.tight_layout()
plt.show()

Figure 4.2: Mean vs. Median: Effect of Outliers

4.2.5 C. The Mode (La Moda)

The mode is the value that appears most frequently in the dataset. Unlike mean and median, a dataset can have:

No mode: All values appear once (uniform frequency)
One mode (unimodal): One value appears most often
Two modes (bimodal): Two values tied for highest frequency
Multimodal: Three or more values tied

Key Characteristics of the Mode:

Only measure of central tendency for categorical data
Can be used with any type of data (numerical or categorical)
Not affected by extreme values
May not exist or may not be unique
Most useful for discrete data (shoe sizes, defect counts)

4.2.6 Example 3.3: Mode in Profit Analysis

An analyst at Acme, Inc. examines monthly profits (in thousands of dollars) for the past 12 months:

56, 67, 52, 45, 67, 52, 58, 67, 62, 70, 58, 45

Identify the mode.

Solution:

Frequency count: - 45 appears 2 times
- 52 appears 2 times
- 56 appears 1 time
- 58 appears 2 times
- 62 appears 1 time
- 67 appears 3 times ← Maximum frequency
- 70 appears 1 time

Mode = $67,000 (appears 3 times)

Interpretation: The most common monthly profit level is $67,000. This occurred in 25% of months (3 out of 12). However, note that multiple values (45, 52, 58) appear twice, suggesting the distribution has some variability in frequently occurring values.

Business Applications of the Mode

Inventory Management:
A shoe store needs to know which sizes to stock heavily. If Size 9 is the mode, order more Size 9 inventory.

Quality Control:
If “3 defects per unit” is the mode in production data, investigate what process conditions lead to this specific defect count.

Market Research:
If “Very Satisfied” is the modal response in customer surveys, this is the most common customer sentiment.

Categorical Data:
Mode is the ONLY measure of central tendency for non-numerical data:
- Most popular car color: “Silver” (mode)
- Most common complaint type: “Billing error” (mode)
- Most frequent transaction method: “Credit card” (mode)

Code

import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Unimodal distribution
profits = [56, 67, 52, 45, 67, 52, 58, 67, 62, 70, 58, 45]
counter = Counter(profits)
values = sorted(counter.keys())
frequencies = [counter[v] for v in values]

bars1 = axes[0].bar(values, frequencies, color='steelblue', alpha=0.7, edgecolor='black', linewidth=1.5)

# Highlight mode
mode_value = max(counter, key=counter.get)
mode_index = values.index(mode_value)
bars1[mode_index].set_color('#e74c3c')
bars1[mode_index].set_edgecolor('black')
bars1[mode_index].set_linewidth(3)

axes[0].set_title('Unimodal Distribution\n(One clear mode at $67K)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Monthly Profit ($1000s)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Frequency (Number of Months)', fontsize=11, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Add annotation for mode
axes[0].annotate(f'Mode = ${mode_value}K\n(appears {counter[mode_value]} times)', 
                xy=(mode_value, counter[mode_value]), xytext=(mode_value + 5, counter[mode_value] + 0.3),
                arrowprops=dict(arrowstyle='->', lw=2, color='red'),
                fontsize=10, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7))

# Panel 2: Bimodal distribution
bimodal_data = [20, 25, 25, 25, 30, 35, 40, 60, 65, 65, 65, 70, 75]
counter2 = Counter(bimodal_data)
values2 = sorted(counter2.keys())
frequencies2 = [counter2[v] for v in values2]

bars2 = axes[1].bar(values2, frequencies2, color='coral', alpha=0.7, edgecolor='black', linewidth=1.5)

# Highlight both modes
modes = [k for k, v in counter2.items() if v == max(counter2.values())]
for mode in modes:
    mode_idx = values2.index(mode)
    bars2[mode_idx].set_color('#2ecc71')
    bars2[mode_idx].set_edgecolor('black')
    bars2[mode_idx].set_linewidth(3)

axes[1].set_title('Bimodal Distribution\n(Two modes: 25 and 65)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Value', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Add annotations for both modes
for mode in modes:
    axes[1].annotate(f'Mode = {mode}', 
                    xy=(mode, counter2[mode]), xytext=(mode, counter2[mode] + 0.5),
                    ha='center', fontsize=9, fontweight='bold',
                    bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.8))

plt.tight_layout()
plt.show()

Figure 4.3: Mode Identification: Unimodal vs. Bimodal Distributions

4.3 3.2 Weighted Mean (Media Ponderada)

When observations have different levels of importance, the simple arithmetic mean is inappropriate. The weighted mean accounts for the relative importance of each value.

Definition: Weighted Mean

\bar{X}_W = \frac{\sum X_i W_i}{\sum W_i}

where: - X_i = individual observations
- W_i = weight (importance) of each observation
- Numerator: Sum of (value × weight)
- Denominator: Sum of all weights

Common Business Applications:

Student grades: Different assignments worth different percentages
Investment portfolios: Different investments with different dollar amounts
Quality control: Defect rates from suppliers with different volumes
Economic indices: Components weighted by importance (CPI, stock indices)

4.3.1 Example 3.4: Weighted Mean for Student Grades

A student receives the following grades with specified weights:

Assessment	Grade	Weight
Homework	85%	20%
Midterm	78%	30%
Final Exam	92%	50%

Calculate the final course grade using weighted mean.

Solution:

\bar{X}_W = \frac{\sum X_i W_i}{\sum W_i} = \frac{(85 \times 0.20) + (78 \times 0.30) + (92 \times 0.50)}{0.20 + 0.30 + 0.50}

= \frac{17.0 + 23.4 + 46.0}{1.00} = \frac{86.4}{1.00} = 86.4\%

Comparison with simple mean:

\bar{X}_{simple} = \frac{85 + 78 + 92}{3} = \frac{255}{3} = 85.0\%

Interpretation: The weighted mean (86.4%) properly reflects that the Final Exam counts for 50% of the grade. The simple mean (85.0%) incorrectly treats all assessments equally. Since the student performed best on the most important exam (92% on Final), the weighted average is higher.

Final Grade: B+ (86.4%)

Critical Error to Avoid

NEVER use simple mean when data has different importance levels!

Wrong: Averaging percentages directly
Right: Weight each percentage by its base amount

Example: Investment returns - Stock A: 12% return on $10,000 investment
- Stock B: 8% return on $90,000 investment

❌ WRONG: (12\% + 8\%) / 2 = 10\% average return
✅ CORRECT: Weighted mean = (0.12 \times 10000 + 0.08 \times 90000) / 100000 = 8.4\%

The $90K investment dominates the portfolio—its 8% return matters far more than the 12% on $10K!

END OF STAGE 1

This covers: - Introduction and business scenario
- Mean (arithmetic average)
- Median (middle value)
- Mode (most frequent value)
- Weighted mean
- 3 Python visualizations

Approximately 900 lines. Ready for STAGE 2? ## 3.3 Geometric Mean (Media Geométrica)

The geometric mean measures the average rate of change or growth rate over time. It is essential for financial returns, population growth, inflation rates, and any data involving compounding or multiplicative changes.

Definition: Geometric Mean

For raw values: GM = \sqrt[n]{X_1 \times X_2 \times \cdots \times X_n} = (X_1 \times X_2 \times \cdots \times X_n)^{1/n}

For growth rates (percentage changes): GM = \left[\sqrt[n]{(1+r_1)(1+r_2)\cdots(1+r_n)}\right] - 1

where r_i represents the growth rate for period i (expressed as decimal).

Critical Difference from Arithmetic Mean:

Arithmetic mean: Sum values and divide (appropriate for additive data)
Geometric mean: Multiply values and take nth root (appropriate for multiplicative/growth data)

When to use geometric mean:

Investment returns over multiple periods
Population growth rates
Inflation rates
Any data involving compounding or percentage changes over time

Common Mistake: Using Arithmetic Mean for Growth Rates

This is mathematically incorrect and will overstate actual growth!

Example: Investment grows 50% in Year 1, then declines 20% in Year 2.

❌ WRONG (Arithmetic Mean):
(50\% + (-20\%)) / 2 = 15\% average annual return

✅ CORRECT (Geometric Mean):
\sqrt{(1.50)(0.80)} - 1 = \sqrt{1.20} - 1 = 1.095 - 1 = 9.5\% average annual return

Verification:
\$1000 \times 1.50 \times 0.80 = \$1200 (final value)
\$1000 \times (1.095)^2 = \$1199 ✓ (geometric mean correctly compounds)
\$1000 \times (1.15)^2 = \$1322 ✗ (arithmetic mean overstates growth!)

4.3.2 Example 3.5: Geometric Mean for Investment Returns

Boggs Financial Services analyzes two mutual funds with identical arithmetic average returns of 11% per year:

Megabucks Fund (5-year returns): 12%, 10%, 13%, 9%, 11%
Dynamics Fund (5-year returns): 13%, 12%, 14%, 10%, 6%

Both have arithmetic mean = (12+10+13+9+11)/5 = 11\%
Both have arithmetic mean = (13+12+14+10+6)/5 = 11\%

Question: Which fund actually performed better over the 5-year period?

Solution:

Megabucks Fund (geometric mean): GM_{Megabucks} = \sqrt[5]{(1.12)(1.10)(1.13)(1.09)(1.11)} - 1 = \sqrt[5]{1.6765} - 1 = 1.1089 - 1 = 0.1089 = 10.89\%

Dynamics Fund (geometric mean): GM_{Dynamics} = \sqrt[5]{(1.13)(1.12)(1.14)(1.10)(1.06)} - 1 = \sqrt[5]{1.6692} - 1 = 1.1081 - 1 = 0.1081 = 10.81\%

Interpretation:
Although both funds have the same arithmetic average return (11%), Megabucks outperformed Dynamics on a compound annual growth rate (CAGR) basis: 10.89% vs. 10.81%.

Verification: If you invested $10,000 in each fund:

Megabucks: $10,000 × (1.1089)^5 = $16,765 ✓
Dynamics: $10,000 × (1.1081)^5 = $16,692 ✓

Megabucks earned $73 more over 5 years due to less volatility in returns.

Financial Principle: Volatility Reduces Compound Returns

Even with the same arithmetic average, more volatile returns (Dynamics: 6% to 14%) produce lower compound growth than stable returns (Megabucks: 9% to 13%).

This is why financial advisors emphasize risk-adjusted returns, not just average returns. The geometric mean captures this volatility penalty.

4.3.3 Example 3.6: White-Knuckle Airlines Revenue Growth

White-Knuckle Airlines reports annual revenues over 5 years:

Year	Revenue	Growth from Previous Year
1992	$50,000	—
1993	$55,000	10% increase
1994	$66,000	20% increase
1995	$60,000	9% decrease
1996	$78,000	30% increase

Question: What is the average annual growth rate (CAGR) from 1992 to 1996?

Solution:

Growth factors: - 1993: $55,000 / 50,000 = 1.10 (10% growth)
- 1994: $66,000 / 55,000 = 1.20 (20% growth)
- 1995: $60,000 / 66,000 = 0.909 (9.1% decline)
- 1996: $78,000 / 60,000 = 1.30 (30% growth)

Geometric mean: GM = \sqrt[4]{(1.10)(1.20)(0.909)(1.30)} - 1 = \sqrt[4]{1.5517} - 1 = 1.1162 - 1 = 0.1162 = 11.62\%

Verification:
$50,000 (1.1162)^4 = $50,000 = $77,625 ≈ $78,000 ✓

Interpretation: Despite the 1995 decline, White-Knuckle Airlines achieved an impressive 11.62% compound annual growth rate over the 4-year period. This exceeds the industry average of 10%, indicating strong performance.

For investor presentations: “We’ve grown revenue at a compound annual rate of 11.6% over the past 4 years, outpacing industry growth by 160 basis points.”

Code

import matplotlib.pyplot as plt
import numpy as np

# Investment scenarios
years = np.arange(0, 6)
initial = 10000

# Scenario 1: Steady growth (10% per year)
steady_values = initial * (1.10) ** years

# Scenario 2: Volatile returns (same arithmetic mean of 10%)
volatile_returns = [1.0, 1.30, 0.85, 1.15, 1.05, 1.12]  # Year-over-year multipliers
volatile_values = [initial]
for mult in volatile_returns[1:]:
    volatile_values.append(volatile_values[-1] * mult)

# Scenario 3: Arithmetic mean projection (incorrect)
arithmetic_values = initial * (1.10) ** years  # Same as steady for 10% arithmetic mean

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Investment growth comparison
axes[0].plot(years, steady_values, 'g-', linewidth=3, marker='o', markersize=8, 
            label='Steady 10% growth\n(Geometric Mean = 10%)', alpha=0.8)
axes[0].plot(years, volatile_values, 'r--', linewidth=3, marker='s', markersize=8,
            label='Volatile returns\n(Arithmetic Mean = 10%,\nGeometric Mean = 8.2%)', alpha=0.8)

axes[0].set_title('Investment Growth: Steady vs. Volatile Returns\n(Both have 10% arithmetic average)', 
                 fontsize=12, fontweight='bold')
axes[0].set_xlabel('Year', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Portfolio Value ($)', fontsize=11, fontweight='bold')
axes[0].legend(fontsize=10, loc='upper left')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(9000, 18000)

# Add final value annotations
axes[0].annotate(f'${steady_values[-1]:,.0f}', 
                xy=(5, steady_values[-1]), xytext=(4.2, steady_values[-1] + 500),
                fontsize=10, fontweight='bold', color='green',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.7))
axes[0].annotate(f'${volatile_values[-1]:,.0f}\n(Loss due to volatility)', 
                xy=(5, volatile_values[-1]), xytext=(4.2, volatile_values[-1] - 1500),
                fontsize=10, fontweight='bold', color='red',
                arrowprops=dict(arrowstyle='->', lw=2, color='red'),
                bbox=dict(boxstyle='round,pad=0.3', facecolor='pink', alpha=0.7))

# Panel 2: Annual returns bar chart for volatile scenario
annual_returns = [0, 30, -15, 15, 5, 12]  # Percentage returns
colors = ['gray' if r == 0 else 'green' if r > 0 else 'red' for r in annual_returns]

bars = axes[1].bar(years, annual_returns, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)
axes[1].axhline(y=10, color='blue', linestyle='--', linewidth=2, 
               label='Arithmetic Mean = 10%')

# Calculate and show geometric mean
returns_decimal = [r/100 for r in annual_returns[1:]]  # Convert to decimals, skip year 0
growth_factors = [1 + r for r in returns_decimal]
geom_mean = (np.prod(growth_factors) ** (1/len(growth_factors)) - 1) * 100
axes[1].axhline(y=geom_mean, color='orange', linestyle=':', linewidth=2.5,
               label=f'Geometric Mean = {geom_mean:.1f}%')

axes[1].set_title('Annual Returns for Volatile Portfolio\n(Why Arithmetic Mean Overstates Growth)', 
                 fontsize=12, fontweight='bold')
axes[1].set_xlabel('Year', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Annual Return (%)', fontsize=11, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper right')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_xticks(years)

# Add value labels on bars
for bar, ret in zip(bars, annual_returns):
    if ret != 0:
        height = bar.get_height()
        axes[1].text(bar.get_x() + bar.get_width()/2., height,
                    f'{ret:+.0f}%',
                    ha='center', va='bottom' if ret > 0 else 'top', 
                    fontweight='bold', fontsize=9)

plt.tight_layout()
plt.show()

Figure 4.4: Arithmetic vs. Geometric Mean for Investment Returns

4.4 3.4 Measures of Dispersion for Ungrouped Data

Dispersion (or variability or spread) measures how scattered or spread out observations are around the central tendency. Two datasets can have the same mean but vastly different dispersions.

Why Dispersion Matters

Central tendency alone is insufficient!

Consider two manufacturing processes: - Process A: Mean diameter = 10.0 mm, values range from 9.8 to 10.2 mm
- Process B: Mean diameter = 10.0 mm, values range from 5.0 to 15.0 mm

Both have identical means, but Process B is catastrophically inconsistent and would produce massive quality problems. Dispersion measures reveal this critical difference.

Primary Measures of Dispersion:

Range — Difference between maximum and minimum
Variance (s^2 or \sigma^2) — Average of squared deviations
Standard Deviation (s or \sigma) — Square root of variance
Interquartile Range (IQR) — Range of middle 50% of data
Percentiles — Values below which certain percentages fall

4.4.1 A. Range and Interquartile Range

The range is the simplest measure of dispersion:

\text{Range} = X_{max} - X_{min}

Advantages: Quick and easy to calculate
Disadvantages: Uses only two values (ignores all others), extremely sensitive to outliers

The interquartile range (IQR) measures the spread of the middle 50% of data:

IQR = Q_3 - Q_1 = P_{75} - P_{25}

where Q_1 is the first quartile (25th percentile) and Q_3 is the third quartile (75th percentile).

Advantages: Resistant to outliers (ignores extreme 25% on each end)
Disadvantages: Ignores outer 50% of data

Code

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data with outliers
np.random.seed(42)
data = np.concatenate([
    np.random.normal(50, 5, 45),  # Main data
    [20, 25, 75, 80, 85]  # Outliers
])
data = np.sort(data)

# Calculate statistics
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
median = np.median(data)
iqr = q3 - q1
data_range = np.max(data) - np.min(data)

fig, ax = plt.subplots(figsize=(12, 6))

# Box plot
bp = ax.boxplot([data], vert=False, widths=0.6, patch_artist=True,
                boxprops=dict(facecolor='lightblue', edgecolor='black', linewidth=2),
                whiskerprops=dict(color='black', linewidth=2),
                capprops=dict(color='black', linewidth=2),
                medianprops=dict(color='red', linewidth=3),
                flierprops=dict(marker='o', markerfacecolor='red', markersize=10, 
                               markeredgecolor='black', alpha=0.7))

# Add scatter of all points below box plot
y_jitter = np.random.normal(0.8, 0.05, len(data))
ax.scatter(data, y_jitter, alpha=0.5, s=30, color='steelblue', edgecolors='black', linewidths=0.5)

# Annotate range
ax.annotate('', xy=(np.max(data), 1.3), xytext=(np.min(data), 1.3),
           arrowprops=dict(arrowstyle='<->', lw=3, color='purple'))
ax.text((np.max(data) + np.min(data))/2, 1.4, 
       f'Range = {data_range:.1f}\n(Uses only 2 values:\nmin and max)',
       ha='center', fontsize=11, fontweight='bold',
       bbox=dict(boxstyle='round,pad=0.5', facecolor='lavender', alpha=0.8))

# Annotate IQR
ax.annotate('', xy=(q3, 0.3), xytext=(q1, 0.3),
           arrowprops=dict(arrowstyle='<->', lw=3, color='green'))
ax.text((q1 + q3)/2, 0.15, 
       f'IQR = {iqr:.1f}\n(Middle 50% of data)',
       ha='center', fontsize=11, fontweight='bold',
       bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8))

# Annotate quartiles
ax.axvline(q1, color='green', linestyle='--', linewidth=2, alpha=0.7)
ax.axvline(q3, color='green', linestyle='--', linewidth=2, alpha=0.7)
ax.axvline(median, color='red', linestyle='--', linewidth=2, alpha=0.7)

ax.text(q1, 1.6, f'Q1 = {q1:.1f}\n(25th percentile)', ha='center', fontsize=9,
       bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9))
ax.text(median, 1.6, f'Median = {median:.1f}\n(50th percentile)', ha='center', fontsize=9,
       bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9))
ax.text(q3, 1.6, f'Q3 = {q3:.1f}\n(75th percentile)', ha='center', fontsize=9,
       bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9))

ax.set_ylim(0, 1.8)
ax.set_xlabel('Value', fontsize=12, fontweight='bold')
ax.set_title('Range vs. Interquartile Range (IQR)\nIQR is Robust to Outliers', 
            fontsize=13, fontweight='bold', pad=15)
ax.set_yticks([])
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

Figure 4.5: Range vs. Interquartile Range (IQR)

4.4.2 B. Variance and Standard Deviation

The variance and standard deviation are the most important and widely used measures of dispersion in statistics.

Definitions: Variance and Standard Deviation

Population Variance: \sigma^2 = \frac{\sum (X_i - \mu)^2}{N}

Sample Variance: s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}

Population Standard Deviation: \sigma = \sqrt{\sigma^2}

Sample Standard Deviation: s = \sqrt{s^2}

where: - X_i = individual observations
- \mu or \bar{X} = mean (population or sample)
- N or n = size (population or sample)
- (X_i - \bar{X}) = deviation from mean
- (X_i - \bar{X})^2 = squared deviation

Key Characteristics:

Variance: Average of squared deviations from the mean
Standard Deviation: Square root of variance (same units as original data!)
Both use all observations in the dataset
Larger values indicate greater dispersion
SD is easier to interpret (same units as data)

Why Divide by (n-1) Instead of n?

When calculating sample variance, we divide by n-1 (called degrees of freedom) instead of n for two reasons:

Bessel’s Correction: Samples tend to be slightly less dispersed than populations. Dividing by n-1 (smaller divisor) inflates the variance slightly, correcting this underestimation bias.
Degrees of Freedom: We “lose” one degree of freedom because we first calculated \bar{X} from the same data. Once we know n-1 deviations and \bar{X}, the final deviation is predetermined.

Example: Choose 4 numbers that average 10. You can freely choose the first 3 (say: 5, 8, 12), but the 4th is forced to be 15 to make the average equal 10.

Result: Sample variance (s^2) with (n-1) is an unbiased estimator of population variance (\sigma^2).

4.4.3 Example 3.7: Investment Risk Comparison (Variance and SD)

Boggs compares two mutual funds with identical 5-year average returns (11%):

Megabucks Fund: 12%, 10%, 13%, 9%, 11%
Dynamics Fund: 13%, 12%, 14%, 10%, 6%

Both have \bar{X} = 11\%, but which is riskier (more volatile)?

Solution:

Megabucks Fund:

Megabucks Variance Calculation
Return (X_i)	Deviation (X_i - \bar{X})	Squared Deviation
12%	1%	1
10%	-1%	1
13%	2%	4
9%	-2%	4
11%	0%	0
Sum		10

s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1} = \frac{10}{5-1} = \frac{10}{4} = 2.5

s = \sqrt{2.5} = 1.58\%

Dynamics Fund:

Dynamics Variance Calculation
Return (X_i)	Deviation (X_i - \bar{X})	Squared Deviation
13%	2%	4
12%	1%	1
14%	3%	9
10%	-1%	1
6%	-5%	25
Sum		40

s^2 = \frac{40}{4} = 10.0

s = \sqrt{10.0} = 3.16\%

Interpretation:
Dynamics Fund has 4 times the variance (10.0 vs. 2.5) and 2 times the standard deviation (3.16% vs. 1.58%) compared to Megabucks.

Investment Recommendation: For risk-averse investors, choose Megabucks—same average return (11%), but half the volatility. Dynamics is suitable only for investors willing to accept higher risk for potential upside.

Financial Principle: In finance, standard deviation = risk. Lower SD = more predictable = less risky.

END OF STAGE 2

This covers: - Geometric mean (growth rates)
- Range and IQR
- Variance and standard deviation (detailed explanation)
- 2 Python visualizations

Approximately 750 lines. Ready for STAGE 3? ### Example 3.8: Stock Price Volatility (Boggs Financial Services)

Robert Boggs analyzes the volatility of a client’s stock portfolio over 7 trading days to assess investment risk.

Daily Closing Prices: $87, $120, $54, $92, $73, $80, $63

Calculate: Mean, variance, and standard deviation.

Solution:

Step 1: Calculate mean \bar{X} = \frac{87 + 120 + 54 + 92 + 73 + 80 + 63}{7} = \frac{569}{7} = 81.29

Step 2: Calculate deviations and squared deviations

Stock Price Variance Calculation
Day	Price (X_i)	Deviation (X_i - \bar{X})	Squared Deviation (X_i - \bar{X})^2
1	$87	$5.71	32.60
2	$120	$38.71	1498.46
3	$54	-$27.29	744.74
4	$92	$10.71	114.74
5	$73	-$8.29	68.72
6	$80	-$1.29	1.66
7	$63	-$18.29	334.52
Sum		$0.00 ✓	2,795.43

Step 3: Calculate variance and standard deviation s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1} = \frac{2795.43}{7-1} = \frac{2795.43}{6} = 465.91

s = \sqrt{465.91} = 21.58

Interpretation:
The stock has a mean price of $81.29 with a standard deviation of $21.58 (26.5% coefficient of variation). This indicates high volatility—the stock price fluctuates substantially from day to day.

Investment Implication: This level of volatility (±$21.58 or roughly 26%) is typical for growth stocks or small-cap stocks. Conservative investors seeking stable income should avoid this stock. Aggressive investors comfortable with risk might find it acceptable.

Code

import matplotlib.pyplot as plt
import numpy as np

# Stock price data
days = np.arange(1, 8)
prices = np.array([87, 120, 54, 92, 73, 80, 63])
mean_price = np.mean(prices)
std_price = np.std(prices, ddof=1)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Line plot with mean and ±1 SD bands
axes[0].plot(days, prices, 'o-', linewidth=3, markersize=10, color='steelblue', 
            label='Daily Closing Price', markeredgecolor='black', markeredgewidth=1.5)
axes[0].axhline(y=mean_price, color='red', linestyle='--', linewidth=2.5, 
               label=f'Mean = ${mean_price:.2f}')
axes[0].axhline(y=mean_price + std_price, color='orange', linestyle=':', linewidth=2,
               label=f'Mean + 1 SD = ${mean_price + std_price:.2f}')
axes[0].axhline(y=mean_price - std_price, color='orange', linestyle=':', linewidth=2,
               label=f'Mean - 1 SD = ${mean_price - std_price:.2f}')

# Shade ±1 SD region
axes[0].fill_between(days, mean_price - std_price, mean_price + std_price,
                     alpha=0.2, color='orange', label='±1 SD Region (68% for normal data)')

# Add vertical deviation lines
for day, price in zip(days, prices):
    axes[0].plot([day, day], [mean_price, price], 'k--', alpha=0.3, linewidth=1)

axes[0].set_title('Stock Price Volatility: Daily Prices vs. Mean ± SD', 
                 fontsize=12, fontweight='bold')
axes[0].set_xlabel('Trading Day', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Stock Price ($)', fontsize=11, fontweight='bold')
axes[0].legend(fontsize=9, loc='upper left')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(days)

# Panel 2: Bar chart of squared deviations (variance components)
deviations = prices - mean_price
squared_devs = deviations ** 2

colors = ['red' if d > 0 else 'blue' for d in deviations]
bars = axes[1].bar(days, squared_devs, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5)

# Add variance line (mean of squared deviations)
variance = np.var(prices, ddof=1)
axes[1].axhline(y=variance, color='green', linestyle='--', linewidth=2.5,
               label=f'Variance = {variance:.2f}\n(Mean of squared deviations)')

axes[1].set_title('Squared Deviations from Mean\n(Components of Variance)', 
                 fontsize=12, fontweight='bold')
axes[1].set_xlabel('Trading Day', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Squared Deviation $(X_i - \\bar{X})^2$', fontsize=11, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper right')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_xticks(days)

# Add value labels on bars
for bar, sq_dev, price in zip(bars, squared_devs, prices):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                f'${price}\n({sq_dev:.0f})',
                ha='center', va='bottom', fontweight='bold', fontsize=8)

plt.tight_layout()
plt.show()

Figure 4.6: Visualizing Variance: Squared Deviations from the Mean

Business Application: Using Standard Deviation

Standard deviation is the universal language of risk:

Finance: Portfolio volatility, Value-at-Risk (VaR), Sharpe ratio
Manufacturing: Process capability indices (Cp, Cpk), Six Sigma
Quality Control: Control charts (±3 SD limits)
Healthcare: Normal ranges for lab tests (mean ± 2 SD)
Marketing: Customer lifetime value variability

Rule of Thumb: For normally distributed data, approximately: - 68% of observations fall within ±1 SD of the mean
- 95% fall within ±2 SD
- 99.7% fall within ±3 SD

We’ll formalize this as the Empirical Rule in Section 3.7.

4.5 3.5 Measures of Central Tendency and Dispersion for Grouped Data

When data are presented in frequency distributions (grouped into class intervals), we cannot compute exact statistics without the raw data. Instead, we use the class midpoint (M) to represent all values in that class.

Key Assumption: All observations in a class are assumed to be located at the class midpoint.

4.5.1 A. Mean for Grouped Data

Formula: Mean for Grouped Data

\bar{X}_g = \frac{\sum f \cdot M}{n} = \frac{\sum f \cdot M}{\sum f}

where: - f = frequency (number of observations in each class)
- M = class midpoint
- n = \sum f = total number of observations
- \bar{X}_g = grouped data mean (approximation)

Computational Steps:

Calculate class midpoint: M = \frac{\text{Lower Limit} + \text{Upper Limit}}{2}
Multiply each midpoint by its frequency: f \cdot M
Sum all products: \sum f \cdot M
Divide by total frequency: \bar{X}_g = \frac{\sum f \cdot M}{n}

4.5.2 Example 3.9: Mean Passengers per Flight (P&P Airlines)

P&P Airlines analyzes passenger loads on 40 regional flights:

Grouped Data Mean Calculation for P&P Airlines
Passengers (Class Interval)	Frequency (f)	Midpoint (M)	f \cdot M
50 – 59	3	54.5	163.5
60 – 69	5	64.5	322.5
70 – 79	9	74.5	670.5
80 – 89	14	84.5	1,183.0
90 – 99	7	94.5	661.5
100 – 109	2	104.5	209.0
Total	40		3,210.0

Solution: \bar{X}_g = \frac{\sum f \cdot M}{n} = \frac{3210.0}{40} = 80.25 \text{ passengers}

Interpretation: The average passenger load per flight is approximately 80 passengers. This grouped data estimate is very close to the actual mean (if we had raw data).

4.5.3 B. Median for Grouped Data

Formula: Median for Grouped Data

\text{Median} = L_{md} + \left[\frac{\frac{n}{2} - F}{f_{md}}\right] \cdot C

where: - L_{md} = lower limit of the median class (class containing the median)
- n = total number of observations
- F = cumulative frequency before the median class
- f_{md} = frequency of the median class
- C = class width (interval size)

Steps to Find Grouped Data Median:

Calculate n/2 (position of median)
Find median class: First class where cumulative frequency \geq n/2
Identify L_{md}, F, f_{md}, and C
Apply formula

4.5.4 Example 3.10: Median Passengers per Flight (P&P Airlines)

Using the same P&P Airlines data:

Finding the Median Class
Passengers	Frequency (f)	Cumulative Frequency
50 – 59	3	3
60 – 69	5	8
70 – 79	9	17
80 – 89	14	31 ← Contains median
90 – 99	7	38
100 – 109	2	40

Solution:

n/2 = 40/2 = 20 (median position)
Median class: 80–89 (cumulative frequency 31 ≥ 20)
Parameters:
- L_{md} = 80 (lower limit)
- F = 17 (cumulative frequency before median class)
- f_{md} = 14 (frequency of median class)
- C = 10 (class width: 89 - 80 + 1 = 10)
Calculate: \text{Median} = 80 + \left[\frac{20 - 17}{14}\right] \cdot 10 = 80 + \left[\frac{3}{14}\right] \cdot 10 = 80 + 2.14 = 82.14

Interpretation: The median passenger load is 82.14 passengers. Half the flights carry fewer than 82 passengers, half carry more.

Comparison: Mean = 80.25, Median = 82.14. The median is slightly higher, suggesting a slight left skew (more flights with lower passenger counts pulling the mean down).

4.5.5 C. Mode for Grouped Data

Formula: Mode for Grouped Data

\text{Mode} = L_{mo} + \left[\frac{D_1}{D_1 + D_2}\right] \cdot C

where: - L_{mo} = lower limit of the modal class (class with highest frequency)
- D_1 = difference between modal class frequency and preceding class frequency
- D_2 = difference between modal class frequency and following class frequency
- C = class width

4.5.6 Example 3.11: Mode for P&P Airlines

Modal class: 80–89 (frequency = 14, highest)

Parameters: - L_{mo} = 80
- D_1 = 14 - 9 = 5 (modal frequency minus preceding frequency)
- D_2 = 14 - 7 = 7 (modal frequency minus following frequency)
- C = 10

Solution: \text{Mode} = 80 + \left[\frac{5}{5 + 7}\right] \cdot 10 = 80 + \left[\frac{5}{12}\right] \cdot 10 = 80 + 4.17 = 84.17

Interpretation: The most common passenger load is approximately 84 passengers.

Summary for P&P Airlines: - Mean = 80.25 passengers
- Median = 82.14 passengers
- Mode = 84.17 passengers

These are reasonably close, indicating a fairly symmetric distribution with slight left skew.

Code

import matplotlib.pyplot as plt
import numpy as np

# P&P Airlines data
class_midpoints = np.array([54.5, 64.5, 74.5, 84.5, 94.5, 104.5])
frequencies = np.array([3, 5, 9, 14, 7, 2])
class_lower = np.array([50, 60, 70, 80, 90, 100])
class_upper = np.array([59, 69, 79, 89, 99, 109])

# Calculated statistics
mean_grouped = 80.25
median_grouped = 82.14
mode_grouped = 84.17

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Panel 1: Histogram with central tendency measures
width = 10
bars = axes[0].bar(class_midpoints, frequencies, width=width*0.8, 
                   color='steelblue', alpha=0.7, edgecolor='black', linewidth=1.5,
                   label='Frequency')

# Add mean, median, mode lines
axes[0].axvline(x=mean_grouped, color='red', linestyle='--', linewidth=2.5,
               label=f'Mean = {mean_grouped:.2f}')
axes[0].axvline(x=median_grouped, color='green', linestyle='-.', linewidth=2.5,
               label=f'Median = {median_grouped:.2f}')
axes[0].axvline(x=mode_grouped, color='purple', linestyle=':', linewidth=3,
               label=f'Mode = {mode_grouped:.2f}')

axes[0].set_title('P&P Airlines: Passenger Distribution\n(Grouped Data with Mean, Median, Mode)', 
                 fontsize=12, fontweight='bold')
axes[0].set_xlabel('Number of Passengers', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Frequency (Number of Flights)', fontsize=11, fontweight='bold')
axes[0].legend(fontsize=10, loc='upper right')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].set_xlim(45, 115)

# Add frequency labels on bars
for bar, freq in zip(bars, frequencies):
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(freq)}',
                ha='center', va='bottom', fontweight='bold', fontsize=10)

# Panel 2: Cumulative frequency plot (ogive)
cumulative_freq = np.cumsum(frequencies)
total_n = cumulative_freq[-1]

# Plot cumulative frequency curve
axes[1].plot(class_upper, cumulative_freq, 'o-', linewidth=3, markersize=10,
            color='darkblue', markeredgecolor='black', markeredgewidth=1.5,
            label='Cumulative Frequency')

# Add median reference
axes[1].axhline(y=total_n/2, color='green', linestyle='--', linewidth=2,
               label=f'n/2 = {total_n/2:.0f} (Median Position)')
axes[1].axvline(x=median_grouped, color='green', linestyle='-.', linewidth=2,
               alpha=0.7)

# Highlight median finding
axes[1].plot([50, median_grouped], [total_n/2, total_n/2], 'g:', linewidth=2)
axes[1].plot([median_grouped, median_grouped], [0, total_n/2], 'g:', linewidth=2)
axes[1].plot(median_grouped, total_n/2, 'go', markersize=12, 
            markeredgecolor='black', markeredgewidth=2)

axes[1].annotate(f'Median ≈ {median_grouped:.1f}', 
                xy=(median_grouped, total_n/2), 
                xytext=(median_grouped + 8, total_n/2 - 5),
                fontsize=11, fontweight='bold',
                bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8),
                arrowprops=dict(arrowstyle='->', lw=2, color='green'))

axes[1].set_title('Cumulative Frequency Distribution (Ogive)\nFinding the Median Graphically', 
                 fontsize=12, fontweight='bold')
axes[1].set_xlabel('Number of Passengers (Upper Class Limit)', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Cumulative Frequency', fontsize=11, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper left')
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(45, 115)
axes[1].set_ylim(0, 45)

plt.tight_layout()
plt.show()

Figure 4.7: P&P Airlines Passenger Distribution with Mean, Median, and Mode

4.5.7 D. Variance and Standard Deviation for Grouped Data

Formula: Variance for Grouped Data

s^2 = \frac{\sum f \cdot M^2 - n \cdot \bar{X}_g^2}{n - 1}

where: - f = frequency
- M = class midpoint
- n = \sum f = total observations
- \bar{X}_g = grouped data mean

Standard Deviation: s = \sqrt{s^2}

4.5.8 Example 3.12: Standard Deviation for P&P Airlines

Using the same passenger data:

Grouped Data Variance Calculation
Passengers	f	M	f \cdot M	M^2	f \cdot M^2
50 – 59	3	54.5	163.5	2,970.25	8,910.75
60 – 69	5	64.5	322.5	4,160.25	20,801.25
70 – 79	9	74.5	670.5	5,550.25	49,952.25
80 – 89	14	84.5	1,183.0	7,140.25	99,963.50
90 – 99	7	94.5	661.5	8,930.25	62,511.75
100 – 109	2	104.5	209.0	10,920.25	21,840.50
Total	40		3,210.0		263,980.00

Solution:

s^2 = \frac{\sum f \cdot M^2 - n \cdot \bar{X}_g^2}{n - 1} = \frac{263,980 - 40(80.25)^2}{39}

= \frac{263,980 - 40(6,440.0625)}{39} = \frac{263,980 - 257,602.50}{39} = \frac{6,377.50}{39} = 163.53

s = \sqrt{163.53} = 12.79 \text{ passengers}

Interpretation:
The standard deviation is 12.79 passengers. This indicates moderate variability in passenger loads. Most flights carry between 67 and 93 passengers (mean ± 1 SD).

4.6 3.6 Percentiles, Quartiles, and Deciles

Percentiles divide a dataset into 100 equal parts. The Pth percentile is the value below which P percent of the observations fall.

Special Cases: - Quartiles: Divide data into 4 parts (Q1 = 25th, Q2 = 50th = median, Q3 = 75th)
- Deciles: Divide data into 10 parts (D1 = 10th, D2 = 20th, …, D9 = 90th)
- Median: 50th percentile (P₅₀ = Q2 = D5)

Formula: Percentile Position (Ungrouped Data)

L_P = (n + 1) \times \frac{P}{100}

where: - L_P = position (location) of the Pth percentile
- n = number of observations
- P = percentile (1 to 99)

If L_P is not an integer, interpolate:
Value = Value at floor(L_P) + fractional part × (next value - value at floor)

4.6.1 Example 3.13: Executive Salaries (Percentiles)

A consulting firm surveys 12 executives’ annual salaries (in thousands):

Data (sorted): 42, 47, 51, 54, 58, 60, 62, 65, 68, 72, 78, 85

Find: 25th percentile (Q1), 50th percentile (Median), and 75th percentile (Q3).

Solution:

Q1 (25th percentile): L_{25} = (12 + 1) \times \frac{25}{100} = 13 \times 0.25 = 3.25

Position 3.25 means: 25% of way between 3rd value (51) and 4th value (54).
Q_1 = 51 + 0.25(54 - 51) = 51 + 0.75 = 51.75 \text{ thousand}

Q2 (50th percentile, Median): L_{50} = 13 \times 0.50 = 6.5

\text{Median} = 60 + 0.5(62 - 60) = 60 + 1 = 61 \text{ thousand}

Q3 (75th percentile): L_{75} = 13 \times 0.75 = 9.75

Q_3 = 68 + 0.75(72 - 68) = 68 + 3 = 71 \text{ thousand}

Interpretation: - 25% of executives earn less than $51,750
- 50% of executives earn less than $61,000 (median)
- 75% of executives earn less than $71,000

Interquartile Range (IQR): IQR = Q_3 - Q_1 = 71 - 51.75 = 19.25 \text{ thousand}

The middle 50% of executive salaries span a range of $19,250.

Code

import matplotlib.pyplot as plt
import numpy as np

# Executive salary data (in thousands)
salaries = np.array([42, 47, 51, 54, 58, 60, 62, 65, 68, 72, 78, 85])

# Calculate percentiles
q1 = np.percentile(salaries, 25)
median = np.percentile(salaries, 50)
q3 = np.percentile(salaries, 75)
iqr = q3 - q1
p10 = np.percentile(salaries, 10)
p90 = np.percentile(salaries, 90)

fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Panel 1: Box plot with annotations
bp = axes[0].boxplot([salaries], vert=False, widths=0.5, patch_artist=True,
                     boxprops=dict(facecolor='lightblue', edgecolor='black', linewidth=2),
                     whiskerprops=dict(color='black', linewidth=2),
                     capprops=dict(color='black', linewidth=2),
                     medianprops=dict(color='red', linewidth=3),
                     flierprops=dict(marker='o', markerfacecolor='red', markersize=10))

# Add individual points below box plot
y_jitter = np.random.normal(0.8, 0.03, len(salaries))
axes[0].scatter(salaries, y_jitter, alpha=0.6, s=80, color='darkblue', 
               edgecolors='black', linewidths=1, zorder=3)

# Annotate quartiles
axes[0].axvline(q1, color='green', linestyle='--', linewidth=2, alpha=0.7)
axes[0].axvline(median, color='red', linestyle='--', linewidth=2, alpha=0.7)
axes[0].axvline(q3, color='green', linestyle='--', linewidth=2, alpha=0.7)

axes[0].text(q1, 1.35, f'Q1 = ${q1:.2f}K\n(25th percentile)', ha='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lightgreen', alpha=0.8))
axes[0].text(median, 1.35, f'Median = ${median:.2f}K\n(50th percentile)', ha='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor='pink', alpha=0.8))
axes[0].text(q3, 1.35, f'Q3 = ${q3:.2f}K\n(75th percentile)', ha='center', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lightgreen', alpha=0.8))

# Annotate IQR
axes[0].annotate('', xy=(q3, 0.3), xytext=(q1, 0.3),
                arrowprops=dict(arrowstyle='<->', lw=2.5, color='purple'))
axes[0].text((q1 + q3)/2, 0.15, f'IQR = ${iqr:.2f}K\n(Middle 50% of data)',
            ha='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender', alpha=0.9))

axes[0].set_ylim(0, 1.6)
axes[0].set_xlabel('Annual Salary ($ thousands)', fontsize=12, fontweight='bold')
axes[0].set_title('Executive Salaries: Box Plot Showing Quartiles and IQR', 
                 fontsize=13, fontweight='bold', pad=15)
axes[0].set_yticks([])
axes[0].grid(True, alpha=0.3, axis='x')

# Panel 2: Percentile reference chart
percentiles = [10, 25, 50, 75, 90]
percentile_values = [np.percentile(salaries, p) for p in percentiles]
percentile_labels = ['P10\n(10th)', 'Q1\n(25th)', 'Median\n(50th)', 'Q3\n(75th)', 'P90\n(90th)']

colors_bars = ['orange', 'green', 'red', 'green', 'orange']
bars = axes[1].bar(percentile_labels, percentile_values, color=colors_bars, 
                   alpha=0.7, edgecolor='black', linewidth=2)

# Add value labels
for bar, val, pct in zip(bars, percentile_values, percentiles):
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                f'${val:.1f}K\n({pct}%)',
                ha='center', va='bottom', fontweight='bold', fontsize=11)

axes[1].set_title('Key Percentiles for Executive Salary Distribution', 
                 fontsize=13, fontweight='bold', pad=15)
axes[1].set_ylabel('Annual Salary ($ thousands)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Percentile', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_ylim(0, 95)

# Add interpretation text
interpretation = (
    "Interpretation:\n"
    "• 10% of executives earn less than $46K (P10)\n"
    "• 25% earn less than $55K (Q1)\n"
    "• 50% earn less than $61K (Median)\n"
    "• 75% earn less than $70K (Q3)\n"
    "• 90% earn less than $80K (P90)"
)
axes[1].text(0.98, 0.97, interpretation, transform=axes[1].transAxes,
            fontsize=10, verticalalignment='top', horizontalalignment='right',
            bbox=dict(boxstyle='round,pad=0.6', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

Figure 4.8: Executive Salaries: Box Plot Showing Quartiles and Percentiles

Business Applications of Percentiles

Percentiles are essential for:

Salary Benchmarking: “Our median salary (P50) is at the 75th percentile of industry standards”
Performance Metrics: “Top 10% of sales reps (P90) exceed $2M in annual revenue”
Quality Control: “Reject parts above P95 or below P5 in dimension distribution”
Financial Analysis: “Value-at-Risk (VaR) uses P5 to estimate worst 5% scenarios”
Student Assessment: “Your test score is at the 85th percentile (better than 85% of students)”
Healthcare: “Blood pressure above P90 for age group indicates hypertension risk”

Percentiles provide context that means and standard deviations cannot—they show relative position in a distribution.

END OF STAGE 3

This covers: - Stock volatility example (variance calculation)
- Grouped data formulas (mean, median, mode, variance/SD)
- P&P Airlines comprehensive example
- Percentiles, quartiles, deciles (with executive salary example)
- 3 Python visualizations

Approximately 900 lines. Ready for STAGE 4 (Chebyshev, Empirical Rule, Skewness, CV, Summary)?

4.7 3.7 Uses of the Standard Deviation

Standard deviation is far more than just a dispersion measure—it provides powerful insights into data distribution patterns and enables probabilistic statements about where observations are likely to fall.

4.7.1 A. Chebyshev’s Theorem

Chebyshev’s Theorem provides a conservative lower bound on the proportion of observations within K standard deviations of the mean, for any distribution (no assumptions about shape).

Chebyshev’s Theorem

For any distribution (symmetric, skewed, bimodal, etc.), at least this proportion of observations lies within K standard deviations of the mean:

\text{Proportion} \geq 1 - \frac{1}{K^2}

where K > 1 is the number of standard deviations from the mean.

Common values:

K	Interval	Minimum Proportion	Percentage
2	\mu \pm 2\sigma	1 - 1/4 = 3/4	≥ 75%
3	\mu \pm 3\sigma	1 - 1/9 = 8/9	≥ 89%
4	\mu \pm 4\sigma	1 - 1/16 = 15/16	≥ 94%

Key Point: This is a conservative estimate. Most real distributions have MORE observations within K standard deviations than Chebyshev predicts.

Critical Distinction: - Chebyshev’s Theorem: Works for ANY distribution (very general, but conservative)
- Empirical Rule: Works ONLY for normal distributions (specific, but much more precise)

4.7.2 Example 3.14: P&P Airlines Passenger Load Distribution (Chebyshev)

P&P Airlines has: - Mean passenger load: \mu = 80 passengers
- Standard deviation: \sigma = 12 passengers

No assumption is made about the shape of the distribution.

Question: Using Chebyshev’s Theorem, what can we say about the proportion of flights within ±2 standard deviations of the mean?

Solution:

For K = 2: \text{Proportion} \geq 1 - \frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} = 0.75 = 75\%

Interval: \mu \pm 2\sigma = 80 \pm 2(12) = 80 \pm 24 = [56, 104] \text{ passengers}

Interpretation:
At least 75% of P&P flights carry between 56 and 104 passengers, regardless of the distribution shape. This is a guaranteed minimum—the actual proportion could be much higher (and usually is).

For K = 3: \text{Proportion} \geq 1 - \frac{1}{3^2} = 1 - \frac{1}{9} = \frac{8}{9} \approx 0.889 = 89\%

\mu \pm 3\sigma = 80 \pm 3(12) = [44, 116] \text{ passengers}

At least 89% of flights carry between 44 and 116 passengers.

When to Use Chebyshev’s Theorem

Use Chebyshev when: - Distribution shape is unknown or non-normal
- Data are heavily skewed
- You need a conservative, guaranteed lower bound
- Working with small samples where normality is uncertain

Don’t use Chebyshev when: - You know data are normally distributed (use Empirical Rule instead—it’s much more precise)

4.7.3 B. The Empirical Rule (Normal Distribution)

When data follow a normal distribution (bell-shaped, symmetric), we can make much more precise statements using the Empirical Rule.

The Empirical Rule (68-95-99.7 Rule)

For data that are approximately normally distributed:

68.3% of observations fall within ±1 SD of the mean: \mu \pm \sigma
95.5% fall within ±2 SD: \mu \pm 2\sigma
99.7% fall within ±3 SD: \mu \pm 3\sigma

Implication: Nearly all observations (99.7%) lie within 3 standard deviations of the mean.

Visual representation: The normal curve is bell-shaped and symmetric around the mean, with areas under the curve representing these percentages. The Python visualization below demonstrates this distribution.

Comparison: Chebyshev vs. Empirical Rule

Chebyshev’s Theorem vs. Empirical Rule
Interval	Chebyshev (ANY distribution)	Empirical Rule (NORMAL only)
\mu \pm 1\sigma	≥ 0% (not useful)	68.3%
\mu \pm 2\sigma	≥ 75%	95.5%
\mu \pm 3\sigma	≥ 89%	99.7%

The Empirical Rule is much more precise, but requires normality.

4.7.4 Example 3.15: Ski Resort Wait Times (Empirical Rule)

A ski resort tracks wait times for the main chairlift on Saturdays. Historical data show wait times are normally distributed with:

Mean: \mu = 10 minutes
Standard deviation: \sigma = 2 minutes

On a typical Saturday, 1,000 skiers use the lift.

Question: How many skiers wait: 1. Between 8 and 12 minutes? (\mu \pm 1\sigma)
2. Between 6 and 14 minutes? (\mu \pm 2\sigma)
3. More than 16 minutes? (beyond \mu + 3\sigma)

Solution:

1. Within ±1 SD (8 to 12 minutes): \mu \pm \sigma = 10 \pm 2 = [8, 12] \text{ minutes}

By the Empirical Rule, 68.3% of skiers wait 8–12 minutes.
Number of skiers: 1000 \times 0.683 = 683 skiers

2. Within ±2 SD (6 to 14 minutes): \mu \pm 2\sigma = 10 \pm 4 = [6, 14] \text{ minutes}

95.5% of skiers wait 6–14 minutes.
Number of skiers: 1000 \times 0.955 = 955 skiers

3. More than 16 minutes (beyond \mu + 3\sigma): \mu + 3\sigma = 10 + 3(2) = 16 \text{ minutes}

The Empirical Rule says 99.7% fall within ±3 SD, so 0.3% fall outside.
Of that 0.3%, half are above (right tail) and half below (left tail).

Proportion above 16 minutes: 0.003 / 2 = 0.0015 = 0.15\%
Number of skiers: 1000 \times 0.0015 = 1.5 \approx 1 or 2 skiers

Interpretation: - About 683 skiers (68%) wait 8–12 minutes (typical experience)
- About 955 skiers (96%) wait 6–14 minutes (almost everyone)
- Only 1–2 skiers wait more than 16 minutes (extremely rare)

Business Decision: The resort can confidently advertise “Average wait time: 10 minutes, with 95% of skiers waiting less than 14 minutes.”

Code

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Parameters
mu = 10
sigma = 2
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
y = stats.norm.pdf(x, mu, sigma)

fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Panel 1: Empirical Rule regions
axes[0].plot(x, y, 'k-', linewidth=2.5, label='Normal Distribution')

# Fill regions
x_1sd = x[(x >= mu - sigma) & (x <= mu + sigma)]
y_1sd = stats.norm.pdf(x_1sd, mu, sigma)
axes[0].fill_between(x_1sd, y_1sd, alpha=0.3, color='green', 
                     label='±1 SD: 68.3%')

x_2sd = x[(x >= mu - 2*sigma) & (x <= mu + 2*sigma)]
y_2sd = stats.norm.pdf(x_2sd, mu, sigma)
axes[0].fill_between(x_2sd, y_2sd, alpha=0.2, color='blue',
                     label='±2 SD: 95.5%')

x_3sd = x[(x >= mu - 3*sigma) & (x <= mu + 3*sigma)]
y_3sd = stats.norm.pdf(x_3sd, mu, sigma)
axes[0].fill_between(x_3sd, y_3sd, alpha=0.1, color='purple',
                     label='±3 SD: 99.7%')

# Add mean line
axes[0].axvline(mu, color='red', linestyle='--', linewidth=2, label=f'Mean = {mu}')

# Add SD markers
for k in range(1, 4):
    axes[0].axvline(mu - k*sigma, color='gray', linestyle=':', linewidth=1.5, alpha=0.7)
    axes[0].axvline(mu + k*sigma, color='gray', linestyle=':', linewidth=1.5, alpha=0.7)

# Annotations
axes[0].text(mu, max(y)*1.05, f'μ = {mu}', ha='center', fontsize=11, fontweight='bold',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))

axes[0].text(mu, max(y)*0.5, '68.3%', ha='center', fontsize=13, fontweight='bold', color='green')
axes[0].text(mu - 1.5*sigma, max(y)*0.15, '13.6%', ha='center', fontsize=10, fontweight='bold')
axes[0].text(mu + 1.5*sigma, max(y)*0.15, '13.6%', ha='center', fontsize=10, fontweight='bold')
axes[0].text(mu - 2.5*sigma, max(y)*0.05, '2.1%', ha='center', fontsize=9)
axes[0].text(mu + 2.5*sigma, max(y)*0.05, '2.1%', ha='center', fontsize=9)

axes[0].set_title('The Empirical Rule: Distribution of Values in a Normal Distribution', 
                 fontsize=13, fontweight='bold', pad=15)
axes[0].set_xlabel('Value (in standard deviation units from mean)', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Probability Density', fontsize=11, fontweight='bold')
axes[0].legend(fontsize=10, loc='upper left')
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks([mu - 3*sigma, mu - 2*sigma, mu - sigma, mu, mu + sigma, mu + 2*sigma, mu + 3*sigma])
axes[0].set_xticklabels(['μ-3σ\n(4)', 'μ-2σ\n(6)', 'μ-σ\n(8)', 'μ\n(10)', 'μ+σ\n(12)', 'μ+2σ\n(14)', 'μ+3σ\n(16)'])

# Panel 2: Ski resort wait times application
x_ski = np.linspace(2, 18, 1000)
y_ski = stats.norm.pdf(x_ski, 10, 2)

axes[1].plot(x_ski, y_ski, 'b-', linewidth=3, label='Wait Time Distribution')
axes[1].fill_between(x_ski, y_ski, alpha=0.2, color='lightblue')

# Highlight regions
x_normal = x_ski[(x_ski >= 8) & (x_ski <= 12)]
y_normal = stats.norm.pdf(x_normal, 10, 2)
axes[1].fill_between(x_normal, y_normal, alpha=0.5, color='green',
                    label='8-12 min (68.3%): 683 skiers')

x_acceptable = x_ski[(x_ski >= 6) & (x_ski <= 14)]
y_acceptable = stats.norm.pdf(x_acceptable, 10, 2)
axes[1].fill_between(x_acceptable, y_acceptable, alpha=0.3, color='yellow',
                    label='6-14 min (95.5%): 955 skiers')

# Mark extremes
x_extreme = x_ski[x_ski > 16]
y_extreme = stats.norm.pdf(x_extreme, 10, 2)
axes[1].fill_between(x_extreme, y_extreme, alpha=0.7, color='red',
                    label='>16 min (0.15%): ~1-2 skiers')

axes[1].axvline(10, color='red', linestyle='--', linewidth=2.5, label='Mean = 10 min')

axes[1].set_title('Ski Resort Wait Times: Empirical Rule Application\n(1000 skiers on typical Saturday)', 
                 fontsize=13, fontweight='bold', pad=15)
axes[1].set_xlabel('Wait Time (minutes)', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Probability Density', fontsize=11, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper right')
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(2, 18)

# Add interpretation box
interpretation = (
    "Business Insight:\n"
    "• Most skiers (68%) wait 8-12 min\n"
    "• Nearly all (96%) wait 6-14 min\n"
    "• Extreme waits (>16 min) are rare (<1%)\n"
    "• Can advertise: '10 min average,\n  95% wait less than 14 min'"
)
axes[1].text(0.02, 0.97, interpretation, transform=axes[1].transAxes,
            fontsize=10, verticalalignment='top',
            bbox=dict(boxstyle='round,pad=0.6', facecolor='lightyellow', alpha=0.9))

plt.tight_layout()
plt.show()

Figure 4.9: The Empirical Rule (68-95-99.7 Rule) for Normal Distributions

When Does the Empirical Rule Apply?

Requirements: 1. Data must be approximately normally distributed (bell-shaped, symmetric)
2. No extreme outliers
3. Sufficient sample size (n ≥ 30 is a rough guideline)

How to check: - Create a histogram—does it look bell-shaped?
- Calculate skewness (Section 3.7.C)—is it close to 0?
- Use a normal probability plot (Q-Q plot)—do points follow a straight line?

If data are NOT normal: Use Chebyshev’s Theorem instead (conservative but always valid).

4.7.5 C. Skewness (Asymmetry)

Skewness measures the asymmetry of a distribution. Symmetric distributions (like the normal distribution) have skewness near zero. Asymmetric distributions are skewed left or skewed right.

Pearson’s Coefficient of Skewness

P = \frac{3(\bar{X} - \text{Median})}{s}

Interpretation: - P \approx 0: Symmetric distribution (mean ≈ median)
- P > 0: Right-skewed (positive skew, long right tail, mean > median)
- P < 0: Left-skewed (negative skew, long left tail, mean < median)

Rule of Thumb: - |P| < 0.5: Approximately symmetric
- 0.5 \leq |P| < 1.0: Moderately skewed
- |P| \geq 1.0: Highly skewed

Visual Guide:

Code

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Generate x values
x = np.linspace(0, 10, 1000)

# Panel 1: Left-skewed (Negative skew)
# Use beta distribution with parameters that create left skew
left_skew = stats.beta(5, 2)
y_left = left_skew.pdf(x / 10)

axes[0].plot(x, y_left, 'b-', linewidth=3)
axes[0].fill_between(x, y_left, alpha=0.4, color='blue')

# Mark mean, median, mode for left-skewed
mode_left = 7.0
median_left = 6.5
mean_left = 6.0

axes[0].axvline(mode_left, color='red', linestyle='--', linewidth=2, label='Mode')
axes[0].axvline(median_left, color='orange', linestyle='--', linewidth=2, label='Median')
axes[0].axvline(mean_left, color='green', linestyle='--', linewidth=2, label='Mean')

axes[0].set_title('LEFT-SKEWED (Negative)\nMean < Median < Mode', 
                 fontsize=12, fontweight='bold')
axes[0].set_xlabel('Values', fontsize=10, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=10, fontweight='bold')
axes[0].legend(loc='upper left', fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].text(0.5, 0.95, 'Long tail\nto the LEFT →',
            transform=axes[0].transAxes, ha='center', va='top', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='lightblue', alpha=0.8))

# Panel 2: Symmetric (Normal distribution)
symmetric = stats.norm(5, 1)
y_symmetric = symmetric.pdf(x)

axes[1].plot(x, y_symmetric, 'g-', linewidth=3)
axes[1].fill_between(x, y_symmetric, alpha=0.4, color='green')

# For symmetric, all three are the same
center = 5.0
axes[1].axvline(center, color='purple', linestyle='--', linewidth=3, 
               label='Mean = Median = Mode')

axes[1].set_title('SYMMETRIC\nMean = Median = Mode', 
                 fontsize=12, fontweight='bold')
axes[1].set_xlabel('Values', fontsize=10, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=10, fontweight='bold')
axes[1].legend(loc='upper right', fontsize=9)
axes[1].grid(True, alpha=0.3)
axes[1].text(0.5, 0.95, 'Balanced\n(P ≈ 0)',
            transform=axes[1].transAxes, ha='center', va='top', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.8))

# Panel 3: Right-skewed (Positive skew)
# Use beta distribution with parameters that create right skew
right_skew = stats.beta(2, 5)
y_right = right_skew.pdf(x / 10)

axes[2].plot(x, y_right, 'r-', linewidth=3)
axes[2].fill_between(x, y_right, alpha=0.4, color='red')

# Mark mean, median, mode for right-skewed
mode_right = 3.0
median_right = 3.5
mean_right = 4.0

axes[2].axvline(mode_right, color='red', linestyle='--', linewidth=2, label='Mode')
axes[2].axvline(median_right, color='orange', linestyle='--', linewidth=2, label='Median')
axes[2].axvline(mean_right, color='green', linestyle='--', linewidth=2, label='Mean')

axes[2].set_title('RIGHT-SKEWED (Positive)\nMean > Median > Mode', 
                 fontsize=12, fontweight='bold')
axes[2].set_xlabel('Values', fontsize=10, fontweight='bold')
axes[2].set_ylabel('Frequency', fontsize=10, fontweight='bold')
axes[2].legend(loc='upper right', fontsize=9)
axes[2].grid(True, alpha=0.3)
axes[2].text(0.5, 0.95, '← Long tail\nto the RIGHT',
            transform=axes[2].transAxes, ha='center', va='top', fontsize=10,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='lightcoral', alpha=0.8))

# Overall title
fig.suptitle('Distribution Skewness: Understanding Asymmetry and the Relationship Between Mean, Median, and Mode',
            fontsize=14, fontweight='bold', y=1.02)

plt.tight_layout()
plt.show()

Figure 4.10: Distribution Skewness Types: Left-Skewed, Symmetric, and Right-Skewed

Key Relationship: - Symmetric: Mean = Median = Mode
- Right-skewed: Mean > Median > Mode (mean pulled right by extreme values)
- Left-skewed: Mean < Median < Mode (mean pulled left by extreme values)

4.7.6 Example 3.16: P&P Airlines Skewness Analysis

P&P Airlines passenger data: - Mean: \bar{X} = 80.25 passengers
- Median: 82.14 passengers
- Standard Deviation: s = 12.79 passengers

Calculate skewness:

P = \frac{3(80.25 - 82.14)}{12.79} = \frac{3(-1.89)}{12.79} = \frac{-5.67}{12.79} = -0.44

Interpretation:
P = -0.44 indicates a slight left skew (moderately skewed). The mean (80.25) is pulled slightly below the median (82.14) by some flights with unusually low passenger counts.

Practical meaning: Most flights are well-loaded (around 80–90 passengers), but occasional flights with very few passengers (50–60 range) pull the average down slightly. This is typical for regional airlines with inconsistent demand.

4.7.7 Example 3.17: Income Distribution (Right-Skewed)

A company’s employee salaries: - Mean: \bar{X} = \$68,000
- Median: \$52,000
- Standard Deviation: s = \$24,000

Calculate skewness:

P = \frac{3(68,000 - 52,000)}{24,000} = \frac{3(16,000)}{24,000} = \frac{48,000}{24,000} = 2.0

Interpretation:
P = 2.0 indicates strong right skew. The distribution has a long right tail (a few executives earning very high salaries pull the mean far above the median).

Practical meaning: The median ($52K) is more representative of a “typical” employee salary than the mean ($68K). Half the employees earn less than $52K, but the high earners inflate the average. This is common for salary distributions.

Code

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Left-skewed distribution (Beta distribution)
x_left = np.linspace(0, 1, 1000)
y_left = stats.beta.pdf(x_left, 8, 2)
mean_left = 0.8
median_left = 0.82
mode_left = 0.875

axes[0].plot(x_left, y_left, 'b-', linewidth=3)
axes[0].fill_between(x_left, y_left, alpha=0.3, color='blue')
axes[0].axvline(mean_left, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_left:.2f}')
axes[0].axvline(median_left, color='green', linestyle='-.', linewidth=2.5, label=f'Median = {median_left:.2f}')
axes[0].axvline(mode_left, color='purple', linestyle=':', linewidth=3, label=f'Mode = {mode_left:.2f}')

axes[0].set_title('Left-Skewed (Negative Skew)\nMean < Median < Mode\nP < 0', 
                 fontsize=12, fontweight='bold')
axes[0].set_xlabel('Value', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[0].legend(fontsize=10, loc='upper left')
axes[0].grid(True, alpha=0.3)
axes[0].text(0.3, max(y_left)*0.7, 'Long left tail\n(negative values\npull mean down)',
            fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', alpha=0.8))

# Symmetric distribution (Normal)
x_sym = np.linspace(-4, 4, 1000)
y_sym = stats.norm.pdf(x_sym, 0, 1)
mean_sym = 0
median_sym = 0
mode_sym = 0

axes[1].plot(x_sym, y_sym, 'g-', linewidth=3)
axes[1].fill_between(x_sym, y_sym, alpha=0.3, color='green')
axes[1].axvline(mean_sym, color='red', linestyle='--', linewidth=2.5, 
               label=f'Mean = Median = Mode = {mean_sym:.2f}')

axes[1].set_title('Symmetric (No Skew)\nMean = Median = Mode\nP ≈ 0', 
                 fontsize=12, fontweight='bold')
axes[1].set_xlabel('Value', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[1].legend(fontsize=10, loc='upper right')
axes[1].grid(True, alpha=0.3)
axes[1].text(0, max(y_sym)*0.7, 'Perfectly balanced\n(normal distribution)',
            fontsize=10, ha='center',
            bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8))

# Right-skewed distribution (Exponential-like)
x_right = np.linspace(0, 10, 1000)
y_right = stats.gamma.pdf(x_right, 2, scale=1)
mean_right = 2.0
median_right = 1.68
mode_right = 1.0

axes[2].plot(x_right, y_right, 'r-', linewidth=3)
axes[2].fill_between(x_right, y_right, alpha=0.3, color='red')
axes[2].axvline(mode_right, color='purple', linestyle=':', linewidth=3, label=f'Mode = {mode_right:.2f}')
axes[2].axvline(median_right, color='green', linestyle='-.', linewidth=2.5, label=f'Median = {median_right:.2f}')
axes[2].axvline(mean_right, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_right:.2f}')

axes[2].set_title('Right-Skewed (Positive Skew)\nMode < Median < Mean\nP > 0', 
                 fontsize=12, fontweight='bold')
axes[2].set_xlabel('Value', fontsize=11, fontweight='bold')
axes[2].set_ylabel('Frequency', fontsize=11, fontweight='bold')
axes[2].legend(fontsize=10, loc='upper right')
axes[2].grid(True, alpha=0.3)
axes[2].text(6, max(y_right)*0.7, 'Long right tail\n(high values\npull mean up)',
            fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightcoral', alpha=0.8))

plt.tight_layout()
plt.show()

Figure 4.11: Types of Skewness: Left-Skewed, Symmetric, and Right-Skewed Distributions

Business Examples of Skewed Distributions

Right-Skewed (Positive Skew) — Most Common: - Income/Salary: Most people earn moderate salaries, few earn very high
- Home prices: Most homes are modestly priced, luxury homes create long right tail
- Customer spending: Most customers spend little, a few “whales” spend heavily
- Insurance claims: Most claims are small, catastrophic claims are rare but large
- Website traffic: Most visitors view few pages, some “power users” view many

Left-Skewed (Negative Skew) — Less Common: - Test scores with ceiling effect: Most students score high (80–100%), few fail
- Product lifetimes with minimum guarantee: Most products last long, few fail early
- Age at retirement: Most retire around 65, few retire very early

Symmetric: - Heights, weights (in homogeneous populations)
- Measurement errors
- Many natural phenomena (follow normal distribution)

4.7.8 D. Coefficient of Variation (CV)

The coefficient of variation measures relative variability—it expresses standard deviation as a percentage of the mean. This allows comparison of variability across datasets with different units or scales.

Coefficient of Variation

CV = \frac{s}{\bar{X}} \times 100\%

where: - s = standard deviation
- \bar{X} = mean

Interpretation: - Low CV (< 15%): Low variability relative to mean (consistent, homogeneous data)
- Moderate CV (15–30%): Moderate variability
- High CV (> 30%): High variability relative to mean (inconsistent, heterogeneous data)

Advantages: - Unit-free (always a percentage)
- Allows comparison of variability across different scales (e.g., dollars vs. percentages)
- Identifies which variable is more variable relative to its size

4.7.9 Example 3.18: P&P Airlines Variability Comparison

P&P Airlines wants to determine which metric is more variable: passenger count or miles flown.

Passenger Data: - Mean: \bar{X}_{\text{passengers}} = 80 passengers
- Standard Deviation: s_{\text{passengers}} = 12.35 passengers

Miles Flown Data: - Mean: \bar{X}_{\text{miles}} = 520 miles
- Standard Deviation: s_{\text{miles}} = 62.7 miles

Question: Which metric shows greater relative variability?

Solution:

CV for Passengers: CV_{\text{passengers}} = \frac{12.35}{80} \times 100\% = 15.4\%

CV for Miles: CV_{\text{miles}} = \frac{62.7}{520} \times 100\% = 12.1\%

Interpretation:
Although miles has a larger standard deviation (62.7 vs. 12.35), passengers show greater relative variability (15.4% vs. 12.1%).

Business Implication:
Passenger counts are less predictable than flight distances. This makes sense: flight routes (miles) are fixed, but passenger demand fluctuates based on day of week, season, pricing, and competition. P&P should focus capacity planning efforts on passenger forecasting, not route planning.

4.7.10 Example 3.19: Investment Portfolio Comparison

An investor compares two portfolios:

Portfolio A (Bonds): - Mean annual return: \bar{X}_A = 5\%
- Standard deviation: s_A = 1.2\%

Portfolio B (Stocks): - Mean annual return: \bar{X}_B = 12\%
- Standard deviation: s_B = 4.8\%

Question: Which portfolio is riskier relative to its return?

Solution:

CV_A = \frac{1.2}{5} \times 100\% = 24\%

CV_B = \frac{4.8}{12} \times 100\% = 40\%

Interpretation:
Portfolio B (stocks) has higher risk-adjusted variability (CV = 40% vs. 24%). For every 1% of expected return, stocks carry 1.67 times more risk than bonds (40% / 24% = 1.67).

Investment Decision:
- Risk-averse investor: Choose Portfolio A (bonds)—lower CV means more consistent returns
- Risk-tolerant investor: Choose Portfolio B (stocks)—higher mean return despite higher CV

This is why Sharpe ratio (excess return / SD) is popular in finance—it’s essentially the inverse of CV adjusted for risk-free rate.

Code

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Example 1: P&P Airlines (Passengers vs. Miles)
categories_pp = ['Passengers', 'Miles Flown']
means_pp = [80, 520]
stds_pp = [12.35, 62.7]
cvs_pp = [(stds_pp[i] / means_pp[i]) * 100 for i in range(2)]

# Panel 1a: Standard Deviations (misleading comparison)
bars1 = axes[0, 0].bar(categories_pp, stds_pp, color=['steelblue', 'coral'], 
                       alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 0].set_title('Standard Deviations (Different Units)\nMisleading Comparison!', 
                     fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Standard Deviation', fontsize=11, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3, axis='y')

for bar, std in zip(bars1, stds_pp):
    height = bar.get_height()
    axes[0, 0].text(bar.get_x() + bar.get_width()/2., height,
                   f'{std:.1f}',
                   ha='center', va='bottom', fontweight='bold', fontsize=11)

axes[0, 0].text(0.5, 0.97, 'X Cannot compare directly\n(different units: passengers vs. miles)',
               transform=axes[0, 0].transAxes, ha='center', va='top', fontsize=10,
               bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.8))

# Panel 1b: Coefficients of Variation (valid comparison)
bars2 = axes[0, 1].bar(categories_pp, cvs_pp, color=['steelblue', 'coral'],
                       alpha=0.7, edgecolor='black', linewidth=2)
axes[0, 1].set_title('Coefficients of Variation (Unit-Free %)\nValid Comparison ✓', 
                     fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('CV (%)', fontsize=11, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3, axis='y')
axes[0, 1].axhline(y=15, color='orange', linestyle='--', linewidth=2, alpha=0.7,
                  label='Moderate variability threshold (15%)')

for bar, cv in zip(bars2, cvs_pp):
    height = bar.get_height()
    axes[0, 1].text(bar.get_x() + bar.get_width()/2., height,
                   f'{cv:.1f}%',
                   ha='center', va='bottom', fontweight='bold', fontsize=11)

axes[0, 1].legend(fontsize=9)
axes[0, 1].text(0.5, 0.97, 'OK - Valid comparison\nPassengers (15.4%) more variable than Miles (12.1%)',
               transform=axes[0, 1].transAxes, ha='center', va='top', fontsize=10,
               bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8))

# Example 2: Investment Portfolio Comparison
categories_inv = ['Bonds\n(Portfolio A)', 'Stocks\n(Portfolio B)']
means_inv = [5, 12]
stds_inv = [1.2, 4.8]
cvs_inv = [(stds_inv[i] / means_inv[i]) * 100 for i in range(2)]

# Panel 2a: Returns with error bars
x_pos = np.arange(len(categories_inv))
bars3 = axes[1, 0].bar(x_pos, means_inv, yerr=stds_inv, capsize=10,
                       color=['green', 'red'], alpha=0.6, edgecolor='black', linewidth=2,
                       error_kw={'linewidth': 3, 'ecolor': 'black'})

axes[1, 0].set_title('Portfolio Returns with Variability (±1 SD)', 
                     fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Annual Return (%)', fontsize=11, fontweight='bold')
axes[1, 0].set_xticks(x_pos)
axes[1, 0].set_xticklabels(categories_inv)
axes[1, 0].grid(True, alpha=0.3, axis='y')

for bar, mean, std in zip(bars3, means_inv, stds_inv):
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height/2,
                   f'μ = {mean}%\nσ = {std}%',
                   ha='center', va='center', fontweight='bold', fontsize=10,
                   color='white')

# Panel 2b: Risk-Adjusted Variability (CV)
bars4 = axes[1, 1].bar(categories_inv, cvs_inv, color=['green', 'red'],
                       alpha=0.6, edgecolor='black', linewidth=2)

axes[1, 1].set_title('Risk-Adjusted Variability (CV)\nStocks Have Higher Relative Risk', 
                     fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('Coefficient of Variation (%)', fontsize=11, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')
axes[1, 1].axhline(y=30, color='red', linestyle='--', linewidth=2, alpha=0.7,
                  label='High variability threshold (30%)')

for bar, cv, label in zip(bars4, cvs_inv, categories_inv):
    height = bar.get_height()
    axes[1, 1].text(bar.get_x() + bar.get_width()/2., height,
                   f'{cv:.0f}%',
                   ha='center', va='bottom', fontweight='bold', fontsize=11)

axes[1, 1].legend(fontsize=9)

# Add interpretation
interpretation = (
    "Interpretation:\n"
    "Bonds: Lower return (5%) but\n"
    "  more consistent (CV=24%)\n"
    "Stocks: Higher return (12%) but\n"
    "  more volatile (CV=40%)\n"
    "→ Stocks have 67% more risk\n"
    "   per unit of return"
)
axes[1, 1].text(0.02, 0.97, interpretation, transform=axes[1, 1].transAxes,
               fontsize=9, verticalalignment='top',
               bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', alpha=0.9))

plt.tight_layout()
plt.show()

Figure 4.12: Coefficient of Variation: Comparing Relative Variability Across Different Scales

When to Use Coefficient of Variation

Use CV when: - Comparing variability of different variables (different units or scales)
- Comparing risk-adjusted returns in finance
- Determining which process/product has more consistent quality relative to its target
- Variables have different means (CV adjusts for this)

Don’t use CV when: - Mean is zero or near zero (CV becomes undefined or misleading)
- Variables have negative values (CV requires positive mean)
- Absolute variability is more important than relative (use SD instead)

4.8 Chapter 3 Summary

4.8.1 Key Concepts

Measures of Central Tendency:

Mean (\bar{X} or \mu): Arithmetic average, uses all values, sensitive to outliers
Median: Middle value when sorted, resistant to outliers, best for skewed data
Mode: Most frequent value, useful for categorical data
Weighted Mean: Accounts for different importance/weights
Geometric Mean: Average growth rate for multiplicative data

Measures of Dispersion:

Range: Max - Min (simple but sensitive to outliers)
Variance (s^2 or \sigma^2): Average squared deviation
Standard Deviation (s or \sigma): Square root of variance (same units as data)
Interquartile Range (IQR): Q3 - Q1 (middle 50%, robust to outliers)
Coefficient of Variation (CV): Relative variability as percentage

Distribution Shape:

Chebyshev’s Theorem: At least 1 - 1/K^2 within K SD (any distribution)
Empirical Rule: 68%-95%-99.7% within 1-2-3 SD (normal distributions only)
Skewness: Asymmetry measure (Pearson’s coefficient)
- Right-skewed: Mean > Median (most common for income, prices)
- Left-skewed: Mean < Median
- Symmetric: Mean ≈ Median

4.8.2 Essential Formulas Reference

Formula Reference Table
Measure	Population	Sample
Mean	\mu = \frac{\sum X_i}{N}	\bar{X} = \frac{\sum X_i}{n}
Variance	\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}	s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}
Std Dev	\sigma = \sqrt{\sigma^2}	s = \sqrt{s^2}
Weighted Mean	—	\bar{X}_W = \frac{\sum X_i W_i}{\sum W_i}
Geometric Mean	—	GM = \sqrt[n]{X_1 \times \cdots \times X_n}
CV	—	CV = \frac{s}{\bar{X}} \times 100\%
Skewness	—	P = \frac{3(\bar{X} - \text{Median})}{s}
Percentile Position	—	L_P = (n+1) \times \frac{P}{100}

4.8.3 Decision Framework: Which Measure to Use?

Central Tendency:

Normal/Symmetric data: Use Mean (most efficient, uses all information)
Skewed data or outliers: Use Median (robust, more representative)
Categorical data: Use Mode (only option for non-numeric data)
Growth rates/returns: Use Geometric Mean (compounds correctly)
Different importance: Use Weighted Mean (accounts for weights)

Dispersion:

Normal data, absolute variability: Use Standard Deviation
Compare across different scales: Use Coefficient of Variation
Outlier-resistant measure: Use IQR
Quick rough estimate: Use Range (least informative)

Distribution Assessment:

Unknown distribution: Use Chebyshev’s Theorem (always valid)
Normal distribution: Use Empirical Rule (precise)
Assess symmetry: Calculate Skewness

4.8.4 Business Applications Recap

Real-World Applications

Finance: - Portfolio risk (standard deviation = volatility)
- Sharpe ratio (return / SD)
- Value-at-Risk (percentiles)
- CAGR (geometric mean)

Manufacturing: - Process capability (mean, SD, specifications)
- Six Sigma (defects per million using SD)
- Quality control charts (mean ± 3 SD limits)

Human Resources: - Salary benchmarking (percentiles, quartiles)
- Performance ratings (mean, mode)
- Compensation equity (CV across departments)

Marketing: - Customer lifetime value variability (SD, CV)
- Market segmentation (percentiles)
- Campaign performance consistency (CV)

Operations: - Demand forecasting (mean, SD)
- Service level targets (percentiles)
- Capacity planning (mean ± 2 SD)

4.9 Glossary

Central Tendency: A single value that represents the “center” or “typical” value of a dataset.

Chebyshev’s Theorem: For any distribution, at least 1 - 1/K^2 of observations lie within K standard deviations of the mean.

Coefficient of Variation (CV): Relative measure of dispersion expressed as percentage of mean: CV = (s/\bar{X}) \times 100\%.

Deciles: Nine values that divide an ordered dataset into 10 equal parts.

Degrees of Freedom: The number of independent pieces of information available to estimate a parameter (n-1 for sample variance).

Dispersion: The spread or variability of observations around the central tendency.

Empirical Rule: For normal distributions, approximately 68%-95%-99.7% of observations fall within 1-2-3 standard deviations of the mean.

Geometric Mean: The nth root of the product of n values; appropriate for growth rates and multiplicative data.

Interquartile Range (IQR): The range of the middle 50% of data: IQR = Q3 - Q1.

Mean: The arithmetic average: sum of values divided by count.

Median: The middle value when data are sorted (50th percentile).

Mode: The most frequently occurring value(s) in a dataset.

Percentile: The Pth percentile is the value below which P percent of observations fall.

Quartiles: Three values (Q1, Q2, Q3) that divide an ordered dataset into four equal parts.

Range: The difference between maximum and minimum values.

Skewness: A measure of asymmetry in a distribution.

Standard Deviation: The square root of variance; measures average distance of observations from the mean.

Variance: The average of squared deviations from the mean.

Weighted Mean: An average that accounts for different importance or weights of values.

END OF CHAPTER 3

You have now cover: ✅ All measures of central tendency (mean, median, mode, weighted mean, geometric mean)
✅ All measures of dispersion (range, variance, SD, IQR, CV)
✅ Grouped data formulas
✅ Percentiles and quartiles
✅ Chebyshev’s Theorem
✅ Empirical Rule
✅ Skewness
✅ Coefficient of Variation
✅ Comprehensive business examples throughout
✅ 9+ Python visualizations
✅ Formulas reference table
✅ Decision framework
✅ Glossary

# Measures of Central Tendency and Dispersion {#sec-central-tendency-dispersion} ```{mermaid} %%| fig-width: 10 %%| fig-cap: "Chapter 3 Overview: Measures of Central Tendency and Dispersion" graph TD A[Measures of Central Tendency and Dispersion] --> B[Central Tendencies] A --> C[Dispersion] B --> D[Ungrouped Data] B --> E[Grouped Data] D --> F[Mean] D --> G[Median] D --> H[Mode] D --> I[Weighted Mean] D --> J[Geometric Mean] E --> K[Mean] E --> L[Median] E --> M[Mode] C --> N[Ungrouped Data] C --> O[Grouped Data] N --> P[Range] N --> Q[Variance and Standard Deviation] N --> R[Percentiles] O --> S[Variance and Standard Deviation] T[Related Concepts] --> U[Chebyshev's Theorem] T --> V[Empirical Rule] T --> W[Skewness] T --> X[Coefficient of Variation] style A fill:#2E86AB,stroke:#A23B72,stroke-width:3px,color:#fff style B fill:#A23B72,stroke:#F18F01,stroke-width:2px,color:#fff style C fill:#A23B72,stroke:#F18F01,stroke-width:2px,color:#fff style T fill:#2E86AB,stroke:#A23B72,stroke-width:3px,color:#fff style D fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff style E fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff style N fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff style O fill:#F18F01,stroke:#C73E1D,stroke-width:2px,color:#fff ``` ## Business Scenario: Investment Analysis at Boggs Financial Services Robert Boggs operates a successful investment advisory firm serving individual and institutional clients. His reputation depends on providing accurate, data-driven recommendations about stocks, bonds, mutual funds, and other securities. Daily, Boggs faces critical questions from nervous investors: - "What's the *average* return on this mutual fund?" - "How *consistent* are the quarterly earnings of this company?" - "Is this stock more *volatile* than that one?" - "Which investment has *less risk* for the same expected return?" To answer these questions professionally, Boggs cannot simply list raw numbers—he must **summarize and describe** the data using precise statistical measures. A list of 50 daily stock prices tells an investor nothing; the **mean price** and **standard deviation** tell the complete story in two numbers. This chapter introduces the fundamental tools for describing datasets: **measures of central tendency** (where the data is centered) and **measures of dispersion** (how spread out the data is). These concepts form the foundation of all statistical analysis in business, economics, finance, quality control, and scientific research. ::: {.callout-important icon="🎯"} ## The Fundamental Challenge Raw data is overwhelming and meaningless. A dataset of 1,000 observations cannot be communicated effectively. Statistical measures reduce thousands of numbers to a handful of meaningful summaries that support intelligent decision-making. **Central Tendency:** Where is the "typical" value? **Dispersion:** How much variability exists around that typical value? These two concepts together provide a complete numerical description of any dataset. ::: --- ## 3.1 Measures of Central Tendency for Ungrouped Data **Central tendency** measures identify the "center" or "typical value" of a dataset. The three primary measures are: 1. **Mean** ($\bar{X}$ or $\mu$) — Arithmetic average 2. **Median** — Middle value when data is ordered 3. **Mode** — Most frequently occurring value Each measure has specific use cases, advantages, and limitations. --- ### A. The Mean (La Media) The **mean** is the arithmetic average—the sum of all values divided by the count of observations. It is the most commonly used measure of central tendency in statistical analysis. ::: {.callout-note icon="📐"} ## Definition: Mean **Sample Mean:** $\bar{X} = \frac{\sum X_i}{n}$ **Population Mean:** $\mu = \frac{\sum X_i}{N}$ where: - $X_i$ = individual observations - $n$ = sample size - $N$ = population size - $\sum$ = summation symbol (add all values) ::: **Key Characteristics of the Mean:** - Uses **every observation** in the dataset - Sensitive to **extreme values** (outliers) - Has desirable **mathematical properties** (totals, weighted averages) - Represents the "center of gravity" or "balance point" of the data ### Example 3.1: Mean Sales Revenue A small retail store recorded monthly sales revenue (in thousands of dollars) for 5 months: $$56, 67, 52, 45, 67$$ **Calculate the mean sales revenue.** **Solution:** $$ \bar{X} = \frac{\sum X_i}{n} = \frac{56 + 67 + 52 + 45 + 67}{5} = \frac{287}{5} = 57.4 \text{ thousand dollars} $$ **Interpretation:** The average monthly sales revenue is **$57,400**. This represents the typical monthly performance. Notice that the value 67 appears twice (the mode), but the mean (57.4) falls below this repeated value because the lower values (45, 52, 56) pull the average down. ::: {.callout-tip icon="💡"} ## Business Application: Using the Mean The mean is ideal for: - **Budgeting:** Average monthly expenses for annual planning - **Performance metrics:** Average sales per employee - **Quality control:** Average product weight or dimension - **Financial analysis:** Average stock return over time **When NOT to use the mean:** - Data contains extreme outliers (CEO salaries, real estate prices) - Data is highly skewed (income distributions) - In these cases, the **median** provides a better "typical" value ::: ```{python} #| label: fig-mean-concept #| fig-cap: "The Mean as Center of Gravity: Balancing Point of Data" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # Sales data sales = [45, 52, 56, 67, 67] mean_sales = np.mean(sales) fig, ax = plt.subplots(figsize=(12, 6)) # Plot data points as weights on a balance beam y_positions = [0.5] * len(sales) colors = ['#3498db' if x < mean_sales else '#e74c3c' for x in sales] ax.scatter(sales, y_positions, s=300, alpha=0.7, c=colors, edgecolors='black', linewidths=2, zorder=3) # Draw balance beam ax.plot([40, 75], [0.5, 0.5], 'k-', linewidth=3, alpha=0.3) # Draw fulcrum at mean fulcrum_x = [mean_sales - 1, mean_sales, mean_sales + 1, mean_sales - 1] fulcrum_y = [0, 0.3, 0, 0] ax.fill(fulcrum_x, fulcrum_y, color='gold', edgecolor='black', linewidth=2, zorder=2) # Vertical line at mean ax.axvline(mean_sales, color='red', linestyle='--', linewidth=2, alpha=0.7, label=f'Mean = ${mean_sales:.1f}K') # Annotations for i, (x, y) in enumerate(zip(sales, y_positions)): ax.annotate(f'${x}K', xy=(x, y), xytext=(0, 20), textcoords='offset points', ha='center', fontweight='bold', fontsize=10) # Add deviation arrows for x in sales: if x < mean_sales: ax.annotate('', xy=(mean_sales, 0.5), xytext=(x, 0.5), arrowprops=dict(arrowstyle='<->', color='blue', lw=1.5, alpha=0.5)) elif x > mean_sales: ax.annotate('', xy=(x, 0.5), xytext=(mean_sales, 0.5), arrowprops=dict(arrowstyle='<->', color='red', lw=1.5, alpha=0.5)) ax.set_xlim(40, 75) ax.set_ylim(-0.2, 1.2) ax.set_xlabel('Sales Revenue ($1000s)', fontsize=12, fontweight='bold') ax.set_title('Mean as Balance Point: Sales Data\n(Deviations below mean balance deviations above mean)', fontsize=13, fontweight='bold', pad=15) ax.legend(loc='upper right', fontsize=11) ax.set_yticks([]) ax.grid(True, alpha=0.3, axis='x') plt.tight_layout() plt.show() ``` --- ### B. The Median (La Mediana) The **median** is the middle value when observations are arranged in order from smallest to largest. Exactly **50% of observations fall below the median** and **50% fall above it**. ::: {.callout-note icon="📐"} ## Definition: Median **Calculation Steps:** 1. **Order the data** from smallest to largest 2. **Find the position:** $L_{median} = \frac{n+1}{2}$ 3. **Determine the median:** - If $n$ is **odd**: Median = value at position $L_{median}$ - If $n$ is **even**: Median = average of the two middle values ::: **Key Characteristics of the Median:** - **Not affected by extreme values** (outliers) - Robust measure for **skewed distributions** - Represents the **50th percentile** - Better than mean for income, prices, and other skewed data ### Example 3.2: Median Manufacturing Times A chip manufacturer tests processing times (in seconds) for 20 processors: $$3.2, 4.1, 6.3, 1.9, 0.6, 5.4, 5.2, 3.2, 4.9, 6.2, 1.8, 1.7, 3.6, 1.5, 2.6, 4.3, 6.1, 2.4, 2.2, 3.3$$ **Calculate the median processing time.** **Solution:** **Step 1:** Order the data: $$0.6, 1.5, 1.7, 1.8, 1.9, 2.2, 2.4, 2.6, 3.2, 3.2, 3.3, 3.6, 4.1, 4.3, 4.9, 5.2, 5.4, 6.1, 6.2, 6.3$$ **Step 2:** Find position: $$L_{median} = \frac{20+1}{2} = 10.5$$ The median is between the 10th and 11th observations. **Step 3:** Calculate median: - 10th observation: 3.2 - 11th observation: 3.3 $$\text{Median} = \frac{3.2 + 3.3}{2} = 3.25 \text{ seconds}$$ **Interpretation:** Half the processors complete in 3.25 seconds or less; half take 3.25 seconds or more. This "typical" value is **not influenced** by the extreme low value (0.6 seconds) or high value (6.3 seconds). ::: {.callout-important icon="⚖️"} ## When to Use Median vs. Mean **Use MEDIAN when:** - Data contains **outliers** (extremely high or low values) - Distribution is **skewed** (income, house prices, CEO compensation) - You want the "typical middle" value unaffected by extremes **Use MEAN when:** - Data is **symmetric** (approximately normally distributed) - All observations are equally important - You need **mathematical properties** (totals, weighted averages) - Calculating averages of averages **Classic Example:** In a neighborhood with 9 homes worth $200K each and 1 mansion worth $5M: - **Mean price** = $680K (misleading—no home sells near this price!) - **Median price** = $200K (accurate representation of typical home) ::: ```{python} #| label: fig-mean-vs-median #| fig-cap: "Mean vs. Median: Effect of Outliers" #| code-fold: true import matplotlib.pyplot as plt import numpy as np fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Panel 1: Symmetric data (Mean ≈ Median) np.random.seed(42) symmetric_data = np.random.normal(50, 10, 100) mean_sym = np.mean(symmetric_data) median_sym = np.median(symmetric_data) axes[0].hist(symmetric_data, bins=20, color='steelblue', alpha=0.7, edgecolor='black') axes[0].axvline(mean_sym, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_sym:.1f}') axes[0].axvline(median_sym, color='green', linestyle=':', linewidth=2.5, label=f'Median = {median_sym:.1f}') axes[0].set_title('Symmetric Distribution\n(Mean ≈ Median)', fontsize=12, fontweight='bold') axes[0].set_xlabel('Value', fontsize=11) axes[0].set_ylabel('Frequency', fontsize=11) axes[0].legend(fontsize=10) axes[0].grid(True, alpha=0.3) # Panel 2: Skewed data with outlier (Mean >> Median) skewed_data = np.concatenate([ np.random.normal(30, 5, 90), np.array([80, 85, 90, 95, 100, 120, 140, 160, 180, 200]) # Outliers ]) mean_skew = np.mean(skewed_data) median_skew = np.median(skewed_data) axes[1].hist(skewed_data, bins=25, color='coral', alpha=0.7, edgecolor='black') axes[1].axvline(mean_skew, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_skew:.1f}') axes[1].axvline(median_skew, color='green', linestyle=':', linewidth=2.5, label=f'Median = {median_skew:.1f}') axes[1].set_title('Right-Skewed Distribution\n(Mean > Median due to outliers)', fontsize=12, fontweight='bold') axes[1].set_xlabel('Value', fontsize=11) axes[1].set_ylabel('Frequency', fontsize=11) axes[1].legend(fontsize=10) axes[1].grid(True, alpha=0.3) # Add annotation arrows axes[1].annotate('Outliers pull\nmean upward', xy=(mean_skew, 5), xytext=(100, 15), arrowprops=dict(arrowstyle='->', lw=2, color='red'), fontsize=10, fontweight='bold', bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7)) plt.tight_layout() plt.show() ``` --- ### C. The Mode (La Moda) The **mode** is the value that appears **most frequently** in the dataset. Unlike mean and median, a dataset can have: - **No mode:** All values appear once (uniform frequency) - **One mode (unimodal):** One value appears most often - **Two modes (bimodal):** Two values tied for highest frequency - **Multimodal:** Three or more values tied **Key Characteristics of the Mode:** - Only measure of central tendency for **categorical data** - Can be used with **any type of data** (numerical or categorical) - Not affected by extreme values - May not exist or may not be unique - Most useful for **discrete data** (shoe sizes, defect counts) ### Example 3.3: Mode in Profit Analysis An analyst at Acme, Inc. examines monthly profits (in thousands of dollars) for the past 12 months: $$56, 67, 52, 45, 67, 52, 58, 67, 62, 70, 58, 45$$ **Identify the mode.** **Solution:** **Frequency count:** - 45 appears 2 times - 52 appears 2 times - 56 appears 1 time - 58 appears 2 times - 62 appears 1 time - **67 appears 3 times** ← Maximum frequency - 70 appears 1 time **Mode = $67,000** (appears 3 times) **Interpretation:** The most common monthly profit level is $67,000. This occurred in 25% of months (3 out of 12). However, note that multiple values (45, 52, 58) appear twice, suggesting the distribution has some variability in frequently occurring values. ::: {.callout-note icon="📊"} ## Business Applications of the Mode **Inventory Management:** A shoe store needs to know which sizes to stock heavily. If Size 9 is the mode, order more Size 9 inventory. **Quality Control:** If "3 defects per unit" is the mode in production data, investigate what process conditions lead to this specific defect count. **Market Research:** If "Very Satisfied" is the modal response in customer surveys, this is the most common customer sentiment. **Categorical Data:** Mode is the ONLY measure of central tendency for non-numerical data: - Most popular car color: "Silver" (mode) - Most common complaint type: "Billing error" (mode) - Most frequent transaction method: "Credit card" (mode) ::: ```{python} #| label: fig-mode-distribution #| fig-cap: "Mode Identification: Unimodal vs. Bimodal Distributions" #| code-fold: true import matplotlib.pyplot as plt import numpy as np from collections import Counter fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Panel 1: Unimodal distribution profits = [56, 67, 52, 45, 67, 52, 58, 67, 62, 70, 58, 45] counter = Counter(profits) values = sorted(counter.keys()) frequencies = [counter[v] for v in values] bars1 = axes[0].bar(values, frequencies, color='steelblue', alpha=0.7, edgecolor='black', linewidth=1.5) # Highlight mode mode_value = max(counter, key=counter.get) mode_index = values.index(mode_value) bars1[mode_index].set_color('#e74c3c') bars1[mode_index].set_edgecolor('black') bars1[mode_index].set_linewidth(3) axes[0].set_title('Unimodal Distribution\n(One clear mode at $67K)', fontsize=12, fontweight='bold') axes[0].set_xlabel('Monthly Profit ($1000s)', fontsize=11, fontweight='bold') axes[0].set_ylabel('Frequency (Number of Months)', fontsize=11, fontweight='bold') axes[0].grid(True, alpha=0.3, axis='y') # Add annotation for mode axes[0].annotate(f'Mode = ${mode_value}K\n(appears {counter[mode_value]} times)', xy=(mode_value, counter[mode_value]), xytext=(mode_value + 5, counter[mode_value] + 0.3), arrowprops=dict(arrowstyle='->', lw=2, color='red'), fontsize=10, fontweight='bold', bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.7)) # Panel 2: Bimodal distribution bimodal_data = [20, 25, 25, 25, 30, 35, 40, 60, 65, 65, 65, 70, 75] counter2 = Counter(bimodal_data) values2 = sorted(counter2.keys()) frequencies2 = [counter2[v] for v in values2] bars2 = axes[1].bar(values2, frequencies2, color='coral', alpha=0.7, edgecolor='black', linewidth=1.5) # Highlight both modes modes = [k for k, v in counter2.items() if v == max(counter2.values())] for mode in modes: mode_idx = values2.index(mode) bars2[mode_idx].set_color('#2ecc71') bars2[mode_idx].set_edgecolor('black') bars2[mode_idx].set_linewidth(3) axes[1].set_title('Bimodal Distribution\n(Two modes: 25 and 65)', fontsize=12, fontweight='bold') axes[1].set_xlabel('Value', fontsize=11, fontweight='bold') axes[1].set_ylabel('Frequency', fontsize=11, fontweight='bold') axes[1].grid(True, alpha=0.3, axis='y') # Add annotations for both modes for mode in modes: axes[1].annotate(f'Mode = {mode}', xy=(mode, counter2[mode]), xytext=(mode, counter2[mode] + 0.5), ha='center', fontsize=9, fontweight='bold', bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.8)) plt.tight_layout() plt.show() ``` --- ## 3.2 Weighted Mean (Media Ponderada) When observations have **different levels of importance**, the simple arithmetic mean is inappropriate. The **weighted mean** accounts for the relative importance of each value. ::: {.callout-note icon="📐"} ## Definition: Weighted Mean $$\bar{X}_W = \frac{\sum X_i W_i}{\sum W_i}$$ where: - $X_i$ = individual observations - $W_i$ = weight (importance) of each observation - Numerator: Sum of (value × weight) - Denominator: Sum of all weights ::: **Common Business Applications:** - **Student grades:** Different assignments worth different percentages - **Investment portfolios:** Different investments with different dollar amounts - **Quality control:** Defect rates from suppliers with different volumes - **Economic indices:** Components weighted by importance (CPI, stock indices) ### Example 3.4: Weighted Mean for Student Grades A student receives the following grades with specified weights: | Assessment | Grade | Weight | |:-----------|:-----:|:------:| | Homework | 85% | 20% | | Midterm | 78% | 30% | | Final Exam | 92% | 50% | **Calculate the final course grade using weighted mean.** **Solution:** $$ \bar{X}_W = \frac{\sum X_i W_i}{\sum W_i} = \frac{(85 \times 0.20) + (78 \times 0.30) + (92 \times 0.50)}{0.20 + 0.30 + 0.50} $$ $$ = \frac{17.0 + 23.4 + 46.0}{1.00} = \frac{86.4}{1.00} = 86.4\% $$ **Comparison with simple mean:** $$ \bar{X}_{simple} = \frac{85 + 78 + 92}{3} = \frac{255}{3} = 85.0\% $$ **Interpretation:** The weighted mean (86.4%) properly reflects that the Final Exam counts for 50% of the grade. The simple mean (85.0%) incorrectly treats all assessments equally. Since the student performed best on the most important exam (92% on Final), the weighted average is higher. **Final Grade: B+ (86.4%)** ::: {.callout-warning icon="⚠️"} ## Critical Error to Avoid **NEVER use simple mean when data has different importance levels!** **Wrong:** Averaging percentages directly **Right:** Weight each percentage by its base amount **Example:** Investment returns - Stock A: 12% return on $10,000 investment - Stock B: 8% return on $90,000 investment ❌ **WRONG:** $(12\% + 8\%) / 2 = 10\%$ average return ✅ **CORRECT:** Weighted mean = $(0.12 \times 10000 + 0.08 \times 90000) / 100000 = 8.4\%$ The $90K investment dominates the portfolio—its 8% return matters far more than the 12% on $10K! ::: --- **END OF STAGE 1** This covers: - Introduction and business scenario - Mean (arithmetic average) - Median (middle value) - Mode (most frequent value) - Weighted mean - 3 Python visualizations Approximately 900 lines. Ready for STAGE 2? ## 3.3 Geometric Mean (Media Geométrica) The **geometric mean** measures the **average rate of change** or **growth rate** over time. It is essential for financial returns, population growth, inflation rates, and any data involving compounding or multiplicative changes. ::: {.callout-note icon="📐"} ## Definition: Geometric Mean **For raw values:** $$GM = \sqrt[n]{X_1 \times X_2 \times \cdots \times X_n} = (X_1 \times X_2 \times \cdots \times X_n)^{1/n}$$ **For growth rates (percentage changes):** $$GM = \left[\sqrt[n]{(1+r_1)(1+r_2)\cdots(1+r_n)}\right] - 1$$ where $r_i$ represents the growth rate for period $i$ (expressed as decimal). ::: **Critical Difference from Arithmetic Mean:** - **Arithmetic mean:** Sum values and divide (appropriate for **additive** data) - **Geometric mean:** Multiply values and take nth root (appropriate for **multiplicative/growth** data) **When to use geometric mean:** - Investment returns over multiple periods - Population growth rates - Inflation rates - Any data involving **compounding** or **percentage changes over time** ::: {.callout-warning icon="⚠️"} ## Common Mistake: Using Arithmetic Mean for Growth Rates **This is mathematically incorrect and will overstate actual growth!** **Example:** Investment grows 50% in Year 1, then declines 20% in Year 2. ❌ **WRONG (Arithmetic Mean):** $(50\% + (-20\%)) / 2 = 15\%$ average annual return ✅ **CORRECT (Geometric Mean):** $\sqrt{(1.50)(0.80)} - 1 = \sqrt{1.20} - 1 = 1.095 - 1 = 9.5\%$ average annual return **Verification:** $\$1000 \times 1.50 \times 0.80 = \$1200$ (final value) $\$1000 \times (1.095)^2 = \$1199$ ✓ (geometric mean correctly compounds) $\$1000 \times (1.15)^2 = \$1322$ ✗ (arithmetic mean overstates growth!) ::: ### Example 3.5: Geometric Mean for Investment Returns Boggs Financial Services analyzes two mutual funds with identical **arithmetic average returns** of 11% per year: **Megabucks Fund (5-year returns):** 12%, 10%, 13%, 9%, 11% **Dynamics Fund (5-year returns):** 13%, 12%, 14%, 10%, 6% Both have arithmetic mean = $(12+10+13+9+11)/5 = 11\%$ Both have arithmetic mean = $(13+12+14+10+6)/5 = 11\%$ **Question:** Which fund actually performed better over the 5-year period? **Solution:** **Megabucks Fund (geometric mean):** $$GM_{Megabucks} = \sqrt[5]{(1.12)(1.10)(1.13)(1.09)(1.11)} - 1$$ $$= \sqrt[5]{1.6765} - 1 = 1.1089 - 1 = 0.1089 = 10.89\%$$ **Dynamics Fund (geometric mean):** $$GM_{Dynamics} = \sqrt[5]{(1.13)(1.12)(1.14)(1.10)(1.06)} - 1$$ $$= \sqrt[5]{1.6692} - 1 = 1.1081 - 1 = 0.1081 = 10.81\%$$ **Interpretation:** Although both funds have the same arithmetic average return (11%), **Megabucks** outperformed **Dynamics** on a compound annual growth rate (CAGR) basis: **10.89% vs. 10.81%**. **Verification:** If you invested $10,000 in each fund: - **Megabucks:** $10,000 × (1.1089)^5 = $16,765 ✓ - **Dynamics:** $10,000 × (1.1081)^5 = $16,692 ✓ Megabucks earned $73 more over 5 years due to less volatility in returns. ::: {.callout-tip icon="💡"} ## Financial Principle: Volatility Reduces Compound Returns Even with the same arithmetic average, **more volatile returns** (Dynamics: 6% to 14%) produce **lower compound growth** than **stable returns** (Megabucks: 9% to 13%). This is why financial advisors emphasize **risk-adjusted returns**, not just average returns. The geometric mean captures this volatility penalty. ::: ### Example 3.6: White-Knuckle Airlines Revenue Growth White-Knuckle Airlines reports annual revenues over 5 years: | Year | Revenue | Growth from Previous Year | |:----:|:--------:|:-------------------------:| | 1992 | $50,000 | — | | 1993 | $55,000 | 10% increase | | 1994 | $66,000 | 20% increase | | 1995 | $60,000 | 9% decrease | | 1996 | $78,000 | 30% increase | **Question:** What is the average annual growth rate (CAGR) from 1992 to 1996? **Solution:** **Growth factors:** - 1993: $55,000 / $50,000 = 1.10$ (10% growth) - 1994: $66,000 / $55,000 = 1.20$ (20% growth) - 1995: $60,000 / $66,000 = 0.909$ (9.1% decline) - 1996: $78,000 / $60,000 = 1.30$ (30% growth) **Geometric mean:** $$GM = \sqrt[4]{(1.10)(1.20)(0.909)(1.30)} - 1$$ $$= \sqrt[4]{1.5517} - 1 = 1.1162 - 1 = 0.1162 = 11.62\%$$ **Verification:** $50,000 \times (1.1162)^4 = $50,000 \times 1.5525 = $77,625 ≈ $78,000 ✓ **Interpretation:** Despite the 1995 decline, White-Knuckle Airlines achieved an impressive **11.62% compound annual growth rate** over the 4-year period. This exceeds the industry average of 10%, indicating strong performance. **For investor presentations:** "We've grown revenue at a **compound annual rate of 11.6%** over the past 4 years, outpacing industry growth by 160 basis points." ```{python} #| label: fig-geometric-vs-arithmetic #| fig-cap: "Arithmetic vs. Geometric Mean for Investment Returns" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # Investment scenarios years = np.arange(0, 6) initial = 10000 # Scenario 1: Steady growth (10% per year) steady_values = initial * (1.10) ** years # Scenario 2: Volatile returns (same arithmetic mean of 10%) volatile_returns = [1.0, 1.30, 0.85, 1.15, 1.05, 1.12] # Year-over-year multipliers volatile_values = [initial] for mult in volatile_returns[1:]: volatile_values.append(volatile_values[-1] * mult) # Scenario 3: Arithmetic mean projection (incorrect) arithmetic_values = initial * (1.10) ** years # Same as steady for 10% arithmetic mean fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Panel 1: Investment growth comparison axes[0].plot(years, steady_values, 'g-', linewidth=3, marker='o', markersize=8, label='Steady 10% growth\n(Geometric Mean = 10%)', alpha=0.8) axes[0].plot(years, volatile_values, 'r--', linewidth=3, marker='s', markersize=8, label='Volatile returns\n(Arithmetic Mean = 10%,\nGeometric Mean = 8.2%)', alpha=0.8) axes[0].set_title('Investment Growth: Steady vs. Volatile Returns\n(Both have 10% arithmetic average)', fontsize=12, fontweight='bold') axes[0].set_xlabel('Year', fontsize=11, fontweight='bold') axes[0].set_ylabel('Portfolio Value ($)', fontsize=11, fontweight='bold') axes[0].legend(fontsize=10, loc='upper left') axes[0].grid(True, alpha=0.3) axes[0].set_ylim(9000, 18000) # Add final value annotations axes[0].annotate(f'${steady_values[-1]:,.0f}', xy=(5, steady_values[-1]), xytext=(4.2, steady_values[-1] + 500), fontsize=10, fontweight='bold', color='green', bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.7)) axes[0].annotate(f'${volatile_values[-1]:,.0f}\n(Loss due to volatility)', xy=(5, volatile_values[-1]), xytext=(4.2, volatile_values[-1] - 1500), fontsize=10, fontweight='bold', color='red', arrowprops=dict(arrowstyle='->', lw=2, color='red'), bbox=dict(boxstyle='round,pad=0.3', facecolor='pink', alpha=0.7)) # Panel 2: Annual returns bar chart for volatile scenario annual_returns = [0, 30, -15, 15, 5, 12] # Percentage returns colors = ['gray' if r == 0 else 'green' if r > 0 else 'red' for r in annual_returns] bars = axes[1].bar(years, annual_returns, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5) axes[1].axhline(y=10, color='blue', linestyle='--', linewidth=2, label='Arithmetic Mean = 10%') # Calculate and show geometric mean returns_decimal = [r/100 for r in annual_returns[1:]] # Convert to decimals, skip year 0 growth_factors = [1 + r for r in returns_decimal] geom_mean = (np.prod(growth_factors) ** (1/len(growth_factors)) - 1) * 100 axes[1].axhline(y=geom_mean, color='orange', linestyle=':', linewidth=2.5, label=f'Geometric Mean = {geom_mean:.1f}%') axes[1].set_title('Annual Returns for Volatile Portfolio\n(Why Arithmetic Mean Overstates Growth)', fontsize=12, fontweight='bold') axes[1].set_xlabel('Year', fontsize=11, fontweight='bold') axes[1].set_ylabel('Annual Return (%)', fontsize=11, fontweight='bold') axes[1].legend(fontsize=10, loc='upper right') axes[1].grid(True, alpha=0.3, axis='y') axes[1].set_xticks(years) # Add value labels on bars for bar, ret in zip(bars, annual_returns): if ret != 0: height = bar.get_height() axes[1].text(bar.get_x() + bar.get_width()/2., height, f'{ret:+.0f}%', ha='center', va='bottom' if ret > 0 else 'top', fontweight='bold', fontsize=9) plt.tight_layout() plt.show() ``` --- ## 3.4 Measures of Dispersion for Ungrouped Data **Dispersion** (or **variability** or **spread**) measures how scattered or spread out observations are around the central tendency. Two datasets can have the same mean but vastly different dispersions. ::: {.callout-important icon="📊"} ## Why Dispersion Matters **Central tendency alone is insufficient!** Consider two manufacturing processes: - **Process A:** Mean diameter = 10.0 mm, values range from 9.8 to 10.2 mm - **Process B:** Mean diameter = 10.0 mm, values range from 5.0 to 15.0 mm Both have identical means, but **Process B is catastrophically inconsistent** and would produce massive quality problems. Dispersion measures reveal this critical difference. ::: **Primary Measures of Dispersion:** 1. **Range** — Difference between maximum and minimum 2. **Variance ($s^2$ or $\sigma^2$)** — Average of squared deviations 3. **Standard Deviation ($s$ or $\sigma$)** — Square root of variance 4. **Interquartile Range (IQR)** — Range of middle 50% of data 5. **Percentiles** — Values below which certain percentages fall --- ### A. Range and Interquartile Range The **range** is the simplest measure of dispersion: $$\text{Range} = X_{max} - X_{min}$$ **Advantages:** Quick and easy to calculate **Disadvantages:** Uses only two values (ignores all others), extremely sensitive to outliers The **interquartile range (IQR)** measures the spread of the **middle 50%** of data: $$IQR = Q_3 - Q_1 = P_{75} - P_{25}$$ where $Q_1$ is the first quartile (25th percentile) and $Q_3$ is the third quartile (75th percentile). **Advantages:** Resistant to outliers (ignores extreme 25% on each end) **Disadvantages:** Ignores outer 50% of data ```{python} #| label: fig-range-iqr-concept #| fig-cap: "Range vs. Interquartile Range (IQR)" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # Generate sample data with outliers np.random.seed(42) data = np.concatenate([ np.random.normal(50, 5, 45), # Main data [20, 25, 75, 80, 85] # Outliers ]) data = np.sort(data) # Calculate statistics q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) median = np.median(data) iqr = q3 - q1 data_range = np.max(data) - np.min(data) fig, ax = plt.subplots(figsize=(12, 6)) # Box plot bp = ax.boxplot([data], vert=False, widths=0.6, patch_artist=True, boxprops=dict(facecolor='lightblue', edgecolor='black', linewidth=2), whiskerprops=dict(color='black', linewidth=2), capprops=dict(color='black', linewidth=2), medianprops=dict(color='red', linewidth=3), flierprops=dict(marker='o', markerfacecolor='red', markersize=10, markeredgecolor='black', alpha=0.7)) # Add scatter of all points below box plot y_jitter = np.random.normal(0.8, 0.05, len(data)) ax.scatter(data, y_jitter, alpha=0.5, s=30, color='steelblue', edgecolors='black', linewidths=0.5) # Annotate range ax.annotate('', xy=(np.max(data), 1.3), xytext=(np.min(data), 1.3), arrowprops=dict(arrowstyle='<->', lw=3, color='purple')) ax.text((np.max(data) + np.min(data))/2, 1.4, f'Range = {data_range:.1f}\n(Uses only 2 values:\nmin and max)', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round,pad=0.5', facecolor='lavender', alpha=0.8)) # Annotate IQR ax.annotate('', xy=(q3, 0.3), xytext=(q1, 0.3), arrowprops=dict(arrowstyle='<->', lw=3, color='green')) ax.text((q1 + q3)/2, 0.15, f'IQR = {iqr:.1f}\n(Middle 50% of data)', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8)) # Annotate quartiles ax.axvline(q1, color='green', linestyle='--', linewidth=2, alpha=0.7) ax.axvline(q3, color='green', linestyle='--', linewidth=2, alpha=0.7) ax.axvline(median, color='red', linestyle='--', linewidth=2, alpha=0.7) ax.text(q1, 1.6, f'Q1 = {q1:.1f}\n(25th percentile)', ha='center', fontsize=9, bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9)) ax.text(median, 1.6, f'Median = {median:.1f}\n(50th percentile)', ha='center', fontsize=9, bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9)) ax.text(q3, 1.6, f'Q3 = {q3:.1f}\n(75th percentile)', ha='center', fontsize=9, bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.9)) ax.set_ylim(0, 1.8) ax.set_xlabel('Value', fontsize=12, fontweight='bold') ax.set_title('Range vs. Interquartile Range (IQR)\nIQR is Robust to Outliers', fontsize=13, fontweight='bold', pad=15) ax.set_yticks([]) ax.grid(True, alpha=0.3, axis='x') plt.tight_layout() plt.show() ``` --- ### B. Variance and Standard Deviation The **variance** and **standard deviation** are the most important and widely used measures of dispersion in statistics. ::: {.callout-note icon="📐"} ## Definitions: Variance and Standard Deviation **Population Variance:** $$\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}$$ **Sample Variance:** $$s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}$$ **Population Standard Deviation:** $$\sigma = \sqrt{\sigma^2}$$ **Sample Standard Deviation:** $$s = \sqrt{s^2}$$ where: - $X_i$ = individual observations - $\mu$ or $\bar{X}$ = mean (population or sample) - $N$ or $n$ = size (population or sample) - $(X_i - \bar{X})$ = deviation from mean - $(X_i - \bar{X})^2$ = squared deviation ::: **Key Characteristics:** - **Variance:** Average of squared deviations from the mean - **Standard Deviation:** Square root of variance (same units as original data!) - Both use **all observations** in the dataset - Larger values indicate greater dispersion - SD is easier to interpret (same units as data) ::: {.callout-important icon="❓"} ## Why Divide by (n-1) Instead of n? When calculating **sample variance**, we divide by $n-1$ (called **degrees of freedom**) instead of $n$ for two reasons: 1. **Bessel's Correction:** Samples tend to be slightly less dispersed than populations. Dividing by $n-1$ (smaller divisor) inflates the variance slightly, correcting this underestimation bias. 2. **Degrees of Freedom:** We "lose" one degree of freedom because we first calculated $\bar{X}$ from the same data. Once we know $n-1$ deviations and $\bar{X}$, the final deviation is predetermined. **Example:** Choose 4 numbers that average 10. You can freely choose the first 3 (say: 5, 8, 12), but the 4th is forced to be 15 to make the average equal 10. **Result:** Sample variance ($s^2$) with $(n-1)$ is an **unbiased estimator** of population variance ($\sigma^2$). ::: ### Example 3.7: Investment Risk Comparison (Variance and SD) Boggs compares two mutual funds with identical 5-year average returns (11%): **Megabucks Fund:** 12%, 10%, 13%, 9%, 11% **Dynamics Fund:** 13%, 12%, 14%, 10%, 6% Both have $\bar{X} = 11\%$, but which is riskier (more volatile)? **Solution:** **Megabucks Fund:** | Return ($X_i$) | Deviation ($X_i - \bar{X}$) | Squared Deviation | |:--------------:|:---------------------------:|:-----------------:| | 12% | 1% | 1 | | 10% | -1% | 1 | | 13% | 2% | 4 | | 9% | -2% | 4 | | 11% | 0% | 0 | | **Sum** | | **10** | : Megabucks Variance Calculation {.striped .hover} $$s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1} = \frac{10}{5-1} = \frac{10}{4} = 2.5$$ $$s = \sqrt{2.5} = 1.58\%$$ **Dynamics Fund:** | Return ($X_i$) | Deviation ($X_i - \bar{X}$) | Squared Deviation | |:--------------:|:---------------------------:|:-----------------:| | 13% | 2% | 4 | | 12% | 1% | 1 | | 14% | 3% | 9 | | 10% | -1% | 1 | | 6% | -5% | 25 | | **Sum** | | **40** | : Dynamics Variance Calculation {.striped .hover} $$s^2 = \frac{40}{4} = 10.0$$ $$s = \sqrt{10.0} = 3.16\%$$ **Interpretation:** **Dynamics Fund** has **4 times the variance** (10.0 vs. 2.5) and **2 times the standard deviation** (3.16% vs. 1.58%) compared to Megabucks. **Investment Recommendation:** For **risk-averse investors**, choose **Megabucks**—same average return (11%), but **half the volatility**. Dynamics is suitable only for investors willing to accept higher risk for potential upside. **Financial Principle:** In finance, **standard deviation = risk**. Lower SD = more predictable = less risky. --- **END OF STAGE 2** This covers: - Geometric mean (growth rates) - Range and IQR - Variance and standard deviation (detailed explanation) - 2 Python visualizations Approximately 750 lines. Ready for STAGE 3? ### Example 3.8: Stock Price Volatility (Boggs Financial Services) Robert Boggs analyzes the volatility of a client's stock portfolio over 7 trading days to assess investment risk. **Daily Closing Prices:** $87, $120, $54, $92, $73, $80, $63 **Calculate:** Mean, variance, and standard deviation. **Solution:** **Step 1: Calculate mean** $$\bar{X} = \frac{87 + 120 + 54 + 92 + 73 + 80 + 63}{7} = \frac{569}{7} = 81.29$$ **Step 2: Calculate deviations and squared deviations** | Day | Price ($X_i$) | Deviation ($X_i - \bar{X}$) | Squared Deviation $(X_i - \bar{X})^2$ | |:---:|:-------------:|:---------------------------:|:-------------------------------------:| | 1 | $87 | $5.71 | 32.60 | | 2 | $120 | $38.71 | 1498.46 | | 3 | $54 | -$27.29 | 744.74 | | 4 | $92 | $10.71 | 114.74 | | 5 | $73 | -$8.29 | 68.72 | | 6 | $80 | -$1.29 | 1.66 | | 7 | $63 | -$18.29 | 334.52 | | **Sum** | | **$0.00** ✓ | **2,795.43** | : Stock Price Variance Calculation {.striped .hover} **Step 3: Calculate variance and standard deviation** $$s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1} = \frac{2795.43}{7-1} = \frac{2795.43}{6} = 465.91$$ $$s = \sqrt{465.91} = 21.58$$ **Interpretation:** The stock has a mean price of **$81.29** with a standard deviation of **$21.58** (26.5% coefficient of variation). This indicates **high volatility**—the stock price fluctuates substantially from day to day. **Investment Implication:** This level of volatility (±$21.58 or roughly 26%) is typical for **growth stocks** or **small-cap stocks**. Conservative investors seeking stable income should avoid this stock. Aggressive investors comfortable with risk might find it acceptable. ```{python} #| label: fig-variance-concept #| fig-cap: "Visualizing Variance: Squared Deviations from the Mean" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # Stock price data days = np.arange(1, 8) prices = np.array([87, 120, 54, 92, 73, 80, 63]) mean_price = np.mean(prices) std_price = np.std(prices, ddof=1) fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Panel 1: Line plot with mean and ±1 SD bands axes[0].plot(days, prices, 'o-', linewidth=3, markersize=10, color='steelblue', label='Daily Closing Price', markeredgecolor='black', markeredgewidth=1.5) axes[0].axhline(y=mean_price, color='red', linestyle='--', linewidth=2.5, label=f'Mean = ${mean_price:.2f}') axes[0].axhline(y=mean_price + std_price, color='orange', linestyle=':', linewidth=2, label=f'Mean + 1 SD = ${mean_price + std_price:.2f}') axes[0].axhline(y=mean_price - std_price, color='orange', linestyle=':', linewidth=2, label=f'Mean - 1 SD = ${mean_price - std_price:.2f}') # Shade ±1 SD region axes[0].fill_between(days, mean_price - std_price, mean_price + std_price, alpha=0.2, color='orange', label='±1 SD Region (68% for normal data)') # Add vertical deviation lines for day, price in zip(days, prices): axes[0].plot([day, day], [mean_price, price], 'k--', alpha=0.3, linewidth=1) axes[0].set_title('Stock Price Volatility: Daily Prices vs. Mean ± SD', fontsize=12, fontweight='bold') axes[0].set_xlabel('Trading Day', fontsize=11, fontweight='bold') axes[0].set_ylabel('Stock Price ($)', fontsize=11, fontweight='bold') axes[0].legend(fontsize=9, loc='upper left') axes[0].grid(True, alpha=0.3) axes[0].set_xticks(days) # Panel 2: Bar chart of squared deviations (variance components) deviations = prices - mean_price squared_devs = deviations ** 2 colors = ['red' if d > 0 else 'blue' for d in deviations] bars = axes[1].bar(days, squared_devs, color=colors, alpha=0.7, edgecolor='black', linewidth=1.5) # Add variance line (mean of squared deviations) variance = np.var(prices, ddof=1) axes[1].axhline(y=variance, color='green', linestyle='--', linewidth=2.5, label=f'Variance = {variance:.2f}\n(Mean of squared deviations)') axes[1].set_title('Squared Deviations from Mean\n(Components of Variance)', fontsize=12, fontweight='bold') axes[1].set_xlabel('Trading Day', fontsize=11, fontweight='bold') axes[1].set_ylabel('Squared Deviation $(X_i - \\bar{X})^2$', fontsize=11, fontweight='bold') axes[1].legend(fontsize=10, loc='upper right') axes[1].grid(True, alpha=0.3, axis='y') axes[1].set_xticks(days) # Add value labels on bars for bar, sq_dev, price in zip(bars, squared_devs, prices): height = bar.get_height() axes[1].text(bar.get_x() + bar.get_width()/2., height, f'${price}\n({sq_dev:.0f})', ha='center', va='bottom', fontweight='bold', fontsize=8) plt.tight_layout() plt.show() ``` ::: {.callout-tip icon="💡"} ## Business Application: Using Standard Deviation **Standard deviation is the universal language of risk:** - **Finance:** Portfolio volatility, Value-at-Risk (VaR), Sharpe ratio - **Manufacturing:** Process capability indices (Cp, Cpk), Six Sigma - **Quality Control:** Control charts (±3 SD limits) - **Healthcare:** Normal ranges for lab tests (mean ± 2 SD) - **Marketing:** Customer lifetime value variability **Rule of Thumb:** For **normally distributed data**, approximately: - **68%** of observations fall within **±1 SD** of the mean - **95%** fall within **±2 SD** - **99.7%** fall within **±3 SD** We'll formalize this as the **Empirical Rule** in Section 3.7. ::: --- ## 3.5 Measures of Central Tendency and Dispersion for Grouped Data When data are presented in **frequency distributions** (grouped into class intervals), we cannot compute exact statistics without the raw data. Instead, we use the **class midpoint** ($M$) to represent all values in that class. **Key Assumption:** All observations in a class are assumed to be located at the **class midpoint**. ### A. Mean for Grouped Data ::: {.callout-note icon="📐"} ## Formula: Mean for Grouped Data $$\bar{X}_g = \frac{\sum f \cdot M}{n} = \frac{\sum f \cdot M}{\sum f}$$ where: - $f$ = frequency (number of observations in each class) - $M$ = class midpoint - $n = \sum f$ = total number of observations - $\bar{X}_g$ = grouped data mean (approximation) ::: **Computational Steps:** 1. Calculate class midpoint: $M = \frac{\text{Lower Limit} + \text{Upper Limit}}{2}$ 2. Multiply each midpoint by its frequency: $f \cdot M$ 3. Sum all products: $\sum f \cdot M$ 4. Divide by total frequency: $\bar{X}_g = \frac{\sum f \cdot M}{n}$ ### Example 3.9: Mean Passengers per Flight (P&P Airlines) P&P Airlines analyzes passenger loads on 40 regional flights: | Passengers (Class Interval) | Frequency ($f$) | Midpoint ($M$) | $f \cdot M$ | |:---------------------------:|:---------------:|:--------------:|:-----------:| | 50 – 59 | 3 | 54.5 | 163.5 | | 60 – 69 | 5 | 64.5 | 322.5 | | 70 – 79 | 9 | 74.5 | 670.5 | | 80 – 89 | 14 | 84.5 | 1,183.0 | | 90 – 99 | 7 | 94.5 | 661.5 | | 100 – 109 | 2 | 104.5 | 209.0 | | **Total** | **40** | | **3,210.0** | : Grouped Data Mean Calculation for P&P Airlines {.striped .hover} **Solution:** $$\bar{X}_g = \frac{\sum f \cdot M}{n} = \frac{3210.0}{40} = 80.25 \text{ passengers}$$ **Interpretation:** The average passenger load per flight is approximately **80 passengers**. This grouped data estimate is very close to the actual mean (if we had raw data). --- ### B. Median for Grouped Data ::: {.callout-note icon="📐"} ## Formula: Median for Grouped Data $$\text{Median} = L_{md} + \left[\frac{\frac{n}{2} - F}{f_{md}}\right] \cdot C$$ where: - $L_{md}$ = lower limit of the **median class** (class containing the median) - $n$ = total number of observations - $F$ = cumulative frequency **before** the median class - $f_{md}$ = frequency **of** the median class - $C$ = class width (interval size) ::: **Steps to Find Grouped Data Median:** 1. Calculate $n/2$ (position of median) 2. Find **median class**: First class where cumulative frequency $\geq n/2$ 3. Identify $L_{md}$, $F$, $f_{md}$, and $C$ 4. Apply formula ### Example 3.10: Median Passengers per Flight (P&P Airlines) Using the same P&P Airlines data: | Passengers | Frequency ($f$) | Cumulative Frequency | |:----------:|:---------------:|:--------------------:| | 50 – 59 | 3 | 3 | | 60 – 69 | 5 | 8 | | 70 – 79 | 9 | 17 | | **80 – 89**| **14** | **31** ← Contains median | | 90 – 99 | 7 | 38 | | 100 – 109 | 2 | 40 | : Finding the Median Class {.striped .hover} **Solution:** 1. $n/2 = 40/2 = 20$ (median position) 2. **Median class:** 80–89 (cumulative frequency 31 ≥ 20) 3. Parameters: - $L_{md} = 80$ (lower limit) - $F = 17$ (cumulative frequency before median class) - $f_{md} = 14$ (frequency of median class) - $C = 10$ (class width: 89 - 80 + 1 = 10) 4. Calculate: $$\text{Median} = 80 + \left[\frac{20 - 17}{14}\right] \cdot 10 = 80 + \left[\frac{3}{14}\right] \cdot 10 = 80 + 2.14 = 82.14$$ **Interpretation:** The median passenger load is **82.14 passengers**. Half the flights carry fewer than 82 passengers, half carry more. **Comparison:** Mean = 80.25, Median = 82.14. The median is slightly higher, suggesting a slight **left skew** (more flights with lower passenger counts pulling the mean down). --- ### C. Mode for Grouped Data ::: {.callout-note icon="📐"} ## Formula: Mode for Grouped Data $$\text{Mode} = L_{mo} + \left[\frac{D_1}{D_1 + D_2}\right] \cdot C$$ where: - $L_{mo}$ = lower limit of the **modal class** (class with highest frequency) - $D_1$ = difference between modal class frequency and **preceding** class frequency - $D_2$ = difference between modal class frequency and **following** class frequency - $C$ = class width ::: ### Example 3.11: Mode for P&P Airlines **Modal class:** 80–89 (frequency = 14, highest) **Parameters:** - $L_{mo} = 80$ - $D_1 = 14 - 9 = 5$ (modal frequency minus preceding frequency) - $D_2 = 14 - 7 = 7$ (modal frequency minus following frequency) - $C = 10$ **Solution:** $$\text{Mode} = 80 + \left[\frac{5}{5 + 7}\right] \cdot 10 = 80 + \left[\frac{5}{12}\right] \cdot 10 = 80 + 4.17 = 84.17$$ **Interpretation:** The most common passenger load is approximately **84 passengers**. **Summary for P&P Airlines:** - **Mean** = 80.25 passengers - **Median** = 82.14 passengers - **Mode** = 84.17 passengers These are reasonably close, indicating a fairly symmetric distribution with slight left skew. ```{python} #| label: fig-grouped-data-distribution #| fig-cap: "P&P Airlines Passenger Distribution with Mean, Median, and Mode" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # P&P Airlines data class_midpoints = np.array([54.5, 64.5, 74.5, 84.5, 94.5, 104.5]) frequencies = np.array([3, 5, 9, 14, 7, 2]) class_lower = np.array([50, 60, 70, 80, 90, 100]) class_upper = np.array([59, 69, 79, 89, 99, 109]) # Calculated statistics mean_grouped = 80.25 median_grouped = 82.14 mode_grouped = 84.17 fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Panel 1: Histogram with central tendency measures width = 10 bars = axes[0].bar(class_midpoints, frequencies, width=width*0.8, color='steelblue', alpha=0.7, edgecolor='black', linewidth=1.5, label='Frequency') # Add mean, median, mode lines axes[0].axvline(x=mean_grouped, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_grouped:.2f}') axes[0].axvline(x=median_grouped, color='green', linestyle='-.', linewidth=2.5, label=f'Median = {median_grouped:.2f}') axes[0].axvline(x=mode_grouped, color='purple', linestyle=':', linewidth=3, label=f'Mode = {mode_grouped:.2f}') axes[0].set_title('P&P Airlines: Passenger Distribution\n(Grouped Data with Mean, Median, Mode)', fontsize=12, fontweight='bold') axes[0].set_xlabel('Number of Passengers', fontsize=11, fontweight='bold') axes[0].set_ylabel('Frequency (Number of Flights)', fontsize=11, fontweight='bold') axes[0].legend(fontsize=10, loc='upper right') axes[0].grid(True, alpha=0.3, axis='y') axes[0].set_xlim(45, 115) # Add frequency labels on bars for bar, freq in zip(bars, frequencies): height = bar.get_height() axes[0].text(bar.get_x() + bar.get_width()/2., height, f'{int(freq)}', ha='center', va='bottom', fontweight='bold', fontsize=10) # Panel 2: Cumulative frequency plot (ogive) cumulative_freq = np.cumsum(frequencies) total_n = cumulative_freq[-1] # Plot cumulative frequency curve axes[1].plot(class_upper, cumulative_freq, 'o-', linewidth=3, markersize=10, color='darkblue', markeredgecolor='black', markeredgewidth=1.5, label='Cumulative Frequency') # Add median reference axes[1].axhline(y=total_n/2, color='green', linestyle='--', linewidth=2, label=f'n/2 = {total_n/2:.0f} (Median Position)') axes[1].axvline(x=median_grouped, color='green', linestyle='-.', linewidth=2, alpha=0.7) # Highlight median finding axes[1].plot([50, median_grouped], [total_n/2, total_n/2], 'g:', linewidth=2) axes[1].plot([median_grouped, median_grouped], [0, total_n/2], 'g:', linewidth=2) axes[1].plot(median_grouped, total_n/2, 'go', markersize=12, markeredgecolor='black', markeredgewidth=2) axes[1].annotate(f'Median ≈ {median_grouped:.1f}', xy=(median_grouped, total_n/2), xytext=(median_grouped + 8, total_n/2 - 5), fontsize=11, fontweight='bold', bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='green')) axes[1].set_title('Cumulative Frequency Distribution (Ogive)\nFinding the Median Graphically', fontsize=12, fontweight='bold') axes[1].set_xlabel('Number of Passengers (Upper Class Limit)', fontsize=11, fontweight='bold') axes[1].set_ylabel('Cumulative Frequency', fontsize=11, fontweight='bold') axes[1].legend(fontsize=10, loc='upper left') axes[1].grid(True, alpha=0.3) axes[1].set_xlim(45, 115) axes[1].set_ylim(0, 45) plt.tight_layout() plt.show() ``` --- ### D. Variance and Standard Deviation for Grouped Data ::: {.callout-note icon="📐"} ## Formula: Variance for Grouped Data $$s^2 = \frac{\sum f \cdot M^2 - n \cdot \bar{X}_g^2}{n - 1}$$ where: - $f$ = frequency - $M$ = class midpoint - $n = \sum f$ = total observations - $\bar{X}_g$ = grouped data mean **Standard Deviation:** $$s = \sqrt{s^2}$$ ::: ### Example 3.12: Standard Deviation for P&P Airlines Using the same passenger data: | Passengers | $f$ | $M$ | $f \cdot M$ | $M^2$ | $f \cdot M^2$ | |:----------:|:---:|:-----:|:-----------:|:--------:|:-------------:| | 50 – 59 | 3 | 54.5 | 163.5 | 2,970.25 | 8,910.75 | | 60 – 69 | 5 | 64.5 | 322.5 | 4,160.25 | 20,801.25 | | 70 – 79 | 9 | 74.5 | 670.5 | 5,550.25 | 49,952.25 | | 80 – 89 | 14 | 84.5 | 1,183.0 | 7,140.25 | 99,963.50 | | 90 – 99 | 7 | 94.5 | 661.5 | 8,930.25 | 62,511.75 | | 100 – 109 | 2 | 104.5 | 209.0 | 10,920.25| 21,840.50 | | **Total** | **40** | | **3,210.0** | | **263,980.00**| : Grouped Data Variance Calculation {.striped .hover} **Solution:** $$s^2 = \frac{\sum f \cdot M^2 - n \cdot \bar{X}_g^2}{n - 1} = \frac{263,980 - 40(80.25)^2}{39}$$ $$= \frac{263,980 - 40(6,440.0625)}{39} = \frac{263,980 - 257,602.50}{39} = \frac{6,377.50}{39} = 163.53$$ $$s = \sqrt{163.53} = 12.79 \text{ passengers}$$ **Interpretation:** The standard deviation is **12.79 passengers**. This indicates moderate variability in passenger loads. Most flights carry between **67 and 93 passengers** (mean ± 1 SD). --- ## 3.6 Percentiles, Quartiles, and Deciles **Percentiles** divide a dataset into **100 equal parts**. The **Pth percentile** is the value below which **P percent** of the observations fall. **Special Cases:** - **Quartiles:** Divide data into 4 parts (Q1 = 25th, Q2 = 50th = median, Q3 = 75th) - **Deciles:** Divide data into 10 parts (D1 = 10th, D2 = 20th, ..., D9 = 90th) - **Median:** 50th percentile (P₅₀ = Q2 = D5) ::: {.callout-note icon="📐"} ## Formula: Percentile Position (Ungrouped Data) $$L_P = (n + 1) \times \frac{P}{100}$$ where: - $L_P$ = **position** (location) of the Pth percentile - $n$ = number of observations - $P$ = percentile (1 to 99) **If $L_P$ is not an integer, interpolate:** Value = Value at floor($L_P$) + fractional part × (next value - value at floor) ::: ### Example 3.13: Executive Salaries (Percentiles) A consulting firm surveys 12 executives' annual salaries (in thousands): **Data (sorted):** 42, 47, 51, 54, 58, 60, 62, 65, 68, 72, 78, 85 **Find:** 25th percentile (Q1), 50th percentile (Median), and 75th percentile (Q3). **Solution:** **Q1 (25th percentile):** $$L_{25} = (12 + 1) \times \frac{25}{100} = 13 \times 0.25 = 3.25$$ Position 3.25 means: 25% of way between 3rd value (51) and 4th value (54). $$Q_1 = 51 + 0.25(54 - 51) = 51 + 0.75 = 51.75 \text{ thousand}$$ **Q2 (50th percentile, Median):** $$L_{50} = 13 \times 0.50 = 6.5$$ $$\text{Median} = 60 + 0.5(62 - 60) = 60 + 1 = 61 \text{ thousand}$$ **Q3 (75th percentile):** $$L_{75} = 13 \times 0.75 = 9.75$$ $$Q_3 = 68 + 0.75(72 - 68) = 68 + 3 = 71 \text{ thousand}$$ **Interpretation:** - **25% of executives** earn less than **$51,750** - **50% of executives** earn less than **$61,000** (median) - **75% of executives** earn less than **$71,000** **Interquartile Range (IQR):** $$IQR = Q_3 - Q_1 = 71 - 51.75 = 19.25 \text{ thousand}$$ The **middle 50%** of executive salaries span a range of **$19,250**. ```{python} #| label: fig-percentiles-boxplot #| fig-cap: "Executive Salaries: Box Plot Showing Quartiles and Percentiles" #| code-fold: true import matplotlib.pyplot as plt import numpy as np # Executive salary data (in thousands) salaries = np.array([42, 47, 51, 54, 58, 60, 62, 65, 68, 72, 78, 85]) # Calculate percentiles q1 = np.percentile(salaries, 25) median = np.percentile(salaries, 50) q3 = np.percentile(salaries, 75) iqr = q3 - q1 p10 = np.percentile(salaries, 10) p90 = np.percentile(salaries, 90) fig, axes = plt.subplots(2, 1, figsize=(12, 10)) # Panel 1: Box plot with annotations bp = axes[0].boxplot([salaries], vert=False, widths=0.5, patch_artist=True, boxprops=dict(facecolor='lightblue', edgecolor='black', linewidth=2), whiskerprops=dict(color='black', linewidth=2), capprops=dict(color='black', linewidth=2), medianprops=dict(color='red', linewidth=3), flierprops=dict(marker='o', markerfacecolor='red', markersize=10)) # Add individual points below box plot y_jitter = np.random.normal(0.8, 0.03, len(salaries)) axes[0].scatter(salaries, y_jitter, alpha=0.6, s=80, color='darkblue', edgecolors='black', linewidths=1, zorder=3) # Annotate quartiles axes[0].axvline(q1, color='green', linestyle='--', linewidth=2, alpha=0.7) axes[0].axvline(median, color='red', linestyle='--', linewidth=2, alpha=0.7) axes[0].axvline(q3, color='green', linestyle='--', linewidth=2, alpha=0.7) axes[0].text(q1, 1.35, f'Q1 = ${q1:.2f}K\n(25th percentile)', ha='center', fontsize=10, bbox=dict(boxstyle='round,pad=0.4', facecolor='lightgreen', alpha=0.8)) axes[0].text(median, 1.35, f'Median = ${median:.2f}K\n(50th percentile)', ha='center', fontsize=10, bbox=dict(boxstyle='round,pad=0.4', facecolor='pink', alpha=0.8)) axes[0].text(q3, 1.35, f'Q3 = ${q3:.2f}K\n(75th percentile)', ha='center', fontsize=10, bbox=dict(boxstyle='round,pad=0.4', facecolor='lightgreen', alpha=0.8)) # Annotate IQR axes[0].annotate('', xy=(q3, 0.3), xytext=(q1, 0.3), arrowprops=dict(arrowstyle='<->', lw=2.5, color='purple')) axes[0].text((q1 + q3)/2, 0.15, f'IQR = ${iqr:.2f}K\n(Middle 50% of data)', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round,pad=0.4', facecolor='lavender', alpha=0.9)) axes[0].set_ylim(0, 1.6) axes[0].set_xlabel('Annual Salary ($ thousands)', fontsize=12, fontweight='bold') axes[0].set_title('Executive Salaries: Box Plot Showing Quartiles and IQR', fontsize=13, fontweight='bold', pad=15) axes[0].set_yticks([]) axes[0].grid(True, alpha=0.3, axis='x') # Panel 2: Percentile reference chart percentiles = [10, 25, 50, 75, 90] percentile_values = [np.percentile(salaries, p) for p in percentiles] percentile_labels = ['P10\n(10th)', 'Q1\n(25th)', 'Median\n(50th)', 'Q3\n(75th)', 'P90\n(90th)'] colors_bars = ['orange', 'green', 'red', 'green', 'orange'] bars = axes[1].bar(percentile_labels, percentile_values, color=colors_bars, alpha=0.7, edgecolor='black', linewidth=2) # Add value labels for bar, val, pct in zip(bars, percentile_values, percentiles): height = bar.get_height() axes[1].text(bar.get_x() + bar.get_width()/2., height, f'${val:.1f}K\n({pct}%)', ha='center', va='bottom', fontweight='bold', fontsize=11) axes[1].set_title('Key Percentiles for Executive Salary Distribution', fontsize=13, fontweight='bold', pad=15) axes[1].set_ylabel('Annual Salary ($ thousands)', fontsize=12, fontweight='bold') axes[1].set_xlabel('Percentile', fontsize=12, fontweight='bold') axes[1].grid(True, alpha=0.3, axis='y') axes[1].set_ylim(0, 95) # Add interpretation text interpretation = ( "Interpretation:\n" "• 10% of executives earn less than $46K (P10)\n" "• 25% earn less than $55K (Q1)\n" "• 50% earn less than $61K (Median)\n" "• 75% earn less than $70K (Q3)\n" "• 90% earn less than $80K (P90)" ) axes[1].text(0.98, 0.97, interpretation, transform=axes[1].transAxes, fontsize=10, verticalalignment='top', horizontalalignment='right', bbox=dict(boxstyle='round,pad=0.6', facecolor='wheat', alpha=0.8)) plt.tight_layout() plt.show() ``` ::: {.callout-tip icon="💼"} ## Business Applications of Percentiles **Percentiles are essential for:** 1. **Salary Benchmarking:** "Our median salary (P50) is at the 75th percentile of industry standards" 2. **Performance Metrics:** "Top 10% of sales reps (P90) exceed $2M in annual revenue" 3. **Quality Control:** "Reject parts above P95 or below P5 in dimension distribution" 4. **Financial Analysis:** "Value-at-Risk (VaR) uses P5 to estimate worst 5% scenarios" 5. **Student Assessment:** "Your test score is at the 85th percentile (better than 85% of students)" 6. **Healthcare:** "Blood pressure above P90 for age group indicates hypertension risk" **Percentiles provide context** that means and standard deviations cannot—they show **relative position** in a distribution. ::: --- **END OF STAGE 3** This covers: - Stock volatility example (variance calculation) - Grouped data formulas (mean, median, mode, variance/SD) - P&P Airlines comprehensive example - Percentiles, quartiles, deciles (with executive salary example) - 3 Python visualizations Approximately 900 lines. Ready for **STAGE 4** (Chebyshev, Empirical Rule, Skewness, CV, Summary)? ## 3.7 Uses of the Standard Deviation Standard deviation is far more than just a dispersion measure—it provides **powerful insights** into data distribution patterns and enables **probabilistic statements** about where observations are likely to fall. ### A. Chebyshev's Theorem **Chebyshev's Theorem** provides a **conservative lower bound** on the proportion of observations within K standard deviations of the mean, **for any distribution** (no assumptions about shape). ::: {.callout-important icon="🎯"} ## Chebyshev's Theorem For **any distribution** (symmetric, skewed, bimodal, etc.), **at least** this proportion of observations lies within K standard deviations of the mean: $$\text{Proportion} \geq 1 - \frac{1}{K^2}$$ where $K > 1$ is the number of standard deviations from the mean. **Common values:** | K | Interval | Minimum Proportion | Percentage | |:-:|:-----------------:|:------------------:|:----------:| | 2 | $\mu \pm 2\sigma$ | $1 - 1/4 = 3/4$ | **≥ 75%** | | 3 | $\mu \pm 3\sigma$ | $1 - 1/9 = 8/9$ | **≥ 89%** | | 4 | $\mu \pm 4\sigma$ | $1 - 1/16 = 15/16$ | **≥ 94%** | **Key Point:** This is a **conservative estimate**. Most real distributions have MORE observations within K standard deviations than Chebyshev predicts. ::: **Critical Distinction:** - **Chebyshev's Theorem:** Works for **ANY** distribution (very general, but conservative) - **Empirical Rule:** Works ONLY for **normal** distributions (specific, but much more precise) ### Example 3.14: P&P Airlines Passenger Load Distribution (Chebyshev) P&P Airlines has: - Mean passenger load: $\mu = 80$ passengers - Standard deviation: $\sigma = 12$ passengers **No assumption** is made about the shape of the distribution. **Question:** Using Chebyshev's Theorem, what can we say about the proportion of flights within ±2 standard deviations of the mean? **Solution:** For $K = 2$: $$\text{Proportion} \geq 1 - \frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} = 0.75 = 75\%$$ **Interval:** $$\mu \pm 2\sigma = 80 \pm 2(12) = 80 \pm 24 = [56, 104] \text{ passengers}$$ **Interpretation:** **At least 75%** of P&P flights carry between **56 and 104 passengers**, regardless of the distribution shape. This is a guaranteed minimum—the actual proportion could be much higher (and usually is). **For $K = 3$:** $$\text{Proportion} \geq 1 - \frac{1}{3^2} = 1 - \frac{1}{9} = \frac{8}{9} \approx 0.889 = 89\%$$ $$\mu \pm 3\sigma = 80 \pm 3(12) = [44, 116] \text{ passengers}$$ **At least 89%** of flights carry between **44 and 116 passengers**. ::: {.callout-tip icon="💡"} ## When to Use Chebyshev's Theorem **Use Chebyshev when:** - Distribution shape is **unknown** or **non-normal** - Data are **heavily skewed** - You need a **conservative, guaranteed** lower bound - Working with small samples where normality is uncertain **Don't use Chebyshev when:** - You know data are **normally distributed** (use Empirical Rule instead—it's much more precise) ::: --- ### B. The Empirical Rule (Normal Distribution) When data follow a **normal distribution** (bell-shaped, symmetric), we can make much more precise statements using the **Empirical Rule**. ::: {.callout-note icon="📊"} ## The Empirical Rule (68-95-99.7 Rule) For data that are **approximately normally distributed**: - **68.3%** of observations fall within **±1 SD** of the mean: $\mu \pm \sigma$ - **95.5%** fall within **±2 SD**: $\mu \pm 2\sigma$ - **99.7%** fall within **±3 SD**: $\mu \pm 3\sigma$ **Implication:** Nearly **all** observations (99.7%) lie within 3 standard deviations of the mean. **Visual representation:** The normal curve is bell-shaped and symmetric around the mean, with areas under the curve representing these percentages. The Python visualization below demonstrates this distribution. ::: **Comparison: Chebyshev vs. Empirical Rule** | Interval | Chebyshev (ANY distribution) | Empirical Rule (NORMAL only) | |:--------:|:----------------------------:|:----------------------------:| | $\mu \pm 1\sigma$ | ≥ 0% (not useful) | **68.3%** | | $\mu \pm 2\sigma$ | ≥ 75% | **95.5%** | | $\mu \pm 3\sigma$ | ≥ 89% | **99.7%** | : Chebyshev's Theorem vs. Empirical Rule {.striped .hover} The Empirical Rule is **much more precise**, but requires normality. ### Example 3.15: Ski Resort Wait Times (Empirical Rule) A ski resort tracks wait times for the main chairlift on Saturdays. Historical data show wait times are **normally distributed** with: - Mean: $\mu = 10$ minutes - Standard deviation: $\sigma = 2$ minutes On a typical Saturday, **1,000 skiers** use the lift. **Question:** How many skiers wait: 1. Between 8 and 12 minutes? ($\mu \pm 1\sigma$) 2. Between 6 and 14 minutes? ($\mu \pm 2\sigma$) 3. More than 16 minutes? (beyond $\mu + 3\sigma$) **Solution:** **1. Within ±1 SD (8 to 12 minutes):** $$\mu \pm \sigma = 10 \pm 2 = [8, 12] \text{ minutes}$$ By the Empirical Rule, **68.3%** of skiers wait 8–12 minutes. Number of skiers: $1000 \times 0.683 = 683$ skiers **2. Within ±2 SD (6 to 14 minutes):** $$\mu \pm 2\sigma = 10 \pm 4 = [6, 14] \text{ minutes}$$ **95.5%** of skiers wait 6–14 minutes. Number of skiers: $1000 \times 0.955 = 955$ skiers **3. More than 16 minutes (beyond $\mu + 3\sigma$):** $$\mu + 3\sigma = 10 + 3(2) = 16 \text{ minutes}$$ The Empirical Rule says **99.7%** fall within ±3 SD, so **0.3%** fall outside. Of that 0.3%, half are above (right tail) and half below (left tail). Proportion above 16 minutes: $0.003 / 2 = 0.0015 = 0.15\%$ Number of skiers: $1000 \times 0.0015 = 1.5 \approx 1$ or 2 skiers **Interpretation:** - About **683 skiers** (68%) wait 8–12 minutes (typical experience) - About **955 skiers** (96%) wait 6–14 minutes (almost everyone) - Only **1–2 skiers** wait more than 16 minutes (extremely rare) **Business Decision:** The resort can confidently advertise **"Average wait time: 10 minutes, with 95% of skiers waiting less than 14 minutes."** ```{python} #| label: fig-empirical-rule #| fig-cap: "The Empirical Rule (68-95-99.7 Rule) for Normal Distributions" #| code-fold: true import matplotlib.pyplot as plt import numpy as np from scipy import stats # Parameters mu = 10 sigma = 2 x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000) y = stats.norm.pdf(x, mu, sigma) fig, axes = plt.subplots(2, 1, figsize=(14, 10)) # Panel 1: Empirical Rule regions axes[0].plot(x, y, 'k-', linewidth=2.5, label='Normal Distribution') # Fill regions x_1sd = x[(x >= mu - sigma) & (x <= mu + sigma)] y_1sd = stats.norm.pdf(x_1sd, mu, sigma) axes[0].fill_between(x_1sd, y_1sd, alpha=0.3, color='green', label='±1 SD: 68.3%') x_2sd = x[(x >= mu - 2*sigma) & (x <= mu + 2*sigma)] y_2sd = stats.norm.pdf(x_2sd, mu, sigma) axes[0].fill_between(x_2sd, y_2sd, alpha=0.2, color='blue', label='±2 SD: 95.5%') x_3sd = x[(x >= mu - 3*sigma) & (x <= mu + 3*sigma)] y_3sd = stats.norm.pdf(x_3sd, mu, sigma) axes[0].fill_between(x_3sd, y_3sd, alpha=0.1, color='purple', label='±3 SD: 99.7%') # Add mean line axes[0].axvline(mu, color='red', linestyle='--', linewidth=2, label=f'Mean = {mu}') # Add SD markers for k in range(1, 4): axes[0].axvline(mu - k*sigma, color='gray', linestyle=':', linewidth=1.5, alpha=0.7) axes[0].axvline(mu + k*sigma, color='gray', linestyle=':', linewidth=1.5, alpha=0.7) # Annotations axes[0].text(mu, max(y)*1.05, f'μ = {mu}', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7)) axes[0].text(mu, max(y)*0.5, '68.3%', ha='center', fontsize=13, fontweight='bold', color='green') axes[0].text(mu - 1.5*sigma, max(y)*0.15, '13.6%', ha='center', fontsize=10, fontweight='bold') axes[0].text(mu + 1.5*sigma, max(y)*0.15, '13.6%', ha='center', fontsize=10, fontweight='bold') axes[0].text(mu - 2.5*sigma, max(y)*0.05, '2.1%', ha='center', fontsize=9) axes[0].text(mu + 2.5*sigma, max(y)*0.05, '2.1%', ha='center', fontsize=9) axes[0].set_title('The Empirical Rule: Distribution of Values in a Normal Distribution', fontsize=13, fontweight='bold', pad=15) axes[0].set_xlabel('Value (in standard deviation units from mean)', fontsize=11, fontweight='bold') axes[0].set_ylabel('Probability Density', fontsize=11, fontweight='bold') axes[0].legend(fontsize=10, loc='upper left') axes[0].grid(True, alpha=0.3) axes[0].set_xticks([mu - 3*sigma, mu - 2*sigma, mu - sigma, mu, mu + sigma, mu + 2*sigma, mu + 3*sigma]) axes[0].set_xticklabels(['μ-3σ\n(4)', 'μ-2σ\n(6)', 'μ-σ\n(8)', 'μ\n(10)', 'μ+σ\n(12)', 'μ+2σ\n(14)', 'μ+3σ\n(16)']) # Panel 2: Ski resort wait times application x_ski = np.linspace(2, 18, 1000) y_ski = stats.norm.pdf(x_ski, 10, 2) axes[1].plot(x_ski, y_ski, 'b-', linewidth=3, label='Wait Time Distribution') axes[1].fill_between(x_ski, y_ski, alpha=0.2, color='lightblue') # Highlight regions x_normal = x_ski[(x_ski >= 8) & (x_ski <= 12)] y_normal = stats.norm.pdf(x_normal, 10, 2) axes[1].fill_between(x_normal, y_normal, alpha=0.5, color='green', label='8-12 min (68.3%): 683 skiers') x_acceptable = x_ski[(x_ski >= 6) & (x_ski <= 14)] y_acceptable = stats.norm.pdf(x_acceptable, 10, 2) axes[1].fill_between(x_acceptable, y_acceptable, alpha=0.3, color='yellow', label='6-14 min (95.5%): 955 skiers') # Mark extremes x_extreme = x_ski[x_ski > 16] y_extreme = stats.norm.pdf(x_extreme, 10, 2) axes[1].fill_between(x_extreme, y_extreme, alpha=0.7, color='red', label='>16 min (0.15%): ~1-2 skiers') axes[1].axvline(10, color='red', linestyle='--', linewidth=2.5, label='Mean = 10 min') axes[1].set_title('Ski Resort Wait Times: Empirical Rule Application\n(1000 skiers on typical Saturday)', fontsize=13, fontweight='bold', pad=15) axes[1].set_xlabel('Wait Time (minutes)', fontsize=11, fontweight='bold') axes[1].set_ylabel('Probability Density', fontsize=11, fontweight='bold') axes[1].legend(fontsize=10, loc='upper right') axes[1].grid(True, alpha=0.3) axes[1].set_xlim(2, 18) # Add interpretation box interpretation = ( "Business Insight:\n" "• Most skiers (68%) wait 8-12 min\n" "• Nearly all (96%) wait 6-14 min\n" "• Extreme waits (>16 min) are rare (<1%)\n" "• Can advertise: '10 min average,\n 95% wait less than 14 min'" ) axes[1].text(0.02, 0.97, interpretation, transform=axes[1].transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round,pad=0.6', facecolor='lightyellow', alpha=0.9)) plt.tight_layout() plt.show() ``` ::: {.callout-warning icon="⚠️"} ## When Does the Empirical Rule Apply? **Requirements:** 1. Data must be **approximately normally distributed** (bell-shaped, symmetric) 2. No extreme outliers 3. Sufficient sample size (n ≥ 30 is a rough guideline) **How to check:** - Create a histogram—does it look bell-shaped? - Calculate skewness (Section 3.7.C)—is it close to 0? - Use a **normal probability plot** (Q-Q plot)—do points follow a straight line? **If data are NOT normal:** Use Chebyshev's Theorem instead (conservative but always valid). ::: --- ### C. Skewness (Asymmetry) **Skewness** measures the **asymmetry** of a distribution. Symmetric distributions (like the normal distribution) have skewness near zero. Asymmetric distributions are **skewed left** or **skewed right**. ::: {.callout-note icon="📐"} ## Pearson's Coefficient of Skewness $$P = \frac{3(\bar{X} - \text{Median})}{s}$$ **Interpretation:** - $P \approx 0$: **Symmetric** distribution (mean ≈ median) - $P > 0$: **Right-skewed** (positive skew, long right tail, mean > median) - $P < 0$: **Left-skewed** (negative skew, long left tail, mean < median) **Rule of Thumb:** - $|P| < 0.5$: Approximately symmetric - $0.5 \leq |P| < 1.0$: Moderately skewed - $|P| \geq 1.0$: Highly skewed ::: **Visual Guide:** ```{python} #| label: fig-skewness-types #| fig-cap: "Distribution Skewness Types: Left-Skewed, Symmetric, and Right-Skewed" #| code-fold: true import matplotlib.pyplot as plt import numpy as np from scipy import stats fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Generate x values x = np.linspace(0, 10, 1000) # Panel 1: Left-skewed (Negative skew) # Use beta distribution with parameters that create left skew left_skew = stats.beta(5, 2) y_left = left_skew.pdf(x / 10) axes[0].plot(x, y_left, 'b-', linewidth=3) axes[0].fill_between(x, y_left, alpha=0.4, color='blue') # Mark mean, median, mode for left-skewed mode_left = 7.0 median_left = 6.5 mean_left = 6.0 axes[0].axvline(mode_left, color='red', linestyle='--', linewidth=2, label='Mode') axes[0].axvline(median_left, color='orange', linestyle='--', linewidth=2, label='Median') axes[0].axvline(mean_left, color='green', linestyle='--', linewidth=2, label='Mean') axes[0].set_title('LEFT-SKEWED (Negative)\nMean < Median < Mode', fontsize=12, fontweight='bold') axes[0].set_xlabel('Values', fontsize=10, fontweight='bold') axes[0].set_ylabel('Frequency', fontsize=10, fontweight='bold') axes[0].legend(loc='upper left', fontsize=9) axes[0].grid(True, alpha=0.3) axes[0].text(0.5, 0.95, 'Long tail\nto the LEFT →', transform=axes[0].transAxes, ha='center', va='top', fontsize=10, bbox=dict(boxstyle='round,pad=0.3', facecolor='lightblue', alpha=0.8)) # Panel 2: Symmetric (Normal distribution) symmetric = stats.norm(5, 1) y_symmetric = symmetric.pdf(x) axes[1].plot(x, y_symmetric, 'g-', linewidth=3) axes[1].fill_between(x, y_symmetric, alpha=0.4, color='green') # For symmetric, all three are the same center = 5.0 axes[1].axvline(center, color='purple', linestyle='--', linewidth=3, label='Mean = Median = Mode') axes[1].set_title('SYMMETRIC\nMean = Median = Mode', fontsize=12, fontweight='bold') axes[1].set_xlabel('Values', fontsize=10, fontweight='bold') axes[1].set_ylabel('Frequency', fontsize=10, fontweight='bold') axes[1].legend(loc='upper right', fontsize=9) axes[1].grid(True, alpha=0.3) axes[1].text(0.5, 0.95, 'Balanced\n(P ≈ 0)', transform=axes[1].transAxes, ha='center', va='top', fontsize=10, bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.8)) # Panel 3: Right-skewed (Positive skew) # Use beta distribution with parameters that create right skew right_skew = stats.beta(2, 5) y_right = right_skew.pdf(x / 10) axes[2].plot(x, y_right, 'r-', linewidth=3) axes[2].fill_between(x, y_right, alpha=0.4, color='red') # Mark mean, median, mode for right-skewed mode_right = 3.0 median_right = 3.5 mean_right = 4.0 axes[2].axvline(mode_right, color='red', linestyle='--', linewidth=2, label='Mode') axes[2].axvline(median_right, color='orange', linestyle='--', linewidth=2, label='Median') axes[2].axvline(mean_right, color='green', linestyle='--', linewidth=2, label='Mean') axes[2].set_title('RIGHT-SKEWED (Positive)\nMean > Median > Mode', fontsize=12, fontweight='bold') axes[2].set_xlabel('Values', fontsize=10, fontweight='bold') axes[2].set_ylabel('Frequency', fontsize=10, fontweight='bold') axes[2].legend(loc='upper right', fontsize=9) axes[2].grid(True, alpha=0.3) axes[2].text(0.5, 0.95, '← Long tail\nto the RIGHT', transform=axes[2].transAxes, ha='center', va='top', fontsize=10, bbox=dict(boxstyle='round,pad=0.3', facecolor='lightcoral', alpha=0.8)) # Overall title fig.suptitle('Distribution Skewness: Understanding Asymmetry and the Relationship Between Mean, Median, and Mode', fontsize=14, fontweight='bold', y=1.02) plt.tight_layout() plt.show() ``` **Key Relationship:** - **Symmetric:** Mean = Median = Mode - **Right-skewed:** Mean > Median > Mode (mean pulled right by extreme values) - **Left-skewed:** Mean < Median < Mode (mean pulled left by extreme values) ### Example 3.16: P&P Airlines Skewness Analysis P&P Airlines passenger data: - Mean: $\bar{X} = 80.25$ passengers - Median: $82.14$ passengers - Standard Deviation: $s = 12.79$ passengers **Calculate skewness:** $$P = \frac{3(80.25 - 82.14)}{12.79} = \frac{3(-1.89)}{12.79} = \frac{-5.67}{12.79} = -0.44$$ **Interpretation:** $P = -0.44$ indicates a **slight left skew** (moderately skewed). The mean (80.25) is pulled slightly below the median (82.14) by some flights with unusually low passenger counts. **Practical meaning:** Most flights are well-loaded (around 80–90 passengers), but occasional flights with very few passengers (50–60 range) pull the average down slightly. This is typical for regional airlines with inconsistent demand. --- ### Example 3.17: Income Distribution (Right-Skewed) A company's employee salaries: - Mean: $\bar{X} = \$68,000$ - Median: $\$52,000$ - Standard Deviation: $s = \$24,000$ **Calculate skewness:** $$P = \frac{3(68,000 - 52,000)}{24,000} = \frac{3(16,000)}{24,000} = \frac{48,000}{24,000} = 2.0$$ **Interpretation:** $P = 2.0$ indicates **strong right skew**. The distribution has a long right tail (a few executives earning very high salaries pull the mean far above the median). **Practical meaning:** The **median ($52K) is more representative** of a "typical" employee salary than the mean ($68K). Half the employees earn less than $52K, but the high earners inflate the average. This is common for salary distributions. ```{python} #| label: fig-skewness-types-income #| fig-cap: "Types of Skewness: Left-Skewed, Symmetric, and Right-Skewed Distributions" #| code-fold: true import matplotlib.pyplot as plt import numpy as np from scipy import stats fig, axes = plt.subplots(1, 3, figsize=(16, 5)) # Left-skewed distribution (Beta distribution) x_left = np.linspace(0, 1, 1000) y_left = stats.beta.pdf(x_left, 8, 2) mean_left = 0.8 median_left = 0.82 mode_left = 0.875 axes[0].plot(x_left, y_left, 'b-', linewidth=3) axes[0].fill_between(x_left, y_left, alpha=0.3, color='blue') axes[0].axvline(mean_left, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_left:.2f}') axes[0].axvline(median_left, color='green', linestyle='-.', linewidth=2.5, label=f'Median = {median_left:.2f}') axes[0].axvline(mode_left, color='purple', linestyle=':', linewidth=3, label=f'Mode = {mode_left:.2f}') axes[0].set_title('Left-Skewed (Negative Skew)\nMean < Median < Mode\nP < 0', fontsize=12, fontweight='bold') axes[0].set_xlabel('Value', fontsize=11, fontweight='bold') axes[0].set_ylabel('Frequency', fontsize=11, fontweight='bold') axes[0].legend(fontsize=10, loc='upper left') axes[0].grid(True, alpha=0.3) axes[0].text(0.3, max(y_left)*0.7, 'Long left tail\n(negative values\npull mean down)', fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', alpha=0.8)) # Symmetric distribution (Normal) x_sym = np.linspace(-4, 4, 1000) y_sym = stats.norm.pdf(x_sym, 0, 1) mean_sym = 0 median_sym = 0 mode_sym = 0 axes[1].plot(x_sym, y_sym, 'g-', linewidth=3) axes[1].fill_between(x_sym, y_sym, alpha=0.3, color='green') axes[1].axvline(mean_sym, color='red', linestyle='--', linewidth=2.5, label=f'Mean = Median = Mode = {mean_sym:.2f}') axes[1].set_title('Symmetric (No Skew)\nMean = Median = Mode\nP ≈ 0', fontsize=12, fontweight='bold') axes[1].set_xlabel('Value', fontsize=11, fontweight='bold') axes[1].set_ylabel('Frequency', fontsize=11, fontweight='bold') axes[1].legend(fontsize=10, loc='upper right') axes[1].grid(True, alpha=0.3) axes[1].text(0, max(y_sym)*0.7, 'Perfectly balanced\n(normal distribution)', fontsize=10, ha='center', bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8)) # Right-skewed distribution (Exponential-like) x_right = np.linspace(0, 10, 1000) y_right = stats.gamma.pdf(x_right, 2, scale=1) mean_right = 2.0 median_right = 1.68 mode_right = 1.0 axes[2].plot(x_right, y_right, 'r-', linewidth=3) axes[2].fill_between(x_right, y_right, alpha=0.3, color='red') axes[2].axvline(mode_right, color='purple', linestyle=':', linewidth=3, label=f'Mode = {mode_right:.2f}') axes[2].axvline(median_right, color='green', linestyle='-.', linewidth=2.5, label=f'Median = {median_right:.2f}') axes[2].axvline(mean_right, color='red', linestyle='--', linewidth=2.5, label=f'Mean = {mean_right:.2f}') axes[2].set_title('Right-Skewed (Positive Skew)\nMode < Median < Mean\nP > 0', fontsize=12, fontweight='bold') axes[2].set_xlabel('Value', fontsize=11, fontweight='bold') axes[2].set_ylabel('Frequency', fontsize=11, fontweight='bold') axes[2].legend(fontsize=10, loc='upper right') axes[2].grid(True, alpha=0.3) axes[2].text(6, max(y_right)*0.7, 'Long right tail\n(high values\npull mean up)', fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightcoral', alpha=0.8)) plt.tight_layout() plt.show() ``` ::: {.callout-tip icon="💼"} ## Business Examples of Skewed Distributions **Right-Skewed (Positive Skew) — Most Common:** - **Income/Salary:** Most people earn moderate salaries, few earn very high - **Home prices:** Most homes are modestly priced, luxury homes create long right tail - **Customer spending:** Most customers spend little, a few "whales" spend heavily - **Insurance claims:** Most claims are small, catastrophic claims are rare but large - **Website traffic:** Most visitors view few pages, some "power users" view many **Left-Skewed (Negative Skew) — Less Common:** - **Test scores** with ceiling effect: Most students score high (80–100%), few fail - **Product lifetimes** with minimum guarantee: Most products last long, few fail early - **Age at retirement:** Most retire around 65, few retire very early **Symmetric:** - **Heights, weights** (in homogeneous populations) - **Measurement errors** - **Many natural phenomena** (follow normal distribution) ::: --- ### D. Coefficient of Variation (CV) The **coefficient of variation** measures **relative variability**—it expresses standard deviation as a **percentage of the mean**. This allows comparison of variability across datasets with different units or scales. ::: {.callout-note icon="📐"} ## Coefficient of Variation $$CV = \frac{s}{\bar{X}} \times 100\%$$ where: - $s$ = standard deviation - $\bar{X}$ = mean **Interpretation:** - **Low CV (< 15%):** Low variability relative to mean (consistent, homogeneous data) - **Moderate CV (15–30%):** Moderate variability - **High CV (> 30%):** High variability relative to mean (inconsistent, heterogeneous data) **Advantages:** - **Unit-free** (always a percentage) - Allows comparison of variability across **different scales** (e.g., dollars vs. percentages) - Identifies which variable is **more variable relative to its size** ::: ### Example 3.18: P&P Airlines Variability Comparison P&P Airlines wants to determine which metric is **more variable**: passenger count or miles flown. **Passenger Data:** - Mean: $\bar{X}_{\text{passengers}} = 80$ passengers - Standard Deviation: $s_{\text{passengers}} = 12.35$ passengers **Miles Flown Data:** - Mean: $\bar{X}_{\text{miles}} = 520$ miles - Standard Deviation: $s_{\text{miles}} = 62.7$ miles **Question:** Which metric shows greater **relative variability**? **Solution:** **CV for Passengers:** $$CV_{\text{passengers}} = \frac{12.35}{80} \times 100\% = 15.4\%$$ **CV for Miles:** $$CV_{\text{miles}} = \frac{62.7}{520} \times 100\% = 12.1\%$$ **Interpretation:** Although **miles has a larger standard deviation** (62.7 vs. 12.35), **passengers show greater relative variability** (15.4% vs. 12.1%). **Business Implication:** Passenger counts are **less predictable** than flight distances. This makes sense: flight routes (miles) are fixed, but passenger demand fluctuates based on day of week, season, pricing, and competition. P&P should focus **capacity planning efforts on passenger forecasting**, not route planning. --- ### Example 3.19: Investment Portfolio Comparison An investor compares two portfolios: **Portfolio A (Bonds):** - Mean annual return: $\bar{X}_A = 5\%$ - Standard deviation: $s_A = 1.2\%$ **Portfolio B (Stocks):** - Mean annual return: $\bar{X}_B = 12\%$ - Standard deviation: $s_B = 4.8\%$ **Question:** Which portfolio is **riskier relative to its return**? **Solution:** $$CV_A = \frac{1.2}{5} \times 100\% = 24\%$$ $$CV_B = \frac{4.8}{12} \times 100\% = 40\%$$ **Interpretation:** **Portfolio B (stocks)** has higher **risk-adjusted variability** (CV = 40% vs. 24%). For every 1% of expected return, stocks carry **1.67 times more risk** than bonds (40% / 24% = 1.67). **Investment Decision:** - **Risk-averse investor:** Choose **Portfolio A** (bonds)—lower CV means more consistent returns - **Risk-tolerant investor:** Choose **Portfolio B** (stocks)—higher mean return despite higher CV This is why **Sharpe ratio** (excess return / SD) is popular in finance—it's essentially the inverse of CV adjusted for risk-free rate. ```{python} #| label: fig-coefficient-variation #| fig-cap: "Coefficient of Variation: Comparing Relative Variability Across Different Scales" #| code-fold: true import matplotlib.pyplot as plt import numpy as np fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Example 1: P&P Airlines (Passengers vs. Miles) categories_pp = ['Passengers', 'Miles Flown'] means_pp = [80, 520] stds_pp = [12.35, 62.7] cvs_pp = [(stds_pp[i] / means_pp[i]) * 100 for i in range(2)] # Panel 1a: Standard Deviations (misleading comparison) bars1 = axes[0, 0].bar(categories_pp, stds_pp, color=['steelblue', 'coral'], alpha=0.7, edgecolor='black', linewidth=2) axes[0, 0].set_title('Standard Deviations (Different Units)\nMisleading Comparison!', fontsize=12, fontweight='bold') axes[0, 0].set_ylabel('Standard Deviation', fontsize=11, fontweight='bold') axes[0, 0].grid(True, alpha=0.3, axis='y') for bar, std in zip(bars1, stds_pp): height = bar.get_height() axes[0, 0].text(bar.get_x() + bar.get_width()/2., height, f'{std:.1f}', ha='center', va='bottom', fontweight='bold', fontsize=11) axes[0, 0].text(0.5, 0.97, 'X Cannot compare directly\n(different units: passengers vs. miles)', transform=axes[0, 0].transAxes, ha='center', va='top', fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='yellow', alpha=0.8)) # Panel 1b: Coefficients of Variation (valid comparison) bars2 = axes[0, 1].bar(categories_pp, cvs_pp, color=['steelblue', 'coral'], alpha=0.7, edgecolor='black', linewidth=2) axes[0, 1].set_title('Coefficients of Variation (Unit-Free %)\nValid Comparison ✓', fontsize=12, fontweight='bold') axes[0, 1].set_ylabel('CV (%)', fontsize=11, fontweight='bold') axes[0, 1].grid(True, alpha=0.3, axis='y') axes[0, 1].axhline(y=15, color='orange', linestyle='--', linewidth=2, alpha=0.7, label='Moderate variability threshold (15%)') for bar, cv in zip(bars2, cvs_pp): height = bar.get_height() axes[0, 1].text(bar.get_x() + bar.get_width()/2., height, f'{cv:.1f}%', ha='center', va='bottom', fontweight='bold', fontsize=11) axes[0, 1].legend(fontsize=9) axes[0, 1].text(0.5, 0.97, 'OK - Valid comparison\nPassengers (15.4%) more variable than Miles (12.1%)', transform=axes[0, 1].transAxes, ha='center', va='top', fontsize=10, bbox=dict(boxstyle='round,pad=0.5', facecolor='lightgreen', alpha=0.8)) # Example 2: Investment Portfolio Comparison categories_inv = ['Bonds\n(Portfolio A)', 'Stocks\n(Portfolio B)'] means_inv = [5, 12] stds_inv = [1.2, 4.8] cvs_inv = [(stds_inv[i] / means_inv[i]) * 100 for i in range(2)] # Panel 2a: Returns with error bars x_pos = np.arange(len(categories_inv)) bars3 = axes[1, 0].bar(x_pos, means_inv, yerr=stds_inv, capsize=10, color=['green', 'red'], alpha=0.6, edgecolor='black', linewidth=2, error_kw={'linewidth': 3, 'ecolor': 'black'}) axes[1, 0].set_title('Portfolio Returns with Variability (±1 SD)', fontsize=12, fontweight='bold') axes[1, 0].set_ylabel('Annual Return (%)', fontsize=11, fontweight='bold') axes[1, 0].set_xticks(x_pos) axes[1, 0].set_xticklabels(categories_inv) axes[1, 0].grid(True, alpha=0.3, axis='y') for bar, mean, std in zip(bars3, means_inv, stds_inv): height = bar.get_height() axes[1, 0].text(bar.get_x() + bar.get_width()/2., height/2, f'μ = {mean}%\nσ = {std}%', ha='center', va='center', fontweight='bold', fontsize=10, color='white') # Panel 2b: Risk-Adjusted Variability (CV) bars4 = axes[1, 1].bar(categories_inv, cvs_inv, color=['green', 'red'], alpha=0.6, edgecolor='black', linewidth=2) axes[1, 1].set_title('Risk-Adjusted Variability (CV)\nStocks Have Higher Relative Risk', fontsize=12, fontweight='bold') axes[1, 1].set_ylabel('Coefficient of Variation (%)', fontsize=11, fontweight='bold') axes[1, 1].grid(True, alpha=0.3, axis='y') axes[1, 1].axhline(y=30, color='red', linestyle='--', linewidth=2, alpha=0.7, label='High variability threshold (30%)') for bar, cv, label in zip(bars4, cvs_inv, categories_inv): height = bar.get_height() axes[1, 1].text(bar.get_x() + bar.get_width()/2., height, f'{cv:.0f}%', ha='center', va='bottom', fontweight='bold', fontsize=11) axes[1, 1].legend(fontsize=9) # Add interpretation interpretation = ( "Interpretation:\n" "Bonds: Lower return (5%) but\n" " more consistent (CV=24%)\n" "Stocks: Higher return (12%) but\n" " more volatile (CV=40%)\n" "→ Stocks have 67% more risk\n" " per unit of return" ) axes[1, 1].text(0.02, 0.97, interpretation, transform=axes[1, 1].transAxes, fontsize=9, verticalalignment='top', bbox=dict(boxstyle='round,pad=0.5', facecolor='lightyellow', alpha=0.9)) plt.tight_layout() plt.show() ``` ::: {.callout-tip icon="💡"} ## When to Use Coefficient of Variation **Use CV when:** - Comparing variability of **different variables** (different units or scales) - Comparing **risk-adjusted returns** in finance - Determining which process/product has **more consistent quality relative to its target** - Variables have **different means** (CV adjusts for this) **Don't use CV when:** - Mean is zero or near zero (CV becomes undefined or misleading) - Variables have **negative values** (CV requires positive mean) - Absolute variability is more important than relative (use SD instead) ::: --- ## Chapter 3 Summary ### Key Concepts **Measures of Central Tendency:** 1. **Mean ($\bar{X}$ or $\mu$):** Arithmetic average, uses all values, sensitive to outliers 2. **Median:** Middle value when sorted, resistant to outliers, best for skewed data 3. **Mode:** Most frequent value, useful for categorical data 4. **Weighted Mean:** Accounts for different importance/weights 5. **Geometric Mean:** Average growth rate for multiplicative data **Measures of Dispersion:** 1. **Range:** Max - Min (simple but sensitive to outliers) 2. **Variance ($s^2$ or $\sigma^2$):** Average squared deviation 3. **Standard Deviation ($s$ or $\sigma$):** Square root of variance (same units as data) 4. **Interquartile Range (IQR):** Q3 - Q1 (middle 50%, robust to outliers) 5. **Coefficient of Variation (CV):** Relative variability as percentage **Distribution Shape:** 1. **Chebyshev's Theorem:** At least $1 - 1/K^2$ within K SD (any distribution) 2. **Empirical Rule:** 68%-95%-99.7% within 1-2-3 SD (normal distributions only) 3. **Skewness:** Asymmetry measure (Pearson's coefficient) - Right-skewed: Mean > Median (most common for income, prices) - Left-skewed: Mean < Median - Symmetric: Mean ≈ Median ### Essential Formulas Reference | Measure | Population | Sample | |:--------|:-----------|:-------| | **Mean** | $\mu = \frac{\sum X_i}{N}$ | $\bar{X} = \frac{\sum X_i}{n}$ | | **Variance** | $\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}$ | $s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1}$ | | **Std Dev** | $\sigma = \sqrt{\sigma^2}$ | $s = \sqrt{s^2}$ | | **Weighted Mean** | — | $\bar{X}_W = \frac{\sum X_i W_i}{\sum W_i}$ | | **Geometric Mean** | — | $GM = \sqrt[n]{X_1 \times \cdots \times X_n}$ | | **CV** | — | $CV = \frac{s}{\bar{X}} \times 100\%$ | | **Skewness** | — | $P = \frac{3(\bar{X} - \text{Median})}{s}$ | | **Percentile Position** | — | $L_P = (n+1) \times \frac{P}{100}$ | : Formula Reference Table {.striped .hover} ### Decision Framework: Which Measure to Use? **Central Tendency:** - **Normal/Symmetric data:** Use **Mean** (most efficient, uses all information) - **Skewed data or outliers:** Use **Median** (robust, more representative) - **Categorical data:** Use **Mode** (only option for non-numeric data) - **Growth rates/returns:** Use **Geometric Mean** (compounds correctly) - **Different importance:** Use **Weighted Mean** (accounts for weights) **Dispersion:** - **Normal data, absolute variability:** Use **Standard Deviation** - **Compare across different scales:** Use **Coefficient of Variation** - **Outlier-resistant measure:** Use **IQR** - **Quick rough estimate:** Use **Range** (least informative) **Distribution Assessment:** - **Unknown distribution:** Use **Chebyshev's Theorem** (always valid) - **Normal distribution:** Use **Empirical Rule** (precise) - **Assess symmetry:** Calculate **Skewness** ### Business Applications Recap ::: {.callout-note icon="💼"} ## Real-World Applications **Finance:** - Portfolio risk (standard deviation = volatility) - Sharpe ratio (return / SD) - Value-at-Risk (percentiles) - CAGR (geometric mean) **Manufacturing:** - Process capability (mean, SD, specifications) - Six Sigma (defects per million using SD) - Quality control charts (mean ± 3 SD limits) **Human Resources:** - Salary benchmarking (percentiles, quartiles) - Performance ratings (mean, mode) - Compensation equity (CV across departments) **Marketing:** - Customer lifetime value variability (SD, CV) - Market segmentation (percentiles) - Campaign performance consistency (CV) **Operations:** - Demand forecasting (mean, SD) - Service level targets (percentiles) - Capacity planning (mean ± 2 SD) ::: --- ## Glossary **Central Tendency:** A single value that represents the "center" or "typical" value of a dataset. **Chebyshev's Theorem:** For any distribution, at least $1 - 1/K^2$ of observations lie within K standard deviations of the mean. **Coefficient of Variation (CV):** Relative measure of dispersion expressed as percentage of mean: $CV = (s/\bar{X}) \times 100\%$. **Deciles:** Nine values that divide an ordered dataset into 10 equal parts. **Degrees of Freedom:** The number of independent pieces of information available to estimate a parameter (n-1 for sample variance). **Dispersion:** The spread or variability of observations around the central tendency. **Empirical Rule:** For normal distributions, approximately 68%-95%-99.7% of observations fall within 1-2-3 standard deviations of the mean. **Geometric Mean:** The nth root of the product of n values; appropriate for growth rates and multiplicative data. **Interquartile Range (IQR):** The range of the middle 50% of data: IQR = Q3 - Q1. **Mean:** The arithmetic average: sum of values divided by count. **Median:** The middle value when data are sorted (50th percentile). **Mode:** The most frequently occurring value(s) in a dataset. **Percentile:** The Pth percentile is the value below which P percent of observations fall. **Quartiles:** Three values (Q1, Q2, Q3) that divide an ordered dataset into four equal parts. **Range:** The difference between maximum and minimum values. **Skewness:** A measure of asymmetry in a distribution. **Standard Deviation:** The square root of variance; measures average distance of observations from the mean. **Variance:** The average of squared deviations from the mean. **Weighted Mean:** An average that accounts for different importance or weights of values. --- **END OF CHAPTER 3** You have now cover: ✅ All measures of central tendency (mean, median, mode, weighted mean, geometric mean) ✅ All measures of dispersion (range, variance, SD, IQR, CV) ✅ Grouped data formulas ✅ Percentiles and quartiles ✅ Chebyshev's Theorem ✅ Empirical Rule ✅ Skewness ✅ Coefficient of Variation ✅ Comprehensive business examples throughout ✅ 9+ Python visualizations ✅ Formulas reference table ✅ Decision framework ✅ Glossary