12 Simple Regression and Correlation

12.1 Opening Scenario: The Cola Wars

The competition in the soft drink industry has always been intense. Recently, the battle between Coca-Cola and Pepsi-Cola has intensified as both giants fight to increase their respective shares of the $27 billion domestic beverage market. Each company has offered its own brand of promotional flair in a continuous effort to reorganize their marketing mix and promote their respective products. Coca-Cola currently enjoys a 21.7% market share, followed by Pepsi at 18.9%.

Without a doubt, marketing executives, management specialists, and statisticians work tirelessly at both companies trying to outmaneuver their competitive-minded counterparts. So far, they have agreed on very little, except that sales seem to increase with higher summer temperatures.

Predicting market share trends is an especially arduous and difficult task. Many executives have ruined their careers in frustrated attempts to correctly anticipate the behavior of fickle consumers.

Regression and correlation analysis are the two most powerful and useful tools that analysts of all types have at their disposal to peer into the shadowy future. In this chapter, we will analyze these procedures and learn how they can guide business professionals in their pursuit of successful careers.

12.2 Learning Objectives

After completing this chapter, you will be able to:

Distinguish between dependent and independent variables in regression analysis
Differentiate between simple and multiple regression models
Understand the difference between linear and curvilinear relationships
Apply ordinary least squares (OLS) method to determine the best-fit line
Interpret regression coefficients and their business implications
Evaluate regression model assumptions (normality, homoscedasticity, independence, linearity)
Calculate and interpret the standard error of estimation
Compute and interpret the correlation coefficient and coefficient of determination
Conduct hypothesis tests for regression coefficients and correlation
Make predictions using regression models with confidence intervals

12.3 Chapter Structure

graph TD
    A[Regression and Correlation] --> B[Regression Analysis]
    A --> C[Correlation Analysis]
    
    B --> B1[Model development:<br/>Ordinary least squares]
    B --> B2[Model assumptions]
    B --> B3[Standard error<br/>of estimation]
    B --> B4[Inferential tests]
    
    B4 --> B4a[Hypothesis tests]
    B4 --> B4b[Confidence intervals]
    
    C --> C1[Correlation<br/>coefficient]
    C --> C2[Coefficient of<br/>determination]
    C --> C3[Hypothesis test<br/>for correlation]

Chapter 11 Conceptual Structure

12.4 11.1 Introduction to Regression and Correlation

Regression and correlation are the two most powerful and versatile statistical tools that can be used to solve common business problems. Many studies are based on the belief that we can identify and quantify some functional relationship between two or more variables. We say that one variable depends on another.

We can say that Y depends on X, where Y and X are any two variables. This can be written as:

Y \text{ is a function of } X \quad Y = f(X)

This is read as “Y is a function of X.”

Because Y depends on X, Y is the dependent variable and X is the independent variable.

Key Definitions

Dependent Variable: The variable we wish to explain or predict; also called the regressand or response variable. It is the outcome we’re interested in understanding.
Independent Variable: The variable used to explain Y; also called the explanatory variable, predictor, or regressor. It is the variable we believe influences the dependent variable.

12.4.1 Example: Student Performance

The university dean wishes to analyze the relationship between student grades and the time they spend studying. Data on both variables were collected. It is logical to presume that grades depend on the quantity and quality of time students spend with their books. Therefore: - Dependent variable (Y): Grades - Independent variable (X): Study time

We say that “Y is regressing on X” or “we are regressing Y on X.”

12.4.2 Historical Context

The first to develop regression analysis was English scientist Sir Francis Galton (1822-1911). His initial experiments with regression began with an attempt to analyze hereditary growth patterns of peas. Encouraged by the results, Sir Francis extended his study to include hereditary patterns in the height of adult humans.

He discovered that children who have very tall or very short parents tended to “regress” toward the average height of the adult population. With this modest beginning, the use of regression analysis became known and has become one of the most powerful statistical tools available today.

Why “Regression”?

The term “regression” comes from Galton’s observation that extreme values tend to “regress” (return) toward the mean. This phenomenon is called regression to the mean.

12.5 11.2 Types of Regression Models

12.5.1 Simple vs. Multiple Regression

We must differentiate between simple regression and multiple regression:

Simple Regression (Bivariate Regression): Y is a function of only one independent variable: Y = f(X) It’s called “bivariate” because there are only two variables: one dependent and one independent.
Multiple Regression: Y is a function of two or more independent variables: Y = f(X_1, X_2, X_3, \ldots, X_k) where X_1, X_2, X_3, \ldots, X_k are independent variables that help explain Y.

12.5.2 Linear vs. Curvilinear Regression

We must also distinguish between linear regression and curvilinear (nonlinear) regression:

Linear Regression: The relationship between X and Y can be represented by a straight line. It holds that as X changes, Y changes by a constant amount.
Curvilinear Regression: Uses a curve to express the relationship between X and Y. It holds that as X changes, Y changes by a different amount each time.

12.5.3 Scatter Diagrams

Some of these relationships appear in scatter diagrams (or scatter plots) that represent paired observations for X and Y. It is customary to place the independent variable on the horizontal axis.

Figure 12.1: Types of Relationships in Scatter Diagrams

Interpreting Scatter Diagrams

Figure a): Positive linear relationship - as X increases (decreases), Y increases (decreases)
Figure b): Negative linear relationship - as X increases, Y decreases
Figures c) and d): Curvilinear relationships - cannot be well described by a straight line
Figure e): No relationship - no detectable pattern between X and Y
Figure f): Perfect linear relationship - all points fall exactly on a line (rare in real data)

12.6 11.3 The Equation of a Straight Line

Before diving into regression analysis, let’s review the mathematical equation of a straight line. Only two points are needed to draw a straight line representing a linear relationship.

12.6.1 Basic Form

The equation of a straight line can be expressed as:

Equation of a Line

Y = b_0 + b_1 X \quad [11.1]

where: - b_0 is the intercept (where the line crosses the Y-axis) - b_1 is the slope of the line

12.6.2 Example: Understanding Slope and Intercept

Consider the equation: Y = 5 + 2X

Intercept (b_0 = 5): The line crosses the Y-axis at 5
Slope (b_1 = 2): The slope is calculated as:

b_1 = \text{slope} = \frac{\text{vertical change}}{\text{horizontal change}} = \frac{2}{1} = 2

This means: For every one-unit change in X, Y changes by 2 units.

Figure 12.2: Straight Line with Positive Slope: Y = 5 + 2X

Notice that as X increases from 2 to 3 (an increase of 1 unit), Y increases from 9 to 11 (an increase of 2 units).

12.6.3 Different Slope Values

The slope b_1 determines the nature of the relationship:

1. Positive Slope (b_1 > 0): Variables move in the same direction; Example: Y = 5 + 2X
2. Negative Slope (b_1 < 0): Variables move in opposite directions; Example: Y = 10 - 3X; For every one-unit increase in X, Y decreases by 3 units
3. Zero Slope (b_1 = 0): No linear relationship; Example: Y = 10 + 0X = 10; Changes in X have no effect on Y

12.7 11.4 Deterministic vs. Stochastic Relationships

12.7.1 Deterministic Relationships

A deterministic relationship can be expressed by a formula that provides exact conversions.

Example: Converting miles per hour (mph) to kilometers per hour (kph): 1 \text{ mph} = 1.6 \text{ kph}

Therefore: 5 \text{ mph} = 5(1.6) = 8.0 \text{ kph}

This is a deterministic model because the relationship is exact and there is no error (except for rounding).

12.7.2 Stochastic (Random) Relationships

Unfortunately, very few relationships in the business world are this exact. When using one variable to explain another, there is usually some variation in the relationship.

Example: Vita+Plus, Inc. (health product distributors) wants to develop a regression model using advertising to explain sales revenue.

They will likely find that: - When advertising is set at amount X_i, sales have value Y_i - The next time advertising is set at the same amount, sales may produce a different value - The dependent variable (sales) exhibits some degree of randomness

Therefore, there will be some error in attempting to explain or predict sales. Such a model is stochastic due to the presence of random variation:

Population (True) Regression Model

Y = \beta_0 + \beta_1 X + \varepsilon \quad [11.4]

where: - \beta_0 + \beta_1 X is the deterministic portion - \varepsilon (epsilon) is the error term (random component) - \beta_0 and \beta_1 are population parameters (usually unknown)

12.7.3 Sample-Based Estimation

The parameters \beta_0 and \beta_1 remain unknown and can only be estimated using sample data:

Sample Regression Model

Y = b_0 + b_1 X + e \quad [11.5]

where: - b_0 and b_1 are estimates of \beta_0 and \beta_1 - e is the residual (observed error in the sample)

The residual e recognizes that not all observations fall exactly on a straight line. If we knew the exact value of e, we could calculate Y precisely. However, because e is random, Y can only be estimated.

12.7.4 The Estimated Regression Model

The regression model therefore takes the form:

Estimated Regression Equation

\hat{Y} = b_0 + b_1 X \quad [11.6]

where: - \hat{Y} (read as “Y-hat”) is the estimated value of Y - b_0 is the estimated intercept - b_1 is the estimated slope

Notation Summary

Symbol	Meaning	Type
Y	Actual observed value	Data
\hat{Y}	Predicted/estimated value	Calculated from model
\beta_0, \beta_1	True population parameters	Unknown
b_0, b_1	Sample estimates of parameters	Calculated from data
\varepsilon	True population error	Unknown/theoretical
e	Sample residual	Calculated: e = Y - \hat{Y}

12.8 Section Exercises

Exercise 12.1 (Understanding Regression Concepts)

What is the difference between simple regression and multiple regression?
What is the difference between linear regression and curvilinear regression? How does Y change when X changes in each case?
Differentiate between the deterministic and stochastic components of a regression model.
Why is the ordinary least squares method called “least squares”? What role does error play in this analysis?

Exercise 12.2 (Identifying Variables) Identify the dependent and independent variables in each of these cases:

Time spent working on an assignment and the grade received
Son’s height and father’s height
A woman’s age and the cost of her life insurance
Price of a product and the number of units sold
Demand for a product and the number of consumers in the market

Exercise 12.3 (Creating Scatter Diagrams) Given the following data for X and Y:

\begin{aligned} X: & \quad 28, 54, 67, 37, 41, 69, 76 \\ Y: & \quad 14, 21, 36, 39, 18, 54, 52 \end{aligned}

Create a scatter diagram for the data
What do the data suggest about a relationship between X and Y?
Draw a line to approximate the relationship

Exercise 12.4 (Regression Terminology)

What is the difference between \hat{Y}_i and Y_i in regression analysis?
What is the term \varepsilon in the regression model and why does it occur?
Explain the meaning of “regression to the mean” in the historical context of Sir Francis Galton’s work

12.9 Section Summary

Key Takeaways

Regression analysis studies relationships between variables to make predictions
Dependent variable (Y): The outcome we want to explain or predict
Independent variable (X): The predictor we use to explain Y
Linear relationship: Y changes by a constant amount when X changes
Scatter diagrams reveal the type and strength of relationships visually
Stochastic models include an error term because real-world relationships aren’t perfect
Historical origin: Sir Francis Galton developed regression studying hereditary patterns

12.10 11.5 Ordinary Least Squares: The Best-Fit Line

The purpose of regression analysis is to determine a line that fits the data better than any other line that could be drawn. To illustrate this, let’s assume that Vita+Plus, Inc. collected data on advertising expenditures and sales revenue for 5 months, as shown in Table 11.1.

Table 11.1: Sales Data for Vita+Plus, Inc.

Month	Sales (in $1,000s)	Advertising (in $100s)
1	$450	$50
2	380	40
3	540	65
4	500	55
5	420	45

Although a sample of only 5 observations would probably be insufficient in practice, it will serve our purposes for now.

12.10.1 Understanding the Error Term

These five data points and the line that best fits them appear in a scatter diagram. This line is determined by estimating b_0 and b_1. A mathematical procedure used to estimate these values is called Ordinary Least Squares (OLS).

OLS produces a line that extends through the center of the scatter plot, approximating all data points better than any other line.

For the 5 data points Y_i in the scatter diagram, these are the actual observed values for Y from Table 11.1. The values \hat{Y}_i are obtained from the regression line and represent the estimated sales.

The difference between what Y actually was (Y_i) and what we estimate it to be (\hat{Y}_i) is the error.

The Error Term

\text{Error} = (Y_i - \hat{Y}_i) \quad [11.7]

If the actual value Y_i is greater than the estimate \hat{Y}_i, then (Y_i - \hat{Y}_i) > 0 and the error is positive (we underestimated)
If the actual value Y_i is less than the estimate \hat{Y}_i, then (Y_i - \hat{Y}_i) < 0 and the error is negative (we overestimated)

12.10.2 The Principle of Least Squares

Because some errors are negative and some are positive, OLS produces a line such that:

Sum of Errors Equals Zero

\sum (Y_i - \hat{Y}_i) = 0

More importantly, OLS ensures that the sum of squared errors is minimized. That is, if we: 1. Take the differences (all vertical) between actual Y values and the regression line 2. Square these differences 3. Sum them

The resulting number will be smaller than what we would obtain with any other line.

OLS Minimizes Sum of Squared Errors

\sum (Y_i - \hat{Y}_i)^2 = \min \quad [11.8]

This is why it’s called Ordinary Least Squares – it produces a line such that the sum of squared errors is the minimum possible.

Why Square the Errors?

Eliminates sign issues: Squaring makes all values positive, preventing cancellation
Penalizes large errors: Larger deviations are penalized more heavily
Mathematical tractability: Squared terms have nice derivative properties for optimization
Statistical properties: Produces unbiased estimates under certain assumptions

12.10.3 Computing Sums of Squares and Cross Products

To determine the best-fit line, OLS requires calculating:

Sum of Squares for X

SCx = \sum(X_i - \bar{X})^2 = \sum X^2 - \frac{(\sum X)^2}{n} \quad [11.9]

Sum of Squares for Y

SCy = \sum(Y_i - \bar{Y})^2 = \sum Y^2 - \frac{(\sum Y)^2}{n} \quad [11.10]

Sum of Cross Products

SCxy = \sum(X_i - \bar{X})(Y_i - \bar{Y}) = \sum XY - \frac{(\sum X)(\sum Y)}{n} \quad [11.11]

Computational Forms

The first portion of each formula: - SCx = \sum(X_i - \bar{X})^2 - SCy = \sum(Y_i - \bar{Y})^2 - SCxy = \sum(X_i - \bar{X})(Y_i - \bar{Y})

illustrates how the OLS line is based on deviations from the mean. However, these are tedious to calculate manually. The second versions (computational formulas) are much easier for hand calculations.

12.10.4 Calculating the Regression Coefficients

Given the sums of squares and cross products, it’s straightforward to calculate:

The Slope (Regression Coefficient)

b_1 = \frac{SCxy}{SCx} \quad [11.12]

The Intercept (Regression Constant)

b_0 = \bar{Y} - b_1\bar{X} \quad [11.13]

where \bar{Y} and \bar{X} are the means of the Y and X values.

Precision Warning

These calculations are extremely sensitive to rounding. This is especially true for the coefficient of determination (discussed later). Therefore, it’s advisable to carry calculations to five or six decimal places for accuracy.

12.11 11.6 Example: Hop Scotch Airlines

The management of Hop Scotch Airlines, the world’s smallest carrier airline, believes there’s a direct relationship between advertising expenditures and the number of passengers who choose to fly Hop Scotch.

To determine if this relationship exists and what its exact nature might be, statisticians employed by Hop Scotch decided to use OLS procedures to determine the regression model.

Monthly values for advertising expenses and number of passengers were collected for the n = 15 most recent months. The data appear in Table 11.2, along with other necessary calculations for finding the regression model.

Table 12.1: Regression Data for Hop Scotch Airlines

Observation (Month)	Advertising (X) (in $1,000s)	Passengers (Y) (in 1,000s)	XY	X^2	Y^2
1	10	15	150	100	225
2	12	17	204	144	289
3	8	13	104	64	169
4	17	23	391	289	529
5	10	16	160	100	256
6	15	21	315	225	441
7	10	14	140	100	196
8	14	20	280	196	400
9	19	24	456	361	576
10	10	17	170	100	289
11	11	16	176	121	256
12	13	18	234	169	324
13	16	23	368	256	529
14	10	15	150	100	225
15	12	16	192	144	256
Totals	187	268	3,490	2,469	4,960

12.11.1 Step-by-Step Solution

With this dataset and the calculations for XY, X^2, and Y^2, it’s easy to determine the regression model.

Step 1: Calculate Sums of Squares and Cross Products

\begin{aligned} SCx &= \sum X^2 - \frac{(\sum X)^2}{n} \\ &= 2,469 - \frac{(187)^2}{15} \\ &= 2,469 - \frac{34,969}{15} \\ &= 2,469 - 2,331.266667 \\ &= 137.733333 \end{aligned}

\begin{aligned} SCy &= \sum Y^2 - \frac{(\sum Y)^2}{n} \\ &= 4,960 - \frac{(268)^2}{15} \\ &= 4,960 - \frac{71,824}{15} \\ &= 4,960 - 4,788.266667 \\ &= 171.733333 \end{aligned}

\begin{aligned} SCxy &= \sum XY - \frac{(\sum X)(\sum Y)}{n} \\ &= 3,490 - \frac{(187)(268)}{15} \\ &= 3,490 - \frac{50,116}{15} \\ &= 3,490 - 3,341.066667 \\ &= 148.933333 \end{aligned}

Step 2: Calculate the Slope (Regression Coefficient)

\begin{aligned} b_1 &= \frac{SCxy}{SCx} \\ &= \frac{148.933333}{137.733333} \\ &= 1.081317 \approx 1.08 \end{aligned}

Step 3: Calculate the Means

\bar{Y} = \frac{\sum Y}{n} = \frac{268}{15} = 17.866667

\bar{X} = \frac{\sum X}{n} = \frac{187}{15} = 12.466667

Step 4: Calculate the Intercept

\begin{aligned} b_0 &= \bar{Y} - b_1\bar{X} \\ &= 17.866667 - 1.081317(12.466667) \\ &= 17.866667 - 13.480282 \\ &= 4.386385 \approx 4.39 \end{aligned}

12.11.2 The Regression Model

Hop Scotch Airlines Regression Equation

\hat{Y}_i = 4.39 + 1.08X_i

where \hat{Y}_i is the predicted number of passengers (in thousands) for a given advertising expenditure X_i (in thousands).

12.11.3 Interpreting the Results

Intercept (b_0 = 4.39): When advertising is zero (X = 0), the model predicts 4,390 passengers. This represents the “baseline” demand without advertising.
Slope (b_1 = 1.08): For every $1,000 increase in advertising expenditure, passenger count increases by approximately 1,080 passengers (1.08 thousand).

This positive slope confirms management’s belief: advertising expenditures are positively related to passenger volume.

12.11.4 Making Predictions

Example 1: If advertising expenditure is 10,000 (X = 10$):

\hat{Y}_i = 4.39 + 1.08(10) = 4.39 + 10.8 = 15.19

Prediction: Approximately 15,190 passengers

Example 2: If advertising increases to 11,000 (X = 11$):

\hat{Y}_i = 4.39 + 1.08(11) = 4.39 + 11.88 = 16.27

Prediction: Approximately 16,270 passengers

Marginal Effect: Increasing advertising by $1,000 resulted in 16.27 - 15.19 = 1.08 thousand = 1,080 additional passengers, confirming our slope interpretation.

Figure 12.3: Regression Line for Hop Scotch Airlines: Advertising vs. Passengers

12.11.5 Computer Output Verification

Modern statistical software like Python produces regression output automatically. Here’s what the output would show:

Regression Equation:

PASS = 4.39 + 1.08 ADV

Coefficients:

Predictor    Coef     SE Coef    t-ratio    p-value
Constant    4.3863    0.9913      4.42      0.001
ADV         1.0813    0.0773     13.99      0.000

Model Summary:

s = 0.9068    R-sq = 93.8%    R-sq(adj) = 93.3%

The regression line passes through the middle of the scatter plot, minimizing the sum of squared vertical distances from each point to the line.

Business Interpretation

For Hop Scotch Airlines Management:

Positive ROI on Advertising: Each $1,000 spent on advertising brings approximately 1,080 new passengers
Strong Relationship: The model explains 93.8% of the variation in passenger volume (R-sq = 93.8%)
Statistical Significance: The p-value of 0.000 for advertising indicates the relationship is highly significant
Predictive Power: Management can confidently use this model for budgeting and forecasting
Strategic Insight: Even with zero advertising, baseline demand exists (4,390 passengers), suggesting brand recognition or repeat customers

12.12 Section Exercises

Exercise 12.5 (GPA and Job Offers) The career placement center at State University wants to determine if student grade point averages (GPAs) can explain the number of job offers they receive after graduation. Data for 10 recent graduates:

Student	1	2	3	4	5	6	7	8	9	10
GPA	3.25	2.35	1.02	0.36	3.69	2.65	2.15	1.25	3.88	3.37
Offers	3	3	1	0	5	4	2	2	6	2

Create a scatter diagram for the data
Calculate and interpret the regression model. What does this model tell you about the relationship between GPA and job offers?
If Steve has a GPA of 3.22, how many job offers do you predict he will receive?

Exercise 12.6 (Income and Consumption) An economist at the Florida Department of Human Resources is preparing a study on consumer behavior. Data were collected (in thousands of dollars) to determine if a relationship exists between consumer income and consumption levels.

Consumer	1	2	3	4	5	6	7	8	9	10	11	12
Income	24.3	12.5	31.2	28.0	35.1	10.5	23.2	10.0	8.5	15.9	14.7	15.0
Consumption	16.2	8.5	15.0	17.0	24.2	11.2	15.0	7.1	3.5	11.5	10.7	9.2

Create a scatter diagram for the data
Calculate and interpret the regression model. What does this model tell you about the relationship between consumption and income? What proportion of each additional dollar earned is spent on consumption?
What consumption would the model predict for someone earning $27,500?

Exercise 12.7 (Interest Rates and Housing Sales) A bank in Atlanta specializing in home loans is trying to analyze the real estate market by measuring how well interest rates explain the number of houses sold in the area. Data were compiled for a 10-month period:

Month	1	2	3	4	5	6	7	8	9	10
Interest	12.3	10.5	15.6	9.5	10.5	9.3	8.7	14.2	15.2	12.0
Houses	196	285	125	225	248	303	265	102	105	114

Create a scatter diagram for the data
Calculate and interpret the regression model. What does this model tell you about the relationship between interest rates and housing sales?
If the interest rate is 9.5%, how many houses would be sold according to the model?

Exercise 12.8 (Production Costs) Overland Group produces truck parts used in semi-trailers. The chief accountant wants to develop a regression model to predict costs. Units produced is selected as the predictor variable. Costs are in thousands of dollars, units in hundreds.

Units	12.3	8.3	6.5	4.8	14.6	14.6	14.6	6.5
Cost	6.2	5.3	4.1	4.4	5.2	4.8	5.9	4.2

Create a scatter diagram for the data
Calculate and interpret the regression model. What does the model tell the accountant about the relationship between production and costs?
According to the model, how much would it cost to produce 750 units?

Exercise 12.9 (Distance and Class Absences) Professor Mundane has noticed many students have been absent from class this semester. He believes he can explain this lack of attendance by the distances students live from campus. Eleven students were surveyed about how many miles they must travel to attend class and the number of classes they’ve missed.

Miles	5	6	2	0	9	12	16	5	7	0	8
Absences	2	2	4	5	4	2	5	2	3	1	4

Create a scatter diagram for the data
Calculate and interpret the regression model. What does the professor discover?
How many classes would you miss if you lived 3.2 miles from campus, according to the model?

Exercise 12.10 (Employment Test Scores and Performance Ratings) The personnel director at Bupkus, Inc. obtained data on 100 employees regarding entrance tests administered at hiring and subsequent ratings employees received from supervisors one year later. Test scores ranged from 0 to 10, and ratings used a 5-point system. The director wants to use the regression model to predict rating (R) based on test score (S). The results are:

\begin{aligned} \sum S &= 522 \quad \sum R = 326 \quad \sum SR = 17,325 \\ \sum S^2 &= 28,854 \quad \sum R^2 = 10,781 \end{aligned}

Develop and interpret the regression model. What can the director predict about the rating of an employee who scored 7 on the test?

Note for Students

Keep your calculations from exercises 1-6 for use throughout the rest of this chapter. Using the same data, you will avoid having to recalculate SCx, SCy, and SCxy each time. You will gain additional experience with other problems at the end of the chapter.

12.13 Section Summary

Key Takeaways: Ordinary Least Squares

OLS minimizes the sum of squared vertical distances from data points to the regression line
Regression slope (b_1): Measures the change in Y for each one-unit change in X
Regression intercept (b_0): The predicted value of Y when X = 0
Computational formulas make hand calculations manageable
Precision matters: Carry calculations to 5-6 decimal places to avoid rounding errors
Residuals: The differences between actual and predicted values reveal model fit
Business value: Regression provides quantitative relationships for decision-making

12.14 11.7 Standard Error of Estimation

Now that we have a regression model, the next logical question is: How good is it? How well does the model fit the data? The standard error of estimation provides a critical measure of this goodness of fit.

12.14.1 Defining the Standard Error

The standard error of estimation, denoted S_e, measures the typical distance that observed data points fall from the regression line. It’s conceptually similar to the standard deviation, but instead of measuring dispersion around the mean of a single variable, it measures dispersion around the regression line.

Standard Error of Estimation (Conceptual Form)

S_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} \quad [11.14]

where: - Y_i = actual observed value - \hat{Y}_i = predicted value from regression line - n - 2 = degrees of freedom (we lose 2 degrees because we estimate b_0 and b_1)

Why n-2 Degrees of Freedom?

In regression analysis, we estimate two parameters: the intercept (b_0) and the slope (b_1). Each parameter estimated “uses up” one degree of freedom from our sample of n observations, leaving us with n - 2 degrees of freedom for the error term.

12.14.2 Computational Formula

While the conceptual formula clearly shows what S_e measures, it’s tedious to calculate manually because we’d need to: 1. Calculate \hat{Y}_i for each observation 2. Find each difference (Y_i - \hat{Y}_i) 3. Square each difference 4. Sum all squared differences

A more efficient computational approach uses the sum of squared errors (SCE) and mean squared error (CME):

Sum of Squared Errors

SCE = SCy - \frac{(SCxy)^2}{SCx} \quad [11.15]

Mean Squared Error (CME)

CME = \frac{SCE}{n-2} \quad [11.16]

Standard Error of Estimation (Computational Form)

S_e = \sqrt{CME} \quad [11.17]

12.14.3 Application to Hop Scotch Airlines

Let’s calculate the standard error for the Hop Scotch Airlines example, where: - SCy = 171.733333 - SCxy = 148.933333 - SCx = 137.733333 - n = 15

Step 1: Calculate SCE

\begin{aligned} SCE &= SCy - \frac{(SCxy)^2}{SCx} \\ &= 171.733333 - \frac{(148.933333)^2}{137.733333} \\ &= 171.733333 - \frac{22,181.14}{137.733333} \\ &= 171.733333 - 161.044 \\ &= 10.6893 \end{aligned}

Step 2: Calculate CME

\begin{aligned} CME &= \frac{SCE}{n-2} \\ &= \frac{10.6893}{15-2} \\ &= \frac{10.6893}{13} \\ &= 0.82226 \end{aligned}

Step 3: Calculate S_e

\begin{aligned} S_e &= \sqrt{CME} \\ &= \sqrt{0.82226} \\ &= 0.90678 \approx 0.907 \end{aligned}

12.14.4 Interpreting the Standard Error

The standard error is always expressed in the same units as the dependent variable Y. For Hop Scotch Airlines:

S_e = 0.907 thousand passengers = 907 passengers

This means that the typical prediction error is about 907 passengers. The actual number of passengers typically deviates from the regression line prediction by approximately 907 passengers.

12.14.5 The Empirical Rule Applied to Regression

Just as the empirical rule applies to standard deviation for a single variable, it applies to the standard error in regression analysis:

Empirical Rule for Regression

If the errors are normally distributed:

68.3% of observations fall within \pm 1S_e of the regression line
95.5% of observations fall within \pm 2S_e of the regression line
99.7% of observations fall within \pm 3S_e of the regression line

12.14.6 Practical Application

Let’s apply this to a specific prediction. When advertising expenditure is X = 10 thousand dollars:

\hat{Y}_i = 4.39 + 1.08(10) = 15.19 \text{ thousand passengers}

This represents the average number of passengers we’d expect when X = 10.

One Standard Error Band: - Upper bound: 15.19 + 0.907 = 16.10 thousand (16,100 passengers) - Lower bound: 15.19 - 0.907 = 14.29 thousand (14,290 passengers)

Interpretation: When Hop Scotch spends $10,000 on advertising: - 68.3% of the time, passenger count will be between 14,290 and 16,100 - 31.7% of the time, passenger count will fall outside this range (either below 14,290 or above 16,100)

Figure 12.4: Standard Error Bands Around Regression Line for Hop Scotch Airlines

12.14.7 Standard Error as a Goodness-of-Fit Measure

The standard error provides a quantifiable measure of how well the regression model fits the data:

Smaller S_e → Better Fit: Data points cluster tightly around the regression line. Predictions are more accurate.
Larger S_e → Poorer Fit: Data points are more scattered around the regression line. Predictions are less reliable.

As shown in the visualization above: - When data are tightly clustered around the line, S_e is small (good fit) - When data are widely dispersed, S_e is large (poor fit)

Context Matters for Interpretation

Whether S_e = 0.907 represents a “good” or “poor” fit depends on context:

For Hop Scotch: A typical error of 907 passengers might be acceptable given that monthly passenger volumes range from 13,000 to 24,000
Relative measure: Compare S_e to the range of Y values. Here, the range is (24 - 13) = 11 thousand, so S_e = 0.907 is about 8% of the range
Business decision: Is this level of prediction accuracy sufficient for planning purposes?

12.14.8 Verification with Computer Output

From the Python output shown earlier:

s = 0.9068    R-sq = 93.8%    R-sq(adj) = 93.3%

The computer-calculated standard error (s = 0.9068) matches our hand calculation (S_e = 0.907), confirming our work.

12.15 Section Exercises

Exercise 12.11 (Standard Error for State University) Using your calculations from the GPA and job offers exercise (Exercise 9), calculate and interpret the standard error of estimation for State University. Create a graph illustrating the interpretation. How can this be used as a measure of goodness of fit?

Exercise 12.12 (Standard Error for Florida Department of Human Resources) Based on the data from the income and consumption exercise (Exercise 10), what is the standard error of estimation for the Florida Department of Human Resources? How would you interpret the results? Use a graph in your interpretation.

Exercise 12.13 (Standard Error for Atlanta Bank) Calculate and interpret the standard error of estimation for the interest rates and housing sales exercise (Exercise 11) about the Atlanta bank.

Exercise 12.14 (Standard Error for Overland Group) The Overland Group from the production costs exercise (Exercise 12) now wants to know the standard error of estimation for their cost prediction model.

Exercise 12.15 (Standard Error for Professor Mundane) What is the standard error of estimation that Professor Mundane will experience in the distance and absences exercise (Exercise 13)?

12.16 11.8 Correlation Analysis: Measuring Relationship Strength

The regression model has provided a clear picture of the relationship between Hop Scotch Airlines’ advertising expenditures and the number of brave travelers who line up at the ticket counter. The positive value for b_1 indicates a direct relationship: as advertising increases, so does the number of passengers.

Now it’s useful to obtain a measure of the strength of that relationship. This is the function of the correlation coefficient, developed by Carl Pearson at the end of the 19th century. Sometimes called the Pearson product-moment correlation coefficient, it’s represented by r.

12.16.1 The Correlation Coefficient

Range of the Correlation Coefficient

-1 \leq r \leq +1 \quad [11.18]

The correlation coefficient can assume any value between -1 and +1:

r = -1 (Perfect Negative Correlation): All observations fall on a straight line with negative slope. X and Y move in opposite directions perfectly.
r = 0 (No Linear Correlation): No linear relationship exists between X and Y. Knowing X provides no information about Y.
r = +1 (Perfect Positive Correlation): All observations fall on a straight line with positive slope. X and Y move in the same direction perfectly.
-1 < r < 0 (Negative Correlation): X and Y tend to move in opposite directions, but the relationship isn’t perfect.
0 < r < +1 (Positive Correlation): X and Y tend to move in the same direction, but the relationship isn’t perfect.

Interpreting Correlation Strength

General guidelines (though context-dependent):

|r| \geq 0.9: Very strong relationship
0.7 \leq |r| < 0.9: Strong relationship
0.5 \leq |r| < 0.7: Moderate relationship
0.3 \leq |r| < 0.5: Weak relationship
|r| < 0.3: Very weak or no relationship

The closer |r| is to 1, the stronger the linear relationship.

Figure 12.5: Examples of Different Correlation Coefficients

12.16.2 Understanding What r Measures

To fully understand what the correlation coefficient measures, we need to develop three measures of deviation. Consider observation 13 from the Hop Scotch data (where X = 16 and Y = 23):

Total Deviation of Y: The amount by which individual Y values vary from their mean \bar{Y}:

(Y_i - \bar{Y}) = 23 - 17.87 = 5.13

Explained Deviation: The difference between what the regression model predicts (\hat{Y}_i) and the mean value of Y (\bar{Y}):

(\hat{Y}_i - \bar{Y}) = 21.68 - 17.87 = 3.81

where \hat{Y}_i = 4.39 + 1.08(16) = 21.68

Unexplained Deviation (Residual): The portion of total deviation that is NOT explained by the regression model (the error):

(Y_i - \hat{Y}_i) = 23 - 21.68 = 1.32

Fundamental Relationship:

\text{Total Deviation} = \text{Explained Deviation} + \text{Unexplained Deviation}

(Y_i - \bar{Y}) = (\hat{Y}_i - \bar{Y}) + (Y_i - \hat{Y}_i)

12.16.3 Sums of Squares Decomposition

When we square these deviations and sum across all observations, we get:

Sum of Squares Total (SCT)

SCT = \sum(Y_i - \bar{Y})^2 \quad [11.19]

Measures total variation in Y around its mean.

Sum of Squares Regression (SCR)

SCR = \sum(\hat{Y}_i - \bar{Y})^2 \quad [11.20]

Measures variation explained by the regression model.

Sum of Squares Error (SCE)

SCE = \sum(Y_i - \hat{Y}_i)^2 \quad [11.21]

Measures variation NOT explained by the model (residual variation).

Fundamental ANOVA Identity:

SCT = SCR + SCE

12.16.4 Calculating the Correlation Coefficient

The correlation coefficient compares explained variation to total variation:

Correlation Coefficient (Conceptual)

r = \sqrt{\frac{\text{Explained Variation}}{\text{Total Variation}}} = \sqrt{\frac{SCR}{SCT}} \quad [11.22]

However, this formula is difficult to calculate manually. A more convenient computational formula is:

Correlation Coefficient (Computational Form)

r = \frac{SCxy}{\sqrt{(SCx)(SCy)}} \quad [11.23]

This formula uses the sums of squares we’ve already calculated for the regression coefficients!

12.16.5 Application to Hop Scotch Airlines

For the Hop Scotch example: - SCxy = 148.933333 - SCx = 137.733333 - SCy = 171.733333

\begin{aligned} r &= \frac{SCxy}{\sqrt{(SCx)(SCy)}} \\ &= \frac{148.933333}{\sqrt{(137.733333)(171.733333)}} \\ &= \frac{148.933333}{\sqrt{23,654.60}} \\ &= \frac{148.933333}{153.807} \\ &= 0.9683 \end{aligned}

Interpretation: r = 0.9683 indicates a very strong positive relationship between advertising expenditures and passenger volume. When advertising increases, passengers almost always increase proportionally.

12.17 11.9 Coefficient of Determination (r^2)

Perhaps the most important measure of goodness of fit is the coefficient of determination, denoted r^2.

12.17.1 Definition and Calculation

Coefficient of Determination (Conceptual)

r^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SCR}{SCT} \quad [11.24]

Computational Formula

r^2 = \frac{(SCxy)^2}{(SCx)(SCy)} \quad [11.25]

Or simply:

r^2 = (r)^2

Square the correlation coefficient!

12.17.2 Application to Hop Scotch Airlines

Method 1: Using the formula

\begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(148.933333)^2}{(137.733333)(171.733333)} \\ &= \frac{22,181.14}{23,654.60} \\ &= 0.93776 \approx 0.938 \end{aligned}

Method 2: Squaring the correlation coefficient

r^2 = (0.9683)^2 = 0.9376 \approx 0.938

Both methods give us r^2 = 0.938 or 93.8%.

12.17.3 Interpreting r^2

The coefficient of determination reveals what percentage of the change in Y is explained by the change in X.

For Hop Scotch Airlines:

Business Interpretation of r^2 = 0.938

93.8% of the variation in passenger volume is explained by advertising expenditure.

This means: - 93.8% of the changes in passenger numbers can be attributed to changes in advertising spending - 6.2% of passenger variation is due to other factors (e.g., seasonality, competitors, economic conditions, random variation) - The regression model captures nearly all the systematic variation in the data

This high r^2 indicates an excellent fit. The model is highly effective for prediction and decision-making.

12.17.4 Important Cautions About r^2

Critical Limitations of r^2

Linear Relationships Only: r^2 measures only linear relationships. Two variables can have r^2 = 0 yet still be strongly related in a curvilinear way.
Correlation ≠ Causation: The 93.8% does NOT mean advertising causes 93.8% of the passenger changes. It only means they are associated.
Context Dependent: What constitutes a “good” r^2 varies by field:
- Physical sciences: Often expect r^2 > 0.95
- Social sciences: r^2 > 0.60 may be considered strong
- Business/economics: r^2 > 0.70 typically indicates good fit
Sample Size Matters: With very large samples, even weak relationships can yield statistically significant r^2 values

12.17.5 Computer Output Verification

From the Python output shown earlier:

s = 0.9068    R-sq = 93.8%    R-sq(adj) = 93.3%

The computer-calculated r^2 = 93.8\% matches our hand calculation exactly!

Note: “R-sq(adj)” is the adjusted r^2, which adjusts for the number of predictors in the model. This becomes important in multiple regression (Chapter 12).

12.18 Section Exercises

Exercise 12.16 (Coefficient of Determination as Goodness of Fit) How can the coefficient of determination be used as a measure of goodness of fit? Create a graph to illustrate your explanation.

Exercise 12.17 (GPA and Job Offers Correlation) What is the strength of the relationship between GPA and job offers in Exercise 9? Calculate and interpret both r and r^2.

Exercise 12.18 (Income and Consumption Correlation) Calculate and interpret the correlation coefficient and coefficient of determination for the Florida Department of Human Resources data in Exercise 10.

Exercise 12.19 (Housing Sales Explained by Interest Rates) How much of the change in houses sold can be explained by the interest rate in Exercise 11? What does this tell you about the strength of the relationship?

Exercise 12.20 (Professor Mundane’s Model Strength) What is the strength of Professor Mundane’s model used in Exercise 13 to explain student absences? How much of the variation in absences is explained by distance from campus?

12.19 Section Summary

Key Takeaways: Model Evaluation

Standard Error of Estimation (S_e): - Measures typical prediction error in the same units as Y - Smaller S_e indicates better fit - Can be interpreted using the empirical rule (68%-95%-99.7%) - Provides absolute measure of scatter around regression line

Correlation Coefficient (r): - Measures strength and direction of linear relationship - Range: -1 \leq r \leq +1 - Sign indicates direction (positive/negative) - Magnitude indicates strength (closer to ±1 is stronger)

Coefficient of Determination (r^2): - Most important goodness-of-fit measure - Percentage of variation in Y explained by X - Range: 0 \leq r^2 \leq 1 (often expressed as percentage) - Higher r^2 means better predictive power - Remember: Correlation does NOT imply causation!

For Hop Scotch Airlines: - S_e = 0.907 (typical error of 907 passengers) - r = 0.9683 (very strong positive relationship) - r^2 = 0.938 (93.8% of passenger variation explained) - Conclusion: Excellent model for business decision-making

12.20 11.10 Hypothesis Testing for Regression Parameters

The statistical results suggest a relationship between passengers and advertising for Hop Scotch Airlines. The non-zero values for the regression coefficient (b_1 = 1.08) and correlation coefficient (r = 0.968) indicate that as advertising expenditures change, the number of passengers changes.

However, these results are based on a sample of only n = 15 observations. As always, we must ask: Does a relationship exist at the population level? It could be that due to sampling error, the population parameters are actually zero. We must test the population parameters to ensure that our sample findings differ significantly from zero.

12.20.1 A. Testing the Regression Slope (\beta_1)

If the slope of the true but unknown population regression line is zero, there is no relationship between passengers and advertising—contrary to our sample results.

If we were to create a scatter diagram for the population of all possible (X, Y) data points, it might look like the figure below, showing no pattern. When collecting our sample, we might have included only 15 observations from within a particular region that falsely suggests a positive relationship.

We must test the hypothesis:

\begin{aligned} H_0: \beta_1 &= 0 \quad \text{(No relationship)} \\ H_A: \beta_1 &\neq 0 \quad \text{(Relationship exists)} \end{aligned}

The t-Test for the Regression Coefficient

t = \frac{b_1 - \beta_1}{S_{b_1}} \quad [11.26]

with n - 2 degrees of freedom.

where S_{b_1} is the standard error of the regression coefficient, which recognizes that different samples produce different values for b_1.

Standard Error of the Regression Coefficient

S_{b_1} = \frac{S_e}{\sqrt{SCx}} \quad [11.27]

Why Does b_1 Have Sampling Variability?

If we took different samples of n = 15 months and calculated a regression equation for each, we’d likely get different values for b_0 and b_1 each time. The standard error S_{b_1} measures this sampling variability in the slope estimate.

If \beta_1 = 0 in the population, the sample values b_1 would be distributed around zero as shown below.

12.20.2 Application to Hop Scotch Airlines

Step 1: Calculate the Standard Error

Given: - S_e = 0.907 - SCx = 137.733333

\begin{aligned} S_{b_1} &= \frac{S_e}{\sqrt{SCx}} \\ &= \frac{0.907}{\sqrt{137.733333}} \\ &= \frac{0.907}{11.735} \\ &= 0.07726 \end{aligned}

Step 2: Calculate the Test Statistic

\begin{aligned} t &= \frac{b_1 - \beta_1}{S_{b_1}} \\ &= \frac{1.0813 - 0}{0.07726} \\ &= 13.995 \end{aligned}

Step 3: Determine the Critical Value

For \alpha = 0.05 (5% significance level) with n - 2 = 15 - 2 = 13 degrees of freedom:

t_{0.025, 13} = \pm 2.160

(Two-tailed test, so we split \alpha in half)

Step 4: Make Decision

Decision Rule: Do not reject H_0 if t is between \pm 2.160; otherwise reject H_0.

Conclusion: Since t = 13.995 > 2.160, we reject the null hypothesis. At the 5% significance level, there appears to be a significant relationship between passengers and advertising.

This is confirmed by the Python output (shown earlier), which reports a p-value of 0.000, indicating extremely strong evidence against H_0.

Figure 12.6: Hypothesis Test for Regression Slope β₁

What If We Fail to Reject H_0?

If we had not rejected the null hypothesis, we would conclude that advertising and passengers are not significantly related. In that case, we would discard this model and use a different explanatory variable instead of advertising.

Since we did reject H_0, we have strong evidence that the relationship is real (not due to chance), and advertising is a useful predictor of passenger volume.

12.20.3 Confidence Interval for \beta_1

Since we’ve rejected the hypothesis that \beta_1 = 0, the natural question is: “What IS its value?”

We can answer this by calculating a confidence interval for \beta_1:

Confidence Interval for the Regression Slope

\text{C.I. for } \beta_1 = b_1 \pm t_{\alpha/2, n-2} \cdot S_{b_1} \quad [11.28]

For a 95% Confidence Level:

\begin{aligned} \text{C.I. for } \beta_1 &= 1.08 \pm (2.160)(0.07726) \\ &= 1.08 \pm 0.167 \\ 0.913 &\leq \beta_1 \leq 1.247 \end{aligned}

Interpretation: We can be 95% confident that the true population regression coefficient is between 0.913 and 1.247. This means each $1,000 increase in advertising expenditure increases passenger volume by somewhere between 913 and 1,247 passengers.

12.20.4 B. Testing the Correlation Coefficient (\rho)

Much of the work done for testing the regression coefficient can be applied to testing the correlation coefficient. The purpose and rationale are very similar.

Since our correlation analysis is based on sample data, sampling error might lead to inappropriate conclusions. The sample data produced a non-zero correlation coefficient of r = 0.9683, but this could be due to sampling error. Perhaps the population correlation is actually zero, and a misleading sample caused us to assume a relationship incorrectly.

Therefore, we must test:

\begin{aligned} H_0: \rho &= 0 \quad \text{(No correlation in population)} \\ H_A: \rho &\neq 0 \quad \text{(Correlation exists in population)} \end{aligned}

where \rho (Greek letter rho) is the population correlation coefficient.

The t-Test for the Correlation Coefficient

t = \frac{r - \rho}{S_r} \quad [11.29]

where S_r is the standard error of the correlation coefficient.

Standard Error of the Correlation Coefficient

S_r = \sqrt{\frac{1 - r^2}{n - 2}} \quad [11.30]

12.20.5 Application to Hop Scotch Airlines

Step 1: Calculate the Standard Error

\begin{aligned} S_r &= \sqrt{\frac{1 - r^2}{n - 2}} \\ &= \sqrt{\frac{1 - 0.93776}{15 - 2}} \\ &= \sqrt{\frac{0.06224}{13}} \\ &= \sqrt{0.004787} \\ &= 0.0692 \end{aligned}

Step 2: Calculate the Test Statistic

\begin{aligned} t &= \frac{r - \rho}{S_r} \\ &= \frac{0.9683 - 0}{0.0692} \\ &= 13.995 \end{aligned}

Step 3: Decision

Using \alpha = 0.05 with df = 13: Critical value t_{0.025, 13} = \pm 2.160

Decision Rule: Do not reject if t is between \pm 2.160; otherwise reject.

Conclusion: Since t = 13.995 > 2.160, we reject the null hypothesis. At the 5% significance level, we conclude that the population correlation coefficient is not zero—passengers and advertising are significantly correlated.

Important Observation: t Values Are Identical!

Notice that the t-value of 13.995 is the same for both: - Testing \beta_1 = 0 - Testing \rho = 0

This is not a coincidence. In simple linear regression (one predictor), these two tests are mathematically equivalent. They will always give identical results.

However, in multiple regression (Chapter 12), this equivalence does not hold. That’s why it’s important to become familiar with both tests.

12.20.6 Equivalence of the Two Tests

For simple linear regression:

F = t^2

Also: - Testing H_0: \beta_1 = 0 is equivalent to testing H_0: \rho = 0 - Both test whether there’s a significant linear relationship between X and Y

12.21 11.11 Confidence and Prediction Intervals

Regression analysis can forecast and predict values for the dependent variable. Once we’ve determined the regression equation, we can develop a point estimate for the dependent variable by substituting a given value for X and solving for Y.

However, as we’ve seen throughout this textbook, interval estimates are often preferred over point estimates. There are at least two types of interval estimates commonly used in regression:

Confidence Interval for the Mean \mu_{Y|X}: The average value of Y for all cases where X equals a specific value
Prediction Interval for an Individual Y_X: A single value of Y when X equals a specific value

12.21.1 A. Confidence Interval for the Mean of Y Conditional on X

Suppose we want to develop an interval estimate for the conditional mean of Y, denoted \mu_{Y|X}. This is the population mean for all values of Y, given that X equals a specific value.

Example: If X = 10 (advertising = $10,000) many times, we’d obtain many different values of Y (passenger counts). The interval we’re calculating estimates the mean of all those Y values.

Two Interpretations (for a 95% confidence interval):

First interpretation: If we set X equal to the same amount many times, we’d obtain many different values of Y. We can be 95% confident that the mean of those Y values (\mu_{Y|X}) will fall within the specified interval.
Second interpretation: If we took many samples of (X, Y) values and constructed a confidence interval based on each sample, 95% of them would contain the true mean value \mu_{Y|X}.

To calculate this interval, we must first determine S_{\hat{Y}}, the standard error of the conditional mean.

Standard Error of the Conditional Mean

S_{\hat{Y}} = S_e \sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.31]

where: - S_e = standard error of estimation - X_i = the given value for the independent variable - \bar{X} = mean of X values in the sample - SCx = sum of squares for X

The confidence interval for the conditional mean is:

Confidence Interval for \mu_{Y|X}

\text{C.I. for } \mu_{Y|X} = \hat{Y}_i \pm t_{\alpha/2, n-2} \cdot S_{\hat{Y}} \quad [11.32]

where \hat{Y}_i is the point estimate from the regression equation.

12.21.2 Application: Hop Scotch Airlines (Confidence Interval)

Question: What is the average passenger count when advertising = $10,000?

Given: - X_i = 10 - \bar{X} = 12.47 - S_e = 0.907 - SCx = 137.733333 - n = 15

Step 1: Calculate Standard Error of the Mean

\begin{aligned} S_{\hat{Y}} &= S_e \sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \\ &= 0.907 \sqrt{\frac{1}{15} + \frac{(10 - 12.47)^2}{137.733333}} \\ &= 0.907 \sqrt{0.06667 + \frac{6.1009}{137.733333}} \\ &= 0.907 \sqrt{0.06667 + 0.04430} \\ &= 0.907 \sqrt{0.11097} \\ &= 0.907(0.3331) \\ &= 0.302 \end{aligned}

Step 2: Calculate Point Estimate

\hat{Y}_i = 4.39 + 1.08(10) = 15.19

Step 3: Construct 95% Confidence Interval

For 95% confidence with df = 13: t_{0.025, 13} = 2.160

\begin{aligned} \text{C.I. for } \mu_{Y|X} &= 15.19 \pm (2.160)(0.302) \\ &= 15.19 \pm 0.652 \\ 14.54 &\leq \mu_{Y|X} \leq 15.84 \end{aligned}

Interpretation: Hop Scotch can be 95% confident that if they spend $10,000 on advertising many times, the average passenger count across all those occasions will be between 14,540 and 15,840 passengers.

12.21.3 B. Prediction Interval for a Single Value of Y

The confidence interval developed above estimates the mean of all Y values when X equals a given amount. Often, it’s more useful to construct an interval for a single value of Y obtained when X is set to a given value just once.

Example: Hop Scotch might be interested in predicting the number of customers next month if they invest $10,000 in advertising. This differs from predicting the average across many months.

Key Insight: Individual values are more dispersed than means. Means tend to cluster around the center of the data, making them easier to predict. Individual values scatter more widely, making them harder to predict.

Therefore, a 95% confidence interval for a single value of Y must be wider than the interval for the conditional mean.

Two Interpretations (for a 95% prediction interval):

First interpretation: If we set X equal to some amount just once, we’ll obtain a single resulting value of Y. We can be 95% confident that this single value of Y falls within the specified interval.
Second interpretation: If we took many samples and used each to construct a prediction interval, 95% of them would contain the true value for Y.

To calculate the prediction interval, we need the standard error of the forecast, S_{Y_X}.

Standard Error of the Forecast

S_{Y_X} = S_e \sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.33]

Notice the extra “1” under the square root compared to formula [11.31]. This accounts for the additional variability of individual values versus means.

Prediction Interval for Y_X

\text{P.I. for } Y_X = \hat{Y}_i \pm t_{\alpha/2, n-2} \cdot S_{Y_X} \quad [11.34]

12.21.4 Application: Hop Scotch Airlines (Prediction Interval)

Question: What will passenger count be next month if advertising = $10,000?

Step 1: Calculate Standard Error of the Forecast

\begin{aligned} S_{Y_X} &= S_e \sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \\ &= 0.907 \sqrt{1 + \frac{1}{15} + \frac{(10 - 12.47)^2}{137.733333}} \\ &= 0.907 \sqrt{1 + 0.06667 + 0.04430} \\ &= 0.907 \sqrt{1.11097} \\ &= 0.907(1.054) \\ &= 0.956 \end{aligned}

Step 2: Construct 95% Prediction Interval

\begin{aligned} \text{P.I. for } Y_X &= 15.19 \pm (2.160)(0.956) \\ &= 15.19 \pm 2.065 \\ 13.13 &\leq Y_X \leq 17.26 \end{aligned}

Interpretation: Hop Scotch can be 95% confident that if they spend $10,000 on advertising next month, the passenger count for that specific month will be between 13,130 and 17,260 passengers.

12.21.5 Comparing the Two Intervals

For X = 10: - Confidence interval for mean: [14.54, 15.84] (width = 1.30) - Prediction interval for individual: [13.13, 17.26] (width = 4.13)

The prediction interval is much wider because individual values are less predictable than means!

Figure 12.7: Confidence Bands vs. Prediction Bands for Hop Scotch Airlines

Key Differences: Confidence vs. Prediction Intervals

Feature	Confidence Interval	Prediction Interval
Estimates	Mean of Y (average)	Individual Y (single value)
Width	Narrower	Wider
Formula	S_{\hat{Y}} = S_e\sqrt{\frac{1}{n} + \frac{(X_i-\bar{X})^2}{SCx}}	S_{Y_X} = S_e\sqrt{1 + \frac{1}{n} + \frac{(X_i-\bar{X})^2}{SCx}}
Use Case	“What’s the average outcome?”	“What’s one specific outcome?”
Example	Average passengers across many months	Passengers next month
Certainty	More certain (means are stable)	Less certain (individuals vary)

12.21.6 C. Factors Influencing Interval Width

Given a confidence level, we prefer to minimize the interval width. The narrower the interval, the more precise our prediction. Three factors influence width:

1. Dispersion of Original Data (S_e): More dispersed data → Larger S_e → Wider interval
2. Sample Size (n): Larger sample → Smaller standard error → Narrower interval
3. Distance from Mean (|X_i - \bar{X}|): Farther from \bar{X} → Wider interval (regression is based on means, so predictions are less reliable far from the center)

12.22 Section Exercises

Exercise 12.21 (GPA as Predictor of Job Offers) Using the appropriate hypothesis test at the 5% level, is GPA a significant explanatory variable for job offers in Exercise 9? Be sure to show all four hypothesis testing steps.

Exercise 12.22 (Income and Consumption Significance) In Exercise 10, is the relationship between income and consumption significant? Test the hypothesis at a 1% significance level.

Exercise 12.23 (Interest Rate Significance (Two Tests)) In Exercise 11, is the interest rate significant at the 10% level? a. Test the significance of the regression coefficient at 10% b. Test the significance of the correlation coefficient at 10% c. How do these two tests differ?

Exercise 12.24 (Confidence Interval for Average Job Offers) The career placement center at State University (Exercise 9) wants a 95% interval estimate for the average number of job offers that many graduates will receive if they have a GPA of 2.69. Calculate and interpret the appropriate interval.

Exercise 12.25 (Prediction Interval for Fred’s Job Offers) Fred has a GPA of 2.69 (see Exercises 9 and above). Calculate the 95% interval for the number of job offers Fred will receive. Why does this differ from your answer to the previous exercise?

Exercise 12.26 (Confidence Interval for Average Consumption) If the economist at the Florida Department of Human Resources (Exercise 10) identifies many consumers with incomes of $14,200, what is the 99% interval for the average consumption of all those consumers?

Exercise 12.27 (Prediction Interval for Individual Consumer) If the economist from Exercise 10 identifies one consumer with an income of $14,500: a. What is the point estimate of their consumption? b. What is the 99% interval estimate of their consumption?

12.23 Section Summary

Key Takeaways: Hypothesis Testing and Intervals

Hypothesis Testing for \beta_1: - Tests whether slope is significantly different from zero
- Uses t = \frac{b_1 - \beta_1}{S_{b_1}} with df = n-2
- If we reject H_0: \beta_1 = 0, the relationship is statistically significant

Hypothesis Testing for \rho: - Tests whether correlation is significantly different from zero
- Uses t = \frac{r - \rho}{S_r} with df = n-2
- In simple regression, gives identical result to testing \beta_1

Confidence Interval for \mu_{Y|X} (Mean): - Estimates average Y when X equals a specific value
- Narrower interval (means are more predictable)
- Formula: \hat{Y}_i \pm t \cdot S_{\hat{Y}}

Prediction Interval for Y_X (Individual): - Estimates a single Y value when X equals a specific value
- Wider interval (individuals more variable)
- Formula: \hat{Y}_i \pm t \cdot S_{Y_X}

For Hop Scotch at X = 10: - Point estimate: 15.19 thousand passengers
- 95% C.I. for mean: [14.54, 15.84]
- 95% P.I. for individual: [13.13, 17.26]
- Both intervals widen as we move away from \bar{X}

12.24 11.12 ANOVA for Regression

The regression model presents a description of the nature of the relationship between dependent and independent variables. We used a t-test to test the hypothesis that \beta_1 = 0. A similar test can be performed using Analysis of Variance (ANOVA) based on the F-test.

The ANOVA procedure measures the amount of variation in the regression model. As noted earlier, there are three sources of variation in a regression model: - Variation explained by regression (SCR) - Unexplained variation due to error (SCE) - Total variation (SCT), which is the sum of the first two

This can be summarized in an ANOVA table.

12.24.1 General ANOVA Table for Regression

Table 11.5: General ANOVA Table

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	F-Ratio
Regression	SCR	k	CMR = \frac{SCR}{k}	\frac{CMR}{CME}
Error	SCE	n - k - 1	CME = \frac{SCE}{n-k-1}
Total	SCT	n - 1

where k is the number of independent variables (predictors) in the model.

For Simple Linear Regression

In simple linear regression (one predictor), k = 1, so:
- Regression degrees of freedom: df_R = 1
- Error degrees of freedom: df_E = n - 2
- Total degrees of freedom: df_T = n - 1

12.24.2 Interpreting the F-Ratio

The ratio \frac{CMR}{CME} provides a measure of model accuracy because it compares: - Numerator (CMR): Average squared deviation explained by the model - Denominator (CME): Average squared deviation that remains unexplained

Higher F-ratio → Better Model: The model has significant explanatory power.
Lower F-ratio → Poorer Model: The model explains little more than random variation.

To determine if the F-ratio is “high enough,” we compare it with a critical value from the F-distribution table.

12.24.3 Computational Formulas

Sum of Squares for Regression

SCR = \frac{(SCxy)^2}{SCx} \quad [11.35]

Sum of Squares for Error (from formula 11.15)

SCE = SCy - \frac{(SCxy)^2}{SCx}

Sum of Squares Total

SCT = SCR + SCE

12.24.4 Application to Hop Scotch Airlines

Using the Hop Scotch data:

Step 1: Calculate SCR

\begin{aligned} SCR &= \frac{(SCxy)^2}{SCx} \\ &= \frac{(148.933333)^2}{137.733333} \\ &= \frac{22,181.14}{137.733333} \\ &= 161.044 \end{aligned}

Step 2: Calculate SCE (already computed earlier)

SCE = 10.689

Step 3: Calculate SCT

SCT = SCR + SCE = 161.044 + 10.689 = 171.733

Step 4: Calculate Mean Squares

CMR = \frac{SCR}{1} = \frac{161.044}{1} = 161.044

CME = \frac{SCE}{13} = \frac{10.689}{13} = 0.822

Step 5: Calculate F-Ratio

F = \frac{CMR}{CME} = \frac{161.044}{0.822} = 195.89

12.24.5 ANOVA Table for Hop Scotch Airlines

Table 11.6: ANOVA Table for Hop Scotch Airlines

Source of Variation	Sum of Squares	Degrees of Freedom	Mean Square	F-Ratio
Regression	161.04	1	161.04	195.89
Error	10.69	13	0.82
Total	171.73	14

12.24.6 Hypothesis Test Using F

Hypotheses:

\begin{aligned} H_0: \beta_1 &= 0 \quad \text{(Model has no explanatory power)} \\ H_A: \beta_1 &\neq 0 \quad \text{(Model is significant)} \end{aligned}

Decision Rule (at \alpha = 0.05):

Critical value: F_{0.05, 1, 13} = 4.67

Do not reject H_0 if F \leq 4.67; otherwise reject H_0.

Conclusion: Since F = 195.89 > 4.67, we reject the null hypothesis. At the 5% significance level, we conclude that advertising has significant explanatory power.

Relationship Between F-test and t-test

In simple linear regression, the F-test and t-test are mathematically equivalent:

F = t^2

For Hop Scotch: - t = 13.995 - t^2 = (13.995)^2 = 195.86 \approx 195.89 = F

Both tests produce the same conclusion. However, in multiple regression (Chapter 12): - The F-test provides a general test of whether any independent variables have explanatory power - Individual t-tests determine which specific variables are significant

12.25 Solved Problems

12.25.1 Problem 1: Keynes’ Consumption Function

In 1936, British economist John Maynard Keynes published his famous book, The General Theory of Employment, Interest and Money. Keynes proposed a theoretical relationship between income and personal consumption expenditures, arguing that as income increases, consumption increases by a smaller amount.

Milton Friedman, Nobel Prize-winning economist from the University of Chicago, collected data on income and consumption in the United States over an extended period. Here are 10 observations on annual consumption and income levels (in billions of current dollars):

Year	Income (I)	Consumption (C)
1950	284.8	191.0
1951	328.4	206.3
1952	345.5	216.7
1953	364.6	230.0
1954	364.8	236.5
1955	398.0	254.4
1956	419.2	266.7
1957	441.1	281.4
1958	447.3	290.1
1959	483.7	311.2

Required: Derive a consumption function assuming a linear relationship between consumption and income.

12.25.2 Solution

a. Regression Model

Since consumption depends on income, consumption is the dependent variable (Y) and income is the independent variable (X). Friedman sought a consumption function of the form:

\hat{C} = b_0 + b_1 I

Calculations:

\begin{aligned} \sum X &= 3,877.4 \quad \sum Y = 2,484.3 \\ \sum XY &= 984,615.32 \quad \sum X^2 = 1,537,084.88 \quad \sum Y^2 = 630,869.49 \\ n &= 10 \end{aligned}

Sums of Squares and Cross Products:

\begin{aligned} SCx &= \sum X^2 - \frac{(\sum X)^2}{n} \\ &= 1,537,084.88 - \frac{(3,877.4)^2}{10} \\ &= 1,537,084.88 - 1,503,423.076 \\ &= 33,661.804 \end{aligned}

\begin{aligned} SCy &= \sum Y^2 - \frac{(\sum Y)^2}{n} \\ &= 630,869.49 - \frac{(2,484.3)^2}{10} \\ &= 630,869.49 - 617,174.649 \\ &= 13,694.841 \end{aligned}

\begin{aligned} SCxy &= \sum XY - \frac{(\sum X)(\sum Y)}{n} \\ &= 984,615.32 - \frac{(3,877.4)(2,484.3)}{10} \\ &= 984,615.32 - 963,262.482 \\ &= 21,352.838 \end{aligned}

Regression Coefficients:

\begin{aligned} b_1 &= \frac{SCxy}{SCx} \\ &= \frac{21,352.838}{33,661.804} \\ &= 0.634 \end{aligned}

\begin{aligned} b_0 &= \bar{Y} - b_1\bar{X} \\ &= 248.43 - (0.634)(387.74) \\ &= 248.43 - 245.827 \\ &= 2.603 \end{aligned}

Consumption Function:

\hat{C} = 2.603 + 0.63I

Interpretation: - Slope (b_1 = 0.63): For every $1 billion increase in income, consumption increases by $0.63 billion (or 630 million). Economics students will recognize this as the **marginal propensity to consume (MPC)**. - **Intercept (b_0 = 2.603$)**: The consumption level when income is zero (2.603 billion). Economists often argue this interpretation isn’t valid because an economic system always generates positive income.

Example Prediction: For 1952, when I = 345.5:

\hat{C} = 2.603 + 0.63(345.5) = 220.27

Actual consumption was 216.7, resulting in an error of $3.57 billion.

b. Coefficient of Determination

\begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(21,352.838)^2}{(33,661.804)(13,694.841)} \\ &= \frac{455,943,735.1}{461,020,746.4} \\ &= 0.989 \end{aligned}

Interpretation: 98.9% of the variation in consumption is explained by changes in income. This extremely high r^2 confirms Keynes’ theory and demonstrates the strong relationship between income and consumption—vital information for policymakers advising Congress and the President on economic policy.

Figure 12.8: Keynes’ Consumption Function: U.S. Income vs. Consumption (1950-1959)

12.25.3 Problem 2: Federal Reserve Discount Rate Analysis

After approximately six years of continuous expansion, the U.S. economy began showing signs of inflationary pressure in fall 1988. The Federal Reserve attempted to cool inflation by restricting money supply through increasing the discount rate that commercial banks must pay to borrow from the Federal Reserve.

Manuel H. Johnson, Vice Chairman of the Federal Reserve, stated that Fed actions regarding the discount rate could be predicted based on the federal funds rate (the cost banks charge each other for overnight loans). However, Fed watchers argued that the federal funds rate was not serving as an adequate predictor, making it difficult for investors to anticipate interest rate levels.

Data from mid-1987 to mid-1988:

Date	Federal Funds Rate (%)	Discount Rate (%)
June 1987	8.0	7.5
July 1987	7.5	7.5
Aug 1987	7.0	7.0
Sept 1987	6.5	6.5
Oct 1987	6.0	6.0
Nov 1987	6.0	5.5
Dec 1987	7.0	5.5
Jan 1988	6.0	5.5
Feb 1988	7.0	5.5
Mar 1988	7.5	5.5
Apr 1988	7.0	6.0
May 1988	7.5	6.5
Totals	83.0	74.5

12.25.4 Solution

Since Johnson argued that federal funds rate could explain discount rate behavior, federal funds is the independent variable (X).

a. Regression and Correlation Analysis

Given calculations: \begin{aligned} \sum X &= 83.0 \quad \sum Y = 74.5 \quad n = 12 \\ \sum XY &= 518.5 \quad \sum X^2 = 579 \quad \sum Y^2 = 469.25 \\ \bar{X} &= 6.92 \quad \bar{Y} = 6.21 \end{aligned}

\begin{aligned} SCx &= 4.9167 \quad SCy = 6.7292 \quad SCxy = 3.2083 \\ b_1 &= 0.6525 \quad b_0 = 1.6949 \end{aligned}

Regression Model:

\hat{Y} = 1.69 + 0.653X

Coefficient of Determination:

\begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(3.2083)^2}{(4.92)(6.73)} \\ &= 0.3111 \text{ or } 31.11\% \end{aligned}

r = \sqrt{0.3111} = 0.56

Conclusion: The Fed watchers are correct in their criticism. Only 31% of the changes in the discount rate are explained by changes in the federal funds rate. This is a weak relationship, not suitable for reliable prediction.

b. Standard Error of Estimation

\begin{aligned} SCE &= SCy - \frac{(SCxy)^2}{SCx} \\ &= 6.7292 - \frac{(3.208)^2}{4.9167} \\ &= 4.6303 \end{aligned}

CME = \frac{4.6303}{10} = 0.4630

S_e = \sqrt{0.4630} = 0.6808

Interpretation: Typically, the estimate of the discount rate is in error by 0.68 percentage points—a substantial margin for financial planning.

c. Testing Correlation Significance (95% confidence, df = 10)

\begin{aligned} H_0&: \rho = 0 \\ H_A&: \rho \neq 0 \end{aligned}

Critical value: t_{0.025, 10} = \pm 2.228

\begin{aligned} t &= \frac{r}{S_r} = \frac{r}{\sqrt{(1-r^2)/(n-2)}} \\ &= \frac{0.56}{\sqrt{(1-0.31)/10}} \\ &= \frac{0.56}{0.2627} \\ &= 2.13 \end{aligned}

Decision: Since t = 2.13 < 2.228, we cannot reject H_0. Despite the sample showing a positive relationship, we cannot reject the hypothesis of zero correlation at the 5% significance level.

d. Testing Regression Slope Significance (99% confidence, df = 10)

\begin{aligned} H_0&: \beta_1 = 0 \\ H_A&: \beta_1 \neq 0 \end{aligned}

Critical value: t_{0.005, 10} = \pm 3.169

\begin{aligned} S_{b_1} &= \frac{S_e}{\sqrt{SCx}} = \frac{0.681}{\sqrt{4.92}} = 0.307 \\ t &= \frac{b_1}{S_{b_1}} = \frac{0.6525}{0.307} = 2.126 \end{aligned}

Decision: Since t = 2.126 < 3.169, we cannot reject H_0. The value of b_1 is not significantly different from zero at the 1% level.

Conclusion: There is little to no confidence in the federal funds rate as a predictor of the discount rate. It would be imprudent for investors to rely on federal funds as an indicator of discount rate behavior.

12.26 Formula Summary

Essential Regression Formulas

Basic Regression Line: Y = b_0 + b_1X \quad [11.3]

Sums of Squares: SCx = \sum X^2 - \frac{(\sum X)^2}{n} \quad [11.9] SCy = \sum Y^2 - \frac{(\sum Y)^2}{n} \quad [11.10] SCxy = \sum XY - \frac{(\sum X)(\sum Y)}{n} \quad [11.11]

Regression Coefficients: b_1 = \frac{SCxy}{SCx} \quad [11.12] b_0 = \bar{Y} - b_1\bar{X} \quad [11.13]

Standard Error of Estimation: S_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} \quad [11.14] SCE = SCy - \frac{(SCxy)^2}{SCx} \quad [11.15] CME = \frac{SCE}{n-2} \quad [11.16] S_e = \sqrt{CME} \quad [11.17]

Correlation and Determination: r = \frac{SCxy}{\sqrt{(SCx)(SCy)}} \quad [11.23] r^2 = \frac{(SCxy)^2}{(SCx)(SCy)} \quad [11.25]

Hypothesis Tests: t = \frac{b_1 - \beta_1}{S_{b_1}} \quad [11.26] S_{b_1} = \frac{S_e}{\sqrt{SCx}} \quad [11.27] t = \frac{r - \rho}{S_r} \quad [11.29] S_r = \sqrt{\frac{1-r^2}{n-2}} \quad [11.30]

Confidence and Prediction Intervals: S_{\hat{Y}} = S_e\sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.31] \text{C.I. for } \mu_{Y|X} = \hat{Y}_i \pm t \cdot S_{\hat{Y}} \quad [11.32] S_{Y_X} = S_e\sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.33] \text{P.I. for } Y_X = \hat{Y}_i \pm t \cdot S_{Y_X} \quad [11.34]

ANOVA: SCR = \frac{(SCxy)^2}{SCx} \quad [11.35] SCT = SCR + SCE F = \frac{CMR}{CME}

12.27 Chapter Summary

This chapter introduced simple linear regression and correlation analysis, two of the most powerful statistical tools for understanding relationships between variables.

12.27.1 Key Concepts Mastered

1. Regression Analysis - Purpose: Quantify the relationship between dependent (Y) and independent (X) variables - Regression line: \hat{Y} = b_0 + b_1X - Slope (b_1): Change in Y for each one-unit change in X - Intercept (b_0): Value of Y when X = 0 - Method: Ordinary Least Squares (OLS) minimizes \sum(Y_i - \hat{Y}_i)^2

2. Model Evaluation - Standard Error (S_e): Typical prediction error (in units of Y) - Correlation coefficient (r): Strength and direction of linear relationship (-1 \leq r \leq +1) - Coefficient of determination (r^2): Percentage of variation in Y explained by X - ANOVA F-test: Overall significance of the regression model

3. Statistical Inference - Hypothesis tests: Determine if \beta_1 and \rho are significantly different from zero - Confidence intervals: Estimate population parameters (\beta_1, \mu_{Y|X}) - Prediction intervals: Predict individual future values of Y

12.27.2 Critical Distinctions

Concept	Purpose	Formula	Example (X=10)
Point Estimate	Single prediction	\hat{Y} = b_0 + b_1X	15.19
Confidence Interval	Estimate mean of Y	Narrower	[14.54, 15.84]
Prediction Interval	Estimate individual Y	Wider	[13.13, 17.26]

12.27.3 Important Cautions

Limitations and Warnings

Correlation ≠ Causation: High correlation doesn’t prove one variable causes another
Linear relationships only: r and r^2 only measure linear associations
Extrapolation danger: Don’t predict beyond the range of observed X values
Outlier sensitivity: Regression is sensitive to extreme values
Spurious correlation: Unrelated variables can show correlation by chance

12.27.4 Decision-Making Applications

Regression and correlation analysis enable managers to:

Forecast: Predict sales, costs, demand based on drivers
Allocate resources: Optimize spending on advertising, production
Evaluate performance: Measure relationships between inputs and outputs
Test theories: Validate economic and business hypotheses with data
Quantify risk: Understand variability in predictions (via S_e and intervals)

12.27.5 Closing the Cola Wars Scenario

Returning to our opening scenario: Coca-Cola and Pepsi can now use regression to:

Model sales vs. temperature to optimize production and distribution
Analyze advertising vs. market share to maximize ROI
Predict consumption vs. price for pricing strategies
Forecast seasonal demand patterns for inventory management

With the tools from this chapter, beverage industry managers can make data-driven decisions worth millions of dollars in a $27 billion market.

12.27.6 Looking Ahead

Chapter 12 extends these concepts to multiple regression, where we can use several independent variables simultaneously to predict Y. This provides even more powerful and realistic models for complex business environments.

Congratulations! You’ve mastered simple linear regression and correlation—essential tools that form the foundation for advanced statistical modeling and data science applications in business.

Chapter 11 Mastery Checklist

You should now be able to: - ✅ Distinguish between dependent and independent variables
- ✅ Create and interpret scatter diagrams
- ✅ Calculate regression coefficients using OLS
- ✅ Interpret slope and intercept in business context
- ✅ Compute and interpret S_e, r, and r^2
- ✅ Test hypotheses about \beta_1 and \rho
- ✅ Construct confidence intervals for means
- ✅ Construct prediction intervals for individuals
- ✅ Perform ANOVA for regression
- ✅ Recognize limitations and avoid common pitfalls

# Simple Regression and Correlation {#sec-regression} ## Opening Scenario: The Cola Wars The competition in the soft drink industry has always been intense. Recently, the battle between Coca-Cola and Pepsi-Cola has intensified as both giants fight to increase their respective shares of the $27 billion domestic beverage market. Each company has offered its own brand of promotional flair in a continuous effort to reorganize their marketing mix and promote their respective products. Coca-Cola currently enjoys a 21.7% market share, followed by Pepsi at 18.9%. Without a doubt, marketing executives, management specialists, and statisticians work tirelessly at both companies trying to outmaneuver their competitive-minded counterparts. So far, they have agreed on very little, except that **sales seem to increase with higher summer temperatures**. Predicting market share trends is an especially arduous and difficult task. Many executives have ruined their careers in frustrated attempts to correctly anticipate the behavior of fickle consumers. **Regression and correlation analysis** are the two most powerful and useful tools that analysts of all types have at their disposal to peer into the shadowy future. In this chapter, we will analyze these procedures and learn how they can guide business professionals in their pursuit of successful careers.   ## Learning Objectives After completing this chapter, you will be able to: 1. **Distinguish** between dependent and independent variables in regression analysis 2. **Differentiate** between simple and multiple regression models 3. **Understand** the difference between linear and curvilinear relationships 4. **Apply** ordinary least squares (OLS) method to determine the best-fit line 5. **Interpret** regression coefficients and their business implications 6. **Evaluate** regression model assumptions (normality, homoscedasticity, independence, linearity) 7. **Calculate** and interpret the standard error of estimation 8. **Compute** and interpret the correlation coefficient and coefficient of determination 9. **Conduct** hypothesis tests for regression coefficients and correlation 10. **Make** predictions using regression models with confidence intervals ## Chapter Structure ```{mermaid} %%| fig-width: 10 %%| fig-cap: "Chapter 11 Conceptual Structure" graph TD A[Regression and Correlation] --> B[Regression Analysis] A --> C[Correlation Analysis] B --> B1[Model development: Ordinary least squares] B --> B2[Model assumptions] B --> B3[Standard error of estimation] B --> B4[Inferential tests] B4 --> B4a[Hypothesis tests] B4 --> B4b[Confidence intervals] C --> C1[Correlation coefficient] C --> C2[Coefficient of determination] C --> C3[Hypothesis test for correlation] ``` ## 11.1 Introduction to Regression and Correlation Regression and correlation are the two most powerful and versatile statistical tools that can be used to solve common business problems. Many studies are based on the belief that we can identify and quantify some functional relationship between two or more variables. We say that **one variable depends on another**. We can say that $Y$ depends on $X$, where $Y$ and $X$ are any two variables. This can be written as: $$Y \text{ is a function of } X \quad Y = f(X)$$ This is read as "$Y$ is a function of $X$." Because $Y$ depends on $X$, $Y$ is the **dependent variable** and $X$ is the **independent variable**. ::: {.callout-note icon="📋"} ## Key Definitions **Dependent Variable** : The variable we wish to explain or predict; also called the **regressand** or **response variable**. It is the outcome we're interested in understanding. **Independent Variable** : The variable used to explain $Y$; also called the **explanatory variable**, **predictor**, or **regressor**. It is the variable we believe influences the dependent variable. ::: ### Example: Student Performance The university dean wishes to analyze the relationship between student grades and the time they spend studying. Data on both variables were collected. It is logical to presume that **grades depend on** the quantity and quality of time students spend with their books. Therefore: - **Dependent variable (Y)**: Grades - **Independent variable (X)**: Study time We say that "$Y$ is regressing on $X$" or "we are regressing $Y$ on $X$." ### Historical Context The first to develop regression analysis was English scientist **Sir Francis Galton (1822-1911)**. His initial experiments with regression began with an attempt to analyze hereditary growth patterns of peas. Encouraged by the results, Sir Francis extended his study to include hereditary patterns in the height of adult humans. He discovered that children who have very tall or very short parents tended to "**regress**" toward the average height of the adult population. With this modest beginning, the use of regression analysis became known and has become one of the most powerful statistical tools available today. ::: {.callout-tip icon="🎓"} ## Why "Regression"? The term "regression" comes from Galton's observation that extreme values tend to "regress" (return) toward the mean. This phenomenon is called **regression to the mean**. ::: ## 11.2 Types of Regression Models ### Simple vs. Multiple Regression We must differentiate between **simple regression** and **multiple regression**: **Simple Regression (Bivariate Regression)** : $Y$ is a function of **only one** independent variable: $$Y = f(X)$$ It's called "bivariate" because there are only two variables: one dependent and one independent. **Multiple Regression** : $Y$ is a function of **two or more** independent variables: $$Y = f(X_1, X_2, X_3, \ldots, X_k)$$ where $X_1, X_2, X_3, \ldots, X_k$ are independent variables that help explain $Y$. ### Linear vs. Curvilinear Regression We must also distinguish between **linear regression** and **curvilinear (nonlinear) regression**: **Linear Regression** : The relationship between $X$ and $Y$ can be represented by a **straight line**. It holds that as $X$ changes, $Y$ changes by a **constant amount**. **Curvilinear Regression** : Uses a **curve** to express the relationship between $X$ and $Y$. It holds that as $X$ changes, $Y$ changes by a **different amount** each time. ### Scatter Diagrams Some of these relationships appear in **scatter diagrams** (or scatter plots) that represent paired observations for $X$ and $Y$. It is customary to place the independent variable on the horizontal axis. ```{python} #| echo: false #| label: fig-scatter-types #| fig-cap: "Types of Relationships in Scatter Diagrams" import matplotlib.pyplot as plt import numpy as np fig, axes = plt.subplots(2, 3, figsize=(14, 9)) # a) Positive Linear np.random.seed(42) x1 = np.linspace(0, 10, 30) y1 = 2 + 1.5*x1 + np.random.normal(0, 1.5, 30) axes[0, 0].scatter(x1, y1, alpha=0.6, s=50, color='steelblue') z = np.polyfit(x1, y1, 1) p = np.poly1d(z) axes[0, 0].plot(x1, p(x1), "r-", linewidth=2) axes[0, 0].set_title('a) Positive Linear Relationship', fontweight='bold') axes[0, 0].set_xlabel('X (Independent)') axes[0, 0].set_ylabel('Y (Dependent)') axes[0, 0].grid(alpha=0.3) axes[0, 0].annotate('As X ↑, Y ↑', xy=(7, 12), fontsize=10, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7)) # b) Negative Linear x2 = np.linspace(0, 10, 30) y2 = 15 - 1.2*x2 + np.random.normal(0, 1, 30) axes[0, 1].scatter(x2, y2, alpha=0.6, s=50, color='coral') z = np.polyfit(x2, y2, 1) p = np.poly1d(z) axes[0, 1].plot(x2, p(x2), "r-", linewidth=2) axes[0, 1].set_title('b) Negative Linear Relationship', fontweight='bold') axes[0, 1].set_xlabel('X (Independent)') axes[0, 1].set_ylabel('Y (Dependent)') axes[0, 1].grid(alpha=0.3) axes[0, 1].annotate('As X ↑, Y ↓', xy=(7, 6), fontsize=10, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7)) # c) Curvilinear (Quadratic) x3 = np.linspace(0, 10, 30) y3 = 0.3*x3**2 - 2*x3 + 8 + np.random.normal(0, 1, 30) axes[0, 2].scatter(x3, y3, alpha=0.6, s=50, color='green') z = np.polyfit(x3, y3, 2) p = np.poly1d(z) x_curve = np.linspace(0, 10, 100) axes[0, 2].plot(x_curve, p(x_curve), "r-", linewidth=2) axes[0, 2].set_title('c) Curvilinear Relationship', fontweight='bold') axes[0, 2].set_xlabel('X (Independent)') axes[0, 2].set_ylabel('Y (Dependent)') axes[0, 2].grid(alpha=0.3) axes[0, 2].annotate('Curved pattern', xy=(6, 10), fontsize=10, bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.7)) # d) Curvilinear (Exponential) x4 = np.linspace(0, 10, 30) y4 = 3*np.exp(0.15*x4) + np.random.normal(0, 1, 30) axes[1, 0].scatter(x4, y4, alpha=0.6, s=50, color='purple') z = np.polyfit(x4, np.log(y4), 1) axes[1, 0].plot(x4, np.exp(z[1])*np.exp(z[0]*x4), "r-", linewidth=2) axes[1, 0].set_title('d) Exponential Relationship', fontweight='bold') axes[1, 0].set_xlabel('X (Independent)') axes[1, 0].set_ylabel('Y (Dependent)') axes[1, 0].grid(alpha=0.3) axes[1, 0].annotate('Accelerating growth', xy=(5, 8), fontsize=10, bbox=dict(boxstyle='round', facecolor='plum', alpha=0.7)) # e) No Relationship x5 = np.random.uniform(0, 10, 30) y5 = np.random.uniform(0, 10, 30) axes[1, 1].scatter(x5, y5, alpha=0.6, s=50, color='gray') axes[1, 1].axhline(y=5, color='r', linestyle='--', linewidth=2) axes[1, 1].set_title('e) No Relationship', fontweight='bold') axes[1, 1].set_xlabel('X (Independent)') axes[1, 1].set_ylabel('Y (Dependent)') axes[1, 1].grid(alpha=0.3) axes[1, 1].annotate('Random scatter', xy=(6, 8), fontsize=10, bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.7)) # f) Perfect Linear x6 = np.linspace(0, 10, 20) y6 = 3 + 1.8*x6 # Perfect relationship, no noise axes[1, 2].scatter(x6, y6, alpha=0.6, s=50, color='darkblue') axes[1, 2].plot(x6, y6, "r-", linewidth=2) axes[1, 2].set_title('f) Perfect Linear Relationship', fontweight='bold') axes[1, 2].set_xlabel('X (Independent)') axes[1, 2].set_ylabel('Y (Dependent)') axes[1, 2].grid(alpha=0.3) axes[1, 2].annotate('All points on line', xy=(5, 12), fontsize=10, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.7)) plt.tight_layout() plt.show() ``` ::: {.callout-important icon="📊"} ## Interpreting Scatter Diagrams - **Figure a)**: Positive linear relationship - as $X$ increases (decreases), $Y$ increases (decreases) - **Figure b)**: Negative linear relationship - as $X$ increases, $Y$ decreases - **Figures c) and d)**: Curvilinear relationships - cannot be well described by a straight line - **Figure e)**: No relationship - no detectable pattern between $X$ and $Y$ - **Figure f)**: Perfect linear relationship - all points fall exactly on a line (rare in real data) ::: ## 11.3 The Equation of a Straight Line Before diving into regression analysis, let's review the mathematical equation of a straight line. Only **two points** are needed to draw a straight line representing a linear relationship. ### Basic Form The equation of a straight line can be expressed as: **Equation of a Line** $$Y = b_0 + b_1 X \quad [11.1]$$ where: - $b_0$ is the **intercept** (where the line crosses the Y-axis) - $b_1$ is the **slope** of the line ### Example: Understanding Slope and Intercept Consider the equation: $$Y = 5 + 2X$$ - **Intercept** ($b_0 = 5$): The line crosses the Y-axis at 5 - **Slope** ($b_1 = 2$): The slope is calculated as: $$b_1 = \text{slope} = \frac{\text{vertical change}}{\text{horizontal change}} = \frac{2}{1} = 2$$ This means: **For every one-unit change in $X$, $Y$ changes by 2 units**. ```{python} #| echo: false #| label: fig-line-slope #| fig-cap: "Straight Line with Positive Slope: Y = 5 + 2X" import matplotlib.pyplot as plt import numpy as np fig, ax = plt.subplots(figsize=(10, 7)) # Create line x = np.linspace(0, 5, 100) y = 5 + 2*x ax.plot(x, y, 'b-', linewidth=3, label='Y = 5 + 2X') # Mark intercept ax.plot(0, 5, 'ro', markersize=12, label=f'Intercept b₀ = 5') ax.annotate('Intercept (0, 5)', xy=(0, 5), xytext=(0.5, 6), fontsize=11, fontweight='bold', arrowprops=dict(arrowstyle='->', lw=2, color='red')) # Show slope x1, x2 = 2, 3 y1, y2 = 5 + 2*x1, 5 + 2*x2 ax.plot([x1, x2], [y1, y1], 'g--', linewidth=2) ax.plot([x2, x2], [y1, y2], 'g--', linewidth=2) ax.annotate('Δx = 1', xy=(2.5, y1-0.3), fontsize=10, ha='center', color='green', fontweight='bold') ax.annotate('Δy = 2', xy=(x2+0.15, 10), fontsize=10, color='green', fontweight='bold') ax.annotate(f'Slope = Δy/Δx = 2/1 = 2', xy=(3.5, 11.5), fontsize=11, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) # Demonstrate with specific points points_x = [2, 3, 4] points_y = [5 + 2*px for px in points_x] for px, py in zip(points_x, points_y): ax.plot(px, py, 'ko', markersize=8) ax.annotate(f'({px}, {py})', xy=(px, py), xytext=(px-0.3, py+0.8), fontsize=9) ax.set_xlabel('X', fontsize=12, fontweight='bold') ax.set_ylabel('Y', fontsize=12, fontweight='bold') ax.set_title('Linear Relationship: Y = 5 + 2X\n(Positive Slope)', fontsize=14, fontweight='bold') ax.legend(fontsize=10, loc='upper left') ax.grid(alpha=0.3) ax.set_xlim(-0.5, 5) ax.set_ylim(0, 17) plt.tight_layout() plt.show() ``` Notice that as $X$ increases from 2 to 3 (an increase of 1 unit), $Y$ increases from 9 to 11 (an increase of 2 units). ### Different Slope Values The slope $b_1$ determines the nature of the relationship: **1. Positive Slope ($b_1 > 0$)** : Variables move in the same direction : Example: $Y = 5 + 2X$ **2. Negative Slope ($b_1 < 0$)** : Variables move in opposite directions : Example: $Y = 10 - 3X$ : For every one-unit increase in $X$, $Y$ decreases by 3 units **3. Zero Slope ($b_1 = 0$)** : No linear relationship : Example: $Y = 10 + 0X = 10$ : Changes in $X$ have no effect on $Y$ ## 11.4 Deterministic vs. Stochastic Relationships ### Deterministic Relationships A **deterministic relationship** can be expressed by a formula that provides exact conversions. **Example**: Converting miles per hour (mph) to kilometers per hour (kph): $$1 \text{ mph} = 1.6 \text{ kph}$$ Therefore: $5 \text{ mph} = 5(1.6) = 8.0 \text{ kph}$ This is a deterministic model because the relationship is **exact** and there is **no error** (except for rounding). ### Stochastic (Random) Relationships Unfortunately, very few relationships in the business world are this exact. When using one variable to explain another, there is usually **some variation** in the relationship. **Example**: Vita+Plus, Inc. (health product distributors) wants to develop a regression model using advertising to explain sales revenue. They will likely find that: - When advertising is set at amount $X_i$, sales have value $Y_i$ - The next time advertising is set at the same amount, sales may produce a **different value** - The dependent variable (sales) exhibits some degree of **randomness** Therefore, there will be some **error** in attempting to explain or predict sales. Such a model is **stochastic** due to the presence of random variation: **Population (True) Regression Model** $$Y = \beta_0 + \beta_1 X + \varepsilon \quad [11.4]$$ where: - $\beta_0 + \beta_1 X$ is the **deterministic portion** - $\varepsilon$ (epsilon) is the **error term** (random component) - $\beta_0$ and $\beta_1$ are population parameters (usually unknown) ### Sample-Based Estimation The parameters $\beta_0$ and $\beta_1$ remain unknown and can only be **estimated** using sample data: **Sample Regression Model** $$Y = b_0 + b_1 X + e \quad [11.5]$$ where: - $b_0$ and $b_1$ are **estimates** of $\beta_0$ and $\beta_1$ - $e$ is the **residual** (observed error in the sample) The residual $e$ recognizes that not all observations fall exactly on a straight line. If we knew the exact value of $e$, we could calculate $Y$ precisely. However, because $e$ is random, $Y$ can only be **estimated**. ### The Estimated Regression Model The regression model therefore takes the form: **Estimated Regression Equation** $$\hat{Y} = b_0 + b_1 X \quad [11.6]$$ where: - $\hat{Y}$ (read as "Y-hat") is the **estimated value** of $Y$ - $b_0$ is the estimated intercept - $b_1$ is the estimated slope ::: {.callout-note} ## Notation Summary | Symbol | Meaning | Type | |:------:|:--------|:-----| | $Y$ | Actual observed value | Data | | $\hat{Y}$ | Predicted/estimated value | Calculated from model | | $\beta_0, \beta_1$ | True population parameters | Unknown | | $b_0, b_1$ | Sample estimates of parameters | Calculated from data | | $\varepsilon$ | True population error | Unknown/theoretical | | $e$ | Sample residual | Calculated: $e = Y - \hat{Y}$ | ::: ## Section Exercises ::: {#exr-regression-concepts} ## Understanding Regression Concepts 1. What is the difference between simple regression and multiple regression? 2. What is the difference between linear regression and curvilinear regression? How does $Y$ change when $X$ changes in each case? 3. Differentiate between the deterministic and stochastic components of a regression model. 4. Why is the ordinary least squares method called "least squares"? What role does error play in this analysis? ::: ::: {#exr-variable-identification} ## Identifying Variables Identify the dependent and independent variables in each of these cases: a. Time spent working on an assignment and the grade received b. Son's height and father's height c. A woman's age and the cost of her life insurance d. Price of a product and the number of units sold e. Demand for a product and the number of consumers in the market ::: ::: {#exr-scatter-plot} ## Creating Scatter Diagrams Given the following data for $X$ and $Y$: $$ \begin{aligned} X: & \quad 28, 54, 67, 37, 41, 69, 76 \\ Y: & \quad 14, 21, 36, 39, 18, 54, 52 \end{aligned} $$ a. Create a scatter diagram for the data b. What do the data suggest about a relationship between $X$ and $Y$? c. Draw a line to approximate the relationship ::: ::: {#exr-terminology} ## Regression Terminology 1. What is the difference between $\hat{Y}_i$ and $Y_i$ in regression analysis? 2. What is the term $\varepsilon$ in the regression model and why does it occur? 3. Explain the meaning of "regression to the mean" in the historical context of Sir Francis Galton's work ::: ## Section Summary ::: {.callout-tip icon="📚"} ## Key Takeaways 1. **Regression analysis** studies relationships between variables to make predictions 2. **Dependent variable (Y)**: The outcome we want to explain or predict 3. **Independent variable (X)**: The predictor we use to explain Y 4. **Linear relationship**: Y changes by a constant amount when X changes 5. **Scatter diagrams** reveal the type and strength of relationships visually 6. **Stochastic models** include an error term because real-world relationships aren't perfect 7. **Historical origin**: Sir Francis Galton developed regression studying hereditary patterns ::: ## 11.5 Ordinary Least Squares: The Best-Fit Line The purpose of regression analysis is to determine a line that **fits the data better** than any other line that could be drawn. To illustrate this, let's assume that Vita+Plus, Inc. collected data on advertising expenditures and sales revenue for 5 months, as shown in Table 11.1. **Table 11.1: Sales Data for Vita+Plus, Inc.** | Month | Sales (in $1,000s) | Advertising (in $100s) | |:-----:|:------------------:|:----------------------:| | 1 | $450 | $50 | | 2 | 380 | 40 | | 3 | 540 | 65 | | 4 | 500 | 55 | | 5 | 420 | 45 | Although a sample of only 5 observations would probably be insufficient in practice, it will serve our purposes for now. ### Understanding the Error Term These five data points and the line that best fits them appear in a scatter diagram. This line is determined by estimating $b_0$ and $b_1$. A mathematical procedure used to estimate these values is called **Ordinary Least Squares (OLS)**. OLS produces a line that extends through the center of the scatter plot, approximating all data points better than any other line. For the 5 data points $Y_i$ in the scatter diagram, these are the actual observed values for $Y$ from Table 11.1. The values $\hat{Y}_i$ are obtained from the regression line and represent the **estimated** sales. The difference between what $Y$ actually was ($Y_i$) and what we estimate it to be ($\hat{Y}_i$) is the **error**. **The Error Term** $$\text{Error} = (Y_i - \hat{Y}_i) \quad [11.7]$$ - If the actual value $Y_i$ is **greater than** the estimate $\hat{Y}_i$, then $(Y_i - \hat{Y}_i) > 0$ and the error is **positive** (we underestimated) - If the actual value $Y_i$ is **less than** the estimate $\hat{Y}_i$, then $(Y_i - \hat{Y}_i) < 0$ and the error is **negative** (we overestimated) ### The Principle of Least Squares Because some errors are negative and some are positive, OLS produces a line such that: **Sum of Errors Equals Zero** $$\sum (Y_i - \hat{Y}_i) = 0$$ More importantly, OLS ensures that the **sum of squared errors is minimized**. That is, if we: 1. Take the differences (all vertical) between actual $Y$ values and the regression line 2. Square these differences 3. Sum them The resulting number will be **smaller** than what we would obtain with any other line. **OLS Minimizes Sum of Squared Errors** $$\sum (Y_i - \hat{Y}_i)^2 = \min \quad [11.8]$$ This is why it's called **Ordinary Least Squares** – it produces a line such that the sum of squared errors is the minimum possible. ::: {.callout-important icon="🎯"} ## Why Square the Errors? 1. **Eliminates sign issues**: Squaring makes all values positive, preventing cancellation 2. **Penalizes large errors**: Larger deviations are penalized more heavily 3. **Mathematical tractability**: Squared terms have nice derivative properties for optimization 4. **Statistical properties**: Produces unbiased estimates under certain assumptions ::: ### Computing Sums of Squares and Cross Products To determine the best-fit line, OLS requires calculating: **Sum of Squares for X** $$SCx = \sum(X_i - \bar{X})^2 = \sum X^2 - \frac{(\sum X)^2}{n} \quad [11.9]$$ **Sum of Squares for Y** $$SCy = \sum(Y_i - \bar{Y})^2 = \sum Y^2 - \frac{(\sum Y)^2}{n} \quad [11.10]$$ **Sum of Cross Products** $$SCxy = \sum(X_i - \bar{X})(Y_i - \bar{Y}) = \sum XY - \frac{(\sum X)(\sum Y)}{n} \quad [11.11]$$ ::: {.callout-note} ## Computational Forms The first portion of each formula: - $SCx = \sum(X_i - \bar{X})^2$ - $SCy = \sum(Y_i - \bar{Y})^2$ - $SCxy = \sum(X_i - \bar{X})(Y_i - \bar{Y})$ illustrates how the OLS line is based on **deviations from the mean**. However, these are tedious to calculate manually. The second versions (computational formulas) are much easier for hand calculations. ::: ### Calculating the Regression Coefficients Given the sums of squares and cross products, it's straightforward to calculate: **The Slope (Regression Coefficient)** $$b_1 = \frac{SCxy}{SCx} \quad [11.12]$$ **The Intercept (Regression Constant)** $$b_0 = \bar{Y} - b_1\bar{X} \quad [11.13]$$ where $\bar{Y}$ and $\bar{X}$ are the means of the $Y$ and $X$ values. ::: {.callout-warning icon="⚠️"} ## Precision Warning These calculations are **extremely sensitive** to rounding. This is especially true for the coefficient of determination (discussed later). Therefore, it's advisable to carry calculations to **five or six decimal places** for accuracy. ::: ## 11.6 Example: Hop Scotch Airlines The management of Hop Scotch Airlines, the world's smallest carrier airline, believes there's a direct relationship between advertising expenditures and the number of passengers who choose to fly Hop Scotch. To determine if this relationship exists and what its exact nature might be, statisticians employed by Hop Scotch decided to use OLS procedures to determine the regression model. Monthly values for advertising expenses and number of passengers were collected for the **n = 15 most recent months**. The data appear in Table 11.2, along with other necessary calculations for finding the regression model. | Observation (Month) | Advertising $(X)$ (in $1,000s) | Passengers $(Y)$ (in 1,000s) | $XY$ | $X^2$ | $Y^2$ | |:-------------------:|:------------------------------:|:----------------------------:|:------:|:-----:|:-----:| | 1 | 10 | 15 | 150 | 100 | 225 | | 2 | 12 | 17 | 204 | 144 | 289 | | 3 | 8 | 13 | 104 | 64 | 169 | | 4 | 17 | 23 | 391 | 289 | 529 | | 5 | 10 | 16 | 160 | 100 | 256 | | 6 | 15 | 21 | 315 | 225 | 441 | | 7 | 10 | 14 | 140 | 100 | 196 | | 8 | 14 | 20 | 280 | 196 | 400 | | 9 | 19 | 24 | 456 | 361 | 576 | | 10 | 10 | 17 | 170 | 100 | 289 | | 11 | 11 | 16 | 176 | 121 | 256 | | 12 | 13 | 18 | 234 | 169 | 324 | | 13 | 16 | 23 | 368 | 256 | 529 | | 14 | 10 | 15 | 150 | 100 | 225 | | 15 | 12 | 16 | 192 | 144 | 256 | | **Totals** | **187** | **268** | **3,490** | **2,469** | **4,960** | : Regression Data for Hop Scotch Airlines {#tbl-hopscotch-data .striped .hover} ### Step-by-Step Solution With this dataset and the calculations for $XY$, $X^2$, and $Y^2$, it's easy to determine the regression model. **Step 1: Calculate Sums of Squares and Cross Products** $$ \begin{aligned} SCx &= \sum X^2 - \frac{(\sum X)^2}{n} \\ &= 2,469 - \frac{(187)^2}{15} \\ &= 2,469 - \frac{34,969}{15} \\ &= 2,469 - 2,331.266667 \\ &= 137.733333 \end{aligned} $$ $$ \begin{aligned} SCy &= \sum Y^2 - \frac{(\sum Y)^2}{n} \\ &= 4,960 - \frac{(268)^2}{15} \\ &= 4,960 - \frac{71,824}{15} \\ &= 4,960 - 4,788.266667 \\ &= 171.733333 \end{aligned} $$ $$ \begin{aligned} SCxy &= \sum XY - \frac{(\sum X)(\sum Y)}{n} \\ &= 3,490 - \frac{(187)(268)}{15} \\ &= 3,490 - \frac{50,116}{15} \\ &= 3,490 - 3,341.066667 \\ &= 148.933333 \end{aligned} $$ **Step 2: Calculate the Slope (Regression Coefficient)** $$ \begin{aligned} b_1 &= \frac{SCxy}{SCx} \\ &= \frac{148.933333}{137.733333} \\ &= 1.081317 \approx 1.08 \end{aligned} $$ **Step 3: Calculate the Means** $$\bar{Y} = \frac{\sum Y}{n} = \frac{268}{15} = 17.866667$$ $$\bar{X} = \frac{\sum X}{n} = \frac{187}{15} = 12.466667$$ **Step 4: Calculate the Intercept** $$ \begin{aligned} b_0 &= \bar{Y} - b_1\bar{X} \\ &= 17.866667 - 1.081317(12.466667) \\ &= 17.866667 - 13.480282 \\ &= 4.386385 \approx 4.39 \end{aligned} $$ ### The Regression Model **Hop Scotch Airlines Regression Equation** $$\hat{Y}_i = 4.39 + 1.08X_i$$ where $\hat{Y}_i$ is the predicted number of passengers (in thousands) for a given advertising expenditure $X_i$ (in thousands). ### Interpreting the Results **Intercept ($b_0 = 4.39$)** : When advertising is zero ($X = 0$), the model predicts 4,390 passengers. This represents the "baseline" demand without advertising. **Slope ($b_1 = 1.08$)** : For every $1,000 increase in advertising expenditure, passenger count increases by approximately 1,080 passengers (1.08 thousand). This positive slope confirms management's belief: **advertising expenditures are positively related to passenger volume**. ### Making Predictions **Example 1**: If advertising expenditure is $10,000 ($X = 10$): $$\hat{Y}_i = 4.39 + 1.08(10) = 4.39 + 10.8 = 15.19$$ **Prediction**: Approximately **15,190 passengers** **Example 2**: If advertising increases to $11,000 ($X = 11$): $$\hat{Y}_i = 4.39 + 1.08(11) = 4.39 + 11.88 = 16.27$$ **Prediction**: Approximately **16,270 passengers** **Marginal Effect**: Increasing advertising by $1,000 resulted in $16.27 - 15.19 = 1.08$ thousand = **1,080 additional passengers**, confirming our slope interpretation. ```{python} #| echo: false #| label: fig-hopscotch-regression #| fig-cap: "Regression Line for Hop Scotch Airlines: Advertising vs. Passengers" import matplotlib.pyplot as plt import numpy as np # Data from Table 11.2 advertising = np.array([10, 12, 8, 17, 10, 15, 10, 14, 19, 10, 11, 13, 16, 10, 12]) passengers = np.array([15, 17, 13, 23, 16, 21, 14, 20, 24, 17, 16, 18, 23, 15, 16]) # Regression line parameters b0 = 4.39 b1 = 1.08 # Create figure fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Left panel: Scatter plot with regression line ax1.scatter(advertising, passengers, s=100, alpha=0.6, color='steelblue', edgecolors='black', linewidth=1.5, label='Actual Data') # Regression line x_line = np.linspace(7, 20, 100) y_line = b0 + b1 * x_line ax1.plot(x_line, y_line, 'r-', linewidth=3, label=f'Ŷ = {b0:.2f} + {b1:.2f}X') # Highlight a specific point to show residual highlight_idx = 8 # Point where X=19, Y=24 x_point = advertising[highlight_idx] y_actual = passengers[highlight_idx] y_pred = b0 + b1 * x_point ax1.plot([x_point, x_point], [y_actual, y_pred], 'g--', linewidth=2, alpha=0.7) ax1.plot(x_point, y_actual, 'go', markersize=12, label=f'Actual: Y = {y_actual}') ax1.plot(x_point, y_pred, 'ro', markersize=12, label=f'Predicted: Ŷ = {y_pred:.2f}') ax1.annotate(f'Residual = {y_actual - y_pred:.2f}', xy=(x_point + 0.3, (y_actual + y_pred)/2), fontsize=10, fontweight='bold', color='green', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7)) # Mark intercept ax1.plot(0, b0, 'mo', markersize=10, zorder=5) ax1.annotate(f'Intercept = {b0:.2f}', xy=(0, b0), xytext=(2, 6), fontsize=10, fontweight='bold', arrowprops=dict(arrowstyle='->', lw=2, color='purple')) ax1.set_xlabel('Advertising Expenditure (in $1,000s)', fontsize=12, fontweight='bold') ax1.set_ylabel('Number of Passengers (in 1,000s)', fontsize=12, fontweight='bold') ax1.set_title('Hop Scotch Airlines: Regression Analysis\nAdvertising Effect on Passenger Volume', fontsize=13, fontweight='bold') ax1.legend(loc='upper left', fontsize=9) ax1.grid(alpha=0.3) ax1.set_xlim(0, 20) ax1.set_ylim(0, 26) # Right panel: Residual plot residuals = passengers - (b0 + b1 * advertising) ax2.scatter(advertising, residuals, s=100, alpha=0.6, color='coral', edgecolors='black', linewidth=1.5) ax2.axhline(y=0, color='red', linestyle='--', linewidth=2, label='Zero Line') # Add reference bands ax2.axhspan(-2, 2, alpha=0.1, color='green', label='±2 range') ax2.set_xlabel('Advertising Expenditure (in $1,000s)', fontsize=12, fontweight='bold') ax2.set_ylabel('Residuals (Y - Ŷ)', fontsize=12, fontweight='bold') ax2.set_title('Residual Plot\n(Checking for Patterns)', fontsize=13, fontweight='bold') ax2.legend(loc='upper right', fontsize=9) ax2.grid(alpha=0.3) ax2.set_xlim(7, 20) # Add annotation about residuals ax2.annotate('Random scatter around zero\nsuggests good model fit', xy=(15, 1.5), fontsize=10, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7)) plt.tight_layout() plt.show() ``` ### Computer Output Verification Modern statistical software like Python produces regression output automatically. Here's what the output would show: **Regression Equation:** ``` PASS = 4.39 + 1.08 ADV ``` **Coefficients:** ``` Predictor Coef SE Coef t-ratio p-value Constant 4.3863 0.9913 4.42 0.001 ADV 1.0813 0.0773 13.99 0.000 ``` **Model Summary:** ``` s = 0.9068 R-sq = 93.8% R-sq(adj) = 93.3% ``` The regression line passes through the middle of the scatter plot, minimizing the sum of squared vertical distances from each point to the line. ::: {.callout-tip icon="💡"} ## Business Interpretation **For Hop Scotch Airlines Management:** 1. **Positive ROI on Advertising**: Each $1,000 spent on advertising brings approximately 1,080 new passengers 2. **Strong Relationship**: The model explains 93.8% of the variation in passenger volume (R-sq = 93.8%) 3. **Statistical Significance**: The p-value of 0.000 for advertising indicates the relationship is highly significant 4. **Predictive Power**: Management can confidently use this model for budgeting and forecasting 5. **Strategic Insight**: Even with zero advertising, baseline demand exists (4,390 passengers), suggesting brand recognition or repeat customers ::: ## Section Exercises ::: {#exr-gpa-job-offers} ## GPA and Job Offers The career placement center at State University wants to determine if student grade point averages (GPAs) can explain the number of job offers they receive after graduation. Data for 10 recent graduates: | Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |:-------:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | GPA | 3.25 | 2.35 | 1.02 | 0.36 | 3.69 | 2.65 | 2.15 | 1.25 | 3.88 | 3.37 | | Offers | 3 | 3 | 1 | 0 | 5 | 4 | 2 | 2 | 6 | 2 | a. Create a scatter diagram for the data b. Calculate and interpret the regression model. What does this model tell you about the relationship between GPA and job offers? c. If Steve has a GPA of 3.22, how many job offers do you predict he will receive? ::: ::: {#exr-income-consumption} ## Income and Consumption An economist at the Florida Department of Human Resources is preparing a study on consumer behavior. Data were collected (in thousands of dollars) to determine if a relationship exists between consumer income and consumption levels. | Consumer | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |:--------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:---:|:----:|:----:|:----:| | Income | 24.3 | 12.5 | 31.2 | 28.0 | 35.1 | 10.5 | 23.2 | 10.0 | 8.5 | 15.9 | 14.7 | 15.0 | | Consumption | 16.2 | 8.5 | 15.0 | 17.0 | 24.2 | 11.2 | 15.0 | 7.1 | 3.5 | 11.5 | 10.7 | 9.2 | a. Create a scatter diagram for the data b. Calculate and interpret the regression model. What does this model tell you about the relationship between consumption and income? What proportion of each additional dollar earned is spent on consumption? c. What consumption would the model predict for someone earning $27,500? ::: ::: {#exr-interest-housing} ## Interest Rates and Housing Sales A bank in Atlanta specializing in home loans is trying to analyze the real estate market by measuring how well interest rates explain the number of houses sold in the area. Data were compiled for a 10-month period: | Month | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |:-----:|:----:|:----:|:----:|:---:|:----:|:---:|:---:|:----:|:----:|:----:| | Interest | 12.3 | 10.5 | 15.6 | 9.5 | 10.5 | 9.3 | 8.7 | 14.2 | 15.2 | 12.0 | | Houses | 196 | 285 | 125 | 225 | 248 | 303 | 265 | 102 | 105 | 114 | a. Create a scatter diagram for the data b. Calculate and interpret the regression model. What does this model tell you about the relationship between interest rates and housing sales? c. If the interest rate is 9.5%, how many houses would be sold according to the model? ::: ::: {#exr-production-costs} ## Production Costs Overland Group produces truck parts used in semi-trailers. The chief accountant wants to develop a regression model to predict costs. Units produced is selected as the predictor variable. Costs are in thousands of dollars, units in hundreds. | Units | 12.3 | 8.3 | 6.5 | 4.8 | 14.6 | 14.6 | 14.6 | 6.5 | |:-----:|:----:|:---:|:---:|:---:|:----:|:----:|:----:|:---:| | Cost | 6.2 | 5.3 | 4.1 | 4.4 | 5.2 | 4.8 | 5.9 | 4.2 | a. Create a scatter diagram for the data b. Calculate and interpret the regression model. What does the model tell the accountant about the relationship between production and costs? c. According to the model, how much would it cost to produce 750 units? ::: ::: {#exr-distance-absences} ## Distance and Class Absences Professor Mundane has noticed many students have been absent from class this semester. He believes he can explain this lack of attendance by the distances students live from campus. Eleven students were surveyed about how many miles they must travel to attend class and the number of classes they've missed. | Miles | 5 | 6 | 2 | 0 | 9 | 12 | 16 | 5 | 7 | 0 | 8 | |:-----:|:-:|:-:|:-:|:-:|:-:|:--:|:--:|:-:|:-:|:-:|:-:| | Absences | 2 | 2 | 4 | 5 | 4 | 2 | 5 | 2 | 3 | 1 | 4 | a. Create a scatter diagram for the data b. Calculate and interpret the regression model. What does the professor discover? c. How many classes would you miss if you lived 3.2 miles from campus, according to the model? ::: ::: {#exr-test-ratings} ## Employment Test Scores and Performance Ratings The personnel director at Bupkus, Inc. obtained data on 100 employees regarding entrance tests administered at hiring and subsequent ratings employees received from supervisors one year later. Test scores ranged from 0 to 10, and ratings used a 5-point system. The director wants to use the regression model to predict rating (R) based on test score (S). The results are: $$ \begin{aligned} \sum S &= 522 \quad \sum R = 326 \quad \sum SR = 17,325 \\ \sum S^2 &= 28,854 \quad \sum R^2 = 10,781 \end{aligned} $$ Develop and interpret the regression model. What can the director predict about the rating of an employee who scored 7 on the test? ::: ::: {.callout-note} ## Note for Students **Keep your calculations from exercises 1-6** for use throughout the rest of this chapter. Using the same data, you will avoid having to recalculate $SCx$, $SCy$, and $SCxy$ each time. You will gain additional experience with other problems at the end of the chapter. ::: ## Section Summary ::: {.callout-tip icon="📚"} ## Key Takeaways: Ordinary Least Squares 1. **OLS minimizes** the sum of squared vertical distances from data points to the regression line 2. **Regression slope ($b_1$)**: Measures the change in Y for each one-unit change in X 3. **Regression intercept ($b_0$)**: The predicted value of Y when X = 0 4. **Computational formulas** make hand calculations manageable 5. **Precision matters**: Carry calculations to 5-6 decimal places to avoid rounding errors 6. **Residuals**: The differences between actual and predicted values reveal model fit 7. **Business value**: Regression provides quantitative relationships for decision-making ::: ## 11.7 Standard Error of Estimation Now that we have a regression model, the next logical question is: **How good is it?** How well does the model fit the data? The **standard error of estimation** provides a critical measure of this goodness of fit. ### Defining the Standard Error The standard error of estimation, denoted $S_e$, measures the **typical distance** that observed data points fall from the regression line. It's conceptually similar to the standard deviation, but instead of measuring dispersion around the mean of a single variable, it measures dispersion around the regression line. **Standard Error of Estimation (Conceptual Form)** $$S_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} \quad [11.14]$$ where: - $Y_i$ = actual observed value - $\hat{Y}_i$ = predicted value from regression line - $n - 2$ = degrees of freedom (we lose 2 degrees because we estimate $b_0$ and $b_1$) ::: {.callout-note} ## Why n-2 Degrees of Freedom? In regression analysis, we estimate **two parameters**: the intercept ($b_0$) and the slope ($b_1$). Each parameter estimated "uses up" one degree of freedom from our sample of $n$ observations, leaving us with $n - 2$ degrees of freedom for the error term. ::: ### Computational Formula While the conceptual formula clearly shows what $S_e$ measures, it's tedious to calculate manually because we'd need to: 1. Calculate $\hat{Y}_i$ for each observation 2. Find each difference $(Y_i - \hat{Y}_i)$ 3. Square each difference 4. Sum all squared differences A more efficient computational approach uses the **sum of squared errors (SCE)** and **mean squared error (CME)**: **Sum of Squared Errors** $$SCE = SCy - \frac{(SCxy)^2}{SCx} \quad [11.15]$$ **Mean Squared Error (CME)** $$CME = \frac{SCE}{n-2} \quad [11.16]$$ **Standard Error of Estimation (Computational Form)** $$S_e = \sqrt{CME} \quad [11.17]$$ ### Application to Hop Scotch Airlines Let's calculate the standard error for the Hop Scotch Airlines example, where: - $SCy = 171.733333$ - $SCxy = 148.933333$ - $SCx = 137.733333$ - $n = 15$ **Step 1: Calculate SCE** $$ \begin{aligned} SCE &= SCy - \frac{(SCxy)^2}{SCx} \\ &= 171.733333 - \frac{(148.933333)^2}{137.733333} \\ &= 171.733333 - \frac{22,181.14}{137.733333} \\ &= 171.733333 - 161.044 \\ &= 10.6893 \end{aligned} $$ **Step 2: Calculate CME** $$ \begin{aligned} CME &= \frac{SCE}{n-2} \\ &= \frac{10.6893}{15-2} \\ &= \frac{10.6893}{13} \\ &= 0.82226 \end{aligned} $$ **Step 3: Calculate $S_e$** $$ \begin{aligned} S_e &= \sqrt{CME} \\ &= \sqrt{0.82226} \\ &= 0.90678 \approx 0.907 \end{aligned} $$ ### Interpreting the Standard Error The standard error is **always expressed in the same units** as the dependent variable $Y$. For Hop Scotch Airlines: **$S_e = 0.907$ thousand passengers = 907 passengers** This means that the typical prediction error is about 907 passengers. The actual number of passengers typically deviates from the regression line prediction by approximately 907 passengers. ### The Empirical Rule Applied to Regression Just as the empirical rule applies to standard deviation for a single variable, it applies to the standard error in regression analysis: ::: {.callout-important icon="📊"} ## Empirical Rule for Regression If the errors are normally distributed: - **68.3%** of observations fall within **$\pm 1S_e$** of the regression line - **95.5%** of observations fall within **$\pm 2S_e$** of the regression line - **99.7%** of observations fall within **$\pm 3S_e$** of the regression line ::: ### Practical Application Let's apply this to a specific prediction. When advertising expenditure is $X = 10$ thousand dollars: $$\hat{Y}_i = 4.39 + 1.08(10) = 15.19 \text{ thousand passengers}$$ This represents the **average** number of passengers we'd expect when $X = 10$. **One Standard Error Band:** - Upper bound: $15.19 + 0.907 = 16.10$ thousand (16,100 passengers) - Lower bound: $15.19 - 0.907 = 14.29$ thousand (14,290 passengers) **Interpretation**: When Hop Scotch spends $10,000 on advertising: - **68.3% of the time**, passenger count will be between 14,290 and 16,100 - **31.7% of the time**, passenger count will fall outside this range (either below 14,290 or above 16,100) ```{python} #| echo: false #| label: fig-standard-error-bands #| fig-cap: "Standard Error Bands Around Regression Line for Hop Scotch Airlines" import matplotlib.pyplot as plt import numpy as np # Data from Table 11.2 advertising = np.array([10, 12, 8, 17, 10, 15, 10, 14, 19, 10, 11, 13, 16, 10, 12]) passengers = np.array([15, 17, 13, 23, 16, 21, 14, 20, 24, 17, 16, 18, 23, 15, 16]) # Regression parameters b0 = 4.39 b1 = 1.08 Se = 0.907 # Create figure with two panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Left panel: Standard error bands x_line = np.linspace(7, 20, 100) y_pred = b0 + b1 * x_line # Plot standard error bands ax1.fill_between(x_line, y_pred - 3*Se, y_pred + 3*Se, alpha=0.1, color='red', label='±3Se (99.7%)') ax1.fill_between(x_line, y_pred - 2*Se, y_pred + 2*Se, alpha=0.15, color='orange', label='±2Se (95.5%)') ax1.fill_between(x_line, y_pred - Se, y_pred + Se, alpha=0.2, color='green', label='±1Se (68.3%)') # Regression line ax1.plot(x_line, y_pred, 'b-', linewidth=3, label=f'Ŷ = {b0:.2f} + {b1:.2f}X') # Scatter plot ax1.scatter(advertising, passengers, s=100, alpha=0.7, color='steelblue', edgecolors='black', linewidth=1.5, label='Actual Data', zorder=5) # Highlight the X=10 example x_example = 10 y_example_pred = b0 + b1 * x_example ax1.plot([x_example, x_example], [y_example_pred - Se, y_example_pred + Se], 'r-', linewidth=4, alpha=0.7) ax1.plot(x_example, y_example_pred + Se, 'ro', markersize=10) ax1.plot(x_example, y_example_pred - Se, 'ro', markersize=10) ax1.annotate(f'X = {x_example}: Ŷ = {y_example_pred:.2f}\n±1Se = [{y_example_pred-Se:.2f}, {y_example_pred+Se:.2f}]', xy=(x_example, y_example_pred), xytext=(x_example + 2, y_example_pred + 3), fontsize=9, fontweight='bold', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7), arrowprops=dict(arrowstyle='->', lw=2, color='red')) ax1.set_xlabel('Advertising Expenditure (in $1,000s)', fontsize=12, fontweight='bold') ax1.set_ylabel('Number of Passengers (in 1,000s)', fontsize=12, fontweight='bold') ax1.set_title('Standard Error Bands\nEmpirical Rule Applied to Regression', fontsize=13, fontweight='bold') ax1.legend(loc='upper left', fontsize=9) ax1.grid(alpha=0.3) ax1.set_xlim(7, 20) ax1.set_ylim(8, 28) # Right panel: Comparison of different dispersions # Simulate two datasets with different standard errors np.random.seed(42) x_sim = np.linspace(8, 18, 15) y_sim_tight = 5 + 1.5*x_sim + np.random.normal(0, 0.5, 15) # Low Se y_sim_loose = 5 + 1.5*x_sim + np.random.normal(0, 2.5, 15) # High Se # Fit lines (using simple calculation) b1_tight = 1.5 b0_tight = 5 b1_loose = 1.5 b0_loose = 5 x_line_sim = np.linspace(7, 19, 100) y_line_tight = b0_tight + b1_tight * x_line_sim y_line_loose = b0_loose + b1_loose * x_line_sim # Plot tight fit ax2.scatter(x_sim, y_sim_tight, s=80, alpha=0.6, color='green', edgecolors='black', linewidth=1.5, label='Low Se (Good Fit)') ax2.plot(x_line_sim, y_line_tight, 'g-', linewidth=2.5, alpha=0.7) # Plot loose fit ax2.scatter(x_sim + 0.3, y_sim_loose, s=80, alpha=0.6, color='red', edgecolors='black', linewidth=1.5, label='High Se (Poor Fit)', marker='s') ax2.plot(x_line_sim, y_line_loose, 'r--', linewidth=2.5, alpha=0.7) ax2.annotate('More scattered data\nresults in higher Se', xy=(15, 28), fontsize=11, fontweight='bold', color='darkred', bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.7)) ax2.annotate('Tightly clustered data\nresults in lower Se', xy=(10, 15), fontsize=11, fontweight='bold', color='darkgreen', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7)) ax2.set_xlabel('X Variable', fontsize=12, fontweight='bold') ax2.set_ylabel('Y Variable', fontsize=12, fontweight='bold') ax2.set_title('Comparing Model Fit Quality\nStandard Error as Goodness of Fit', fontsize=13, fontweight='bold') ax2.legend(loc='upper left', fontsize=10) ax2.grid(alpha=0.3) ax2.set_xlim(7, 19) plt.tight_layout() plt.show() ``` ### Standard Error as a Goodness-of-Fit Measure The standard error provides a quantifiable measure of how well the regression model fits the data: **Smaller $S_e$** → **Better Fit** : Data points cluster tightly around the regression line. Predictions are more accurate. **Larger $S_e$** → **Poorer Fit** : Data points are more scattered around the regression line. Predictions are less reliable. As shown in the visualization above: - When data are **tightly clustered** around the line, $S_e$ is small (good fit) - When data are **widely dispersed**, $S_e$ is large (poor fit) ::: {.callout-warning icon="⚠️"} ## Context Matters for Interpretation Whether $S_e = 0.907$ represents a "good" or "poor" fit depends on context: - **For Hop Scotch**: A typical error of 907 passengers might be acceptable given that monthly passenger volumes range from 13,000 to 24,000 - **Relative measure**: Compare $S_e$ to the range of $Y$ values. Here, the range is $(24 - 13) = 11$ thousand, so $S_e = 0.907$ is about 8% of the range - **Business decision**: Is this level of prediction accuracy sufficient for planning purposes? ::: ### Verification with Computer Output From the Python output shown earlier: ``` s = 0.9068 R-sq = 93.8% R-sq(adj) = 93.3% ``` The computer-calculated standard error ($s = 0.9068$) matches our hand calculation ($S_e = 0.907$), confirming our work. ## Section Exercises ::: {#exr-se-state-university} ## Standard Error for State University Using your calculations from the GPA and job offers exercise (Exercise 9), calculate and interpret the standard error of estimation for State University. Create a graph illustrating the interpretation. How can this be used as a measure of goodness of fit? ::: ::: {#exr-se-florida-resources} ## Standard Error for Florida Department of Human Resources Based on the data from the income and consumption exercise (Exercise 10), what is the standard error of estimation for the Florida Department of Human Resources? How would you interpret the results? Use a graph in your interpretation. ::: ::: {#exr-se-atlanta-bank} ## Standard Error for Atlanta Bank Calculate and interpret the standard error of estimation for the interest rates and housing sales exercise (Exercise 11) about the Atlanta bank. ::: ::: {#exr-se-overland-group} ## Standard Error for Overland Group The Overland Group from the production costs exercise (Exercise 12) now wants to know the standard error of estimation for their cost prediction model. ::: ::: {#exr-se-professor-mundane} ## Standard Error for Professor Mundane What is the standard error of estimation that Professor Mundane will experience in the distance and absences exercise (Exercise 13)? ::: ## 11.8 Correlation Analysis: Measuring Relationship Strength The regression model has provided a clear picture of the relationship between Hop Scotch Airlines' advertising expenditures and the number of brave travelers who line up at the ticket counter. The positive value for $b_1$ indicates a **direct relationship**: as advertising increases, so does the number of passengers. Now it's useful to obtain a measure of the **strength** of that relationship. This is the function of the **correlation coefficient**, developed by Carl Pearson at the end of the 19th century. Sometimes called the **Pearson product-moment correlation coefficient**, it's represented by $r$. ### The Correlation Coefficient **Range of the Correlation Coefficient** $$-1 \leq r \leq +1 \quad [11.18]$$ The correlation coefficient can assume any value between $-1$ and $+1$: **$r = -1$** (Perfect Negative Correlation) : All observations fall on a straight line with negative slope. $X$ and $Y$ move in **opposite directions** perfectly. **$r = 0$** (No Linear Correlation) : No linear relationship exists between $X$ and $Y$. Knowing $X$ provides no information about $Y$. **$r = +1$** (Perfect Positive Correlation) : All observations fall on a straight line with positive slope. $X$ and $Y$ move in the **same direction** perfectly. **$-1 < r < 0$** (Negative Correlation) : $X$ and $Y$ tend to move in opposite directions, but the relationship isn't perfect. **$0 < r < +1$** (Positive Correlation) : $X$ and $Y$ tend to move in the same direction, but the relationship isn't perfect. ::: {.callout-tip icon="💡"} ## Interpreting Correlation Strength General guidelines (though context-dependent): - $|r| \geq 0.9$: Very strong relationship - $0.7 \leq |r| < 0.9$: Strong relationship - $0.5 \leq |r| < 0.7$: Moderate relationship - $0.3 \leq |r| < 0.5$: Weak relationship - $|r| < 0.3$: Very weak or no relationship The closer $|r|$ is to 1, the stronger the linear relationship. ::: ```{python} #| echo: false #| label: fig-correlation-examples #| fig-cap: "Examples of Different Correlation Coefficients" import matplotlib.pyplot as plt import numpy as np # Set random seed for reproducibility np.random.seed(42) # Create figure with 6 panels fig, axes = plt.subplots(2, 3, figsize=(15, 10)) axes = axes.flatten() # Generate data for different correlation scenarios n = 50 # 1. Perfect positive (r = +1) x1 = np.linspace(0, 10, n) y1 = 2 + 1.5 * x1 axes[0].scatter(x1, y1, s=60, alpha=0.6, color='darkgreen', edgecolors='black') axes[0].plot(x1, y1, 'g-', linewidth=2) axes[0].set_title('Perfect Positive Correlation\nr = +1.00', fontsize=12, fontweight='bold') axes[0].set_xlabel('X') axes[0].set_ylabel('Y') axes[0].grid(alpha=0.3) # 2. Strong positive (r ≈ +0.9) x2 = np.random.uniform(0, 10, n) y2 = 2 + 1.5 * x2 + np.random.normal(0, 1.5, n) axes[1].scatter(x2, y2, s=60, alpha=0.6, color='green', edgecolors='black') # Add regression line b1_2 = np.corrcoef(x2, y2)[0,1] * np.std(y2) / np.std(x2) b0_2 = np.mean(y2) - b1_2 * np.mean(x2) axes[1].plot(x2, b0_2 + b1_2 * x2, 'g-', linewidth=2) r_2 = np.corrcoef(x2, y2)[0,1] axes[1].set_title(f'Strong Positive Correlation\nr = {r_2:.2f}', fontsize=12, fontweight='bold') axes[1].set_xlabel('X') axes[1].set_ylabel('Y') axes[1].grid(alpha=0.3) # 3. Moderate positive (r ≈ +0.6) x3 = np.random.uniform(0, 10, n) y3 = 2 + 1.5 * x3 + np.random.normal(0, 4, n) axes[2].scatter(x3, y3, s=60, alpha=0.6, color='lightgreen', edgecolors='black') b1_3 = np.corrcoef(x3, y3)[0,1] * np.std(y3) / np.std(x3) b0_3 = np.mean(y3) - b1_3 * np.mean(x3) axes[2].plot(x3, b0_3 + b1_3 * x3, 'g-', linewidth=2) r_3 = np.corrcoef(x3, y3)[0,1] axes[2].set_title(f'Moderate Positive Correlation\nr = {r_3:.2f}', fontsize=12, fontweight='bold') axes[2].set_xlabel('X') axes[2].set_ylabel('Y') axes[2].grid(alpha=0.3) # 4. Perfect negative (r = -1) x4 = np.linspace(0, 10, n) y4 = 20 - 1.5 * x4 axes[3].scatter(x4, y4, s=60, alpha=0.6, color='darkred', edgecolors='black') axes[3].plot(x4, y4, 'r-', linewidth=2) axes[3].set_title('Perfect Negative Correlation\nr = -1.00', fontsize=12, fontweight='bold') axes[3].set_xlabel('X') axes[3].set_ylabel('Y') axes[3].grid(alpha=0.3) # 5. Strong negative (r ≈ -0.85) x5 = np.random.uniform(0, 10, n) y5 = 20 - 1.5 * x5 + np.random.normal(0, 2, n) axes[4].scatter(x5, y5, s=60, alpha=0.6, color='red', edgecolors='black') b1_5 = np.corrcoef(x5, y5)[0,1] * np.std(y5) / np.std(x5) b0_5 = np.mean(y5) - b1_5 * np.mean(x5) axes[4].plot(x5, b0_5 + b1_5 * x5, 'r-', linewidth=2) r_5 = np.corrcoef(x5, y5)[0,1] axes[4].set_title(f'Strong Negative Correlation\nr = {r_5:.2f}', fontsize=12, fontweight='bold') axes[4].set_xlabel('X') axes[4].set_ylabel('Y') axes[4].grid(alpha=0.3) # 6. No correlation (r ≈ 0) x6 = np.random.uniform(0, 10, n) y6 = np.random.uniform(0, 20, n) axes[5].scatter(x6, y6, s=60, alpha=0.6, color='gray', edgecolors='black') r_6 = np.corrcoef(x6, y6)[0,1] axes[5].axhline(y=np.mean(y6), color='gray', linestyle='--', linewidth=2) axes[5].set_title(f'No Correlation\nr = {r_6:.2f}', fontsize=12, fontweight='bold') axes[5].set_xlabel('X') axes[5].set_ylabel('Y') axes[5].grid(alpha=0.3) plt.tight_layout() plt.show() ``` ### Understanding What r Measures To fully understand what the correlation coefficient measures, we need to develop three measures of deviation. Consider observation 13 from the Hop Scotch data (where $X = 16$ and $Y = 23$): **Total Deviation of Y** : The amount by which individual $Y$ values vary from their mean $\bar{Y}$: $$(Y_i - \bar{Y}) = 23 - 17.87 = 5.13$$ **Explained Deviation** : The difference between what the regression model predicts ($\hat{Y}_i$) and the mean value of $Y$ ($\bar{Y}$): $$(\hat{Y}_i - \bar{Y}) = 21.68 - 17.87 = 3.81$$ where $\hat{Y}_i = 4.39 + 1.08(16) = 21.68$ **Unexplained Deviation (Residual)** : The portion of total deviation that is NOT explained by the regression model (the error): $$(Y_i - \hat{Y}_i) = 23 - 21.68 = 1.32$$ **Fundamental Relationship**: $$\text{Total Deviation} = \text{Explained Deviation} + \text{Unexplained Deviation}$$ $$(Y_i - \bar{Y}) = (\hat{Y}_i - \bar{Y}) + (Y_i - \hat{Y}_i)$$ ### Sums of Squares Decomposition When we square these deviations and sum across all observations, we get: **Sum of Squares Total (SCT)** $$SCT = \sum(Y_i - \bar{Y})^2 \quad [11.19]$$ Measures total variation in $Y$ around its mean. **Sum of Squares Regression (SCR)** $$SCR = \sum(\hat{Y}_i - \bar{Y})^2 \quad [11.20]$$ Measures variation **explained** by the regression model. **Sum of Squares Error (SCE)** $$SCE = \sum(Y_i - \hat{Y}_i)^2 \quad [11.21]$$ Measures variation **NOT explained** by the model (residual variation). **Fundamental ANOVA Identity**: $$SCT = SCR + SCE$$ ### Calculating the Correlation Coefficient The correlation coefficient compares explained variation to total variation: **Correlation Coefficient (Conceptual)** $$r = \sqrt{\frac{\text{Explained Variation}}{\text{Total Variation}}} = \sqrt{\frac{SCR}{SCT}} \quad [11.22]$$ However, this formula is difficult to calculate manually. A more convenient computational formula is: **Correlation Coefficient (Computational Form)** $$r = \frac{SCxy}{\sqrt{(SCx)(SCy)}} \quad [11.23]$$ This formula uses the sums of squares we've already calculated for the regression coefficients! ### Application to Hop Scotch Airlines For the Hop Scotch example: - $SCxy = 148.933333$ - $SCx = 137.733333$ - $SCy = 171.733333$ $$ \begin{aligned} r &= \frac{SCxy}{\sqrt{(SCx)(SCy)}} \\ &= \frac{148.933333}{\sqrt{(137.733333)(171.733333)}} \\ &= \frac{148.933333}{\sqrt{23,654.60}} \\ &= \frac{148.933333}{153.807} \\ &= 0.9683 \end{aligned} $$ **Interpretation**: $r = 0.9683$ indicates a **very strong positive relationship** between advertising expenditures and passenger volume. When advertising increases, passengers almost always increase proportionally. ## 11.9 Coefficient of Determination ($r^2$) Perhaps the **most important measure of goodness of fit** is the **coefficient of determination**, denoted $r^2$. ### Definition and Calculation **Coefficient of Determination (Conceptual)** $$r^2 = \frac{\text{Explained Variation}}{\text{Total Variation}} = \frac{SCR}{SCT} \quad [11.24]$$ **Computational Formula** $$r^2 = \frac{(SCxy)^2}{(SCx)(SCy)} \quad [11.25]$$ Or simply: $$r^2 = (r)^2$$ Square the correlation coefficient! ### Application to Hop Scotch Airlines **Method 1: Using the formula** $$ \begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(148.933333)^2}{(137.733333)(171.733333)} \\ &= \frac{22,181.14}{23,654.60} \\ &= 0.93776 \approx 0.938 \end{aligned} $$ **Method 2: Squaring the correlation coefficient** $$r^2 = (0.9683)^2 = 0.9376 \approx 0.938$$ Both methods give us $r^2 = 0.938$ or **93.8%**. ### Interpreting $r^2$ The coefficient of determination reveals **what percentage of the change in $Y$ is explained by the change in $X$**. **For Hop Scotch Airlines**: ::: {.callout-important icon="📊"} ## Business Interpretation of $r^2 = 0.938$ **93.8% of the variation in passenger volume is explained by advertising expenditure.** This means: - **93.8%** of the changes in passenger numbers can be attributed to changes in advertising spending - **6.2%** of passenger variation is due to other factors (e.g., seasonality, competitors, economic conditions, random variation) - The regression model captures **nearly all** the systematic variation in the data ::: This high $r^2$ indicates an **excellent fit**. The model is highly effective for prediction and decision-making. ### Important Cautions About $r^2$ ::: {.callout-warning icon="⚠️"} ## Critical Limitations of $r^2$ 1. **Linear Relationships Only**: $r^2$ measures only **linear** relationships. Two variables can have $r^2 = 0$ yet still be strongly related in a curvilinear way. 2. **Correlation ≠ Causation**: The 93.8% does **NOT** mean advertising *causes* 93.8% of the passenger changes. It only means they are associated. 3. **Context Dependent**: What constitutes a "good" $r^2$ varies by field: - Physical sciences: Often expect $r^2 > 0.95$ - Social sciences: $r^2 > 0.60$ may be considered strong - Business/economics: $r^2 > 0.70$ typically indicates good fit 4. **Sample Size Matters**: With very large samples, even weak relationships can yield statistically significant $r^2$ values ::: ### Computer Output Verification From the Python output shown earlier: ``` s = 0.9068 R-sq = 93.8% R-sq(adj) = 93.3% ``` The computer-calculated $r^2 = 93.8\%$ matches our hand calculation exactly! **Note**: "R-sq(adj)" is the **adjusted** $r^2$, which adjusts for the number of predictors in the model. This becomes important in multiple regression (Chapter 12). ## Section Exercises ::: {#exr-r2-goodness-fit} ## Coefficient of Determination as Goodness of Fit How can the coefficient of determination be used as a measure of goodness of fit? Create a graph to illustrate your explanation. ::: ::: {#exr-correlation-gpa-jobs} ## GPA and Job Offers Correlation What is the strength of the relationship between GPA and job offers in Exercise 9? Calculate and interpret both $r$ and $r^2$. ::: ::: {#exr-correlation-income-consumption} ## Income and Consumption Correlation Calculate and interpret the correlation coefficient and coefficient of determination for the Florida Department of Human Resources data in Exercise 10. ::: ::: {#exr-r2-housing-interest} ## Housing Sales Explained by Interest Rates How much of the change in houses sold can be explained by the interest rate in Exercise 11? What does this tell you about the strength of the relationship? ::: ::: {#exr-correlation-distance-absences} ## Professor Mundane's Model Strength What is the strength of Professor Mundane's model used in Exercise 13 to explain student absences? How much of the variation in absences is explained by distance from campus? ::: ## Section Summary ::: {.callout-tip icon="📚"} ## Key Takeaways: Model Evaluation **Standard Error of Estimation ($S_e$)**: - Measures **typical prediction error** in the same units as $Y$ - Smaller $S_e$ indicates better fit - Can be interpreted using the empirical rule (68%-95%-99.7%) - Provides absolute measure of scatter around regression line **Correlation Coefficient ($r$)**: - Measures **strength and direction** of linear relationship - Range: $-1 \leq r \leq +1$ - Sign indicates direction (positive/negative) - Magnitude indicates strength (closer to ±1 is stronger) **Coefficient of Determination ($r^2$)**: - **Most important goodness-of-fit measure** - Percentage of variation in $Y$ explained by $X$ - Range: $0 \leq r^2 \leq 1$ (often expressed as percentage) - Higher $r^2$ means better predictive power - **Remember**: Correlation does NOT imply causation! **For Hop Scotch Airlines**: - $S_e = 0.907$ (typical error of 907 passengers) - $r = 0.9683$ (very strong positive relationship) - $r^2 = 0.938$ (93.8% of passenger variation explained) - **Conclusion**: Excellent model for business decision-making ::: ## 11.10 Hypothesis Testing for Regression Parameters The statistical results suggest a relationship between passengers and advertising for Hop Scotch Airlines. The non-zero values for the regression coefficient ($b_1 = 1.08$) and correlation coefficient ($r = 0.968$) indicate that as advertising expenditures change, the number of passengers changes. However, these results are based on a sample of only $n = 15$ observations. As always, we must ask: **Does a relationship exist at the population level?** It could be that due to sampling error, the population parameters are actually zero. We must test the population parameters to ensure that our sample findings differ significantly from zero. ### A. Testing the Regression Slope ($\beta_1$) If the slope of the **true but unknown** population regression line is zero, there is no relationship between passengers and advertising—contrary to our sample results. If we were to create a scatter diagram for the **population** of all possible $(X, Y)$ data points, it might look like the figure below, showing no pattern. When collecting our sample, we might have included only 15 observations from within a particular region that falsely suggests a positive relationship. We must test the hypothesis: $$ \begin{aligned} H_0: \beta_1 &= 0 \quad \text{(No relationship)} \\ H_A: \beta_1 &\neq 0 \quad \text{(Relationship exists)} \end{aligned} $$ **The t-Test for the Regression Coefficient** $$t = \frac{b_1 - \beta_1}{S_{b_1}} \quad [11.26]$$ with $n - 2$ degrees of freedom. where $S_{b_1}$ is the **standard error of the regression coefficient**, which recognizes that different samples produce different values for $b_1$. **Standard Error of the Regression Coefficient** $$S_{b_1} = \frac{S_e}{\sqrt{SCx}} \quad [11.27]$$ ::: {.callout-note} ## Why Does $b_1$ Have Sampling Variability? If we took different samples of $n = 15$ months and calculated a regression equation for each, we'd likely get different values for $b_0$ and $b_1$ each time. The standard error $S_{b_1}$ measures this sampling variability in the slope estimate. If $\beta_1 = 0$ in the population, the sample values $b_1$ would be distributed around zero as shown below. ::: ### Application to Hop Scotch Airlines **Step 1: Calculate the Standard Error** Given: - $S_e = 0.907$ - $SCx = 137.733333$ $$ \begin{aligned} S_{b_1} &= \frac{S_e}{\sqrt{SCx}} \\ &= \frac{0.907}{\sqrt{137.733333}} \\ &= \frac{0.907}{11.735} \\ &= 0.07726 \end{aligned} $$ **Step 2: Calculate the Test Statistic** $$ \begin{aligned} t &= \frac{b_1 - \beta_1}{S_{b_1}} \\ &= \frac{1.0813 - 0}{0.07726} \\ &= 13.995 \end{aligned} $$ **Step 3: Determine the Critical Value** For $\alpha = 0.05$ (5% significance level) with $n - 2 = 15 - 2 = 13$ degrees of freedom: $$t_{0.025, 13} = \pm 2.160$$ (Two-tailed test, so we split $\alpha$ in half) **Step 4: Make Decision** **Decision Rule**: Do not reject $H_0$ if $t$ is between $\pm 2.160$; otherwise reject $H_0$. **Conclusion**: Since $t = 13.995 > 2.160$, we **reject the null hypothesis**. At the 5% significance level, there appears to be a significant relationship between passengers and advertising. This is confirmed by the Python output (shown earlier), which reports a **p-value of 0.000**, indicating extremely strong evidence against $H_0$. ```{python} #| echo: false #| label: fig-hypothesis-test-beta #| fig-cap: "Hypothesis Test for Regression Slope β₁" import matplotlib.pyplot as plt import numpy as np from scipy import stats # Create figure with two panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Left panel: t-distribution with rejection regions df = 13 x = np.linspace(-5, 5, 1000) y = stats.t.pdf(x, df) # Plot the distribution ax1.plot(x, y, 'b-', linewidth=2, label='t-distribution (df=13)') ax1.fill_between(x, 0, y, where=(x <= -2.160), alpha=0.3, color='red', label='Rejection Region (α/2 = 0.025)') ax1.fill_between(x, 0, y, where=(x >= 2.160), alpha=0.3, color='red') ax1.fill_between(x, 0, y, where=((x > -2.160) & (x < 2.160)), alpha=0.2, color='green', label='Non-rejection Region') # Mark critical values ax1.axvline(x=-2.160, color='red', linestyle='--', linewidth=2, alpha=0.7) ax1.axvline(x=2.160, color='red', linestyle='--', linewidth=2, alpha=0.7) ax1.text(-2.160, -0.01, '-2.160', ha='center', fontsize=10, fontweight='bold') ax1.text(2.160, -0.01, '+2.160', ha='center', fontsize=10, fontweight='bold') # Mark test statistic ax1.axvline(x=13.995, color='darkred', linestyle='-', linewidth=3, alpha=0.9) ax1.annotate('Test statistic\nt = 13.995', xy=(13.995, 0.05), xytext=(10, 0.15), fontsize=11, fontweight='bold', color='darkred', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='darkred')) ax1.set_xlabel('t-value', fontsize=12, fontweight='bold') ax1.set_ylabel('Probability Density', fontsize=12, fontweight='bold') ax1.set_title('Hypothesis Test: H₀: β₁ = 0 vs. Hₐ: β₁ ≠ 0\nα = 0.05 (Two-tailed)', fontsize=13, fontweight='bold') ax1.legend(loc='upper right', fontsize=9) ax1.grid(alpha=0.3) ax1.set_xlim(-5, 15) ax1.set_ylim(-0.02, 0.4) # Add decision box ax1.text(0, 0.35, 'Decision: REJECT H₀\nt = 13.995 > 2.160\np-value ≈ 0.000', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.9)) # Right panel: Sampling distribution of b₁ # If β₁ = 0, b₁ is distributed around 0 with SE = 0.07726 se_b1 = 0.07726 x_b1 = np.linspace(-0.3, 0.3, 1000) y_b1 = stats.norm.pdf(x_b1, 0, se_b1) ax2.plot(x_b1, y_b1, 'b-', linewidth=2, label=f'Sampling distribution\nif β₁ = 0 (SE = {se_b1:.4f})') ax2.fill_between(x_b1, 0, y_b1, where=(x_b1 <= -2.160*se_b1), alpha=0.3, color='red') ax2.fill_between(x_b1, 0, y_b1, where=(x_b1 >= 2.160*se_b1), alpha=0.3, color='red', label='Rejection Regions') ax2.fill_between(x_b1, 0, y_b1, where=((x_b1 > -2.160*se_b1) & (x_b1 < 2.160*se_b1)), alpha=0.2, color='green', label='Non-rejection Region') # Mark critical values critical_lower = -2.160 * se_b1 critical_upper = 2.160 * se_b1 ax2.axvline(x=critical_lower, color='red', linestyle='--', linewidth=2, alpha=0.7) ax2.axvline(x=critical_upper, color='red', linestyle='--', linewidth=2, alpha=0.7) ax2.text(critical_lower, -0.3, f'{critical_lower:.3f}', ha='center', fontsize=9, fontweight='bold') ax2.text(critical_upper, -0.3, f'{critical_upper:.3f}', ha='center', fontsize=9, fontweight='bold') # Mark observed b₁ observed_b1 = 1.0813 ax2.axvline(x=observed_b1, color='darkred', linestyle='-', linewidth=3, alpha=0.9) ax2.annotate(f'Observed b₁ = {observed_b1:.4f}\n(Way off the chart!)', xy=(0.25, 3), xytext=(0.15, 4), fontsize=10, fontweight='bold', color='darkred', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='darkred')) ax2.set_xlabel('Sample Regression Coefficient (b₁)', fontsize=12, fontweight='bold') ax2.set_ylabel('Probability Density', fontsize=12, fontweight='bold') ax2.set_title('Sampling Distribution of b₁ Under H₀: β₁ = 0\n(Our observed b₁ is extremely unlikely if H₀ were true)', fontsize=12, fontweight='bold') ax2.legend(loc='upper right', fontsize=9) ax2.grid(alpha=0.3) ax2.set_xlim(-0.3, 0.3) plt.tight_layout() plt.show() ``` ::: {.callout-important icon="🎯"} ## What If We Fail to Reject $H_0$? If we had **not** rejected the null hypothesis, we would conclude that advertising and passengers are **not significantly related**. In that case, we would discard this model and use a different explanatory variable instead of advertising. Since we **did reject** $H_0$, we have strong evidence that the relationship is real (not due to chance), and advertising is a useful predictor of passenger volume. ::: ### Confidence Interval for $\beta_1$ Since we've rejected the hypothesis that $\beta_1 = 0$, the natural question is: **"What IS its value?"** We can answer this by calculating a confidence interval for $\beta_1$: **Confidence Interval for the Regression Slope** $$\text{C.I. for } \beta_1 = b_1 \pm t_{\alpha/2, n-2} \cdot S_{b_1} \quad [11.28]$$ **For a 95% Confidence Level**: $$ \begin{aligned} \text{C.I. for } \beta_1 &= 1.08 \pm (2.160)(0.07726) \\ &= 1.08 \pm 0.167 \\ 0.913 &\leq \beta_1 \leq 1.247 \end{aligned} $$ **Interpretation**: We can be **95% confident** that the true population regression coefficient is between 0.913 and 1.247. This means each $1,000 increase in advertising expenditure increases passenger volume by somewhere between 913 and 1,247 passengers. ### B. Testing the Correlation Coefficient ($\rho$) Much of the work done for testing the regression coefficient can be applied to testing the correlation coefficient. The purpose and rationale are very similar. Since our correlation analysis is based on sample data, sampling error might lead to inappropriate conclusions. The sample data produced a non-zero correlation coefficient of $r = 0.9683$, but this could be due to sampling error. Perhaps the **population correlation** is actually zero, and a misleading sample caused us to assume a relationship incorrectly. Therefore, we must test: $$ \begin{aligned} H_0: \rho &= 0 \quad \text{(No correlation in population)} \\ H_A: \rho &\neq 0 \quad \text{(Correlation exists in population)} \end{aligned} $$ where $\rho$ (Greek letter rho) is the population correlation coefficient. **The t-Test for the Correlation Coefficient** $$t = \frac{r - \rho}{S_r} \quad [11.29]$$ where $S_r$ is the standard error of the correlation coefficient. **Standard Error of the Correlation Coefficient** $$S_r = \sqrt{\frac{1 - r^2}{n - 2}} \quad [11.30]$$ ### Application to Hop Scotch Airlines **Step 1: Calculate the Standard Error** $$ \begin{aligned} S_r &= \sqrt{\frac{1 - r^2}{n - 2}} \\ &= \sqrt{\frac{1 - 0.93776}{15 - 2}} \\ &= \sqrt{\frac{0.06224}{13}} \\ &= \sqrt{0.004787} \\ &= 0.0692 \end{aligned} $$ **Step 2: Calculate the Test Statistic** $$ \begin{aligned} t &= \frac{r - \rho}{S_r} \\ &= \frac{0.9683 - 0}{0.0692} \\ &= 13.995 \end{aligned} $$ **Step 3: Decision** Using $\alpha = 0.05$ with $df = 13$: Critical value $t_{0.025, 13} = \pm 2.160$ **Decision Rule**: Do not reject if $t$ is between $\pm 2.160$; otherwise reject. **Conclusion**: Since $t = 13.995 > 2.160$, we **reject the null hypothesis**. At the 5% significance level, we conclude that the population correlation coefficient is not zero—passengers and advertising are significantly correlated. ::: {.callout-tip icon="💡"} ## Important Observation: $t$ Values Are Identical! Notice that the $t$-value of **13.995** is **the same** for both: - Testing $\beta_1 = 0$ - Testing $\rho = 0$ This is **not a coincidence**. In **simple linear regression** (one predictor), these two tests are mathematically equivalent. They will **always** give identical results. **However**, in multiple regression (Chapter 12), this equivalence does not hold. That's why it's important to become familiar with both tests. ::: ### Equivalence of the Two Tests For simple linear regression: $$F = t^2$$ Also: - Testing $H_0: \beta_1 = 0$ is equivalent to testing $H_0: \rho = 0$ - Both test whether there's a significant **linear** relationship between $X$ and $Y$ ## 11.11 Confidence and Prediction Intervals Regression analysis can forecast and predict values for the dependent variable. Once we've determined the regression equation, we can develop a **point estimate** for the dependent variable by substituting a given value for $X$ and solving for $Y$. However, as we've seen throughout this textbook, **interval estimates** are often preferred over point estimates. There are at least two types of interval estimates commonly used in regression: 1. **Confidence Interval for the Mean** $\mu_{Y|X}$: The average value of $Y$ for all cases where $X$ equals a specific value 2. **Prediction Interval for an Individual** $Y_X$: A single value of $Y$ when $X$ equals a specific value ### A. Confidence Interval for the Mean of $Y$ Conditional on $X$ Suppose we want to develop an interval estimate for the **conditional mean** of $Y$, denoted $\mu_{Y|X}$. This is the population mean for all values of $Y$, given that $X$ equals a specific value. **Example**: If $X = 10$ (advertising = $10,000) many times, we'd obtain many different values of $Y$ (passenger counts). The interval we're calculating estimates the **mean** of all those $Y$ values. **Two Interpretations** (for a 95% confidence interval): 1. **First interpretation**: If we set $X$ equal to the same amount many times, we'd obtain many different values of $Y$. We can be 95% confident that the **mean** of those $Y$ values ($\mu_{Y|X}$) will fall within the specified interval. 2. **Second interpretation**: If we took many samples of $(X, Y)$ values and constructed a confidence interval based on each sample, 95% of them would contain the true mean value $\mu_{Y|X}$. To calculate this interval, we must first determine $S_{\hat{Y}}$, the **standard error of the conditional mean**. **Standard Error of the Conditional Mean** $$S_{\hat{Y}} = S_e \sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.31]$$ where: - $S_e$ = standard error of estimation - $X_i$ = the given value for the independent variable - $\bar{X}$ = mean of $X$ values in the sample - $SCx$ = sum of squares for $X$ The confidence interval for the conditional mean is: **Confidence Interval for $\mu_{Y|X}$** $$\text{C.I. for } \mu_{Y|X} = \hat{Y}_i \pm t_{\alpha/2, n-2} \cdot S_{\hat{Y}} \quad [11.32]$$ where $\hat{Y}_i$ is the point estimate from the regression equation. ### Application: Hop Scotch Airlines (Confidence Interval) **Question**: What is the **average** passenger count when advertising = $10,000? Given: - $X_i = 10$ - $\bar{X} = 12.47$ - $S_e = 0.907$ - $SCx = 137.733333$ - $n = 15$ **Step 1: Calculate Standard Error of the Mean** $$ \begin{aligned} S_{\hat{Y}} &= S_e \sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \\ &= 0.907 \sqrt{\frac{1}{15} + \frac{(10 - 12.47)^2}{137.733333}} \\ &= 0.907 \sqrt{0.06667 + \frac{6.1009}{137.733333}} \\ &= 0.907 \sqrt{0.06667 + 0.04430} \\ &= 0.907 \sqrt{0.11097} \\ &= 0.907(0.3331) \\ &= 0.302 \end{aligned} $$ **Step 2: Calculate Point Estimate** $$\hat{Y}_i = 4.39 + 1.08(10) = 15.19$$ **Step 3: Construct 95% Confidence Interval** For 95% confidence with $df = 13$: $t_{0.025, 13} = 2.160$ $$ \begin{aligned} \text{C.I. for } \mu_{Y|X} &= 15.19 \pm (2.160)(0.302) \\ &= 15.19 \pm 0.652 \\ 14.54 &\leq \mu_{Y|X} \leq 15.84 \end{aligned} $$ **Interpretation**: Hop Scotch can be **95% confident** that if they spend $10,000 on advertising many times, the **average** passenger count across all those occasions will be between 14,540 and 15,840 passengers. ### B. Prediction Interval for a Single Value of $Y$ The confidence interval developed above estimates the **mean** of all $Y$ values when $X$ equals a given amount. Often, it's more useful to construct an interval for a **single value** of $Y$ obtained when $X$ is set to a given value **just once**. **Example**: Hop Scotch might be interested in predicting the number of customers **next month** if they invest $10,000 in advertising. This differs from predicting the **average** across many months. **Key Insight**: Individual values are **more dispersed** than means. Means tend to cluster around the center of the data, making them easier to predict. Individual values scatter more widely, making them harder to predict. Therefore, a 95% confidence interval for a **single value** of $Y$ must be **wider** than the interval for the conditional mean. **Two Interpretations** (for a 95% prediction interval): 1. **First interpretation**: If we set $X$ equal to some amount just once, we'll obtain a single resulting value of $Y$. We can be 95% confident that this single value of $Y$ falls within the specified interval. 2. **Second interpretation**: If we took many samples and used each to construct a prediction interval, 95% of them would contain the true value for $Y$. To calculate the prediction interval, we need the **standard error of the forecast**, $S_{Y_X}$. **Standard Error of the Forecast** $$S_{Y_X} = S_e \sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.33]$$ Notice the extra "1" under the square root compared to formula [11.31]. This accounts for the additional variability of individual values versus means. **Prediction Interval for $Y_X$** $$\text{P.I. for } Y_X = \hat{Y}_i \pm t_{\alpha/2, n-2} \cdot S_{Y_X} \quad [11.34]$$ ### Application: Hop Scotch Airlines (Prediction Interval) **Question**: What will passenger count be **next month** if advertising = $10,000? **Step 1: Calculate Standard Error of the Forecast** $$ \begin{aligned} S_{Y_X} &= S_e \sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \\ &= 0.907 \sqrt{1 + \frac{1}{15} + \frac{(10 - 12.47)^2}{137.733333}} \\ &= 0.907 \sqrt{1 + 0.06667 + 0.04430} \\ &= 0.907 \sqrt{1.11097} \\ &= 0.907(1.054) \\ &= 0.956 \end{aligned} $$ **Step 2: Construct 95% Prediction Interval** $$ \begin{aligned} \text{P.I. for } Y_X &= 15.19 \pm (2.160)(0.956) \\ &= 15.19 \pm 2.065 \\ 13.13 &\leq Y_X \leq 17.26 \end{aligned} $$ **Interpretation**: Hop Scotch can be **95% confident** that if they spend $10,000 on advertising **next month**, the passenger count for that specific month will be between 13,130 and 17,260 passengers. ### Comparing the Two Intervals For $X = 10$: - **Confidence interval for mean**: $[14.54, 15.84]$ (width = 1.30) - **Prediction interval for individual**: $[13.13, 17.26]$ (width = 4.13) The prediction interval is **much wider** because individual values are less predictable than means! ```{python} #| echo: false #| label: fig-confidence-prediction-bands #| fig-cap: "Confidence Bands vs. Prediction Bands for Hop Scotch Airlines" import matplotlib.pyplot as plt import numpy as np from scipy import stats # Data from Table 11.2 advertising = np.array([10, 12, 8, 17, 10, 15, 10, 14, 19, 10, 11, 13, 16, 10, 12]) passengers = np.array([15, 17, 13, 23, 16, 21, 14, 20, 24, 17, 16, 18, 23, 15, 16]) # Regression parameters b0 = 4.39 b1 = 1.08 Se = 0.907 n = 15 X_bar = np.mean(advertising) SCx = np.sum((advertising - X_bar)**2) # Create x values for plotting bands x_values = np.linspace(7, 20, 100) y_pred = b0 + b1 * x_values # Calculate confidence and prediction bands t_critical = 2.160 # for 95% confidence, df=13 # Arrays for bands conf_bands_upper = [] conf_bands_lower = [] pred_bands_upper = [] pred_bands_lower = [] for x_val in x_values: # Confidence band S_y = Se * np.sqrt(1/n + (x_val - X_bar)**2 / SCx) margin_conf = t_critical * S_y conf_bands_upper.append(b0 + b1*x_val + margin_conf) conf_bands_lower.append(b0 + b1*x_val - margin_conf) # Prediction band S_yx = Se * np.sqrt(1 + 1/n + (x_val - X_bar)**2 / SCx) margin_pred = t_critical * S_yx pred_bands_upper.append(b0 + b1*x_val + margin_pred) pred_bands_lower.append(b0 + b1*x_val - margin_pred) # Create figure fig, ax = plt.subplots(1, 1, figsize=(12, 8)) # Plot prediction bands (wider, outer) ax.fill_between(x_values, pred_bands_lower, pred_bands_upper, alpha=0.15, color='orange', label='95% Prediction Interval (individual Y)') # Plot confidence bands (narrower, inner) ax.fill_between(x_values, conf_bands_lower, conf_bands_upper, alpha=0.25, color='green', label='95% Confidence Interval (mean of Y)') # Regression line ax.plot(x_values, y_pred, 'b-', linewidth=3, label=f'Ŷ = {b0:.2f} + {b1:.2f}X') # Scatter plot ax.scatter(advertising, passengers, s=120, alpha=0.7, color='steelblue', edgecolors='black', linewidth=2, label='Actual Data', zorder=5) # Highlight X=10 example x_example = 10 y_example = b0 + b1 * x_example # Confidence interval at X=10 S_y_10 = Se * np.sqrt(1/n + (x_example - X_bar)**2 / SCx) conf_margin = t_critical * S_y_10 conf_lower = y_example - conf_margin conf_upper = y_example + conf_margin # Prediction interval at X=10 S_yx_10 = Se * np.sqrt(1 + 1/n + (x_example - X_bar)**2 / SCx) pred_margin = t_critical * S_yx_10 pred_lower = y_example - pred_margin pred_upper = y_example + pred_margin # Draw vertical lines at X=10 ax.plot([x_example, x_example], [pred_lower, pred_upper], 'r-', linewidth=3, alpha=0.7) ax.plot([x_example, x_example], [conf_lower, conf_upper], 'g-', linewidth=4, alpha=0.9) # Mark the intervals ax.plot(x_example, conf_lower, 'go', markersize=12, zorder=6) ax.plot(x_example, conf_upper, 'go', markersize=12, zorder=6) ax.plot(x_example, pred_lower, 'ro', markersize=10, zorder=6) ax.plot(x_example, pred_upper, 'ro', markersize=10, zorder=6) ax.plot(x_example, y_example, 'bo', markersize=14, zorder=7) # Annotations ax.annotate(f'Point Estimate\nŶ = {y_example:.2f}', xy=(x_example, y_example), xytext=(x_example + 2, y_example + 1), fontsize=10, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='blue')) ax.annotate(f'95% C.I. for μ_Y|X:\n[{conf_lower:.2f}, {conf_upper:.2f}]\nWidth = {conf_upper-conf_lower:.2f}', xy=(x_example, (conf_lower + conf_upper)/2), xytext=(x_example - 3.5, 13), fontsize=9, fontweight='bold', color='darkgreen', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='green')) ax.annotate(f'95% P.I. for Y_X:\n[{pred_lower:.2f}, {pred_upper:.2f}]\nWidth = {pred_upper-pred_lower:.2f}', xy=(x_example, (pred_lower + pred_upper)/2 + 1), xytext=(x_example + 3.5, 19), fontsize=9, fontweight='bold', color='darkred', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8), arrowprops=dict(arrowstyle='->', lw=2, color='red')) # Note about band width ax.annotate('Bands widen away from X̄ = 12.47\n(less precise predictions)', xy=(18, 25), fontsize=10, fontweight='bold', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8)) # Mark mean of X ax.axvline(x=X_bar, color='purple', linestyle=':', linewidth=2, alpha=0.5) ax.text(X_bar, 27, f'X̄ = {X_bar:.2f}', ha='center', fontsize=10, fontweight='bold', color='purple', bbox=dict(boxstyle='round', facecolor='lavender', alpha=0.7)) ax.set_xlabel('Advertising Expenditure (in $1,000s)', fontsize=13, fontweight='bold') ax.set_ylabel('Number of Passengers (in 1,000s)', fontsize=13, fontweight='bold') ax.set_title('95% Confidence Interval vs. 95% Prediction Interval\nConfidence bands (mean) are narrower than prediction bands (individual)', fontsize=14, fontweight='bold') ax.legend(loc='upper left', fontsize=10) ax.grid(alpha=0.3) ax.set_xlim(7, 20) ax.set_ylim(8, 28) plt.tight_layout() plt.show() ``` ::: {.callout-important icon="📊"} ## Key Differences: Confidence vs. Prediction Intervals | Feature | Confidence Interval | Prediction Interval | |:--------|:-------------------|:--------------------| | **Estimates** | Mean of $Y$ (average) | Individual $Y$ (single value) | | **Width** | Narrower | Wider | | **Formula** | $S_{\hat{Y}} = S_e\sqrt{\frac{1}{n} + \frac{(X_i-\bar{X})^2}{SCx}}$ | $S_{Y_X} = S_e\sqrt{1 + \frac{1}{n} + \frac{(X_i-\bar{X})^2}{SCx}}$ | | **Use Case** | "What's the average outcome?" | "What's one specific outcome?" | | **Example** | Average passengers across many months | Passengers next month | | **Certainty** | More certain (means are stable) | Less certain (individuals vary) | ::: ### C. Factors Influencing Interval Width Given a confidence level, we prefer to **minimize** the interval width. The narrower the interval, the more precise our prediction. Three factors influence width: **1. Dispersion of Original Data ($S_e$)** : More dispersed data → Larger $S_e$ → Wider interval **2. Sample Size ($n$)** : Larger sample → Smaller standard error → Narrower interval **3. Distance from Mean ($|X_i - \bar{X}|$)** : Farther from $\bar{X}$ → Wider interval (regression is based on means, so predictions are less reliable far from the center) ## Section Exercises ::: {#exr-hypothesis-gpa} ## GPA as Predictor of Job Offers Using the appropriate hypothesis test at the 5% level, is GPA a significant explanatory variable for job offers in Exercise 9? Be sure to show all four hypothesis testing steps. ::: ::: {#exr-hypothesis-income} ## Income and Consumption Significance In Exercise 10, is the relationship between income and consumption significant? Test the hypothesis at a 1% significance level. ::: ::: {#exr-hypothesis-interest} ## Interest Rate Significance (Two Tests) In Exercise 11, is the interest rate significant at the 10% level? a. Test the significance of the regression coefficient at 10% b. Test the significance of the correlation coefficient at 10% c. How do these two tests differ? ::: ::: {#exr-ci-mean-gpa} ## Confidence Interval for Average Job Offers The career placement center at State University (Exercise 9) wants a 95% interval estimate for the **average** number of job offers that many graduates will receive if they have a GPA of 2.69. Calculate and interpret the appropriate interval. ::: ::: {#exr-pi-individual-gpa} ## Prediction Interval for Fred's Job Offers Fred has a GPA of 2.69 (see Exercises 9 and above). Calculate the 95% interval for the number of job offers **Fred** will receive. Why does this differ from your answer to the previous exercise? ::: ::: {#exr-ci-consumption} ## Confidence Interval for Average Consumption If the economist at the Florida Department of Human Resources (Exercise 10) identifies many consumers with incomes of $14,200, what is the 99% interval for the **average** consumption of all those consumers? ::: ::: {#exr-pi-consumption} ## Prediction Interval for Individual Consumer If the economist from Exercise 10 identifies **one consumer** with an income of $14,500: a. What is the point estimate of their consumption? b. What is the 99% interval estimate of their consumption? ::: ## Section Summary ::: {.callout-tip icon="📚"} ## Key Takeaways: Hypothesis Testing and Intervals **Hypothesis Testing for $\beta_1$**: - Tests whether slope is significantly different from zero - Uses $t = \frac{b_1 - \beta_1}{S_{b_1}}$ with $df = n-2$ - If we reject $H_0: \beta_1 = 0$, the relationship is statistically significant **Hypothesis Testing for $\rho$**: - Tests whether correlation is significantly different from zero - Uses $t = \frac{r - \rho}{S_r}$ with $df = n-2$ - In simple regression, gives **identical result** to testing $\beta_1$ **Confidence Interval for $\mu_{Y|X}$** (Mean): - Estimates **average** $Y$ when $X$ equals a specific value - Narrower interval (means are more predictable) - Formula: $\hat{Y}_i \pm t \cdot S_{\hat{Y}}$ **Prediction Interval for $Y_X$** (Individual): - Estimates a **single** $Y$ value when $X$ equals a specific value - Wider interval (individuals more variable) - Formula: $\hat{Y}_i \pm t \cdot S_{Y_X}$ **For Hop Scotch at $X = 10$**: - Point estimate: 15.19 thousand passengers - 95% C.I. for mean: [14.54, 15.84] - 95% P.I. for individual: [13.13, 17.26] - Both intervals widen as we move away from $\bar{X}$ ::: ## 11.12 ANOVA for Regression The regression model presents a description of the nature of the relationship between dependent and independent variables. We used a t-test to test the hypothesis that $\beta_1 = 0$. A similar test can be performed using **Analysis of Variance (ANOVA)** based on the F-test. The ANOVA procedure measures the amount of variation in the regression model. As noted earlier, there are three sources of variation in a regression model: - **Variation explained by regression** (SCR) - **Unexplained variation due to error** (SCE) - **Total variation** (SCT), which is the sum of the first two This can be summarized in an ANOVA table. ### General ANOVA Table for Regression **Table 11.5: General ANOVA Table** | Source of Variation | Sum of Squares | Degrees of Freedom | Mean Square | F-Ratio | |:-------------------|:--------------|:------------------|:-----------|:--------| | Regression | SCR | $k$ | $CMR = \frac{SCR}{k}$ | $\frac{CMR}{CME}$ | | Error | SCE | $n - k - 1$ | $CME = \frac{SCE}{n-k-1}$ | | | Total | SCT | $n - 1$ | | | where $k$ is the number of independent variables (predictors) in the model. ::: {.callout-note} ## For Simple Linear Regression In simple linear regression (one predictor), $k = 1$, so: - Regression degrees of freedom: $df_R = 1$ - Error degrees of freedom: $df_E = n - 2$ - Total degrees of freedom: $df_T = n - 1$ ::: ### Interpreting the F-Ratio The ratio $\frac{CMR}{CME}$ provides a measure of model accuracy because it compares: - **Numerator (CMR)**: Average squared deviation **explained** by the model - **Denominator (CME)**: Average squared deviation that remains **unexplained** **Higher F-ratio** → **Better Model** : The model has significant explanatory power. **Lower F-ratio** → **Poorer Model** : The model explains little more than random variation. To determine if the F-ratio is "high enough," we compare it with a critical value from the F-distribution table. ### Computational Formulas **Sum of Squares for Regression** $$SCR = \frac{(SCxy)^2}{SCx} \quad [11.35]$$ **Sum of Squares for Error** (from formula 11.15) $$SCE = SCy - \frac{(SCxy)^2}{SCx}$$ **Sum of Squares Total** $$SCT = SCR + SCE$$ ### Application to Hop Scotch Airlines Using the Hop Scotch data: **Step 1: Calculate SCR** $$ \begin{aligned} SCR &= \frac{(SCxy)^2}{SCx} \\ &= \frac{(148.933333)^2}{137.733333} \\ &= \frac{22,181.14}{137.733333} \\ &= 161.044 \end{aligned} $$ **Step 2: Calculate SCE** (already computed earlier) $$SCE = 10.689$$ **Step 3: Calculate SCT** $$SCT = SCR + SCE = 161.044 + 10.689 = 171.733$$ **Step 4: Calculate Mean Squares** $$CMR = \frac{SCR}{1} = \frac{161.044}{1} = 161.044$$ $$CME = \frac{SCE}{13} = \frac{10.689}{13} = 0.822$$ **Step 5: Calculate F-Ratio** $$F = \frac{CMR}{CME} = \frac{161.044}{0.822} = 195.89$$ ### ANOVA Table for Hop Scotch Airlines **Table 11.6: ANOVA Table for Hop Scotch Airlines** | Source of Variation | Sum of Squares | Degrees of Freedom | Mean Square | F-Ratio | |:-------------------|:--------------|:------------------|:-----------|:--------| | Regression | 161.04 | 1 | 161.04 | 195.89 | | Error | 10.69 | 13 | 0.82 | | | Total | 171.73 | 14 | | | ### Hypothesis Test Using F **Hypotheses**: $$ \begin{aligned} H_0: \beta_1 &= 0 \quad \text{(Model has no explanatory power)} \\ H_A: \beta_1 &\neq 0 \quad \text{(Model is significant)} \end{aligned} $$ **Decision Rule** (at $\alpha = 0.05$): Critical value: $F_{0.05, 1, 13} = 4.67$ Do not reject $H_0$ if $F \leq 4.67$; otherwise reject $H_0$. **Conclusion**: Since $F = 195.89 > 4.67$, we **reject the null hypothesis**. At the 5% significance level, we conclude that advertising has significant explanatory power. ::: {.callout-important icon="🎯"} ## Relationship Between F-test and t-test In **simple linear regression**, the F-test and t-test are mathematically equivalent: $$F = t^2$$ For Hop Scotch: - $t = 13.995$ - $t^2 = (13.995)^2 = 195.86 \approx 195.89 = F$ Both tests produce the same conclusion. However, in **multiple regression** (Chapter 12): - The **F-test** provides a general test of whether *any* independent variables have explanatory power - Individual **t-tests** determine which specific variables are significant ::: ## Solved Problems ### Problem 1: Keynes' Consumption Function In 1936, British economist **John Maynard Keynes** published his famous book, *The General Theory of Employment, Interest and Money*. Keynes proposed a theoretical relationship between income and personal consumption expenditures, arguing that as income increases, consumption increases by a smaller amount. **Milton Friedman**, Nobel Prize-winning economist from the University of Chicago, collected data on income and consumption in the United States over an extended period. Here are 10 observations on annual consumption and income levels (in billions of current dollars): | Year | Income (I) | Consumption (C) | |:----:|:----------:|:--------------:| | 1950 | 284.8 | 191.0 | | 1951 | 328.4 | 206.3 | | 1952 | 345.5 | 216.7 | | 1953 | 364.6 | 230.0 | | 1954 | 364.8 | 236.5 | | 1955 | 398.0 | 254.4 | | 1956 | 419.2 | 266.7 | | 1957 | 441.1 | 281.4 | | 1958 | 447.3 | 290.1 | | 1959 | 483.7 | 311.2 | **Required**: Derive a consumption function assuming a linear relationship between consumption and income. ### Solution **a. Regression Model** Since consumption depends on income, consumption is the dependent variable ($Y$) and income is the independent variable ($X$). Friedman sought a consumption function of the form: $$\hat{C} = b_0 + b_1 I$$ **Calculations**: $$ \begin{aligned} \sum X &= 3,877.4 \quad \sum Y = 2,484.3 \\ \sum XY &= 984,615.32 \quad \sum X^2 = 1,537,084.88 \quad \sum Y^2 = 630,869.49 \\ n &= 10 \end{aligned} $$ **Sums of Squares and Cross Products**: $$ \begin{aligned} SCx &= \sum X^2 - \frac{(\sum X)^2}{n} \\ &= 1,537,084.88 - \frac{(3,877.4)^2}{10} \\ &= 1,537,084.88 - 1,503,423.076 \\ &= 33,661.804 \end{aligned} $$ $$ \begin{aligned} SCy &= \sum Y^2 - \frac{(\sum Y)^2}{n} \\ &= 630,869.49 - \frac{(2,484.3)^2}{10} \\ &= 630,869.49 - 617,174.649 \\ &= 13,694.841 \end{aligned} $$ $$ \begin{aligned} SCxy &= \sum XY - \frac{(\sum X)(\sum Y)}{n} \\ &= 984,615.32 - \frac{(3,877.4)(2,484.3)}{10} \\ &= 984,615.32 - 963,262.482 \\ &= 21,352.838 \end{aligned} $$ **Regression Coefficients**: $$ \begin{aligned} b_1 &= \frac{SCxy}{SCx} \\ &= \frac{21,352.838}{33,661.804} \\ &= 0.634 \end{aligned} $$ $$ \begin{aligned} b_0 &= \bar{Y} - b_1\bar{X} \\ &= 248.43 - (0.634)(387.74) \\ &= 248.43 - 245.827 \\ &= 2.603 \end{aligned} $$ **Consumption Function**: $$\hat{C} = 2.603 + 0.63I$$ **Interpretation**: - **Slope ($b_1 = 0.63$)**: For every $1 billion increase in income, consumption increases by $0.63 billion (or $630 million). Economics students will recognize this as the **marginal propensity to consume (MPC)**. - **Intercept ($b_0 = 2.603$)**: The consumption level when income is zero (2.603 billion). Economists often argue this interpretation isn't valid because an economic system always generates positive income. **Example Prediction**: For 1952, when $I = 345.5$: $$\hat{C} = 2.603 + 0.63(345.5) = 220.27$$ Actual consumption was 216.7, resulting in an error of $3.57 billion. **b. Coefficient of Determination** $$ \begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(21,352.838)^2}{(33,661.804)(13,694.841)} \\ &= \frac{455,943,735.1}{461,020,746.4} \\ &= 0.989 \end{aligned} $$ **Interpretation**: **98.9%** of the variation in consumption is explained by changes in income. This extremely high $r^2$ confirms Keynes' theory and demonstrates the strong relationship between income and consumption—vital information for policymakers advising Congress and the President on economic policy. ```{python} #| echo: false #| label: fig-consumption-function #| fig-cap: "Keynes' Consumption Function: U.S. Income vs. Consumption (1950-1959)" import matplotlib.pyplot as plt import numpy as np # Data years = np.array([1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959]) income = np.array([284.8, 328.4, 345.5, 364.6, 364.8, 398.0, 419.2, 441.1, 447.3, 483.7]) consumption = np.array([191.0, 206.3, 216.7, 230.0, 236.5, 254.4, 266.7, 281.4, 290.1, 311.2]) # Regression line b0 = 2.603 b1 = 0.63 # Create figure with two panels fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) # Left panel: Regression scatter plot x_line = np.linspace(280, 490, 100) y_line = b0 + b1 * x_line ax1.scatter(income, consumption, s=120, alpha=0.7, color='darkgreen', edgecolors='black', linewidth=2, label='Actual Data', zorder=5) ax1.plot(x_line, y_line, 'r-', linewidth=3, label=f'Ĉ = {b0:.2f} + {b1:.2f}I\n(Marginal Propensity to Consume = {b1:.2f})') # Add year labels to points for i, year in enumerate(years): ax1.annotate(f'{year}', xy=(income[i], consumption[i]), xytext=(5, 5), textcoords='offset points', fontsize=8, alpha=0.7) # Highlight 1952 idx_1952 = 2 actual_1952 = consumption[idx_1952] pred_1952 = b0 + b1 * income[idx_1952] ax1.plot([income[idx_1952], income[idx_1952]], [actual_1952, pred_1952], 'b--', linewidth=2, alpha=0.7) ax1.annotate(f'1952 Error:\n${actual_1952 - pred_1952:.2f}B', xy=(income[idx_1952], (actual_1952 + pred_1952)/2), xytext=(-60, 0), textcoords='offset points', fontsize=9, fontweight='bold', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7), arrowprops=dict(arrowstyle='->', lw=2, color='blue')) ax1.set_xlabel('National Income (Billions of Dollars)', fontsize=12, fontweight='bold') ax1.set_ylabel('Personal Consumption (Billions of Dollars)', fontsize=12, fontweight='bold') ax1.set_title('Keynesian Consumption Function\nUnited States 1950-1959', fontsize=13, fontweight='bold') ax1.legend(loc='upper left', fontsize=10) ax1.grid(alpha=0.3) # Add MPC interpretation box ax1.text(300, 300, 'MPC = 0.63\nFor every $1B increase in income,\nconsumption rises by $0.63B', fontsize=10, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8)) # Right panel: Time series ax2.plot(years, income, 'b-o', linewidth=2, markersize=8, label='National Income', alpha=0.7) ax2.plot(years, consumption, 'r-s', linewidth=2, markersize=8, label='Personal Consumption', alpha=0.7) # Fill area between ax2.fill_between(years, consumption, income, alpha=0.2, color='gray', label='Savings\n(Income - Consumption)') ax2.set_xlabel('Year', fontsize=12, fontweight='bold') ax2.set_ylabel('Billions of Dollars', fontsize=12, fontweight='bold') ax2.set_title('Income and Consumption Trends\n1950-1959', fontsize=13, fontweight='bold') ax2.legend(loc='upper left', fontsize=10) ax2.grid(alpha=0.3) ax2.set_xticks(years) ax2.set_xticklabels(years, rotation=45) # Add R² annotation ax2.text(1951, 420, f'r² = 0.989 (98.9%)\nStrong relationship confirmed', fontsize=10, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8)) plt.tight_layout() plt.show() ``` ### Problem 2: Federal Reserve Discount Rate Analysis After approximately six years of continuous expansion, the U.S. economy began showing signs of inflationary pressure in fall 1988. The Federal Reserve attempted to cool inflation by restricting money supply through increasing the discount rate that commercial banks must pay to borrow from the Federal Reserve. Manuel H. Johnson, Vice Chairman of the Federal Reserve, stated that Fed actions regarding the discount rate could be predicted based on the **federal funds rate** (the cost banks charge each other for overnight loans). However, Fed watchers argued that the federal funds rate was not serving as an adequate predictor, making it difficult for investors to anticipate interest rate levels. Data from mid-1987 to mid-1988: | Date | Federal Funds Rate (%) | Discount Rate (%) | |:-----|:---------------------:|:----------------:| | June 1987 | 8.0 | 7.5 | | July 1987 | 7.5 | 7.5 | | Aug 1987 | 7.0 | 7.0 | | Sept 1987 | 6.5 | 6.5 | | Oct 1987 | 6.0 | 6.0 | | Nov 1987 | 6.0 | 5.5 | | Dec 1987 | 7.0 | 5.5 | | Jan 1988 | 6.0 | 5.5 | | Feb 1988 | 7.0 | 5.5 | | Mar 1988 | 7.5 | 5.5 | | Apr 1988 | 7.0 | 6.0 | | May 1988 | 7.5 | 6.5 | | **Totals** | **83.0** | **74.5** | ### Solution Since Johnson argued that federal funds rate could explain discount rate behavior, federal funds is the independent variable ($X$). **a. Regression and Correlation Analysis** Given calculations: $$ \begin{aligned} \sum X &= 83.0 \quad \sum Y = 74.5 \quad n = 12 \\ \sum XY &= 518.5 \quad \sum X^2 = 579 \quad \sum Y^2 = 469.25 \\ \bar{X} &= 6.92 \quad \bar{Y} = 6.21 \end{aligned} $$ $$ \begin{aligned} SCx &= 4.9167 \quad SCy = 6.7292 \quad SCxy = 3.2083 \\ b_1 &= 0.6525 \quad b_0 = 1.6949 \end{aligned} $$ **Regression Model**: $$\hat{Y} = 1.69 + 0.653X$$ **Coefficient of Determination**: $$ \begin{aligned} r^2 &= \frac{(SCxy)^2}{(SCx)(SCy)} \\ &= \frac{(3.2083)^2}{(4.92)(6.73)} \\ &= 0.3111 \text{ or } 31.11\% \end{aligned} $$ $$r = \sqrt{0.3111} = 0.56$$ **Conclusion**: The Fed watchers are correct in their criticism. **Only 31% of the changes in the discount rate are explained by changes in the federal funds rate**. This is a weak relationship, not suitable for reliable prediction. **b. Standard Error of Estimation** $$ \begin{aligned} SCE &= SCy - \frac{(SCxy)^2}{SCx} \\ &= 6.7292 - \frac{(3.208)^2}{4.9167} \\ &= 4.6303 \end{aligned} $$ $$CME = \frac{4.6303}{10} = 0.4630$$ $$S_e = \sqrt{0.4630} = 0.6808$$ **Interpretation**: Typically, the estimate of the discount rate is in error by **0.68 percentage points**—a substantial margin for financial planning. **c. Testing Correlation Significance** (95% confidence, $df = 10$) $$ \begin{aligned} H_0&: \rho = 0 \\ H_A&: \rho \neq 0 \end{aligned} $$ Critical value: $t_{0.025, 10} = \pm 2.228$ $$ \begin{aligned} t &= \frac{r}{S_r} = \frac{r}{\sqrt{(1-r^2)/(n-2)}} \\ &= \frac{0.56}{\sqrt{(1-0.31)/10}} \\ &= \frac{0.56}{0.2627} \\ &= 2.13 \end{aligned} $$ **Decision**: Since $t = 2.13 < 2.228$, we **cannot reject** $H_0$. Despite the sample showing a positive relationship, we cannot reject the hypothesis of zero correlation at the 5% significance level. **d. Testing Regression Slope Significance** (99% confidence, $df = 10$) $$ \begin{aligned} H_0&: \beta_1 = 0 \\ H_A&: \beta_1 \neq 0 \end{aligned} $$ Critical value: $t_{0.005, 10} = \pm 3.169$ $$ \begin{aligned} S_{b_1} &= \frac{S_e}{\sqrt{SCx}} = \frac{0.681}{\sqrt{4.92}} = 0.307 \\ t &= \frac{b_1}{S_{b_1}} = \frac{0.6525}{0.307} = 2.126 \end{aligned} $$ **Decision**: Since $t = 2.126 < 3.169$, we **cannot reject** $H_0$. The value of $b_1$ is not significantly different from zero at the 1% level. **Conclusion**: There is little to no confidence in the federal funds rate as a predictor of the discount rate. It would be imprudent for investors to rely on federal funds as an indicator of discount rate behavior. ## Formula Summary ::: {.callout-note icon="📐"} ## Essential Regression Formulas **Basic Regression Line**: $$Y = b_0 + b_1X \quad [11.3]$$ **Sums of Squares**: $$SCx = \sum X^2 - \frac{(\sum X)^2}{n} \quad [11.9]$$ $$SCy = \sum Y^2 - \frac{(\sum Y)^2}{n} \quad [11.10]$$ $$SCxy = \sum XY - \frac{(\sum X)(\sum Y)}{n} \quad [11.11]$$ **Regression Coefficients**: $$b_1 = \frac{SCxy}{SCx} \quad [11.12]$$ $$b_0 = \bar{Y} - b_1\bar{X} \quad [11.13]$$ **Standard Error of Estimation**: $$S_e = \sqrt{\frac{\sum(Y_i - \hat{Y}_i)^2}{n-2}} \quad [11.14]$$ $$SCE = SCy - \frac{(SCxy)^2}{SCx} \quad [11.15]$$ $$CME = \frac{SCE}{n-2} \quad [11.16]$$ $$S_e = \sqrt{CME} \quad [11.17]$$ **Correlation and Determination**: $$r = \frac{SCxy}{\sqrt{(SCx)(SCy)}} \quad [11.23]$$ $$r^2 = \frac{(SCxy)^2}{(SCx)(SCy)} \quad [11.25]$$ **Hypothesis Tests**: $$t = \frac{b_1 - \beta_1}{S_{b_1}} \quad [11.26]$$ $$S_{b_1} = \frac{S_e}{\sqrt{SCx}} \quad [11.27]$$ $$t = \frac{r - \rho}{S_r} \quad [11.29]$$ $$S_r = \sqrt{\frac{1-r^2}{n-2}} \quad [11.30]$$ **Confidence and Prediction Intervals**: $$S_{\hat{Y}} = S_e\sqrt{\frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.31]$$ $$\text{C.I. for } \mu_{Y|X} = \hat{Y}_i \pm t \cdot S_{\hat{Y}} \quad [11.32]$$ $$S_{Y_X} = S_e\sqrt{1 + \frac{1}{n} + \frac{(X_i - \bar{X})^2}{SCx}} \quad [11.33]$$ $$\text{P.I. for } Y_X = \hat{Y}_i \pm t \cdot S_{Y_X} \quad [11.34]$$ **ANOVA**: $$SCR = \frac{(SCxy)^2}{SCx} \quad [11.35]$$ $$SCT = SCR + SCE$$ $$F = \frac{CMR}{CME}$$ ::: ## Chapter Summary This chapter introduced **simple linear regression and correlation analysis**, two of the most powerful statistical tools for understanding relationships between variables. ### Key Concepts Mastered **1. Regression Analysis** - **Purpose**: Quantify the relationship between dependent ($Y$) and independent ($X$) variables - **Regression line**: $\hat{Y} = b_0 + b_1X$ - **Slope ($b_1$)**: Change in $Y$ for each one-unit change in $X$ - **Intercept ($b_0$)**: Value of $Y$ when $X = 0$ - **Method**: Ordinary Least Squares (OLS) minimizes $\sum(Y_i - \hat{Y}_i)^2$ **2. Model Evaluation** - **Standard Error ($S_e$)**: Typical prediction error (in units of $Y$) - **Correlation coefficient ($r$)**: Strength and direction of linear relationship ($-1 \leq r \leq +1$) - **Coefficient of determination ($r^2$)**: Percentage of variation in $Y$ explained by $X$ - **ANOVA F-test**: Overall significance of the regression model **3. Statistical Inference** - **Hypothesis tests**: Determine if $\beta_1$ and $\rho$ are significantly different from zero - **Confidence intervals**: Estimate population parameters ($\beta_1$, $\mu_{Y|X}$) - **Prediction intervals**: Predict individual future values of $Y$ ### Critical Distinctions | Concept | Purpose | Formula | Example (X=10) | |:--------|:--------|:--------|:--------------| | **Point Estimate** | Single prediction | $\hat{Y} = b_0 + b_1X$ | 15.19 | | **Confidence Interval** | Estimate **mean** of $Y$ | Narrower | [14.54, 15.84] | | **Prediction Interval** | Estimate **individual** $Y$ | Wider | [13.13, 17.26] | ### Important Cautions ::: {.callout-warning icon="⚠️"} ## Limitations and Warnings 1. **Correlation ≠ Causation**: High correlation doesn't prove one variable causes another 2. **Linear relationships only**: $r$ and $r^2$ only measure linear associations 3. **Extrapolation danger**: Don't predict beyond the range of observed $X$ values 4. **Outlier sensitivity**: Regression is sensitive to extreme values 5. **Spurious correlation**: Unrelated variables can show correlation by chance ::: ### Decision-Making Applications Regression and correlation analysis enable managers to: - **Forecast**: Predict sales, costs, demand based on drivers - **Allocate resources**: Optimize spending on advertising, production - **Evaluate performance**: Measure relationships between inputs and outputs - **Test theories**: Validate economic and business hypotheses with data - **Quantify risk**: Understand variability in predictions (via $S_e$ and intervals) ### Closing the Cola Wars Scenario Returning to our opening scenario: Coca-Cola and Pepsi can now use regression to: - Model **sales vs. temperature** to optimize production and distribution - Analyze **advertising vs. market share** to maximize ROI - Predict **consumption vs. price** for pricing strategies - Forecast **seasonal demand** patterns for inventory management With the tools from this chapter, beverage industry managers can make data-driven decisions worth millions of dollars in a $27 billion market. ### Looking Ahead Chapter 12 extends these concepts to **multiple regression**, where we can use several independent variables simultaneously to predict $Y$. This provides even more powerful and realistic models for complex business environments. --- **Congratulations!** You've mastered simple linear regression and correlation—essential tools that form the foundation for advanced statistical modeling and data science applications in business. ::: {.callout-tip icon="🎯"} ## Chapter 11 Mastery Checklist You should now be able to: - ✅ Distinguish between dependent and independent variables - ✅ Create and interpret scatter diagrams - ✅ Calculate regression coefficients using OLS - ✅ Interpret slope and intercept in business context - ✅ Compute and interpret $S_e$, $r$, and $r^2$ - ✅ Test hypotheses about $\beta_1$ and $\rho$ - ✅ Construct confidence intervals for means - ✅ Construct prediction intervals for individuals - ✅ Perform ANOVA for regression - ✅ Recognize limitations and avoid common pitfalls :::