Outline of Steps in Multiple Regression¶
-
Specify the Model
-
Fit the Model
-
Inferences
-
Assumptions
-
Use the Model
Sam
Key Reminder: You generally need at least one numerical IV for multiple linear regression.
-
If DV is numeric & all predictors are categorical, typically use an ANOVA framework.
-
If DV is binary, typically use logistic regression, which can handle both numeric and categorical IVs.
1. Specify the Model¶
Selecting the Best Model¶
-
Keep it simple — Avoid overfitting.
-
Maximize \(R^2\).
-
Minimize SER (Residual Standard Error) — A smaller SER indicates less error on average.
-
Use significant predictors — Statistical significance suggests they’re meaningful.
-
Maintain logical relationships — The model should make theoretical and practical sense.
-
Check residual assumptions — Residuals should meet normality, homoscedasticity, etc.
Process Model:
\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \varepsilon\)
-
\(\beta_0\): Intercept (the expected value of \(Y\) when predictors = 0).
-
Only interpretable if \(X=0\) makes sense in your context or is near the observed range.
-
\(\beta_1, \beta_2, \dots\): Mean change in \(Y\) for one-unit change in each \(X\).
Here, \(\varepsilon\) represents random error in the real-world process. (It's still considered an RV though.)
Patterns to Recognize¶
-
Log-like patterns may suggest transformations (e.g., log, exponential, power).
-
Convex vs. concave shapes in a plot of \(Y\) vs. \(X\) might indicate the need for polynomial terms or log transformations.
-
Convex (Exponential)
-
Concave (Manual laborer)
-
2. Fit the Model¶
Goal: Minimize the sum of squared errors (SSE) to find the best-fitting regression line or hyperplane.
| Model | Definition | Coefficients | Formula |
|---|---|---|---|
| Process | Describes relationships in the real world (conceptual) | Unknown (Parameters) | \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \varepsilon\) |
| Fitted | Describes how the model operates on sample data (empirical) | Known (Statistics) | \(\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2\) |
Mean Point Property¶
The least squares regression line in simple linear regression always passes through \((\bar{X}, \bar{Y})\). In multiple regression, this concept generalizes to the idea that the fitted hyperplane goes through the means of all variables involved.
3. Inferences (Testing)¶
Hypothesis Testing¶
-
Null Hypothesis \((H_0)\): No relationship (all \(\beta\)s = 0).
-
Alternative Hypothesis \((H_a)\): At least one \(\beta \neq 0\).
Variation Decomposition:
-
Explained Variation: Variation due to the model (\(\hat{\beta}\) line/hyperplane).
-
Unexplained Variation: Residuals/errors.
In software like R:
-
summary(model)interprets each \(\beta\) controlling for all other predictors. -
anova(model)interprets each \(\beta\) in sequence, depending on the order of predictors.
ANOVA Components¶
This table shows how Analysis of Variance partitions the variability in \(Y\). It also provides metrics to evaluate model quality.
| Category | Component | Formula | Explanation | |
|---|---|---|---|---|
| Model Structure | Degrees of Freedom (DF) | Regression | Number of predictors | Number of independent variables in the model. |
| Error | \(\text{observations} - \text{predictors} - 1\) | Residual (unexplained) degrees of freedom. | ||
| Total | \(\text{observations} - 1\) | Total degrees of freedom in the dataset. | ||
| Sum of Squares (SS) | Regression | Explained variation | Portion of total variability explained by the model. | |
| Error | Unexplained variation | Portion of total variability not explained by the model. | ||
| Total | \(\text{SS Regression} + \text{SS Error}\) | Total variability in the dataset. | ||
| Model Evaluation | Mean Squares (MS) | MS Regression | \(\text{SS Regression} / \text{DF Regression}\) | Average explained variability per degree of freedom. |
| MS Error | \(\text{SS Error} / \text{DF Error}\) | Average unexplained variability per degree of freedom. | ||
| F-statistic | \(F = \frac{\text{MS Regression}}{\text{MS Error}}\) | Compares explained vs. unexplained variability; used to test overall model significance. | ||
| Model Fit Statistics | Standard Error (SER) | \(s = \sqrt{\text{MS Error}}\) | Standard deviation of the residuals (errors). | |
| \(R^2\) | \(R^2 = 1 - \frac{\text{SS Error}}{\text{SS Total}}\) | Proportion of the variation in \(Y\) explained by the model. | ||
| \(R\) | \(\sqrt{R^2}\) (or \(\pm \sqrt{R^2}\) for direction) | Strength (and possibly direction) of the linear relationship. |
4. Assumptions¶
-
Random/Representative Sampling — Data should be sampled in a way that is representative of the population.
-
Stability Over Time — No major changes in relationships during the data collection period.
-
Errors Normally Distributed — Error term \(\varepsilon\) has mean = 0 and constant variance (homoscedasticity).
Sam
Standardized Residuals are used to check normality and to identify potential outliers. Calculated as:
-
Ordinary Residual: Find the difference between actual value (\(y_i\)) and prediction (\(\hat{y}_i\)).
-
Standardized Residual: Divide this by the estimated standard error of ALL residuals. (Converts into standard scale.)
Multicollinearity¶
- Doesn’t affect overall \(F\)-stat or \(R^2\), but affects the individual \(\beta\) estimates (they can become unstable or imprecise).
cor() |
Interpretation |
|---|---|
| <0.2 | Low correlation, usually no issue. |
| 0.2–0.7 | Moderate correlation, interpret with caution. |
| 0.7 | High correlation, potential problem. |
5. Using the Model¶
5a. Description¶
- Summarize relationships between variables (direction, strength).
5b. Estimation¶
-
Estimate the mean of \(Y\) for given \(X\)-values.
-
The tightest (least variance) point estimate is typically around \(\bar{X}\).
5c. Prediction¶
-
Use a prediction interval to predict an individual \(Y\).
-
A simplified form:
\(\hat{Y} \;\pm\; t_{\alpha/2, \text{df}} \times \sqrt{\text{Var}(\hat{Y}) + \text{Var}(\epsilon)}\)
Beta Interpretation¶
-
\(t\)-statistic: \(\displaystyle t = \frac{\text{coefficient}}{\text{std error of coefficient}}\).
-
Degrees of Freedom: Typically \(n - k - 1\), where \(n\) = number of observations, \(k\) = number of predictors.
Nominal Factors¶
-
Same slopes across groups; different intercepts.
-
Categorical variable levels shift the regression line vertically.
Interactions¶
-
Different slopes across groups.
-
An interaction term allows the effect of one predictor to change depending on the value of another.
Transformations¶
| Log | \(\Delta\) X --> | \(\Delta\) Y |
|---|---|---|
| Y ~ X | U | U |
| Y ~ log(X) | % | U |
| log(Y) ~ X | U | % pt |
| log(Y) ~ log(X) | % | % |
Multicollinearity Plots in R¶
Below is an example of checking correlations in an R workflow. Note that the line good_data <- data[good_mask, 0] might be a mistake or placeholder; you’d typically use data[good_mask, ] or select the relevant columns.
```r
Example R code for exploring multicollinearity¶
data <- c() # Your dataset here
row_sums <- rowSums(data) good_mask <- row_sums > 0 good_data <- data[good_mask, ] # Usually select rows and all columns
Calculate correlation matrix¶
cormat <- round(cor(good_data), 2)
Reshape data for ggplot¶
melted_cormat <- melt(cormat)
Inspect the long-format correlation data¶
melted_cormat
Plot heatmap of correlations¶
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = 'white') + scale_fill_gradient2(low = 'blue', high = 'red', mid = 'white', midpoint = 0, limit = c(-1,1)) + coord_fixed()