Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It is widely used in various fields for prediction and inference.
Introduction to Linear Regression
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.
Simple Linear Regression
In simple linear regression, we aim to fit a line that best describes the relationship between a dependent variable y and an independent variable x. The cost function, also known as the mean squared error (MSE), measures how well the model’s predictions match the actual data. The cost function for simple linear regression is defined as:
\[
J(\beta_0, \beta_1) = \frac{1}{2N} \sum_{i=1}^{N} (y_i – (\beta_0 + \beta_1 x_i))^2
\]
where:
\begin{align*}
J(\beta_0, \beta_1) & : \text{cost function for simple linear regression}, \\
N & : \text{number of training examples}, \\
y_i & : \text{observed value for the } i\text{th example}, \\
x_i & : \text{feature value for the } i\text{th example}, \\
\beta_0 & : \text{intercept term}, \\
\beta_1 & : \text{slope term (coefficient for } x_i\text{)}.
\end{align*}
Multiple Linear Regression
In multiple linear regression, we aim to fit a hyperplane that best describes the relationship between a dependent variable \( y \) and multiple independent variables \( \mathbf{x} = (x_1, x_2, \ldots, x_p) \). The cost function, also known as the mean squared error (MSE), measures how well the model’s predictions match the actual data. The cost function for multiple linear regression is defined as:
\[
J(\beta_0, \beta_1, \ldots, \beta_p) = \frac{1}{2N} \sum_{i=1}^{N} \left(y_i – \left(\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right)\right)^2
\]
where:
\begin{align*}
J(\beta_0, \beta_1, \ldots, \beta_p) & : \text{cost function for multiple linear regression}, \\
N & : \text{number of training examples}, \\
y_i & : \text{observed value for the } i\text{th example}, \\
x_{ij} & : \text{value of the } j\text{th feature for the } i\text{th example}, \\
\beta_0 & : \text{intercept term}, \\
\beta_j & : \text{coefficient for the } j\text{th feature (for } j = 1, 2, \ldots, p\text{)}.
\end{align*}
Assumptions of Linear Regression
Linear regression analysis relies on several key assumptions:
- Linearity: The relationship between the dependent and independent variables should be linear.
- Independence: Observations should be independent of each other.
- Homoscedasticity: The residuals (errors) should have constant variance at every level of x.
- Normality: The residuals of the model should be normally distributed.
- No multicollinearity: In multiple regression, the independent variables should not be highly correlated with each other.
Fitting a Linear Regression Model
To fit a linear regression model, we typically use the Ordinary Least Squares (OLS) method, which minimizes the sum of the squared differences between the observed values and the values predicted by the model.
The coefficients \((\beta_0, \beta_1, \ldots, \beta_n)\) are estimated such that the following cost function is minimized:
\[
\text{Cost}(\beta_0, \beta_1, \ldots, \beta_n) = \sum_{i=1}^{n} (y_i – (\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \ldots + \beta_n x_{ni}))^2
\]
Evaluating the Model
Several metrics can evaluate the performance of a linear regression model:
- R-squared: Measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- Adjusted R-squared: Adjusts the R-squared value based on the number of predictors in the model.
- Mean Squared Error (MSE): The average of the squares of the errors (i.e., the average squared difference between the observed actual outcomes and the outcomes predicted by the model).
Interpreting the Coefficients
The coefficients in a linear regression model represent the mean change in the dependent variable for one unit of change in an independent variable while holding other variables constant.
- Intercept: The expected value of y when all x variables are zero.
- Slope: The expected change in y for a one-unit change in x.
Practical Example
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Example data
x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 3, 5, 7, 11])
# Create and fit the model
model = LinearRegression()
model.fit(x, y)
# Predict using the model
y_pred = model.predict(x)
# Plotting the results
plt.scatter(x, y, color='blue', label='Observed data')
plt.plot(x, y_pred, color='red', label='Fitted line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
# Coefficients
print(f'Intercept: {model.intercept_}')
print(f'Slope: {model.coef_[0]}')
This code fits a simple linear regression model to a small dataset and plots the observed data along with the fitted line. The intercept and slope of the model are printed out.
Advanced Topics in Linear Regression
Polynomial Regression
When the relationship between the dependent and independent variables is not linear, polynomial regression can be used. This involves adding polynomial terms to the regression equation:
\[
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \ldots + \beta_n x^n + \epsilon
\]
Interaction Terms
Interaction terms can be included in the regression model to capture the effect of two or more variables interacting with each other. For instance:
\[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 \times x_2) + \epsilon
\]
This allows the model to account for situations where the effect of one independent variable depends on the level of another independent variable.
Regularization Techniques
Regularization techniques such as Ridge Regression and Lasso Regression help prevent overfitting by adding a penalty to the regression coefficients.
Ridge Regression: Adds a penalty equal to the sum of the squared values of the coefficients.
\[
\text{Cost} = \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
\]
Lasso Regression: Adds a penalty equal to the sum of the absolute values of the coefficients.
\[
\text{Cost} = \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
\]
Checking for Multicollinearity
Multicollinearity occurs when independent variables are highly correlated, which can inflate the variance of the coefficient estimates and make the model unstable. Variance Inflation Factor (VIF) is commonly used to detect multicollinearity.
Residual Analysis
Analyzing the residuals of the model can provide insights into the goodness of fit. Key plots include:
- Residual vs Fitted Plot: Checks for non-linearity and homoscedasticity.
- Normal Q-Q Plot: Checks for normality of residuals.
- Scale-Location Plot: Checks for homoscedasticity.
- Residual vs Leverage Plot: Identifies influential cases.
Cross-Validation
Cross-validation is used to assess how the results of a statistical analysis will generalize to an independent data set. Common techniques include k-fold cross-validation and leave-one-out cross-validation.
Model Selection Criteria
There are several criteria for model selection beyond R-squared, including:
- Akaike Information Criterion (AIC): Balances model fit and complexity.
- Bayesian Information Criterion (BIC): Similar to AIC but imposes a greater penalty for models with more parameters.
Practical Example of Polynomial Regression
Let’s extend our practical example to include polynomial regression using Python:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Example data
x = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 3, 5, 7, 11])
# Transforming data to include polynomial terms (degree 2)
poly = PolynomialFeatures(degree=2)
x_poly = poly.fit_transform(x)
# Create and fit the model
model = LinearRegression()
model.fit(x_poly, y)
# Predict using the model
y_pred = model.predict(x_poly)
# Plotting the results
plt.scatter(x, y, color='blue', label='Observed data')
plt.plot(x, y_pred, color='red', label='Fitted polynomial line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
# Coefficients
print(f'Intercept: {model.intercept_}')
print(f'Coefficients: {model.coef_}')