Pearson Correlation

Introduction

Pearson correlation is a statistical measure that quantifies the linear relationship between two continuous variables. It assesses how strongly variables are related and the direction (positive or negative) of their relationship. This lesson covers the definition, calculation, interpretation, assumptions, and practical applications of Pearson correlation in data science.

Pearson correlation coefficient, denoted as \( r \), ranges from -1 to +1:

\[
r = +1 : \text{Perfect positive correlation, indicating that as one variable increases, the other also increases proportionally.} \]

\[
r = -1 : \text{Perfect negative correlation, indicating that as one variable increases, the other decreases proportionally.} \]

\[
r = 0 : \text{No linear correlation between the variables.} \]

To compute the Pearson correlation coefficient between two variables \( X \) and \( Y \), use the following formula:

\[
r = \frac{\sum_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i – \bar{X})^2 \sum_{i=1}^{n} (Y_i – \bar{Y})^2}}
\]

Where:

\begin{align*}
X_i & : \text{individual data points of variable } X, \\
Y_i & : \text{individual data points of variable } Y, \\
\bar{X} & : \text{mean of } X, \\
\bar{Y} & : \text{mean of } Y, \\
n & : \text{number of data points}.
\end{align*}

Interpretation
  • Strength: The absolute value of r indicates the strength of the correlation. Values closer to 1 (either positive or negative) indicate stronger relationships.
  • Direction: A positive r indicates a positive correlation (both variables increase together), while a negative r indicates a negative correlation (one variable increases as the other decreases).
  • Significance: Evaluate the significance of r using hypothesis testing (e.g., computing p-values) to determine if the observed correlation is statistically significant.
Assumptions of Pearson Correlation
  • Linearity: Pearson correlation measures linear relationships. Non-linear relationships may result in misleading correlation coefficients.
  • Normality: Variables should follow a normal distribution. For large sample sizes, deviations from normality may not severely impact the validity of the correlation coefficient.
  • Homoscedasticity: Variability of data points around the regression line should be consistent.
Practical Applications
  1. Data Exploration: Analyze correlations between features to identify potential relationships and dependencies in datasets.
  2. Feature Selection: Determine which features are strongly correlated with the target variable in predictive modeling tasks.
  3. Quality Control: Assess correlations between process variables to improve manufacturing processes and detect anomalies.
Example Calculation
import numpy as np

# Example data
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 6, 8, 10])

# Calculate mean of X and Y
mean_X = np.mean(X)
mean_Y = np.mean(Y)

# Calculate Pearson correlation coefficient
numerator = np.sum((X - mean_X) * (Y - mean_Y))
denominator = np.sqrt(np.sum((X - mean_X)**2) * np.sum((Y - mean_Y)**2))
r = numerator / denominator

print(f"Pearson correlation coefficient: {r}")
Conclusion

Pearson correlation is a fundamental statistical measure that quantifies the strength and direction of linear relationships between continuous variables. By understanding how to calculate and interpret r, data scientists can gain insights into data relationships, validate hypotheses, and make informed decisions in various domains. Mastery of Pearson correlation supports rigorous data analysis, enhances predictive modeling capabilities, and facilitates effective communication of analytical findings within data science and beyond.