Summary Statistics

Introduction

Summary statistics provide a concise overview of essential characteristics and properties of a dataset. These statistics summarize the central tendency, dispersion, and shape of data distributions, facilitating data exploration, comparison, and analysis. This lesson covers the main types of summary statistics, including measures of central tendency, measures of dispersion, and other key metrics used in descriptive statistics.

Measures of Central Tendency

Measures of central tendency indicate the central or typical value around which data points tend to cluster.

Mean

The mean (average) is the sum of all values divided by the number of values in the dataset.

import numpy as np

# Example data
data = [10, 20, 30, 40, 50]

# Calculating the mean
mean_value = np.mean(data)
print(f"Mean: {mean_value:.2f}")
Median

The median is the middle value in a sorted dataset. It divides the dataset into two equal halves.

# Calculating the median
median_value = np.median(data)
print(f"Median: {median_value}")
Mode

The mode is the most frequently occurring value in the dataset.

from scipy import stats

# Calculating the mode
mode_value = stats.mode(data)
print(f"Mode: {mode_value.mode[0]}")
Measures of Dispersion

Measures of dispersion quantify the spread or variability of data points around the central tendency.

Range

The range is the difference between the maximum and minimum values in the dataset.

# Calculating the range
data_range = np.max(data) - np.min(data)
print(f"Range: {data_range}")
Interquartile Range (IQR)

The interquartile range (IQR) is the range between the first quartile (Q1) and the third quartile (Q3) of the dataset.

# Calculating the interquartile range (IQR)
q3, q1 = np.percentile(data, [75, 25])
iqr = q3 - q1
print(f"IQR: {iqr}")
Variance and Standard Deviation

Variance measures the average squared deviation of each data point from the mean. Standard deviation is the square root of variance and indicates the spread of data around the mean.

# Calculating the variance
variance_value = np.var(data)
print(f"Variance: {variance_value:.2f}")

# Calculating the standard deviation
std_deviation = np.std(data)
print(f"Standard Deviation: {std_deviation:.2f}")
Other Key Summary Statistics
Skewness and Kurtosis

Skewness measures the asymmetry of the data distribution. Kurtosis measures the “tailedness” of the distribution.

from scipy.stats import skew, kurtosis

# Calculating skewness
skewness_value = skew(data)
print(f"Skewness: {skewness_value:.2f}")

# Calculating kurtosis
kurtosis_value = kurtosis(data)
print(f"Kurtosis: {kurtosis_value:.2f}")
Percentiles and Quantiles

Percentiles divide a dataset into 100 equal parts, while quantiles generalize this concept to any number of equal parts.

# Calculating percentiles and quartiles
percentile_25 = np.percentile(data, 25)
median = np.percentile(data, 50)
percentile_75 = np.percentile(data, 75)
print(f"25th Percentile (Q1): {percentile_25}")
print(f"Median (50th Percentile): {median}")
print(f"75th Percentile (Q3): {percentile_75}")
Applications of Summary Statistics
  1. Data Exploration: Gain insights into data distributions and characteristics.
  2. Comparison: Compare datasets or subsets within a dataset.
  3. Modeling and Analysis: Inform modeling assumptions and parameter estimation.
Conclusion

Summary statistics play a crucial role in descriptive statistics by summarizing key aspects of data distributions. By mastering measures of central tendency, measures of dispersion, and other summary metrics like skewness, kurtosis, percentiles, and quantiles, analysts can effectively describe, interpret, and analyze data across various domains and applications. These statistics provide foundational insights for making informed decisions, validating assumptions, and deriving actionable insights from data-driven analyses.