Summary Statistics

Overview

Summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. They are the first step in any data analysis, providing a snapshot of the data’s central tendency, dispersion, and shape.

Common summary statistics fall into three categories:

  1. Central Tendency: Where is the “center” of the data? (Mean, Median, Mode)
  2. Dispersion: How spread out is the data? (Variance, Standard Deviation, Interquartile Range)
  3. Shape: Is the distribution symmetric or skewed? Peaked or flat? (Skewness, Kurtosis)
Figure 1: Central Tendency in Skewed Data: In a symmetric distribution, mean and median align. In a right-skewed distribution (shown below), the mean is pulled towards the long tail, while the median remains closer to the peak.

Key Functions

  • DESCRIBE: Computes several descriptive statistics (nobs, minmax, mean, variance, skewness, kurtosis) in a single call.
  • SKEW and KURTOSIS: Quantify the asymmetry and “tailedness” of the distribution.
  • MODE: Identifies the most frequent value(s) in a dataset.
  • SEM: Calculates the standard error of the mean, essential for error bars.

Native Excel Capabilities

Excel computes basic statistics easily (AVERAGE, MEDIAN, STDEV.S, VAR.S). However, Python provides enhanced capabilities:

  • Higher-Order Moments: Excel provides SKEW and KURT, but scipy.stats offers adjustments for bias and different definitions (Fisher vs Pearson).
  • Multi-dimensional Data: Python functions easily operate along specific axes of matrix data (e.g., “compute the mean of every column”).
  • Comprehensive Reporting: The DESCRIBE function returns a full statistical summary object, whereas getting the same info in Excel requires running the Analysis ToolPak “Descriptive Statistics” tool, which creates a static output that doesn’t update when data changes.

Tools

Tool Description
DESCRIBE Compute descriptive statistics using scipy.stats.describe module.
EFFECT_SIZES Computes effect size measures for comparing two groups.
EXPECTILE Calculates the expectile of a dataset using scipy.stats.expectile.
GMEAN Compute the geometric mean of the input data, flattening the input and ignoring non-numeric values.
HMEAN Calculates the harmonic mean of the input data, flattening the input and ignoring non-numeric values.
KURTOSIS Compute the kurtosis (Fisher or Pearson) of a dataset.
MODE Returns the modal (most common) value in the passed array. Wraps scipy.stats.mode to flatten the input, ignore non-numeric values, and always return a single mode (the smallest if multiple). If no mode is found (all values occur only once), returns an error.
MOMENT Calculates the nth moment about the mean for a sample.
PMEAN Computes the power mean (generalized mean) of the input data for a given power p.
SKEWNESS Calculate the skewness of a dataset.