Frequency Statistics

Overview

Frequency statistics quantify how often values occur within a dataset. This includes grouping data into bins (histograms), calculating cumulative counts, and determining the position of values relative to the entire distribution (percentiles). These tools are fundamental for visualizing distributions and detecting outliers.

Binning and Histograms

Aggregating continuous data into discrete categories is known as binning. It transforms raw data into a frequency table.

  • BINNED_STATISTIC: A powerful generalization of a histogram. Instead of just counting points in a bin, it can compute a statistic (mean, sum, median) for the values falling into that bin. Ideally suited for summarizing noisy scatter plot data.
  • RELFREQ: Computes the relative frequency (probability estimate) for each bin.
  • CUMFREQ: Computes the cumulative frequency, useful for plotting Empirical CDFs.
Figure 1: Binned Statistics: Raw noisy data (blue dots) is summarized by dividing the x-axis into bins and computing the mean y-value (red line) for each bin. This reveals the underlying trend that is obscured by noise.

Percentiles and Scores

Understanding the rank of a value within a distribution is critical for standardized testing, growth charts, and outlier detection.

  • PERCENTILEOFSCORE: Calculates the percentile rank of a given score (e.g., “A score of 75 is in the 90th percentile”).
  • SCOREATPERCENTILE: The inverse operation; finds the value at a specific percentile (e.g., “The 90th percentile is 75”).

Native Excel Capabilities

Excel has robust tools for this: - FREQUENCY: An array formula for counting values in bins. - PERCENTILE.EXC / PERCENTRANK.EXC: Handle ranking. - Pivot Tables: Great for grouping and summarizing.

However, Python provides: - Advanced Binning: BINNED_STATISTIC offers more than just counts. You can bin y values based on x coordinates and compute specific stats (std dev, median, custom function) per bin, which is complex in native Excel. - Multi-dimensional Binning: BINNED_STATISTIC_2D allows for summarizing data on a 2D grid (heatmaps), which is not natively supported in Excel formulas.

Tools

Tool Description
BINNED_STATISTIC Computes a binned statistic (mean, sum, median, etc.) for the input data.
BINNED_STATISTIC_2D Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data.
CUMFREQ Compute the cumulative frequency histogram for the input data.
PERCENTILEOFSCORE Computes the percentile rank of a score relative to the input data.
RELFREQ Returns the relative frequency histogram for the input data.
SCOREATPERCENTILE Calculates the score at the given percentile of the input data.