Frequency Statistics
Overview
Frequency statistics quantify how often values occur within a dataset. This includes grouping data into bins (histograms), calculating cumulative counts, and determining the position of values relative to the entire distribution (percentiles). These tools are fundamental for visualizing distributions and detecting outliers.
Binning and Histograms
Aggregating continuous data into discrete categories is known as binning. It transforms raw data into a frequency table.
- BINNED_STATISTIC: A powerful generalization of a histogram. Instead of just counting points in a bin, it can compute a statistic (mean, sum, median) for the values falling into that bin. Ideally suited for summarizing noisy scatter plot data.
- RELFREQ: Computes the relative frequency (probability estimate) for each bin.
- CUMFREQ: Computes the cumulative frequency, useful for plotting Empirical CDFs.
Percentiles and Scores
Understanding the rank of a value within a distribution is critical for standardized testing, growth charts, and outlier detection.
- PERCENTILEOFSCORE: Calculates the percentile rank of a given score (e.g., “A score of 75 is in the 90th percentile”).
- SCOREATPERCENTILE: The inverse operation; finds the value at a specific percentile (e.g., “The 90th percentile is 75”).
Native Excel Capabilities
Excel has robust tools for this: - FREQUENCY: An array formula for counting values in bins. - PERCENTILE.EXC / PERCENTRANK.EXC: Handle ranking. - Pivot Tables: Great for grouping and summarizing.
However, Python provides: - Advanced Binning: BINNED_STATISTIC offers more than just counts. You can bin y values based on x coordinates and compute specific stats (std dev, median, custom function) per bin, which is complex in native Excel. - Multi-dimensional Binning: BINNED_STATISTIC_2D allows for summarizing data on a 2D grid (heatmaps), which is not natively supported in Excel formulas.
Tools
| Tool | Description |
|---|---|
| BINNED_STATISTIC | Computes a binned statistic (mean, sum, median, etc.) for the input data. |
| BINNED_STATISTIC_2D | Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data. |
| CUMFREQ | Compute the cumulative frequency histogram for the input data. |
| PERCENTILEOFSCORE | Computes the percentile rank of a score relative to the input data. |
| RELFREQ | Returns the relative frequency histogram for the input data. |
| SCOREATPERCENTILE | Calculates the score at the given percentile of the input data. |