Statistics

Overview

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. It provides the mathematical framework for understanding uncertainty, making predictions, and drawing conclusions from incomplete information. Whether in science, economics, or engineering, statistical methods are essential for distinguishing signal from noise.

The field is broadly divided into two major categories:

  1. Descriptive Statistics: Summarizes and organizes data to highlight its main features (e.g., mean, median, variance).
  2. Inferential Statistics: Uses sample data to make generalizations, estimates, predictions, or decisions about a larger population.
Figure 1: The Standard Normal Distribution: The foundation of many statistical tests. The red areas represent the critical regions (alpha=0.05) often used in hypothesis testing to reject null hypotheses.

Key Sub-Disciplines

  • Hypothesis Testing: A formal procedure for investigating our ideas about the world using statistics. It involves comparing data against a null hypothesis to check if deviations are significant.
  • Models: Statistical models (like regression or GLMs) mathematically represent relationships between variables, allowing for prediction and causal inference.
  • Probability Distributions: Mathematical functions that describe all the possible values and likelihoods that a random variable can take within a given range.
  • Summary Statistics: Tools for quantitatively describing the main features of a collection of information.

Python Libraries

The functions in this category leverage the most powerful statistical libraries in the Python ecosystem:

  • SciPy Stats: The workhorse for probability distributions, summary statistics, and standard hypothesis tests.
  • Statsmodels: Provides classes and functions for the estimation of many different statistical models (regression, GLM, time series analysis) and conducting statistical tests.

Native Excel Capabilities

Excel has a strong set of basic statistical functions (AVERAGE, STDEV, T.TEST, NORM.DIST) and the Analysis ToolPak for regression and ANOVA. However, it lacks:

  • Advanced Modeling: No native support for Generalized Linear Models (GLM), Mixed Effects models, or Survival Analysis.
  • Comprehensive Distributions: Many specialized distributions (e.g., Multivariate Normal, Dirichlet, Von Mises) are missing.
  • Advanced Testing: Robust tests, post-hoc analysis, and specialized non-parametric tests are often absent.
  • Unified API: Python’s object-oriented approach allows for treating distributions as objects (with consistent .pdf, .cdf, .rvs methods), simplifying simulation and analysis workflows.

Frequency Statistics

Tool Description
BINNED_STATISTIC Computes a binned statistic (mean, sum, median, etc.) for the input data.
BINNED_STATISTIC_2D Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data.
CUMFREQ Compute the cumulative frequency histogram for the input data.
PERCENTILEOFSCORE Computes the percentile rank of a score relative to the input data.
RELFREQ Returns the relative frequency histogram for the input data.
SCOREATPERCENTILE Calculates the score at the given percentile of the input data.

Hypothesis Tests

Association Correlation

Tool Description
BARNARD_EXACT Perform Barnard’s exact test on a 2x2 contingency table.
BOSCHLOO_EXACT Perform Boschloo’s exact test on a 2x2 contingency table.
CHI2_CONTINGENCY Perform the chi-square test of independence for variables in a contingency table.
FISHER_EXACT Perform Fisher’s exact test on a 2x2 contingency table.
KENDALLTAU Calculate Kendall’s tau, a correlation measure for ordinal data.
PAGE_TREND_TEST Perform Page’s L trend test for monotonic trends across treatments.
PEARSONR Calculate the Pearson correlation coefficient and p-value for two datasets.
POINTBISERIALR Calculate a point biserial correlation coefficient and its p-value.
SIEGELSLOPES Compute the Siegel repeated medians estimator for robust linear regression using scipy.stats.siegelslopes.
SOMERSD Calculate Somers’ D, an asymmetric measure of ordinal association between two variables.
SPEARMANR Calculate a Spearman rank-order correlation coefficient with associated p-value.
THEILSLOPES Compute the Theil-Sen estimator for a set of points (robust linear regression).
WEIGHTEDTAU Compute a weighted version of Kendall’s tau correlation coefficient.

Independent Sample

Tool Description
ALEXANDERGOVERN Performs the Alexander-Govern test for equality of means across multiple independent samples with possible heterogeneity of variance.
ANDERSON_KSAMP Performs the k-sample Anderson-Darling test to determine if samples are drawn from the same population.
ANSARI Performs the Ansari-Bradley test for equal scale parameters (non-parametric) using scipy.stats.ansari.
BRUNNERMUNZEL Computes the Brunner-Munzel nonparametric test for two independent samples.
BWS_TEST Performs the Baumgartner-Weiss-Schindler test on two independent samples.
CVM_2SAMP Performs the two-sample Cramér-von Mises test using scipy.stats.cramervonmises_2samp.
DUNNETT Performs Dunnett’s test for multiple comparisons of means against a control group.
EPPS_SINGLE_2SAMP Compute the Epps-Singleton test statistic and p-value for two samples.
F_ONEWAY Performs a one-way ANOVA test for two or more independent samples.
FLIGNER Performs the Fligner-Killeen test for equality of variances across multiple samples.
FRIEDMANCHISQUARE Computes the Friedman test for repeated samples.
KRUSKAL Computes the Kruskal-Wallis H-test for independent samples.
KS_2SAMP Performs the two-sample Kolmogorov-Smirnov test for goodness of fit.
LEVENE Performs the Levene test for equality of variances across multiple samples.
MANNWHITNEYU Performs the Mann-Whitney U rank test on two independent samples using scipy.stats.mannwhitneyu.
MEDIAN_TEST Performs Mood’s median test to determine if two or more independent samples come from populations with the same median.
MOOD Perform Mood’s two-sample test for scale parameters.
POISSON_MEANS_TEST Performs the Poisson means test (E-test) to compare the means of two Poisson distributions.
RANKSUMS Computes the Wilcoxon rank-sum statistic and p-value for two independent samples.
TTEST_IND Performs the independent two-sample t-test for the means of two groups.
TTEST_IND_STATS Perform a t-test for means of two independent samples using summary statistics.

One Sample

Tool Description
BINOMTEST Perform a binomial test for the probability of success in a Bernoulli experiment.
JARQUE_BERA Perform the Jarque-Bera goodness of fit test for normality.
KSTEST Performs the one-sample Kolmogorov-Smirnov test for goodness of fit.
KURTOSISTEST Test whether the kurtosis of a sample is different from that of a normal distribution.
NORMALTEST Test whether a sample differs from a normal distribution (omnibus test).
QUANTILE_TEST Perform a quantile test to determine if a population quantile equals a hypothesized value.
SHAPIRO Perform the Shapiro-Wilk test for normality.
SKEWTEST Test whether the skewness of a sample is different from that of a normal distribution.
TTEST_1SAMP Perform a one-sample t-test for the mean of a group of scores.

Models

Count

Tool Description
HURDLE_COUNT_MODEL Fits a Hurdle model for count data with two-stage process (zero vs. positive counts).
ZINB_MODEL Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.
ZIP_MODEL Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.

Discrete Choice

Tool Description
LOGIT_MODEL Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation.
MULTINOMIAL_LOGIT Fits a multinomial logistic regression model for multi-category outcomes.
ORDERED_LOGIT Fits an ordered logistic regression model for ordinal outcomes.
PROBIT_MODEL Fits a binary probit regression model using maximum likelihood estimation.

Generalized Linear

Tool Description
GLM_BINOMIAL Fits a Generalized Linear Model with binomial family for binary or proportion data.
GLM_GAMMA Fit a Generalized Linear Model with Gamma family for positive continuous data.
GLM_INV_GAUSS Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data.
GLM_NEG_BINOM Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data.
GLM_POISSON Fits a Generalized Linear Model with Poisson family for count data.
GLM_TWEEDIE Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling.

Mixed Effects

Tool Description
GEE_MODEL Fits a Generalized Estimating Equations (GEE) model for correlated data.
GLMM_BINOMIAL Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data.
GLMM_POISSON Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data.
MIXED_LINEAR_MODEL Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes.

Regression

Tool Description
GLS_REGRESSION Fits a Generalized Least Squares (GLS) regression model.
INFLUENCE_DIAG Computes regression influence diagnostics for identifying influential observations.
OLS_DIAGNOSTICS Performs diagnostic tests on OLS regression residuals.
OLS_REGRESSION Fits an Ordinary Least Squares (OLS) regression model.
QUANTILE_REGRESSION Fits a quantile regression model to estimate conditional quantiles of the response distribution.
REGRESS_DIAG Performs comprehensive regression diagnostic tests.
ROBUST_LINEAR_MODEL Fits a robust linear regression model using M-estimators.
SPECIFICATION_TESTS Performs regression specification tests to detect model misspecification.
WLS_REGRESSION Fits a Weighted Least Squares (WLS) regression model.

Survival

Tool Description
COX_HAZARDS Fits a Cox Proportional Hazards regression model for survival data.
EXP_SURVIVAL_REG Fits a parametric exponential survival regression model.
KAPLAN_MEIER Computes the Kaplan-Meier survival function estimate for time-to-event data.

Multivariate Analysis

Tool Description
CANCORR Performs Canonical Correlation Analysis (CCA) between two sets of variables.
FACTOR_ANALYSIS Performs exploratory factor analysis with rotation.
MANOVA_TEST Performs Multivariate Analysis of Variance (MANOVA) for multiple dependent variables.
PCA_ANALYSIS Performs Principal Component Analysis (PCA) for dimensionality reduction.

Probability Distributions

Continuous Distributions

Tool Description
BETA Wrapper for scipy.stats.beta distribution providing multiple statistical methods.
CAUCHY Wrapper for scipy.stats.cauchy distribution providing multiple statistical methods.
CHISQ Compute various statistics and functions for the chi-squared distribution from scipy.stats.chi2.
EXPON Exponential distribution function wrapping scipy.stats.expon.
F_DIST Unified interface to the main methods of the F-distribution, including PDF, CDF, inverse CDF, survival function, and distribution statistics.
LAPLACE Laplace distribution function supporting multiple methods.
LOGNORM Compute lognormal distribution statistics and evaluations.
NORM Normal (Gaussian) distribution function supporting multiple methods.
PARETO Generalized Pareto distribution function supporting multiple methods.
T_DIST Student’s t distribution function supporting multiple methods from scipy.stats.t.
UNIFORM Uniform distribution function supporting multiple methods.
WEIBULL_MIN Compute various functions of the Weibull minimum distribution using scipy.stats.weibull_min.

Discrete Distributions

Tool Description
BERNOULLI Calculates properties of a Bernoulli discrete random variable.
BETABINOM Compute Beta-binomial distribution values from scipy.stats.betabinom.
BETANBINOM Compute Beta-negative-binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
BINOM Compute Binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
BOLTZMANN Compute Boltzmann distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
DLAPLACE Compute Discrete Laplace distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
GEOM Compute Geometric distribution values using scipy.stats.geom.
HYPERGEOM Compute Hypergeometric distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
LOGSER Compute Log-Series distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
NBINOM Compute Negative Binomial distribution values using scipy.stats.nbinom.
NHYPERGEOM Compute Negative Hypergeometric distribution values using scipy.stats.nhypergeom.
PLANCK Compute Planck distribution values using scipy.stats.planck.
POISSON_DIST Compute Poisson distribution values using scipy.stats.poisson.
RANDINT Compute Uniform discrete distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
SKELLAM Compute Skellam distribution values using scipy.stats.skellam.
YULESIMON Compute Yule-Simon distribution values using scipy.stats.yulesimon.
ZIPF Compute Zipf distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
ZIPFIAN Compute Zipfian distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.

Multivariate Distributions

Tool Description
DIRICHLET Computes the PDF, log-PDF, mean, variance, covariance, entropy, or draws random samples from a Dirichlet distribution.
DIRICHLET_MULTINOM Computes the probability mass function, log probability mass function, mean, variance, or covariance of the Dirichlet multinomial distribution.
MATRIX_NORMAL Computes the PDF, log-PDF, or draws random samples from a matrix normal distribution.
MULTINOMIAL Compute the probability mass function, log-PMF, entropy, covariance, or draw random samples from a multinomial distribution.
MULTIVARIATE_NORMAL Computes the PDF, CDF, log-PDF, log-CDF, entropy, or draws random samples from a multivariate normal distribution.
MULTIVARIATE_T Computes the PDF, CDF, or draws random samples from a multivariate t-distribution.
MV_HYPERGEOM Computes probability mass function, log-PMF, mean, variance, covariance, or draws random samples from a multivariate hypergeometric distribution.
ORTHO_GROUP Draws random samples of orthogonal matrices from the O(N) Haar distribution using scipy.stats.ortho_group.
RANDOM_CORRELATION Generates a random correlation matrix with specified eigenvalues.
SPECIAL_ORTHO_GROUP Draws random samples from the special orthogonal group SO(N), returning orthogonal matrices with determinant +1.
UNIFORM_DIRECTION Draws random unit vectors uniformly distributed on the surface of a hypersphere in the specified dimension.
UNITARY_GROUP Generate a random unitary matrix of dimension N from the Haar distribution.
VONMISES_FISHER Computes the PDF, log-PDF, entropy, or draws random samples from a von Mises-Fisher distribution on the unit hypersphere.
WISHART Computes the PDF, log-PDF, or draws random samples from the Wishart distribution using scipy.stats.wishart.

Summary Statistics

Tool Description
DESCRIBE Compute descriptive statistics using scipy.stats.describe module.
EFFECT_SIZES Computes effect size measures for comparing two groups.
EXPECTILE Calculates the expectile of a dataset using scipy.stats.expectile.
GMEAN Compute the geometric mean of the input data, flattening the input and ignoring non-numeric values.
HMEAN Calculates the harmonic mean of the input data, flattening the input and ignoring non-numeric values.
KURTOSIS Compute the kurtosis (Fisher or Pearson) of a dataset.
MODE Returns the modal (most common) value in the passed array. Wraps scipy.stats.mode to flatten the input, ignore non-numeric values, and always return a single mode (the smallest if multiple). If no mode is found (all values occur only once), returns an error.
MOMENT Calculates the nth moment about the mean for a sample.
PMEAN Computes the power mean (generalized mean) of the input data for a given power p.
SKEWNESS Calculate the skewness of a dataset.