Statistics
Overview
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. It provides the mathematical framework for understanding uncertainty, making predictions, and drawing conclusions from incomplete information. Whether in science, economics, or engineering, statistical methods are essential for distinguishing signal from noise.
The field is broadly divided into two major categories:
- Descriptive Statistics: Summarizes and organizes data to highlight its main features (e.g., mean, median, variance).
- Inferential Statistics: Uses sample data to make generalizations, estimates, predictions, or decisions about a larger population.
Key Sub-Disciplines
- Hypothesis Testing: A formal procedure for investigating our ideas about the world using statistics. It involves comparing data against a null hypothesis to check if deviations are significant.
- Models: Statistical models (like regression or GLMs) mathematically represent relationships between variables, allowing for prediction and causal inference.
- Probability Distributions: Mathematical functions that describe all the possible values and likelihoods that a random variable can take within a given range.
- Summary Statistics: Tools for quantitatively describing the main features of a collection of information.
Python Libraries
The functions in this category leverage the most powerful statistical libraries in the Python ecosystem:
- SciPy Stats: The workhorse for probability distributions, summary statistics, and standard hypothesis tests.
- Statsmodels: Provides classes and functions for the estimation of many different statistical models (regression, GLM, time series analysis) and conducting statistical tests.
Native Excel Capabilities
Excel has a strong set of basic statistical functions (AVERAGE, STDEV, T.TEST, NORM.DIST) and the Analysis ToolPak for regression and ANOVA. However, it lacks:
- Advanced Modeling: No native support for Generalized Linear Models (GLM), Mixed Effects models, or Survival Analysis.
- Comprehensive Distributions: Many specialized distributions (e.g., Multivariate Normal, Dirichlet, Von Mises) are missing.
- Advanced Testing: Robust tests, post-hoc analysis, and specialized non-parametric tests are often absent.
- Unified API: Python’s object-oriented approach allows for treating distributions as objects (with consistent
.pdf,.cdf,.rvsmethods), simplifying simulation and analysis workflows.
Frequency Statistics
| Tool | Description |
|---|---|
| BINNED_STATISTIC | Computes a binned statistic (mean, sum, median, etc.) for the input data. |
| BINNED_STATISTIC_2D | Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data. |
| CUMFREQ | Compute the cumulative frequency histogram for the input data. |
| PERCENTILEOFSCORE | Computes the percentile rank of a score relative to the input data. |
| RELFREQ | Returns the relative frequency histogram for the input data. |
| SCOREATPERCENTILE | Calculates the score at the given percentile of the input data. |
Hypothesis Tests
Association Correlation
| Tool | Description |
|---|---|
| BARNARD_EXACT | Perform Barnard’s exact test on a 2x2 contingency table. |
| BOSCHLOO_EXACT | Perform Boschloo’s exact test on a 2x2 contingency table. |
| CHI2_CONTINGENCY | Perform the chi-square test of independence for variables in a contingency table. |
| FISHER_EXACT | Perform Fisher’s exact test on a 2x2 contingency table. |
| KENDALLTAU | Calculate Kendall’s tau, a correlation measure for ordinal data. |
| PAGE_TREND_TEST | Perform Page’s L trend test for monotonic trends across treatments. |
| PEARSONR | Calculate the Pearson correlation coefficient and p-value for two datasets. |
| POINTBISERIALR | Calculate a point biserial correlation coefficient and its p-value. |
| SIEGELSLOPES | Compute the Siegel repeated medians estimator for robust linear regression using scipy.stats.siegelslopes. |
| SOMERSD | Calculate Somers’ D, an asymmetric measure of ordinal association between two variables. |
| SPEARMANR | Calculate a Spearman rank-order correlation coefficient with associated p-value. |
| THEILSLOPES | Compute the Theil-Sen estimator for a set of points (robust linear regression). |
| WEIGHTEDTAU | Compute a weighted version of Kendall’s tau correlation coefficient. |
Independent Sample
| Tool | Description |
|---|---|
| ALEXANDERGOVERN | Performs the Alexander-Govern test for equality of means across multiple independent samples with possible heterogeneity of variance. |
| ANDERSON_KSAMP | Performs the k-sample Anderson-Darling test to determine if samples are drawn from the same population. |
| ANSARI | Performs the Ansari-Bradley test for equal scale parameters (non-parametric) using scipy.stats.ansari. |
| BRUNNERMUNZEL | Computes the Brunner-Munzel nonparametric test for two independent samples. |
| BWS_TEST | Performs the Baumgartner-Weiss-Schindler test on two independent samples. |
| CVM_2SAMP | Performs the two-sample Cramér-von Mises test using scipy.stats.cramervonmises_2samp. |
| DUNNETT | Performs Dunnett’s test for multiple comparisons of means against a control group. |
| EPPS_SINGLE_2SAMP | Compute the Epps-Singleton test statistic and p-value for two samples. |
| F_ONEWAY | Performs a one-way ANOVA test for two or more independent samples. |
| FLIGNER | Performs the Fligner-Killeen test for equality of variances across multiple samples. |
| FRIEDMANCHISQUARE | Computes the Friedman test for repeated samples. |
| KRUSKAL | Computes the Kruskal-Wallis H-test for independent samples. |
| KS_2SAMP | Performs the two-sample Kolmogorov-Smirnov test for goodness of fit. |
| LEVENE | Performs the Levene test for equality of variances across multiple samples. |
| MANNWHITNEYU | Performs the Mann-Whitney U rank test on two independent samples using scipy.stats.mannwhitneyu. |
| MEDIAN_TEST | Performs Mood’s median test to determine if two or more independent samples come from populations with the same median. |
| MOOD | Perform Mood’s two-sample test for scale parameters. |
| POISSON_MEANS_TEST | Performs the Poisson means test (E-test) to compare the means of two Poisson distributions. |
| RANKSUMS | Computes the Wilcoxon rank-sum statistic and p-value for two independent samples. |
| TTEST_IND | Performs the independent two-sample t-test for the means of two groups. |
| TTEST_IND_STATS | Perform a t-test for means of two independent samples using summary statistics. |
One Sample
| Tool | Description |
|---|---|
| BINOMTEST | Perform a binomial test for the probability of success in a Bernoulli experiment. |
| JARQUE_BERA | Perform the Jarque-Bera goodness of fit test for normality. |
| KSTEST | Performs the one-sample Kolmogorov-Smirnov test for goodness of fit. |
| KURTOSISTEST | Test whether the kurtosis of a sample is different from that of a normal distribution. |
| NORMALTEST | Test whether a sample differs from a normal distribution (omnibus test). |
| QUANTILE_TEST | Perform a quantile test to determine if a population quantile equals a hypothesized value. |
| SHAPIRO | Perform the Shapiro-Wilk test for normality. |
| SKEWTEST | Test whether the skewness of a sample is different from that of a normal distribution. |
| TTEST_1SAMP | Perform a one-sample t-test for the mean of a group of scores. |
Models
Count
| Tool | Description |
|---|---|
| HURDLE_COUNT_MODEL | Fits a Hurdle model for count data with two-stage process (zero vs. positive counts). |
| ZINB_MODEL | Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros. |
| ZIP_MODEL | Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros. |
Discrete Choice
| Tool | Description |
|---|---|
| LOGIT_MODEL | Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation. |
| MULTINOMIAL_LOGIT | Fits a multinomial logistic regression model for multi-category outcomes. |
| ORDERED_LOGIT | Fits an ordered logistic regression model for ordinal outcomes. |
| PROBIT_MODEL | Fits a binary probit regression model using maximum likelihood estimation. |
Generalized Linear
| Tool | Description |
|---|---|
| GLM_BINOMIAL | Fits a Generalized Linear Model with binomial family for binary or proportion data. |
| GLM_GAMMA | Fit a Generalized Linear Model with Gamma family for positive continuous data. |
| GLM_INV_GAUSS | Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data. |
| GLM_NEG_BINOM | Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data. |
| GLM_POISSON | Fits a Generalized Linear Model with Poisson family for count data. |
| GLM_TWEEDIE | Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling. |
Mixed Effects
| Tool | Description |
|---|---|
| GEE_MODEL | Fits a Generalized Estimating Equations (GEE) model for correlated data. |
| GLMM_BINOMIAL | Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data. |
| GLMM_POISSON | Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data. |
| MIXED_LINEAR_MODEL | Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes. |
Regression
| Tool | Description |
|---|---|
| GLS_REGRESSION | Fits a Generalized Least Squares (GLS) regression model. |
| INFLUENCE_DIAG | Computes regression influence diagnostics for identifying influential observations. |
| OLS_DIAGNOSTICS | Performs diagnostic tests on OLS regression residuals. |
| OLS_REGRESSION | Fits an Ordinary Least Squares (OLS) regression model. |
| QUANTILE_REGRESSION | Fits a quantile regression model to estimate conditional quantiles of the response distribution. |
| REGRESS_DIAG | Performs comprehensive regression diagnostic tests. |
| ROBUST_LINEAR_MODEL | Fits a robust linear regression model using M-estimators. |
| SPECIFICATION_TESTS | Performs regression specification tests to detect model misspecification. |
| WLS_REGRESSION | Fits a Weighted Least Squares (WLS) regression model. |
Survival
| Tool | Description |
|---|---|
| COX_HAZARDS | Fits a Cox Proportional Hazards regression model for survival data. |
| EXP_SURVIVAL_REG | Fits a parametric exponential survival regression model. |
| KAPLAN_MEIER | Computes the Kaplan-Meier survival function estimate for time-to-event data. |
Multivariate Analysis
| Tool | Description |
|---|---|
| CANCORR | Performs Canonical Correlation Analysis (CCA) between two sets of variables. |
| FACTOR_ANALYSIS | Performs exploratory factor analysis with rotation. |
| MANOVA_TEST | Performs Multivariate Analysis of Variance (MANOVA) for multiple dependent variables. |
| PCA_ANALYSIS | Performs Principal Component Analysis (PCA) for dimensionality reduction. |
Probability Distributions
Continuous Distributions
| Tool | Description |
|---|---|
| BETA | Wrapper for scipy.stats.beta distribution providing multiple statistical methods. |
| CAUCHY | Wrapper for scipy.stats.cauchy distribution providing multiple statistical methods. |
| CHISQ | Compute various statistics and functions for the chi-squared distribution from scipy.stats.chi2. |
| EXPON | Exponential distribution function wrapping scipy.stats.expon. |
| F_DIST | Unified interface to the main methods of the F-distribution, including PDF, CDF, inverse CDF, survival function, and distribution statistics. |
| LAPLACE | Laplace distribution function supporting multiple methods. |
| LOGNORM | Compute lognormal distribution statistics and evaluations. |
| NORM | Normal (Gaussian) distribution function supporting multiple methods. |
| PARETO | Generalized Pareto distribution function supporting multiple methods. |
| T_DIST | Student’s t distribution function supporting multiple methods from scipy.stats.t. |
| UNIFORM | Uniform distribution function supporting multiple methods. |
| WEIBULL_MIN | Compute various functions of the Weibull minimum distribution using scipy.stats.weibull_min. |
Discrete Distributions
| Tool | Description |
|---|---|
| BERNOULLI | Calculates properties of a Bernoulli discrete random variable. |
| BETABINOM | Compute Beta-binomial distribution values from scipy.stats.betabinom. |
| BETANBINOM | Compute Beta-negative-binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| BINOM | Compute Binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| BOLTZMANN | Compute Boltzmann distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| DLAPLACE | Compute Discrete Laplace distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| GEOM | Compute Geometric distribution values using scipy.stats.geom. |
| HYPERGEOM | Compute Hypergeometric distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| LOGSER | Compute Log-Series distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| NBINOM | Compute Negative Binomial distribution values using scipy.stats.nbinom. |
| NHYPERGEOM | Compute Negative Hypergeometric distribution values using scipy.stats.nhypergeom. |
| PLANCK | Compute Planck distribution values using scipy.stats.planck. |
| POISSON_DIST | Compute Poisson distribution values using scipy.stats.poisson. |
| RANDINT | Compute Uniform discrete distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| SKELLAM | Compute Skellam distribution values using scipy.stats.skellam. |
| YULESIMON | Compute Yule-Simon distribution values using scipy.stats.yulesimon. |
| ZIPF | Compute Zipf distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| ZIPFIAN | Compute Zipfian distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
Multivariate Distributions
| Tool | Description |
|---|---|
| DIRICHLET | Computes the PDF, log-PDF, mean, variance, covariance, entropy, or draws random samples from a Dirichlet distribution. |
| DIRICHLET_MULTINOM | Computes the probability mass function, log probability mass function, mean, variance, or covariance of the Dirichlet multinomial distribution. |
| MATRIX_NORMAL | Computes the PDF, log-PDF, or draws random samples from a matrix normal distribution. |
| MULTINOMIAL | Compute the probability mass function, log-PMF, entropy, covariance, or draw random samples from a multinomial distribution. |
| MULTIVARIATE_NORMAL | Computes the PDF, CDF, log-PDF, log-CDF, entropy, or draws random samples from a multivariate normal distribution. |
| MULTIVARIATE_T | Computes the PDF, CDF, or draws random samples from a multivariate t-distribution. |
| MV_HYPERGEOM | Computes probability mass function, log-PMF, mean, variance, covariance, or draws random samples from a multivariate hypergeometric distribution. |
| ORTHO_GROUP | Draws random samples of orthogonal matrices from the O(N) Haar distribution using scipy.stats.ortho_group. |
| RANDOM_CORRELATION | Generates a random correlation matrix with specified eigenvalues. |
| SPECIAL_ORTHO_GROUP | Draws random samples from the special orthogonal group SO(N), returning orthogonal matrices with determinant +1. |
| UNIFORM_DIRECTION | Draws random unit vectors uniformly distributed on the surface of a hypersphere in the specified dimension. |
| UNITARY_GROUP | Generate a random unitary matrix of dimension N from the Haar distribution. |
| VONMISES_FISHER | Computes the PDF, log-PDF, entropy, or draws random samples from a von Mises-Fisher distribution on the unit hypersphere. |
| WISHART | Computes the PDF, log-PDF, or draws random samples from the Wishart distribution using scipy.stats.wishart. |
Summary Statistics
| Tool | Description |
|---|---|
| DESCRIBE | Compute descriptive statistics using scipy.stats.describe module. |
| EFFECT_SIZES | Computes effect size measures for comparing two groups. |
| EXPECTILE | Calculates the expectile of a dataset using scipy.stats.expectile. |
| GMEAN | Compute the geometric mean of the input data, flattening the input and ignoring non-numeric values. |
| HMEAN | Calculates the harmonic mean of the input data, flattening the input and ignoring non-numeric values. |
| KURTOSIS | Compute the kurtosis (Fisher or Pearson) of a dataset. |
| MODE | Returns the modal (most common) value in the passed array. Wraps scipy.stats.mode to flatten the input, ignore non-numeric values, and always return a single mode (the smallest if multiple). If no mode is found (all values occur only once), returns an error. |
| MOMENT | Calculates the nth moment about the mean for a sample. |
| PMEAN | Computes the power mean (generalized mean) of the input data for a given power p. |
| SKEWNESS | Calculate the skewness of a dataset. |