Statistics
Overview
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. It provides the mathematical framework for understanding uncertainty, making predictions, and drawing conclusions from incomplete information. Whether in science, economics, or engineering, statistical methods are essential for distinguishing signal from noise and extracting meaningful insights from data.
Background and Importance: Statistics is fundamental to science, business, engineering, and policy-making. It enables researchers and practitioners to quantify uncertainty, test hypotheses rigorously, model complex relationships in data, and make evidence-based decisions. The field has roots in probability theory and has evolved to encompass sophisticated computational methods for analyzing everything from large-scale datasets to small, carefully designed experiments.
Two Core Pillars: The field divides into Descriptive Statistics, which summarizes and organizes data to reveal main features (mean, median, variance, distribution shape), and Inferential Statistics, which uses sample data to generalize about populations, make predictions, and test theories. These complementary approaches work together: descriptive statistics summarize what you observe, while inferential statistics help you understand what it means.
Implementation Framework: Statistical analysis in Python is powered primarily by SciPy, NumPy, and Statsmodels. SciPy provides the foundational distributions and statistical tests, NumPy enables efficient numerical computations on arrays, and Statsmodels offers higher-level model fitting and diagnostics. These libraries are built on decades of research and are widely used across academia and industry.
Hypothesis Testing and Inference: One of statistics’ most powerful applications is hypothesis testing—a formal process for assessing whether observed differences or patterns could arise by chance. Tests range from simple comparisons of two groups using t-tests to sophisticated approaches for categorical data (chi-square tests), non-parametric alternatives (Mann-Whitney U test), and specialized tests for association and correlation. Choosing the right test depends on data type, sample size, and assumptions about underlying distributions.
Probability Distributions: The foundation of statistical inference rests on probability distributions. Continuous distributions like the normal (Gaussian) distribution, t-distribution, F-distribution, and exponential distribution model continuous measurements. Discrete distributions like binomial, Poisson, and negative binomial handle count data. Multivariate distributions extend these concepts to multiple variables simultaneously, essential for modeling joint behavior in multivariate datasets. Understanding the properties of these distributions—their shape, tail behavior, and moments—is crucial for selecting appropriate models.
Advanced Statistical Modeling: Beyond basic hypothesis tests, regression models investigate relationships between variables: ordinary least squares (OLS) for linear relationships, quantile regression for conditional quantiles, and robust methods for data with outliers. Generalized Linear Models (GLMs) extend regression to non-normal response variables, handling binary outcomes (logistic regression), counts (Poisson regression), and survival times. Mixed effects models incorporate random variation from multiple sources, essential when data has hierarchical or clustered structure. Survival analysis specialized tools like Kaplan-Meier estimators and Cox proportional hazards models for time-to-event data common in medicine and engineering.
Multivariate and Dimensionality Reduction: When analyzing many variables simultaneously, techniques like Principal Component Analysis (PCA) reduce dimensions while preserving variance, factor analysis uncovers latent structures, canonical correlation analysis (CCA) finds relationships between variable sets, and MANOVA tests differences across multiple outcomes. These methods reveal hidden patterns and simplify interpretation of high-dimensional data.
Frequency Statistics
| Tool | Description |
|---|---|
| BINNED_STATISTIC | Computes a binned statistic (mean, sum, median, etc.) for the input data. |
| BINNED_STATISTIC_2D | Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data. |
| CUMFREQ | Compute the cumulative frequency histogram for the input data. |
| PERCENTILEOFSCORE | Computes the percentile rank of a score relative to the input data. |
| RELFREQ | Returns the relative frequency histogram for the input data. |
| SCOREATPERCENTILE | Calculates the score at the given percentile of the input data. |
Hypothesis Tests
Association Correlation
| Tool | Description |
|---|---|
| BARNARD_EXACT | Perform Barnard’s exact test on a 2x2 contingency table. |
| BOSCHLOO_EXACT | Perform Boschloo’s exact test on a 2x2 contingency table. |
| CHI2_CONTINGENCY | Perform the chi-square test of independence for variables in a contingency table. |
| FISHER_EXACT | Perform Fisher’s exact test on a 2x2 contingency table. |
| KENDALLTAU | Calculate Kendall’s tau, a correlation measure for ordinal data. |
| PAGE_TREND_TEST | Perform Page’s L trend test for monotonic trends across treatments. |
| PEARSONR | Calculate the Pearson correlation coefficient and p-value for two datasets. |
| POINTBISERIALR | Calculate a point biserial correlation coefficient and its p-value. |
| SIEGELSLOPES | Compute the Siegel repeated medians estimator for robust linear regression using scipy.stats.siegelslopes. |
| SOMERSD | Calculate Somers’ D, an asymmetric measure of ordinal association between two variables. |
| SPEARMANR | Calculate a Spearman rank-order correlation coefficient with associated p-value. |
| THEILSLOPES | Compute the Theil-Sen estimator for a set of points (robust linear regression). |
| WEIGHTEDTAU | Compute a weighted version of Kendall’s tau correlation coefficient. |
Independent Sample
| Tool | Description |
|---|---|
| ALEXANDERGOVERN | Performs the Alexander-Govern test for equality of means across multiple independent samples with possible heterogeneity of variance. |
| ANDERSON_KSAMP | Performs the k-sample Anderson-Darling test to determine if samples are drawn from the same population. |
| ANSARI | Performs the Ansari-Bradley test for equal scale parameters (non-parametric) using scipy.stats.ansari. |
| BRUNNERMUNZEL | Computes the Brunner-Munzel nonparametric test for two independent samples. |
| BWS_TEST | Performs the Baumgartner-Weiss-Schindler test on two independent samples. |
| CVM_2SAMP | Performs the two-sample Cramér-von Mises test using scipy.stats.cramervonmises_2samp. |
| DUNNETT | Performs Dunnett’s test for multiple comparisons of means against a control group. |
| EPPS_SINGLE_2SAMP | Compute the Epps-Singleton test statistic and p-value for two samples. |
| F_ONEWAY | Performs a one-way ANOVA test for two or more independent samples. |
| FLIGNER | Performs the Fligner-Killeen test for equality of variances across multiple samples. |
| FRIEDMANCHISQUARE | Computes the Friedman test for repeated samples. |
| KRUSKAL | Computes the Kruskal-Wallis H-test for independent samples. |
| KS_2SAMP | Performs the two-sample Kolmogorov-Smirnov test for goodness of fit. |
| LEVENE | Performs the Levene test for equality of variances across multiple samples. |
| MANNWHITNEYU | Performs the Mann-Whitney U rank test on two independent samples using scipy.stats.mannwhitneyu. |
| MEDIAN_TEST | Performs Mood’s median test to determine if two or more independent samples come from populations with the same median. |
| MOOD | Perform Mood’s two-sample test for scale parameters. |
| POISSON_MEANS_TEST | Performs the Poisson means test (E-test) to compare the means of two Poisson distributions. |
| RANKSUMS | Computes the Wilcoxon rank-sum statistic and p-value for two independent samples. |
| TTEST_IND | Performs the independent two-sample t-test for the means of two groups. |
| TTEST_IND_STATS | Perform a t-test for means of two independent samples using summary statistics. |
One Sample
| Tool | Description |
|---|---|
| BINOMTEST | Perform a binomial test for the probability of success in a Bernoulli experiment. |
| JARQUE_BERA | Perform the Jarque-Bera goodness of fit test for normality. |
| KSTEST | Performs the one-sample Kolmogorov-Smirnov test for goodness of fit. |
| KURTOSISTEST | Test whether the kurtosis of a sample is different from that of a normal distribution. |
| NORMALTEST | Test whether a sample differs from a normal distribution (omnibus test). |
| QUANTILE_TEST | Perform a quantile test to determine if a population quantile equals a hypothesized value. |
| SHAPIRO | Perform the Shapiro-Wilk test for normality. |
| SKEWTEST | Test whether the skewness of a sample is different from that of a normal distribution. |
| TTEST_1SAMP | Perform a one-sample t-test for the mean of a group of scores. |
Models
Count
| Tool | Description |
|---|---|
| HURDLE_COUNT_MODEL | Fits a Hurdle model for count data with two-stage process (zero vs. |
| ZINB_MODEL | Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros. |
| ZIP_MODEL | Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros. |
Discrete Choice
| Tool | Description |
|---|---|
| LOGIT_MODEL | Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation. |
| MULTINOMIAL_LOGIT | Fits a multinomial logistic regression model for multi-category outcomes. |
| ORDERED_LOGIT | Fits an ordered logistic regression model for ordinal outcomes. |
| PROBIT_MODEL | Fits a binary probit regression model using maximum likelihood estimation. |
Generalized Linear
| Tool | Description |
|---|---|
| GLM_BINOMIAL | Fits a Generalized Linear Model with binomial family for binary or proportion data. |
| GLM_GAMMA | Fit a Generalized Linear Model with Gamma family for positive continuous data. |
| GLM_INV_GAUSS | Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data. |
| GLM_NEG_BINOM | Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data. |
| GLM_POISSON | Fits a Generalized Linear Model with Poisson family for count data. |
| GLM_TWEEDIE | Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling. |
Mixed Effects
| Tool | Description |
|---|---|
| GEE_MODEL | Fits a Generalized Estimating Equations (GEE) model for correlated data. |
| GLMM_BINOMIAL | Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data. |
| GLMM_POISSON | Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data. |
| MIXED_LINEAR_MODEL | Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes. |
Regression
| Tool | Description |
|---|---|
| GLS_REGRESSION | Fits a Generalized Least Squares (GLS) regression model. |
| INFLUENCE_DIAG | Computes regression influence diagnostics for identifying influential observations. |
| OLS_DIAGNOSTICS | Performs diagnostic tests on OLS regression residuals. |
| OLS_REGRESSION | Fits an Ordinary Least Squares (OLS) regression model. |
| QUANTILE_REGRESSION | Fits a quantile regression model to estimate conditional quantiles of the response distribution. |
| REGRESS_DIAG | Performs comprehensive regression diagnostic tests. |
| ROBUST_LINEAR_MODEL | Fits a robust linear regression model using M-estimators. |
| SPECIFICATION_TESTS | Performs regression specification tests to detect model misspecification. |
| WLS_REGRESSION | Fits a Weighted Least Squares (WLS) regression model. |
Survival
| Tool | Description |
|---|---|
| COX_HAZARDS | Fits a Cox Proportional Hazards regression model for survival data. |
| EXP_SURVIVAL_REG | Fits a parametric exponential survival regression model. |
| KAPLAN_MEIER | Computes the Kaplan-Meier survival function estimate for time-to-event data. |
Multivariate Analysis
| Tool | Description |
|---|---|
| CANCORR | Performs Canonical Correlation Analysis (CCA) between two sets of variables. |
| FACTOR_ANALYSIS | Performs exploratory factor analysis with rotation. |
| MANOVA_TEST | Performs Multivariate Analysis of Variance (MANOVA) for multiple dependent variables. |
| PCA_ANALYSIS | Performs Principal Component Analysis (PCA) for dimensionality reduction. |
Probability Distributions
Continuous Distributions
| Tool | Description |
|---|---|
| BETA | Wrapper for scipy.stats.beta distribution providing multiple statistical methods. |
| CAUCHY | Wrapper for scipy.stats.cauchy distribution providing multiple statistical methods. |
| CHISQ | Compute various statistics and functions for the chi-squared distribution from scipy.stats.chi2. |
| EXPON | Exponential distribution function wrapping scipy.stats.expon. |
| F_DIST | Unified interface to the main methods of the F-distribution, including PDF, CDF, inverse CDF, survival function, and distribution statistics. |
| LAPLACE | Laplace distribution function supporting multiple methods. |
| LOGNORM | Compute lognormal distribution statistics and evaluations. |
| NORM | Normal (Gaussian) distribution function supporting multiple methods. |
| PARETO | Generalized Pareto distribution function supporting multiple methods. |
| T_DIST | Student’s t distribution function supporting multiple methods from scipy.stats.t. |
| UNIFORM | Uniform distribution function supporting multiple methods. |
| WEIBULL_MIN | Compute various functions of the Weibull minimum distribution using scipy.stats.weibull_min. |
Discrete Distributions
| Tool | Description |
|---|---|
| BERNOULLI | Calculates properties of a Bernoulli discrete random variable. |
| BETABINOM | Compute Beta-binomial distribution values from scipy.stats.betabinom. |
| BETANBINOM | Compute Beta-negative-binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| BINOM | Compute Binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| BOLTZMANN | Compute Boltzmann distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| DLAPLACE | Compute Discrete Laplace distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| GEOM | Compute Geometric distribution values using scipy.stats.geom. |
| HYPERGEOM | Compute Hypergeometric distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| LOGSER | Compute Log-Series distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| NBINOM | Compute Negative Binomial distribution values using scipy.stats.nbinom. |
| NHYPERGEOM | Compute Negative Hypergeometric distribution values using scipy.stats.nhypergeom. |
| PLANCK | Compute Planck distribution values using scipy.stats.planck. |
| POISSON_DIST | Compute Poisson distribution values using scipy.stats.poisson. |
| RANDINT | Compute Uniform discrete distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| SKELLAM | Compute Skellam distribution values using scipy.stats.skellam. |
| YULESIMON | Compute Yule-Simon distribution values using scipy.stats.yulesimon. |
| ZIPF | Compute Zipf distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
| ZIPFIAN | Compute Zipfian distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median. |
Multivariate Distributions
| Tool | Description |
|---|---|
| DIRICHLET | Computes the PDF, log-PDF, mean, variance, covariance, entropy, or draws random samples from a Dirichlet distribution. |
| DIRICHLET_MULTINOM | Computes the probability mass function, log probability mass function, mean, variance, or covariance of the Dirichlet multinomial distribution. |
| MATRIX_NORMAL | Computes the PDF, log-PDF, or draws random samples from a matrix normal distribution. |
| MULTINOMIAL | Compute the probability mass function, log-PMF, entropy, covariance, or draw random samples from a multinomial distribution. |
| MULTIVARIATE_NORMAL | Computes the PDF, CDF, log-PDF, log-CDF, entropy, or draws random samples from a multivariate normal distribution. |
| MULTIVARIATE_T | Computes the PDF, CDF, or draws random samples from a multivariate t-distribution. |
| MV_HYPERGEOM | Computes probability mass function, log-PMF, mean, variance, covariance, or draws random samples from a multivariate hypergeometric distribution. |
| ORTHO_GROUP | Draws random samples of orthogonal matrices from the O(N) Haar distribution using scipy.stats.ortho_group. |
| RANDOM_CORRELATION | Generates a random correlation matrix with specified eigenvalues. |
| SPECIAL_ORTHO_GROUP | Draws random samples from the special orthogonal group SO(N), returning orthogonal matrices with determinant +1. |
| UNIFORM_DIRECTION | Draws random unit vectors uniformly distributed on the surface of a hypersphere in the specified dimension. |
| UNITARY_GROUP | Generate a random unitary matrix of dimension N from the Haar distribution. |
| VONMISES_FISHER | Computes the PDF, log-PDF, entropy, or draws random samples from a von Mises-Fisher distribution on the unit hypersphere. |
| WISHART | Computes the PDF, log-PDF, or draws random samples from the Wishart distribution using scipy.stats.wishart. |
Summary Statistics
| Tool | Description |
|---|---|
| DESCRIBE | Compute descriptive statistics using scipy.stats.describe module. |
| EFFECT_SIZES | Computes effect size measures for comparing two groups. |
| EXPECTILE | Calculates the expectile of a dataset using scipy.stats.expectile. |
| GMEAN | Compute the geometric mean of the input data, flattening the input and ignoring non-numeric values. |
| HMEAN | Calculates the harmonic mean of the input data, flattening the input and ignoring non-numeric values. |
| KURTOSIS | Compute the kurtosis (Fisher or Pearson) of a dataset. |
| MODE | Returns the modal (most common) value in the passed array. |
| MOMENT | Calculates the nth moment about the mean for a sample. |
| PMEAN | Computes the power mean (generalized mean) of the input data for a given power p. |
| SKEWNESS | Calculate the skewness of a dataset. |