Statistics

Overview

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. It provides the mathematical framework for understanding uncertainty, making predictions, and drawing conclusions from incomplete information. Whether in science, economics, or engineering, statistical methods are essential for distinguishing signal from noise and extracting meaningful insights from data.

Background and Importance: Statistics is fundamental to science, business, engineering, and policy-making. It enables researchers and practitioners to quantify uncertainty, test hypotheses rigorously, model complex relationships in data, and make evidence-based decisions. The field has roots in probability theory and has evolved to encompass sophisticated computational methods for analyzing everything from large-scale datasets to small, carefully designed experiments.

Two Core Pillars: The field divides into Descriptive Statistics, which summarizes and organizes data to reveal main features (mean, median, variance, distribution shape), and Inferential Statistics, which uses sample data to generalize about populations, make predictions, and test theories. These complementary approaches work together: descriptive statistics summarize what you observe, while inferential statistics help you understand what it means.

Implementation Framework: Statistical analysis in Python is powered primarily by SciPy, NumPy, and Statsmodels. SciPy provides the foundational distributions and statistical tests, NumPy enables efficient numerical computations on arrays, and Statsmodels offers higher-level model fitting and diagnostics. These libraries are built on decades of research and are widely used across academia and industry.

Hypothesis Testing and Inference: One of statistics’ most powerful applications is hypothesis testing—a formal process for assessing whether observed differences or patterns could arise by chance. Tests range from simple comparisons of two groups using t-tests to sophisticated approaches for categorical data (chi-square tests), non-parametric alternatives (Mann-Whitney U test), and specialized tests for association and correlation. Choosing the right test depends on data type, sample size, and assumptions about underlying distributions.

Probability Distributions: The foundation of statistical inference rests on probability distributions. Continuous distributions like the normal (Gaussian) distribution, t-distribution, F-distribution, and exponential distribution model continuous measurements. Discrete distributions like binomial, Poisson, and negative binomial handle count data. Multivariate distributions extend these concepts to multiple variables simultaneously, essential for modeling joint behavior in multivariate datasets. Understanding the properties of these distributions—their shape, tail behavior, and moments—is crucial for selecting appropriate models.

Advanced Statistical Modeling: Beyond basic hypothesis tests, regression models investigate relationships between variables: ordinary least squares (OLS) for linear relationships, quantile regression for conditional quantiles, and robust methods for data with outliers. Generalized Linear Models (GLMs) extend regression to non-normal response variables, handling binary outcomes (logistic regression), counts (Poisson regression), and survival times. Mixed effects models incorporate random variation from multiple sources, essential when data has hierarchical or clustered structure. Survival analysis specialized tools like Kaplan-Meier estimators and Cox proportional hazards models for time-to-event data common in medicine and engineering.

Multivariate and Dimensionality Reduction: When analyzing many variables simultaneously, techniques like Principal Component Analysis (PCA) reduce dimensions while preserving variance, factor analysis uncovers latent structures, canonical correlation analysis (CCA) finds relationships between variable sets, and MANOVA tests differences across multiple outcomes. These methods reveal hidden patterns and simplify interpretation of high-dimensional data.

Figure 1: Statistical Foundations: (A) The standard normal distribution showing the probability density function and critical regions used in hypothesis testing. The shaded red areas (beyond ±1.96) represent rejection regions at the 0.05 significance level. (B) Distribution comparison showing how different continuous distributions (normal, exponential, chi-squared) with similar means can have drastically different shapes and tail behavior, illustrating the importance of choosing the right distribution for your data.

Bayesian

Conjugate Priors

Tool Description
BB_LOGBETA Compute the log-Beta term used in conjugate posterior calculations.
BB_POST_UPDATE Update Beta-Binomial posterior hyperparameters from observed counts.
BB_QBETA Compute a Beta posterior quantile for Beta-Binomial models.
GAMMA_POST_Q Compute a Gamma posterior quantile from shape-rate parameters.
INVGAMMA_POST_Q Compute an inverse-Gamma posterior quantile.
NIG_POST_UPDATE Update Normal-Inverse-Gamma posterior hyperparameters from sample summaries.
NN_POST_UPDATE Update Normal posterior parameters for unknown mean with known variance.

Credible Intervals

Tool Description
BAYES_MVS_CI Compute Bayesian credible intervals for mean, variance, and standard deviation from sample data.
BETA_CI_BOUNDS Compute an equal-tailed Bayesian credible interval for a proportion using a Beta posterior.
GAMMA_CI_BOUNDS Compute an equal-tailed Bayesian credible interval for a positive rate parameter using Gamma quantiles.
INVGAMMA_CI_BOUNDS Compute an equal-tailed Bayesian credible interval for a positive scale or variance parameter using Inverse-Gamma quantiles.
MVSDIST_CI Compute Bayesian credible intervals from posterior distributions of mean, variance, and standard deviation.
SAMPLE_EQTAIL_CI Compute an equal-tailed credible interval from posterior samples using empirical quantiles.
SAMPLE_HPD_CI Approximate a highest posterior density interval from posterior samples using the narrowest empirical window.

Dirichlet Multinomial

Tool Description
DM_CRED_INT Compute category-wise credible intervals from posterior Dirichlet parameters.
DM_DIRICHLET_SUM Compute Dirichlet density and moments for a category-probability vector.
DM_LOGBETA Compute the Dirichlet log-normalization term using log-gamma values.
DM_LOGSUM_NORM Compute a stable log normalizer and normalized probabilities from log-values.
DM_POST_UPDATE Update Dirichlet posterior parameters from prior hyperparameters and observed counts.
DM_PREDICTIVE Compute posterior predictive category probabilities from Dirichlet parameters.

Posterior Summarization

Tool Description
POSTERIOR_BMV Compute Bayesian posterior summaries for mean, variance, and standard deviation.
POSTERIOR_ENTROPY Compute Shannon or relative entropy for posterior probability tables.
POSTERIOR_LOGSUMEXP Compute stable log-sum-exp aggregates for posterior normalization and evidence calculations.
POSTERIOR_MAP Extract the MAP estimate from a tabulated posterior distribution.
POSTERIOR_TAILPROB Compute posterior tail probabilities relative to a decision threshold.
POSTERIOR_WMEANVAR Compute posterior weighted mean and variance summaries from values and weights.
POSTERIOR_XLOGY Compute numerically stable x times log y terms for posterior information calculations.

Frequency Statistics

Tool Description
BINNED_STATISTIC Computes a binned statistic (mean, sum, median, etc.) for the input data.
BINNED_STATISTIC_2D Computes a bidimensional binned statistic (mean, sum, median, etc.) for the input data.
CUMFREQ Compute the cumulative frequency histogram for the input data.
PERCENTILEOFSCORE Computes the percentile rank of a score relative to the input data.
RELFREQ Returns the relative frequency histogram for the input data.
SCOREATPERCENTILE Calculates the score at the given percentile of the input data.

Hypothesis Tests

Tool Description
ANOVA Perform one-way ANOVA on tabular data using Pingouin.
GAMESHOWELL Run Games-Howell pairwise comparisons using Pingouin.
HOMOSCEDASTICITY Test equality of variances across groups using Pingouin.
MIXED_ANOVA Perform mixed ANOVA with within- and between-subject factors using Pingouin.
NORMALITY Test normality by group or overall using Pingouin.
PAIRWISE_TUKEY Run Tukey HSD pairwise comparisons using Pingouin.
RM_ANOVA Perform repeated-measures ANOVA on tabular data using Pingouin.
WELCH_ANOVA Perform Welch ANOVA for unequal variances using Pingouin.

Models

Tool Description
MEDIATION_ANALYSIS Perform causal mediation analysis with bootstrap confidence intervals.

Multivariate Analysis

Tool Description
CANCORR Performs Canonical Correlation Analysis (CCA) between two sets of variables.
FACTOR_ANALYSIS Performs exploratory factor analysis with rotation.
MANOVA_TEST Performs Multivariate Analysis of Variance (MANOVA) for multiple dependent variables.
PCA_ANALYSIS Performs Principal Component Analysis (PCA) for dimensionality reduction.

Probability Distributions

Continuous Distributions

Tool Description
BETA Wrapper for scipy.stats.beta distribution providing multiple statistical methods.
CAUCHY Wrapper for scipy.stats.cauchy distribution providing multiple statistical methods.
CHISQ Compute various statistics and functions for the chi-squared distribution from scipy.stats.chi2.
EXPON Exponential distribution function wrapping scipy.stats.expon.
F_DIST Unified interface to the main methods of the F-distribution, including PDF, CDF, inverse CDF, survival function, and distribution statistics.
LAPLACE Laplace distribution function supporting multiple methods.
LOGNORM Compute lognormal distribution statistics and evaluations.
NORM Normal (Gaussian) distribution function supporting multiple methods.
PARETO Pareto distribution function supporting multiple methods.
T_DIST Student’s t distribution function supporting multiple methods from scipy.stats.t.
UNIFORM Uniform distribution function supporting multiple methods.
WEIBULL_MIN Compute various functions of the Weibull minimum distribution using scipy.stats.weibull_min.

Discrete Distributions

Tool Description
BERNOULLI Calculates properties of a Bernoulli discrete random variable.
BETABINOM Compute Beta-binomial distribution values from scipy.stats.betabinom.
BETANBINOM Compute Beta-negative-binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
BINOM Compute Binomial distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
BOLTZMANN Compute Boltzmann distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
DLAPLACE Compute Discrete Laplace distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
GEOM Compute Geometric distribution values using scipy.stats.geom.
HYPERGEOM Compute Hypergeometric distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
LOGSER Compute Log-Series distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
NBINOM Compute Negative Binomial distribution values using scipy.stats.nbinom.
NHYPERGEOM Compute Negative Hypergeometric distribution values using scipy.stats.nhypergeom.
PLANCK Compute Planck distribution values using scipy.stats.planck.
POISSON_DIST Compute Poisson distribution values using scipy.stats.poisson.
RANDINT Compute Uniform discrete distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
SKELLAM Compute Skellam distribution values using scipy.stats.skellam.
VAL_DISCRETE Select a value from a list based on a discrete probability distribution.
YULESIMON Compute Yule-Simon distribution values using scipy.stats.yulesimon.
ZIPF Compute Zipf distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.
ZIPFIAN Compute Zipfian distribution values: PMF, CDF, SF, ICDF, ISF, mean, variance, std, or median.

Multivariate Distributions

Tool Description
DIRICHLET Computes the PDF, log-PDF, mean, variance, covariance, entropy, or draws random samples from a Dirichlet distribution.
DIRICHLET_MULTINOM Computes the probability mass function, log probability mass function, mean, variance, or covariance of the Dirichlet multinomial distribution.
MATRIX_NORMAL Computes the PDF, log-PDF, or draws random samples from a matrix normal distribution.
MULTINOMIAL Compute the probability mass function, log-PMF, entropy, covariance, or draw random samples from a multinomial distribution.
MULTIVARIATE_NORMAL Computes the PDF, CDF, log-PDF, log-CDF, entropy, or draws random samples from a multivariate normal distribution.
MULTIVARIATE_T Computes the PDF, CDF, or draws random samples from a multivariate t-distribution.
MV_HYPERGEOM Computes probability mass function, log-PMF, mean, variance, covariance, or draws random samples from a multivariate hypergeometric distribution.
ORTHO_GROUP Draws random samples of orthogonal matrices from the O(N) Haar distribution using scipy.stats.ortho_group.
RANDOM_CORRELATION Generates a random correlation matrix with specified eigenvalues.
SPECIAL_ORTHO_GROUP Draws random samples from the special orthogonal group SO(N), returning orthogonal matrices with determinant +1.
UNIFORM_DIRECTION Draws random unit vectors uniformly distributed on the surface of a hypersphere in the specified dimension.
UNITARY_GROUP Generate a random unitary matrix of dimension N from the Haar distribution.
VONMISES_FISHER Computes the PDF, log-PDF, entropy, or draws random samples from a von Mises-Fisher distribution on the unit hypersphere.
WISHART Computes the PDF, log-PDF, or draws random samples from the Wishart distribution using scipy.stats.wishart.

Summary Statistics

Tool Description
CRONBACH_ALPHA Compute Cronbach’s alpha reliability coefficient for a set of items.
DESCRIBE Compute descriptive statistics using scipy.stats.describe.
DISTANCE_CORR Compute distance correlation between two numeric variables.
EXPECTILE Calculates the expectile of a dataset using scipy.stats.expectile.
GMEAN Compute the geometric mean of the input data, flattening the input and ignoring non-numeric values.
HMEAN Calculates the harmonic mean of the input data, flattening the input and ignoring non-numeric values.
KURTOSIS Compute the kurtosis (Fisher or Pearson) of a dataset.
MODE Return the modal (most common) numeric value in the input data, returning the smallest if there are multiple modes.
MOMENT Calculates the nth moment about the mean for a sample.
PARTIAL_CORR Compute partial or semi-partial correlation between two variables.
PMEAN Computes the power mean (generalized mean) of the input data for a given power p.
SKEWNESS Calculate the skewness of a dataset.

Time Series

Autocorrelation And Stationarity Tests

Tool Description
ACF Compute autocorrelation values across lags with optional confidence intervals and Ljung-Box statistics.
ACOVF Estimate autocovariance values of a time series across lags.
ADFULLER Run the Augmented Dickey-Fuller unit-root test for stationarity diagnostics.
CCF Compute cross-correlation between two time series across nonnegative lags.
CCOVF Estimate cross-covariance values between two time series across lags.
KPSS Run the KPSS stationarity test under level or trend null hypotheses.
PACF Compute partial autocorrelation values across lags for lag-order diagnostics.
Q_STAT Compute Ljung-Box Q statistics and p-values from autocorrelation coefficients.
RURTEST Run the range unit-root test as an alternative stationarity diagnostic.
ZIVOT_ANDREWS Run the Zivot-Andrews unit-root test allowing one endogenous structural break.

Decomposition And Seasonality

Tool Description
DETREND Remove linear or constant trend from input data.
MSTL Perform multi-seasonal STL decomposition on a time series.
PERIODOGRAM Estimate the power spectral density of a time series using a periodogram.
SEASDECOMP Decompose a time series into trend, seasonal, and residual components.
STL Perform STL decomposition of a univariate time series.
WELCH Estimate the power spectral density of a time series using Welch’s method.

Forecasting Models

Tool Description
ARIMA_FORECAST Fit an ARIMA model and return out-of-sample forecasts.
ARMA_ORDER_IC Select ARMA order using an information criterion.
HANNAN_RISSANEN Estimate ARMA parameters using the Hannan-Rissanen procedure.
HOLT_FORECAST Fit Holt trend exponential smoothing and return forecasts.
HW_FORECAST Fit Holt-Winters exponential smoothing and return forecasts.
INNOVATIONS_MLE Estimate SARIMA parameters using innovations maximum likelihood.
SARIMAX_FORECAST Fit a SARIMAX model and return out-of-sample forecasts.
SES_FORECAST Fit simple exponential smoothing and return forecasts.

Moving Averages

Tool Description
EMA_LFILTER Compute an exponential moving average using recursive linear filtering.
EMA_PERIOD Compute an exponential moving average using a period-derived smoothing constant.
SMA_CONV Compute a simple moving average using discrete convolution with a uniform window.
SMA_CUMSUM Compute a simple moving average using cumulative-sum differencing.
WINMA_CONV Compute a weighted moving average by convolving data with a user-defined weight window.
WMA Compute a rolling weighted moving average using user-supplied weights.