Models

Overview

Statistical models are mathematical representations of the process that generates data. They allow us to estimate relationships between variables, predict future outcomes, and understand causal mechanisms. Unlike simple summary statistics, models can account for confounding variables and quantify uncertainty in predictions.

Figure 1: Linear Regression: The most fundamental statistical model. It estimates the linear relationship between a dependent variable Y and independent variable X, minimizing the sum of squared errors.

Key Sub-Disciplines

  • Regression: The cornerstone of modeling. Includes OLS (Ordinary Least Squares) for continuous outcomes and WLS for weighted data.
  • Generalized Linear Models (GLM): Extends linear regression to non-normal response variables.
    • Logistic Regression (LOGIT): For binary outcomes (Yes/No).
    • Poisson Regression: For count data.
  • Discrete Choice: Models for choosing between categorical options (e.g., MNLOGIT for multinomial choices).
  • Survival Analysis: Models the time until an event occurs (e.g., KAPLAN_MEIER).
  • Mixed Effects: Handles hierarchical or clustered data (e.g., students within schools) by including random effects.

Native Excel Capabilities

Excel’s modeling capabilities are primarily limited to: - Linear Regression: LINEST and Analysis ToolPak handle simple and multiple linear regression well. - Visualization: Trendlines in charts are useful for quick checks.

However, significant gaps exist: - No Native GLM: Excel has no built-in function for Logistic Regression or Poisson Regression, forcing users to use Solver manually or buy add-ins. - No Survival Analysis: Kaplan-Meier curves and Cox Proportional Hazards models are widely used in medicine/engineering but absent in Excel. - No Mixed Effects: Analyzing repeated measures or nested data is not possible with standard formulas. - Python Power: The functions here wrap statsmodels, estimating complex models with proper standard errors, p-values, and diagnostic statistics directly in the spreadsheet.

Count

Tool Description
HURDLE_COUNT_MODEL Fits a Hurdle model for count data with two-stage process (zero vs. positive counts).
ZINB_MODEL Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.
ZIP_MODEL Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.

Discrete Choice

Tool Description
LOGIT_MODEL Fits a binary logistic regression model to predict binary outcomes using maximum likelihood estimation.
MULTINOMIAL_LOGIT Fits a multinomial logistic regression model for multi-category outcomes.
ORDERED_LOGIT Fits an ordered logistic regression model for ordinal outcomes.
PROBIT_MODEL Fits a binary probit regression model using maximum likelihood estimation.

Generalized Linear

Tool Description
GLM_BINOMIAL Fits a Generalized Linear Model with binomial family for binary or proportion data.
GLM_GAMMA Fit a Generalized Linear Model with Gamma family for positive continuous data.
GLM_INV_GAUSS Fits a Generalized Linear Model with Inverse Gaussian family for right-skewed positive data.
GLM_NEG_BINOM Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data.
GLM_POISSON Fits a Generalized Linear Model with Poisson family for count data.
GLM_TWEEDIE Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling.

Mixed Effects

Tool Description
GEE_MODEL Fits a Generalized Estimating Equations (GEE) model for correlated data.
GLMM_BINOMIAL Fits a Generalized Linear Mixed Model (GLMM) with binomial family for binary clustered data.
GLMM_POISSON Fits a Generalized Linear Mixed Model (GLMM) with Poisson family for count clustered data.
MIXED_LINEAR_MODEL Fits a Linear Mixed Effects Model (LMM) with random intercepts and slopes.

Regression

Tool Description
GLS_REGRESSION Fits a Generalized Least Squares (GLS) regression model.
INFLUENCE_DIAG Computes regression influence diagnostics for identifying influential observations.
OLS_DIAGNOSTICS Performs diagnostic tests on OLS regression residuals.
OLS_REGRESSION Fits an Ordinary Least Squares (OLS) regression model.
QUANTILE_REGRESSION Fits a quantile regression model to estimate conditional quantiles of the response distribution.
REGRESS_DIAG Performs comprehensive regression diagnostic tests.
ROBUST_LINEAR_MODEL Fits a robust linear regression model using M-estimators.
SPECIFICATION_TESTS Performs regression specification tests to detect model misspecification.
WLS_REGRESSION Fits a Weighted Least Squares (WLS) regression model.

Survival

Tool Description
COX_HAZARDS Fits a Cox Proportional Hazards regression model for survival data.
EXP_SURVIVAL_REG Fits a parametric exponential survival regression model.
KAPLAN_MEIER Computes the Kaplan-Meier survival function estimate for time-to-event data.