GLM_NEG_BINOM
Overview
The GLM_NEG_BINOM function fits a Generalized Linear Model (GLM) using the Negative Binomial family, designed specifically for modeling overdispersed count data. Overdispersion occurs when the observed variance in count data exceeds the mean—a common scenario in real-world datasets such as insurance claims, hospital visits, species counts, and event frequencies where the Poisson assumption of equal mean and variance does not hold.
This implementation uses the statsmodels library’s GLM module with the NegativeBinomial family. The Negative Binomial distribution in statsmodels corresponds to the NB2 parameterization, where the variance is a quadratic function of the mean. For a count variable Y with mean \mu, the variance is given by:
\text{Var}(Y) = \mu + \alpha \mu^2
where \alpha is the dispersion parameter (alpha_param). When \alpha = 0, the model reduces to Poisson regression. Larger values of \alpha indicate greater overdispersion relative to the Poisson distribution.
The probability mass function for the Negative Binomial distribution is:
f(y) = \frac{\Gamma(y + 1/\alpha)}{y! \, \Gamma(1/\alpha)} \left(\frac{1}{1 + \alpha\mu}\right)^{1/\alpha} \left(\frac{\alpha\mu}{1 + \alpha\mu}\right)^y
The function supports three link functions: log (default), negative binomial, and complementary log-log. The log link is most commonly used, as it ensures predicted counts remain positive and allows coefficients to be interpreted as incidence rate ratios (IRR) when exponentiated. The model returns coefficient estimates, standard errors, z-statistics, p-values, confidence intervals, and IRRs for each predictor, along with model fit statistics including deviance, Pearson chi-squared, AIC, BIC, and log-likelihood.
For more information, see the statsmodels GLM documentation and the statsmodels GitHub repository.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=GLM_NEG_BINOM(y, x, alpha_param, glm_neg_binom_link, fit_intercept, alpha)
y(list[list], required): Dependent variable containing overdispersed count data (column vector).x(list[list], required): Independent variables (predictors). Each column represents a different predictor.alpha_param(float, optional, default: 1): Dispersion parameter controlling overdispersion. Higher values indicate greater overdispersion.glm_neg_binom_link(str, optional, default: “log”): Link function for the model.fit_intercept(bool, optional, default: true): If TRUE, an intercept term is added to the model.alpha(float, optional, default: 0.05): Significance level for confidence intervals. Must be between 0 and 1.
Returns (list[list]): 2D list with GLM results and statistics, or error string.
Examples
Example 1: Demo case 1
Inputs:
| y | x |
|---|---|
| 2 | 1 |
| 3 | 1.5 |
| 5 | 2 |
| 7 | 2.5 |
| 11 | 3 |
| 13 | 3.5 |
| 17 | 4 |
| 19 | 4.5 |
Excel formula:
=GLM_NEG_BINOM({2;3;5;7;11;13;17;19}, {1;1.5;2;2.5;3;3.5;4;4.5})
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | incidence_rate_ratio |
|---|---|---|---|---|---|---|---|
| intercept | 0.232 | 1.036 | 0.2239 | 0.8228 | -1.799 | 2.263 | 1.261 |
| x1 | 0.6518 | 0.3371 | 1.933 | 0.0532 | -0.008994 | 1.313 | 1.919 |
| deviance | 0.1327 | ||||||
| pearson_chi2 | 0.131 | ||||||
| aic | 53.64 | ||||||
| bic | 53.8 | ||||||
| log_likelihood | -24.82 | ||||||
| dispersion | 1 |
Example 2: Demo case 2
Inputs:
| y | x | fit_intercept |
|---|---|---|
| 10 | 1 | false |
| 20 | 2 | |
| 30 | 3 | |
| 40 | 4 | |
| 50 | 5 | |
| 60 | 6 |
Excel formula:
=GLM_NEG_BINOM({10;20;30;40;50;60}, {1;2;3;4;5;6}, FALSE)
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | incidence_rate_ratio |
|---|---|---|---|---|---|---|---|
| x1 | 0.9435 | 0.106 | 8.9 | 5.592e-19 | 0.7357 | 1.151 | 2.569 |
| deviance | 6.502 | ||||||
| pearson_chi2 | 11.1 | ||||||
| aic | 61.53 | ||||||
| bic | 61.32 | ||||||
| log_likelihood | -29.77 | ||||||
| dispersion | 1 |
Example 3: Demo case 3
Inputs:
| y | x | alpha_param |
|---|---|---|
| 5 | 1 | 0.5 |
| 8 | 2 | |
| 12 | 3 | |
| 15 | 4 | |
| 20 | 5 | |
| 25 | 6 | |
| 30 | 7 |
Excel formula:
=GLM_NEG_BINOM({5;8;12;15;20;25;30}, {1;2;3;4;5;6;7}, 0.5)
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | incidence_rate_ratio |
|---|---|---|---|---|---|---|---|
| intercept | 1.5 | 0.6673 | 2.247 | 0.02462 | 0.1917 | 2.807 | 4.48 |
| x1 | 0.2873 | 0.1451 | 1.98 | 0.04771 | 0.002905 | 0.5717 | 1.333 |
| deviance | 0.1054 | ||||||
| pearson_chi2 | 0.1024 | ||||||
| aic | 50.79 | ||||||
| bic | 50.68 | ||||||
| log_likelihood | -23.39 | ||||||
| dispersion | 1 |
Example 4: Demo case 4
Inputs:
| y | x | alpha_param | glm_neg_binom_link | fit_intercept | alpha |
|---|---|---|---|---|---|
| 3 | 1 | 2 | log | true | 0.1 |
| 5 | 1.5 | ||||
| 8 | 2 | ||||
| 10 | 2.5 | ||||
| 14 | 3 | ||||
| 18 | 3.5 | ||||
| 22 | 4 | ||||
| 26 | 4.5 |
Excel formula:
=GLM_NEG_BINOM({3;5;8;10;14;18;22;26}, {1;1.5;2;2.5;3;3.5;4;4.5}, 2, "log", TRUE, 0.1)
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | incidence_rate_ratio |
|---|---|---|---|---|---|---|---|
| intercept | 0.7404 | 1.357 | 0.5455 | 0.5854 | -1.492 | 2.973 | 2.097 |
| x1 | 0.5962 | 0.4506 | 1.323 | 0.1857 | -0.1449 | 1.337 | 1.815 |
| deviance | 0.0576 | ||||||
| pearson_chi2 | 0.05489 | ||||||
| aic | 65.17 | ||||||
| bic | 65.33 | ||||||
| log_likelihood | -30.59 | ||||||
| dispersion | 1 |
Python Code
import math
import statsmodels.api as sm
from statsmodels.genmod.families import NegativeBinomial as sm_NegativeBinomial
from statsmodels.genmod.families.links import Log as sm_Log, NegativeBinomial as sm_NBinomLink, CLogLog as sm_CLogLog
def glm_neg_binom(y, x, alpha_param=1, glm_neg_binom_link='log', fit_intercept=True, alpha=0.05):
"""
Fits a Generalized Linear Model with Negative Binomial family for overdispersed count data.
See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable containing overdispersed count data (column vector).
x (list[list]): Independent variables (predictors). Each column represents a different predictor.
alpha_param (float, optional): Dispersion parameter controlling overdispersion. Higher values indicate greater overdispersion. Default is 1.
glm_neg_binom_link (str, optional): Link function for the model. Valid options: Log, Negative Binomial, Complementary Log-Log. Default is 'log'.
fit_intercept (bool, optional): If TRUE, an intercept term is added to the model. Default is True.
alpha (float, optional): Significance level for confidence intervals. Must be between 0 and 1. Default is 0.05.
Returns:
list[list]: 2D list with GLM results and statistics, or error string.
"""
def to2d(value):
"""Convert scalar or list to 2D list."""
return [[value]] if not isinstance(value, list) else value
# Normalize inputs
y = to2d(y)
x = to2d(x)
# Validate y is a column vector
if not isinstance(y, list) or len(y) == 0:
return "Invalid input: y must be a non-empty 2D list."
for row in y:
if not isinstance(row, list) or len(row) != 1:
return "Invalid input: y must be a column vector (2D list with single column)."
# Validate x is a 2D matrix
if not isinstance(x, list) or len(x) == 0:
return "Invalid input: x must be a non-empty 2D list."
num_rows = len(x)
num_cols = len(x[0]) if isinstance(x[0], list) else 0
if num_cols == 0:
return "Invalid input: x must be a 2D list with at least one column."
for row in x:
if not isinstance(row, list) or len(row) != num_cols:
return "Invalid input: x must have consistent number of columns."
# Check dimensions match
if len(y) != num_rows:
return "Invalid input: y and x must have the same number of rows."
# Validate alpha_param
if not isinstance(alpha_param, (int, float)):
return "Invalid input: alpha_param must be a number."
if math.isnan(alpha_param) or math.isinf(alpha_param):
return "Invalid input: alpha_param must be finite."
if alpha_param <= 0:
return "Invalid input: alpha_param must be positive."
# Validate link function
valid_links = ['log', 'nbinom', 'cloglog']
if not isinstance(glm_neg_binom_link, str):
return "Invalid input: glm_neg_binom_link must be a string."
if glm_neg_binom_link not in valid_links:
return f"Invalid input: glm_neg_binom_link must be one of {valid_links}."
# Validate fit_intercept
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be a boolean."
# Validate alpha
if not isinstance(alpha, (int, float)):
return "Invalid input: alpha must be a number."
if math.isnan(alpha) or math.isinf(alpha):
return "Invalid input: alpha must be finite."
if alpha <= 0 or alpha >= 1:
return "Invalid input: alpha must be between 0 and 1."
# Convert y and x to numeric arrays
try:
y_values = [float(row[0]) for row in y]
except (ValueError, TypeError):
return "Invalid input: y must contain numeric values."
try:
x_values = [[float(val) for val in row] for row in x]
except (ValueError, TypeError):
return "Invalid input: x must contain numeric values."
# Check for non-finite values
for val in y_values:
if math.isnan(val) or math.isinf(val):
return "Invalid input: y must contain finite values."
for row in x_values:
for val in row:
if math.isnan(val) or math.isinf(val):
return "Invalid input: x must contain finite values."
# Add intercept if requested
if fit_intercept:
x_values = [[1.0] + row for row in x_values]
# Select link function
if glm_neg_binom_link == 'log':
link = sm_Log()
elif glm_neg_binom_link == 'nbinom':
link = sm_NBinomLink()
elif glm_neg_binom_link == 'cloglog':
link = sm_CLogLog()
# Create NegativeBinomial family
family = sm_NegativeBinomial(alpha=alpha_param, link=link)
# Fit GLM
try:
model = sm.GLM(y_values, x_values, family=family)
results = model.fit()
except Exception as e:
return f"statsmodels.genmod.generalized_linear_model.GLM error: {e}"
# Extract results
params = results.params
std_errors = results.bse
z_stats = results.tvalues
p_values = results.pvalues
conf_int = results.conf_int(alpha=alpha)
# Build results table
output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper', 'incidence_rate_ratio']]
# Add parameter results
for i in range(len(params)):
if fit_intercept and i == 0:
param_name = 'intercept'
else:
predictor_idx = i if not fit_intercept else i - 1
param_name = f'x{predictor_idx + 1}'
coef = float(params[i])
std_err = float(std_errors[i])
z_stat = float(z_stats[i])
p_val = float(p_values[i])
ci_low = float(conf_int[i, 0])
ci_high = float(conf_int[i, 1])
irr = math.exp(coef)
output.append([param_name, coef, std_err, z_stat, p_val, ci_low, ci_high, irr])
# Add model statistics
output.append(['deviance', float(results.deviance), '', '', '', '', '', ''])
output.append(['pearson_chi2', float(results.pearson_chi2), '', '', '', '', '', ''])
output.append(['aic', float(results.aic), '', '', '', '', '', ''])
output.append(['bic', float(results.bic_llf), '', '', '', '', '', ''])
output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
# Calculate dispersion (phi)
dispersion = results.scale
output.append(['dispersion', float(dispersion), '', '', '', '', '', ''])
return output