ZINB_MODEL
Overview
The ZINB_MODEL function fits a Zero-Inflated Negative Binomial (ZINB) regression model for count data that exhibits both overdispersion and excess zeros. Zero-inflated models are appropriate when the observed number of zeros exceeds what would be expected from a standard count distribution, a situation that arises in many real-world scenarios such as insurance claims, healthcare utilization, ecological surveys, and manufacturing defects.
The ZINB model is a mixture distribution that combines two data-generating processes. The first process, modeled with a binary outcome (logit or probit), determines whether an observation is a “structural zero” (always zero) or comes from the count process. The second process uses a negative binomial distribution to generate counts, which naturally accommodates overdispersion where the variance exceeds the mean. This two-part structure allows the model to handle both the excess zeros and the overdispersion commonly observed in count data.
This implementation uses the statsmodels library, specifically the ZeroInflatedNegativeBinomialP class. The function estimates separate coefficients for the count process (predicting the magnitude of non-zero counts) and the inflation process (predicting the probability of structural zeros). The negative binomial component includes a dispersion parameter that controls the variance structure, with the P-parameterization allowing flexibility in modeling different types of overdispersion. For more information, see the statsmodels ZINB documentation and the statsmodels GitHub repository.
The mathematical formulation for the ZINB model is:
\Pr(Y = 0) = \pi + (1 - \pi) \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha}
\Pr(Y = y_i) = (1 - \pi) \frac{\Gamma(y_i + \alpha)}{\Gamma(\alpha) \cdot y_i!} \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha} \left(\frac{\mu}{\mu + \alpha}\right)^{y_i}, \quad y_i = 1, 2, 3, \ldots
where \pi is the probability of being in the zero-inflated group (structural zero), \mu is the expected count from the negative binomial process, and \alpha is the dispersion parameter.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=ZINB_MODEL(y, x, x_inflate, fit_intercept, alpha)
y(list[list], required): Dependent variable containing overdispersed count data (non-negative integers).x(list[list], required): Independent variables for the count process. Each column represents a predictor.x_inflate(list[list], optional, default: null): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x.fit_intercept(bool, optional, default: true): If true, adds an intercept term to both the count and zero-inflation processes.alpha(float, optional, default: 0.05): Significance level for confidence intervals (must be between 0 and 1).
Returns (list[list]): 2D list with model results and statistics, or error string.
Examples
Example 1: Basic ZINB model with single predictor
Inputs:
| x | y |
|---|---|
| 0 | 0 |
| 0.5 | 0 |
| 1 | 0 |
| 1.5 | 1 |
| 2 | 2 |
| 2.5 | 3 |
| 3 | 0 |
| 3.5 | 1 |
| 4 | 4 |
| 4.5 | 5 |
| 5 | 0 |
| 5.5 | 2 |
| 6 | 3 |
| 6.5 | 6 |
| 7 | 0 |
| 7.5 | 1 |
| 8 | 2 |
| 8.5 | 7 |
| 9 | 8 |
| 9.5 | 4 |
Excel formula:
=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 0.3785 | 1.3471 | 0.281 | 0.7787 | -2.2618 | 3.0188 | |
| x1 | -0.4067 | 0.3391 | -1.1993 | 0.2304 | -1.0714 | 0.258 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.2241 | 0.515 | 0.4351 | 0.6635 | -0.7854 | 1.2336 | |
| x1 | 0.1573 | 0.0723 | 2.1749 | 0.0296 | 0.0155 | 0.299 | |
| model_statistics | |||||||
| log_likelihood | -36.4553 | ||||||
| aic | 82.9105 | ||||||
| bic | 87.8892 | ||||||
| dispersion | 0.0657 |
Example 2: ZINB with separate zero-inflation predictors
Inputs:
| x | x_inflate | y |
|---|---|---|
| 0 | 0.5 | 0 |
| 0.5 | 0.8 | 0 |
| 1 | 1.1 | 1 |
| 1.5 | 1.4 | 0 |
| 2 | 1.7 | 2 |
| 2.5 | 2 | 3 |
| 3 | 2.3 | 0 |
| 3.5 | 2.6 | 1 |
| 4 | 2.9 | 4 |
| 4.5 | 3.1999999999999997 | 5 |
| 5 | 3.5 | 0 |
| 5.5 | 3.8 | 2 |
| 6 | 4.1 | 3 |
| 6.5 | 4.4 | 0 |
| 7 | 4.7 | 6 |
| 7.5 | 5 | 1 |
| 8 | 5.3 | 2 |
| 8.5 | 5.6 | 7 |
| 9 | 5.8999999999999995 | 0 |
| 9.5 | 6.2 | 8 |
Excel formula:
=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0.5;0.8;1.1;1.4;1.7;2;2.3;2.6;2.9;3.1999999999999997;3.5;3.8;4.1;4.4;4.7;5;5.3;5.6;5.8999999999999995;6.2}, {0;0;1;0;2;3;0;1;4;5;0;2;3;0;6;1;2;7;0;8})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 0.6525 | 1.2652 | 0.5157 | 0.6061 | -1.8274 | 3.1323 | |
| x1 | -0.3937 | 0.3451 | -1.1406 | 0.254 | -1.0701 | 0.2828 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.186 | 0.4655 | 0.3995 | 0.6895 | -0.7265 | 1.0984 | |
| x1 | 0.1724 | 0.0676 | 2.551 | 0.0107 | 0.04 | 0.3049 | |
| model_statistics | |||||||
| log_likelihood | -36.4821 | ||||||
| aic | 82.9642 | ||||||
| bic | 87.9428 | ||||||
| dispersion | 0.0105 |
Example 3: ZINB model without intercept term
Inputs:
| fit_intercept | x | y |
|---|---|---|
| false | 0 | 0 |
| 0.5 | 0 | |
| 1 | 0 | |
| 1.5 | 1 | |
| 2 | 2 | |
| 2.5 | 3 | |
| 3 | 0 | |
| 3.5 | 1 | |
| 4 | 4 | |
| 4.5 | 5 | |
| 5 | 0 | |
| 5.5 | 2 | |
| 6 | 3 | |
| 6.5 | 6 | |
| 7 | 0 | |
| 7.5 | 1 | |
| 8 | 2 | |
| 8.5 | 7 | |
| 9 | 8 | |
| 9.5 | 4 |
Excel formula:
=ZINB_MODEL(FALSE, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| x1 | -0.3585 | 0.228 | -1.5725 | 0.1158 | -0.8053 | 0.0883 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| x1 | 0.1866 | 0.0265 | 7.0511 | 0 | 0.1347 | 0.2385 | |
| model_statistics | |||||||
| log_likelihood | -36.5492 | ||||||
| aic | 79.0985 | ||||||
| bic | 82.0857 | ||||||
| dispersion | 0.0907 |
Example 4: ZINB model with 90% confidence intervals
Inputs:
| alpha | x | y |
|---|---|---|
| 0.1 | 0 | 0 |
| 0.5 | 0 | |
| 1 | 0 | |
| 1.5 | 1 | |
| 2 | 2 | |
| 2.5 | 3 | |
| 3 | 0 | |
| 3.5 | 1 | |
| 4 | 4 | |
| 4.5 | 5 | |
| 5 | 0 | |
| 5.5 | 2 | |
| 6 | 3 | |
| 6.5 | 6 | |
| 7 | 0 | |
| 7.5 | 1 | |
| 8 | 2 | |
| 8.5 | 7 | |
| 9 | 8 | |
| 9.5 | 4 |
Excel formula:
=ZINB_MODEL(0.1, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 0.3785 | 1.3471 | 0.281 | 0.7787 | -1.8373 | 2.5943 | |
| x1 | -0.4067 | 0.3391 | -1.1993 | 0.2304 | -0.9645 | 0.1511 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.2241 | 0.515 | 0.4351 | 0.6635 | -0.6231 | 1.0713 | |
| x1 | 0.1573 | 0.0723 | 2.1749 | 0.0296 | 0.0383 | 0.2762 | |
| model_statistics | |||||||
| log_likelihood | -36.4553 | ||||||
| aic | 82.9105 | ||||||
| bic | 87.8892 | ||||||
| dispersion | 0.0657 |
Python Code
import math
import warnings
import numpy as np
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP as statsmodels_zinb
def zinb_model(y, x, x_inflate=None, fit_intercept=True, alpha=0.05):
"""
Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.
See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedNegativeBinomialP.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable containing overdispersed count data (non-negative integers).
x (list[list]): Independent variables for the count process. Each column represents a predictor.
x_inflate (list[list], optional): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x. Default is None.
fit_intercept (bool, optional): If true, adds an intercept term to both the count and zero-inflation processes. Default is True.
alpha (float, optional): Significance level for confidence intervals (must be between 0 and 1). Default is 0.05.
Returns:
list[list]: 2D list with model results and statistics, or error string.
"""
def to2d(arr):
return [[arr]] if not isinstance(arr, list) else arr
def validate_numeric_2d(arr, name):
if not isinstance(arr, list):
return f"Invalid input: {name} must be a 2D list."
if not arr:
return f"Invalid input: {name} cannot be empty."
if not all(isinstance(row, list) for row in arr):
return f"Invalid input: {name} must be a 2D list."
for i, row in enumerate(arr):
if not row:
return f"Invalid input: {name} row {i} cannot be empty."
for j, val in enumerate(row):
if isinstance(val, bool) or not isinstance(val, (int, float)):
return f"Invalid input: {name}[{i}][{j}] must be numeric."
if math.isnan(val) or math.isinf(val):
return f"Invalid input: {name}[{i}][{j}] must be finite."
return None
# Normalize inputs
y = to2d(y)
x = to2d(x)
if x_inflate is not None:
x_inflate = to2d(x_inflate)
# Validate inputs
error = validate_numeric_2d(y, "y")
if error:
return error
error = validate_numeric_2d(x, "x")
if error:
return error
if x_inflate is not None:
error = validate_numeric_2d(x_inflate, "x_inflate")
if error:
return error
# Validate y is column vector
if len(y[0]) != 1:
return "Invalid input: y must be a column vector (single column)."
# Check dimensions
n_obs = len(y)
if len(x) != n_obs:
return f"Invalid input: y and x must have same number of rows ({n_obs} vs {len(x)})."
if x_inflate is not None and len(x_inflate) != n_obs:
return f"Invalid input: y and x_inflate must have same number of rows ({n_obs} vs {len(x_inflate)})."
# Validate alpha
if not isinstance(alpha, (int, float)):
return "Invalid input: alpha must be numeric."
if math.isnan(alpha) or math.isinf(alpha):
return "Invalid input: alpha must be finite."
if alpha <= 0 or alpha >= 1:
return "Invalid input: alpha must be between 0 and 1."
# Validate fit_intercept
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be boolean."
# Convert to flat lists
y_vec = [row[0] for row in y]
# Validate y contains non-negative integers
for i, val in enumerate(y_vec):
if val < 0:
return f"Invalid input: y[{i}] must be non-negative."
if val % 1 != 0:
return f"Invalid input: y[{i}] must be an integer count."
# Build design matrices
try:
y_arr = np.array(y_vec)
x_arr = np.array(x)
if fit_intercept:
x_arr = np.column_stack([np.ones(n_obs), x_arr])
if x_inflate is None:
x_inflate_arr = x_arr
else:
x_inflate_arr = np.array(x_inflate)
if fit_intercept:
x_inflate_arr = np.column_stack([np.ones(n_obs), x_inflate_arr])
# Fit model
model = statsmodels_zinb(y_arr, x_arr, exog_infl=x_inflate_arr)
# Try fitting with different methods, suppressing convergence warnings
result = None
for method in ['bfgs', 'newton', 'nm']:
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
result = model.fit(disp=False, method=method, maxiter=5000, warn_convergence=False)
# Check if we got valid standard errors
if result.bse is not None and not np.any(np.isnan(result.bse)):
break
except Exception:
continue
if result is None:
return "Model fitting error: Unable to fit model with any optimization method."
# Extract parameters
n_count = x_arr.shape[1]
count_params = result.params[:n_count]
inflate_params = result.params[n_count:-1] # Exclude dispersion
dispersion = result.params[-1]
# Get confidence intervals - handle case where Hessian inversion failed
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
conf_int = result.conf_int(alpha=alpha)
except Exception:
conf_int = np.full((len(result.params), 2), '')
# Get standard errors, z-stats, p-values - handle NaN
def safe_extract(arr, idx):
if arr is None:
return ''
if idx >= len(arr):
return ''
val = arr[idx]
if isinstance(val, (int, float)):
if math.isnan(val) or math.isinf(val):
return ''
return float(val)
return ''
bse = result.bse if result.bse is not None else []
tvalues = result.tvalues if result.tvalues is not None else []
pvalues = result.pvalues if result.pvalues is not None else []
# Build output
output = []
# Section 1: Count process header
output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])
# Section 1: Count process rows
for i in range(n_count):
param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
if fit_intercept and i > 0:
param_name = f'x{i}'
ci_lower = ''
ci_upper = ''
if len(conf_int) > i:
ci_lower = float(conf_int[i, 0]) if isinstance(conf_int[i, 0], (int, float)) and not math.isnan(conf_int[i, 0]) else ''
ci_upper = float(conf_int[i, 1]) if isinstance(conf_int[i, 1], (int, float)) and not math.isnan(conf_int[i, 1]) else ''
output.append([
'',
param_name,
float(count_params[i]),
safe_extract(bse, i),
safe_extract(tvalues, i),
safe_extract(pvalues, i),
ci_lower,
ci_upper
])
# Section 2: Inflate process header
output.append(['inflate_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])
# Section 2: Inflate process rows
n_inflate = len(inflate_params)
for i in range(n_inflate):
param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
if fit_intercept and i > 0:
param_name = f'x{i}'
idx = n_count + i
ci_lower = ''
ci_upper = ''
if len(conf_int) > idx:
ci_lower = float(conf_int[idx, 0]) if isinstance(conf_int[idx, 0], (int, float)) and not math.isnan(conf_int[idx, 0]) else ''
ci_upper = float(conf_int[idx, 1]) if isinstance(conf_int[idx, 1], (int, float)) and not math.isnan(conf_int[idx, 1]) else ''
output.append([
'',
param_name,
float(inflate_params[i]),
safe_extract(bse, idx),
safe_extract(tvalues, idx),
safe_extract(pvalues, idx),
ci_lower,
ci_upper
])
# Model statistics
output.append(['model_statistics', '', '', '', '', '', '', ''])
output.append(['log_likelihood', float(result.llf) if hasattr(result, 'llf') else '', '', '', '', '', '', ''])
output.append(['aic', float(result.aic) if hasattr(result, 'aic') else '', '', '', '', '', '', ''])
output.append(['bic', float(result.bic) if hasattr(result, 'bic') else '', '', '', '', '', '', ''])
output.append(['dispersion', float(dispersion), '', '', '', '', '', ''])
return output
except Exception as e:
return f"Model fitting error: {str(e)}"