ZIP_MODEL
Overview
The ZIP_MODEL function fits a Zero-Inflated Poisson (ZIP) regression model for count data that exhibits an excess of zero observations beyond what a standard Poisson distribution would predict. Zero-inflated models are commonly used when analyzing data such as insurance claims (where non-policyholders always have zero claims), fish counts in lakes (where some lakes cannot support fish), or publication counts (where some individuals systematically do not publish).
The ZIP model, originally developed by Diane Lambert in 1992, combines two separate data-generating processes through a mixture distribution. The first process models the probability of being in a “structural zero” state, where zeros always occur. The second process follows a standard Poisson distribution that can generate zero or positive counts. The model is defined as:
\text{Pr}(Y = 0) = \pi + (1 - \pi)e^{-\lambda}
\text{Pr}(Y = y_i) = (1 - \pi)\frac{\lambda^{y_i}e^{-\lambda}}{y_i!}, \quad y_i = 1, 2, 3, \ldots
where \pi represents the probability of excess zeros (the inflation parameter), and \lambda is the expected Poisson count. The overall mean is (1 - \pi)\lambda and the variance is \lambda(1 - \pi)(1 + \pi\lambda), which exhibits overdispersion relative to a standard Poisson model.
This implementation uses the statsmodels library, which estimates both the count process (using a Poisson regression) and the zero-inflation process (using a logistic or probit regression) via maximum likelihood estimation. For technical details, see the statsmodels ZeroInflatedPoisson documentation. The function allows separate predictor variables for each process, providing flexibility to model different mechanisms generating zeros versus non-zero counts.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=ZIP_MODEL(y, x, x_inflate, fit_intercept, alpha)
y(list[list], required): Dependent variable as a column vector of count data (non-negative integers).x(list[list], required): Independent variables for the count process. Each column is a predictor.x_inflate(list[list], optional, default: null): Independent variables for zero-inflation process. If omitted, uses same as x.fit_intercept(bool, optional, default: true): If true, adds an intercept term to both processes.alpha(float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).
Returns (list[list]): 2D list with model results and statistics, or error string.
Examples
Example 1: Basic ZIP with required arguments only
Inputs:
| y | x |
|---|---|
| 0 | 0.1 |
| 2 | 0.23 |
| 0 | 0.29 |
| 0 | 0.7 |
| 2 | 0.78 |
| 2 | 0.78 |
| 0 | 0.91 |
| 0 | 0.92 |
| 0 | 1 |
| 0 | 1.06 |
| 0 | 1.46 |
| 1 | 1.46 |
| 0 | 1.52 |
| 3 | 1.83 |
| 0 | 1.87 |
| 1 | 2.16 |
| 3 | 2.28 |
| 2 | 2.57 |
| 0 | 2.62 |
| 4 | 2.96 |
| 0 | 2.99 |
| 6 | 3.01 |
| 0 | 3.06 |
| 5 | 3.54 |
| 5 | 3.66 |
| 5 | 3.93 |
| 4 | 4.16 |
| 9 | 4.33 |
| 2 | 4.75 |
| 7 | 4.85 |
Excel formula:
=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 1.11 | 0.8968 | 1.237 | 0.2159 | -0.648 | 2.867 | |
| x1 | -0.7928 | 0.3937 | -2.014 | 0.04404 | -1.564 | -0.02117 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.1648 | 0.3958 | 0.4165 | 0.6771 | -0.6109 | 0.9405 | |
| x1 | 0.3526 | 0.1077 | 3.275 | 0.001056 | 0.1416 | 0.5637 | |
| log_likelihood | -46.63 | ||||||
| aic | 101.3 | ||||||
| bic | 106.9 |
Example 2: ZIP with separate zero-inflation predictors
Inputs:
| y | x | x_inflate |
|---|---|---|
| 0 | 0.1 | 0.05 |
| 2 | 0.23 | 0.12 |
| 0 | 0.29 | 0.15 |
| 0 | 0.7 | 0.35 |
| 2 | 0.78 | 0.39 |
| 2 | 0.78 | 0.39 |
| 0 | 0.91 | 0.46 |
| 0 | 0.92 | 0.46 |
| 0 | 1 | 0.5 |
| 0 | 1.06 | 0.53 |
| 0 | 1.46 | 0.73 |
| 1 | 1.46 | 0.73 |
| 0 | 1.52 | 0.76 |
| 3 | 1.83 | 0.92 |
| 0 | 1.87 | 0.94 |
| 1 | 2.16 | 1.08 |
| 3 | 2.28 | 1.14 |
| 2 | 2.57 | 1.29 |
| 0 | 2.62 | 1.31 |
| 4 | 2.96 | 1.48 |
| 0 | 2.99 | 1.5 |
| 6 | 3.01 | 1.51 |
| 0 | 3.06 | 1.53 |
| 5 | 3.54 | 1.77 |
| 5 | 3.66 | 1.83 |
| 5 | 3.93 | 1.97 |
| 4 | 4.16 | 2.08 |
| 9 | 4.33 | 2.17 |
| 2 | 4.75 | 2.38 |
| 7 | 4.85 | 2.43 |
Excel formula:
=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, {0.05;0.12;0.15;0.35;0.39;0.39;0.46;0.46;0.5;0.53;0.73;0.73;0.76;0.92;0.94;1.08;1.14;1.29;1.31;1.48;1.5;1.51;1.53;1.77;1.83;1.97;2.08;2.17;2.38;2.43})
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 1.113 | 0.8976 | 1.24 | 0.2151 | -0.6465 | 2.872 | |
| x1 | -1.586 | 0.7872 | -2.014 | 0.04396 | -3.128 | -0.04289 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.165 | 0.3957 | 0.4169 | 0.6768 | -0.6107 | 0.9406 | |
| x1 | 0.3526 | 0.1077 | 3.275 | 0.001057 | 0.1416 | 0.5636 | |
| log_likelihood | -46.63 | ||||||
| aic | 101.3 | ||||||
| bic | 106.9 |
Example 3: ZIP with custom significance level (alpha=0.1)
Inputs:
| y | x | alpha |
|---|---|---|
| 0 | 0.1 | 0.1 |
| 2 | 0.23 | |
| 0 | 0.29 | |
| 0 | 0.7 | |
| 2 | 0.78 | |
| 2 | 0.78 | |
| 0 | 0.91 | |
| 0 | 0.92 | |
| 0 | 1 | |
| 0 | 1.06 | |
| 0 | 1.46 | |
| 1 | 1.46 | |
| 0 | 1.52 | |
| 3 | 1.83 | |
| 0 | 1.87 | |
| 1 | 2.16 | |
| 3 | 2.28 | |
| 2 | 2.57 | |
| 0 | 2.62 | |
| 4 | 2.96 | |
| 0 | 2.99 | |
| 6 | 3.01 | |
| 0 | 3.06 | |
| 5 | 3.54 | |
| 5 | 3.66 | |
| 5 | 3.93 | |
| 4 | 4.16 | |
| 9 | 4.33 | |
| 2 | 4.75 | |
| 7 | 4.85 |
Excel formula:
=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, 0.1)
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 1.11 | 0.8968 | 1.237 | 0.2159 | -0.3653 | 2.585 | |
| x1 | -0.7928 | 0.3937 | -2.014 | 0.04404 | -1.44 | -0.1452 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.1648 | 0.3958 | 0.4165 | 0.6771 | -0.4862 | 0.8158 | |
| x1 | 0.3526 | 0.1077 | 3.275 | 0.001056 | 0.1755 | 0.5297 | |
| log_likelihood | -46.63 | ||||||
| aic | 101.3 | ||||||
| bic | 106.9 |
Example 4: ZIP with all arguments specified
Inputs:
| y | x | x_inflate | fit_intercept | alpha |
|---|---|---|---|---|
| 0 | 0.1 | 0.05 | true | 0.1 |
| 2 | 0.23 | 0.12 | ||
| 0 | 0.29 | 0.15 | ||
| 0 | 0.7 | 0.35 | ||
| 2 | 0.78 | 0.39 | ||
| 2 | 0.78 | 0.39 | ||
| 0 | 0.91 | 0.46 | ||
| 0 | 0.92 | 0.46 | ||
| 0 | 1 | 0.5 | ||
| 0 | 1.06 | 0.53 | ||
| 0 | 1.46 | 0.73 | ||
| 1 | 1.46 | 0.73 | ||
| 0 | 1.52 | 0.76 | ||
| 3 | 1.83 | 0.92 | ||
| 0 | 1.87 | 0.94 | ||
| 1 | 2.16 | 1.08 | ||
| 3 | 2.28 | 1.14 | ||
| 2 | 2.57 | 1.29 | ||
| 0 | 2.62 | 1.31 | ||
| 4 | 2.96 | 1.48 | ||
| 0 | 2.99 | 1.5 | ||
| 6 | 3.01 | 1.51 | ||
| 0 | 3.06 | 1.53 | ||
| 5 | 3.54 | 1.77 | ||
| 5 | 3.66 | 1.83 | ||
| 5 | 3.93 | 1.97 | ||
| 4 | 4.16 | 2.08 | ||
| 9 | 4.33 | 2.17 | ||
| 2 | 4.75 | 2.38 | ||
| 7 | 4.85 | 2.43 |
Excel formula:
=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, {0.05;0.12;0.15;0.35;0.39;0.39;0.46;0.46;0.5;0.53;0.73;0.73;0.76;0.92;0.94;1.08;1.14;1.29;1.31;1.48;1.5;1.51;1.53;1.77;1.83;1.97;2.08;2.17;2.38;2.43}, TRUE, 0.1)
Expected output:
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | 1.113 | 0.8976 | 1.24 | 0.2151 | -0.3636 | 2.589 | |
| x1 | -1.586 | 0.7872 | -2.014 | 0.04396 | -2.88 | -0.2909 | |
| inflate_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | 0.165 | 0.3957 | 0.4169 | 0.6768 | -0.486 | 0.8159 | |
| x1 | 0.3526 | 0.1077 | 3.275 | 0.001057 | 0.1755 | 0.5297 | |
| log_likelihood | -46.63 | ||||||
| aic | 101.3 | ||||||
| bic | 106.9 |
Python Code
import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.count_model import ZeroInflatedPoisson as sm_ZeroInflatedPoisson
def zip_model(y, x, x_inflate=None, fit_intercept=True, alpha=0.05):
"""
Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.
See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedPoisson.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable as a column vector of count data (non-negative integers).
x (list[list]): Independent variables for the count process. Each column is a predictor.
x_inflate (list[list], optional): Independent variables for zero-inflation process. If omitted, uses same as x. Default is None.
fit_intercept (bool, optional): If true, adds an intercept term to both processes. Default is True.
alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.
Returns:
list[list]: 2D list with model results and statistics, or error string.
"""
def to2d(val):
return [[val]] if not isinstance(val, list) else val
# Normalize inputs
y = to2d(y)
x = to2d(x)
if x_inflate is not None:
x_inflate = to2d(x_inflate)
# Validate y is a column vector
try:
y_array = np.array(y, dtype=float)
except Exception:
return "Invalid input: y must be a numeric 2D list."
if y_array.ndim != 2:
return "Invalid input: y must be a 2D list."
if y_array.shape[1] != 1:
return "Invalid input: y must be a column vector (single column)."
y_vec = y_array.flatten()
# Check for NaN or Inf in y
if np.any(np.isnan(y_vec)) or np.any(np.isinf(y_vec)):
return "Invalid input: y contains NaN or Inf values."
# Check that y values are non-negative integers (count data)
if np.any(y_vec < 0):
return "Invalid input: y must contain non-negative values (count data)."
# Check that y values are integers
if not np.all(np.equal(np.mod(y_vec, 1), 0)):
return "Invalid input: y must contain integer count values."
# Validate x is a matrix
try:
x_array = np.array(x, dtype=float)
except Exception:
return "Invalid input: x must be a numeric 2D list."
if x_array.ndim != 2:
return "Invalid input: x must be a 2D list."
# Check for NaN or Inf in x
if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
return "Invalid input: x contains NaN or Inf values."
# Check dimensions match
if x_array.shape[0] != y_vec.shape[0]:
return "Invalid input: x and y must have the same number of rows."
# Validate x_inflate if provided
if x_inflate is not None:
try:
x_inflate_array = np.array(x_inflate, dtype=float)
except Exception:
return "Invalid input: x_inflate must be a numeric 2D list."
if x_inflate_array.ndim != 2:
return "Invalid input: x_inflate must be a 2D list."
# Check for NaN or Inf in x_inflate
if np.any(np.isnan(x_inflate_array)) or np.any(np.isinf(x_inflate_array)):
return "Invalid input: x_inflate contains NaN or Inf values."
# Check dimensions match
if x_inflate_array.shape[0] != y_vec.shape[0]:
return "Invalid input: x_inflate and y must have the same number of rows."
else:
# Use same predictors for inflation process as count process
x_inflate_array = x_array
# Validate fit_intercept
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be a boolean."
# Validate alpha
if not isinstance(alpha, (int, float)):
return "Invalid input: alpha must be a number."
alpha_float = float(alpha)
if np.isnan(alpha_float) or np.isinf(alpha_float):
return "Invalid input: alpha must be finite."
if not (0 < alpha_float < 1):
return "Invalid input: alpha must be between 0 and 1."
# Add intercept if requested
if fit_intercept:
x_with_const = sm.add_constant(x_array, has_constant='add')
x_inflate_with_const = sm.add_constant(x_inflate_array, has_constant='add')
else:
x_with_const = x_array
x_inflate_with_const = x_inflate_array
# Fit ZIP model
try:
model = sm_ZeroInflatedPoisson(y_vec, x_with_const, exog_infl=x_inflate_with_const)
results = model.fit(disp=False)
except Exception as exc:
return f"statsmodels ZeroInflatedPoisson error: {exc}"
# Extract results
params = results.params
std_err = results.bse
z_stats = results.tvalues
p_values = results.pvalues
conf_int = results.conf_int(alpha=alpha_float)
# Check for NaN or Inf in results
if (np.any(np.isnan(params)) or np.any(np.isinf(params)) or
np.any(np.isnan(std_err)) or np.any(np.isinf(std_err)) or
np.any(np.isnan(z_stats)) or np.any(np.isinf(z_stats)) or
np.any(np.isnan(p_values)) or np.any(np.isinf(p_values)) or
np.any(np.isnan(conf_int)) or np.any(np.isinf(conf_int))):
return "statsmodels ZeroInflatedPoisson error: results contain NaN or Inf values."
# Build output table
# Section 1: Count process
output = [['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]
# Get number of parameters for count process
num_count_params = x_with_const.shape[1]
# Add count process parameter rows
for i in range(num_count_params):
if fit_intercept and i == 0:
param_name = 'intercept'
else:
predictor_idx = i if not fit_intercept else i - 1
param_name = f'x{predictor_idx + 1}'
output.append([
'',
param_name,
float(params[i]),
float(std_err[i]),
float(z_stats[i]),
float(p_values[i]),
float(conf_int[i, 0]),
float(conf_int[i, 1])
])
# Section 2: Inflate process
output.append(['inflate_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])
# Get number of parameters for inflate process
num_inflate_params = x_inflate_with_const.shape[1]
# Add inflate process parameter rows
for i in range(num_inflate_params):
param_idx = num_count_params + i
if fit_intercept and i == 0:
param_name = 'intercept'
else:
predictor_idx = i if not fit_intercept else i - 1
param_name = f'x{predictor_idx + 1}'
output.append([
'',
param_name,
float(params[param_idx]),
float(std_err[param_idx]),
float(z_stats[param_idx]),
float(p_values[param_idx]),
float(conf_int[param_idx, 0]),
float(conf_int[param_idx, 1])
])
# Add model statistics
output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
output.append(['aic', float(results.aic), '', '', '', '', '', ''])
output.append(['bic', float(results.bic), '', '', '', '', '', ''])
return output