HURDLE_COUNT_MODEL
Overview
The HURDLE_COUNT_MODEL function fits a hurdle model for count data, a two-stage regression approach designed to handle datasets with excess zeros. Hurdle models are widely used in econometrics, ecology, and healthcare analytics where the outcome variable represents counts (e.g., number of doctor visits, species occurrences, or product purchases) and zeros occur more frequently than standard count distributions predict.
A hurdle model decomposes the data-generating process into two distinct components. First, a zero hurdle process (binary model) determines whether an observation “crosses the hurdle” from zero to a positive count. Second, a truncated count process models the magnitude of positive counts conditional on crossing the hurdle. Mathematically, the probability structure is:
\Pr(Y = 0) = \theta
\Pr(Y = k \mid Y > 0) = \frac{p(k)}{1 - p(0)} \quad \text{for } k = 1, 2, 3, \ldots
where \theta is the probability of a zero count from the hurdle process and p(k) is the probability mass function of the underlying count distribution (Poisson or Negative Binomial), truncated at zero.
This implementation uses the statsmodels library’s HurdleCountModel class, introduced in version 0.14.0. The function supports both Poisson and Negative Binomial distributions for the positive count component, with a Poisson distribution for the zero hurdle. The Negative Binomial option is useful when count data exhibit overdispersion (variance exceeding the mean).
The hurdle model differs from zero-inflated models in its treatment of zeros. While zero-inflated models assume zeros arise from a mixture of a point mass at zero and the count distribution, hurdle models treat all zeros as originating solely from the binary hurdle process. This distinction is important when the mechanism generating zeros is conceptually different from the count process—for example, whether someone makes any purchase versus how many items they buy.
The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals for both processes, along with model fit statistics including log-likelihood, AIC, and BIC. The hurdle model concept was introduced by Cragg (1971) and extended for count data by Mullahy (1986).
This example function is provided as-is without any representation of accuracy.
Excel Usage
=HURDLE_COUNT_MODEL(y, x, x_zero, hurdle_dist, fit_intercept, alpha)
y(list[list], required): Dependent variable containing count data (column vector of non-negative integers).x(list[list], required): Independent variables (predictors) for both zero and count processes.x_zero(list[list], optional, default: null): Independent variables for zero process (must be same as x or None).hurdle_dist(str, optional, default: “poisson”): Distribution for positive counts.fit_intercept(bool, optional, default: true): If true, adds an intercept term to both processes.alpha(float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).
Returns (list[list]): 2D list with model results and statistics, or error string.
Examples
Example 1: Basic Hurdle Poisson model with single predictor
Inputs:
| y | x |
|---|---|
| 0 | 1 |
| 0 | 2 |
| 0 | 1.5 |
| 1 | 3 |
| 2 | 4 |
| 0 | 2.5 |
| 3 | 5 |
| 0 | 1 |
| 1 | 3.5 |
| 0 | 2 |
Excel formula:
=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2})
Expected output:
| zero_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | -99.57 | 4537000 | -0.00002195 | 1 | -8892000 | 8892000 | |
| x1 | 35.72 | 1815000 | 0.00001968 | 1 | -3557000 | 3557000 | |
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | -5.586 | 4.406 | -1.268 | 0.2048 | -14.22 | 3.049 | |
| x1 | 1.341 | 0.9308 | 1.441 | 0.1495 | -0.4829 | 3.166 | |
| log_likelihood | -3.109 | ||||||
| aic | 14.22 | ||||||
| bic | 15.43 |
Example 2: Hurdle model with larger dataset
Inputs:
| y | x |
|---|---|
| 0 | 1 |
| 0 | 2 |
| 1 | 1.5 |
| 2 | 3 |
| 0 | 4 |
| 3 | 2.5 |
| 1 | 5 |
| 0 | 1 |
| 2 | 3.5 |
| 0 | 2 |
| 1 | 1.5 |
| 4 | 4.5 |
| 0 | 2 |
| 0 | 1 |
| 2 | 3 |
Excel formula:
=HURDLE_COUNT_MODEL({0;0;1;2;0;3;1;0;2;0;1;4;0;0;2}, {1;2;1.5;3;4;2.5;5;1;3.5;2;1.5;4.5;2;1;3})
Expected output:
"non-error"
Example 3: Hurdle model with 90% confidence intervals
Inputs:
| y | x | alpha |
|---|---|---|
| 0 | 1 | 0.1 |
| 0 | 2 | |
| 0 | 1.5 | |
| 1 | 3 | |
| 2 | 4 | |
| 0 | 2.5 | |
| 3 | 5 | |
| 0 | 1 | |
| 1 | 3.5 | |
| 0 | 2 |
Excel formula:
=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, 0.1)
Expected output:
| zero_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | -99.57 | 4537000 | -0.00002195 | 1 | -7469000 | 7469000 | |
| x1 | 35.72 | 1815000 | 0.00001968 | 1 | -2987000 | 2987000 | |
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | -5.586 | 4.406 | -1.268 | 0.2048 | -12.84 | 1.672 | |
| x1 | 1.341 | 0.9308 | 1.441 | 0.1495 | -0.1894 | 2.872 | |
| log_likelihood | -3.109 | ||||||
| aic | 14.22 | ||||||
| bic | 15.43 |
Example 4: Hurdle model with all arguments specified
Inputs:
| y | x | hurdle_dist | fit_intercept | alpha |
|---|---|---|---|---|
| 0 | 1 | poisson | true | 0.1 |
| 0 | 2 | |||
| 0 | 1.5 | |||
| 1 | 3 | |||
| 2 | 4 | |||
| 0 | 2.5 | |||
| 3 | 5 | |||
| 0 | 1 | |||
| 1 | 3.5 | |||
| 0 | 2 |
Excel formula:
=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, "poisson", TRUE, 0.1)
Expected output:
| zero_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
|---|---|---|---|---|---|---|---|
| intercept | -99.57 | 4537000 | -0.00002195 | 1 | -7469000 | 7469000 | |
| x1 | 35.72 | 1815000 | 0.00001968 | 1 | -2987000 | 2987000 | |
| count_process | parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper |
| intercept | -5.586 | 4.406 | -1.268 | 0.2048 | -12.84 | 1.672 | |
| x1 | 1.341 | 0.9308 | 1.441 | 0.1495 | -0.1894 | 2.872 | |
| log_likelihood | -3.109 | ||||||
| aic | 14.22 | ||||||
| bic | 15.43 |
Python Code
import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.truncated_model import HurdleCountModel as statsmodels_hurdle_count_model
def hurdle_count_model(y, x, x_zero=None, hurdle_dist='poisson', fit_intercept=True, alpha=0.05):
"""
Fits a Hurdle model for count data with two-stage process (zero vs. positive counts).
See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.truncated_model.HurdleCountModel.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable containing count data (column vector of non-negative integers).
x (list[list]): Independent variables (predictors) for both zero and count processes.
x_zero (list[list], optional): Independent variables for zero process (must be same as x or None). Default is None.
hurdle_dist (str, optional): Distribution for positive counts. Valid options: Poisson, Negative Binomial. Default is 'poisson'.
fit_intercept (bool, optional): If true, adds an intercept term to both processes. Default is True.
alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.
Returns:
list[list]: 2D list with model results and statistics, or error string.
"""
def to2d(val):
return [[val]] if not isinstance(val, list) else val
# Normalize inputs
y = to2d(y)
x = to2d(x)
if x_zero is not None:
x_zero = to2d(x_zero)
# Validate y is a column vector
try:
y_array = np.array(y, dtype=float)
except Exception:
return "Invalid input: y must be a numeric 2D list."
if y_array.ndim != 2:
return "Invalid input: y must be a 2D list."
if y_array.shape[1] != 1:
return "Invalid input: y must be a column vector (single column)."
y_vec = y_array.flatten()
# Check for NaN or Inf in y
if np.any(np.isnan(y_vec)) or np.any(np.isinf(y_vec)):
return "Invalid input: y contains NaN or Inf values."
# Check that y values are non-negative integers (count data)
if np.any(y_vec < 0):
return "Invalid input: y must contain non-negative values (count data)."
# Check that y values are integers
if not np.all(np.equal(np.mod(y_vec, 1), 0)):
return "Invalid input: y must contain integer count values."
# Validate x is a matrix
try:
x_array = np.array(x, dtype=float)
except Exception:
return "Invalid input: x must be a numeric 2D list."
if x_array.ndim != 2:
return "Invalid input: x must be a 2D list."
# Check for NaN or Inf in x
if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
return "Invalid input: x contains NaN or Inf values."
# Check dimensions match
if x_array.shape[0] != y_vec.shape[0]:
return "Invalid input: x and y must have the same number of rows."
# Note: statsmodels HurdleCountModel uses the same exog for both processes
# If x_zero is provided and different from x, return an error
if x_zero is not None:
try:
x_zero_array = np.array(x_zero, dtype=float)
except Exception:
return "Invalid input: x_zero must be a numeric 2D list."
if x_zero_array.ndim != 2:
return "Invalid input: x_zero must be a 2D list."
# Check for NaN or Inf in x_zero
if np.any(np.isnan(x_zero_array)) or np.any(np.isinf(x_zero_array)):
return "Invalid input: x_zero contains NaN or Inf values."
# Check dimensions match
if x_zero_array.shape[0] != y_vec.shape[0]:
return "Invalid input: x_zero and y must have the same number of rows."
# Check if x_zero is different from x
if not np.array_equal(x_zero_array, x_array):
return "Invalid input: statsmodels HurdleCountModel does not support different predictors for zero and count processes. x_zero must be the same as x or None."
# Validate hurdle_dist
if not isinstance(hurdle_dist, str):
return "Invalid input: hurdle_dist must be a string."
dist_lower = hurdle_dist.lower()
if dist_lower not in ['poisson', 'negbin']:
return "Invalid input: hurdle_dist must be 'poisson' or 'negbin'."
# Validate fit_intercept
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be a boolean."
# Validate alpha
if not isinstance(alpha, (int, float)):
return "Invalid input: alpha must be a number."
alpha_float = float(alpha)
if np.isnan(alpha_float) or np.isinf(alpha_float):
return "Invalid input: alpha must be finite."
if not (0 < alpha_float < 1):
return "Invalid input: alpha must be between 0 and 1."
# Add intercept if requested
if fit_intercept:
x_with_const = sm.add_constant(x_array, has_constant='add')
else:
x_with_const = x_array
# Fit Hurdle model (uses same exog for both zero and count processes)
try:
model = statsmodels_hurdle_count_model(y_vec, x_with_const, dist=dist_lower, zerodist='poisson')
results = model.fit(disp=0)
except Exception as exc:
return f"statsmodels HurdleCountModel error: {exc}"
# Extract results
params = results.params
std_err = results.bse
z_stats = results.tvalues
p_values = results.pvalues
conf_int = results.conf_int(alpha=alpha_float)
# Check for NaN or Inf in results
if (np.any(np.isnan(params)) or np.any(np.isinf(params)) or
np.any(np.isnan(std_err)) or np.any(np.isinf(std_err)) or
np.any(np.isnan(z_stats)) or np.any(np.isinf(z_stats)) or
np.any(np.isnan(p_values)) or np.any(np.isinf(p_values)) or
np.any(np.isnan(conf_int)) or np.any(np.isinf(conf_int))):
return "statsmodels HurdleCountModel error: results contain NaN or Inf values."
# Build output table
# Section 1: Zero process (binary model for zeros)
output = [['zero_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]
# Get number of parameters (same for both processes)
num_params = x_with_const.shape[1]
# Add zero process parameter rows
for i in range(num_params):
if fit_intercept and i == 0:
param_name = 'intercept'
else:
predictor_idx = i if not fit_intercept else i - 1
param_name = f'x{predictor_idx + 1}'
output.append([
'',
param_name,
float(params[i]),
float(std_err[i]),
float(z_stats[i]),
float(p_values[i]),
float(conf_int[i, 0]),
float(conf_int[i, 1])
])
# Section 2: Count process (for positive counts)
output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])
# Add count process parameter rows
for i in range(num_params):
param_idx = num_params + i
if fit_intercept and i == 0:
param_name = 'intercept'
else:
predictor_idx = i if not fit_intercept else i - 1
param_name = f'x{predictor_idx + 1}'
output.append([
'',
param_name,
float(params[param_idx]),
float(std_err[param_idx]),
float(z_stats[param_idx]),
float(p_values[param_idx]),
float(conf_int[param_idx, 0]),
float(conf_int[param_idx, 1])
])
# Add model statistics
output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
output.append(['aic', float(results.aic), '', '', '', '', '', ''])
output.append(['bic', float(results.bic), '', '', '', '', '', ''])
return output