HURDLE_COUNT_MODEL

Overview

The HURDLE_COUNT_MODEL function fits a hurdle model for count data, a two-stage regression approach designed to handle datasets with excess zeros. Hurdle models are widely used in econometrics, ecology, and healthcare analytics where the outcome variable represents counts (e.g., number of doctor visits, species occurrences, or product purchases) and zeros occur more frequently than standard count distributions predict.

A hurdle model decomposes the data-generating process into two distinct components. First, a zero hurdle process (binary model) determines whether an observation “crosses the hurdle” from zero to a positive count. Second, a truncated count process models the magnitude of positive counts conditional on crossing the hurdle. Mathematically, the probability structure is:

\Pr(Y = 0) = \theta
\Pr(Y = k \mid Y > 0) = \frac{p(k)}{1 - p(0)} \quad \text{for } k = 1, 2, 3, \ldots

where \theta is the probability of a zero count from the hurdle process and p(k) is the probability mass function of the underlying count distribution (Poisson or Negative Binomial), truncated at zero.

This implementation uses the statsmodels library’s HurdleCountModel class, introduced in version 0.14.0. The function supports both Poisson and Negative Binomial distributions for the positive count component, with a Poisson distribution for the zero hurdle. The Negative Binomial option is useful when count data exhibit overdispersion (variance exceeding the mean).

The hurdle model differs from zero-inflated models in its treatment of zeros. While zero-inflated models assume zeros arise from a mixture of a point mass at zero and the count distribution, hurdle models treat all zeros as originating solely from the binary hurdle process. This distinction is important when the mechanism generating zeros is conceptually different from the count process—for example, whether someone makes any purchase versus how many items they buy.

The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals for both processes, along with model fit statistics including log-likelihood, AIC, and BIC. The hurdle model concept was introduced by Cragg (1971) and extended for count data by Mullahy (1986).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=HURDLE_COUNT_MODEL(y, x, x_zero, hurdle_dist, fit_intercept, alpha)
  • y (list[list], required): Dependent variable containing count data (column vector of non-negative integers).
  • x (list[list], required): Independent variables (predictors) for both zero and count processes.
  • x_zero (list[list], optional, default: null): Independent variables for zero process (must be same as x or None).
  • hurdle_dist (str, optional, default: “poisson”): Distribution for positive counts.
  • fit_intercept (bool, optional, default: true): If true, adds an intercept term to both processes.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).

Returns (list[list]): 2D list with model results and statistics, or error string.

Examples

Example 1: Basic Hurdle Poisson model with single predictor

Inputs:

y x
0 1
0 2
0 1.5
1 3
2 4
0 2.5
3 5
0 1
1 3.5
0 2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2})

Expected output:

zero_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -99.57 4537000 -0.00002195 1 -8892000 8892000
x1 35.72 1815000 0.00001968 1 -3557000 3557000
count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -5.586 4.406 -1.268 0.2048 -14.22 3.049
x1 1.341 0.9308 1.441 0.1495 -0.4829 3.166
log_likelihood -3.109
aic 14.22
bic 15.43

Example 2: Hurdle model with larger dataset

Inputs:

y x
0 1
0 2
1 1.5
2 3
0 4
3 2.5
1 5
0 1
2 3.5
0 2
1 1.5
4 4.5
0 2
0 1
2 3

Excel formula:

=HURDLE_COUNT_MODEL({0;0;1;2;0;3;1;0;2;0;1;4;0;0;2}, {1;2;1.5;3;4;2.5;5;1;3.5;2;1.5;4.5;2;1;3})

Expected output:

"non-error"

Example 3: Hurdle model with 90% confidence intervals

Inputs:

y x alpha
0 1 0.1
0 2
0 1.5
1 3
2 4
0 2.5
3 5
0 1
1 3.5
0 2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, 0.1)

Expected output:

zero_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -99.57 4537000 -0.00002195 1 -7469000 7469000
x1 35.72 1815000 0.00001968 1 -2987000 2987000
count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -5.586 4.406 -1.268 0.2048 -12.84 1.672
x1 1.341 0.9308 1.441 0.1495 -0.1894 2.872
log_likelihood -3.109
aic 14.22
bic 15.43

Example 4: Hurdle model with all arguments specified

Inputs:

y x hurdle_dist fit_intercept alpha
0 1 poisson true 0.1
0 2
0 1.5
1 3
2 4
0 2.5
3 5
0 1
1 3.5
0 2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, "poisson", TRUE, 0.1)

Expected output:

zero_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -99.57 4537000 -0.00002195 1 -7469000 7469000
x1 35.72 1815000 0.00001968 1 -2987000 2987000
count_process parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -5.586 4.406 -1.268 0.2048 -12.84 1.672
x1 1.341 0.9308 1.441 0.1495 -0.1894 2.872
log_likelihood -3.109
aic 14.22
bic 15.43

Python Code

import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.truncated_model import HurdleCountModel as statsmodels_hurdle_count_model

def hurdle_count_model(y, x, x_zero=None, hurdle_dist='poisson', fit_intercept=True, alpha=0.05):
    """
    Fits a Hurdle model for count data with two-stage process (zero vs. positive counts).

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.truncated_model.HurdleCountModel.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable containing count data (column vector of non-negative integers).
        x (list[list]): Independent variables (predictors) for both zero and count processes.
        x_zero (list[list], optional): Independent variables for zero process (must be same as x or None). Default is None.
        hurdle_dist (str, optional): Distribution for positive counts. Valid options: Poisson, Negative Binomial. Default is 'poisson'.
        fit_intercept (bool, optional): If true, adds an intercept term to both processes. Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with model results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    # Normalize inputs
    y = to2d(y)
    x = to2d(x)
    if x_zero is not None:
        x_zero = to2d(x_zero)

    # Validate y is a column vector
    try:
        y_array = np.array(y, dtype=float)
    except Exception:
        return "Invalid input: y must be a numeric 2D list."

    if y_array.ndim != 2:
        return "Invalid input: y must be a 2D list."

    if y_array.shape[1] != 1:
        return "Invalid input: y must be a column vector (single column)."

    y_vec = y_array.flatten()

    # Check for NaN or Inf in y
    if np.any(np.isnan(y_vec)) or np.any(np.isinf(y_vec)):
        return "Invalid input: y contains NaN or Inf values."

    # Check that y values are non-negative integers (count data)
    if np.any(y_vec < 0):
        return "Invalid input: y must contain non-negative values (count data)."

    # Check that y values are integers
    if not np.all(np.equal(np.mod(y_vec, 1), 0)):
        return "Invalid input: y must contain integer count values."

    # Validate x is a matrix
    try:
        x_array = np.array(x, dtype=float)
    except Exception:
        return "Invalid input: x must be a numeric 2D list."

    if x_array.ndim != 2:
        return "Invalid input: x must be a 2D list."

    # Check for NaN or Inf in x
    if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
        return "Invalid input: x contains NaN or Inf values."

    # Check dimensions match
    if x_array.shape[0] != y_vec.shape[0]:
        return "Invalid input: x and y must have the same number of rows."

    # Note: statsmodels HurdleCountModel uses the same exog for both processes
    # If x_zero is provided and different from x, return an error
    if x_zero is not None:
        try:
            x_zero_array = np.array(x_zero, dtype=float)
        except Exception:
            return "Invalid input: x_zero must be a numeric 2D list."

        if x_zero_array.ndim != 2:
            return "Invalid input: x_zero must be a 2D list."

        # Check for NaN or Inf in x_zero
        if np.any(np.isnan(x_zero_array)) or np.any(np.isinf(x_zero_array)):
            return "Invalid input: x_zero contains NaN or Inf values."

        # Check dimensions match
        if x_zero_array.shape[0] != y_vec.shape[0]:
            return "Invalid input: x_zero and y must have the same number of rows."

        # Check if x_zero is different from x
        if not np.array_equal(x_zero_array, x_array):
            return "Invalid input: statsmodels HurdleCountModel does not support different predictors for zero and count processes. x_zero must be the same as x or None."

    # Validate hurdle_dist
    if not isinstance(hurdle_dist, str):
        return "Invalid input: hurdle_dist must be a string."

    dist_lower = hurdle_dist.lower()
    if dist_lower not in ['poisson', 'negbin']:
        return "Invalid input: hurdle_dist must be 'poisson' or 'negbin'."

    # Validate fit_intercept
    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be a boolean."

    # Validate alpha
    if not isinstance(alpha, (int, float)):
        return "Invalid input: alpha must be a number."

    alpha_float = float(alpha)
    if np.isnan(alpha_float) or np.isinf(alpha_float):
        return "Invalid input: alpha must be finite."

    if not (0 < alpha_float < 1):
        return "Invalid input: alpha must be between 0 and 1."

    # Add intercept if requested
    if fit_intercept:
        x_with_const = sm.add_constant(x_array, has_constant='add')
    else:
        x_with_const = x_array

    # Fit Hurdle model (uses same exog for both zero and count processes)
    try:
        model = statsmodels_hurdle_count_model(y_vec, x_with_const, dist=dist_lower, zerodist='poisson')
        results = model.fit(disp=0)
    except Exception as exc:
        return f"statsmodels HurdleCountModel error: {exc}"

    # Extract results
    params = results.params
    std_err = results.bse
    z_stats = results.tvalues
    p_values = results.pvalues
    conf_int = results.conf_int(alpha=alpha_float)

    # Check for NaN or Inf in results
    if (np.any(np.isnan(params)) or np.any(np.isinf(params)) or
        np.any(np.isnan(std_err)) or np.any(np.isinf(std_err)) or
        np.any(np.isnan(z_stats)) or np.any(np.isinf(z_stats)) or
        np.any(np.isnan(p_values)) or np.any(np.isinf(p_values)) or
        np.any(np.isnan(conf_int)) or np.any(np.isinf(conf_int))):
        return "statsmodels HurdleCountModel error: results contain NaN or Inf values."

    # Build output table
    # Section 1: Zero process (binary model for zeros)
    output = [['zero_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

    # Get number of parameters (same for both processes)
    num_params = x_with_const.shape[1]

    # Add zero process parameter rows
    for i in range(num_params):
        if fit_intercept and i == 0:
            param_name = 'intercept'
        else:
            predictor_idx = i if not fit_intercept else i - 1
            param_name = f'x{predictor_idx + 1}'

        output.append([
            '',
            param_name,
            float(params[i]),
            float(std_err[i]),
            float(z_stats[i]),
            float(p_values[i]),
            float(conf_int[i, 0]),
            float(conf_int[i, 1])
        ])

    # Section 2: Count process (for positive counts)
    output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

    # Add count process parameter rows
    for i in range(num_params):
        param_idx = num_params + i
        if fit_intercept and i == 0:
            param_name = 'intercept'
        else:
            predictor_idx = i if not fit_intercept else i - 1
            param_name = f'x{predictor_idx + 1}'

        output.append([
            '',
            param_name,
            float(params[param_idx]),
            float(std_err[param_idx]),
            float(z_stats[param_idx]),
            float(p_values[param_idx]),
            float(conf_int[param_idx, 0]),
            float(conf_int[param_idx, 1])
        ])

    # Add model statistics
    output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
    output.append(['aic', float(results.aic), '', '', '', '', '', ''])
    output.append(['bic', float(results.bic), '', '', '', '', '', ''])

    return output

Online Calculator