HURDLE_COUNT_MODEL

Overview

The HURDLE_COUNT_MODEL function fits a hurdle model for count data, a two-stage regression approach designed to handle datasets with excess zeros. Hurdle models are widely used in econometrics, ecology, and healthcare analytics where the outcome variable represents counts (e.g., number of doctor visits, species occurrences, or product purchases) and zeros occur more frequently than standard count distributions predict.

A hurdle model decomposes the data-generating process into two distinct components. First, a zero hurdle process (binary model) determines whether an observation “crosses the hurdle” from zero to a positive count. Second, a truncated count process models the magnitude of positive counts conditional on crossing the hurdle. Mathematically, the probability structure is:

\Pr(Y = 0) = \theta \Pr(Y = k \mid Y > 0) = \frac{p(k)}{1 - p(0)} \quad \text{for } k = 1, 2, 3, \ldots

where \theta is the probability of a zero count from the hurdle process and p(k) is the probability mass function of the underlying count distribution (Poisson or Negative Binomial), truncated at zero.

This implementation uses the statsmodels library’s HurdleCountModel class, introduced in version 0.14.0. The function supports both Poisson and Negative Binomial distributions for the positive count component, with a Poisson distribution for the zero hurdle. The Negative Binomial option is useful when count data exhibit overdispersion (variance exceeding the mean).

The hurdle model differs from zero-inflated models in its treatment of zeros. While zero-inflated models assume zeros arise from a mixture of a point mass at zero and the count distribution, hurdle models treat all zeros as originating solely from the binary hurdle process. This distinction is important when the mechanism generating zeros is conceptually different from the count process—for example, whether someone makes any purchase versus how many items they buy.

The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals for both processes, along with model fit statistics including log-likelihood, AIC, and BIC. The hurdle model concept was introduced by Cragg (1971) and extended for count data by Mullahy (1986).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=HURDLE_COUNT_MODEL(y, x, x_zero, hurdle_dist, fit_intercept, alpha)

y (list[list], required): Dependent variable containing count data (column vector of non-negative integers).
x (list[list], required): Independent variables (predictors) for both zero and count processes.
x_zero (list[list], optional, default: null): Independent variables for zero process (must be same as x or None).
hurdle_dist (str, optional, default: “poisson”): Distribution for positive counts.
fit_intercept (bool, optional, default: true): If true, adds an intercept term to both processes.
alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).

Returns (list[list]): 2D list with model results and statistics, or error string.

Example 1: Basic Hurdle Poisson model with single predictor

Inputs:

y	x
0	1
0	2
0	1.5
1	3
2	4
0	2.5
3	5
0	1
1	3.5
0	2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2})

Expected output:

zero_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-99.5737	4538010	-0.0000219421	0.999982	-8894450	8894250
	x1	35.7216	1815210	0.000019679	0.999984	-3557720	3557790
count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-5.58607	4.40564	-1.26794	0.204821	-14.221	3.04883
	x1	1.34145	0.930821	1.44115	0.149543	-0.482925	3.16583
log_likelihood	-3.10921
aic	14.2184
bic	15.4288

Example 2: Hurdle model with larger dataset

Inputs:

y	x
0	1
0	2
1	1.5
2	3
0	4
3	2.5
1	5
0	1
2	3.5
0	2
1	1.5
4	4.5
0	2
0	1
2	3

Excel formula:

=HURDLE_COUNT_MODEL({0;0;1;2;0;3;1;0;2;0;1;4;0;0;2}, {1;2;1.5;3;4;2.5;5;1;3.5;2;1.5;4.5;2;1;3})

Expected output:

zero_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-1.72246	0.968735	-1.77805	0.0753949	-3.62115	0.176222
	x1	0.568418	0.312651	1.81806	0.0690548	-0.044366	1.1812
count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-0.392018	1.02402	-0.382823	0.701851	-2.39906	1.61502
	x1	0.260199	0.275153	0.945652	0.344326	-0.279091	0.79949
log_likelihood	-18.6209
aic	45.2418
bic	48.074

Example 3: Hurdle model with 90% confidence intervals

Inputs:

y	x	alpha
0	1	0.1
0	2
0	1.5
1	3
2	4
0	2.5
3	5
0	1
1	3.5
0	2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, 0.1)

Expected output:

zero_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-99.5737	4538010	-0.0000219421	0.999982	-7464470	7464270
	x1	35.7216	1815210	0.000019679	0.999984	-2985730	2985800
count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-5.58607	4.40564	-1.26794	0.204821	-12.8327	1.66056
	x1	1.34145	0.930821	1.44115	0.149543	-0.189614	2.87252
log_likelihood	-3.10921
aic	14.2184
bic	15.4288

Example 4: Hurdle model with all arguments specified

Inputs:

y	x	hurdle_dist	fit_intercept	alpha
0	1	poisson	true	0.1
0	2
0	1.5
1	3
2	4
0	2.5
3	5
0	1
1	3.5
0	2

Excel formula:

=HURDLE_COUNT_MODEL({0;0;0;1;2;0;3;0;1;0}, {1;2;1.5;3;4;2.5;5;1;3.5;2}, "poisson", TRUE, 0.1)

Expected output:

zero_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-99.5737	4538010	-0.0000219421	0.999982	-7464470	7464270
	x1	35.7216	1815210	0.000019679	0.999984	-2985730	2985800
count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	-5.58607	4.40564	-1.26794	0.204821	-12.8327	1.66056
	x1	1.34145	0.930821	1.44115	0.149543	-0.189614	2.87252
log_likelihood	-3.10921
aic	14.2184
bic	15.4288

Python Code

Show Code

import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.truncated_model import HurdleCountModel as statsmodels_hurdle_count_model

def hurdle_count_model(y, x, x_zero=None, hurdle_dist='poisson', fit_intercept=True, alpha=0.05):
    """
    Fits a Hurdle model for count data with two-stage process (zero vs. positive counts).

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.truncated_model.HurdleCountModel.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable containing count data (column vector of non-negative integers).
        x (list[list]): Independent variables (predictors) for both zero and count processes.
        x_zero (list[list], optional): Independent variables for zero process (must be same as x or None). Default is None.
        hurdle_dist (str, optional): Distribution for positive counts. Valid options: Poisson, Negative Binomial. Default is 'poisson'.
        fit_intercept (bool, optional): If true, adds an intercept term to both processes. Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with model results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)
        if x_zero is not None:
            x_zero = to2d(x_zero)

        # Validate y is a column vector
        try:
            y_array = np.array(y, dtype=float)
        except Exception:
            return "Error: Invalid input: y must be a numeric 2D list."

        if y_array.ndim != 2:
            return "Error: Invalid input: y must be a 2D list."

        if y_array.shape[1] != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        y_vec = y_array.flatten()

        # Check for NaN or Inf in y
        if np.any(np.isnan(y_vec)) or np.any(np.isinf(y_vec)):
            return "Error: Invalid input: y contains NaN or Inf values."

        # Check that y values are non-negative integers (count data)
        if np.any(y_vec < 0):
            return "Error: Invalid input: y must contain non-negative values (count data)."

        # Check that y values are integers
        if not np.all(np.equal(np.mod(y_vec, 1), 0)):
            return "Error: Invalid input: y must contain integer count values."

        # Validate x is a matrix
        try:
            x_array = np.array(x, dtype=float)
        except Exception:
            return "Error: Invalid input: x must be a numeric 2D list."

        if x_array.ndim != 2:
            return "Error: Invalid input: x must be a 2D list."

        # Check for NaN or Inf in x
        if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
            return "Error: Invalid input: x contains NaN or Inf values."

        # Check dimensions match
        if x_array.shape[0] != y_vec.shape[0]:
            return "Error: Invalid input: x and y must have the same number of rows."

        # Note: statsmodels HurdleCountModel uses the same exog for both processes
        # If x_zero is provided and different from x, return an error
        if x_zero is not None:
            try:
                x_zero_array = np.array(x_zero, dtype=float)
            except Exception:
                return "Error: Invalid input: x_zero must be a numeric 2D list."

            if x_zero_array.ndim != 2:
                return "Error: Invalid input: x_zero must be a 2D list."

            # Check for NaN or Inf in x_zero
            if np.any(np.isnan(x_zero_array)) or np.any(np.isinf(x_zero_array)):
                return "Error: Invalid input: x_zero contains NaN or Inf values."

            # Check dimensions match
            if x_zero_array.shape[0] != y_vec.shape[0]:
                return "Error: Invalid input: x_zero and y must have the same number of rows."

            # Check if x_zero is different from x
            if not np.array_equal(x_zero_array, x_array):
                return "Error: Invalid input: statsmodels HurdleCountModel does not support different predictors for zero and count processes. x_zero must be the same as x or None."

        # Validate hurdle_dist
        if not isinstance(hurdle_dist, str):
            return "Error: Invalid input: hurdle_dist must be a string."

        dist_lower = hurdle_dist.lower()
        if dist_lower not in ['poisson', 'negbin']:
            return "Error: Invalid input: hurdle_dist must be 'poisson' or 'negbin'."

        # Validate fit_intercept
        if not isinstance(fit_intercept, bool):
            return "Error: Invalid input: fit_intercept must be a boolean."

        # Validate alpha
        if not isinstance(alpha, (int, float)):
            return "Error: Invalid input: alpha must be a number."

        alpha_float = float(alpha)
        if np.isnan(alpha_float) or np.isinf(alpha_float):
            return "Error: Invalid input: alpha must be finite."

        if not (0 < alpha_float < 1):
            return "Error: Invalid input: alpha must be between 0 and 1."

        # Add intercept if requested
        if fit_intercept:
            x_with_const = sm.add_constant(x_array, has_constant='add')
        else:
            x_with_const = x_array

        # Fit Hurdle model (uses same exog for both zero and count processes)
        try:
            model = statsmodels_hurdle_count_model(y_vec, x_with_const, dist=dist_lower, zerodist='poisson')
            results = model.fit(disp=0)
        except Exception as exc:
            return f"Error: statsmodels HurdleCountModel error: {exc}"

        # Extract results
        params = results.params
        std_err = results.bse
        z_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int(alpha=alpha_float)

        # Check for NaN or Inf in results
        if (np.any(np.isnan(params)) or np.any(np.isinf(params)) or
            np.any(np.isnan(std_err)) or np.any(np.isinf(std_err)) or
            np.any(np.isnan(z_stats)) or np.any(np.isinf(z_stats)) or
            np.any(np.isnan(p_values)) or np.any(np.isinf(p_values)) or
            np.any(np.isnan(conf_int)) or np.any(np.isinf(conf_int))):
            return "Error: statsmodels HurdleCountModel error: results contain NaN or Inf values."

        # Build output table
        # Section 1: Zero process (binary model for zeros)
        output = [['zero_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Get number of parameters (same for both processes)
        num_params = x_with_const.shape[1]

        # Add zero process parameter rows
        for i in range(num_params):
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                '',
                param_name,
                float(params[i]),
                float(std_err[i]),
                float(z_stats[i]),
                float(p_values[i]),
                float(conf_int[i, 0]),
                float(conf_int[i, 1])
            ])

        # Section 2: Count process (for positive counts)
        output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Add count process parameter rows
        for i in range(num_params):
            param_idx = num_params + i
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                '',
                param_name,
                float(params[param_idx]),
                float(std_err[param_idx]),
                float(z_stats[param_idx]),
                float(p_values[param_idx]),
                float(conf_int[param_idx, 0]),
                float(conf_int[param_idx, 1])
            ])

        # Add model statistics
        output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
        output.append(['aic', float(results.aic), '', '', '', '', '', ''])
        output.append(['bic', float(results.bic), '', '', '', '', '', ''])

        return output
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

y *

Dependent variable containing count data (column vector of non-negative integers).

x *

Independent variables (predictors) for both zero and count processes.

x_zero

Independent variables for zero process (must be same as x or None).

hurdle_dist

Distribution for positive counts.

fit_intercept

If true, adds an intercept term to both processes.

alpha

Significance level for confidence intervals (between 0 and 1).