ZINB_MODEL

Overview

The ZINB_MODEL function fits a Zero-Inflated Negative Binomial (ZINB) regression model for count data that exhibits both overdispersion and excess zeros. Zero-inflated models are appropriate when the observed number of zeros exceeds what would be expected from a standard count distribution, a situation that arises in many real-world scenarios such as insurance claims, healthcare utilization, ecological surveys, and manufacturing defects.

The ZINB model is a mixture distribution that combines two data-generating processes. The first process, modeled with a binary outcome (logit or probit), determines whether an observation is a “structural zero” (always zero) or comes from the count process. The second process uses a negative binomial distribution to generate counts, which naturally accommodates overdispersion where the variance exceeds the mean. This two-part structure allows the model to handle both the excess zeros and the overdispersion commonly observed in count data.

This implementation uses the statsmodels library, specifically the ZeroInflatedNegativeBinomialP class. The function estimates separate coefficients for the count process (predicting the magnitude of non-zero counts) and the inflation process (predicting the probability of structural zeros). The negative binomial component includes a dispersion parameter that controls the variance structure, with the P-parameterization allowing flexibility in modeling different types of overdispersion. For more information, see the statsmodels ZINB documentation and the statsmodels GitHub repository.

The mathematical formulation for the ZINB model is:

\Pr(Y = 0) = \pi + (1 - \pi) \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha}

\Pr(Y = y_i) = (1 - \pi) \frac{\Gamma(y_i + \alpha)}{\Gamma(\alpha) \cdot y_i!} \left(\frac{\alpha}{\mu + \alpha}\right)^{\alpha} \left(\frac{\mu}{\mu + \alpha}\right)^{y_i}, \quad y_i = 1, 2, 3, \ldots

where \pi is the probability of being in the zero-inflated group (structural zero), \mu is the expected count from the negative binomial process, and \alpha is the dispersion parameter.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ZINB_MODEL(y, x, x_inflate, fit_intercept, alpha)

y (list[list], required): Dependent variable containing overdispersed count data (non-negative integers).
x (list[list], required): Independent variables for the count process. Each column represents a predictor.
x_inflate (list[list], optional, default: null): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x.
fit_intercept (bool, optional, default: true): If true, adds an intercept term to both the count and zero-inflation processes.
alpha (float, optional, default: 0.05): Significance level for confidence intervals (must be between 0 and 1).

Returns (list[list]): 2D list with model results and statistics, or error string.

Example 1: Basic ZINB model with single predictor

Inputs:

x	y
0	0
0.5	0
1	0
1.5	1
2	2
2.5	3
3	0
3.5	1
4	4
4.5	5
5	0
5.5	2
6	3
6.5	6
7	0
7.5	1
8	2
8.5	7
9	8
9.5	4

Excel formula:

=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.378496	1.34714	0.280963	0.778739	-2.26184	3.01884
	x1	-0.406692	0.339118	-1.19927	0.230424	-1.07135	0.257966
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.224119	0.515049	0.43514	0.663461	-0.78536	1.2336
	x1	0.15726	0.0723068	2.1749	0.0296376	0.0155413	0.298979
model_statistics
log_likelihood	-36.4553
aic	82.9105
bic	87.8892
dispersion	0.0657159

Example 2: ZINB with separate zero-inflation predictors

Inputs:

x	x_inflate	y
0	0.5	0
0.5	0.8	0
1	1.1	1
1.5	1.4	0
2	1.7	2
2.5	2	3
3	2.3	0
3.5	2.6	1
4	2.9	4
4.5	3.1999999999999997	5
5	3.5	0
5.5	3.8	2
6	4.1	3
6.5	4.4	0
7	4.7	6
7.5	5	1
8	5.3	2
8.5	5.6	7
9	5.8999999999999995	0
9.5	6.2	8

Excel formula:

=ZINB_MODEL({0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0.5;0.8;1.1;1.4;1.7;2;2.3;2.6;2.9;3.1999999999999997;3.5;3.8;4.1;4.4;4.7;5;5.3;5.6;5.8999999999999995;6.2}, {0;0;1;0;2;3;0;1;4;5;0;2;3;0;6;1;2;7;0;8})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.652489	1.26525	0.515701	0.606063	-1.82735	3.13233
	x1	-0.393667	0.345144	-1.14059	0.254041	-1.07014	0.282803
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.185989	0.46554	0.399512	0.689516	-0.726453	1.09843
	x1	0.17244	0.0675974	2.55099	0.0107417	0.0399519	0.304929
model_statistics
log_likelihood	-36.4821
aic	82.9642
bic	87.9428
dispersion	0.0105111

Example 3: ZINB model without intercept term

Inputs:

fit_intercept	x	y
false	0	0
	0.5	0
	1	0
	1.5	1
	2	2
	2.5	3
	3	0
	3.5	1
	4	4
	4.5	5
	5	0
	5.5	2
	6	3
	6.5	6
	7	0
	7.5	1
	8	2
	8.5	7
	9	8
	9.5	4

Excel formula:

=ZINB_MODEL(FALSE, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	x1	-0.358501	0.227977	-1.57253	0.115828	-0.805328	0.0883261
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	x1	0.186602	0.026464	7.05115	1.77446e-12	0.134733	0.238471
model_statistics
log_likelihood	-36.5492
aic	79.0985
bic	82.0857
dispersion	0.0907271

Example 4: ZINB model with 90% confidence intervals

Inputs:

alpha	x	y
0.1	0	0
	0.5	0
	1	0
	1.5	1
	2	2
	2.5	3
	3	0
	3.5	1
	4	4
	4.5	5
	5	0
	5.5	2
	6	3
	6.5	6
	7	0
	7.5	1
	8	2
	8.5	7
	9	8
	9.5	4

Excel formula:

=ZINB_MODEL(0.1, {0;0.5;1;1.5;2;2.5;3;3.5;4;4.5;5;5.5;6;6.5;7;7.5;8;8.5;9;9.5}, {0;0;0;1;2;3;0;1;4;5;0;2;3;6;0;1;2;7;8;4})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.378496	1.34714	0.280963	0.778739	-1.83735	2.59434
	x1	-0.406692	0.339118	-1.19927	0.230424	-0.964491	0.151107
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.224119	0.515049	0.43514	0.663461	-0.623062	1.0713
	x1	0.15726	0.0723068	2.1749	0.0296376	0.0383259	0.276194
model_statistics
log_likelihood	-36.4553
aic	82.9105
bic	87.8892
dispersion	0.0657159

Python Code

Show Code

import math
import warnings
import numpy as np
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP as statsmodels_zinb

def zinb_model(y, x, x_inflate=None, fit_intercept=True, alpha=0.05):
    """
    Fits a Zero-Inflated Negative Binomial (ZINB) model for overdispersed count data with excess zeros.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedNegativeBinomialP.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable containing overdispersed count data (non-negative integers).
        x (list[list]): Independent variables for the count process. Each column represents a predictor.
        x_inflate (list[list], optional): Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x. Default is None.
        fit_intercept (bool, optional): If true, adds an intercept term to both the count and zero-inflation processes. Default is True.
        alpha (float, optional): Significance level for confidence intervals (must be between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with model results and statistics, or error string.
    """
    def to2d(arr):
        return [[arr]] if not isinstance(arr, list) else arr

    def validate_numeric_2d(arr, name):
        if not isinstance(arr, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        if not arr:
            return f"Error: Invalid input: {name} cannot be empty."
        if not all(isinstance(row, list) for row in arr):
            return f"Error: Invalid input: {name} must be a 2D list."
        for i, row in enumerate(arr):
            if not row:
                return f"Error: Invalid input: {name} row {i} cannot be empty."
            for j, val in enumerate(row):
                if isinstance(val, bool) or not isinstance(val, (int, float)):
                    return f"Error: Invalid input: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: Invalid input: {name}[{i}][{j}] must be finite."
        return None

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)
        if x_inflate is not None:
            x_inflate = to2d(x_inflate)

        # Validate inputs
        error = validate_numeric_2d(y, "y")
        if error:
            return error
        error = validate_numeric_2d(x, "x")
        if error:
            return error
        if x_inflate is not None:
            error = validate_numeric_2d(x_inflate, "x_inflate")
            if error:
                return error

        # Validate y is column vector
        if len(y[0]) != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        # Check dimensions
        n_obs = len(y)
        if len(x) != n_obs:
            return f"Error: Invalid input: y and x must have same number of rows ({n_obs} vs {len(x)})."
        if x_inflate is not None and len(x_inflate) != n_obs:
            return f"Error: Invalid input: y and x_inflate must have same number of rows ({n_obs} vs {len(x_inflate)})."

        # Validate alpha
        if not isinstance(alpha, (int, float)):
            return "Error: Invalid input: alpha must be numeric."
        if math.isnan(alpha) or math.isinf(alpha):
            return "Error: Invalid input: alpha must be finite."
        if alpha <= 0 or alpha >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1."

        # Validate fit_intercept
        if not isinstance(fit_intercept, bool):
            return "Error: Invalid input: fit_intercept must be boolean."

        # Convert to flat lists
        y_vec = [row[0] for row in y]

        # Validate y contains non-negative integers
        for i, val in enumerate(y_vec):
            if val < 0:
                return f"Error: Invalid input: y[{i}] must be non-negative."
            if val % 1 != 0:
                return f"Error: Invalid input: y[{i}] must be an integer count."

        # Build design matrices
        try:
            y_arr = np.array(y_vec)
            x_arr = np.array(x)

            if fit_intercept:
                x_arr = np.column_stack([np.ones(n_obs), x_arr])

            if x_inflate is None:
                x_inflate_arr = x_arr
            else:
                x_inflate_arr = np.array(x_inflate)
                if fit_intercept:
                    x_inflate_arr = np.column_stack([np.ones(n_obs), x_inflate_arr])

            # Fit model
            model = statsmodels_zinb(y_arr, x_arr, exog_infl=x_inflate_arr)

            # Try fitting with different methods, suppressing convergence warnings
            result = None
            for method in ['bfgs', 'newton', 'nm']:
                try:
                    with warnings.catch_warnings():
                        warnings.simplefilter("ignore")
                        result = model.fit(disp=False, method=method, maxiter=5000, warn_convergence=False)
                    # Check if we got valid standard errors
                    if result.bse is not None and not np.any(np.isnan(result.bse)):
                        break
                except Exception:
                    continue

            if result is None:
                return "Error: Model fitting error: Unable to fit model with any optimization method."

            # Extract parameters
            n_count = x_arr.shape[1]
            count_params = result.params[:n_count]
            inflate_params = result.params[n_count:-1]  # Exclude dispersion
            dispersion = result.params[-1]

            # Get confidence intervals - handle case where Hessian inversion failed
            try:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    conf_int = result.conf_int(alpha=alpha)
            except Exception:
                conf_int = np.full((len(result.params), 2), '')

            # Get standard errors, z-stats, p-values - handle NaN
            def safe_extract(arr, idx):
                if arr is None:
                    return ''
                if idx >= len(arr):
                    return ''
                val = arr[idx]
                if isinstance(val, (int, float)):
                    if math.isnan(val) or math.isinf(val):
                        return ''
                    return float(val)
                return ''

            bse = result.bse if result.bse is not None else []
            tvalues = result.tvalues if result.tvalues is not None else []
            pvalues = result.pvalues if result.pvalues is not None else []

            # Build output
            output = []

            # Section 1: Count process header
            output.append(['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

            # Section 1: Count process rows
            for i in range(n_count):
                param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
                if fit_intercept and i > 0:
                    param_name = f'x{i}'
                ci_lower = ''
                ci_upper = ''
                if len(conf_int) > i:
                    ci_lower = float(conf_int[i, 0]) if isinstance(conf_int[i, 0], (int, float)) and not math.isnan(conf_int[i, 0]) else ''
                    ci_upper = float(conf_int[i, 1]) if isinstance(conf_int[i, 1], (int, float)) and not math.isnan(conf_int[i, 1]) else ''
                output.append([
                    '',
                    param_name,
                    float(count_params[i]),
                    safe_extract(bse, i),
                    safe_extract(tvalues, i),
                    safe_extract(pvalues, i),
                    ci_lower,
                    ci_upper
                ])

            # Section 2: Inflate process header
            output.append(['inflate_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

            # Section 2: Inflate process rows
            n_inflate = len(inflate_params)
            for i in range(n_inflate):
                param_name = f'x{i+1}' if not fit_intercept or i > 0 else 'intercept'
                if fit_intercept and i > 0:
                    param_name = f'x{i}'
                idx = n_count + i
                ci_lower = ''
                ci_upper = ''
                if len(conf_int) > idx:
                    ci_lower = float(conf_int[idx, 0]) if isinstance(conf_int[idx, 0], (int, float)) and not math.isnan(conf_int[idx, 0]) else ''
                    ci_upper = float(conf_int[idx, 1]) if isinstance(conf_int[idx, 1], (int, float)) and not math.isnan(conf_int[idx, 1]) else ''
                output.append([
                    '',
                    param_name,
                    float(inflate_params[i]),
                    safe_extract(bse, idx),
                    safe_extract(tvalues, idx),
                    safe_extract(pvalues, idx),
                    ci_lower,
                    ci_upper
                ])

            # Model statistics
            output.append(['model_statistics', '', '', '', '', '', '', ''])
            output.append(['log_likelihood', float(result.llf) if hasattr(result, 'llf') else '', '', '', '', '', '', ''])
            output.append(['aic', float(result.aic) if hasattr(result, 'aic') else '', '', '', '', '', '', ''])
            output.append(['bic', float(result.bic) if hasattr(result, 'bic') else '', '', '', '', '', '', ''])
            output.append(['dispersion', float(dispersion), '', '', '', '', '', ''])

            return output

        except Exception as e:
            return f"Error: Model fitting error: {str(e)}"
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

y *

Dependent variable containing overdispersed count data (non-negative integers).

x *

Independent variables for the count process. Each column represents a predictor.

x_inflate

Independent variables for the zero-inflation process. If not specified, defaults to the same variables as x.

fit_intercept

If true, adds an intercept term to both the count and zero-inflation processes.

alpha

Significance level for confidence intervals (must be between 0 and 1).