ZIP_MODEL

Overview

The ZIP_MODEL function fits a Zero-Inflated Poisson (ZIP) regression model for count data that exhibits an excess of zero observations beyond what a standard Poisson distribution would predict. Zero-inflated models are commonly used when analyzing data such as insurance claims (where non-policyholders always have zero claims), fish counts in lakes (where some lakes cannot support fish), or publication counts (where some individuals systematically do not publish).

The ZIP model, originally developed by Diane Lambert in 1992, combines two separate data-generating processes through a mixture distribution. The first process models the probability of being in a “structural zero” state, where zeros always occur. The second process follows a standard Poisson distribution that can generate zero or positive counts. The model is defined as:

\text{Pr}(Y = 0) = \pi + (1 - \pi)e^{-\lambda}

\text{Pr}(Y = y_i) = (1 - \pi)\frac{\lambda^{y_i}e^{-\lambda}}{y_i!}, \quad y_i = 1, 2, 3, \ldots

where \pi represents the probability of excess zeros (the inflation parameter), and \lambda is the expected Poisson count. The overall mean is (1 - \pi)\lambda and the variance is \lambda(1 - \pi)(1 + \pi\lambda), which exhibits overdispersion relative to a standard Poisson model.

This implementation uses the statsmodels library, which estimates both the count process (using a Poisson regression) and the zero-inflation process (using a logistic or probit regression) via maximum likelihood estimation. For technical details, see the statsmodels ZeroInflatedPoisson documentation. The function allows separate predictor variables for each process, providing flexibility to model different mechanisms generating zeros versus non-zero counts.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ZIP_MODEL(y, x, x_inflate, fit_intercept, alpha)

y (list[list], required): Dependent variable as a column vector of count data (non-negative integers).
x (list[list], required): Independent variables for the count process. Each column is a predictor.
x_inflate (list[list], optional, default: null): Independent variables for zero-inflation process. If omitted, uses same as x.
fit_intercept (bool, optional, default: true): If true, adds an intercept term to both processes.
alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).

Returns (list[list]): 2D list with model results and statistics, or error string.

Example 1: Basic ZIP with required arguments only

Inputs:

y	x
0	0.1
2	0.23
0	0.29
0	0.7
2	0.78
2	0.78
0	0.91
0	0.92
0	1
0	1.06
0	1.46
1	1.46
0	1.52
3	1.83
0	1.87
1	2.16
3	2.28
2	2.57
0	2.62
4	2.96
0	2.99
6	3.01
0	3.06
5	3.54
5	3.66
5	3.93
4	4.16
9	4.33
2	4.75
7	4.85

Excel formula:

=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	1.10973	0.896781	1.23746	0.215918	-0.647931	2.86738
	x1	-0.792803	0.393698	-2.01374	0.0440372	-1.56444	-0.02117
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.164819	0.395764	0.416459	0.677074	-0.610863	0.940502
	x1	0.352635	0.107672	3.27508	0.00105632	0.141601	0.563668
log_likelihood	-46.6346
aic	101.269
bic	106.874

Example 2: ZIP with separate zero-inflation predictors

Inputs:

y	x	x_inflate
0	0.1	0.05
2	0.23	0.12
0	0.29	0.15
0	0.7	0.35
2	0.78	0.39
2	0.78	0.39
0	0.91	0.46
0	0.92	0.46
0	1	0.5
0	1.06	0.53
0	1.46	0.73
1	1.46	0.73
0	1.52	0.76
3	1.83	0.92
0	1.87	0.94
1	2.16	1.08
3	2.28	1.14
2	2.57	1.29
0	2.62	1.31
4	2.96	1.48
0	2.99	1.5
6	3.01	1.51
0	3.06	1.53
5	3.54	1.77
5	3.66	1.83
5	3.93	1.97
4	4.16	2.08
9	4.33	2.17
2	4.75	2.38
7	4.85	2.43

Excel formula:

=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, {0.05;0.12;0.15;0.35;0.39;0.39;0.46;0.46;0.5;0.53;0.73;0.73;0.76;0.92;0.94;1.08;1.14;1.29;1.31;1.48;1.5;1.51;1.53;1.77;1.83;1.97;2.08;2.17;2.38;2.43})

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	1.11271	0.897563	1.2397	0.215085	-0.64648	2.8719
	x1	-1.58568	0.787153	-2.01445	0.0439628	-3.12847	-0.0428852
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.164971	0.395745	0.416862	0.676779	-0.610674	0.940616
	x1	0.352597	0.107668	3.27484	0.0010572	0.141571	0.563623
log_likelihood	-46.6318
aic	101.264
bic	106.868

Example 3: ZIP with custom significance level (alpha=0.1)

Inputs:

y	x	alpha
0	0.1	0.1
2	0.23
0	0.29
0	0.7
2	0.78
2	0.78
0	0.91
0	0.92
0	1
0	1.06
0	1.46
1	1.46
0	1.52
3	1.83
0	1.87
1	2.16
3	2.28
2	2.57
0	2.62
4	2.96
0	2.99
6	3.01
0	3.06
5	3.54
5	3.66
5	3.93
4	4.16
9	4.33
2	4.75
7	4.85

Excel formula:

=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, 0.1)

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	1.10973	0.896781	1.23746	0.215918	-0.365346	2.5848
	x1	-0.792803	0.393698	-2.01374	0.0440372	-1.44038	-0.145228
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.164819	0.395764	0.416459	0.677074	-0.486154	0.815793
	x1	0.352635	0.107672	3.27508	0.00105632	0.17553	0.52974
log_likelihood	-46.6346
aic	101.269
bic	106.874

Example 4: ZIP with all arguments specified

Inputs:

y	x	x_inflate	fit_intercept	alpha
0	0.1	0.05	true	0.1
2	0.23	0.12
0	0.29	0.15
0	0.7	0.35
2	0.78	0.39
2	0.78	0.39
0	0.91	0.46
0	0.92	0.46
0	1	0.5
0	1.06	0.53
0	1.46	0.73
1	1.46	0.73
0	1.52	0.76
3	1.83	0.92
0	1.87	0.94
1	2.16	1.08
3	2.28	1.14
2	2.57	1.29
0	2.62	1.31
4	2.96	1.48
0	2.99	1.5
6	3.01	1.51
0	3.06	1.53
5	3.54	1.77
5	3.66	1.83
5	3.93	1.97
4	4.16	2.08
9	4.33	2.17
2	4.75	2.38
7	4.85	2.43

Excel formula:

=ZIP_MODEL({0;2;0;0;2;2;0;0;0;0;0;1;0;3;0;1;3;2;0;4;0;6;0;5;5;5;4;9;2;7}, {0.1;0.23;0.29;0.7;0.78;0.78;0.91;0.92;1;1.06;1.46;1.46;1.52;1.83;1.87;2.16;2.28;2.57;2.62;2.96;2.99;3.01;3.06;3.54;3.66;3.93;4.16;4.33;4.75;4.85}, {0.05;0.12;0.15;0.35;0.39;0.39;0.46;0.46;0.5;0.53;0.73;0.73;0.76;0.92;0.94;1.08;1.14;1.29;1.31;1.48;1.5;1.51;1.53;1.77;1.83;1.97;2.08;2.17;2.38;2.43}, TRUE, 0.1)

Expected output:

count_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	1.11271	0.897563	1.2397	0.215085	-0.363649	2.58907
	x1	-1.58568	0.787153	-2.01445	0.0439628	-2.88043	-0.290925
inflate_process	parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
	intercept	0.164971	0.395745	0.416862	0.676779	-0.485971	0.815913
	x1	0.352597	0.107668	3.27484	0.0010572	0.175498	0.529695
log_likelihood	-46.6318
aic	101.264
bic	106.868

Python Code

Show Code

import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.count_model import ZeroInflatedPoisson as sm_ZeroInflatedPoisson

def zip_model(y, x, x_inflate=None, fit_intercept=True, alpha=0.05):
    """
    Fits a Zero-Inflated Poisson (ZIP) model for count data with excess zeros.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedPoisson.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector of count data (non-negative integers).
        x (list[list]): Independent variables for the count process. Each column is a predictor.
        x_inflate (list[list], optional): Independent variables for zero-inflation process. If omitted, uses same as x. Default is None.
        fit_intercept (bool, optional): If true, adds an intercept term to both processes. Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with model results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)
        if x_inflate is not None:
            x_inflate = to2d(x_inflate)

        # Validate y is a column vector
        try:
            y_array = np.array(y, dtype=float)
        except Exception:
            return "Error: Invalid input: y must be a numeric 2D list."

        if y_array.ndim != 2:
            return "Error: Invalid input: y must be a 2D list."

        if y_array.shape[1] != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        y_vec = y_array.flatten()

        # Check for NaN or Inf in y
        if np.any(np.isnan(y_vec)) or np.any(np.isinf(y_vec)):
            return "Error: Invalid input: y contains NaN or Inf values."

        # Check that y values are non-negative integers (count data)
        if np.any(y_vec < 0):
            return "Error: Invalid input: y must contain non-negative values (count data)."

        # Check that y values are integers
        if not np.all(np.equal(np.mod(y_vec, 1), 0)):
            return "Error: Invalid input: y must contain integer count values."

        # Validate x is a matrix
        try:
            x_array = np.array(x, dtype=float)
        except Exception:
            return "Error: Invalid input: x must be a numeric 2D list."

        if x_array.ndim != 2:
            return "Error: Invalid input: x must be a 2D list."

        # Check for NaN or Inf in x
        if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
            return "Error: Invalid input: x contains NaN or Inf values."

        # Check dimensions match
        if x_array.shape[0] != y_vec.shape[0]:
            return "Error: Invalid input: x and y must have the same number of rows."

        # Validate x_inflate if provided
        if x_inflate is not None:
            try:
                x_inflate_array = np.array(x_inflate, dtype=float)
            except Exception:
                return "Error: Invalid input: x_inflate must be a numeric 2D list."

            if x_inflate_array.ndim != 2:
                return "Error: Invalid input: x_inflate must be a 2D list."

            # Check for NaN or Inf in x_inflate
            if np.any(np.isnan(x_inflate_array)) or np.any(np.isinf(x_inflate_array)):
                return "Error: Invalid input: x_inflate contains NaN or Inf values."

            # Check dimensions match
            if x_inflate_array.shape[0] != y_vec.shape[0]:
                return "Error: Invalid input: x_inflate and y must have the same number of rows."
        else:
            # Use same predictors for inflation process as count process
            x_inflate_array = x_array

        # Validate fit_intercept
        if not isinstance(fit_intercept, bool):
            return "Error: Invalid input: fit_intercept must be a boolean."

        # Validate alpha
        if not isinstance(alpha, (int, float)):
            return "Error: Invalid input: alpha must be a number."

        alpha_float = float(alpha)
        if np.isnan(alpha_float) or np.isinf(alpha_float):
            return "Error: Invalid input: alpha must be finite."

        if not (0 < alpha_float < 1):
            return "Error: Invalid input: alpha must be between 0 and 1."

        # Add intercept if requested
        if fit_intercept:
            x_with_const = sm.add_constant(x_array, has_constant='add')
            x_inflate_with_const = sm.add_constant(x_inflate_array, has_constant='add')
        else:
            x_with_const = x_array
            x_inflate_with_const = x_inflate_array

        # Fit ZIP model
        try:
            model = sm_ZeroInflatedPoisson(y_vec, x_with_const, exog_infl=x_inflate_with_const)
            results = model.fit(disp=False)
        except Exception as exc:
            return f"Error: statsmodels ZeroInflatedPoisson error: {exc}"

        # Extract results
        params = results.params
        std_err = results.bse
        z_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int(alpha=alpha_float)

        # Check for NaN or Inf in results
        if (np.any(np.isnan(params)) or np.any(np.isinf(params)) or
            np.any(np.isnan(std_err)) or np.any(np.isinf(std_err)) or
            np.any(np.isnan(z_stats)) or np.any(np.isinf(z_stats)) or
            np.any(np.isnan(p_values)) or np.any(np.isinf(p_values)) or
            np.any(np.isnan(conf_int)) or np.any(np.isinf(conf_int))):
            return "Error: statsmodels ZeroInflatedPoisson error: results contain NaN or Inf values."

        # Build output table
        # Section 1: Count process
        output = [['count_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Get number of parameters for count process
        num_count_params = x_with_const.shape[1]

        # Add count process parameter rows
        for i in range(num_count_params):
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                '',
                param_name,
                float(params[i]),
                float(std_err[i]),
                float(z_stats[i]),
                float(p_values[i]),
                float(conf_int[i, 0]),
                float(conf_int[i, 1])
            ])

        # Section 2: Inflate process
        output.append(['inflate_process', 'parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Get number of parameters for inflate process
        num_inflate_params = x_inflate_with_const.shape[1]

        # Add inflate process parameter rows
        for i in range(num_inflate_params):
            param_idx = num_count_params + i
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                '',
                param_name,
                float(params[param_idx]),
                float(std_err[param_idx]),
                float(z_stats[param_idx]),
                float(p_values[param_idx]),
                float(conf_int[param_idx, 0]),
                float(conf_int[param_idx, 1])
            ])

        # Add model statistics
        output.append(['log_likelihood', float(results.llf), '', '', '', '', '', ''])
        output.append(['aic', float(results.aic), '', '', '', '', '', ''])
        output.append(['bic', float(results.bic), '', '', '', '', '', ''])

        return output
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

y *

Dependent variable as a column vector of count data (non-negative integers).

x *

Independent variables for the count process. Each column is a predictor.

x_inflate

Independent variables for zero-inflation process. If omitted, uses same as x.

fit_intercept

If true, adds an intercept term to both processes.

alpha

Significance level for confidence intervals (between 0 and 1).