PROBIT_MODEL

Overview

The PROBIT_MODEL function fits a binary probit regression model using maximum likelihood estimation (MLE). Probit regression is a type of discrete choice model used to analyze binary outcomes—situations where the dependent variable takes one of two values (0 or 1), such as purchase decisions, event occurrences, or pass/fail outcomes.

Unlike logistic regression, which uses the logistic cumulative distribution function, probit regression models the probability of the binary outcome using the cumulative distribution function (CDF) of the standard normal distribution. The probability that the outcome equals 1 is given by:

P(Y = 1 | X) = \Phi(X\beta)

where \Phi(\cdot) is the standard normal CDF, X is the matrix of independent variables, and \beta is the vector of coefficients to be estimated.

This implementation uses the statsmodels library, specifically the Probit class from the statsmodels.discrete.discrete_model module. For source code and additional details, see the statsmodels GitHub repository.

The function returns parameter estimates (coefficients), standard errors, z-statistics, p-values, and confidence intervals for each predictor. It also provides model fit statistics including McFadden’s pseudo R-squared, log-likelihood, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and the likelihood ratio test p-value. McFadden’s pseudo R-squared is calculated as:

\text{Pseudo } R^2 = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}

where \log L_{\text{full}} is the log-likelihood of the fitted model and \log L_{\text{null}} is the log-likelihood of the null model (intercept only).

Probit models are widely used in economics, social sciences, and biostatistics for analyzing binary choice behavior. For comprehensive references on discrete choice models, see Cameron & Trivedi’s Regression Analysis of Count Data (1998) and Greene’s Econometric Analysis (2003).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PROBIT_MODEL(y, x, fit_intercept, alpha)
  • y (list[list], required): Column vector of binary dependent variable values (0 or 1)
  • x (list[list], required): Matrix of independent variables (predictors), one column per predictor
  • fit_intercept (bool, optional, default: true): If true, adds an intercept term to the model
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1)

Returns (list[list]): 2D list with probit results and statistics, or error string.

Examples

Example 1: Basic single predictor model

Inputs:

y x
0 1
0 1.2
0 1.5
0 1.8
0 2
1 2.2
0 2.5
1 2.8
0 3
1 3.2
1 3.5
1 3.8
1 4
1 4.2
1 4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -4.466 2.137 -2.09 0.0366 -8.654 -0.2784
x1 1.704 0.7943 2.145 0.03196 0.1469 3.261
pseudo_r_squared 0.5779
log_likelihood -4.374
aic 12.75
bic 14.16
llr_pvalue 0.000538

Example 2: Model without intercept

Inputs:

y x fit_intercept
0 1 false
0 1.2
0 1.5
0 1.8
0 2
1 2.2
0 2.5
1 2.8
0 3
1 3.2
1 3.5
1 3.8
1 4
1 4.2
1 4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5}, FALSE)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
x1 0.158 0.1199 1.318 0.1875 -0.07696 0.393
pseudo_r_squared 0.08651
log_likelihood -9.467
aic 20.93
bic 21.64
llr_pvalue

Example 3: Custom alpha for confidence intervals

Inputs:

y x alpha
0 1 0.1
0 1.5
0 2
1 2
0 2.5
1 2.5
1 3
0 3
1 3.5
1 4
1 4.5
1 5

Excel formula:

=PROBIT_MODEL({0;0;0;1;0;1;1;0;1;1;1;1}, {1;1.5;2;2;2.5;2.5;3;3;3.5;4;4.5;5}, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -2.976 1.746 -1.704 0.0883 -5.849 -0.104
x1 1.193 0.6663 1.79 0.07338 0.09701 2.289
pseudo_r_squared 0.3937
log_likelihood -4.942
aic 13.88
bic 14.85
llr_pvalue 0.0113

Example 4: Multiple predictors with all arguments

Inputs:

y x fit_intercept alpha
0 1 1.5 true 0.05
0 1.5 2
0 2 1
0 2.5 2.5
0 2 3
1 2.2 2.8
0 3 2
1 2.8 3.2
0 3.2 2.5
1 3 3.5
0 3.5 3
1 3.5 3.8
1 4 3.5
1 4 4
1 4.5 4.2
1 4.2 4.5
1 5 4.8
1 5.5 5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;0;1;1;1;1;1;1;1}, {1,1.5;1.5,2;2,1;2.5,2.5;2,3;2.2,2.8;3,2;2.8,3.2;3.2,2.5;3,3.5;3.5,3;3.5,3.8;4,3.5;4,4;4.5,4.2;4.2,4.5;5,4.8;5.5,5}, TRUE, 0.05)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -11.02 6.889 -1.6 0.1095 -24.53 2.477
x1 -0.5517 1.011 -0.5459 0.5851 -2.533 1.429
x2 4.185 2.624 1.595 0.1107 -0.9579 9.327
pseudo_r_squared 0.75
log_likelihood -3.091
aic 12.18
bic 14.85
llr_pvalue 0.00009379

Python Code

import numpy as np
from statsmodels.discrete.discrete_model import Probit as statsmodels_probit

def probit_model(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits a binary probit regression model using maximum likelihood estimation.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Probit.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Column vector of binary dependent variable values (0 or 1)
        x (list[list]): Matrix of independent variables (predictors), one column per predictor
        fit_intercept (bool, optional): If true, adds an intercept term to the model Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1) Default is 0.05.

    Returns:
        list[list]: 2D list with probit results and statistics, or error string.
    """
    def to2d(arr):
        return [[arr]] if not isinstance(arr, list) else arr

    def validate_numeric_2d(arr, name):
        arr = to2d(arr)
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if not all(isinstance(row, list) for row in arr):
            return f"Invalid input: {name} must be a 2D list."
        flat = []
        for row in arr:
            for val in row:
                if not isinstance(val, (int, float)):
                    return f"Invalid input: {name} must contain only numeric values."
                if np.isnan(val) or np.isinf(val):
                    return f"Invalid input: {name} contains non-finite values."
                flat.append(float(val))
        return arr, flat

    def validate_alpha(alpha_val):
        if not isinstance(alpha_val, (int, float)):
            return "Invalid input: alpha must be a number."
        alpha_val = float(alpha_val)
        if np.isnan(alpha_val) or np.isinf(alpha_val):
            return "Invalid input: alpha must be finite."
        if alpha_val <= 0 or alpha_val >= 1:
            return "Invalid input: alpha must be between 0 and 1."
        return alpha_val

    # Validate inputs
    y_result = validate_numeric_2d(y, "y")
    if isinstance(y_result, str):
        return y_result
    y_2d, y_flat = y_result

    x_result = validate_numeric_2d(x, "x")
    if isinstance(x_result, str):
        return x_result
    x_2d, x_flat = x_result

    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be a boolean."

    alpha_val = validate_alpha(alpha)
    if isinstance(alpha_val, str):
        return alpha_val

    # Check y is a column vector
    if len(y_2d[0]) != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Check y contains only 0s and 1s
    for val in y_flat:
        if val not in (0.0, 1.0):
            return "Invalid input: y must contain only binary values (0 or 1)."

    # Get number of observations and predictors
    n_obs = len(y_flat)
    n_rows_x = len(x_2d)
    if n_rows_x != n_obs:
        return "Invalid input: x and y must have the same number of rows."

    # Check all rows in x have the same number of columns
    n_cols_x = len(x_2d[0]) if x_2d else 0
    if not all(len(row) == n_cols_x for row in x_2d):
        return "Invalid input: all rows in x must have the same length."

    # Convert to matrix form
    try:
        y_array = np.array(y_flat).reshape(-1, 1)
        x_array = np.array(x_2d)

        # Add intercept if requested
        if fit_intercept:
            x_array = np.column_stack([np.ones(n_obs), x_array])

        # Fit probit model
        model = statsmodels_probit(y_array, x_array)
        result = model.fit(disp=0)

        # Get confidence intervals
        conf_int = result.conf_int(alpha=alpha_val)

        # Build output table
        output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Parameter names
        param_names = []
        if fit_intercept:
            param_names.append('intercept')
        for i in range(n_cols_x):
            param_names.append(f'x{i+1}')

        # Helper to convert NaN/inf to empty string
        def safe_float(val):
            fval = float(val)
            return '' if (np.isnan(fval) or np.isinf(fval)) else fval

        # Add parameter results
        for i, param_name in enumerate(param_names):
            output.append([
                param_name,
                safe_float(result.params[i]),
                safe_float(result.bse[i]),
                safe_float(result.tvalues[i]),
                safe_float(result.pvalues[i]),
                safe_float(conf_int[i, 0]),
                safe_float(conf_int[i, 1])
            ])

        # Add model statistics
        output.append(['pseudo_r_squared', safe_float(result.prsquared), '', '', '', '', ''])
        output.append(['log_likelihood', safe_float(result.llf), '', '', '', '', ''])
        output.append(['aic', safe_float(result.aic), '', '', '', '', ''])
        output.append(['bic', safe_float(result.bic), '', '', '', '', ''])
        output.append(['llr_pvalue', safe_float(result.llr_pvalue), '', '', '', '', ''])

        return output

    except Exception as exc:
        return f"statsmodels.discrete.discrete_model.Probit error: {exc}"

Online Calculator