PROBIT_MODEL

Overview

The PROBIT_MODEL function fits a binary probit regression model using maximum likelihood estimation (MLE). Probit regression is a type of discrete choice model used to analyze binary outcomes—situations where the dependent variable takes one of two values (0 or 1), such as purchase decisions, event occurrences, or pass/fail outcomes.

Unlike logistic regression, which uses the logistic cumulative distribution function, probit regression models the probability of the binary outcome using the cumulative distribution function (CDF) of the standard normal distribution. The probability that the outcome equals 1 is given by:

P(Y = 1 | X) = \Phi(X\beta)

where \Phi(\cdot) is the standard normal CDF, X is the matrix of independent variables, and \beta is the vector of coefficients to be estimated.

This implementation uses the statsmodels library, specifically the Probit class from the statsmodels.discrete.discrete_model module. For source code and additional details, see the statsmodels GitHub repository.

The function returns parameter estimates (coefficients), standard errors, z-statistics, p-values, and confidence intervals for each predictor. It also provides model fit statistics including McFadden’s pseudo R-squared, log-likelihood, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and the likelihood ratio test p-value. McFadden’s pseudo R-squared is calculated as:

\text{Pseudo } R^2 = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}

where \log L_{\text{full}} is the log-likelihood of the fitted model and \log L_{\text{null}} is the log-likelihood of the null model (intercept only).

Probit models are widely used in economics, social sciences, and biostatistics for analyzing binary choice behavior. For comprehensive references on discrete choice models, see Cameron & Trivedi’s Regression Analysis of Count Data (1998) and Greene’s Econometric Analysis (2003).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PROBIT_MODEL(y, x, fit_intercept, alpha)
  • y (list[list], required): Column vector of binary dependent variable values (0 or 1)
  • x (list[list], required): Matrix of independent variables (predictors), one column per predictor
  • fit_intercept (bool, optional, default: true): If true, adds an intercept term to the model
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1)

Returns (list[list]): 2D list with probit results and statistics, or error string.

Example 1: Basic single predictor model

Inputs:

y x
0 1
0 1.2
0 1.5
0 1.8
0 2
1 2.2
0 2.5
1 2.8
0 3
1 3.2
1 3.5
1 3.8
1 4
1 4.2
1 4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -4.46639 2.13678 -2.09024 0.036596 -8.65441 -0.278377
x1 1.70371 0.794304 2.14491 0.0319603 0.146901 3.26051
pseudo_r_squared 0.577934
log_likelihood -4.37423
aic 12.7485
bic 14.1646
llr_pvalue 0.000537964
Example 2: Model without intercept

Inputs:

y x fit_intercept
0 1 false
0 1.2
0 1.5
0 1.8
0 2
1 2.2
0 2.5
1 2.8
0 3
1 3.2
1 3.5
1 3.8
1 4
1 4.2
1 4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5}, FALSE)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
x1 0.158044 0.1199 1.31813 0.187459 -0.0769553 0.393043
pseudo_r_squared 0.0865097
log_likelihood -9.46728
aic 20.9346
bic 21.6426
llr_pvalue
Example 3: Custom alpha for confidence intervals

Inputs:

y x alpha
0 1 0.1
0 1.5
0 2
1 2
0 2.5
1 2.5
1 3
0 3
1 3.5
1 4
1 4.5
1 5

Excel formula:

=PROBIT_MODEL({0;0;0;1;0;1;1;0;1;1;1;1}, {1;1.5;2;2;2.5;2.5;3;3;3.5;4;4.5;5}, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -2.9764 1.74629 -1.70441 0.0883049 -5.8488 -0.103999
x1 1.19292 0.666264 1.79046 0.0733798 0.0970132 2.28883
pseudo_r_squared 0.393651
log_likelihood -4.94193
aic 13.8839
bic 14.8537
llr_pvalue 0.0113048
Example 4: Multiple predictors with all arguments

Inputs:

y x fit_intercept alpha
0 1 1.5 true 0.05
0 1.5 2
0 2 1
0 2.5 2.5
0 2 3
1 2.2 2.8
0 3 2
1 2.8 3.2
0 3.2 2.5
1 3 3.5
0 3.5 3
1 3.5 3.8
1 4 3.5
1 4 4
1 4.5 4.2
1 4.2 4.5
1 5 4.8
1 5.5 5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;0;1;1;1;1;1;1;1}, {1,1.5;1.5,2;2,1;2.5,2.5;2,3;2.2,2.8;3,2;2.8,3.2;3.2,2.5;3,3.5;3.5,3;3.5,3.8;4,3.5;4,4;4.5,4.2;4.2,4.5;5,4.8;5.5,5}, TRUE, 0.05)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -11.0244 6.88853 -1.6004 0.109509 -24.5257 2.47684
x1 -0.551699 1.01064 -0.545893 0.585139 -2.53251 1.42911
x2 4.18473 2.62384 1.59489 0.110738 -0.957905 9.32737
pseudo_r_squared 0.750034
log_likelihood -3.0909
aic 12.1818
bic 14.8529
llr_pvalue 0.0000937943

Python Code

Show Code
import numpy as np
from statsmodels.discrete.discrete_model import Probit as statsmodels_probit

def probit_model(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits a binary probit regression model using maximum likelihood estimation.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Probit.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Column vector of binary dependent variable values (0 or 1)
        x (list[list]): Matrix of independent variables (predictors), one column per predictor
        fit_intercept (bool, optional): If true, adds an intercept term to the model Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1) Default is 0.05.

    Returns:
        list[list]: 2D list with probit results and statistics, or error string.
    """
    def to2d(arr):
        return [[arr]] if not isinstance(arr, list) else arr

    def validate_numeric_2d(arr, name):
        arr = to2d(arr)
        if not isinstance(arr, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        if not all(isinstance(row, list) for row in arr):
            return f"Error: Invalid input: {name} must be a 2D list."
        flat = []
        for row in arr:
            for val in row:
                if not isinstance(val, (int, float)):
                    return f"Error: Invalid input: {name} must contain only numeric values."
                if np.isnan(val) or np.isinf(val):
                    return f"Error: Invalid input: {name} contains non-finite values."
                flat.append(float(val))
        return arr, flat

    def validate_alpha(alpha_val):
        if not isinstance(alpha_val, (int, float)):
            return "Error: Invalid input: alpha must be a number."
        alpha_val = float(alpha_val)
        if np.isnan(alpha_val) or np.isinf(alpha_val):
            return "Error: Invalid input: alpha must be finite."
        if alpha_val <= 0 or alpha_val >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1."
        return alpha_val

    try:
        # Validate inputs
        y_result = validate_numeric_2d(y, "y")
        if isinstance(y_result, str):
            return y_result
        y_2d, y_flat = y_result

        x_result = validate_numeric_2d(x, "x")
        if isinstance(x_result, str):
            return x_result
        x_2d, x_flat = x_result

        if not isinstance(fit_intercept, bool):
            return "Error: Invalid input: fit_intercept must be a boolean."

        alpha_val = validate_alpha(alpha)
        if isinstance(alpha_val, str):
            return alpha_val

        # Check y is a column vector
        if len(y_2d[0]) != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        # Check y contains only 0s and 1s
        for val in y_flat:
            if val not in (0.0, 1.0):
                return "Error: Invalid input: y must contain only binary values (0 or 1)."

        # Get number of observations and predictors
        n_obs = len(y_flat)
        n_rows_x = len(x_2d)
        if n_rows_x != n_obs:
            return "Error: Invalid input: x and y must have the same number of rows."

        # Check all rows in x have the same number of columns
        n_cols_x = len(x_2d[0]) if x_2d else 0
        if not all(len(row) == n_cols_x for row in x_2d):
            return "Error: Invalid input: all rows in x must have the same length."

        # Convert to matrix form
        try:
            y_array = np.array(y_flat)
            x_array = np.array(x_2d)

            # Add intercept if requested
            if fit_intercept:
                x_array = np.column_stack([np.ones(n_obs), x_array])

            # Fit probit model
            model = statsmodels_probit(y_array, x_array)
            result = model.fit(disp=0)

            # Get confidence intervals
            conf_int = result.conf_int(alpha=alpha_val)

            # Build output table
            output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

            # Parameter names
            param_names = []
            if fit_intercept:
                param_names.append('intercept')
            for i in range(n_cols_x):
                param_names.append(f'x{i+1}')

            # Helper to convert NaN/inf to empty string
            def safe_float(val):
                fval = float(val)
                return '' if (np.isnan(fval) or np.isinf(fval)) else fval

            # Add parameter results
            for i, param_name in enumerate(param_names):
                output.append([
                    param_name,
                    safe_float(result.params[i]),
                    safe_float(result.bse[i]),
                    safe_float(result.tvalues[i]),
                    safe_float(result.pvalues[i]),
                    safe_float(conf_int[i, 0]),
                    safe_float(conf_int[i, 1])
                ])

            # Add model statistics
            output.append(['pseudo_r_squared', safe_float(result.prsquared), '', '', '', '', ''])
            output.append(['log_likelihood', safe_float(result.llf), '', '', '', '', ''])
            output.append(['aic', safe_float(result.aic), '', '', '', '', ''])
            output.append(['bic', safe_float(result.bic), '', '', '', '', ''])
            output.append(['llr_pvalue', safe_float(result.llr_pvalue), '', '', '', '', ''])

            return output

        except Exception as exc:
            return f"Error: statsmodels.discrete.discrete_model.Probit error: {exc}"
    except Exception as exc:
        return f"Error: {exc}"

Online Calculator

Column vector of binary dependent variable values (0 or 1)
Matrix of independent variables (predictors), one column per predictor
If true, adds an intercept term to the model
Significance level for confidence intervals (between 0 and 1)