PROBIT_MODEL

Overview

The PROBIT_MODEL function fits a binary probit regression model using maximum likelihood estimation (MLE). Probit regression is a type of discrete choice model used to analyze binary outcomes—situations where the dependent variable takes one of two values (0 or 1), such as purchase decisions, event occurrences, or pass/fail outcomes.

Unlike logistic regression, which uses the logistic cumulative distribution function, probit regression models the probability of the binary outcome using the cumulative distribution function (CDF) of the standard normal distribution. The probability that the outcome equals 1 is given by:

P(Y = 1 | X) = \Phi(X\beta)

where \Phi(\cdot) is the standard normal CDF, X is the matrix of independent variables, and \beta is the vector of coefficients to be estimated.

This implementation uses the statsmodels library, specifically the Probit class from the statsmodels.discrete.discrete_model module. For source code and additional details, see the statsmodels GitHub repository.

The function returns parameter estimates (coefficients), standard errors, z-statistics, p-values, and confidence intervals for each predictor. It also provides model fit statistics including McFadden’s pseudo R-squared, log-likelihood, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and the likelihood ratio test p-value. McFadden’s pseudo R-squared is calculated as:

\text{Pseudo } R^2 = 1 - \frac{\log L_{\text{full}}}{\log L_{\text{null}}}

where \log L_{\text{full}} is the log-likelihood of the fitted model and \log L_{\text{null}} is the log-likelihood of the null model (intercept only).

Probit models are widely used in economics, social sciences, and biostatistics for analyzing binary choice behavior. For comprehensive references on discrete choice models, see Cameron & Trivedi’s Regression Analysis of Count Data (1998) and Greene’s Econometric Analysis (2003).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PROBIT_MODEL(y, x, fit_intercept, alpha)

y (list[list], required): Column vector of binary dependent variable values (0 or 1)
x (list[list], required): Matrix of independent variables (predictors), one column per predictor
fit_intercept (bool, optional, default: true): If true, adds an intercept term to the model
alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1)

Returns (list[list]): 2D list with probit results and statistics, or error string.

Example 1: Basic single predictor model

Inputs:

y	x
0	1
0	1.2
0	1.5
0	1.8
0	2
1	2.2
0	2.5
1	2.8
0	3
1	3.2
1	3.5
1	3.8
1	4
1	4.2
1	4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5})

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	-4.46639	2.13678	-2.09024	0.036596	-8.65441	-0.278377
x1	1.70371	0.794304	2.14491	0.0319603	0.146901	3.26051
pseudo_r_squared	0.577934
log_likelihood	-4.37423
aic	12.7485
bic	14.1646
llr_pvalue	0.000537964

Example 2: Model without intercept

Inputs:

y	x	fit_intercept
0	1	false
0	1.2
0	1.5
0	1.8
0	2
1	2.2
0	2.5
1	2.8
0	3
1	3.2
1	3.5
1	3.8
1	4
1	4.2
1	4.5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;1;1;1;1;1}, {1;1.2;1.5;1.8;2;2.2;2.5;2.8;3;3.2;3.5;3.8;4;4.2;4.5}, FALSE)

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
x1	0.158044	0.1199	1.31813	0.187459	-0.0769553	0.393043
pseudo_r_squared	0.0865097
log_likelihood	-9.46728
aic	20.9346
bic	21.6426
llr_pvalue

Example 3: Custom alpha for confidence intervals

Inputs:

y	x	alpha
0	1	0.1
0	1.5
0	2
1	2
0	2.5
1	2.5
1	3
0	3
1	3.5
1	4
1	4.5
1	5

Excel formula:

=PROBIT_MODEL({0;0;0;1;0;1;1;0;1;1;1;1}, {1;1.5;2;2;2.5;2.5;3;3;3.5;4;4.5;5}, 0.1)

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	-2.9764	1.74629	-1.70441	0.0883049	-5.8488	-0.103999
x1	1.19292	0.666264	1.79046	0.0733798	0.0970132	2.28883
pseudo_r_squared	0.393651
log_likelihood	-4.94193
aic	13.8839
bic	14.8537
llr_pvalue	0.0113048

Example 4: Multiple predictors with all arguments

Inputs:

y	x		fit_intercept	alpha
0	1	1.5	true	0.05
0	1.5	2
0	2	1
0	2.5	2.5
0	2	3
1	2.2	2.8
0	3	2
1	2.8	3.2
0	3.2	2.5
1	3	3.5
0	3.5	3
1	3.5	3.8
1	4	3.5
1	4	4
1	4.5	4.2
1	4.2	4.5
1	5	4.8
1	5.5	5

Excel formula:

=PROBIT_MODEL({0;0;0;0;0;1;0;1;0;1;0;1;1;1;1;1;1;1}, {1,1.5;1.5,2;2,1;2.5,2.5;2,3;2.2,2.8;3,2;2.8,3.2;3.2,2.5;3,3.5;3.5,3;3.5,3.8;4,3.5;4,4;4.5,4.2;4.2,4.5;5,4.8;5.5,5}, TRUE, 0.05)

Expected output:

parameter	coefficient	std_error	z_statistic	p_value	ci_lower	ci_upper
intercept	-11.0244	6.88853	-1.6004	0.109509	-24.5257	2.47684
x1	-0.551699	1.01064	-0.545893	0.585139	-2.53251	1.42911
x2	4.18473	2.62384	1.59489	0.110738	-0.957905	9.32737
pseudo_r_squared	0.750034
log_likelihood	-3.0909
aic	12.1818
bic	14.8529
llr_pvalue	0.0000937943

Python Code

Show Code

import numpy as np
from statsmodels.discrete.discrete_model import Probit as statsmodels_probit

def probit_model(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits a binary probit regression model using maximum likelihood estimation.

    See: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Probit.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Column vector of binary dependent variable values (0 or 1)
        x (list[list]): Matrix of independent variables (predictors), one column per predictor
        fit_intercept (bool, optional): If true, adds an intercept term to the model Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1) Default is 0.05.

    Returns:
        list[list]: 2D list with probit results and statistics, or error string.
    """
    def to2d(arr):
        return [[arr]] if not isinstance(arr, list) else arr

    def validate_numeric_2d(arr, name):
        arr = to2d(arr)
        if not isinstance(arr, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        if not all(isinstance(row, list) for row in arr):
            return f"Error: Invalid input: {name} must be a 2D list."
        flat = []
        for row in arr:
            for val in row:
                if not isinstance(val, (int, float)):
                    return f"Error: Invalid input: {name} must contain only numeric values."
                if np.isnan(val) or np.isinf(val):
                    return f"Error: Invalid input: {name} contains non-finite values."
                flat.append(float(val))
        return arr, flat

    def validate_alpha(alpha_val):
        if not isinstance(alpha_val, (int, float)):
            return "Error: Invalid input: alpha must be a number."
        alpha_val = float(alpha_val)
        if np.isnan(alpha_val) or np.isinf(alpha_val):
            return "Error: Invalid input: alpha must be finite."
        if alpha_val <= 0 or alpha_val >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1."
        return alpha_val

    try:
        # Validate inputs
        y_result = validate_numeric_2d(y, "y")
        if isinstance(y_result, str):
            return y_result
        y_2d, y_flat = y_result

        x_result = validate_numeric_2d(x, "x")
        if isinstance(x_result, str):
            return x_result
        x_2d, x_flat = x_result

        if not isinstance(fit_intercept, bool):
            return "Error: Invalid input: fit_intercept must be a boolean."

        alpha_val = validate_alpha(alpha)
        if isinstance(alpha_val, str):
            return alpha_val

        # Check y is a column vector
        if len(y_2d[0]) != 1:
            return "Error: Invalid input: y must be a column vector (single column)."

        # Check y contains only 0s and 1s
        for val in y_flat:
            if val not in (0.0, 1.0):
                return "Error: Invalid input: y must contain only binary values (0 or 1)."

        # Get number of observations and predictors
        n_obs = len(y_flat)
        n_rows_x = len(x_2d)
        if n_rows_x != n_obs:
            return "Error: Invalid input: x and y must have the same number of rows."

        # Check all rows in x have the same number of columns
        n_cols_x = len(x_2d[0]) if x_2d else 0
        if not all(len(row) == n_cols_x for row in x_2d):
            return "Error: Invalid input: all rows in x must have the same length."

        # Convert to matrix form
        try:
            y_array = np.array(y_flat)
            x_array = np.array(x_2d)

            # Add intercept if requested
            if fit_intercept:
                x_array = np.column_stack([np.ones(n_obs), x_array])

            # Fit probit model
            model = statsmodels_probit(y_array, x_array)
            result = model.fit(disp=0)

            # Get confidence intervals
            conf_int = result.conf_int(alpha=alpha_val)

            # Build output table
            output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

            # Parameter names
            param_names = []
            if fit_intercept:
                param_names.append('intercept')
            for i in range(n_cols_x):
                param_names.append(f'x{i+1}')

            # Helper to convert NaN/inf to empty string
            def safe_float(val):
                fval = float(val)
                return '' if (np.isnan(fval) or np.isinf(fval)) else fval

            # Add parameter results
            for i, param_name in enumerate(param_names):
                output.append([
                    param_name,
                    safe_float(result.params[i]),
                    safe_float(result.bse[i]),
                    safe_float(result.tvalues[i]),
                    safe_float(result.pvalues[i]),
                    safe_float(conf_int[i, 0]),
                    safe_float(conf_int[i, 1])
                ])

            # Add model statistics
            output.append(['pseudo_r_squared', safe_float(result.prsquared), '', '', '', '', ''])
            output.append(['log_likelihood', safe_float(result.llf), '', '', '', '', ''])
            output.append(['aic', safe_float(result.aic), '', '', '', '', ''])
            output.append(['bic', safe_float(result.bic), '', '', '', '', ''])
            output.append(['llr_pvalue', safe_float(result.llr_pvalue), '', '', '', '', ''])

            return output

        except Exception as exc:
            return f"Error: statsmodels.discrete.discrete_model.Probit error: {exc}"
    except Exception as exc:
        return f"Error: {exc}"

Online Calculator

y *

Column vector of binary dependent variable values (0 or 1)

x *

Matrix of independent variables (predictors), one column per predictor

fit_intercept

If true, adds an intercept term to the model

alpha

Significance level for confidence intervals (between 0 and 1)