GLM_BINOMIAL

Overview

The GLM_BINOMIAL function fits a Generalized Linear Model (GLM) with a binomial family distribution, designed for modeling binary outcomes (0/1) or proportion data (values between 0 and 1). This type of model is fundamental in fields such as epidemiology, marketing, and social sciences where the response variable represents a probability or binary classification.

GLMs extend ordinary linear regression by allowing the response variable to follow distributions from the exponential family—including binomial, Poisson, and gamma distributions. For binomial data, the model relates the expected probability \mu to the linear predictor \eta = X\beta through a link function g:

g(\mu) = X\beta \quad \text{or equivalently} \quad \mu = g^{-1}(X\beta)

This implementation supports multiple link functions. The default logit link is the most common choice for binomial regression (logistic regression):

\text{logit}(\mu) = \log\left(\frac{\mu}{1-\mu}\right)

Other supported links include probit (based on the standard normal CDF), cloglog (complementary log-log), log, and cauchy. Each link function provides a different transformation between the probability scale and the linear predictor, which can be useful depending on the nature of the data.

Model parameters are estimated via Iteratively Reweighted Least Squares (IRLS), a maximum likelihood method. The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals. For logit models, odds ratios are also calculated—representing the multiplicative change in odds for a one-unit increase in each predictor. Model fit statistics include deviance, Pearson chi-squared, AIC, BIC, and log-likelihood.

This implementation uses the statsmodels library. For more details, see the GLM documentation and binomial family reference. The statsmodels GitHub repository provides source code and additional examples.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=GLM_BINOMIAL(y, x, glm_binomial_link, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.
  • x (list[list], required): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.
  • glm_binomial_link (str, optional, default: “logit”): Link function to use for the binomial GLM.
  • fit_intercept (bool, optional, default: true): If True, includes an intercept term in the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).

Returns (list[list]): 2D list with GLM results and statistics, or error string.

Examples

Example 1: Demo case 1

Inputs:

y x
0 1
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -7.0526 4.8673 -1.449 0.1473 -16.5924 2.4872 0.0009
x1 2.5646 1.7208 1.4903 0.1361 -0.8082 5.9373 12.9953
deviance 5.0061
pearson_chi2 4.1931
aic 9.0061
bic 9.165
log_likelihood -2.503

Example 2: Demo case 2

Inputs:

y x
0.1 1 2
0.2 1.5 2.5
0.35 2 3
0.45 2.5 2.5
0.55 3 3.5
0.65 3.5 3
0.75 4 4
0.85 4.5 4.5

Excel formula:

=GLM_BINOMIAL({0.1;0.2;0.35;0.45;0.55;0.65;0.75;0.85}, {1,2;1.5,2.5;2,3;2.5,2.5;3,3.5;3.5,3;4,4;4.5,4.5})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -2.9587 4.1871 -0.7066 0.4798 -11.1652 5.2478 0.0519
x1 0.9781 1.598 0.612 0.5405 -2.154 4.1101 2.6593
x2 0.0656 2.3438 0.028 0.9777 -4.5282 4.6593 1.0678
deviance 0.0287
pearson_chi2 0.028
aic 12.0468
bic 12.2852
log_likelihood -3.0234

Example 3: Demo case 3

Inputs:

y x glm_binomial_link
0 1 probit
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, "probit")

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
intercept -4.2877 2.67 -1.6059 0.1083 -9.5209 0.9455
x1 1.5592 0.941 1.657 0.0975 -0.2851 3.4035
deviance 4.8514
pearson_chi2 4.0761
aic 8.8514
bic 9.0103
log_likelihood -2.4257

Example 4: Demo case 4

Inputs:

y x fit_intercept alpha
0 1 false 0.1
0 1.5
0 2
1 2.5
0 3
1 3.5
1 4
1 4.5

Excel formula:

=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, FALSE, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper odds_ratio
x1 0.2065 0.2543 0.8117 0.4169 -0.2119 0.6248 1.2293
deviance 10.3844
pearson_chi2 7.8764
aic 12.3844
bic 12.4638
log_likelihood -5.1922

Python Code

import math
import statsmodels.api as sm
from statsmodels.genmod.families import Binomial as sm_Binomial
from statsmodels.genmod.generalized_linear_model import SET_USE_BIC_LLF

SET_USE_BIC_LLF(True)

def glm_binomial(y, x, glm_binomial_link='logit', fit_intercept=True, alpha=0.05):
    """
    Fits a Generalized Linear Model with binomial family for binary or proportion data.

    See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.
        x (list[list]): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.
        glm_binomial_link (str, optional): Link function to use for the binomial GLM. Valid options: Logit, Probit, CLogLog, Log, Cauchy. Default is 'logit'.
        fit_intercept (bool, optional): If True, includes an intercept term in the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.

    Returns:
        list[list]: 2D list with GLM results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_numeric_2d(data, name):
        if not isinstance(data, list):
            return f"Invalid input: {name} must be a 2D list."
        if len(data) == 0:
            return f"Invalid input: {name} cannot be empty."
        for i, row in enumerate(data):
            if not isinstance(row, list):
                return f"Invalid input: {name} must be a 2D list."
            if len(row) == 0:
                return f"Invalid input: {name} rows cannot be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Invalid input: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Invalid input: {name}[{i}][{j}] must be finite."
        return None

    # Normalize inputs
    y = to2d(y)
    x = to2d(x)

    # Validate inputs
    err = validate_numeric_2d(y, 'y')
    if err:
        return err
    err = validate_numeric_2d(x, 'x')
    if err:
        return err

    # Check y is a column vector
    if len(y[0]) != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Check dimensions match
    n_obs_y = len(y)
    n_obs_x = len(x)
    if n_obs_y != n_obs_x:
        return "Invalid input: y and x must have the same number of rows."

    # Validate alpha
    if not isinstance(alpha, (int, float)):
        return "Invalid input: alpha must be numeric."
    if math.isnan(alpha) or math.isinf(alpha):
        return "Invalid input: alpha must be finite."
    if alpha <= 0 or alpha >= 1:
        return "Invalid input: alpha must be between 0 and 1."

    # Validate link function
    valid_links = ['logit', 'probit', 'cloglog', 'log', 'cauchy']
    if not isinstance(glm_binomial_link, str):
        return "Invalid input: glm_binomial_link must be a string."
    if glm_binomial_link not in valid_links:
        return f"Invalid input: glm_binomial_link must be one of {valid_links}."

    # Convert to flat list for y
    y_flat = [row[0] for row in y]

    # Check y values are in valid range [0, 1]
    for i, val in enumerate(y_flat):
        if val < 0 or val > 1:
            return f"Invalid input: y[{i}] = {val} must be between 0 and 1."

    # Get number of columns in x
    n_cols = len(x[0])
    for i, row in enumerate(x):
        if len(row) != n_cols:
            return f"Invalid input: all rows in x must have the same number of columns."

    # Convert x to list of columns
    x_data = []
    for col_idx in range(n_cols):
        x_data.append([x[row_idx][col_idx] for row_idx in range(n_obs_x)])

    # Add intercept if requested
    if fit_intercept:
        x_data.insert(0, [1.0] * n_obs_x)

    # Transpose to get design matrix
    design_matrix = []
    for row_idx in range(n_obs_x):
        design_matrix.append([col[row_idx] for col in x_data])

    # Create link object
    try:
        if glm_binomial_link == 'logit':
            link = sm.families.links.Logit()
        elif glm_binomial_link == 'probit':
            link = sm.families.links.Probit()
        elif glm_binomial_link == 'cloglog':
            link = sm.families.links.CLogLog()
        elif glm_binomial_link == 'log':
            link = sm.families.links.Log()
        elif glm_binomial_link == 'cauchy':
            link = sm.families.links.Cauchy()
    except Exception as e:
        return f"Invalid input: unable to create link function: {e}"

    # Fit GLM
    try:
        family = sm_Binomial(link=link)
        model = sm.GLM(y_flat, design_matrix, family=family)
        result = model.fit()
    except Exception as e:
        return f"statsmodels.GLM error: {e}"

    # Extract results
    try:
        params = result.params
        bse = result.bse
        tvalues = result.tvalues
        pvalues = result.pvalues
        conf_int = result.conf_int(alpha=alpha)

        # Build parameter names
        param_names = []
        if fit_intercept:
            param_names.append('intercept')
        for i in range(n_cols):
            param_names.append(f'x{i+1}')

        # Build results table
        results = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper', 'odds_ratio']]

        for i, name in enumerate(param_names):
            coef = float(params[i])
            stderr = float(bse[i])
            zstat = float(tvalues[i])
            pval = float(pvalues[i])
            ci_low = float(conf_int[i][0])
            ci_high = float(conf_int[i][1])
            odds_ratio = math.exp(coef) if glm_binomial_link == 'logit' else None

            results.append([
                name,
                coef,
                stderr,
                zstat,
                pval,
                ci_low,
                ci_high,
                odds_ratio if odds_ratio is not None else ''
            ])

        # Add model statistics
        results.append(['deviance', float(result.deviance), '', '', '', '', '', ''])
        results.append(['pearson_chi2', float(result.pearson_chi2), '', '', '', '', '', ''])
        results.append(['aic', float(result.aic), '', '', '', '', '', ''])
        results.append(['bic', float(result.bic_llf), '', '', '', '', '', ''])
        results.append(['log_likelihood', float(result.llf), '', '', '', '', '', ''])

        return results

    except Exception as e:
        return f"statsmodels.GLM error: unable to extract results: {e}"

Online Calculator