GLM_BINOMIAL
Overview
The GLM_BINOMIAL function fits a Generalized Linear Model (GLM) with a binomial family distribution, designed for modeling binary outcomes (0/1) or proportion data (values between 0 and 1). This type of model is fundamental in fields such as epidemiology, marketing, and social sciences where the response variable represents a probability or binary classification.
GLMs extend ordinary linear regression by allowing the response variable to follow distributions from the exponential family—including binomial, Poisson, and gamma distributions. For binomial data, the model relates the expected probability \mu to the linear predictor \eta = X\beta through a link function g:
g(\mu) = X\beta \quad \text{or equivalently} \quad \mu = g^{-1}(X\beta)
This implementation supports multiple link functions. The default logit link is the most common choice for binomial regression (logistic regression):
\text{logit}(\mu) = \log\left(\frac{\mu}{1-\mu}\right)
Other supported links include probit (based on the standard normal CDF), cloglog (complementary log-log), log, and cauchy. Each link function provides a different transformation between the probability scale and the linear predictor, which can be useful depending on the nature of the data.
Model parameters are estimated via Iteratively Reweighted Least Squares (IRLS), a maximum likelihood method. The function returns coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals. For logit models, odds ratios are also calculated—representing the multiplicative change in odds for a one-unit increase in each predictor. Model fit statistics include deviance, Pearson chi-squared, AIC, BIC, and log-likelihood.
This implementation uses the statsmodels library. For more details, see the GLM documentation and binomial family reference. The statsmodels GitHub repository provides source code and additional examples.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=GLM_BINOMIAL(y, x, glm_binomial_link, fit_intercept, alpha)
y(list[list], required): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.x(list[list], required): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.glm_binomial_link(str, optional, default: “logit”): Link function to use for the binomial GLM.fit_intercept(bool, optional, default: true): If True, includes an intercept term in the model.alpha(float, optional, default: 0.05): Significance level for confidence intervals (between 0 and 1).
Returns (list[list]): 2D list with GLM results and statistics, or error string.
Examples
Example 1: Demo case 1
Inputs:
| y | x |
|---|---|
| 0 | 1 |
| 0 | 1.5 |
| 0 | 2 |
| 1 | 2.5 |
| 0 | 3 |
| 1 | 3.5 |
| 1 | 4 |
| 1 | 4.5 |
Excel formula:
=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5})
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | odds_ratio |
|---|---|---|---|---|---|---|---|
| intercept | -7.0526 | 4.8673 | -1.449 | 0.1473 | -16.5924 | 2.4872 | 0.0009 |
| x1 | 2.5646 | 1.7208 | 1.4903 | 0.1361 | -0.8082 | 5.9373 | 12.9953 |
| deviance | 5.0061 | ||||||
| pearson_chi2 | 4.1931 | ||||||
| aic | 9.0061 | ||||||
| bic | 9.165 | ||||||
| log_likelihood | -2.503 |
Example 2: Demo case 2
Inputs:
| y | x | |
|---|---|---|
| 0.1 | 1 | 2 |
| 0.2 | 1.5 | 2.5 |
| 0.35 | 2 | 3 |
| 0.45 | 2.5 | 2.5 |
| 0.55 | 3 | 3.5 |
| 0.65 | 3.5 | 3 |
| 0.75 | 4 | 4 |
| 0.85 | 4.5 | 4.5 |
Excel formula:
=GLM_BINOMIAL({0.1;0.2;0.35;0.45;0.55;0.65;0.75;0.85}, {1,2;1.5,2.5;2,3;2.5,2.5;3,3.5;3.5,3;4,4;4.5,4.5})
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | odds_ratio |
|---|---|---|---|---|---|---|---|
| intercept | -2.9587 | 4.1871 | -0.7066 | 0.4798 | -11.1652 | 5.2478 | 0.0519 |
| x1 | 0.9781 | 1.598 | 0.612 | 0.5405 | -2.154 | 4.1101 | 2.6593 |
| x2 | 0.0656 | 2.3438 | 0.028 | 0.9777 | -4.5282 | 4.6593 | 1.0678 |
| deviance | 0.0287 | ||||||
| pearson_chi2 | 0.028 | ||||||
| aic | 12.0468 | ||||||
| bic | 12.2852 | ||||||
| log_likelihood | -3.0234 |
Example 3: Demo case 3
Inputs:
| y | x | glm_binomial_link |
|---|---|---|
| 0 | 1 | probit |
| 0 | 1.5 | |
| 0 | 2 | |
| 1 | 2.5 | |
| 0 | 3 | |
| 1 | 3.5 | |
| 1 | 4 | |
| 1 | 4.5 |
Excel formula:
=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, "probit")
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | odds_ratio |
|---|---|---|---|---|---|---|---|
| intercept | -4.2877 | 2.67 | -1.6059 | 0.1083 | -9.5209 | 0.9455 | |
| x1 | 1.5592 | 0.941 | 1.657 | 0.0975 | -0.2851 | 3.4035 | |
| deviance | 4.8514 | ||||||
| pearson_chi2 | 4.0761 | ||||||
| aic | 8.8514 | ||||||
| bic | 9.0103 | ||||||
| log_likelihood | -2.4257 |
Example 4: Demo case 4
Inputs:
| y | x | fit_intercept | alpha |
|---|---|---|---|
| 0 | 1 | false | 0.1 |
| 0 | 1.5 | ||
| 0 | 2 | ||
| 1 | 2.5 | ||
| 0 | 3 | ||
| 1 | 3.5 | ||
| 1 | 4 | ||
| 1 | 4.5 |
Excel formula:
=GLM_BINOMIAL({0;0;0;1;0;1;1;1}, {1;1.5;2;2.5;3;3.5;4;4.5}, FALSE, 0.1)
Expected output:
| parameter | coefficient | std_error | z_statistic | p_value | ci_lower | ci_upper | odds_ratio |
|---|---|---|---|---|---|---|---|
| x1 | 0.2065 | 0.2543 | 0.8117 | 0.4169 | -0.2119 | 0.6248 | 1.2293 |
| deviance | 10.3844 | ||||||
| pearson_chi2 | 7.8764 | ||||||
| aic | 12.3844 | ||||||
| bic | 12.4638 | ||||||
| log_likelihood | -5.1922 |
Python Code
import math
import statsmodels.api as sm
from statsmodels.genmod.families import Binomial as sm_Binomial
from statsmodels.genmod.generalized_linear_model import SET_USE_BIC_LLF
SET_USE_BIC_LLF(True)
def glm_binomial(y, x, glm_binomial_link='logit', fit_intercept=True, alpha=0.05):
"""
Fits a Generalized Linear Model with binomial family for binary or proportion data.
See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable as a column vector. For binary data, values should be 0 or 1. For proportion data, values should be between 0 and 1.
x (list[list]): Independent variables (predictors) as a matrix. Each column is a predictor variable, and each row corresponds to an observation.
glm_binomial_link (str, optional): Link function to use for the binomial GLM. Valid options: Logit, Probit, CLogLog, Log, Cauchy. Default is 'logit'.
fit_intercept (bool, optional): If True, includes an intercept term in the model. Default is True.
alpha (float, optional): Significance level for confidence intervals (between 0 and 1). Default is 0.05.
Returns:
list[list]: 2D list with GLM results and statistics, or error string.
"""
def to2d(val):
return [[val]] if not isinstance(val, list) else val
def validate_numeric_2d(data, name):
if not isinstance(data, list):
return f"Invalid input: {name} must be a 2D list."
if len(data) == 0:
return f"Invalid input: {name} cannot be empty."
for i, row in enumerate(data):
if not isinstance(row, list):
return f"Invalid input: {name} must be a 2D list."
if len(row) == 0:
return f"Invalid input: {name} rows cannot be empty."
for j, val in enumerate(row):
if not isinstance(val, (int, float)):
return f"Invalid input: {name}[{i}][{j}] must be numeric."
if math.isnan(val) or math.isinf(val):
return f"Invalid input: {name}[{i}][{j}] must be finite."
return None
# Normalize inputs
y = to2d(y)
x = to2d(x)
# Validate inputs
err = validate_numeric_2d(y, 'y')
if err:
return err
err = validate_numeric_2d(x, 'x')
if err:
return err
# Check y is a column vector
if len(y[0]) != 1:
return "Invalid input: y must be a column vector (single column)."
# Check dimensions match
n_obs_y = len(y)
n_obs_x = len(x)
if n_obs_y != n_obs_x:
return "Invalid input: y and x must have the same number of rows."
# Validate alpha
if not isinstance(alpha, (int, float)):
return "Invalid input: alpha must be numeric."
if math.isnan(alpha) or math.isinf(alpha):
return "Invalid input: alpha must be finite."
if alpha <= 0 or alpha >= 1:
return "Invalid input: alpha must be between 0 and 1."
# Validate link function
valid_links = ['logit', 'probit', 'cloglog', 'log', 'cauchy']
if not isinstance(glm_binomial_link, str):
return "Invalid input: glm_binomial_link must be a string."
if glm_binomial_link not in valid_links:
return f"Invalid input: glm_binomial_link must be one of {valid_links}."
# Convert to flat list for y
y_flat = [row[0] for row in y]
# Check y values are in valid range [0, 1]
for i, val in enumerate(y_flat):
if val < 0 or val > 1:
return f"Invalid input: y[{i}] = {val} must be between 0 and 1."
# Get number of columns in x
n_cols = len(x[0])
for i, row in enumerate(x):
if len(row) != n_cols:
return f"Invalid input: all rows in x must have the same number of columns."
# Convert x to list of columns
x_data = []
for col_idx in range(n_cols):
x_data.append([x[row_idx][col_idx] for row_idx in range(n_obs_x)])
# Add intercept if requested
if fit_intercept:
x_data.insert(0, [1.0] * n_obs_x)
# Transpose to get design matrix
design_matrix = []
for row_idx in range(n_obs_x):
design_matrix.append([col[row_idx] for col in x_data])
# Create link object
try:
if glm_binomial_link == 'logit':
link = sm.families.links.Logit()
elif glm_binomial_link == 'probit':
link = sm.families.links.Probit()
elif glm_binomial_link == 'cloglog':
link = sm.families.links.CLogLog()
elif glm_binomial_link == 'log':
link = sm.families.links.Log()
elif glm_binomial_link == 'cauchy':
link = sm.families.links.Cauchy()
except Exception as e:
return f"Invalid input: unable to create link function: {e}"
# Fit GLM
try:
family = sm_Binomial(link=link)
model = sm.GLM(y_flat, design_matrix, family=family)
result = model.fit()
except Exception as e:
return f"statsmodels.GLM error: {e}"
# Extract results
try:
params = result.params
bse = result.bse
tvalues = result.tvalues
pvalues = result.pvalues
conf_int = result.conf_int(alpha=alpha)
# Build parameter names
param_names = []
if fit_intercept:
param_names.append('intercept')
for i in range(n_cols):
param_names.append(f'x{i+1}')
# Build results table
results = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper', 'odds_ratio']]
for i, name in enumerate(param_names):
coef = float(params[i])
stderr = float(bse[i])
zstat = float(tvalues[i])
pval = float(pvalues[i])
ci_low = float(conf_int[i][0])
ci_high = float(conf_int[i][1])
odds_ratio = math.exp(coef) if glm_binomial_link == 'logit' else None
results.append([
name,
coef,
stderr,
zstat,
pval,
ci_low,
ci_high,
odds_ratio if odds_ratio is not None else ''
])
# Add model statistics
results.append(['deviance', float(result.deviance), '', '', '', '', '', ''])
results.append(['pearson_chi2', float(result.pearson_chi2), '', '', '', '', '', ''])
results.append(['aic', float(result.aic), '', '', '', '', '', ''])
results.append(['bic', float(result.bic_llf), '', '', '', '', '', ''])
results.append(['log_likelihood', float(result.llf), '', '', '', '', '', ''])
return results
except Exception as e:
return f"statsmodels.GLM error: unable to extract results: {e}"