GEE_MODEL

Overview

The GEE_MODEL function fits a Generalized Estimating Equations (GEE) model for analyzing correlated or clustered data, such as longitudinal studies, repeated measures, or hierarchical designs. GEE extends generalized linear models (GLMs) to handle situations where observations within clusters may be correlated, while observations between clusters are assumed to be independent.

GEE was introduced by Liang and Zeger (1986) as a method for estimating regression parameters in marginal models when the data have a grouped structure. Unlike mixed-effects models that estimate subject-specific effects, GEE focuses on estimating population-averaged (marginal) effects, making it particularly suitable for epidemiological and clinical studies where population-level interpretations are desired.

This implementation uses the statsmodels library. For detailed documentation, see the GEE class reference and the GEE user guide.

The function supports multiple distribution families (Gaussian, Binomial, Poisson, Gamma) and working correlation structures:

  • Exchangeable: Assumes constant correlation between all pairs of observations within a cluster
  • AR(1): First-order autoregressive structure, appropriate for equally-spaced longitudinal data
  • Independence: Assumes no within-cluster correlation (equivalent to GLM with robust standard errors)
  • Unstructured: Estimates a separate correlation for each pair of observations

GEE uses an iterative algorithm that alternates between estimating regression coefficients (using weighted least squares) and updating the working correlation structure. A key advantage is that parameter estimates remain consistent even if the working correlation structure is misspecified, though efficiency improves with a correctly specified structure. The function returns robust “sandwich” standard errors following Liang and Zeger (1986), which provide valid inference regardless of the true correlation structure.

For more information on the theoretical foundations, see the original paper: Liang, K-Y. and Zeger, S.L. (1986). “Longitudinal data analysis using generalized linear models.” Biometrika, 73(1): 13-22. Additional background is available on Wikipedia.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=GEE_MODEL(y, x, groups, family, cov_struct, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector.
  • x (list[list], required): Independent variables (predictors) as a matrix where each column is a predictor.
  • groups (list[list], required): Group or cluster identifiers as a column vector with integer group codes.
  • family (str, optional, default: “gaussian”): Distribution family for the response variable.
  • cov_struct (str, optional, default: “exchangeable”): Within-group covariance structure.
  • fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals.

Returns (list[list]): 2D list with GEE results, or error message string.

Examples

Example 1: Gaussian with exchangeable correlation

Inputs:

y x groups
1.2 1 1
1.8 2 1
2.3 3 2
2.9 4 2
3.4 5 3
4.1 6 3

Excel formula:

=GEE_MODEL({1.2;1.8;2.3;2.9;3.4;4.1}, {1;2;3;4;5;6}, {1;1;2;2;3;3})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.6541 0.008694 75.24 0 0.6371 0.6712
x1 0.5607 0.002461 227.8 0 0.5559 0.5655

Example 2: Poisson with AR1 correlation and multiple predictors

Inputs:

y x groups family cov_struct
1 1 0.5 1 poisson ar1
2 2 0.6 1
3 3 0.7 1
2 2 0.8 2
4 4 0.9 2
5 5 1 2

Excel formula:

=GEE_MODEL({1;2;3;2;4;5}, {1,0.5;2,0.6;3,0.7;2,0.8;4,0.9;5,1}, {1;1;1;2;2;2}, "poisson", "ar1")

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -0.07104 0.0547 -1.299 0.1941 -0.1783 0.03618
x1 0.3376 0.06257 5.395 6.833e-8 0.2149 0.4602
x2 0.06159 0.3222 0.1911 0.8484 -0.5699 0.6931

Example 3: Binomial with independence correlation (no intercept)

Inputs:

y x groups family cov_struct fit_intercept
0 1 1 binomial independence false
1 2 1
0 1.5 2
1 2.5 2
1 3 3
1 3.5 3

Excel formula:

=GEE_MODEL({0;1;0;1;1;1}, {1;2;1.5;2.5;3;3.5}, {1;1;2;2;3;3}, "binomial", "independence", FALSE)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
x1 0.6031 0.2233 2.701 0.006908 0.1655 1.041

Example 4: Gaussian with custom alpha for 90% confidence intervals

Inputs:

y x groups family cov_struct fit_intercept alpha
5.2 2 1 gaussian exchangeable true 0.1
7.1 3 1
6.3 2.5 2
8.2 4 2
9.1 4.5 3
10.8 5 3

Excel formula:

=GEE_MODEL({5.2;7.1;6.3;8.2;9.1;10.8}, {2;3;2.5;4;4.5;5}, {1;1;2;2;3;3}, "gaussian", "exchangeable", TRUE, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 1.779 0.2431 7.319 2.506e-13 1.379 2.179
x1 1.715 0.05682 30.19 2.919e-200 1.622 1.809

Python Code

import statsmodels.api as sm
from statsmodels.genmod.generalized_estimating_equations import GEE as statsmodels_gee
from statsmodels.genmod import families

def gee_model(y, x, groups, family='gaussian', cov_struct='exchangeable', fit_intercept=True, alpha=0.05):
    """
    Fits a Generalized Estimating Equations (GEE) model for correlated data.

    See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_estimating_equations.GEE.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector.
        x (list[list]): Independent variables (predictors) as a matrix where each column is a predictor.
        groups (list[list]): Group or cluster identifiers as a column vector with integer group codes.
        family (str, optional): Distribution family for the response variable. Valid options: Gaussian, Binomial, Poisson, Gamma. Default is 'gaussian'.
        cov_struct (str, optional): Within-group covariance structure. Valid options: Exchangeable, AR1, Independence, Unstructured. Default is 'exchangeable'.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals. Default is 0.05.

    Returns:
        list[list]: 2D list with GEE results, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_float(val, name):
        try:
            converted = float(val)
        except Exception:
            return f"Invalid input: {name} must be a number."
        if converted != converted or converted == float('inf') or converted == float('-inf'):
            return f"Invalid input: {name} must be finite."
        return converted

    def validate_2d_list(data, name):
        if not isinstance(data, list):
            return f"Invalid input: {name} must be a 2D list."
        if len(data) == 0:
            return f"Invalid input: {name} cannot be empty."
        for row in data:
            if not isinstance(row, list):
                return f"Invalid input: {name} must be a 2D list."
        return None

    # Normalize 2D list inputs
    y = to2d(y)
    x = to2d(x)
    groups = to2d(groups)

    # Validate 2D list structures
    err = validate_2d_list(y, "y")
    if err:
        return err
    err = validate_2d_list(x, "x")
    if err:
        return err
    err = validate_2d_list(groups, "groups")
    if err:
        return err

    # Extract column vectors
    y_flat = []
    for row in y:
        if len(row) != 1:
            return "Invalid input: y must be a column vector (single column)."
        val = validate_float(row[0], "y element")
        if isinstance(val, str):
            return val
        y_flat.append(val)

    groups_flat = []
    for row in groups:
        if len(row) != 1:
            return "Invalid input: groups must be a column vector (single column)."
        val = validate_float(row[0], "groups element")
        if isinstance(val, str):
            return val
        groups_flat.append(int(val))

    # Extract x matrix
    n_rows = len(x)
    if n_rows != len(y_flat):
        return "Invalid input: x and y must have the same number of rows."
    if n_rows != len(groups_flat):
        return "Invalid input: x and groups must have the same number of rows."

    n_cols = len(x[0])
    x_flat = []
    for row in x:
        if len(row) != n_cols:
            return "Invalid input: x must have consistent number of columns."
        row_vals = []
        for val in row:
            converted = validate_float(val, "x element")
            if isinstance(converted, str):
                return converted
            row_vals.append(converted)
        x_flat.append(row_vals)

    # Validate alpha
    alpha_val = validate_float(alpha, "alpha")
    if isinstance(alpha_val, str):
        return alpha_val
    if alpha_val <= 0 or alpha_val >= 1:
        return "Invalid input: alpha must be between 0 and 1."

    # Validate family
    family_map = {
        'gaussian': families.Gaussian(),
        'binomial': families.Binomial(),
        'poisson': families.Poisson(),
        'gamma': families.Gamma()
    }
    if family.lower() not in family_map:
        return f"Invalid input: family must be one of {list(family_map.keys())}."
    family_obj = family_map[family.lower()]

    # Validate covariance structure
    cov_struct_map = {
        'exchangeable': sm.cov_struct.Exchangeable(),
        'ar1': sm.cov_struct.Autoregressive(),
        'independence': sm.cov_struct.Independence(),
        'unstructured': sm.cov_struct.Unstructured()
    }
    if cov_struct.lower() not in cov_struct_map:
        return f"Invalid input: cov_struct must be one of {list(cov_struct_map.keys())}."
    cov_struct_obj = cov_struct_map[cov_struct.lower()]

    # Add intercept if requested
    if fit_intercept:
        x_flat = [[1.0] + row for row in x_flat]

    # Fit GEE model
    try:
        model = statsmodels_gee(y_flat, x_flat, groups=groups_flat, family=family_obj, cov_struct=cov_struct_obj)
        results = model.fit()
    except Exception as exc:
        return f"GEE fitting error: {exc}"

    # Extract results
    params = results.params
    std_errors = results.bse
    z_stats = results.tvalues
    p_values = results.pvalues
    conf_int = results.conf_int(alpha=alpha_val)

    # Build output table
    output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

    # Add parameter rows
    for i, (param, se, z, p, ci) in enumerate(zip(params, std_errors, z_stats, p_values, conf_int)):
        if fit_intercept and i == 0:
            param_name = 'intercept'
        else:
            param_idx = i if not fit_intercept else i - 1
            param_name = f'x{param_idx + 1}'
        output.append([param_name, float(param), float(se), float(z), float(p), float(ci[0]), float(ci[1])])

    return output

Online Calculator