GLM_TWEEDIE

Overview

The GLM_TWEEDIE function fits a Generalized Linear Model (GLM) using the Tweedie family of distributions, providing a flexible framework for modeling data with non-normal response distributions. The Tweedie distribution is a special case of exponential dispersion models that unifies several common distributions including Poisson, Gamma, and compound Poisson-Gamma (useful for data with exact zeros and positive continuous values).

This implementation uses the statsmodels library’s GLM module. For detailed documentation, see the statsmodels GLM documentation and the Tweedie family reference.

The key parameter is the variance power p, which controls the relationship between the variance and mean of the response variable:

\text{Var}(Y) = \phi \cdot \mu^p

where \mu is the mean, \phi is the dispersion parameter, and p is the variance power. Different values of p correspond to different distributions:

  • p = 1: Poisson (count data)
  • 1 < p < 2: Compound Poisson-Gamma (insurance claims, rainfall with zeros)
  • p = 2: Gamma (positive continuous data)

The function supports Log and Power link functions. The log link (default) is commonly used and ensures positive predictions:

\log(\mu) = X\beta \quad \Rightarrow \quad \mu = e^{X\beta}

Model output includes coefficient estimates, standard errors, z-statistics, p-values, and confidence intervals for each predictor. Goodness-of-fit statistics such as deviance, Pearson chi-squared, AIC, BIC, and log-likelihood are also provided. Tweedie GLMs are widely applied in actuarial science for insurance claim modeling, ecology for species abundance data, and any domain where data exhibits a mean-variance power relationship.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=GLM_TWEEDIE(y, x, var_power, glm_tweedie_link, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector (2D list with one column).
  • x (list[list], required): Independent variables (predictors) as a matrix. Each column represents a predictor variable.
  • var_power (float, optional, default: 1.5): Variance power parameter controlling the Tweedie distribution (1 <= var_power <= 2). 1.0 = Poisson, 1.5 = Compound Poisson-Gamma, 2.0 = Gamma.
  • glm_tweedie_link (str, optional, default: “log”): Link function to use for the Tweedie GLM model.
  • fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals (0 < alpha < 1).

Returns (list[list]): 2D list with GLM results and statistics, or error string.

Examples

Example 1: Demo case 1

Inputs:

y x
1.2 1
2.3 2
3.8 3
4.5 4
6.1 5
7.2 6

Excel formula:

=GLM_TWEEDIE({1.2;2.3;3.8;4.5;6.1;7.2}, {1;2;3;4;5;6})

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.16737052107241146 0.1777598466210029 0.9415541487794922 0.3464209539226616 -0.1810323762021182 0.5157734183469411
x1 0.3205527363026178 0.04136592541451977 7.74919775371691 9.247490982061944e-15 0.23947701230298896 0.4016284603022467
deviance 0.23257537880695267
pearson_chi2 0.22600578760047332
aic 13.393097469193265
bic 13.39
log_likelihood -4.696548734596632
scale 0.05650144690011844

Example 2: Demo case 2

Inputs:

y x var_power
1 1 1
3 2
2 2.5
5 4
4 3.5
8 5

Excel formula:

=GLM_TWEEDIE({1;3;2;5;4;8}, {1;2;2.5;4;3.5;5}, 1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept -0.17824085881745044 0.27172165173113344 -0.6559685534144274 0.5118443577545689 -0.7108055100302076 0.35432379239530676
x1 0.45011040836640037 0.06918247257596571 6.5061335856731235 7.710966198964227e-11 0.3145152537560776 0.5857055629767232
deviance 0.6001504010077837
pearson_chi2 0.6362839140174792
aic 122.75327413912515
bic 122.33679307758129
log_likelihood -59.37663706956258
scale 0.15907097850436994

Example 3: Demo case 3

Inputs:

y x var_power glm_tweedie_link
2.1 1 2 power
3.9 2
6.2 3
7.8 4
10.5 5

Excel formula:

=GLM_TWEEDIE({2.1;3.9;6.2;7.8;10.5}, {1;2;3;4;5}, 2, "power")

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.06614066513969515 0.13362379156684015 0.4949767130848912 0.6206165584728542 -0.19575715380899855 0.3280384840883888
x1 2.0031265943221577 0.07147363948224718 28.026089182427878 7.81611292159438e-173 1.8630408350929533 2.143212353551362
deviance 0.00536930480034218
pearson_chi2 0.005342950892998718
aic 1.218643907218393
bic 0.43751973209023154
log_likelihood 1.3906780463908035
scale 0.0017809836309995734

Example 4: Demo case 4

Inputs:

y x var_power glm_tweedie_link fit_intercept alpha
3.5 1.5 1.5 log true 0.1
5.2 2.5
7.1 3.5
9.5 4.5
11.8 5.5
13.2 6.5

Excel formula:

=GLM_TWEEDIE({3.5;5.2;7.1;9.5;11.8;13.2}, {1.5;2.5;3.5;4.5;5.5;6.5}, 1.5, "log", TRUE, 0.1)

Expected output:

parameter coefficient std_error z_statistic p_value ci_lower ci_upper
intercept 0.9954692897509494 0.11096471007931333 8.971043938558719 2.937172544485591e-19 0.8129485839133723 1.1779899955885265
x1 0.2590961324508474 0.023679261440779684 10.941900916075031 7.266022592467453e-28 0.2201472133864488 0.298045051515246
deviance 0.1098369952938425
pearson_chi2 0.1077084267312591
aic 15.671587705582624
bic 15.255106644038733
log_likelihood -5.835793852791312
scale 0.026927106682814846

Python Code

import math
import warnings
import numpy as np
import statsmodels.api as sm
from statsmodels.genmod.families import Tweedie
from statsmodels.genmod.families.links import Log, Power
from statsmodels.genmod.generalized_linear_model import SET_USE_BIC_LLF

def glm_tweedie(y, x, var_power=1.5, glm_tweedie_link='log', fit_intercept=True, alpha=0.05):
    """
    Fits a Generalized Linear Model with Tweedie family for flexible distribution modeling.

    See: https://www.statsmodels.org/stable/generated/statsmodels.genmod.generalized_linear_model.GLM.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (2D list with one column).
        x (list[list]): Independent variables (predictors) as a matrix. Each column represents a predictor variable.
        var_power (float, optional): Variance power parameter controlling the Tweedie distribution (1 <= var_power <= 2). 1.0 = Poisson, 1.5 = Compound Poisson-Gamma, 2.0 = Gamma. Default is 1.5.
        glm_tweedie_link (str, optional): Link function to use for the Tweedie GLM model. Valid options: Log, Power. Default is 'log'.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals (0 < alpha < 1). Default is 0.05.

    Returns:
        list[list]: 2D list with GLM results and statistics, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_2d_array(arr, name):
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if len(arr) == 0:
            return f"Invalid input: {name} cannot be empty."
        if not all(isinstance(row, list) for row in arr):
            return f"Invalid input: {name} must be a 2D list."
        return None

    def validate_numeric(value, name, min_val=None, max_val=None):
        try:
            num = float(value)
        except Exception:
            return f"Invalid input: {name} must be a number."
        if math.isnan(num) or math.isinf(num):
            return f"Invalid input: {name} must be finite."
        if min_val is not None and num < min_val:
            return f"Invalid input: {name} must be >= {min_val}."
        if max_val is not None and num > max_val:
            return f"Invalid input: {name} must be <= {max_val}."
        return num

    # Normalize 2D list inputs
    y = to2d(y)
    x = to2d(x)

    # Validate 2D list inputs
    err = validate_2d_array(y, 'y')
    if err:
        return err
    err = validate_2d_array(x, 'x')
    if err:
        return err

    # Convert to numpy arrays and validate
    try:
        y_arr = np.array(y, dtype=float)
        x_arr = np.array(x, dtype=float)
    except Exception:
        return "Invalid input: y and x must contain numeric values."

    if y_arr.ndim != 2 or x_arr.ndim != 2:
        return "Invalid input: y and x must be 2D arrays."

    # Validate y is a column vector
    if y_arr.shape[1] != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Validate matching dimensions
    if y_arr.shape[0] != x_arr.shape[0]:
        return "Invalid input: y and x must have the same number of rows."

    # Flatten y to 1D
    y_arr = y_arr.flatten()

    # Validate var_power
    var_power_val = validate_numeric(var_power, 'var_power', min_val=1.0, max_val=2.0)
    if isinstance(var_power_val, str):
        return var_power_val

    # Validate alpha
    alpha_val = validate_numeric(alpha, 'alpha', min_val=0.0, max_val=1.0)
    if isinstance(alpha_val, str):
        return alpha_val

    # Validate glm_tweedie_link
    if not isinstance(glm_tweedie_link, str):
        return "Invalid input: glm_tweedie_link must be a string."

    link_lower = glm_tweedie_link.lower()
    if link_lower == 'log':
        link_func = Log()
    elif link_lower == 'power':
        link_func = Power()
    else:
        return f"Invalid input: glm_tweedie_link must be 'log' or 'power', got '{glm_tweedie_link}'."

    # Validate fit_intercept
    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be a boolean."

    # Add intercept if requested
    if fit_intercept:
        x_arr = sm.add_constant(x_arr, has_constant='add')

    # Check for valid data
    if np.any(np.isnan(y_arr)) or np.any(np.isinf(y_arr)):
        return "Invalid input: y contains non-finite values."
    if np.any(np.isnan(x_arr)) or np.any(np.isinf(x_arr)):
        return "Invalid input: x contains non-finite values."

    # Fit the model
    try:
        SET_USE_BIC_LLF(True)
        family = Tweedie(var_power=var_power_val, link=link_func)
        model = sm.GLM(y_arr, x_arr, family=family)
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            results = model.fit()
    except Exception as e:
        return f"GLM fitting error: {e}"

    # Extract results
    try:
        params = results.params
        std_errors = results.bse
        z_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int(alpha=alpha_val)

        # Build parameter names
        n_predictors = x_arr.shape[1]
        if fit_intercept:
            param_names = ['intercept'] + [f'x{i}' for i in range(1, n_predictors)]
        else:
            param_names = [f'x{i}' for i in range(1, n_predictors + 1)]

        # Build output table
        output = [['parameter', 'coefficient', 'std_error', 'z_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        for i, name in enumerate(param_names):
            row = [
                name,
                float(params[i]),
                float(std_errors[i]),
                float(z_stats[i]),
                float(p_values[i]),
                float(conf_int[i, 0]),
                float(conf_int[i, 1])
            ]
            output.append(row)

        # Add model statistics - check for non-finite values
        deviance = float(results.deviance)
        pearson_chi2 = float(results.pearson_chi2)
        aic = float(results.aic)
        bic = float(results.bic_llf)
        llf = float(results.llf)
        scale = float(results.scale)

        # Check if any critical values are non-finite
        if math.isinf(aic) or math.isnan(aic):
            return "GLM error: AIC is non-finite, model may be degenerate."
        if math.isinf(llf) or math.isnan(llf):
            return "GLM error: log-likelihood is non-finite, model may be degenerate."

        output.append(['deviance', deviance, '', '', '', '', ''])
        output.append(['pearson_chi2', pearson_chi2, '', '', '', '', ''])
        output.append(['aic', aic, '', '', '', '', ''])
        output.append(['bic', bic, '', '', '', '', ''])
        output.append(['log_likelihood', llf, '', '', '', '', ''])
        output.append(['scale', scale, '', '', '', '', ''])

        return output
    except Exception as e:
        return f"Error extracting results: {e}"

Online Calculator