REGRESS_DIAG

Overview

The REGRESS_DIAG function performs a comprehensive suite of diagnostic tests on regression residuals to evaluate model assumptions. It returns results for six statistical tests across three categories: heteroskedasticity, autocorrelation, and normality. This implementation uses the statsmodels library’s diagnostic module.

Heteroskedasticity Tests check whether the variance of residuals is constant across observations—a key assumption of ordinary least squares (OLS) regression. The function includes:

Breusch-Pagan test: A Lagrange Multiplier test that regresses squared residuals on the independent variables. Significant results (p < 0.05) indicate heteroskedasticity.
White test: A more general test that includes cross-products and squared terms of regressors, making it sensitive to both heteroskedasticity and model misspecification.

Autocorrelation Tests detect serial correlation in residuals, which violates the independence assumption and can lead to inefficient coefficient estimates:

Ljung-Box test: A portmanteau test that examines whether any of the first k autocorrelations are non-zero. The function uses k = \min(10, n/5) lags.
Durbin-Watson statistic: A test specifically for first-order autocorrelation. Values near 2 indicate no autocorrelation; values below 1.5 or above 2.5 suggest potential problems.

Normality Tests assess whether residuals follow a normal distribution, which affects the validity of confidence intervals and hypothesis tests:

Jarque-Bera test: Examines whether residual skewness and kurtosis match a normal distribution. The test statistic follows a chi-squared distribution with 2 degrees of freedom.
Omnibus test: D’Agostino’s K-squared test, which combines measures of skewness and kurtosis to test for normality.

Each test returns a statistic, p-value (except Durbin-Watson), and an interpretation based on a 0.05 significance threshold. For more details on the underlying algorithms, see the statsmodels documentation and GitHub repository.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=REGRESS_DIAG(residuals, exog, endog)

residuals (list[list], required): A column vector of regression residuals from the fitted model.
exog (list[list], required): A matrix of independent variables (predictors) used in the regression model, including a constant column if applicable.
endog (list[list], required): A column vector of the dependent variable from the regression model.

Returns (list[list]): 2D list with regression diagnostics

Example 1: Diagnostics on a small residual sample

Inputs:

residuals	exog		endog
0.1	1	1	2.1
0.2	1	2	4.2
-0.1	1	3	6.1
0.05	1	4	8.05
-0.15	1	5	9.85
0.08	1	6	12.08
-0.05	1	7	13.95
0.12	1	8	16.12
-0.08	1	9	17.92
0.03	1	10	20.03

Excel formula:

=REGRESS_DIAG({0.1;0.2;-0.1;0.05;-0.15;0.08;-0.05;0.12;-0.08;0.03}, {1,1;1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9;1,10}, {2.1;4.2;6.1;8.05;9.85;12.08;13.95;16.12;17.92;20.03})

Expected output:

test_category	test_name	statistic	p_value	conclusion
Heteroskedasticity	Breusch-Pagan	2.30625	0.128855	No problem detected
Heteroskedasticity	White	2.30686	0.315553	No problem detected
Autocorrelation	Ljung-Box	4.81858	0.0898789	No problem detected
Autocorrelation	Durbin-Watson	2.71021		Problem detected
Normality	Jarque-Bera	0.502584	0.777795	Normal
Normality	Omnibus	0.523743	0.76961	Normal

Example 2: Diagnostics with near-zero residuals

Inputs:

residuals	exog		endog
0.01	1	2	4.01
0.02	1	3	6.02
-0.01	1	4	7.99
0.03	1	5	10.03
-0.02	1	6	11.98
0.015	1	7	14.015
-0.015	1	8	15.985
0.025	1	9	18.025

Excel formula:

=REGRESS_DIAG({0.01;0.02;-0.01;0.03;-0.02;0.015;-0.015;0.025}, {1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9}, {4.01;6.02;7.99;10.03;11.98;14.015;15.985;18.025})

Expected output:

test_category	test_name	statistic	p_value	conclusion
Heteroskedasticity	Breusch-Pagan	0.636254	0.42507	No problem detected
Heteroskedasticity	White	0.916397	0.632422	No problem detected
Autocorrelation	Ljung-Box	6.67588	0.00977264	Problem detected
Autocorrelation	Durbin-Watson	2.96639		Problem detected
Normality	Jarque-Bera	0.867162	0.648184	Normal
Normality	Omnibus	2.27487	0.320641	Normal

Example 3: Diagnostics on a larger sample

Inputs:

residuals	exog		endog
0.5	1	1	2.5
-0.3	1	2	3.7
0.2	1	3	6.2
-0.1	1	4	7.9
0.4	1	5	10.4
-0.2	1	6	11.8
0.3	1	7	14.3
-0.4	1	8	15.6
0.1	1	9	18.1
-0.5	1	10	19.5
0.2	1	11	22.2
-0.1	1	12	23.9

Excel formula:

=REGRESS_DIAG({0.5;-0.3;0.2;-0.1;0.4;-0.2;0.3;-0.4;0.1;-0.5;0.2;-0.1}, {1,1;1,2;1,3;1,4;1,5;1,6;1,7;1,8;1,9;1,10;1,11;1,12}, {2.5;3.7;6.2;7.9;10.4;11.8;14.3;15.6;18.1;19.5;22.2;23.9})

Expected output:

test_category	test_name	statistic	p_value	conclusion
Heteroskedasticity	Breusch-Pagan	0.592395	0.441494	No problem detected
Heteroskedasticity	White	0.746539	0.68848	No problem detected
Autocorrelation	Ljung-Box	13.0993	0.00143061	Problem detected
Autocorrelation	Durbin-Watson	3.06087		Problem detected
Normality	Jarque-Bera	0.731205	0.693778	Normal
Normality	Omnibus	1.25139	0.534889	Normal

Example 4: Diagnostics with multiple predictors

Inputs:

residuals	exog			endog
0.1	1	1.5	2	5.1
0.15	1	2.5	3	8.15
0.12	1	3.5	4	11.12
0.18	1	4.5	5	14.18
0.14	1	5.5	6	17.14
0.16	1	6.5	7	20.16
0.13	1	7.5	8	23.13
0.17	1	8.5	9	26.17

Excel formula:

=REGRESS_DIAG({0.1;0.15;0.12;0.18;0.14;0.16;0.13;0.17}, {1,1.5,2;1,2.5,3;1,3.5,4;1,4.5,5;1,5.5,6;1,6.5,7;1,7.5,8;1,8.5,9}, {5.1;8.15;11.12;14.18;17.14;20.16;23.13;26.17})

Expected output:

test_category	test_name	statistic	p_value	conclusion
Heteroskedasticity	Breusch-Pagan	1.87035	0.392517	No problem detected
Heteroskedasticity	White	2.42786	0.297028	No problem detected
Autocorrelation	Ljung-Box	1.95736	0.161795	No problem detected
Autocorrelation	Durbin-Watson	0.0675279		Problem detected
Normality	Jarque-Bera	0.393101	0.82156	Normal
Normality	Omnibus	0.266869	0.875085	Normal

Python Code

Show Code

import math
import numpy as np
from statsmodels.stats.diagnostic import het_breuschpagan, het_white, acorr_ljungbox
from statsmodels.stats.stattools import durbin_watson, jarque_bera, omni_normtest

def regress_diag(residuals, exog, endog):
    """
    Performs comprehensive regression diagnostic tests.

    See: https://www.statsmodels.org/stable/stats.html#residual-diagnostics-and-specification-tests

    This example function is provided as-is without any representation of accuracy.

    Args:
        residuals (list[list]): A column vector of regression residuals from the fitted model.
        exog (list[list]): A matrix of independent variables (predictors) used in the regression model, including a constant column if applicable.
        endog (list[list]): A column vector of the dependent variable from the regression model.

    Returns:
        list[list]: 2D list with regression diagnostics
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    def validate_2d_numeric(data, name):
        # Validate 2D list structure
        if not isinstance(data, list):
            return f"Error: {name} must be a 2D list."
        if not data:
            return f"Error: {name} cannot be empty."
        for i, row in enumerate(data):
            if not isinstance(row, list):
                return f"Error: {name} row {i} must be a list."
            if not row:
                return f"Error: {name} row {i} cannot be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Error: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: {name}[{i}][{j}] must be finite."
        return None

    def interpret_pvalue(p_value, test_category):
        # Interpretation based on p-value threshold of 0.05
        if test_category in ('Heteroskedasticity', 'Autocorrelation'):
            return 'Problem detected' if p_value < 0.05 else 'No problem detected'
        elif test_category == 'Normality':
            return 'Non-normal' if p_value < 0.05 else 'Normal'
        return 'N/A'

    try:
        # Normalize inputs
        residuals_2d = to2d(residuals)
        exog_2d = to2d(exog)
        endog_2d = to2d(endog)

        # Validate inputs
        error = validate_2d_numeric(residuals_2d, 'residuals')
        if error:
            return error

        error = validate_2d_numeric(exog_2d, 'exog')
        if error:
            return error

        error = validate_2d_numeric(endog_2d, 'endog')
        if error:
            return error

        # Convert to numpy arrays
        resid_array = np.array(residuals_2d, dtype=float)
        exog_array = np.array(exog_2d, dtype=float)
        endog_array = np.array(endog_2d, dtype=float)

        # Flatten residuals to 1D array
        if resid_array.shape[1] != 1:
            return "Error: residuals must be a column vector (2D list with 1 column)."
        resid_1d = resid_array.flatten()

        # Flatten endog to 1D array
        if endog_array.shape[1] != 1:
            return "Error: endog must be a column vector (2D list with 1 column)."
        endog_1d = endog_array.flatten()

        # Validate dimensions
        n_obs = len(resid_1d)
        if exog_array.shape[0] != n_obs:
            return f"Error: number of residuals ({n_obs}) must match number of observations in exog ({exog_array.shape[0]})."
        if len(endog_1d) != n_obs:
            return f"Error: number of residuals ({n_obs}) must match number of observations in endog ({len(endog_1d)})."

        if n_obs < 3:
            return "Error: insufficient data for diagnostic tests (need at least 3 observations)."

        # Initialize results
        results = [['test_category', 'test_name', 'statistic', 'p_value', 'conclusion']]

        # Run diagnostic tests
        # Heteroskedasticity tests
        # Breusch-Pagan test
        lm_stat, lm_pvalue, f_stat, f_pvalue = het_breuschpagan(resid_1d, exog_array)
        interp = interpret_pvalue(lm_pvalue, 'Heteroskedasticity')
        results.append(['Heteroskedasticity', 'Breusch-Pagan', float(lm_stat), float(lm_pvalue), interp])

        # White test
        white_lm, white_lm_pvalue, white_f, white_f_pvalue = het_white(resid_1d, exog_array)
        interp = interpret_pvalue(white_lm_pvalue, 'Heteroskedasticity')
        results.append(['Heteroskedasticity', 'White', float(white_lm), float(white_lm_pvalue), interp])

        # Autocorrelation tests
        # Ljung-Box test (using lag=min(10, n_obs//5))
        max_lag = min(10, max(1, n_obs // 5))
        lb_result = acorr_ljungbox(resid_1d, lags=max_lag, return_df=True)
        # Use the last lag result
        last_row = lb_result.iloc[-1]
        lb_stat = float(last_row['lb_stat'])
        lb_pvalue = float(last_row['lb_pvalue'])
        interp = interpret_pvalue(lb_pvalue, 'Autocorrelation')
        results.append(['Autocorrelation', 'Ljung-Box', lb_stat, lb_pvalue, interp])

        # Durbin-Watson test (note: DW doesn't have a p-value, only statistic)
        dw_stat = durbin_watson(resid_1d)
        # DW statistic interpretation: values around 2 indicate no autocorrelation
        # < 2: positive autocorrelation, > 2: negative autocorrelation
        if dw_stat < 1.5 or dw_stat > 2.5:
            dw_interp = 'Problem detected'
        else:
            dw_interp = 'No problem detected'
        results.append(['Autocorrelation', 'Durbin-Watson', float(dw_stat), '', dw_interp])

        # Normality tests
        # Jarque-Bera test
        jb_stat, jb_pvalue, skew, kurtosis = jarque_bera(resid_1d)
        interp = interpret_pvalue(jb_pvalue, 'Normality')
        results.append(['Normality', 'Jarque-Bera', float(jb_stat), float(jb_pvalue), interp])

        # Omnibus test
        omnibus_stat, omnibus_pvalue = omni_normtest(resid_1d)
        interp = interpret_pvalue(omnibus_pvalue, 'Normality')
        results.append(['Normality', 'Omnibus', float(omnibus_stat), float(omnibus_pvalue), interp])

        # Validate results
        for row in results[1:]:
            if not isinstance(row[2], (int, float)) or math.isnan(row[2]) or math.isinf(row[2]):
                return "Error: test produced invalid statistic."
            # p_value can be empty for Durbin-Watson
            if row[3] != '' and row[3] is not None:
                if not isinstance(row[3], (int, float)) or math.isnan(row[3]) or math.isinf(row[3]):
                    return "Error: test produced invalid p-value."

        return results
    except Exception as exc:
        return f"Error: {exc}"

Online Calculator

residuals *

A column vector of regression residuals from the fitted model.

exog *

A matrix of independent variables (predictors) used in the regression model, including a constant column if applicable.

endog *

A column vector of the dependent variable from the regression model.