INFLUENCE_DIAG

Overview

The INFLUENCE_DIAG function computes regression influence diagnostics to identify observations that have unusual influence on an ordinary least squares (OLS) regression model. Influential observations can distort regression estimates, making their detection critical for building reliable statistical models.

This implementation uses the OLSInfluence class from the statsmodels library, which provides a comprehensive suite of influence measures for regression analysis. The function returns four key diagnostics for each observation:

Leverage measures how far an observation’s predictor values are from the mean of all predictors. It is the diagonal element of the hat matrix H = X(X^TX)^{-1}X^T. High-leverage points have the potential to strongly influence the regression fit. A common threshold flags observations where leverage exceeds 2k/n, where k is the number of parameters and n is the number of observations. For more background, see Leverage (statistics) on Wikipedia.

Cook’s Distance quantifies the overall influence of an observation by measuring how much all fitted values change when that observation is removed. It combines both leverage and residual information:

D_i = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}

where e_i is the residual, p is the number of parameters, s^2 is the mean squared error, and h_{ii} is the leverage. Observations with D_i > 4/n are typically flagged as influential. See Cook’s distance on Wikipedia.

DFFITS measures the change in the predicted value for the i-th observation when that observation is deleted from the regression, scaled by its estimated standard error. It is closely related to Cook’s distance but focuses on the impact on individual fitted values rather than overall model fit.

Studentized Residuals are residuals standardized by an estimate of their standard deviation that accounts for leverage, computed as t_i = e_i / (\hat{\sigma}\sqrt{1 - h_{ii}}). Observations with |t_i| > 2 may indicate outliers in the response variable.

The function automatically classifies each observation as influential, high_leverage, outlier, or normal based on these diagnostic thresholds, enabling analysts to quickly identify problematic data points that warrant further investigation.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=INFLUENCE_DIAG(y, x, fit_intercept)
  • y (list[list], required): A column vector containing the dependent variable values.
  • x (list[list], required): A matrix where each column represents a predictor variable.
  • fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.

Returns (list[list]): 2D list with influence diagnostics, or error message string.

Examples

Example 1: Demo case 1

Inputs:

y x
1.1 1
2.3 2
2.9 3
4.2 4
4.8 5

Excel formula:

=INFLUENCE_DIAG({1.1;2.3;2.9;4.2;4.8}, {1;2;3;4;5})

Expected output:

observation leverage cooks_distance dffits student_resid influence_flag
1 0.6 0.4573 -0.8748 -0.7809 normal
2 0.3 0.2158 0.6581 1.003 normal
3 0.2 0.09756 -0.4193 -0.8835 normal
4 0.3 0.3293 0.9487 1.24 normal
5 0.6 0.6585 -1.114 -0.937 normal

Example 2: Demo case 2

Inputs:

y x fit_intercept
2.1 1 false
3.9 2
6.2 3
7.8 4
10.3 5

Excel formula:

=INFLUENCE_DIAG({2.1;3.9;6.2;7.8;10.3}, {1;2;3;4;5}, FALSE)

Expected output:

observation leverage cooks_distance dffits student_resid influence_flag
1 0.01818 0.002815 0.04685 0.3899 normal
2 0.07273 0.04261 -0.1923 -0.7371 normal
3 0.1636 0.1034 0.2989 0.727 normal
4 0.2909 1.166 -1.738 -1.686 influential
5 0.4545 1.36 1.312 1.277 influential

Example 3: Demo case 3

Inputs:

y x
5.2 1 2
9.8 2 4
15.1 3 6
19.9 4 8
25.3 5 10
29.7 6 12

Excel formula:

=INFLUENCE_DIAG({5.2;9.8;15.1;19.9;25.3;29.7}, {1,2;2,4;3,6;4,8;5,10;6,12})

Expected output:

observation leverage cooks_distance dffits student_resid influence_flag
1 0.5238 0.1551 0.6246 0.6503 normal
2 0.2952 0.1931 -0.815 -1.176 normal
3 0.181 0.009518 0.1488 0.3595 normal
4 0.181 0.009518 -0.1488 -0.3595 normal
5 0.2952 0.3773 1.617 1.644 normal
6 0.5238 0.5452 -1.397 -1.219 normal

Example 4: Demo case 4

Inputs:

y x
2.1 1
4 2
5.9 3
8.1 4
20 5

Excel formula:

=INFLUENCE_DIAG({2.1;4;5.9;8.1;20}, {1;2;3;4;5})

Expected output:

observation leverage cooks_distance dffits student_resid influence_flag
1 0.6 0.5964 1.04 0.8917 normal
2 0.3 0.00002065 -0.005247 -0.009816 normal
3 0.2 0.05263 -0.2857 -0.6489 normal
4 0.3 0.3508 -1.015 -1.279 normal
5 0.6 2.248 66.67 1.731 influential

Python Code

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence as statsmodels_ols_influence

def influence_diag(y, x, fit_intercept=True):
    """
    Computes regression influence diagnostics for identifying influential observations.

    See: https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): A column vector containing the dependent variable values.
        x (list[list]): A matrix where each column represents a predictor variable.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.

    Returns:
        list[list]: 2D list with influence diagnostics, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    # Normalize inputs to 2D lists
    y = to2d(y)
    x = to2d(x)

    # Validate inputs
    if not isinstance(y, list) or not all(isinstance(row, list) for row in y):
        return "Invalid input: y must be a 2D list."

    if not isinstance(x, list) or not all(isinstance(row, list) for row in x):
        return "Invalid input: x must be a 2D list."

    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be a boolean."

    # Convert to numpy arrays
    try:
        y_array = np.array(y, dtype=float).flatten()
        x_array = np.array(x, dtype=float)
    except (ValueError, TypeError):
        return "Invalid input: y and x must contain numeric values."

    # Check for invalid values
    if np.any(np.isnan(y_array)) or np.any(np.isinf(y_array)):
        return "Invalid input: y contains NaN or infinite values."

    if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
        return "Invalid input: x contains NaN or infinite values."

    # Check dimensions
    n = len(y_array)
    if n < 3:
        return "Invalid input: need at least 3 observations."

    if x_array.ndim == 1:
        x_array = x_array.reshape(-1, 1)

    if x_array.shape[0] != n:
        return "Invalid input: y and x must have the same number of observations."

    # Add intercept if requested
    if fit_intercept:
        x_array = sm.add_constant(x_array)

    # Check for sufficient predictors
    k = x_array.shape[1]
    if n <= k:
        return "Invalid input: number of observations must be greater than number of predictors."

    # Fit OLS model
    try:
        model = sm.OLS(y_array, x_array)
        results = model.fit()
    except Exception as e:
        return f"Model fitting error: {str(e)}"

    # Get influence diagnostics
    try:
        influence = statsmodels_ols_influence(results)

        # Get diagnostics
        leverage = influence.hat_matrix_diag
        cooks_d = influence.cooks_distance[0]
        dffits = influence.dffits[0]
        student_resid = influence.resid_studentized_internal

    except Exception as e:
        return f"Influence diagnostics error: {str(e)}"

    # Define thresholds for flags
    leverage_threshold = 2 * k / n
    cooks_threshold = 4 / n
    outlier_threshold = 2.0

    # Create output with headers
    output = [['observation', 'leverage', 'cooks_distance', 'dffits', 'student_resid', 'influence_flag']]

    # Add diagnostics for each observation
    for i in range(n):
        # Determine influence flag
        if abs(cooks_d[i]) > cooks_threshold:
            flag = 'influential'
        elif leverage[i] > leverage_threshold:
            flag = 'high_leverage'
        elif abs(student_resid[i]) > outlier_threshold:
            flag = 'outlier'
        else:
            flag = 'normal'

        row = [
            float(i + 1),  # observation number (1-indexed)
            float(leverage[i]),
            float(cooks_d[i]),
            float(dffits[i]),
            float(student_resid[i]),
            flag
        ]
        output.append(row)

    return output

Online Calculator