INFLUENCE_DIAG

Overview

The INFLUENCE_DIAG function computes regression influence diagnostics to identify observations that have unusual influence on an ordinary least squares (OLS) regression model. Influential observations can distort regression estimates, making their detection critical for building reliable statistical models.

This implementation uses the OLSInfluence class from the statsmodels library, which provides a comprehensive suite of influence measures for regression analysis. The function returns four key diagnostics for each observation:

Leverage measures how far an observation’s predictor values are from the mean of all predictors. It is the diagonal element of the hat matrix H = X(X^TX)^{-1}X^T. High-leverage points have the potential to strongly influence the regression fit. A common threshold flags observations where leverage exceeds 2k/n, where k is the number of parameters and n is the number of observations. For more background, see Leverage (statistics) on Wikipedia.

Cook’s Distance quantifies the overall influence of an observation by measuring how much all fitted values change when that observation is removed. It combines both leverage and residual information:

D_i = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}

where e_i is the residual, p is the number of parameters, s^2 is the mean squared error, and h_{ii} is the leverage. Observations with D_i > 4/n are typically flagged as influential. See Cook’s distance on Wikipedia.

DFFITS measures the change in the predicted value for the i-th observation when that observation is deleted from the regression, scaled by its estimated standard error. It is closely related to Cook’s distance but focuses on the impact on individual fitted values rather than overall model fit.

Studentized Residuals are residuals standardized by an estimate of their standard deviation that accounts for leverage, computed as t_i = e_i / (\hat{\sigma}\sqrt{1 - h_{ii}}). Observations with |t_i| > 2 may indicate outliers in the response variable.

The function automatically classifies each observation as influential, high_leverage, outlier, or normal based on these diagnostic thresholds, enabling analysts to quickly identify problematic data points that warrant further investigation.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=INFLUENCE_DIAG(y, x, fit_intercept)

y (list[list], required): A column vector containing the dependent variable values.
x (list[list], required): A matrix where each column represents a predictor variable.
fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.

Returns (list[list]): 2D list with influence diagnostics, or error message string.

Example 1: Influence diagnostics for a simple linear fit

Inputs:

y	x
1.1	1
2.3	2
2.9	3
4.2	4
4.8	5

Excel formula:

=INFLUENCE_DIAG({1.1;2.3;2.9;4.2;4.8}, {1;2;3;4;5})

Expected output:

observation	leverage	cooks_distance	dffits	student_resid	influence_flag
1	0.6	0.457317	-0.874818	-0.780869	normal
2	0.3	0.215779	0.658078	1.00348	normal
3	0.2	0.097561	-0.419314	-0.883452	normal
4	0.3	0.329268	0.948683	1.23959	normal
5	0.6	0.658537	-1.11417	-0.937043	normal

Example 2: Influence diagnostics without intercept

Inputs:

y	x	fit_intercept
2.1	1	false
3.9	2
6.2	3
7.8	4
10.3	5

Excel formula:

=INFLUENCE_DIAG({2.1;3.9;6.2;7.8;10.3}, {1;2;3;4;5}, FALSE)

Expected output:

observation	leverage	cooks_distance	dffits	student_resid	influence_flag
1	0.0181818	0.00281504	0.0468475	0.389887	normal
2	0.0727273	0.0426098	-0.192302	-0.737071	normal
3	0.163636	0.103401	0.298927	0.726977	normal
4	0.290909	1.16584	-1.73771	-1.68575	influential
5	0.454545	1.3596	1.31229	1.27731	influential

Example 3: Influence diagnostics with two predictors

Inputs:

y	x
5.2	1	2
9.8	2	4
15.1	3	6
19.9	4	8
25.3	5	10
29.7	6	12

Excel formula:

=INFLUENCE_DIAG({5.2;9.8;15.1;19.9;25.3;29.7}, {1,2;2,4;3,6;4,8;5,10;6,12})

Expected output:

observation	leverage	cooks_distance	dffits	student_resid	influence_flag
1	0.52381	0.155066	0.624618	0.650313	normal
2	0.295238	0.193125	-0.814967	-1.17602	normal
3	0.180952	0.00951759	0.14876	0.359498	normal
4	0.180952	0.00951759	-0.14876	-0.359498	normal
5	0.295238	0.377296	1.61738	1.64375	normal
6	0.52381	0.545154	-1.39722	-1.21934	normal

Example 4: Influence diagnostics with an extreme outlier

Inputs:

y	x
2.1	1
4	2
5.9	3
8.1	4
20	5

Excel formula:

=INFLUENCE_DIAG({2.1;4;5.9;8.1;20}, {1;2;3;4;5})

Expected output:

observation	leverage	cooks_distance	dffits	student_resid	influence_flag
1	0.6	0.596354	1.04014	0.891705	normal
2	0.3	0.0000206493	-0.00524722	-0.00981649	normal
3	0.2	0.0526332	-0.285719	-0.648896	normal
4	0.3	0.350766	-1.01455	-1.27942	normal
5	0.6	2.24848	66.6667	1.73147	influential

Python Code

Show Code

import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence as statsmodels_ols_influence

def influence_diag(y, x, fit_intercept=True):
    """
    Computes regression influence diagnostics for identifying influential observations.

    See: https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.OLSInfluence.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): A column vector containing the dependent variable values.
        x (list[list]): A matrix where each column represents a predictor variable.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.

    Returns:
        list[list]: 2D list with influence diagnostics, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val
    try:
      # Normalize inputs to 2D lists
      y = to2d(y)
      x = to2d(x)

      # Validate inputs
      if not isinstance(y, list) or not all(isinstance(row, list) for row in y):
        return "Error: y must be a 2D list."

      if not isinstance(x, list) or not all(isinstance(row, list) for row in x):
        return "Error: x must be a 2D list."

      if not isinstance(fit_intercept, bool):
        return "Error: fit_intercept must be a boolean."

      # Convert to numpy arrays
      try:
        y_array = np.array(y, dtype=float).flatten()
        x_array = np.array(x, dtype=float)
      except (ValueError, TypeError):
        return "Error: y and x must contain numeric values."

      # Check for invalid values
      if np.any(np.isnan(y_array)) or np.any(np.isinf(y_array)):
        return "Error: y contains NaN or infinite values."

      if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
        return "Error: x contains NaN or infinite values."

      # Check dimensions
      n = len(y_array)
      if n < 3:
        return "Error: need at least 3 observations."

      if x_array.ndim == 1:
        x_array = x_array.reshape(-1, 1)

      if x_array.shape[0] != n:
        return "Error: y and x must have the same number of observations."

      # Add intercept if requested
      if fit_intercept:
        x_array = sm.add_constant(x_array)

      # Check for sufficient predictors
      k = x_array.shape[1]
      if n <= k:
        return "Error: number of observations must be greater than number of predictors."

      # Fit OLS model
      model = sm.OLS(y_array, x_array)
      results = model.fit()

      # Get influence diagnostics
      influence = statsmodels_ols_influence(results)

      # Get diagnostics
      leverage = influence.hat_matrix_diag
      cooks_d = influence.cooks_distance[0]
      dffits = influence.dffits[0]
      student_resid = influence.resid_studentized_internal

      # Define thresholds for flags
      leverage_threshold = 2 * k / n
      cooks_threshold = 4 / n
      outlier_threshold = 2.0

      # Create output with headers
      output = [['observation', 'leverage', 'cooks_distance', 'dffits', 'student_resid', 'influence_flag']]

      # Add diagnostics for each observation
      for i in range(n):
        # Determine influence flag
        if abs(cooks_d[i]) > cooks_threshold:
          flag = 'influential'
        elif leverage[i] > leverage_threshold:
          flag = 'high_leverage'
        elif abs(student_resid[i]) > outlier_threshold:
          flag = 'outlier'
        else:
          flag = 'normal'

        row = [
          float(i + 1),  # observation number (1-indexed)
          float(leverage[i]),
          float(cooks_d[i]),
          float(dffits[i]),
          float(student_resid[i]),
          flag
        ]
        output.append(row)

      return output
    except Exception as e:
      return f"Error: {str(e)}"

Online Calculator

y *

A column vector containing the dependent variable values.

x *

A matrix where each column represents a predictor variable.

fit_intercept

If True, adds an intercept term to the model.