INFLUENCE_DIAG
Overview
The INFLUENCE_DIAG function computes regression influence diagnostics to identify observations that have unusual influence on an ordinary least squares (OLS) regression model. Influential observations can distort regression estimates, making their detection critical for building reliable statistical models.
This implementation uses the OLSInfluence class from the statsmodels library, which provides a comprehensive suite of influence measures for regression analysis. The function returns four key diagnostics for each observation:
Leverage measures how far an observation’s predictor values are from the mean of all predictors. It is the diagonal element of the hat matrix H = X(X^TX)^{-1}X^T. High-leverage points have the potential to strongly influence the regression fit. A common threshold flags observations where leverage exceeds 2k/n, where k is the number of parameters and n is the number of observations. For more background, see Leverage (statistics) on Wikipedia.
Cook’s Distance quantifies the overall influence of an observation by measuring how much all fitted values change when that observation is removed. It combines both leverage and residual information:
D_i = \frac{e_i^2}{p \cdot s^2} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}
where e_i is the residual, p is the number of parameters, s^2 is the mean squared error, and h_{ii} is the leverage. Observations with D_i > 4/n are typically flagged as influential. See Cook’s distance on Wikipedia.
DFFITS measures the change in the predicted value for the i-th observation when that observation is deleted from the regression, scaled by its estimated standard error. It is closely related to Cook’s distance but focuses on the impact on individual fitted values rather than overall model fit.
Studentized Residuals are residuals standardized by an estimate of their standard deviation that accounts for leverage, computed as t_i = e_i / (\hat{\sigma}\sqrt{1 - h_{ii}}). Observations with |t_i| > 2 may indicate outliers in the response variable.
The function automatically classifies each observation as influential, high_leverage, outlier, or normal based on these diagnostic thresholds, enabling analysts to quickly identify problematic data points that warrant further investigation.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=INFLUENCE_DIAG(y, x, fit_intercept)
y(list[list], required): A column vector containing the dependent variable values.x(list[list], required): A matrix where each column represents a predictor variable.fit_intercept(bool, optional, default: true): If True, adds an intercept term to the model.
Returns (list[list]): 2D list with influence diagnostics, or error message string.
Examples
Example 1: Demo case 1
Inputs:
| y | x |
|---|---|
| 1.1 | 1 |
| 2.3 | 2 |
| 2.9 | 3 |
| 4.2 | 4 |
| 4.8 | 5 |
Excel formula:
=INFLUENCE_DIAG({1.1;2.3;2.9;4.2;4.8}, {1;2;3;4;5})
Expected output:
| observation | leverage | cooks_distance | dffits | student_resid | influence_flag |
|---|---|---|---|---|---|
| 1 | 0.6 | 0.4573 | -0.8748 | -0.7809 | normal |
| 2 | 0.3 | 0.2158 | 0.6581 | 1.003 | normal |
| 3 | 0.2 | 0.09756 | -0.4193 | -0.8835 | normal |
| 4 | 0.3 | 0.3293 | 0.9487 | 1.24 | normal |
| 5 | 0.6 | 0.6585 | -1.114 | -0.937 | normal |
Example 2: Demo case 2
Inputs:
| y | x | fit_intercept |
|---|---|---|
| 2.1 | 1 | false |
| 3.9 | 2 | |
| 6.2 | 3 | |
| 7.8 | 4 | |
| 10.3 | 5 |
Excel formula:
=INFLUENCE_DIAG({2.1;3.9;6.2;7.8;10.3}, {1;2;3;4;5}, FALSE)
Expected output:
| observation | leverage | cooks_distance | dffits | student_resid | influence_flag |
|---|---|---|---|---|---|
| 1 | 0.01818 | 0.002815 | 0.04685 | 0.3899 | normal |
| 2 | 0.07273 | 0.04261 | -0.1923 | -0.7371 | normal |
| 3 | 0.1636 | 0.1034 | 0.2989 | 0.727 | normal |
| 4 | 0.2909 | 1.166 | -1.738 | -1.686 | influential |
| 5 | 0.4545 | 1.36 | 1.312 | 1.277 | influential |
Example 3: Demo case 3
Inputs:
| y | x | |
|---|---|---|
| 5.2 | 1 | 2 |
| 9.8 | 2 | 4 |
| 15.1 | 3 | 6 |
| 19.9 | 4 | 8 |
| 25.3 | 5 | 10 |
| 29.7 | 6 | 12 |
Excel formula:
=INFLUENCE_DIAG({5.2;9.8;15.1;19.9;25.3;29.7}, {1,2;2,4;3,6;4,8;5,10;6,12})
Expected output:
| observation | leverage | cooks_distance | dffits | student_resid | influence_flag |
|---|---|---|---|---|---|
| 1 | 0.5238 | 0.1551 | 0.6246 | 0.6503 | normal |
| 2 | 0.2952 | 0.1931 | -0.815 | -1.176 | normal |
| 3 | 0.181 | 0.009518 | 0.1488 | 0.3595 | normal |
| 4 | 0.181 | 0.009518 | -0.1488 | -0.3595 | normal |
| 5 | 0.2952 | 0.3773 | 1.617 | 1.644 | normal |
| 6 | 0.5238 | 0.5452 | -1.397 | -1.219 | normal |
Example 4: Demo case 4
Inputs:
| y | x |
|---|---|
| 2.1 | 1 |
| 4 | 2 |
| 5.9 | 3 |
| 8.1 | 4 |
| 20 | 5 |
Excel formula:
=INFLUENCE_DIAG({2.1;4;5.9;8.1;20}, {1;2;3;4;5})
Expected output:
| observation | leverage | cooks_distance | dffits | student_resid | influence_flag |
|---|---|---|---|---|---|
| 1 | 0.6 | 0.5964 | 1.04 | 0.8917 | normal |
| 2 | 0.3 | 0.00002065 | -0.005247 | -0.009816 | normal |
| 3 | 0.2 | 0.05263 | -0.2857 | -0.6489 | normal |
| 4 | 0.3 | 0.3508 | -1.015 | -1.279 | normal |
| 5 | 0.6 | 2.248 | 66.67 | 1.731 | influential |
Python Code
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import OLSInfluence as statsmodels_ols_influence
def influence_diag(y, x, fit_intercept=True):
"""
Computes regression influence diagnostics for identifying influential observations.
See: https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.OLSInfluence.html
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): A column vector containing the dependent variable values.
x (list[list]): A matrix where each column represents a predictor variable.
fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.
Returns:
list[list]: 2D list with influence diagnostics, or error message string.
"""
def to2d(val):
return [[val]] if not isinstance(val, list) else val
# Normalize inputs to 2D lists
y = to2d(y)
x = to2d(x)
# Validate inputs
if not isinstance(y, list) or not all(isinstance(row, list) for row in y):
return "Invalid input: y must be a 2D list."
if not isinstance(x, list) or not all(isinstance(row, list) for row in x):
return "Invalid input: x must be a 2D list."
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be a boolean."
# Convert to numpy arrays
try:
y_array = np.array(y, dtype=float).flatten()
x_array = np.array(x, dtype=float)
except (ValueError, TypeError):
return "Invalid input: y and x must contain numeric values."
# Check for invalid values
if np.any(np.isnan(y_array)) or np.any(np.isinf(y_array)):
return "Invalid input: y contains NaN or infinite values."
if np.any(np.isnan(x_array)) or np.any(np.isinf(x_array)):
return "Invalid input: x contains NaN or infinite values."
# Check dimensions
n = len(y_array)
if n < 3:
return "Invalid input: need at least 3 observations."
if x_array.ndim == 1:
x_array = x_array.reshape(-1, 1)
if x_array.shape[0] != n:
return "Invalid input: y and x must have the same number of observations."
# Add intercept if requested
if fit_intercept:
x_array = sm.add_constant(x_array)
# Check for sufficient predictors
k = x_array.shape[1]
if n <= k:
return "Invalid input: number of observations must be greater than number of predictors."
# Fit OLS model
try:
model = sm.OLS(y_array, x_array)
results = model.fit()
except Exception as e:
return f"Model fitting error: {str(e)}"
# Get influence diagnostics
try:
influence = statsmodels_ols_influence(results)
# Get diagnostics
leverage = influence.hat_matrix_diag
cooks_d = influence.cooks_distance[0]
dffits = influence.dffits[0]
student_resid = influence.resid_studentized_internal
except Exception as e:
return f"Influence diagnostics error: {str(e)}"
# Define thresholds for flags
leverage_threshold = 2 * k / n
cooks_threshold = 4 / n
outlier_threshold = 2.0
# Create output with headers
output = [['observation', 'leverage', 'cooks_distance', 'dffits', 'student_resid', 'influence_flag']]
# Add diagnostics for each observation
for i in range(n):
# Determine influence flag
if abs(cooks_d[i]) > cooks_threshold:
flag = 'influential'
elif leverage[i] > leverage_threshold:
flag = 'high_leverage'
elif abs(student_resid[i]) > outlier_threshold:
flag = 'outlier'
else:
flag = 'normal'
row = [
float(i + 1), # observation number (1-indexed)
float(leverage[i]),
float(cooks_d[i]),
float(dffits[i]),
float(student_resid[i]),
flag
]
output.append(row)
return output