SPECIFICATION_TESTS
Overview
The SPECIFICATION_TESTS function performs a suite of diagnostic tests on an Ordinary Least Squares (OLS) regression to detect model misspecification. Model specification errors occur when the functional form of the regression model does not adequately capture the true relationship between variables, leading to biased or inefficient estimates.
This implementation uses the statsmodels library’s diagnostic module to run three complementary tests:
Ramsey’s RESET Test (Regression Equation Specification Error Test) detects neglected nonlinearity by augmenting the original regression with powers of the fitted values. The null hypothesis is that the model is correctly specified as linear. A significant result (p-value < 0.05) suggests the model may be missing polynomial or interaction terms. See the linear_reset documentation for details.
Rainbow Test evaluates linearity by comparing the fit of the full model against a model using only a central subset of the data (ordered by the independent variable). Under the null hypothesis, the relationship is linear and both fits should be similar. Rejection indicates potential nonlinearity or structural breaks in the data. The test has power against many forms of nonlinearity. See the linear_rainbow documentation.
Harvey-Collier Test is a t-test on the mean of recursive OLS residuals. Under correct specification, these residuals should have zero mean. The test computes residuals by sequentially adding observations and re-estimating the model, detecting systematic patterns that indicate specification errors. See the linear_harvey_collier documentation.
Each test returns an F-statistic or t-statistic along with a p-value. A p-value greater than 0.05 (the conventional threshold) suggests the null hypothesis of correct specification cannot be rejected, while a smaller p-value indicates potential misspecification. The function provides both a conclusion and interpretation for each test to guide remedial action.
This example function is provided as-is without any representation of accuracy.
Excel Usage
=SPECIFICATION_TESTS(y, x, fit_intercept)
y(list[list], required): Dependent variable as a column vector (Nx1 matrix)x(list[list], required): Independent variables (predictors) as a matrix where each column is a predictorfit_intercept(bool, optional, default: true): Whether to add an intercept term to the model
Returns (list[list]): 2D list with specification tests, or error message string.
Examples
Example 1: Demo case 1
Inputs:
| y | x |
|---|---|
| 3.397 | 1 |
| 4.889 | 2 |
| 7.518 | 3 |
| 10.22 | 4 |
| 10.81 | 5 |
| 12.81 | 6 |
| 16.26 | 7 |
| 17.61 | 8 |
| 18.62 | 9 |
| 21.43 | 10 |
| 22.63 | 11 |
| 24.63 | 12 |
Excel formula:
=SPECIFICATION_TESTS({3.397;4.889;7.518;10.22;10.81;12.81;16.26;17.61;18.62;21.43;22.63;24.63}, {1;2;3;4;5;6;7;8;9;10;11;12})
Expected output:
| test_name | statistic | p_value | conclusion | interpretation |
|---|---|---|---|---|
| RESET | 0.6257 | 0.5591 | No misspecification | Model appears linear |
| Rainbow | 0.2573 | 0.9317 | Linear | Model specification appears adequate |
| Harvey-Collier | -0.9325 | 0.3784 | Correctly specified | No evidence of specification error |
Example 2: Demo case 2
Inputs:
| y | x |
|---|---|
| 2.1 | 1 |
| 3.9 | 2 |
| 6.2 | 3 |
| 7.8 | 4 |
| 10.1 | 5 |
| 12 | 6 |
| 13.9 | 7 |
| 16.2 | 8 |
Excel formula:
=SPECIFICATION_TESTS({2.1;3.9;6.2;7.8;10.1;12;13.9;16.2}, {1;2;3;4;5;6;7;8})
Expected output:
| test_name | statistic | p_value | conclusion | interpretation |
|---|---|---|---|---|
| RESET | 0.3827 | 0.7046 | No misspecification | Model appears linear |
| Rainbow | 0.428 | 0.7873 | Linear | Model specification appears adequate |
| Harvey-Collier | 0.2913 | 0.7853 | Correctly specified | No evidence of specification error |
Example 3: Demo case 3
Inputs:
| y | x | fit_intercept |
|---|---|---|
| 2.496 | 1 | false |
| 4.178 | 2 | |
| 6.037 | 3 | |
| 8.435 | 4 | |
| 9.625 | 5 | |
| 11.63 | 6 | |
| 14.84 | 7 | |
| 15.84 | 8 | |
| 17.25 | 9 | |
| 19.87 | 10 | |
| 21.86 | 11 | |
| 23.63 | 12 |
Excel formula:
=SPECIFICATION_TESTS({2.496;4.178;6.037;8.435;9.625;11.63;14.84;15.84;17.25;19.87;21.86;23.63}, {1;2;3;4;5;6;7;8;9;10;11;12}, FALSE)
Expected output:
| test_name | statistic | p_value | conclusion | interpretation |
|---|---|---|---|---|
| RESET | 0.6662 | 0.5372 | No misspecification | Model appears linear |
| Rainbow | 0.1716 | 0.9734 | Linear | Model specification appears adequate |
| Harvey-Collier | -1.037 | 0.33 | Correctly specified | No evidence of specification error |
Example 4: Demo case 4
Inputs:
| y | x | fit_intercept |
|---|---|---|
| 2.148 | 1 | true |
| 3.889 | 2 | |
| 5.518 | 3 | |
| 7.761 | 4 | |
| 9.883 | 5 | |
| 11.88 | 6 | |
| 14.79 | 7 | |
| 16.38 | 8 | |
| 17.77 | 9 | |
| 20.27 | 10 |
Excel formula:
=SPECIFICATION_TESTS({2.148;3.889;5.518;7.761;9.883;11.88;14.79;16.38;17.77;20.27}, {1;2;3;4;5;6;7;8;9;10}, TRUE)
Expected output:
| test_name | statistic | p_value | conclusion | interpretation |
|---|---|---|---|---|
| RESET | 1.589 | 0.2794 | No misspecification | Model appears linear |
| Rainbow | 1.539 | 0.3836 | Linear | Model specification appears adequate |
| Harvey-Collier | 1.069 | 0.3263 | Correctly specified | No evidence of specification error |
Python Code
import numpy as np
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.diagnostic import linear_reset, linear_rainbow, linear_harvey_collier
def specification_tests(y, x, fit_intercept=True):
"""
Performs regression specification tests to detect model misspecification.
See: https://www.statsmodels.org/stable/stats.html#residual-diagnostics-and-specification-tests
This example function is provided as-is without any representation of accuracy.
Args:
y (list[list]): Dependent variable as a column vector (Nx1 matrix)
x (list[list]): Independent variables (predictors) as a matrix where each column is a predictor
fit_intercept (bool, optional): Whether to add an intercept term to the model Default is True.
Returns:
list[list]: 2D list with specification tests, or error message string.
"""
def to2d(val):
return [[val]] if not isinstance(val, list) else val
def validate_2d_numeric(arr, name):
arr_2d = to2d(arr)
if not isinstance(arr_2d, list):
return f"Invalid input: {name} must be a 2D list."
if not arr_2d:
return f"Invalid input: {name} cannot be empty."
for i, row in enumerate(arr_2d):
if not isinstance(row, list):
return f"Invalid input: {name} row {i} must be a list."
if not row:
return f"Invalid input: {name} row {i} cannot be empty."
for j, val in enumerate(row):
if not isinstance(val, (int, float)):
return f"Invalid input: {name}[{i}][{j}] must be a number."
if np.isnan(val) or np.isinf(val):
return f"Invalid input: {name}[{i}][{j}] must be finite."
return arr_2d
# Validate inputs
y_2d = validate_2d_numeric(y, "y")
if isinstance(y_2d, str):
return y_2d
x_2d = validate_2d_numeric(x, "x")
if isinstance(x_2d, str):
return x_2d
if not isinstance(fit_intercept, bool):
return "Invalid input: fit_intercept must be a boolean."
# Convert to numpy arrays
try:
y_array = np.array(y_2d, dtype=float)
x_array = np.array(x_2d, dtype=float)
except Exception as e:
return f"Invalid input: unable to convert inputs to arrays: {e}"
# Ensure y is a column vector
if y_array.ndim != 2 or y_array.shape[1] != 1:
return "Invalid input: y must be a column vector (2D list with one column)."
n_obs = y_array.shape[0]
if x_array.shape[0] != n_obs:
return "Invalid input: y and x must have the same number of rows."
if n_obs < 3:
return "Invalid input: need at least 3 observations."
# Flatten y to 1D for statsmodels
y_flat = y_array.flatten()
# Add intercept if requested
if fit_intercept:
x_with_const = np.column_stack([np.ones(n_obs), x_array])
else:
x_with_const = x_array
if x_with_const.shape[1] >= n_obs:
return "Invalid input: number of parameters must be less than number of observations."
# Fit OLS model
try:
model = OLS(y_flat, x_with_const)
results = model.fit()
except Exception as e:
return f"Error fitting OLS model: {e}"
# Initialize output
output = [
['test_name', 'statistic', 'p_value', 'conclusion', 'interpretation']
]
# RESET test (tests for omitted nonlinear terms)
try:
reset_result = linear_reset(results, use_f=True)
reset_stat = float(reset_result.fvalue)
reset_p = float(reset_result.pvalue)
if np.isnan(reset_stat) or np.isnan(reset_p):
output.append(['RESET', 'N/A', 'N/A', 'Unable to compute', 'Insufficient variation in data'])
else:
reset_conclusion = "No misspecification" if reset_p > 0.05 else "Misspecification detected"
reset_interp = "Model appears linear" if reset_p > 0.05 else "Model may be missing nonlinear terms"
output.append(['RESET', reset_stat, reset_p, reset_conclusion, reset_interp])
except Exception as e:
output.append(['RESET', 'Error', 'Error', str(e), 'Unable to compute'])
# Rainbow test (tests for linearity)
try:
rainbow_stat, rainbow_p = linear_rainbow(results)
rainbow_stat = float(rainbow_stat)
rainbow_p = float(rainbow_p)
if np.isnan(rainbow_stat) or np.isnan(rainbow_p):
output.append(['Rainbow', 'N/A', 'N/A', 'Unable to compute', 'Insufficient variation in data'])
else:
rainbow_conclusion = "Linear" if rainbow_p > 0.05 else "Non-linear"
rainbow_interp = "Model specification appears adequate" if rainbow_p > 0.05 else "Model may be non-linear"
output.append(['Rainbow', rainbow_stat, rainbow_p, rainbow_conclusion, rainbow_interp])
except Exception as e:
output.append(['Rainbow', 'Error', 'Error', str(e), 'Unable to compute'])
# Harvey-Collier test (tests for specification errors)
try:
hc_result = linear_harvey_collier(results)
hc_stat = float(hc_result.statistic)
hc_p = float(hc_result.pvalue)
if np.isnan(hc_stat) or np.isnan(hc_p):
output.append(['Harvey-Collier', 'N/A', 'N/A', 'Unable to compute', 'Insufficient variation in data'])
else:
hc_conclusion = "Correctly specified" if hc_p > 0.05 else "Specification error"
hc_interp = "No evidence of specification error" if hc_p > 0.05 else "Model may have specification errors"
output.append(['Harvey-Collier', hc_stat, hc_p, hc_conclusion, hc_interp])
except Exception as e:
output.append(['Harvey-Collier', 'Error', 'Error', str(e), 'Unable to compute'])
return output