ROBUST_LINEAR_MODEL

Overview

The ROBUST_LINEAR_MODEL function fits a robust linear regression model using M-estimators, which are designed to be less sensitive to outliers than ordinary least squares (OLS) regression. This approach is particularly valuable when working with real-world data that may contain anomalous observations that would otherwise disproportionately influence regression coefficients.

The function implements the Robust Linear Model (RLM) class from the statsmodels library. For detailed documentation, see the RLM class reference and the Robust Linear Models guide. The statsmodels source code is available on GitHub.

M-estimators generalize maximum likelihood estimation by replacing the squared residuals used in OLS with a robust loss function \rho. The objective is to minimize:

\sum_{i=1}^{n} \rho\left(\frac{r_i}{s}\right)

where r_i are the residuals and s is a robust scale estimate. The model is fitted using Iteratively Reweighted Least Squares (IRLS), which assigns weights to observations based on their residuals—downweighting outliers while preserving the influence of typical observations.

This implementation supports four M-estimator norm functions:

  • Huber (default): Combines squared loss for small residuals with linear loss for large residuals, providing a balance between efficiency and robustness. See HuberT norm.
  • Bisquare (Tukey): Completely rejects extreme outliers by assigning zero weight beyond a threshold. See TukeyBiweight norm.
  • Andrew (Andrew’s Wave): Uses a sinusoidal weighting function that smoothly downweights outliers. See AndrewWave norm.
  • Hampel: A three-part redescending function that provides a smooth transition from full weight to zero. See Hampel norm.

The function returns coefficient estimates, standard errors, t-statistics, p-values, and 95% confidence intervals for each predictor, along with a robust scale estimate. The theoretical foundations of robust regression are detailed in Huber’s seminal work Robust Statistics (John Wiley and Sons, 1981).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ROBUST_LINEAR_MODEL(y, x, rlm_m_estimator, fit_intercept)
  • y (list[list], required): Dependent variable as a column vector (n_obs, 1).
  • x (list[list], required): Independent variables matrix with dimensions (n_obs, n_predictors). Each column is a predictor.
  • rlm_m_estimator (str, optional, default: “huber”): M-estimator norm function to use for robust regression.
  • fit_intercept (bool, optional, default: true): If True, adds an intercept term to the model.

Returns (list[list]): 2D list with robust regression results, or error string.

Examples

Example 1: Basic linear regression with default Huber estimator

Inputs:

y x
1 1
2 2
3 3
4 4
5 5

Excel formula:

=ROBUST_LINEAR_MODEL({1;2;3;4;5}, {1;2;3;4;5})

Expected output:

"non-error"

Example 2: Regression with outlier using Huber estimator

Inputs:

y x rlm_m_estimator
1 1 huber
2 2
3 3
4 4
100 5

Excel formula:

=ROBUST_LINEAR_MODEL({1;2;3;4;100}, {1;2;3;4;5}, "huber")

Expected output:

"non-error"

Example 3: Multiple predictors with Bisquare estimator

Inputs:

y x rlm_m_estimator fit_intercept
2 1 0.5 bisquare true
4 2 1.2
6 3 1.8
8 4 2.5
10 5 3.1
12 6 3.9

Excel formula:

=ROBUST_LINEAR_MODEL({2;4;6;8;10;12}, {1,0.5;2,1.2;3,1.8;4,2.5;5,3.1;6,3.9}, "bisquare", TRUE)

Expected output:

"non-error"

Example 4: No intercept model with Hampel estimator

Inputs:

y x rlm_m_estimator fit_intercept
3.1 1 hampel false
5.9 2
9.2 3
11.8 4
15.1 5
17.9 6

Excel formula:

=ROBUST_LINEAR_MODEL({3.1;5.9;9.2;11.8;15.1;17.9}, {1;2;3;4;5;6}, "hampel", FALSE)

Expected output:

"non-error"

Python Code

import math
import warnings
from statsmodels.robust.robust_linear_model import RLM as statsmodels_RLM
from statsmodels.robust import norms as statsmodels_norms
import numpy as np

def robust_linear_model(y, x, rlm_m_estimator='huber', fit_intercept=True):
    """
    Fits a robust linear regression model using M-estimators.

    See: https://www.statsmodels.org/stable/generated/statsmodels.robust.robust_linear_model.RLM.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (n_obs, 1).
        x (list[list]): Independent variables matrix with dimensions (n_obs, n_predictors). Each column is a predictor.
        rlm_m_estimator (str, optional): M-estimator norm function to use for robust regression. Valid options: Huber, Andrew, Bisquare, Hampel. Default is 'huber'.
        fit_intercept (bool, optional): If True, adds an intercept term to the model. Default is True.

    Returns:
        list[list]: 2D list with robust regression results, or error string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_2d_array(arr, name):
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if len(arr) == 0:
            return f"Invalid input: {name} must not be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Invalid input: {name} row {i} must be a list."
            if len(row) == 0:
                return f"Invalid input: {name} row {i} must not be empty."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Invalid input: {name}[{i}][{j}] must be a number."
                if math.isnan(val) or math.isinf(val):
                    return f"Invalid input: {name}[{i}][{j}] must be finite."
        return None

    # Normalize inputs to 2D lists
    y = to2d(y)
    x = to2d(x)

    # Validate inputs
    y_error = validate_2d_array(y, "y")
    if y_error:
        return y_error

    x_error = validate_2d_array(x, "x")
    if x_error:
        return x_error

    # Check y is a column vector
    if len(y[0]) != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Check dimensions match
    n_obs_y = len(y)
    n_obs_x = len(x)
    if n_obs_y != n_obs_x:
        return f"Invalid input: y has {n_obs_y} observations but x has {n_obs_x} observations."

    # Check all rows in x have the same number of columns
    n_predictors = len(x[0])
    for i, row in enumerate(x):
        if len(row) != n_predictors:
            return f"Invalid input: x row {i} has {len(row)} columns but expected {n_predictors}."

    # Check we have enough observations
    min_obs = n_predictors + (1 if fit_intercept else 0)
    if n_obs_y <= min_obs:
        return f"Invalid input: need more than {min_obs} observations for {n_predictors} predictors."

    # Validate rlm_m_estimator
    valid_estimators = ['huber', 'andrew', 'bisquare', 'hampel']
    if not isinstance(rlm_m_estimator, str):
        return "Invalid input: rlm_m_estimator must be a string."
    if rlm_m_estimator.lower() not in valid_estimators:
        return f"Invalid input: rlm_m_estimator must be one of {valid_estimators}."

    # Validate fit_intercept
    if not isinstance(fit_intercept, bool):
        return "Invalid input: fit_intercept must be a boolean."

    # Convert to numpy arrays
    try:
        y_array = np.array([row[0] for row in y], dtype=float)
        x_array = np.array(x, dtype=float)
    except Exception as e:
        return f"Invalid input: could not convert to numpy arrays: {e}"

    # Add intercept column if needed
    if fit_intercept:
        x_array = np.column_stack([np.ones(n_obs_y), x_array])

    # Map rlm_m_estimator to norm class
    estimator_lower = rlm_m_estimator.lower()
    if estimator_lower == 'huber':
        norm = statsmodels_norms.HuberT()
    elif estimator_lower == 'andrew':
        norm = statsmodels_norms.AndrewWave()
    elif estimator_lower == 'bisquare':
        norm = statsmodels_norms.TukeyBiweight()
    elif estimator_lower == 'hampel':
        norm = statsmodels_norms.Hampel()

    # Fit the model (suppress warnings for numerical edge cases)
    try:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            model = statsmodels_RLM(y_array, x_array, M=norm)
            results = model.fit()
    except Exception as e:
        return f"statsmodels.robust.robust_linear_model.RLM error: {e}"

    # Extract results
    try:
        params = results.params
        std_errors = results.bse
        t_stats = results.tvalues
        p_values = results.pvalues
        conf_int = results.conf_int()
        scale = results.scale

        # Build output
        output = [['parameter', 'coefficient', 'std_error', 't_statistic', 'p_value', 'ci_lower', 'ci_upper']]

        # Add parameter rows
        for i in range(len(params)):
            if fit_intercept and i == 0:
                param_name = 'intercept'
            else:
                predictor_idx = i if not fit_intercept else i - 1
                param_name = f'x{predictor_idx + 1}'

            output.append([
                param_name,
                float(params[i]),
                float(std_errors[i]),
                float(t_stats[i]),
                float(p_values[i]),
                float(conf_int[i, 0]),
                float(conf_int[i, 1])
            ])

        # Add scale estimate
        output.append(['scale', float(scale), None, None, None, None, None])

        # Validate output values
        for i, row in enumerate(output[1:], 1):  # Skip header row
            for j, val in enumerate(row):
                if val is not None and isinstance(val, float):
                    if math.isnan(val) or math.isinf(val):
                        return f"statsmodels.robust.robust_linear_model.RLM error: result contains non-finite value at row {i}, column {j}."

        return output

    except Exception as e:
        return f"statsmodels.robust.robust_linear_model.RLM error: {e}"

Online Calculator