OLS_REGRESSION

Overview

The OLS_REGRESSION function fits an Ordinary Least Squares (OLS) regression model to estimate the relationship between a dependent variable and one or more independent variables (predictors). OLS is the most widely used method in linear regression analysis and is foundational to econometrics, statistics, and predictive modeling.

OLS works by finding the coefficient values that minimize the sum of squared residuals (the differences between observed and predicted values). Given a linear model of the form:

y = X\beta + \varepsilon

where y is the vector of observed responses, X is the matrix of predictors, \beta is the vector of unknown coefficients, and \varepsilon represents random errors, the OLS estimator is computed as:

\hat{\beta} = (X^T X)^{-1} X^T y

This closed-form solution yields the best linear unbiased estimator (BLUE) under the assumptions of the Gauss-Markov theorem: linearity, exogeneity, homoscedasticity (constant variance of errors), and no perfect multicollinearity among predictors. For more theoretical background, see the Wikipedia article on Ordinary Least Squares.

This implementation uses the statsmodels Python library, which provides comprehensive statistical modeling capabilities. The statsmodels OLS documentation describes the underlying implementation, and the source code is available on GitHub.

The function returns a structured table containing coefficient estimates, standard errors, t-statistics, p-values, and confidence intervals for each parameter. It also provides key model diagnostics including:

  • R-squared and Adjusted R-squared: measures of how well the model explains variance in the response
  • F-statistic and its p-value: tests whether the model has overall explanatory power
  • AIC/BIC: information criteria for model comparison
  • Condition number: indicates potential multicollinearity issues

When fit_intercept is set to True, a constant term is automatically added to the model. The alpha parameter controls the significance level for confidence intervals (default 0.05 for 95% confidence).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=OLS_REGRESSION(y, x, fit_intercept, alpha)
  • y (list[list], required): Dependent variable as a column vector (N rows x 1 column) of float values.
  • x (list[list], required): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
  • fit_intercept (bool, optional, default: true): If true, adds an intercept (constant) term to the model.
  • alpha (float, optional, default: 0.05): Significance level for confidence intervals, between 0 and 1.

Returns (list[list]): 2D list with OLS results, or error message string.

Examples

Example 1: Demo case 1

Inputs:

y x
1.2 1
2.1 2
2.9 3
4.2 4
4.8 5

Excel formula:

=OLS_REGRESSION({1.2;2.1;2.9;4.2;4.8}, {1;2;3;4;5})

Expected output:

"non-error"

Example 2: Demo case 2

Inputs:

y x
2.5 1 0.5
3.7 2 1
4.2 3 1.5
5.8 4 2
6.1 5 2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5})

Expected output:

"non-error"

Example 3: Demo case 3

Inputs:

y x fit_intercept
2.1 1 false
3.9 2
6.2 3
7.8 4
10.1 5

Excel formula:

=OLS_REGRESSION({2.1;3.9;6.2;7.8;10.1}, {1;2;3;4;5}, FALSE)

Expected output:

"non-error"

Example 4: Demo case 4

Inputs:

y x alpha
2.5 1 0.5 0.1
3.7 2 1
4.2 3 1.5
5.8 4 2
6.1 5 2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5}, 0.1)

Expected output:

"non-error"

Python Code

import math
import pandas as pd
import statsmodels.api as sm

def ols_regression(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits an Ordinary Least Squares (OLS) regression model.

    See: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (N rows x 1 column) of float values.
        x (list[list]): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
        fit_intercept (bool, optional): If true, adds an intercept (constant) term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals, between 0 and 1. Default is 0.05.

    Returns:
        list[list]: 2D list with OLS results, or error message string.
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_numeric_2d(arr, name):
        if not isinstance(arr, list):
            return f"Invalid input: {name} must be a 2D list."
        if not arr:
            return f"Invalid input: {name} cannot be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Invalid input: {name} must be a 2D list."
            if not row:
                return f"Invalid input: {name} cannot have empty rows."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Invalid input: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Invalid input: {name}[{i}][{j}] must be finite."
        return None

    def validate_scalar(val, name, val_type, check_range=None):
        if not isinstance(val, val_type):
            type_name = val_type.__name__ if hasattr(val_type, '__name__') else str(val_type)
            return f"Invalid input: {name} must be {type_name}."
        # Check if val_type includes numeric types
        is_numeric = (val_type in (int, float)) or (isinstance(val_type, tuple) and any(t in (int, float) for t in val_type))
        if is_numeric and isinstance(val, (int, float)):
            if math.isnan(val) or math.isinf(val):
                return f"Invalid input: {name} must be finite."
            if check_range and not (check_range[0] <= val <= check_range[1]):
                return f"Invalid input: {name} must be in range {check_range}."
        return None

    # Normalize inputs
    y = to2d(y)
    x = to2d(x)

    # Validate inputs
    err = validate_numeric_2d(y, "y")
    if err:
        return err
    err = validate_numeric_2d(x, "x")
    if err:
        return err
    err = validate_scalar(fit_intercept, "fit_intercept", bool)
    if err:
        return err
    err = validate_scalar(alpha, "alpha", (int, float), (0.0, 1.0))
    if err:
        return err

    # Check dimensions
    n_obs_y = len(y)
    n_cols_y = len(y[0])
    if n_cols_y != 1:
        return "Invalid input: y must be a column vector (single column)."

    # Check consistent column length in y
    for i, row in enumerate(y):
        if len(row) != n_cols_y:
            return f"Invalid input: y row {i} has inconsistent number of columns."

    n_obs_x = len(x)
    n_predictors = len(x[0])

    # Check consistent column length in x
    for i, row in enumerate(x):
        if len(row) != n_predictors:
            return f"Invalid input: x row {i} has inconsistent number of columns."

    if n_obs_y != n_obs_x:
        return f"Invalid input: y and x must have the same number of observations ({n_obs_y} vs {n_obs_x})."

    # Check sufficient observations
    min_obs = n_predictors + (1 if fit_intercept else 0)
    if n_obs_y <= min_obs:
        return f"Invalid input: insufficient observations ({n_obs_y}) for {min_obs} parameters."

    # Convert to pandas for proper parameter names
    y_series = pd.Series([row[0] for row in y])

    # Create column names for X
    x_col_names = [f'x{i+1}' for i in range(n_predictors)]
    x_df = pd.DataFrame(x, columns=x_col_names)

    # Prepare X matrix
    try:
        if fit_intercept:
            x_with_const = sm.add_constant(x_df, has_constant='add')
        else:
            x_with_const = x_df

        # Fit OLS model
        model = sm.OLS(y_series, x_with_const)
        results = model.fit()

        # Get confidence intervals
        conf_int = results.conf_int(alpha=alpha)

        # Build output table
        output = []

        # Header row
        output.append(['parameter', 'coefficient', 'std_error', 't_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Parameter rows
        param_names = results.params.index.tolist()
        for i, param_name in enumerate(param_names):
            row = [
                param_name,
                float(results.params.iloc[i]),
                float(results.bse.iloc[i]),
                float(results.tvalues.iloc[i]),
                float(results.pvalues.iloc[i]),
                float(conf_int.iloc[i, 0]),
                float(conf_int.iloc[i, 1])
            ]
            output.append(row)

        # Model statistics rows (pad to 7 columns for consistency with header)
        output.append(['r_squared', float(results.rsquared), None, None, None, None, None])
        output.append(['adj_r_squared', float(results.rsquared_adj), None, None, None, None, None])
        output.append(['f_statistic', float(results.fvalue), None, None, None, None, None])
        output.append(['f_pvalue', float(results.f_pvalue), None, None, None, None, None])
        output.append(['aic', float(results.aic), None, None, None, None, None])
        output.append(['bic', float(results.bic), None, None, None, None, None])
        output.append(['log_likelihood', float(results.llf), None, None, None, None, None])
        output.append(['condition_number', float(results.condition_number), None, None, None, None, None])

        return output

    except Exception as e:
        return f"statsmodels.api.OLS error: {e}"

Online Calculator