OLS_REGRESSION

Overview

The OLS_REGRESSION function fits an Ordinary Least Squares (OLS) regression model to estimate the relationship between a dependent variable and one or more independent variables (predictors). OLS is the most widely used method in linear regression analysis and is foundational to econometrics, statistics, and predictive modeling.

OLS works by finding the coefficient values that minimize the sum of squared residuals (the differences between observed and predicted values). Given a linear model of the form:

y = X\beta + \varepsilon

where y is the vector of observed responses, X is the matrix of predictors, \beta is the vector of unknown coefficients, and \varepsilon represents random errors, the OLS estimator is computed as:

\hat{\beta} = (X^T X)^{-1} X^T y

This closed-form solution yields the best linear unbiased estimator (BLUE) under the assumptions of the Gauss-Markov theorem: linearity, exogeneity, homoscedasticity (constant variance of errors), and no perfect multicollinearity among predictors. For more theoretical background, see the Wikipedia article on Ordinary Least Squares.

This implementation uses the statsmodels Python library, which provides comprehensive statistical modeling capabilities. The statsmodels OLS documentation describes the underlying implementation, and the source code is available on GitHub.

The function returns a structured table containing coefficient estimates, standard errors, t-statistics, p-values, and confidence intervals for each parameter. It also provides key model diagnostics including:

R-squared and Adjusted R-squared: measures of how well the model explains variance in the response
F-statistic and its p-value: tests whether the model has overall explanatory power
AIC/BIC: information criteria for model comparison
Condition number: indicates potential multicollinearity issues

When fit_intercept is set to True, a constant term is automatically added to the model. The alpha parameter controls the significance level for confidence intervals (default 0.05 for 95% confidence).

This example function is provided as-is without any representation of accuracy.

Excel Usage

=OLS_REGRESSION(y, x, fit_intercept, alpha)

y (list[list], required): Dependent variable as a column vector (N rows x 1 column) of float values.
x (list[list], required): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
fit_intercept (bool, optional, default: true): If true, adds an intercept (constant) term to the model.
alpha (float, optional, default: 0.05): Significance level for confidence intervals, between 0 and 1.

Returns (list[list]): 2D list with OLS results

Example 1: Simple linear regression with intercept

Inputs:

y	x
1.2	1
2.1	2
2.9	3
4.2	4
4.8	5

Excel formula:

=OLS_REGRESSION({1.2;2.1;2.9;4.2;4.8}, {1;2;3;4;5})

Expected output:

parameter	coefficient	std_error	t_statistic	p_value	ci_lower	ci_upper
const	0.25	0.174452	1.43306	0.24728	-0.305183	0.805183
x1	0.93	0.0525991	17.6809	0.000394438	0.762606	1.09739
r_squared	0.990495
adj_r_squared	0.987326
f_statistic	312.614
f_pvalue	0.000394438
aic	-2.30238
bic	-3.0835
log_likelihood	3.15119
condition_number	8.36575

Example 2: Multiple regression with two predictors

Inputs:

y	x
2.5	1	0.5
3.7	2	1
4.2	3	1.5
5.8	4	2
6.1	5	2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5})

Expected output:

parameter	coefficient	std_error	t_statistic	p_value	ci_lower	ci_upper
const	1.67	0.344141	4.85266	0.0167042	0.574788	2.76521
x1	0.744	0.08301	8.96277	0.002931	0.479825	1.00817
x2	0.372	0.041505	8.96277	0.002931	0.239913	0.504087
r_squared	0.963999
adj_r_squared	0.951999
f_statistic	80.3313
f_pvalue	0.002931
aic	4.49168
bic	3.71056
log_likelihood	-0.24584
condition_number	12676700000000000

Example 3: OLS regression without intercept

Inputs:

y	x	fit_intercept
2.1	1	false
3.9	2
6.2	3
7.8	4
10.1	5

Excel formula:

=OLS_REGRESSION({2.1;3.9;6.2;7.8;10.1}, {1;2;3;4;5}, FALSE)

Expected output:

parameter	coefficient	std_error	t_statistic	p_value	ci_lower	ci_upper
x1	2.00364	0.0222866	89.903	9.17689e-8	1.94176	2.06551
r_squared	0.999505
adj_r_squared	0.999382
f_statistic	8082.56
f_pvalue	9.17689e-8
aic	-2.92735
bic	-3.31791
log_likelihood	2.46367
condition_number	1

Example 4: OLS regression with 90% confidence intervals

Inputs:

y	x		alpha
2.5	1	0.5	0.1
3.7	2	1
4.2	3	1.5
5.8	4	2
6.1	5	2.5

Excel formula:

=OLS_REGRESSION({2.5;3.7;4.2;5.8;6.1}, {1,0.5;2,1;3,1.5;4,2;5,2.5}, 0.1)

Expected output:

parameter	coefficient	std_error	t_statistic	p_value	ci_lower	ci_upper
const	1.67	0.344141	4.85266	0.0167042	0.86011	2.47989
x1	0.744	0.08301	8.96277	0.002931	0.548647	0.939353
x2	0.372	0.041505	8.96277	0.002931	0.274324	0.469676
r_squared	0.963999
adj_r_squared	0.951999
f_statistic	80.3313
f_pvalue	0.002931
aic	4.49168
bic	3.71056
log_likelihood	-0.24584
condition_number	12676700000000000

Python Code

Show Code

import math
import pandas as pd
import statsmodels.api as sm

def ols_regression(y, x, fit_intercept=True, alpha=0.05):
    """
    Fits an Ordinary Least Squares (OLS) regression model.

    See: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        y (list[list]): Dependent variable as a column vector (N rows x 1 column) of float values.
        x (list[list]): Independent variables (predictors) as a matrix (N rows x P columns) of float values.
        fit_intercept (bool, optional): If true, adds an intercept (constant) term to the model. Default is True.
        alpha (float, optional): Significance level for confidence intervals, between 0 and 1. Default is 0.05.

    Returns:
        list[list]: 2D list with OLS results
    """
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    def validate_numeric_2d(arr, name):
        if not isinstance(arr, list):
            return f"Error: {name} must be a 2D list."
        if not arr:
            return f"Error: {name} cannot be empty."
        for i, row in enumerate(arr):
            if not isinstance(row, list):
                return f"Error: {name} must be a 2D list."
            if not row:
                return f"Error: {name} cannot have empty rows."
            for j, val in enumerate(row):
                if not isinstance(val, (int, float)):
                    return f"Error: {name}[{i}][{j}] must be numeric."
                if math.isnan(val) or math.isinf(val):
                    return f"Error: {name}[{i}][{j}] must be finite."
        return None

    def validate_scalar(val, name, val_type, check_range=None):
        if not isinstance(val, val_type):
            type_name = val_type.__name__ if hasattr(val_type, '__name__') else str(val_type)
            return f"Error: {name} must be {type_name}."
        # Check if val_type includes numeric types
        is_numeric = (val_type in (int, float)) or (isinstance(val_type, tuple) and any(t in (int, float) for t in val_type))
        if is_numeric and isinstance(val, (int, float)):
            if math.isnan(val) or math.isinf(val):
                return f"Error: {name} must be finite."
            if check_range and not (check_range[0] <= val <= check_range[1]):
                return f"Error: {name} must be in range {check_range}."
        return None

    try:
        # Normalize inputs
        y = to2d(y)
        x = to2d(x)

        # Validate inputs
        err = validate_numeric_2d(y, "y")
        if err:
            return err
        err = validate_numeric_2d(x, "x")
        if err:
            return err
        err = validate_scalar(fit_intercept, "fit_intercept", bool)
        if err:
            return err
        err = validate_scalar(alpha, "alpha", (int, float), (0.0, 1.0))
        if err:
            return err

        # Check dimensions
        n_obs_y = len(y)
        n_cols_y = len(y[0])
        if n_cols_y != 1:
            return "Error: y must be a column vector (single column)."

        # Check consistent column length in y
        for i, row in enumerate(y):
            if len(row) != n_cols_y:
                return f"Error: y row {i} has inconsistent number of columns."

        n_obs_x = len(x)
        n_predictors = len(x[0])

        # Check consistent column length in x
        for i, row in enumerate(x):
            if len(row) != n_predictors:
                return f"Error: x row {i} has inconsistent number of columns."

        if n_obs_y != n_obs_x:
            return f"Error: y and x must have the same number of observations ({n_obs_y} vs {n_obs_x})."

        # Check sufficient observations
        min_obs = n_predictors + (1 if fit_intercept else 0)
        if n_obs_y <= min_obs:
            return f"Error: insufficient observations ({n_obs_y}) for {min_obs} parameters."

        # Convert to pandas for proper parameter names
        y_series = pd.Series([row[0] for row in y])

        # Create column names for X
        x_col_names = [f'x{i+1}' for i in range(n_predictors)]
        x_df = pd.DataFrame(x, columns=x_col_names)

        # Prepare X matrix
        if fit_intercept:
            x_with_const = sm.add_constant(x_df, has_constant='add')
        else:
            x_with_const = x_df

        # Fit OLS model
        model = sm.OLS(y_series, x_with_const)
        results = model.fit()

        # Get confidence intervals
        conf_int = results.conf_int(alpha=alpha)

        # Build output table
        output = []

        # Header row
        output.append(['parameter', 'coefficient', 'std_error', 't_statistic', 'p_value', 'ci_lower', 'ci_upper'])

        # Parameter rows
        param_names = results.params.index.tolist()
        for i, param_name in enumerate(param_names):
            row = [
                param_name,
                float(results.params.iloc[i]),
                float(results.bse.iloc[i]),
                float(results.tvalues.iloc[i]),
                float(results.pvalues.iloc[i]),
                float(conf_int.iloc[i, 0]),
                float(conf_int.iloc[i, 1])
            ]
            output.append(row)

        # Model statistics rows (pad to 7 columns for consistency with header)
        output.append(['r_squared', float(results.rsquared), '', '', '', '', ''])
        output.append(['adj_r_squared', float(results.rsquared_adj), '', '', '', '', ''])
        output.append(['f_statistic', float(results.fvalue), '', '', '', '', ''])
        output.append(['f_pvalue', float(results.f_pvalue), '', '', '', '', ''])
        output.append(['aic', float(results.aic), '', '', '', '', ''])
        output.append(['bic', float(results.bic), '', '', '', '', ''])
        output.append(['log_likelihood', float(results.llf), '', '', '', '', ''])
        output.append(['condition_number', float(results.condition_number), '', '', '', '', ''])

        return output

    except Exception as e:
        return f"Error: {e}"

Online Calculator

y *

Dependent variable as a column vector (N rows x 1 column) of float values.

x *

Independent variables (predictors) as a matrix (N rows x P columns) of float values.

fit_intercept

If true, adds an intercept (constant) term to the model.

alpha

Significance level for confidence intervals, between 0 and 1.