PCA_ANALYSIS

Overview

The PCA_ANALYSIS function performs Principal Component Analysis (PCA), a linear dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. PCA was invented by Karl Pearson in 1901 and independently developed by Harold Hotelling in the 1930s. It is widely used in exploratory data analysis, visualization, feature extraction, and data preprocessing across fields including finance, genetics, image processing, and machine learning.

This implementation uses the statsmodels library’s PCA class, which supports multiple decomposition methods: Singular Value Decomposition (SVD), eigenvalue decomposition, and the NIPALS algorithm. The function returns three key outputs: loadings (the relationship between original variables and principal components), variance explained (the proportion of total variance captured by each component), and scores (the transformed data in the principal component space).

PCA works by finding orthogonal directions (principal components) that maximize variance in the data. Given a centered data matrix X, PCA computes the eigenvectors of the covariance matrix:

C = \frac{1}{n-1} X^T X

The eigenvectors form the principal component directions, and the corresponding eigenvalues represent the variance explained by each component. The first principal component captures the direction of maximum variance, the second captures the maximum remaining variance orthogonal to the first, and so on.

When the standardize parameter is set to True (the default), variables are scaled to have zero mean and unit variance before analysis, which is equivalent to performing PCA on the correlation matrix rather than the covariance matrix. This is recommended when variables are measured on different scales.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PCA_ANALYSIS(data, n_components, standardize, pca_analysis_method)
  • data (list[list], required): A matrix where rows represent observations and columns represent variables.
  • n_components (int, optional, default: null): Number of principal components to extract. If None, extracts all components.
  • standardize (bool, optional, default: true): If True, standardizes variables to mean 0 and variance 1.
  • pca_analysis_method (str, optional, default: “svd”): The decomposition method to use.

Returns (list[list]): 2D list with PCA components, or error message string.

Examples

Example 1: Demo case 1

Inputs:

data
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9

Excel formula:

=PCA_ANALYSIS({2.5,2.4;0.5,0.7;2.2,2.9;1.9,2.2;3.1,3;2.3,2.7;2,1.6;1,1.1;1.5,1.6;1.1,0.9})

Expected output:

"non-error"

Example 2: Demo case 2

Inputs:

data n_components
1 2 3 2
2 3 4
3 4 5
4 5 6

Excel formula:

=PCA_ANALYSIS({1,2,3;2,3,4;3,4,5;4,5,6}, 2)

Expected output:

"non-error"

Example 3: Demo case 3

Inputs:

data standardize
1 10 false
2 20
3 30
4 40
5 50

Excel formula:

=PCA_ANALYSIS({1,10;2,20;3,30;4,40;5,50}, FALSE)

Expected output:

"non-error"

Example 4: Demo case 4

Inputs:

data n_components standardize pca_analysis_method
5 3 2 2 true eig
3 6 4
2 4 7
4 5 5

Excel formula:

=PCA_ANALYSIS({5,3,2;3,6,4;2,4,7;4,5,5}, 2, TRUE, "eig")

Expected output:

"non-error"

Python Code

import math
from statsmodels.multivariate.pca import PCA as statsmodels_PCA

def pca_analysis(data, n_components=None, standardize=True, pca_analysis_method='svd'):
    """
    Performs Principal Component Analysis (PCA) for dimensionality reduction.

    See: https://www.statsmodels.org/stable/generated/statsmodels.multivariate.pca.PCA.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): A matrix where rows represent observations and columns represent variables.
        n_components (int, optional): Number of principal components to extract. If None, extracts all components. Default is None.
        standardize (bool, optional): If True, standardizes variables to mean 0 and variance 1. Default is True.
        pca_analysis_method (str, optional): The decomposition method to use. Valid options: SVD, Eigenvalue Decomposition, NIPALS. Default is 'svd'.

    Returns:
        list[list]: 2D list with PCA components, or error message string.
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    # Normalize data to 2D list
    data_2d = to2d(data)

    # Validate data is a proper 2D list with at least 2 rows and 2 columns
    if not isinstance(data_2d, list) or len(data_2d) < 2:
        return "Invalid input: data must be a 2D list with at least 2 rows."

    if not all(isinstance(row, list) for row in data_2d):
        return "Invalid input: data must be a 2D list."

    row_lengths = [len(row) for row in data_2d]
    if len(set(row_lengths)) > 1:
        return "Invalid input: all rows in data must have the same length."

    if row_lengths[0] < 2:
        return "Invalid input: data must have at least 2 columns."

    # Convert all elements to float and validate
    try:
        numeric_data = []
        for i, row in enumerate(data_2d):
            numeric_row = []
            for j, val in enumerate(row):
                try:
                    num_val = float(val)
                except (ValueError, TypeError):
                    return f"Invalid input: data[{i}][{j}] is not a valid number."
                if math.isnan(num_val) or math.isinf(num_val):
                    return f"Invalid input: data[{i}][{j}] must be finite."
                numeric_row.append(num_val)
            numeric_data.append(numeric_row)
    except Exception as e:
        return f"Invalid input: unable to convert data to numeric values: {e}"

    # Validate standardize
    if not isinstance(standardize, bool):
        return "Invalid input: standardize must be a boolean."

    # Validate pca_analysis_method
    valid_methods = ['svd', 'eig', 'nipals']
    if not isinstance(pca_analysis_method, str):
        return "Invalid input: pca_analysis_method must be a string."
    if pca_analysis_method not in valid_methods:
        return f"Invalid input: pca_analysis_method must be one of {valid_methods}."

    # Validate n_components
    n_vars = len(numeric_data[0])
    n_obs = len(numeric_data)
    max_components = min(n_obs, n_vars)

    if n_components is None:
        n_comp_to_use = None
    else:
        try:
            n_comp_to_use = int(n_components)
        except (ValueError, TypeError):
            return "Invalid input: n_components must be an integer or None."

        if n_comp_to_use < 1:
            return "Invalid input: n_components must be at least 1."
        if n_comp_to_use > max_components:
            return f"Invalid input: n_components cannot exceed {max_components} (min of observations and variables)."

    # Perform PCA
    try:
        pca_model = statsmodels_PCA(
            numeric_data,
            ncomp=n_comp_to_use,
            standardize=standardize,
            method=pca_analysis_method
        )
    except Exception as e:
        return f"statsmodels.multivariate.pca.PCA error: {e}"

    # Extract results
    try:
        # Get loadings (variables x components)
        loadings = pca_model.loadings.tolist()

        # Get variance explained information
        eigenvalues = pca_model.eigenvals.tolist()
        variance_explained = (pca_model.eigenvals / pca_model.eigenvals.sum()).tolist()
        cumulative_variance = []
        cumsum = 0.0
        for var_exp in variance_explained:
            cumsum += var_exp
            cumulative_variance.append(cumsum)

        # Get scores (observations x components)
        scores = pca_model.scores.tolist()

    except Exception as e:
        return f"Error extracting PCA results: {e}"

    # Build output as 2D list with three sections
    # First, determine the maximum width
    # Loadings/Scores sections have n_comps + 1 columns
    # Variance section has 4 columns
    n_comps = len(loadings[0])
    max_width = max(n_comps + 1, 4)

    result = []

    # Section 1: Component loadings with header
    loadings_header = ['Loadings'] + [f'PC{i+1}' for i in range(n_comps)]
    loadings_header += [""] * (max_width - len(loadings_header))
    result.append(loadings_header)

    for i, loading_row in enumerate(loadings):
        row = [f'Var{i+1}'] + loading_row
        row += [""] * (max_width - len(row))
        result.append(row)

    # Add blank row separator (padded)
    result.append([""] * max_width)

    # Section 2: Variance explained with header
    variance_header = ['Component', 'Eigenvalue', 'Variance_Explained', 'Cumulative_Variance']
    variance_header += [""] * (max_width - len(variance_header))
    result.append(variance_header)

    # Eigenvalues usually match n_vars, but we might only show n_comps components?
    # Usually PCA output shows all eigenvalues in variance table, or just the kept ones?
    # Statsmodels eigenvals contains all of them.
    # But for consistent output tables, we should probably align with n_comps if n_comps is set?
    # Let's inspect eigenvalues length. If we show all, they might exceed n_comps.
    # But if we show all Eigenvalues, the table length (rows) is fine, but widths must match.
    # Let's show all eigenvalues available conform to the logic.

    for i in range(len(eigenvalues)):
        # If we have more eigenvalues than components (scores/loadings), we still list them
        # The columns are fixed at 4 for this section.
        variance_row = [
            f'PC{i+1}',
            eigenvalues[i],
            variance_explained[i],
            cumulative_variance[i]
        ]
        variance_row += [""] * (max_width - len(variance_row))
        result.append(variance_row)

    # Add blank row separator (padded)
    result.append([""] * max_width)

    # Section 3: Component scores with header
    scores_header = ['Scores'] + [f'PC{i+1}' for i in range(n_comps)]
    scores_header += [""] * (max_width - len(scores_header))
    result.append(scores_header)

    for i, score_row in enumerate(scores):
        row = [f'Obs{i+1}'] + score_row
        row += [""] * (max_width - len(row))
        result.append(row)

    return result

Online Calculator