PCA_ANALYSIS

Overview

The PCA_ANALYSIS function performs Principal Component Analysis (PCA), a linear dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. PCA was invented by Karl Pearson in 1901 and independently developed by Harold Hotelling in the 1930s. It is widely used in exploratory data analysis, visualization, feature extraction, and data preprocessing across fields including finance, genetics, image processing, and machine learning.

This implementation uses the statsmodels library’s PCA class, which supports multiple decomposition methods: Singular Value Decomposition (SVD), eigenvalue decomposition, and the NIPALS algorithm. The function returns three key outputs: loadings (the relationship between original variables and principal components), variance explained (the proportion of total variance captured by each component), and scores (the transformed data in the principal component space).

PCA works by finding orthogonal directions (principal components) that maximize variance in the data. Given a centered data matrix X, PCA computes the eigenvectors of the covariance matrix:

C = \frac{1}{n-1} X^T X

The eigenvectors form the principal component directions, and the corresponding eigenvalues represent the variance explained by each component. The first principal component captures the direction of maximum variance, the second captures the maximum remaining variance orthogonal to the first, and so on.

When the standardize parameter is set to True (the default), variables are scaled to have zero mean and unit variance before analysis, which is equivalent to performing PCA on the correlation matrix rather than the covariance matrix. This is recommended when variables are measured on different scales.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PCA_ANALYSIS(data, n_components, standardize, pca_analysis_method)

data (list[list], required): A matrix where rows represent observations and columns represent variables.
n_components (int, optional, default: null): Number of principal components to extract. If None, extracts all components.
standardize (bool, optional, default: true): If True, standardizes variables to mean 0 and variance 1.
pca_analysis_method (str, optional, default: “svd”): The decomposition method to use.

Returns (list[list]): 2D list with PCA components, or error message string.

Example 1: Two-variable dataset with default PCA settings

Inputs:

data
2.5	2.4
0.5	0.7
2.2	2.9
1.9	2.2
3.1	3
2.3	2.7
2	1.6
1	1.1
1.5	1.6
1.1	0.9

Excel formula:

=PCA_ANALYSIS({2.5,2.4;0.5,0.7;2.2,2.9;1.9,2.2;3.1,3;2.3,2.7;2,1.6;1,1.1;1.5,1.6;1.1,0.9})

Expected output:

Loadings	PC1	PC2
Var1	-0.707107	0.707107
Var2	-0.707107	-0.707107

Component	Eigenvalue	Variance_Explained	Cumulative_Variance
PC1	19.2593	0.962965	0.962965
PC2	0.740707	0.0370354	1

Scores	PC1	PC2
Obs1	-0.247561	0.259717
Obs2	0.526129	-0.206916
Obs3	-0.282991	-0.582714
Obs4	-0.0776528	-0.197432
Obs5	-0.497726	0.307628
Obs6	-0.264493	-0.267801
Obs7	0.0211015	0.526719
Obs8	0.337722	-0.0646803
Obs9	0.129252	-0.0247542
Obs10	0.35622	0.250233

Example 2: Three-variable dataset retaining two components

Inputs:

data			n_components
1	2	3	2
2	3	4
3	4	5
4	5	6

Excel formula:

=PCA_ANALYSIS({1,2,3;2,3,4;3,4,5;4,5,6}, 2)

Expected output:

Loadings	PC1	PC2
Var1	0.57735	-0.816497
Var2	0.57735	0.408248
Var3	0.57735	0.408248

Component	Eigenvalue	Variance_Explained	Cumulative_Variance
PC1	12	1	1
PC2	1.47911e-31	1.2326e-32	1

Scores	PC1	PC2
Obs1	-0.67082	-1.73205
Obs2	-0.223607	-0.433013
Obs3	0.223607	0.433013
Obs4	0.67082	1.73205

Example 3: PCA on unstandardized data

Inputs:

data		standardize
1	10	false
2	20
3	30
4	40
5	50

Excel formula:

=PCA_ANALYSIS({1,10;2,20;3,30;4,40;5,50}, FALSE)

Expected output:

Loadings	PC1	PC2
Var1	0.0995037	-0.995037
Var2	0.995037	0.0995037

Component	Eigenvalue	Variance_Explained	Cumulative_Variance
PC1	1010	1	1
PC2	1.87452e-31	1.85596e-34	1

Scores	PC1	PC2
Obs1	-0.632456	-0.512856
Obs2	-0.316228	-0.256428
Obs3	0	0
Obs4	0.316228	0.256428
Obs5	0.632456	0.512856

Example 4: PCA using eigenvalue decomposition with explicit options

Inputs:

data			n_components	standardize	pca_analysis_method
5	3	2	2	true	eig
3	6	4
2	4	7
4	5	5

Excel formula:

=PCA_ANALYSIS({5,3,2;3,6,4;2,4,7;4,5,5}, 2, TRUE, "eig")

Expected output:

Loadings	PC1	PC2
Var1	-0.664499	-0.179123
Var2	0.396621	-0.907359
Var3	0.63335	0.380281

Component	Eigenvalue	Variance_Explained	Cumulative_Variance
PC1	8.26517	0.716627	0.716627
PC2	3.26827	0.283373	1

Scores	PC1	PC2
Obs1	-0.800697	0.248738
Obs2	0.227358	-0.687405
Obs3	0.553909	0.649095
Obs4	0.0194305	-0.210428

Python Code

Show Code

import math
from statsmodels.multivariate.pca import PCA as statsmodels_PCA

def pca_analysis(data, n_components=None, standardize=True, pca_analysis_method='svd'):
    """
    Performs Principal Component Analysis (PCA) for dimensionality reduction.

    See: https://www.statsmodels.org/stable/generated/statsmodels.multivariate.pca.PCA.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): A matrix where rows represent observations and columns represent variables.
        n_components (int, optional): Number of principal components to extract. If None, extracts all components. Default is None.
        standardize (bool, optional): If True, standardizes variables to mean 0 and variance 1. Default is True.
        pca_analysis_method (str, optional): The decomposition method to use. Valid options: SVD, Eigenvalue Decomposition, NIPALS. Default is 'svd'.

    Returns:
        list[list]: 2D list with PCA components, or error message string.
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    try:
      # Normalize data to 2D list
      data_2d = to2d(data)

      # Validate data is a proper 2D list with at least 2 rows and 2 columns
      if not isinstance(data_2d, list) or len(data_2d) < 2:
        return "Error: Invalid input: data must be a 2D list with at least 2 rows."

      if not all(isinstance(row, list) for row in data_2d):
        return "Error: Invalid input: data must be a 2D list."

      row_lengths = [len(row) for row in data_2d]
      if len(set(row_lengths)) > 1:
        return "Error: Invalid input: all rows in data must have the same length."

      if row_lengths[0] < 2:
        return "Error: Invalid input: data must have at least 2 columns."

      # Convert all elements to float and validate
      try:
        numeric_data = []
        for i, row in enumerate(data_2d):
          numeric_row = []
          for j, val in enumerate(row):
            try:
              num_val = float(val)
            except (ValueError, TypeError):
              return f"Error: Invalid input: data[{i}][{j}] is not a valid number."
            if math.isnan(num_val) or math.isinf(num_val):
              return f"Error: Invalid input: data[{i}][{j}] must be finite."
            numeric_row.append(num_val)
          numeric_data.append(numeric_row)
      except Exception as e:
        return f"Error: Invalid input: unable to convert data to numeric values: {str(e)}"

      # Validate standardize
      if not isinstance(standardize, bool):
        return "Error: Invalid input: standardize must be a boolean."

      # Validate pca_analysis_method
      valid_methods = ['svd', 'eig', 'nipals']
      if not isinstance(pca_analysis_method, str):
        return "Error: Invalid input: pca_analysis_method must be a string."
      method_lower = pca_analysis_method.lower()
      if method_lower not in valid_methods:
        return f"Error: Invalid input: pca_analysis_method must be one of {valid_methods}."

      # Validate n_components
      n_vars = len(numeric_data[0])
      n_obs = len(numeric_data)
      max_components = min(n_obs, n_vars)

      if n_components is None:
        n_comp_to_use = None
      else:
        try:
          n_comp_to_use = int(n_components)
        except (ValueError, TypeError):
          return "Error: Invalid input: n_components must be an integer or None."

        if n_comp_to_use < 1:
          return "Error: Invalid input: n_components must be at least 1."
        if n_comp_to_use > max_components:
          return f"Error: Invalid input: n_components cannot exceed {max_components} (min of observations and variables)."

      # Perform PCA
      pca_model = statsmodels_PCA(
        numeric_data,
        ncomp=n_comp_to_use,
        standardize=standardize,
        method=method_lower
      )

      # Extract results
      try:
        # Get loadings (variables x components)
        loadings = pca_model.loadings.tolist()

        # Get variance explained information
        eigenvalues = pca_model.eigenvals.tolist()
        variance_explained = (pca_model.eigenvals / pca_model.eigenvals.sum()).tolist()
        cumulative_variance = []
        cumsum = 0.0
        for var_exp in variance_explained:
          cumsum += var_exp
          cumulative_variance.append(cumsum)

        # Get scores (observations x components)
        scores = pca_model.scores.tolist()

      except Exception as e:
        return f"Error: Unable to extract PCA results: {str(e)}"

      # Build output as 2D list with three sections
      # First, determine the maximum width
      # Loadings/Scores sections have n_comps + 1 columns
      # Variance section has 4 columns
      n_comps = len(loadings[0])
      max_width = max(n_comps + 1, 4)

      result = []

      # Section 1: Component loadings with header
      loadings_header = ['Loadings'] + [f'PC{i+1}' for i in range(n_comps)]
      loadings_header += [""] * (max_width - len(loadings_header))
      result.append(loadings_header)

      for i, loading_row in enumerate(loadings):
        row = [f'Var{i+1}'] + loading_row
        row += [""] * (max_width - len(row))
        result.append(row)

      # Add blank row separator (padded)
      result.append([""] * max_width)

      # Section 2: Variance explained with header
      variance_header = ['Component', 'Eigenvalue', 'Variance_Explained', 'Cumulative_Variance']
      variance_header += [""] * (max_width - len(variance_header))
      result.append(variance_header)

      # Eigenvalues usually match n_vars, but we might only show n_comps components?
      # Usually PCA output shows all eigenvalues in variance table, or just the kept ones?
      # Statsmodels eigenvals contains all of them.
      # But for consistent output tables, we should probably align with n_comps if n_comps is set?
      # Let's inspect eigenvalues length. If we show all, they might exceed n_comps.
      # But if we show all Eigenvalues, the table length (rows) is fine, but widths must match.
      # Let's show all eigenvalues available conform to the logic.

      for i in range(len(eigenvalues)):
        # If we have more eigenvalues than components (scores/loadings), we still list them
        # The columns are fixed at 4 for this section.
        variance_row = [
          f'PC{i+1}',
          eigenvalues[i],
          variance_explained[i],
          cumulative_variance[i]
        ]
        variance_row += [""] * (max_width - len(variance_row))
        result.append(variance_row)

      # Add blank row separator (padded)
      result.append([""] * max_width)

      # Section 3: Component scores with header
      scores_header = ['Scores'] + [f'PC{i+1}' for i in range(n_comps)]
      scores_header += [""] * (max_width - len(scores_header))
      result.append(scores_header)

      for i, score_row in enumerate(scores):
        row = [f'Obs{i+1}'] + score_row
        row += [""] * (max_width - len(row))
        result.append(row)

      return result
    except Exception as e:
      return f"Error: {str(e)}"

Online Calculator

data *

A matrix where rows represent observations and columns represent variables.

n_components

Number of principal components to extract. If None, extracts all components.

standardize

If True, standardizes variables to mean 0 and variance 1.

pca_analysis_method

The decomposition method to use.