ANDERSON_KSAMP

Overview

The ANDERSON_KSAMP function performs the k-sample Anderson-Darling test, a non-parametric statistical test that determines whether two or more samples are drawn from the same population distribution. Unlike parametric tests, it does not require specifying the underlying distribution, making it particularly versatile for exploratory data analysis.

The Anderson-Darling test is an extension of the classic one-sample Anderson-Darling goodness-of-fit test, adapted for comparing multiple samples. It tests the null hypothesis that all k samples originate from a common, unspecified distribution. If the test statistic exceeds a critical value (or the p-value falls below the significance level), the null hypothesis is rejected, suggesting the samples come from different distributions.

This implementation uses SciPy’s anderson_ksamp function from the scipy.stats module. The test is based on the methodology described by Scholz and Stephens (1987) in their paper “K-Sample Anderson-Darling Tests” published in the Journal of the American Statistical Association.

The function returns a normalized test statistic along with critical values corresponding to significance levels of 25%, 10%, 5%, 2.5%, 1%, 0.5%, and 0.1%. The p-value is interpolated from tabulated values and is floored at 0.1% and capped at 25%. To interpret the results, compare the test statistic against the critical values: if the statistic exceeds the critical value for a given significance level, the null hypothesis can be rejected at that level.

The midrank parameter controls which variant of the test is applied. When set to TRUE (the default), the midrank empirical distribution function is used, which is appropriate for both continuous and discrete data. When set to FALSE, the right-side empirical distribution is used, which is designed specifically for discrete data where ties may occur between samples.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=ANDERSON_KSAMP(samples, midrank)
  • samples (list[list], required): Collection of sample groups where each inner list is one group with at least two observations.
  • midrank (bool, optional, default: true): If TRUE, uses the midrank test (recommended for continuous and discrete data). If FALSE, uses the right side empirical distribution for discrete data.

Returns (list[list]): 2D list [[stat, p, critical_values…]], or error string.

Example 1: Two-group test using midrank method

Inputs:

samples midrank
1.1 2.2 3.3 true
1.2 2.1 3.4

Excel formula:

=ANDERSON_KSAMP({1.1,2.2,3.3;1.2,2.1,3.4}, TRUE)

Expected output:

Result
-0.939886 0.25 0.325 1.226 1.961 2.718 3.752 4.592 6.546
Example 2: Three-group test using midrank method

Inputs:

samples midrank
1.1 2.2 3.3 true
1.2 2.1 3.4
1.3 2.3 3.1

Excel formula:

=ANDERSON_KSAMP({1.1,2.2,3.3;1.2,2.1,3.4;1.3,2.3,3.1}, TRUE)

Expected output:

Result
-1.30617 0.25 0.449259 1.30528 1.94342 2.57697 3.41635 4.0721 5.56419
Example 3: Two-group test using right-side empirical distribution

Inputs:

samples midrank
1.1 2.2 3.3 false
1.2 2.1 3.4

Excel formula:

=ANDERSON_KSAMP({1.1,2.2,3.3;1.2,2.1,3.4}, FALSE)

Expected output:

Result
-0.867258 0.25 0.325 1.226 1.961 2.718 3.752 4.592 6.546
Example 4: Three-group test using right-side empirical distribution

Inputs:

samples midrank
1.1 2.2 3.3 false
1.2 2.1 3.4
1.3 2.3 3.1

Excel formula:

=ANDERSON_KSAMP({1.1,2.2,3.3;1.2,2.1,3.4;1.3,2.3,3.1}, FALSE)

Expected output:

Result
-1.23885 0.25 0.449259 1.30528 1.94342 2.57697 3.41635 4.0721 5.56419

Python Code

Show Code
import warnings
from scipy.stats import anderson_ksamp as scipy_anderson_ksamp

def anderson_ksamp(samples, midrank=True):
    """
    Performs the k-sample Anderson-Darling test to determine if samples are drawn from the same population.

    See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson_ksamp.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        samples (list[list]): Collection of sample groups where each inner list is one group with at least two observations.
        midrank (bool, optional): If TRUE, uses the midrank test (recommended for continuous and discrete data). If FALSE, uses the right side empirical distribution for discrete data. Default is True.

    Returns:
        list[list]: 2D list [[stat, p, critical_values...]], or error string.
    """
    try:
      # Validate samples
      if not isinstance(samples, list) or len(samples) < 2:
        return "Error: samples must be a 2D list with at least two columns (sample groups)."
      if any(not isinstance(col, list) or len(col) < 2 for col in samples):
        return "Error: each sample group must be a list with at least two values."

      # Transpose columns to rows for scipy
      transposed = [list(col) for col in samples]

      # Check for non-numeric values
      for group in transposed:
        for v in group:
          if not isinstance(v, (int, float)):
            return "Error: all sample values must be numeric."

      with warnings.catch_warnings():
        warnings.filterwarnings('ignore', message='p-value capped')
        result = scipy_anderson_ksamp(transposed, midrank=midrank)

      # Compose output row
      output = [
        float(result.statistic),
        float(result.pvalue),
        float(result.critical_values[0]),
        float(result.critical_values[1]),
        float(result.critical_values[2]),
        float(result.critical_values[3]),
        float(result.critical_values[4]),
        float(result.critical_values[5]),
        float(result.critical_values[6])
      ]
      # Check for nan/inf
      if any([
        isinstance(x, float) and (x != x or x == float('inf') or x == float('-inf'))
        for x in output
      ]):
        return "Error: statistic or critical values are not finite."
      return [output]
    except Exception as e:
      return f"Error: {str(e)}"

Online Calculator

Collection of sample groups where each inner list is one group with at least two observations.
If TRUE, uses the midrank test (recommended for continuous and discrete data). If FALSE, uses the right side empirical distribution for discrete data.