KS_2SAMP

Overview

The KS_2SAMP function performs the two-sample Kolmogorov-Smirnov (K-S) test, a nonparametric statistical test that determines whether two independent samples were drawn from the same underlying continuous probability distribution. Named after mathematicians Andrey Kolmogorov and Nikolai Smirnov, the test is particularly valuable because it makes no assumptions about the specific form of the distribution.

The test works by comparing the empirical distribution functions (EDFs) of the two samples. For two samples with sizes n and m, the K-S statistic measures the maximum absolute difference between their cumulative distributions:

D_{n,m} = \sup_x |F_{1,n}(x) - F_{2,m}(x)|

where F_{1,n} and F_{2,m} are the empirical distribution functions of the first and second samples respectively, and \sup denotes the supremum (largest value) over all observations.

The two-sample K-S test is sensitive to differences in both location (central tendency) and shape of the distributions, making it one of the most general nonparametric methods for comparing two samples. The function supports three alternative hypotheses:

  • two-sided: Tests whether the distributions are identical (default)
  • less: Tests whether the first sample’s distribution is stochastically less than the second
  • greater: Tests whether the first sample’s distribution is stochastically greater than the second

This implementation uses the scipy.stats.ks_2samp function from the SciPy library. The function returns the K-S test statistic and the associated p-value. A small p-value (typically < 0.05) suggests the two samples come from different distributions. For more details on the underlying algorithm, see Hodges (1958), “The Significance Probability of the Smirnov Two-Sample Test” in Arkiv für Matematik.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=KS_2SAMP(data_one, data_two, ks_twosamp_alt)
  • data_one (list[list], required): First sample of observations as a 2D array.
  • data_two (list[list], required): Second sample of observations as a 2D array.
  • ks_twosamp_alt (str, optional, default: “two-sided”): Defines the alternative hypothesis for the test.

Returns (list[list]): 2D list [[statistic, p_value]], or error message string.

Examples

Example 1: Basic two-sided test comparing similar distributions

Inputs:

data_one data_two ks_twosamp_alt
1 2 2 3 two-sided
3 4 4 5

Excel formula:

=KS_2SAMP({1,2;3,4}, {2,3;4,5}, "two-sided")

Expected output:

Result
0.25 1

Example 2: One-sided test for stochastically less distribution

Inputs:

data_one data_two ks_twosamp_alt
1 2 2 3 less
3 4 4 5

Excel formula:

=KS_2SAMP({1,2;3,4}, {2,3;4,5}, "less")

Expected output:

Result
0 1

Example 3: One-sided test for stochastically greater distribution

Inputs:

data_one data_two ks_twosamp_alt
1 2 2 3 greater
3 4 4 5

Excel formula:

=KS_2SAMP({1,2;3,4}, {2,3;4,5}, "greater")

Expected output:

Result
0.25 0.8

Example 4: Two-sided test with shifted distributions

Inputs:

data_one data_two ks_twosamp_alt
10 20 15 25 two-sided
30 40 35 45

Excel formula:

=KS_2SAMP({10,20;30,40}, {15,25;35,45}, "two-sided")

Expected output:

Result
0.25 1

Python Code

from scipy.stats import ks_2samp as scipy_ks_2samp
import math

def ks_2samp(data_one, data_two, ks_twosamp_alt='two-sided'):
    """
    Performs the two-sample Kolmogorov-Smirnov test for goodness of fit.

    See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        data_one (list[list]): First sample of observations as a 2D array.
        data_two (list[list]): Second sample of observations as a 2D array.
        ks_twosamp_alt (str, optional): Defines the alternative hypothesis for the test. Valid options: Two-sided, Less, Greater. Default is 'two-sided'.

    Returns:
        list[list]: 2D list [[statistic, p_value]], or error message string.
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    data_one = to2d(data_one)
    data_two = to2d(data_two)

    if not all(isinstance(row, list) for row in data_one):
        return "Invalid input: data_one must be a 2D list."
    if not all(isinstance(row, list) for row in data_two):
        return "Invalid input: data_two must be a 2D list."

    try:
        x = [float(item) for row in data_one for item in row]
        y = [float(item) for row in data_two for item in row]
    except (TypeError, ValueError):
        return "Invalid input: data_one and data_two must contain only numeric values."

    if len(x) < 2 or len(y) < 2:
        return "Invalid input: each sample must contain at least two values."

    if ks_twosamp_alt not in ['two-sided', 'less', 'greater']:
        return "Invalid input: ks_twosamp_alt must be 'two-sided', 'less', or 'greater'."

    try:
        result = scipy_ks_2samp(x, y, alternative=ks_twosamp_alt)
        stat = float(result.statistic)
        pvalue = float(result.pvalue)
    except Exception as e:
        return f"scipy.stats.ks_2samp error: {e}"

    if math.isnan(stat) or math.isnan(pvalue) or math.isinf(stat) or math.isinf(pvalue):
        return "Invalid result: statistic or pvalue is nan or inf."

    return [[stat, pvalue]]

Online Calculator