POINTBISERIALR

Overview

The POINTBISERIALR function calculates the point-biserial correlation coefficient, a measure of the strength and direction of association between a binary variable (coded as 0 and 1) and a continuous variable. This statistic is commonly used in psychometrics, educational testing, and social science research to assess relationships such as whether a treatment group (1) versus control group (0) differs on a continuous outcome measure.

The point-biserial correlation is mathematically equivalent to the Pearson correlation coefficient when applied to a dichotomous and continuous variable pair. Like other correlation coefficients, it ranges from -1 to +1, where 0 indicates no correlation, and values of -1 or +1 indicate a perfect determinative relationship between the variables.

This implementation uses SciPy’s pointbiserialr function from the scipy.stats module. The function returns both the correlation coefficient and a two-sided p-value based on a t-test with n-2 degrees of freedom.

The point-biserial correlation coefficient is calculated using the formula:

r_{pb} = \frac{\bar{Y}_1 - \bar{Y}_0}{s_y} \sqrt{\frac{N_0 N_1}{N(N-1)}}

where \bar{Y}_0 and \bar{Y}_1 are the means of the continuous variable for observations coded 0 and 1 respectively, N_0 and N_1 are the counts of observations in each group, N is the total sample size, and s_y is the standard deviation of the continuous variable.

A significant point-biserial correlation (p-value below a chosen threshold such as 0.05) is equivalent to finding a significant difference in means between the two groups via an independent samples t-test. The relationship between the t-statistic and r_{pb} is given by:

t = \sqrt{N-2} \cdot \frac{r_{pb}}{\sqrt{1 - r_{pb}^2}}

For additional background on the point-biserial correlation, see Tate (1954) and the Wiley StatsRef entry on Point Biserial Correlation.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=POINTBISERIALR(x, y)
  • x (list[list], required): Binary variable (column vector of 0s and 1s)
  • y (list[list], required): Continuous variable (column vector), same length as x

Returns (list[list]): 2D list [[correlation, p_value]], or error message string.

Examples

Example 1: Demo case 1

Inputs:

x y
0 1
0 2
0 3
1 4
1 5
1 6
1 7

Excel formula:

=POINTBISERIALR({0;0;0;1;1;1;1}, {1;2;3;4;5;6;7})

Expected output:

Result
0.866 0.0117

Example 2: Demo case 2

Inputs:

x y
0 1
0 1
1 5
1 5

Excel formula:

=POINTBISERIALR({0;0;1;1}, {1;1;5;5})

Expected output:

Result
1 0

Example 3: Demo case 3

Inputs:

x y
0 10
0 8
0 9
1 2
1 3
1 1

Excel formula:

=POINTBISERIALR({0;0;0;1;1;1}, {10;8;9;2;3;1})

Expected output:

Result
-0.9739 0.001

Example 4: Demo case 4

Inputs:

x y
0 1
0 5
1 3
1 3

Excel formula:

=POINTBISERIALR({0;0;1;1}, {1;5;3;3})

Expected output:

Result
0 1

Python Code

from scipy.stats import pointbiserialr as scipy_pointbiserialr

def pointbiserialr(x, y):
    """
    Calculate a point biserial correlation coefficient and its p-value.

    See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pointbiserialr.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        x (list[list]): Binary variable (column vector of 0s and 1s)
        y (list[list]): Continuous variable (column vector), same length as x

    Returns:
        list[list]: 2D list [[correlation, p_value]], or error message string.
    """
    # Helper function to convert scalar or 2D list inputs
    def to2d(val):
        return [[val]] if not isinstance(val, list) else val

    # Normalize inputs
    x = to2d(x)
    y = to2d(y)

    # Flatten 2D lists to 1D and convert to numeric
    try:
        x_flat = []
        for row in x:
            if isinstance(row, list):
                x_flat.extend(row)
            else:
                x_flat.append(row)

        y_flat = []
        for row in y:
            if isinstance(row, list):
                y_flat.extend(row)
            else:
                y_flat.append(row)

        x_array = [float(val) for val in x_flat]
        y_array = [float(val) for val in y_flat]

    except (ValueError, TypeError) as e:
        return f"Invalid input: x and y must contain numeric values. {str(e)}"

    # Check that arrays have the same length
    if len(x_array) != len(y_array):
        return "Invalid input: x and y must have the same length."

    # Check minimum length
    if len(x_array) < 3:
        return "Invalid input: arrays must contain at least 3 elements."

    # Validate that x contains only binary values (0 or 1)
    x_unique = set(x_array)
    if not x_unique.issubset({0.0, 1.0}):
        return "Invalid input: x must contain only binary values (0 or 1)."

    # Check that we have both 0 and 1 values in x
    if len(x_unique) < 2:
        return "Invalid input: x must contain both 0 and 1 values."

    # Check for constant y values
    if len(set(y_array)) == 1:
        return "Invalid input: y must contain varying values (not all identical)."

    try:
        # Calculate point-biserial correlation
        result = scipy_pointbiserialr(x_array, y_array)
        correlation = float(result.statistic)
        pvalue = float(result.pvalue)

        # Return as 2D list (single row, two columns)
        return [[correlation, pvalue]]

    except Exception as e:
        return f"Error calculating point-biserial correlation: {str(e)}"

Online Calculator