KAPLAN_MEIER

Overview

The KAPLAN_MEIER function estimates the survival function from time-to-event data using the Kaplan-Meier estimator, also known as the product-limit estimator. The survival function S(t) = P(T > t) represents the probability that an event (such as death, failure, or recovery) has not yet occurred by time t.

The Kaplan-Meier estimator is a non-parametric statistic introduced by Edward L. Kaplan and Paul Meier in their seminal 1958 paper. It is widely used in medical research to measure patient survival after treatment, in reliability engineering for time-to-failure analysis, and in economics for unemployment duration studies.

This implementation uses the statsmodels library’s SurvfuncRight class, which supports right-censored data—observations where the event has not yet occurred by the end of the study period. The estimator handles censoring by adjusting the risk set at each time point.

The survival probability at time t is computed as:

\hat{S}(t) = \prod_{i: t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)

where t_i represents times when events occurred, d_i is the number of events at time t_i, and n_i is the number of individuals at risk just before time t_i.

The function returns survival estimates at each event time, including the number at risk, events, and censored observations. Confidence intervals are computed using the log-log transformation (Kalbfleisch-Prentice method), which provides better coverage for extreme survival probabilities. The variance is estimated using Greenwood’s formula. Additionally, the function provides the median survival time—the time at which the survival probability crosses 0.5—along with its confidence interval.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=KAPLAN_MEIER(time, event, alpha)

time (list[list], required): Column vector of observed times to event or censoring (non-negative).
event (list[list], required): Column vector of event indicators (1 = event occurred, 0 = censored).
alpha (float, optional, default: 0.05): Significance level for confidence intervals (must be between 0 and 1 exclusive).

Returns (list[list]): 2D list with survival estimates

Example 1: All events (no censoring)

Inputs:

time	event
5	1
10	1
15	1
20	1

Excel formula:

=KAPLAN_MEIER({5;10;15;20}, {1;1;1;1})

Expected output:

time	n_risk	n_event	n_censored	survival_prob	ci_lower	ci_upper
5	4	1	0	0.75	0.127947	0.960549
10	3	1	0	0.5	0.0578471	0.844861
15	2	1	0	0.25	0.00894782	0.665325
20	1	1	0	0	0	0
Median				15	5	20

Example 2: Mix of events and censoring

Inputs:

time	event
6	1
12	0
18	1
24	1
30	0

Excel formula:

=KAPLAN_MEIER({6;12;18;24;30}, {1;0;1;1;0})

Expected output:

time	n_risk	n_event	n_censored	survival_prob	ci_lower	ci_upper
6	5	1	0	0.8	0.203809	0.96918
18	3	1	0	0.533333	0.0683321	0.863071
24	2	1	0	0.266667	0.009677	0.686136
Median				24	6	N/A

Example 3: Custom alpha 0.1

Inputs:

time	event	alpha
3	1	0.1
7	1
11	0
14	1

Excel formula:

=KAPLAN_MEIER({3;7;11;14}, {1;1;0;1}, 0.1)

Expected output:

time	n_risk	n_event	n_censored	survival_prob	ci_lower	ci_upper
3	4	1	0	0.75	0.223409	0.946277
7	3	1	0	0.5	0.103261	0.809283
14	1	1	0	0	0	0
Median				14	3	14

Example 4: Larger dataset with censoring

Inputs:

time	event	alpha
2	1	0.05
4	1
6	0
8	1
10	1
12	0

Excel formula:

=KAPLAN_MEIER({2;4;6;8;10;12}, {1;1;0;1;1;0}, 0.05)

Expected output:

time	n_risk	n_event	n_censored	survival_prob	ci_lower	ci_upper
2	6	1	0	0.833333	0.273123	0.974712
4	5	1	0	0.666667	0.194617	0.904434
8	3	1	0	0.444444	0.0661868	0.784908
10	2	1	0	0.222222	0.00956914	0.614721
Median				8	2	N/A

Python Code

Show Code

import math
import warnings
from statsmodels.duration.survfunc import SurvfuncRight as statsmodels_SurvfuncRight
from scipy import stats

def kaplan_meier(time, event, alpha=0.05):
    """
    Computes the Kaplan-Meier survival function estimate for time-to-event data.

    See: https://www.statsmodels.org/stable/generated/statsmodels.duration.survfunc.SurvfuncRight.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        time (list[list]): Column vector of observed times to event or censoring (non-negative).
        event (list[list]): Column vector of event indicators (1 = event occurred, 0 = censored).
        alpha (float, optional): Significance level for confidence intervals (must be between 0 and 1 exclusive). Default is 0.05.

    Returns:
        list[list]: 2D list with survival estimates
    """
    def to2d(x):
        return [[x]] if not isinstance(x, list) else x

    def _validate_float(value, name, allow_negative=False):
        if not isinstance(value, (int, float)):
            return f"Error: Invalid input: {name} must be a number."
        val = float(value)
        if math.isnan(val) or math.isinf(val):
            return f"Error: Invalid input: {name} must be finite."
        if not allow_negative and val < 0:
            return f"Error: Invalid input: {name} must be non-negative."
        return val

    def _flatten_column(data, name):
        # Flatten 2D column vector to 1D list
        if not isinstance(data, list):
            return f"Error: Invalid input: {name} must be a 2D list."
        flat = []
        for row in data:
            if not isinstance(row, list):
                return f"Error: Invalid input: {name} must be a 2D list."
            if len(row) != 1:
                return f"Error: Invalid input: {name} must be a column vector (single column)."
            flat.append(row[0])
        return flat

    try:
        # Normalize inputs to 2D
        time = to2d(time)
        event = to2d(event)

        # Flatten column vectors
        time_flat = _flatten_column(time, "time")
        if isinstance(time_flat, str):
            return time_flat

        event_flat = _flatten_column(event, "event")
        if isinstance(event_flat, str):
            return event_flat

        # Check equal lengths
        if len(time_flat) != len(event_flat):
            return "Error: Invalid input: time and event must have the same length."

        if len(time_flat) == 0:
            return "Error: Invalid input: time and event must not be empty."

        # Validate time values
        validated_time = []
        for i, t in enumerate(time_flat):
            val_t = _validate_float(t, f"time[{i}]", allow_negative=False)
            if isinstance(val_t, str):
                return val_t
            validated_time.append(val_t)

        # Validate event values (must be 0 or 1)
        validated_event = []
        for i, e in enumerate(event_flat):
            val_e = _validate_float(e, f"event[{i}]", allow_negative=False)
            if isinstance(val_e, str):
                return val_e
            if val_e not in (0, 1):
                return f"Error: Invalid input: event[{i}] must be 0 or 1."
            validated_event.append(int(val_e))

        # Validate alpha
        val_alpha = _validate_float(alpha, "alpha", allow_negative=False)
        if isinstance(val_alpha, str):
            return val_alpha
        if val_alpha <= 0 or val_alpha >= 1:
            return "Error: Invalid input: alpha must be between 0 and 1 (exclusive)."

        # Create SurvfuncRight object
        try:
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", RuntimeWarning)
                sf = statsmodels_SurvfuncRight(validated_time, validated_event, title="KM")
        except Exception as exc:
            return f"Error: statsmodels.duration.survfunc.SurvfuncRight error: {exc}"

        # Build output table
        try:
            # Get survival function data
            surv_times = sf.surv_times
            surv_prob = sf.surv_prob
            surv_se = sf.surv_prob_se

            # Calculate z-score for confidence interval
            z = stats.norm.ppf(1 - val_alpha / 2)

            # Calculate n_risk, n_event, n_censored at each time point
            result = [['time', 'n_risk', 'n_event', 'n_censored', 'survival_prob', 'ci_lower', 'ci_upper']]

            for i, t in enumerate(surv_times):
                # Count at risk, events, and censored at this time
                n_risk = sum(1 for j, tt in enumerate(validated_time) if tt >= t)
                n_event = sum(1 for j, (tt, ee) in enumerate(zip(validated_time, validated_event)) if tt == t and ee == 1)
                n_censored = sum(1 for j, (tt, ee) in enumerate(zip(validated_time, validated_event)) if tt == t and ee == 0)

                # Calculate confidence intervals using normal approximation
                s = surv_prob[i]
                se = surv_se[i]

                # Use log-log transformation for better CI (Kalbfleisch-Prentice method)
                if s > 0 and s < 1 and se > 0:
                    # Calculate CI on log-log scale
                    log_s = math.log(s)
                    if log_s < 0:
                        w = z * se / (s * abs(log_s))
                        ci_l = s ** math.exp(w)
                        ci_u = s ** math.exp(-w)
                        # Clamp to [0, 1]
                        ci_l = max(0.0, min(1.0, ci_l))
                        ci_u = max(0.0, min(1.0, ci_u))
                    else:
                        # Fallback to plain CI
                        ci_l = max(0.0, s - z * se)
                        ci_u = min(1.0, s + z * se)
                elif s == 1.0:
                    ci_l = 1.0
                    ci_u = 1.0
                elif s == 0.0:
                    ci_l = 0.0
                    ci_u = 0.0
                else:
                    # Fallback to plain CI
                    ci_l = max(0.0, s - z * se)
                    ci_u = min(1.0, s + z * se)

                result.append([
                    float(t),
                    float(n_risk),
                    float(n_event),
                    float(n_censored),
                    float(s),
                    float(ci_l),
                    float(ci_u)
                ])

            # Add median survival time row
            with warnings.catch_warnings():
                warnings.simplefilter("ignore", RuntimeWarning)
                median_surv = sf.quantile(0.5)
                median_ci_lower = sf.quantile_ci(0.5, alpha=val_alpha)[0]
                median_ci_upper = sf.quantile_ci(0.5, alpha=val_alpha)[1]

            # Handle potential NaN/inf in median values
            median_val = float(median_surv) if not (math.isnan(median_surv) or math.isinf(median_surv)) else 'N/A'
            median_ci_l = float(median_ci_lower) if not (math.isnan(median_ci_lower) or math.isinf(median_ci_lower)) else 'N/A'
            median_ci_u = float(median_ci_upper) if not (math.isnan(median_ci_upper) or math.isinf(median_ci_upper)) else 'N/A'

            result.append([
                'Median',
                "",
                "",
                "",
                median_val,
                median_ci_l,
                median_ci_u
            ])

        except Exception as exc:
            return f"Error: Failed to compute Kaplan-Meier estimates: {exc}"

        return result
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

time *

Column vector of observed times to event or censoring (non-negative).

event *

Column vector of event indicators (1 = event occurred, 0 = censored).

alpha

Significance level for confidence intervals (must be between 0 and 1 exclusive).