Clustering

Overview

Introduction Clustering is an unsupervised learning approach that groups observations so that items in the same group are more similar to one another than to items in other groups. In practical terms, clustering helps teams discover structure in data when no labeled outcome is available. A broad overview is available at Wikipedia: Cluster analysis. This category focuses specifically on fuzzy clustering, where each observation can belong to multiple clusters with different strengths rather than being forced into one hard assignment. A general reference is Wikipedia: Fuzzy clustering.

In Boardflare’s clustering category, the two calculators are CMEANS and CMEANS_PREDICT. Both are built on the scikit-fuzzy implementation of fuzzy c-means: skfuzzy.cmeans and skfuzzy.cmeans_predict. These functions are widely used when cluster boundaries overlap and decision-makers need membership degrees rather than binary labels.

Traditional hard clustering (for example, k-means) answers: “Which single cluster does this point belong to?” Fuzzy c-means answers a richer question: “To what degree does this point belong to each cluster?” That distinction matters in many business and engineering domains where categories are inherently mixed. A customer can behave partly like a value buyer and partly like a premium buyer. A machine condition can be partly normal and partly degrading. A geographic zone can have mixed demand profiles. Soft membership captures this reality more faithfully.

From a mathematical perspective, fuzzy c-means optimizes cluster centers and a membership matrix simultaneously. The result includes:

  • Cluster centers (cntr) that represent typical patterns.
  • Membership matrix (u) showing each point’s fractional affiliation with each cluster.
  • Distances (d) from points to centers.
  • Objective history (jm) and convergence state (p) used for diagnostics.
  • Fuzzy partition coefficient (fpc) as a compact measure of partition sharpness.

From a practical perspective, these outputs support workflows that go beyond segmentation itself. Teams can rank records by cluster affinity, define soft thresholds for interventions, and monitor how memberships drift over time. This is why fuzzy clustering is useful not only for exploratory analysis but also for operational decision support.

Boardflare’s two calculators map directly to two phases of a clustering lifecycle:

  1. Model establishment with CMEANS: identify centers and baseline memberships from historical data.
  2. Model application with CMEANS_PREDICT: score new observations against fixed, previously trained centers.

This separation is important for controlled analytics. It lets teams train a reference clustering model, freeze its centers for governance, and then score incoming data consistently over time. That pattern is common in production analytics pipelines where model stability and auditability matter as much as raw fit.

When to Use It Use this category when the job is to uncover latent structure in unlabeled data and preserve nuance in category boundaries. In other words, use fuzzy clustering when real-world entities can plausibly belong to multiple groups at once.

One high-value scenario is customer and account segmentation in go-to-market analytics. Suppose a revenue operations team wants to segment accounts by product usage, support intensity, and spend trajectory. Hard segmentation can force an account into only one bucket, which often hides transition behavior. With CMEANS, analysts produce cluster centers and membership degrees that reveal blended profiles (for example, “70% growth-oriented, 30% cost-sensitive”). As new accounts arrive each week, CMEANS_PREDICT assigns memberships to those same strategic segments without retraining every time. This supports consistent routing, campaign personalization, and quarterly planning.

A second scenario is industrial condition monitoring. Sensors on pumps, compressors, or turbines rarely produce crisp, separable states. A machine can gradually transition from healthy to early-fault behavior. Fuzzy memberships are a natural fit because they quantify partial condition overlap. Engineers train baseline states with CMEANS, then score streaming batches with CMEANS_PREDICT to detect early drift. Rising membership in a “degradation” cluster can trigger preventive maintenance before hard alarm thresholds are reached.

A third scenario is supply and demand zoning in logistics or retail networks. Regional demand patterns are often mixtures: a zone may show both urban and suburban demand signatures depending on daypart and product mix. Hard clustering can produce brittle decisions when zones sit near boundaries. Fuzzy assignments from CMEANS let planners treat zones with mixed identity differently in inventory policies and staffing models. Then CMEANS_PREDICT maps new periods or new locations into the same framework for ongoing network calibration.

This category is also useful in risk analytics, healthcare stratification, and anomaly triage when stakeholders need explainable gradations instead of one-shot labels. The membership matrix becomes a decision surface: high-confidence cases can be auto-routed, while ambiguous cases receive human review.

Situations where fuzzy clustering is especially appropriate:

  • Cluster overlap is expected and meaningful.
  • Decision logic benefits from graded certainty.
  • Teams need stable centers for future scoring.
  • Analysts want both center-level summaries and record-level affinities.

Situations where it may be less appropriate:

  • Data are strongly separable and hard labels are required for compliance logic.
  • There is no tolerance for tuning hyperparameters such as fuzziness exponent m.
  • Business process cannot consume probabilistic or fractional outputs.

In short, this category is best when the job is not just “group records,” but “characterize mixed membership in a reproducible way.”

How It Works The underlying method is fuzzy c-means optimization. Let x_j \in \mathbb{R}^d be data points, v_i be cluster centers for i=1,\dots,c, and u_{ij} be membership of point j in cluster i.

The core objective minimized by CMEANS is:

J_m(U, V) = \sum_{j=1}^{N} \sum_{i=1}^{c} u_{ij}^m \lVert x_j - v_i \rVert^2

subject to:

\sum_{i=1}^{c} u_{ij} = 1, \quad 0 \le u_{ij} \le 1

The exponent m > 1 controls fuzziness. As m \to 1, assignments become harder; larger m spreads membership more evenly across clusters.

The alternating updates are conceptually:

v_i = \frac{\sum_{j=1}^{N} u_{ij}^m x_j}{\sum_{j=1}^{N} u_{ij}^m}

and

u_{ij} = \left(\sum_{k=1}^{c} \left(\frac{\lVert x_j-v_i\rVert}{\lVert x_j-v_k\rVert}\right)^{\frac{2}{m-1}}\right)^{-1}

The algorithm iterates until either the membership matrix change falls below error or maxiter is reached. In Boardflare’s CMEANS, inputs are organized as features by samples, so shape conventions should be checked before use.

The key diagnostics include:

  • jm: objective function values across iterations, useful for confirming descent and convergence behavior.
  • p: effective iteration count.
  • fpc: fuzzy partition coefficient, often interpreted in [0,1] where higher values suggest crisper partitions.

For scoring new data, CMEANS_PREDICT keeps trained centers fixed and solves memberships for incoming points under the same fuzzy framework. Conceptually, this is the deployment step: no center re-estimation, only assignment against known prototypes. That distinction makes outputs comparable over time.

Why this matters operationally:

  1. Governance: fixed centers prevent silent drift from retraining on every batch.
  2. Consistency: weekly or daily scores remain anchored to the same segment definitions.
  3. Speed: prediction is lighter than full retraining when centers are already validated.

The upstream implementation is provided by scikit-fuzzy, with numerics based on NumPy/SciPy arrays. In Boardflare calculators, users provide structured arrays from spreadsheet-like inputs and receive dictionary outputs with both scalar and matrix fields.

Important assumptions and requirements:

  • Distance is Euclidean in standard fuzzy c-means; feature scaling strongly affects results.
  • c (number of clusters) must be set by the analyst; it is not inferred automatically.
  • m controls overlap; common starting value is around 2, then tuned by interpretability and stability.
  • Initialization can affect local minima, so sensitivity checks are good practice.
  • Features should be normalized when units differ materially.

Compared with hard clustering, fuzzy c-means does not force a single segment identity. This is mathematically elegant and practically useful, but it also means downstream systems must know how to consume memberships. Typical strategies include maximum-membership assignment for simple routing, top-two memberships for mixed strategies, or threshold-based triage for uncertain records.

When interpreting results, teams should avoid over-reading tiny membership differences. For example, a record with memberships (0.51, 0.49) is intrinsically ambiguous and may warrant special handling. That ambiguity is a feature, not a bug: it surfaces uncertainty explicitly rather than hiding it.

Practical Example Consider a B2B SaaS team building account archetypes from three standardized signals: product usage intensity, support ticket load, and expansion propensity. The goal is to discover account segments and then score new accounts weekly for campaign and customer-success actions.

Step 1: Prepare a training matrix for CMEANS.

  • Assemble historical account-level data.
  • Standardize each feature so scale does not dominate distance.
  • Arrange data in the calculator’s expected orientation (features x samples).

Step 2: Choose initial hyperparameters.

  • c = 3 clusters for an initial business hypothesis (for example: growth, stable, at-risk).
  • m = 2.0 for moderate fuzziness.
  • error = 0.005, maxiter = 100 as practical convergence controls.

Step 3: Run CMEANS and inspect outputs.

Suppose the returned cntr suggests:

  • Cluster A: high usage, moderate support, high expansion propensity.
  • Cluster B: moderate usage, low support, stable expansion.
  • Cluster C: low usage, high support, low expansion propensity.

The membership matrix u reveals mixed identities. An account might score 0.62 in Cluster A and 0.34 in Cluster B, indicating strong-but-not-exclusive growth behavior. Another might be 0.40 / 0.38 / 0.22, signaling ambiguity and a candidate for analyst review.

Step 4: Validate usefulness before operationalizing.

  • Check whether clusters map to meaningful business actions.
  • Review fpc and jm trends for quality and convergence.
  • Compare outcomes under nearby settings (for example, c=4 or m=1.8) to test stability.

Step 5: Freeze centers and transition to scoring mode with CMEANS_PREDICT.

  • Store the validated cntr matrix as the reference segment definition.
  • For each new weekly batch, submit test_data and cntr_trained.
  • Receive updated memberships without changing the segment prototypes.

Step 6: Convert memberships into actions.

  • If max membership \ge 0.75, route to segment-specific automation.
  • If top-two memberships are close (for example, difference < 0.1), assign blended messaging.
  • If all memberships are diffuse, route to manual strategy review.

Step 7: Monitor drift and retrain only when needed.

  • Track distribution shifts in incoming feature data.
  • Watch aggregate changes in membership patterns over time.
  • Retrain with CMEANS on a defined cadence or when drift thresholds are breached.

This workflow is more effective than a traditional hard-assignment spreadsheet approach because it preserves uncertainty, supports transparent governance, and decouples model training from production scoring. The training/scoring split between CMEANS and CMEANS_PREDICT is exactly what enables repeatable operations.

Figure 1: Fuzzy clustering overview: (left) two overlapping clusters with soft memberships, (right) membership profiles for new points scored against fixed centers.

How to Choose In this category, tool selection is straightforward and should align with lifecycle stage: train versus score.

  • Choose CMEANS when cluster centers are unknown and must be learned from a representative dataset.
  • Choose CMEANS_PREDICT when centers already exist and the task is to assign memberships for new observations consistently.

The decision process can be represented as:

graph TD
    A[Start: Need fuzzy cluster memberships] --> B{Do you already have trained centers?}
    B -- No --> C[Use CMEANS to learn centers and memberships]
    B -- Yes --> D[Use CMEANS_PREDICT to score new data]
    C --> E{Are segment definitions stable and approved?}
    E -- Yes --> D
    E -- No --> C

Comparison guidance:

Function Primary Job Key Inputs Main Outputs Strengths Trade-offs
CMEANS Train fuzzy clusters data, c, m, error, maxiter cntr, u, d, jm, p, fpc Learns structure from scratch; provides full diagnostics and centers Requires hyperparameter tuning; retraining can change segment definitions
CMEANS_PREDICT Score new points against fixed centers test_data, cntr_trained, m, error, maxiter u, d, jm, p, fpc Consistent production scoring; no center drift during inference Cannot discover new cluster structure; quality depends on trained centers

A practical rule set for analysts:

  1. Begin with CMEANS during model development and periodic recalibration.
  2. Move to CMEANS_PREDICT for recurring operational scoring.
  3. Revisit CMEANS only when data drift, strategy shifts, or performance diagnostics justify retraining.

Parameter-selection tips across both tools:

  • Cluster count (c): start from business archetypes, then test interpretability and stability.
  • Fuzziness (m): use around 2 as baseline; increase if memberships are too brittle, decrease if too diffuse.
  • Convergence controls (error, maxiter): tighter error improves precision but can increase runtime.
  • Data orientation: ensure matrix shape is features x samples to match function expectations.

Common implementation mistakes and remedies:

  • Mistake: interpreting low fpc alone as model failure. Remedy: pair it with business interpretability and membership usability.
  • Mistake: retraining constantly and losing segment continuity. Remedy: separate training and scoring phases with center governance.
  • Mistake: skipping feature scaling. Remedy: standardize before clustering to avoid unit-dominated distances.

If the business goal is exploratory discovery, prioritize CMEANS. If the goal is repeatable production assignment, prioritize CMEANS_PREDICT with validated centers. Most mature deployments use both in sequence: learn, govern, score, monitor, then retrain on policy.

CMEANS

Fuzzy c-means clustering allows each data point to belong to multiple clusters with varying degrees of membership. This is a soft clustering approach compared to k-means where each data point belongs to exactly one cluster.

The algorithm minimizes an objective function to find cluster centers (centroids) and the fuzzy partition matrix. It handles high-dimensional datasets and overlapping clusters well, returning properties like the cluster centers, partition matrix, and objective function history.

Excel Usage

=CMEANS(data, c, m, error, maxiter)
  • data (list[list], required): 2D array of data to be clustered, where rows are features and columns are samples (S x N). Note the transpose requirement compared to typical scikit-learn S x N layout.
  • c (int, required): Desired number of clusters or classes.
  • m (float, required): Array exponentiation applied to the membership function (fuzziness parameter, typically 2.0).
  • error (float, required): Stopping criterion; stop early if the change in partition matrix is less than this error (e.g., 0.005).
  • maxiter (int, required): Maximum number of iterations allowed.

Returns (dict): Dictionary of clustering results, including cluster centers and the fuzzy partition matrix.

Example 1: Fuzzy c-means clustering of 2D data

Inputs:

data c m error maxiter
1 2 1 2 0.005 2
1 2

Excel formula:

=CMEANS({1,2;1,2}, 1, 2, 0.005, 2)

Expected output:

{"type":"Double","basicValue":1,"properties":{"fpc":{"type":"Double","basicValue":1},"cntr":{"type":"Array","elements":[[{"type":"Double","basicValue":1.5},{"type":"Double","basicValue":1.5}]]},"u":{"type":"Array","elements":[[{"type":"Double","basicValue":1},{"type":"Double","basicValue":1}]]},"d":{"type":"Array","elements":[[{"type":"Double","basicValue":0.707107},{"type":"Double","basicValue":0.707107}]]},"jm":{"type":"Array","elements":[[{"type":"Double","basicValue":1}]]},"p":{"type":"Double","basicValue":1}}}

Python Code

Show Code
import numpy as np
from skfuzzy import cmeans as fuzz_cmeans

def cmeans(data, c, m, error, maxiter):
    """
    Perform fuzzy c-means clustering on data.

    See: https://pythonhosted.org/scikit-fuzzy/api/skfuzzy.html#skfuzzy.cmeans

    This example function is provided as-is without any representation of accuracy.

    Args:
        data (list[list]): 2D array of data to be clustered, where rows are features and columns are samples (S x N). Note the transpose requirement compared to typical scikit-learn S x N layout.
        c (int): Desired number of clusters or classes.
        m (float): Array exponentiation applied to the membership function (fuzziness parameter, typically 2.0).
        error (float): Stopping criterion; stop early if the change in partition matrix is less than this error (e.g., 0.005).
        maxiter (int): Maximum number of iterations allowed.

    Returns:
        dict: Dictionary of clustering results, including cluster centers and the fuzzy partition matrix.
    """
    try:
        data_np = np.array(data, dtype=float)
        if data_np.ndim != 2:
            return "Error: data must be a 2D array"

        cntr, u, u0, d, jm, p, fpc = fuzz_cmeans(
            data=data_np,
            c=c,
            m=m,
            error=error,
            maxiter=maxiter
        )

        return {
            "type": "Double",
            "basicValue": float(fpc),
            "properties": {
                "fpc": {"type": "Double", "basicValue": float(fpc)},
                "cntr": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)} for val in row] for row in cntr]
                },
                "u": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)} for val in row] for row in u]
                },
                "d": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)} for val in row] for row in d]
                },
                "jm": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)}] for val in jm]
                },
                "p": {"type": "Double", "basicValue": float(p)}
            }
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of data to be clustered, where rows are features and columns are samples (S x N). Note the transpose requirement compared to typical scikit-learn S x N layout.
Desired number of clusters or classes.
Array exponentiation applied to the membership function (fuzziness parameter, typically 2.0).
Stopping criterion; stop early if the change in partition matrix is less than this error (e.g., 0.005).
Maximum number of iterations allowed.

CMEANS_PREDICT

Prediction of new data in a given trained fuzzy c-means framework. This algorithm repeats the clustering with fixed cluster centers efficiently finding the fuzzy membership of all data points.

Excel Usage

=CMEANS_PREDICT(test_data, cntr_trained, m, error, maxiter)
  • test_data (list[list], required): 2D array of new data to predict (features x samples S x N).
  • cntr_trained (list[list], required): Location of trained centers from prior training c-means.
  • m (float, required): Array exponentiation applied to the membership function.
  • error (float, required): Stopping criterion.
  • maxiter (int, required): Maximum number of iterations allowed.

Returns (dict): Dictionary of prediction results, including the fuzzy partition matrix.

Example 1: Fuzzy c-means prediction on new data

Inputs:

test_data cntr_trained m error maxiter
1.5 8.5 1.5 1.5 2 0.005 2
1.5 8.5 8.5 8.5

Excel formula:

=CMEANS_PREDICT({1.5,8.5;1.5,8.5}, {1.5,1.5;8.5,8.5}, 2, 0.005, 2)

Expected output:

{"type":"Double","basicValue":1,"properties":{"fpc":{"type":"Double","basicValue":1},"u":{"type":"Array","elements":[[{"type":"Double","basicValue":1},{"type":"Double","basicValue":4.93038e-32}],[{"type":"Double","basicValue":4.93038e-32},{"type":"Double","basicValue":1}]]},"d":{"type":"Array","elements":[[{"type":"Double","basicValue":2.22045e-16},{"type":"Double","basicValue":9.89949}],[{"type":"Double","basicValue":9.89949},{"type":"Double","basicValue":2.22045e-16}]]},"jm":{"type":"Array","elements":[[{"type":"Double","basicValue":73.5597}],[{"type":"Double","basicValue":9.76215e-30}]]},"p":{"type":"Double","basicValue":2}}}

Python Code

Show Code
import numpy as np
from skfuzzy import cmeans_predict as fuzz_cmeans_predict

def cmeans_predict(test_data, cntr_trained, m, error, maxiter):
    """
    Predict cluster membership for new data given a trained fuzzy c-means model.

    See: https://pythonhosted.org/scikit-fuzzy/api/skfuzzy.html#skfuzzy.cmeans_predict

    This example function is provided as-is without any representation of accuracy.

    Args:
        test_data (list[list]): 2D array of new data to predict (features x samples S x N).
        cntr_trained (list[list]): Location of trained centers from prior training c-means.
        m (float): Array exponentiation applied to the membership function.
        error (float): Stopping criterion.
        maxiter (int): Maximum number of iterations allowed.

    Returns:
        dict: Dictionary of prediction results, including the fuzzy partition matrix.
    """
    try:
        test_data_np = np.array(test_data, dtype=float)
        if test_data_np.ndim != 2:
            return "Error: test_data must be a 2D array"

        cntr_trained_np = np.array(cntr_trained, dtype=float)

        u, u0, d, jm, p, fpc = fuzz_cmeans_predict(
            test_data=test_data_np,
            cntr_trained=cntr_trained_np,
            m=m,
            error=error,
            maxiter=maxiter
        )

        return {
            "type": "Double",
            "basicValue": float(fpc),
            "properties": {
                "fpc": {"type": "Double", "basicValue": float(fpc)},
                "u": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)} for val in row] for row in u]
                },
                "d": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)} for val in row] for row in d]
                },
                "jm": {
                    "type": "Array",
                    "elements": [[{"type": "Double", "basicValue": float(val)}] for val in jm]
                },
                "p": {"type": "Double", "basicValue": float(p)}
            }
        }
    except Exception as e:
        return f"Error: {str(e)}"

Online Calculator

2D array of new data to predict (features x samples S x N).
Location of trained centers from prior training c-means.
Array exponentiation applied to the membership function.
Stopping criterion.
Maximum number of iterations allowed.