TEXT_DISTANCE

Overview

The TEXT_DISTANCE function provides advanced fuzzy matching for text data using the Python textdistance library (documentation ). It supports a variety of algorithms, including edit distance, token-based, sequence-based, and phonetic algorithms, to calculate the similarity between strings. This is useful for tasks such as deduplication, record linkage, and finding approximate matches in business data (e.g., customer names, product catalogs, addresses).

For example, the Jaccard similarity between two sets $A$ and $B$ is defined as:

J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Other algorithms, such as Levenshtein distance, compute the minimum number of single-character edits required to change one string into another. See the textdistance documentation for more details on supported algorithms.

This example function is provided as-is without any representation of accuracy.

Usage

To use the TEXT_DISTANCE function in Excel, enter it as a formula in a cell, specifying your lookup value(s), lookup array, algorithm, and top_n:


=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])

lookup_value (str or list, required): String or 2D list of strings to compare with the lookup_array.
lookup_array (list, required): 2D list of strings to compare with the lookup_value.
algorithm (string, optional, default=“jaccard”): Similarity algorithm to use. See documentation for options.
top_n (int, optional, default=1): Number of top matches to return for each lookup_value.

The function returns, for each lookup_value, a 2D list of [position, score, …] for the top N matches.

Examples

Example 1: Fuzzy match a product name in a catalog

Find products with names similar to ‘apple’ in your product catalog.

Inputs:

lookup_value
apple

lookup_array
appl
banana
orange
grape

Excel formula:


=TEXT_DISTANCE({"apple"}, {"appl","banana","orange","grape"})

Expected output:

Position	Score
1	0.8

This means the closest match to ‘apple’ is ‘appl’ with a similarity score of 0.8.

Example 2: Find similar customer names using Jaro-Winkler

Find customers with names similar to ‘Johnson’ in your customer database.

Inputs:

lookup_value
Johnson

lookup_array
Johnsen
Jonson
Johanson
Smith
Jonsen

Excel formula:


=TEXT_DISTANCE("Johnson", {"Johnsen","Jonson","Johanson","Smith","Jonsen"}, "jaro_winkler", 3)

Expected output:

Position	Score	Position	Score	Position	Score
1	0.97	2	0.96	3	0.93

This means the top 3 matches for ‘Johnson’ are ‘Johnsen’, ‘Jonson’, and ‘Johanson’ with similarity scores of 0.97, 0.96, and 0.93, respectively.

Example 3: Match multiple product names using Levenshtein distance

Find matches for multiple product names using Levenshtein distance.

Inputs:

lookup_value
aple
banaa

lookup_array
apple
banana
orange
grape

Excel formula:


=TEXT_DISTANCE({"aple","banaa"}, {"apple","banana","orange","grape"}, "levenshtein", 2)

Expected output:

Needle	Position	Score	Position	Score
aple	1	0.8	4	0.4
banaa	2	0.83	3	0.33

This means ‘aple’ matches best with ‘apple’ (0.8) and ‘grape’ (0.4), while ‘banaa’ matches best with ‘banana’ (0.83) and ‘orange’ (0.33).

Example 4: Match addresses between two systems

Match addresses in your CRM with addresses in your billing system.

Inputs:

lookup_value
123 Main St
456 Oak Ave

lookup_array
123 Main Street
456 Oak Avenue
789 Pine Blvd
321 Elm Street

Excel formula:


=TEXT_DISTANCE({"123 Main St","456 Oak Ave"}, {"123 Main Street","456 Oak Avenue","789 Pine Blvd","321 Elm Street"}, "ratcliff_obershelp", 1)

Expected output:

Needle	Position	Score
123 Main St	1	0.85
456 Oak Ave	2	0.88

This means ‘123 Main St’ matches best with ‘123 Main Street’ (0.85), and ‘456 Oak Ave’ matches best with ‘456 Oak Avenue’ (0.88).

Similarity Algorithms

jaccard (default)
jaro_winkler
levenshtein
ratcliff_obershelp
See textdistance documentation for more options.

Python Code


import micropip
await micropip.install('textdistance')
import textdistance
 
def text_distance(lookup_value, lookup_array, algorithm='jaccard', top_n=1):
    """Calculate text similarity scores between lookup_value(s) and lookup_array items using the specified algorithm.
 
    Args:
        lookup_value: String or 2D list of strings to compare with lookup_array.
        lookup_array: 2D list of strings to compare with lookup_value.
        algorithm: Similarity algorithm to use from textdistance library. Default is 'jaccard'.
        top_n: Number of top matches to return for each lookup_value. Default is 1.
 
    Returns:
        2D list: Each row is a flat list of [position, score, ...] for a needle.
 
    This example function is provided as-is without any representation of accuracy.
    """
    try:
        algo_func = getattr(textdistance, algorithm)
    except AttributeError:
        return f"Error: Algorithm '{algorithm}' not found in textdistance."
    # Normalize input to a list of lists
    if isinstance(lookup_value, str):
        needles = [[lookup_value]]
    else:
        needles = lookup_value
    haystack_flat = [item for sublist in lookup_array for item in sublist if item is not None]
    if not haystack_flat:
        return [[] for _ in needles] if needles else []
    results = []
    for needle_group in needles:
        for needle_item in needle_group:
            if not str(needle_item).strip():
                results.append([])
                continue
            scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2))
                      for index, item in enumerate(haystack_flat)]
            scores.sort(key=lambda x: x[1], reverse=True)
            row = []
            for score in scores[:top_n]:
                row.extend(list(score))
            results.append(row)
    return results

Live Notebook

Edit this function in a live notebook .

TEXT_DISTANCE

Overview

Usage

Examples

Similarity Algorithms

Python Code

Live Notebook

Live Demo