Skip to Content

TEXT_DISTANCE

Overview

The TEXT_DISTANCE function provides advanced fuzzy matching for text data using the Python textdistance library (documentation). It supports a variety of algorithms, including edit distance, token-based, sequence-based, and phonetic algorithms, to calculate the similarity between strings. This is useful for tasks such as deduplication, record linkage, and finding approximate matches in business data (e.g., customer names, product catalogs, addresses).

For example, the Jaccard similarity between two sets AA and BB is defined as:

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}

Other algorithms, such as Levenshtein distance, compute the minimum number of single-character edits required to change one string into another. See the textdistance documentation for more details on supported algorithms.

This example function is provided as-is without any representation of accuracy.

Usage

To use the TEXT_DISTANCE function in Excel, enter it as a formula in a cell, specifying your lookup value(s), lookup array, algorithm, and top_n:

=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])
  • lookup_value (str or list, required): String or 2D list of strings to compare with the lookup_array.
  • lookup_array (list, required): 2D list of strings to compare with the lookup_value.
  • algorithm (string, optional, default=“jaccard”): Similarity algorithm to use. See documentation for options.
  • top_n (int, optional, default=1): Number of top matches to return for each lookup_value.

The function returns, for each lookup_value, a 2D list of [position, score, …] for the top N matches.

Examples

Example 1: Fuzzy match a product name in a catalog

Find products with names similar to ‘apple’ in your product catalog.

Inputs:

lookup_value
apple
lookup_array
appl
banana
orange
grape

Excel formula:

=TEXT_DISTANCE({"apple"}, {"appl","banana","orange","grape"})

Expected output:

PositionScore
10.8

This means the closest match to ‘apple’ is ‘appl’ with a similarity score of 0.8.

Example 2: Find similar customer names using Jaro-Winkler

Find customers with names similar to ‘Johnson’ in your customer database.

Inputs:

lookup_value
Johnson
lookup_array
Johnsen
Jonson
Johanson
Smith
Jonsen

Excel formula:

=TEXT_DISTANCE("Johnson", {"Johnsen","Jonson","Johanson","Smith","Jonsen"}, "jaro_winkler", 3)

Expected output:

PositionScorePositionScorePositionScore
10.9720.9630.93

This means the top 3 matches for ‘Johnson’ are ‘Johnsen’, ‘Jonson’, and ‘Johanson’ with similarity scores of 0.97, 0.96, and 0.93, respectively.

Example 3: Match multiple product names using Levenshtein distance

Find matches for multiple product names using Levenshtein distance.

Inputs:

lookup_value
aple
banaa
lookup_array
apple
banana
orange
grape

Excel formula:

=TEXT_DISTANCE({"aple","banaa"}, {"apple","banana","orange","grape"}, "levenshtein", 2)

Expected output:

NeedlePositionScorePositionScore
aple10.840.4
banaa20.8330.33

This means ‘aple’ matches best with ‘apple’ (0.8) and ‘grape’ (0.4), while ‘banaa’ matches best with ‘banana’ (0.83) and ‘orange’ (0.33).

Example 4: Match addresses between two systems

Match addresses in your CRM with addresses in your billing system.

Inputs:

lookup_value
123 Main St
456 Oak Ave
lookup_array
123 Main Street
456 Oak Avenue
789 Pine Blvd
321 Elm Street

Excel formula:

=TEXT_DISTANCE({"123 Main St","456 Oak Ave"}, {"123 Main Street","456 Oak Avenue","789 Pine Blvd","321 Elm Street"}, "ratcliff_obershelp", 1)

Expected output:

NeedlePositionScore
123 Main St10.85
456 Oak Ave20.88

This means ‘123 Main St’ matches best with ‘123 Main Street’ (0.85), and ‘456 Oak Ave’ matches best with ‘456 Oak Avenue’ (0.88).

Similarity Algorithms

Python Code

import micropip await micropip.install('textdistance') import textdistance def text_distance(lookup_value, lookup_array, algorithm='jaccard', top_n=1): """Calculate text similarity scores between lookup_value(s) and lookup_array items using the specified algorithm. Args: lookup_value: String or 2D list of strings to compare with lookup_array. lookup_array: 2D list of strings to compare with lookup_value. algorithm: Similarity algorithm to use from textdistance library. Default is 'jaccard'. top_n: Number of top matches to return for each lookup_value. Default is 1. Returns: 2D list: Each row is a flat list of [position, score, ...] for a needle. This example function is provided as-is without any representation of accuracy. """ try: algo_func = getattr(textdistance, algorithm) except AttributeError: return f"Error: Algorithm '{algorithm}' not found in textdistance." # Normalize input to a list of lists if isinstance(lookup_value, str): needles = [[lookup_value]] else: needles = lookup_value haystack_flat = [item for sublist in lookup_array for item in sublist if item is not None] if not haystack_flat: return [[] for _ in needles] if needles else [] results = [] for needle_group in needles: for needle_item in needle_group: if not str(needle_item).strip(): results.append([]) continue scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2)) for index, item in enumerate(haystack_flat)] scores.sort(key=lambda x: x[1], reverse=True) row = [] for score in scores[:top_n]: row.extend(list(score)) results.append(row) return results

Live Notebook

Edit this function in a live notebook.

Live Demo

Last updated on