TEXT_DISTANCE
Overview
The TEXT_DISTANCE
function provides advanced fuzzy matching for text data using the Python textdistance library (documentation ). It supports a variety of algorithms, including edit distance, token-based, sequence-based, and phonetic algorithms, to calculate the similarity between strings. This is useful for tasks such as deduplication, record linkage, and finding approximate matches in business data (e.g., customer names, product catalogs, addresses).
For example, the Jaccard similarity between two sets and is defined as:
Other algorithms, such as Levenshtein distance, compute the minimum number of single-character edits required to change one string into another. See the textdistance documentation for more details on supported algorithms.
This example function is provided as-is without any representation of accuracy.
Usage
To use the TEXT_DISTANCE
function in Excel, enter it as a formula in a cell, specifying your lookup value(s), lookup array, algorithm, and top_n:
=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])
lookup_value
(str or list, required): String or 2D list of strings to compare with the lookup_array.lookup_array
(list, required): 2D list of strings to compare with the lookup_value.algorithm
(string, optional, default=“jaccard”): Similarity algorithm to use. See documentation for options.top_n
(int, optional, default=1): Number of top matches to return for each lookup_value.
The function returns, for each lookup_value, a 2D list of [position, score, …] for the top N matches.
Examples
Example 1: Fuzzy match a product name in a catalog
Find products with names similar to ‘apple’ in your product catalog.
Inputs:
lookup_value |
---|
apple |
lookup_array |
---|
appl |
banana |
orange |
grape |
Excel formula:
=TEXT_DISTANCE({"apple"}, {"appl","banana","orange","grape"})
Expected output:
Position | Score |
---|---|
1 | 0.8 |
This means the closest match to ‘apple’ is ‘appl’ with a similarity score of 0.8.
Example 2: Find similar customer names using Jaro-Winkler
Find customers with names similar to ‘Johnson’ in your customer database.
Inputs:
lookup_value |
---|
Johnson |
lookup_array |
---|
Johnsen |
Jonson |
Johanson |
Smith |
Jonsen |
Excel formula:
=TEXT_DISTANCE("Johnson", {"Johnsen","Jonson","Johanson","Smith","Jonsen"}, "jaro_winkler", 3)
Expected output:
Position | Score | Position | Score | Position | Score |
---|---|---|---|---|---|
1 | 0.97 | 2 | 0.96 | 3 | 0.93 |
This means the top 3 matches for ‘Johnson’ are ‘Johnsen’, ‘Jonson’, and ‘Johanson’ with similarity scores of 0.97, 0.96, and 0.93, respectively.
Example 3: Match multiple product names using Levenshtein distance
Find matches for multiple product names using Levenshtein distance.
Inputs:
lookup_value |
---|
aple |
banaa |
lookup_array |
---|
apple |
banana |
orange |
grape |
Excel formula:
=TEXT_DISTANCE({"aple","banaa"}, {"apple","banana","orange","grape"}, "levenshtein", 2)
Expected output:
Needle | Position | Score | Position | Score |
---|---|---|---|---|
aple | 1 | 0.8 | 4 | 0.4 |
banaa | 2 | 0.83 | 3 | 0.33 |
This means ‘aple’ matches best with ‘apple’ (0.8) and ‘grape’ (0.4), while ‘banaa’ matches best with ‘banana’ (0.83) and ‘orange’ (0.33).
Example 4: Match addresses between two systems
Match addresses in your CRM with addresses in your billing system.
Inputs:
lookup_value |
---|
123 Main St |
456 Oak Ave |
lookup_array |
---|
123 Main Street |
456 Oak Avenue |
789 Pine Blvd |
321 Elm Street |
Excel formula:
=TEXT_DISTANCE({"123 Main St","456 Oak Ave"}, {"123 Main Street","456 Oak Avenue","789 Pine Blvd","321 Elm Street"}, "ratcliff_obershelp", 1)
Expected output:
Needle | Position | Score |
---|---|---|
123 Main St | 1 | 0.85 |
456 Oak Ave | 2 | 0.88 |
This means ‘123 Main St’ matches best with ‘123 Main Street’ (0.85), and ‘456 Oak Ave’ matches best with ‘456 Oak Avenue’ (0.88).
Similarity Algorithms
- jaccard (default)
- jaro_winkler
- levenshtein
- ratcliff_obershelp
- See textdistance documentation for more options.
Python Code
import micropip
await micropip.install('textdistance')
import textdistance
def text_distance(lookup_value, lookup_array, algorithm='jaccard', top_n=1):
"""Calculate text similarity scores between lookup_value(s) and lookup_array items using the specified algorithm.
Args:
lookup_value: String or 2D list of strings to compare with lookup_array.
lookup_array: 2D list of strings to compare with lookup_value.
algorithm: Similarity algorithm to use from textdistance library. Default is 'jaccard'.
top_n: Number of top matches to return for each lookup_value. Default is 1.
Returns:
2D list: Each row is a flat list of [position, score, ...] for a needle.
This example function is provided as-is without any representation of accuracy.
"""
try:
algo_func = getattr(textdistance, algorithm)
except AttributeError:
return f"Error: Algorithm '{algorithm}' not found in textdistance."
# Normalize input to a list of lists
if isinstance(lookup_value, str):
needles = [[lookup_value]]
else:
needles = lookup_value
haystack_flat = [item for sublist in lookup_array for item in sublist if item is not None]
if not haystack_flat:
return [[] for _ in needles] if needles else []
results = []
for needle_group in needles:
for needle_item in needle_group:
if not str(needle_item).strip():
results.append([])
continue
scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2))
for index, item in enumerate(haystack_flat)]
scores.sort(key=lambda x: x[1], reverse=True)
row = []
for score in scores[:top_n]:
row.extend(list(score))
results.append(row)
return results
Live Notebook
Edit this function in a live notebook .