TEXT_DISTANCE
Overview
This function demonstrates fuzzy matching techniques using the Python textdistance library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.
Usage
Compares a lookup_value
with each item in a lookup_array
and returns the top_n
closest matches along with their normalized similarity scores (between 0 and 1, higher is more similar).
=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])
Arguments:
Argument | Type | Description |
---|---|---|
lookup_value | string or 2D list | String(s) to compare with the strings in the lookup_array . |
lookup_array | 2D list | A list of strings to compare with the lookup_value . |
algorithm | string | Specifies the similarity algorithm to use. Default: ‘jaccard’. |
top_n | int | The number of top matches to return for each lookup_value . Default: 1. |
Returns a 2D list where each inner list contains the top_n
matches for the corresponding lookup_value
. Each match is represented as [index, similarity_score]
. The matches are ordered by similarity score (highest first). The index is 1-based.
Examples
1. Finding Products with Similar Names
Find products with names similar to ‘apple’ in your product catalog.
=TEXT_DISTANCE({"apple"}, {"appl"; "banana"; "orange"; "grape"})
Output: {1, 0.8}
The output indicates that “appl” (at position 1) is the closest match with a similarity score of 0.8 using the default Jaccard algorithm.
2. Customer Name Matching
Find customers with names similar to ‘Johnson’ in your customer database.
=TEXT_DISTANCE("Johnson", {"Johnsen"; "Jonson"; "Johanson"; "Smith"; "Jonsen"}, "jaro_winkler", 3)
Output: A range showing the top 3 matches with their positions and similarity scores.
3. Multiple Product Name Matching
Find matches for multiple product names using Levenshtein distance.
=TEXT_DISTANCE({"aple", "banaa"}, {"apple"; "banana"; "orange"; "grape"}, "levenshtein", 2)
Output: A range with the top 2 matches for each input term, showing positions and similarity scores.
4. Address Fuzzy Matching
Match addresses in your CRM with addresses in your billing system.
=TEXT_DISTANCE({"123 Main St"; "456 Oak Ave"}, {"123 Main Street"; "456 Oak Avenue"; "789 Pine Blvd"; "321 Elm Street"}, "ratcliff_obershelp", 1)
Output: A range showing the best match for each address with position and similarity score.
Similarity Algorithms
The similarity algorithms available in textdistance
are given in the tables below.
Edit Distance
Algorithm | Description |
---|---|
damerau_levenshtein | Similar to Levenshtein but considers transpositions as a single edit. |
hamming | Measures the number of positions at which the corresponding symbols are different. |
levenshtein | Calculates the minimum number of single-character edits required to change one word into the other. |
jaro | Measures similarity between two strings, giving more weight to common prefixes. |
jaro_winkler | An extension of Jaro, giving more weight to strings that match from the beginning. |
lcsseq | Measures the longest common subsequence. |
lcsstr | Measures the longest common substring. |
ratcliff_obershelp | Measures similarity based on the longest common subsequence. |
strcmp95 | A string comparison algorithm developed by the U.S. Census Bureau. |
needleman_wunsch | A dynamic programming algorithm for sequence alignment. |
smith_waterman | A dynamic programming algorithm for local sequence alignment. |
gotoh | An extension of Needleman-Wunsch with affine gap penalties. |
Token
Algorithm | Description |
---|---|
cosine | Measures the cosine of the angle between two non-zero vectors. |
jaccard | Measures similarity between finite sample sets. |
overlap | Measures the overlap coefficient between two sets. |
sorensen | Measures similarity between two sets, based on the size of the intersection divided by the size of the union. |
sorensen_dice | Similar to Sorensen, but uses Dice’s coefficient. |
dice | Another name for Sorensen-Dice coefficient. |
tversky | A generalization of the Jaccard index. |
Sequence
Algorithm | Description |
---|---|
bag | Measures bag similarity between two sequences |
mlipns | Measures similarity using the MLIPNS algorithm |
monge_elkan | A hybrid algorithm combining multiple similarity measures. |
Phonetic
Algorithm | Description |
---|---|
mra | Measures similarity using the MRA algorithm |
editex | Measures similarity using the Editex algorithm |
Source Code
import textdistance
def text_distance(needle, haystack, algorithm='jaccard', top_n=1):
"""Calculate text similarity scores between needle(s) and haystack items.
Args:
needle: String or 2D list of strings to search for
haystack: 2D list of strings to search within
algorithm (str): Algorithm name from textdistance library (default: 'jaccard')
top_n (int): Number of top matches to return (default: 1).
Returns:
list: For each needle, a flat list of [position, score, position, score, ...] for the top N matches (row format).
"""
algo_func = getattr(textdistance, algorithm)
# Handle needle as either string or 2D list
if isinstance(needle, str):
needle_flat = [needle] if needle.strip() else []
else:
# Flatten 2D lists and filter out None values
needle_flat = [item for sublist in needle for item in sublist if item is not None]
haystack_flat = [item for sublist in haystack for item in sublist if item is not None]
if not haystack_flat:
return [[] for _ in needle_flat] if needle_flat else []
results = []
for needle_item in needle_flat:
if not str(needle_item).strip():
results.append([])
continue
scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2))
for index, item in enumerate(haystack_flat)]
scores.sort(key=lambda x: x[1], reverse=True)
# Flatten the top matches into a single row
row = []
for score in scores[:top_n]:
row.extend(list(score)) # [position, score, ...]
results.append(row)
# If only one needle, return just the row for that needle
if len(results) == 1:
return results[0]
return results