Skip to Content

TEXT_DISTANCE

Overview

This function demonstrates fuzzy matching techniques using the Python textdistance library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.

Usage

Compares a lookup_value with each item in a lookup_array and returns the top_n closest matches along with their normalized similarity scores (between 0 and 1, higher is more similar).

=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])

Arguments:

ArgumentTypeDescription
lookup_valuestring or 2D listString(s) to compare with the strings in the lookup_array.
lookup_array2D listA list of strings to compare with the lookup_value.
algorithmstringSpecifies the similarity algorithm to use. Default: ‘jaccard’.
top_nintThe number of top matches to return for each lookup_value. Default: 1.

Returns a 2D list where each inner list contains the top_n matches for the corresponding lookup_value. Each match is represented as [index, similarity_score]. The matches are ordered by similarity score (highest first). The index is 1-based.

Examples

1. Finding Products with Similar Names

Find products with names similar to ‘apple’ in your product catalog.

=TEXT_DISTANCE({"apple"}, {"appl"; "banana"; "orange"; "grape"})

Output: {1, 0.8}

The output indicates that “appl” (at position 1) is the closest match with a similarity score of 0.8 using the default Jaccard algorithm.

2. Customer Name Matching

Find customers with names similar to ‘Johnson’ in your customer database.

=TEXT_DISTANCE("Johnson", {"Johnsen"; "Jonson"; "Johanson"; "Smith"; "Jonsen"}, "jaro_winkler", 3)

Output: A range showing the top 3 matches with their positions and similarity scores.

3. Multiple Product Name Matching

Find matches for multiple product names using Levenshtein distance.

=TEXT_DISTANCE({"aple", "banaa"}, {"apple"; "banana"; "orange"; "grape"}, "levenshtein", 2)

Output: A range with the top 2 matches for each input term, showing positions and similarity scores.

4. Address Fuzzy Matching

Match addresses in your CRM with addresses in your billing system.

=TEXT_DISTANCE({"123 Main St"; "456 Oak Ave"}, {"123 Main Street"; "456 Oak Avenue"; "789 Pine Blvd"; "321 Elm Street"}, "ratcliff_obershelp", 1)

Output: A range showing the best match for each address with position and similarity score.

Similarity Algorithms

The similarity algorithms available in textdistance are given in the tables below.

Edit Distance

AlgorithmDescription
damerau_levenshteinSimilar to Levenshtein but considers transpositions as a single edit.
hammingMeasures the number of positions at which the corresponding symbols are different.
levenshteinCalculates the minimum number of single-character edits required to change one word into the other.
jaroMeasures similarity between two strings, giving more weight to common prefixes.
jaro_winklerAn extension of Jaro, giving more weight to strings that match from the beginning.
lcsseqMeasures the longest common subsequence.
lcsstrMeasures the longest common substring.
ratcliff_obershelpMeasures similarity based on the longest common subsequence.
strcmp95A string comparison algorithm developed by the U.S. Census Bureau.
needleman_wunschA dynamic programming algorithm for sequence alignment.
smith_watermanA dynamic programming algorithm for local sequence alignment.
gotohAn extension of Needleman-Wunsch with affine gap penalties.

Token

AlgorithmDescription
cosineMeasures the cosine of the angle between two non-zero vectors.
jaccardMeasures similarity between finite sample sets.
overlapMeasures the overlap coefficient between two sets.
sorensenMeasures similarity between two sets, based on the size of the intersection divided by the size of the union.
sorensen_diceSimilar to Sorensen, but uses Dice’s coefficient.
diceAnother name for Sorensen-Dice coefficient.
tverskyA generalization of the Jaccard index.

Sequence

AlgorithmDescription
bagMeasures bag similarity between two sequences
mlipnsMeasures similarity using the MLIPNS algorithm
monge_elkanA hybrid algorithm combining multiple similarity measures. ME(a,b)ME(a,b)

Phonetic

AlgorithmDescription
mraMeasures similarity using the MRA algorithm
editexMeasures similarity using the Editex algorithm

Source Code

import textdistance def text_distance(needle, haystack, algorithm='jaccard', top_n=1): """Calculate text similarity scores between needle(s) and haystack items. Args: needle: String or 2D list of strings to search for haystack: 2D list of strings to search within algorithm (str): Algorithm name from textdistance library (default: 'jaccard') top_n (int): Number of top matches to return (default: 1). Returns: list: For each needle, a flat list of [position, score, position, score, ...] for the top N matches (row format). """ algo_func = getattr(textdistance, algorithm) # Handle needle as either string or 2D list if isinstance(needle, str): needle_flat = [needle] if needle.strip() else [] else: # Flatten 2D lists and filter out None values needle_flat = [item for sublist in needle for item in sublist if item is not None] haystack_flat = [item for sublist in haystack for item in sublist if item is not None] if not haystack_flat: return [[] for _ in needle_flat] if needle_flat else [] results = [] for needle_item in needle_flat: if not str(needle_item).strip(): results.append([]) continue scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2)) for index, item in enumerate(haystack_flat)] scores.sort(key=lambda x: x[1], reverse=True) # Flatten the top matches into a single row row = [] for score in scores[:top_n]: row.extend(list(score)) # [position, score, ...] results.append(row) # If only one needle, return just the row for that needle if len(results) == 1: return results[0] return results
Last updated on