TEXT_DISTANCE

Overview

This function demonstrates fuzzy matching techniques using the Python textdistance library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.

Usage

Compares a lookup_value with each item in a lookup_array and returns the top_n closest matches along with their normalized similarity scores (between 0 and 1, higher is more similar).


=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])

Arguments:

Argument	Type	Description
`lookup_value`	string or 2D list	String(s) to compare with the strings in the `lookup_array`.
`lookup_array`	2D list	A list of strings to compare with the `lookup_value`.
`algorithm`	string	Specifies the similarity algorithm to use. Default: ‘jaccard’.
`top_n`	int	The number of top matches to return for each `lookup_value`. Default: 1.

Returns a 2D list where each inner list contains the top_n matches for the corresponding lookup_value. Each match is represented as [index, similarity_score]. The matches are ordered by similarity score (highest first). The index is 1-based.

Examples

1. Finding Products with Similar Names

Find products with names similar to ‘apple’ in your product catalog.


=TEXT_DISTANCE({"apple"}, {"appl"; "banana"; "orange"; "grape"})

Output: {1, 0.8}

The output indicates that “appl” (at position 1) is the closest match with a similarity score of 0.8 using the default Jaccard algorithm.

2. Customer Name Matching

Find customers with names similar to ‘Johnson’ in your customer database.


=TEXT_DISTANCE("Johnson", {"Johnsen"; "Jonson"; "Johanson"; "Smith"; "Jonsen"}, "jaro_winkler", 3)

Output: A range showing the top 3 matches with their positions and similarity scores.

3. Multiple Product Name Matching

Find matches for multiple product names using Levenshtein distance.


=TEXT_DISTANCE({"aple", "banaa"}, {"apple"; "banana"; "orange"; "grape"}, "levenshtein", 2)

Output: A range with the top 2 matches for each input term, showing positions and similarity scores.

4. Address Fuzzy Matching

Match addresses in your CRM with addresses in your billing system.


=TEXT_DISTANCE({"123 Main St"; "456 Oak Ave"}, {"123 Main Street"; "456 Oak Avenue"; "789 Pine Blvd"; "321 Elm Street"}, "ratcliff_obershelp", 1)

Output: A range showing the best match for each address with position and similarity score.

Similarity Algorithms

The similarity algorithms available in textdistance are given in the tables below.

Edit Distance

Algorithm	Description
`damerau_levenshtein`	Similar to Levenshtein but considers transpositions as a single edit.
`hamming`	Measures the number of positions at which the corresponding symbols are different.
`levenshtein`	Calculates the minimum number of single-character edits required to change one word into the other.
`jaro`	Measures similarity between two strings, giving more weight to common prefixes.
`jaro_winkler`	An extension of Jaro, giving more weight to strings that match from the beginning.
`lcsseq`	Measures the longest common subsequence.
`lcsstr`	Measures the longest common substring.
`ratcliff_obershelp`	Measures similarity based on the longest common subsequence.
`strcmp95`	A string comparison algorithm developed by the U.S. Census Bureau.
`needleman_wunsch`	A dynamic programming algorithm for sequence alignment.
`smith_waterman`	A dynamic programming algorithm for local sequence alignment.
`gotoh`	An extension of Needleman-Wunsch with affine gap penalties.

Token

Algorithm	Description
`cosine`	Measures the cosine of the angle between two non-zero vectors.
`jaccard`	Measures similarity between finite sample sets.
`overlap`	Measures the overlap coefficient between two sets.
`sorensen`	Measures similarity between two sets, based on the size of the intersection divided by the size of the union.
`sorensen_dice`	Similar to Sorensen, but uses Dice’s coefficient.
`dice`	Another name for Sorensen-Dice coefficient.
`tversky`	A generalization of the Jaccard index.

Sequence

Algorithm	Description
`bag`	Measures bag similarity between two sequences
`mlipns`	Measures similarity using the MLIPNS algorithm
`monge_elkan`	A hybrid algorithm combining multiple similarity measures. $ME(a,b)$

Phonetic

Algorithm	Description
`mra`	Measures similarity using the MRA algorithm
`editex`	Measures similarity using the Editex algorithm

Source Code


import textdistance
 
def text_distance(needle, haystack, algorithm='jaccard', top_n=1):
    """Calculate text similarity scores between needle(s) and haystack items.
    
    Args:
        needle: String or 2D list of strings to search for
        haystack: 2D list of strings to search within
        algorithm (str): Algorithm name from textdistance library (default: 'jaccard')
        top_n (int): Number of top matches to return (default: 1).
    
    Returns:
        list: For each needle, a flat list of [position, score, position, score, ...] for the top N matches (row format).
    """
    algo_func = getattr(textdistance, algorithm)
 
    # Handle needle as either string or 2D list
    if isinstance(needle, str):
        needle_flat = [needle] if needle.strip() else []
    else:
        # Flatten 2D lists and filter out None values
        needle_flat = [item for sublist in needle for item in sublist if item is not None]
 
    haystack_flat = [item for sublist in haystack for item in sublist if item is not None]
 
    if not haystack_flat:
        return [[] for _ in needle_flat] if needle_flat else []
 
    results = []
    for needle_item in needle_flat:
        if not str(needle_item).strip():
            results.append([])
            continue
        scores = [(index + 1, round(algo_func.normalized_similarity(str(needle_item), str(item)), 2))
                  for index, item in enumerate(haystack_flat)]
        scores.sort(key=lambda x: x[1], reverse=True)
        # Flatten the top matches into a single row
        row = []
        for score in scores[:top_n]:
            row.extend(list(score))  # [position, score, ...]
        results.append(row)
 
    # If only one needle, return just the row for that needle
    if len(results) == 1:
        return results[0]
    return results