Skip to Content

Text Distance

Overview

This function demonstrates fuzzy matching techniques using the Python textdistance library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.

View Python code on GitHub

Usage

Compares a lookup_value with each item in a lookup_array and returns the top_n closest matches along with their normalized similarity scores (between 0 and 1, higher is more similar).

=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])

Arguments:

ArgumentTypeDescription
lookup_valuestring or 2D listString(s) to compare with the strings in the lookup_array.
lookup_array2D listA list of strings to compare with the lookup_value.
algorithmstringSpecifies the similarity algorithm to use. Default: ‘jaccard’.
top_nintThe number of top matches to return for each lookup_value. Default: 1.

Returns a 2D list where each inner list contains the top_n matches for the corresponding lookup_value. Each match is represented as [index, similarity_score]. The matches are ordered by similarity score (highest first). The index is 1-based.

Similarity Algorithms

The similarity algorithms available in textdistance are given in the tables below.

Edit Distance

AlgorithmDescription
damerau_levenshteinSimilar to Levenshtein but considers transpositions as a single edit.
hammingMeasures the number of positions at which the corresponding symbols are different.
levenshteinCalculates the minimum number of single-character edits required to change one word into the other.
jaroMeasures similarity between two strings, giving more weight to common prefixes.
jaro_winklerAn extension of Jaro, giving more weight to strings that match from the beginning.
lcsseqMeasures the longest common subsequence.
lcsstrMeasures the longest common substring.
ratcliff_obershelpMeasures similarity based on the longest common subsequence.
strcmp95A string comparison algorithm developed by the U.S. Census Bureau.
needleman_wunschA dynamic programming algorithm for sequence alignment.
smith_watermanA dynamic programming algorithm for local sequence alignment.
gotohAn extension of Needleman-Wunsch with affine gap penalties.

Token

AlgorithmDescription
cosineMeasures the cosine of the angle between two non-zero vectors.
jaccardMeasures similarity between finite sample sets.
overlapMeasures the overlap coefficient between two sets.
sorensenMeasures similarity between two sets, based on the size of the intersection divided by the size of the union.
sorensen_diceSimilar to Sorensen, but uses Dice’s coefficient.
diceAnother name for Sorensen-Dice coefficient.
tverskyA generalization of the Jaccard index.

Sequence

AlgorithmDescription
bagMeasures bag similarity between two sequences
mlipnsMeasures similarity using the MLIPNS algorithm
monge_elkanA hybrid algorithm combining multiple similarity measures. ME(a,b)ME(a,b)

Phonetic

AlgorithmDescription
mraMeasures similarity using the MRA algorithm
editexMeasures similarity using the Editex algorithm
Last updated on