Text Distance
Overview
This function demonstrates fuzzy matching techniques using the Python textdistance library. It implements various algorithms including edit distance, token-based, sequence-based, and phonetic algorithms to calculate the similarity between strings.
Usage
Compares a lookup_value
with each item in a lookup_array
and returns the top_n
closest matches along with their normalized similarity scores (between 0 and 1, higher is more similar).
=TEXT_DISTANCE(lookup_value, lookup_array, [algorithm], [top_n])
Arguments:
Argument | Type | Description |
---|---|---|
lookup_value | string or 2D list | String(s) to compare with the strings in the lookup_array . |
lookup_array | 2D list | A list of strings to compare with the lookup_value . |
algorithm | string | Specifies the similarity algorithm to use. Default: ‘jaccard’. |
top_n | int | The number of top matches to return for each lookup_value . Default: 1. |
Returns a 2D list where each inner list contains the top_n
matches for the corresponding lookup_value
. Each match is represented as [index, similarity_score]
. The matches are ordered by similarity score (highest first). The index is 1-based.
Similarity Algorithms
The similarity algorithms available in textdistance
are given in the tables below.
Edit Distance
Algorithm | Description |
---|---|
damerau_levenshtein | Similar to Levenshtein but considers transpositions as a single edit. |
hamming | Measures the number of positions at which the corresponding symbols are different. |
levenshtein | Calculates the minimum number of single-character edits required to change one word into the other. |
jaro | Measures similarity between two strings, giving more weight to common prefixes. |
jaro_winkler | An extension of Jaro, giving more weight to strings that match from the beginning. |
lcsseq | Measures the longest common subsequence. |
lcsstr | Measures the longest common substring. |
ratcliff_obershelp | Measures similarity based on the longest common subsequence. |
strcmp95 | A string comparison algorithm developed by the U.S. Census Bureau. |
needleman_wunsch | A dynamic programming algorithm for sequence alignment. |
smith_waterman | A dynamic programming algorithm for local sequence alignment. |
gotoh | An extension of Needleman-Wunsch with affine gap penalties. |
Token
Algorithm | Description |
---|---|
cosine | Measures the cosine of the angle between two non-zero vectors. |
jaccard | Measures similarity between finite sample sets. |
overlap | Measures the overlap coefficient between two sets. |
sorensen | Measures similarity between two sets, based on the size of the intersection divided by the size of the union. |
sorensen_dice | Similar to Sorensen, but uses Dice’s coefficient. |
dice | Another name for Sorensen-Dice coefficient. |
tversky | A generalization of the Jaccard index. |
Sequence
Algorithm | Description |
---|---|
bag | Measures bag similarity between two sequences |
mlipns | Measures similarity using the MLIPNS algorithm |
monge_elkan | A hybrid algorithm combining multiple similarity measures. |
Phonetic
Algorithm | Description |
---|---|
mra | Measures similarity using the MRA algorithm |
editex | Measures similarity using the Editex algorithm |