Fuzzy Matching with Python in Excel
November 4, 2024.
Overview
Fuzzy matching is a technique used to find strings that are approximately equal, e.g. “Microsoft” and “Microsoft Inc”. This is particularly useful in scenarios where exact matches are not possible due to typographical errors, variations in spelling, or other inconsistencies. In this post, we will explore how to perform fuzzy matching with Python in Excel using the NLTK library.
NLTK Library
The Natural Language Toolkit (NLTK) is a powerful library for natural language processing. It provides various tools and algorithms for text processing, including tokenization, stemming, lemmatization, and distance metrics. We will focus on using NLTK to calculate similarity scores between strings for fuzzy matching.
An example of calculating the similarity of two strings using edit_distance
is shown below:
from nltk.metrics.distance import edit_distance
string1 = "apple"
string2 = "apples"
distance = edit_distance(string1, string2)
print(distance)
In this example, the edit_distance
function calculates the minimum number of single-character edits required to change one string into the other. The output will be 1
, indicating that the strings ‘apple’ and ‘apples’ have an edit distance of 1.
NLTK supports various distance metrics for calculating similarity between strings, including:
- Jaccard Distance: Measures the similarity between two sets by comparing the intersection and union of the sets.
- Levenshtein Distance: Calculates the minimum number of single-character edits required to change one word into the other.
- Jaro Similarity: Measures the similarity between two strings by considering the number of matching characters and transpositions.
Python Function
We’ve created the nltk_distance function to handle the typical use-case in Excel of finding a match for a lookup_value
in a lookup_array
, similar to how the XMATCH
function works. It normalizes the edit_distance results to a similarity score of 0-1 for easier comparison. The function is as follows:
import nltk
import pandas as pd
from nltk.metrics.distance import edit_distance, jaccard_distance, jaro_similarity
from nltk.util import ngrams
def nltk_distance(lookup_value, lookup_array_df, algorithm):
"""
Calculate the similarity between a lookup_value and a lookup_array using various distance algorithms.
Parameters:
lookup_value (str or pd.DataFrame): The string or DataFrame to search for.
lookup_array_df (pd.DataFrame): The DataFrame to search within.
algorithm (str): The algorithm to use for calculating similarity. Options are 'levenshtein', 'jaccard', and 'jaro'. Default is 'jaccard'.
Returns:
list: A list of lists where each sublist contains the index (1-based) and the similarity score of the most similar item in the lookup_array.
"""
# Define a dictionary to map algorithm names to functions
algo_funcs = {
'levenshtein': lambda x, y: 1 - edit_distance(x, y) / max(len(x), len(y)),
'jaccard': lambda x, y: 1 - jaccard_distance(set(ngrams(x, 2)), set(ngrams(y, 2))),
'jaro': jaro_similarity
}
# Get the algorithm function from the dictionary
algo_func = algo_funcs.get(algorithm)
if algo_func is None:
raise ValueError(f"Unsupported algorithm: {algorithm}")
# Flatten the DataFrame to a list
lookup_array = lookup_array_df.values.flatten().tolist()
# Check if lookup_value is a DataFrame
if isinstance(lookup_value, pd.DataFrame):
lookup_value_list = lookup_value.values.flatten().tolist()
else:
lookup_value_list = [lookup_value]
results = []
for lookup_value_item in lookup_value_list:
# Calculate similarity scores and round to 2 decimal places
scores = [(index + 1, round(algo_func(lookup_value_item, item), 2)) for index, item in enumerate(lookup_array)]
# Sort based on scores in descending order
scores.sort(key=lambda x: x[1], reverse=True)
# Append the top index and score to results as a list
results.append(list(scores[0]))
# results is 2D list, e.g. [[1, 0.75], [2, 0.85]]
return results
nltk_distance(arg1, arg2, arg3)
The nltk_distance
function takes three arguments:
lookup_value
: The string or DataFrame to search for, e.g. “apple” orpd.DataFrame(["apple", "banana"])
.lookup_array_df
: The DataFrame to search within, e.g.pd.DataFrame(["apples", "bananas", "cherries"])
.algorithm
: The algorithm to use for calculating similarity. Options are'levenshtein'
,'jaccard'
, and'jaro'
.
The function returns a list of lists, where each sublist contains the index (1-based), similar to XMATCH
, and the similarity score of the most similar item in the lookup_array
. For example, [[1, 0.75], [2, 0.85]]
indicates that the most similar item to the first lookup_value
has an index of 1 and a similarity score of 0.75, and the most similar item to the second lookup_value
has an index of 2 and a similarity score of 0.85. This matches the array output expected by Excel.
Use in Excel
Two options for running this code in Excel are outlined below. See this demo workbook for an example of both.
EXCEL PY()
In an Excel PY() cell, start by assigning the values of the function arguments arg1, arg2, arg3
to the appropriate data in the workbook, then paste the function code below it. For example, if the range A6:A10
contains the lookup values, the range B6:B27
contains the lookup array, and the cell C6
contains the algorithm name, we get the following in the Excel PY() cell:
=PY(
arg1 = xl("A6:A10") # lookup_value
arg2 = xl("B6:B27") # lookup_array_df
arg3 = xl("C6") # algorithm
def nltk_distance(lookup_value, lookup_array_df, algorithm):
# ... function code omitted for brevity ...
return results
nltk_distance(arg1, arg2, arg3)
)
BOARDFLARE.RUNPY()
Assuming the code is contained in cell A1, we can get the same result using the BOARDFLARE.RUNPY
function as follows:
=BOARDFLARE.RUNPY(A1, A6:A10, B6:B27, C6)
Since nltk_distance
is in our functions library, you can call it using the path text/fuzzy-match/nltk_distance.ipynb
, and the latest code will be downloaded and executed. For example:
=BOARDFLARE.RUNPY("text/fuzzy-match/nltk_distance.ipynb", A6:A10, B6:B27, C6)
One of the benefits of using the BOARDFLARE.RUNPY
function is that we can further customize it in Excel using a LAMBDA function. For example, if we want to set a similarity threshold, as follows:
=LAMBDA(lookup_value, lookup_array, [similarity_threshold], [algorithm],
LET(
threshold, IF(ISOMITTED(similarity_threshold), 0.7, similarity_threshold),
algo, IF(ISOMITTED(algorithm), "jaccard", algorithm),
result, BOARDFLARE.RUNPY("text/fuzzy-match/nltk_distance.ipynb", lookup_value, lookup_array, algo),
index, INDEX(result, 1),
score, INDEX(result, 2),
IF(score >= threshold, index, "No match")
)
)
Conclusion
Fuzzy matching is a powerful tool for handling approximate string matching in various applications. The nltk_distance
Python function provides a flexible and easy-to-use interface for performing fuzzy matching using different algorithms. It can be run efficiently in Excel using either the Excel PY()
or BOARDFLARE.RUNPY
functions, enabling users to leverage the power of Python within their Excel workbooks.