Fuzzy matching is the ability to match non-identical text based on how similar one string is compared to the other.
You may have text that has been entered as the answer to a survey question, in which case you can't control what the respondent types.
This can result in misspelling of words, for example, take this table of companies (named Table1) and their products
It's clear to you and I that Msoft, MICROSOFT. and Micro Soft mean the same company based in Redmond, Washington.
But if we want to clean up that column and replace all the different versions of Microsoft with a single version Microsoft you can't do it with a regular match (query merge) in Power Query. Fuzzy matching to the rescue.
Fuzzy Matching Availability
Fuzzy matching is only available in Excel in Microsoft 365, and in Power BI.
Download Sample Excel Workbook
Enter your email address below to download the sample workbook.
Excel Workbook. Note: This is a .xlsx file please ensure your browser doesn't change the file extension on download.
If I have a second table (named Table2) consisting of the companies names as I want them displayed
I can load both tables into Power Query as connection only queries.
From within the Power Query Advanced Editor, select either query and then from the Ribbon choose Merge Queries -> Merge Queries as New
In the Merge dialog window select Table1 as the first table and Table2 as the second table. Leave Join Kind as Left Outer.
Click on the Company column in both tables to make the join on those columns.
You can see with these settings that only 3 out of 14 rows match, because as it is, PQ is doing an exact match.
Check Use fuzzy matching to perform the merge and you can now see that 10 rows match. The fuzzy matching algorithm has been able to identify a further 7 matches.
Click on Fuzzy matching options to see what options are available to fine tune the matching. You'll need to use the scroll bar to see all the options. Why can't they make the window bigger?
The first thing you'll see is the Similarity threshold which has a default value of 0.8.
A value of 1.0 means only look for exact matches. As you lower the value, the fuzzy matching algorithm gets less strict with how it matches text strings.
If I make the value 0.7, you can see that there are now 11 matches.
Reducing the value further, in this case, doesn't make any more matches. But depending on your own situation, you may find that reducing this threshold value gives you more matches.
Just be careful that you don't reduce it too low or you'll end up with incorrect matches because the algorithm has become very 'relaxed' with matching up strings.
For now, let's click OK to close the Merge dialog and see what our results are.
The merge gives me another column with a table in it
Clicking the icon in the new column header to expand these tables gives this
Rows 1, 9 and 12 don't have a match. Lets open the Source step to adjust the Similarity Threshold and see if that helps.
Gradually lowering the threshold by 0.1 each time, even going all the way down to 0.1 doesn't give me any more matches. Looks like I'm going to have to use a Transformation Table.
A Transformation Table is a table with 2 columns (From and To) that explicitly states what the string in the From column will be changed To.
To create a Transformation Table, I go back into Excel and create this table that I named Transform.
It lists the three companies in Table1 that don't have a match, and specifies in the To column what to use when the query merge is performed.
Load this table into Power Query (connection only).
In the Merge1 query, open the Source step again. In the Fuzzy Matching Options, scroll to the bottom until you see the Transformation table (optional) section.
Click on the drop down and select the Transform table.
The merge now tells me that it can match all 14 rows
With all rows matched in my merge I can get rid of the first column of incomplete or messed up company names, and end up with this nice table.
Leave a Reply