Fuzzy matcher
Match rows from two sets of data based on non-identical values.
The Fuzzy Matcher tool is used to match rows from two sets of data based on non-identical values. Inconsistencies in data entry, such as spelling errors and naming differences, often prevent the joining of two datasets. With fuzzy matching, approximately similar records are identified and the closest match is returned for any given row of data.
In the blueprint example, we are matching a dataset of survey results Fuzzy Match Survey Results (Data) to our database of existing customers Fuzzy Match Customer Lookup (Data). Since the survey results were entered manually, the names may not match the company records exactly (i.e. middle name vs. no middle name, use of nicknames, capitalization differences, etc. ), hence the need for fuzzy matching.
After matching, we filter the results by the confidence level of the match to identify matches with low confidence that require further investigation.
With the output dataset of high confidence matches, we can gather actionable insights from the joined data (i.e. following up with existing customers that gave low ratings, grouping ratings per city, etc.).
Fuzzy Matcher Source: The "answer key" to be matched against.
Fuzzy Match Type: Currently, the only fuzzy matching method available is Edit Distance. The Edit Distance formula compares two words or phrases based on the number of letters they differ by. For example, the words 'test' and 'best' share three of their four letters and therefore have a similarity measure of 75%.
Similarity Measure Column Name: This identifies the column in the output that quantifies the closeness of the match. By default, this column name is "Similarity Measure". This column name can be renamed.
Match Conditions: These are the parameters the tool will use to match the two datasets. Each condition requires corresponding columns from each dataset and the desired Match Level. The Match Level is the minimum similarity between two column entries that is required to identify a possible match between the columns.
Note
The order of Match Conditions does not impact the best match results.
Output
The Fuzzy Matcher maps rows from the input dataset (the dataset you are trying to match) to the "source" dataset (the "answer key" dataset).
When looking at the results, you can think of them as two halves: the first half is from the input dataset, and the second half is from the "source" dataset. If the tool is not able to make a match, there will be an empty row in the "source" dataset.
Fuzzy Matcher also adds a "Similarity Measure" column. The degree of matching between two rows is calculated based on the conditions provided (with 0 being 0% similarity and 1 being an exact match).