A user recently asked for some help with the use of the Fuzzy Join node and I thought it may be useful to the other users in the Dataverse community.
The requirement was to identify potential duplicates in a list of names. The attached example data flow shows one way this can be achieved in Dataverse.
The example data comprises a field with customer names and the corresponding account number. Just to aid with tracking the records through the data flow a record ID field is added to the data.
The Duplicate Detection node is then used to identify any records where there is an exact match in the name for multiple accounts. Obviously, it is not uncommon for customers to have the same name so the duplicate records may be valid and other information such as address could be used to track down any actual duplicates.
The 'Unique' option of the Sort node is then used to reduce the 'exact duplicates' so that there is only one record for each name. These are then merged back with the records that had only a single occurrence before looking for any fuzzy duplicates.
The name data is passed to both inputs of the Fuzzy Join node so that the data can be joined with itself. In this case only the fuzzy expression properties are used in the match process. The Levenshtein Distance fuzzy algorithm is used to find potential duplicates. The node is configured to prefix the fields from the 'Right' data set with 'R_' so that the output field names do not clash.
The Fuzzy Join node's 'uniquematches ' output field comprises the records for which there was only one match (i.e. we did not find any fuzzy duplicates). A Transform node is then used to exclude any fields with the 'R_' prefix from these records with unique names.
The Fuzzy Join node's 'multimatch' output pin comprises the records for which there was more than one matching record. As we used the same data for the both the Left and Right data sets, records will be output for exact matches. A Split node is used to separate records with name matches in the 'Names' field and 'R_Names' field. Another consequence of joining the data with itself is that there are also duplicate records where the fuzzy matches are swapped in the 'Names' field and 'R_Names' field. A Transform node and Python code is used to separate the swapped duplicates to a second output pin. The output of this node comprises the records that may be potential fuzzy duplicates.
I hope you find this useful in your analyses.
Please sign in to leave a comment.