When analyzing or processing data it is best practice to profile your data to assess it's integrity. One way to profile the data is to check for duplicates which could skew your analysis. Data360 Analyze provides a number of tools that can detect and remove duplicates. The easiest way to find relevant nodes is to use the search functionality - using a search term such as 'duplicate' or 'unique':
Duplicate Detection node
The Duplicate Detection node is used to identify records with duplicate values in specified fields. When run it segregates the data so that records that have unique values in the specified fields are output on the node's 'single occurance' ouptut pin and records with duplicate values are output on the 'multiple occurrance' pin.
In this example transaction data the key fields that are going to be used to determine whether a record is a duplicate are 'Customer_Account', 'Agreement_Number' and 'Charges_Sequence_Number':
To configure the node, the 'IdentifyDuplicatesBy' property is set to use the key fields:
When the node is run the default action is to fail if it detects duplicates:
This action would be appropriate if the data should never contain duplicates and you needed to ensure duplicates are not propogated downstream. In some circumstances, you may want to configure the node to be successful when duplicates are present in the data. You can do this by setting the 'ErrorIfDuplicates' property to False. In this case there are six records that have been identified as being duplicates.
Only the specified key fields were used in determining whether the records are duplicates. You can examine the records in the Data Viewer or export them for further investigation. In this case these are exact duplicate records.
You can also use the node without specifying any fields in the 'IDentifyDuplicatesBy' property - in this case the node will use all input fields when looking for duplicates.
Remove Duplicates node
The Remove Duplicates node enables you to exclude duplicate records from your data. Typically it would be best practice to detect any duplicates before removing them, as this will provide transparency in your analysis where records are being 'dropped' and the reason why. The node is configured similar to the Duplicate Detection node:
When run the records with unique values in the specified fields are output on the node's ouput pin.
You can then easily combine the 'single occurance' records with the de-duplicated records using a Cat node:
See the attached .lna file for the example used in this article. Data360 Analyze version 3.6.0 or above is required.
Comments
2 comments
I have duplicate records of the same transaction, and want to preserve the most recent one. How can I do that?
After you have used the Duplicate Detection node to find the duplicates, add a Sort node and configure the SortBy fields to be the same as the key fields used by the Duplicate Detection node plus the Transaction date field. This should be configured to sort descending - meaning the first record in the duplicate group will be the most recent. You can then use the Duplicate Detection node to delete the earlier duplicate record(s).
Please sign in to leave a comment.