BACKGROUND:
In Data360 Analyze, the Hash Split node can be used with data sets where you are getting a million or more records per day. The HashSplit node can be used to split the data up and process the parts in parallel.
It is useful if you need to join two sets of data that have the same phone number, and there are a lot of records.
Use a Hash Split node to split the data on both data sources based on phone number.
1. All the records for the same specified key value, in this case a phone number, are guaranteed to go to the same output pin
2. All the records for the same specified key value, will always go to that same output pin every time it runs, always.
3. This is true no matter what the rest of the values in a record are, or how many records there are with the same key value
The node splits the input record set into multiple streams to allow parallel processing of subsets of the input.
NOTE: This example assumes the number of output pins on the Hash Split node stays the same. If you change that number, the records will probably move to a different output pin and start being consistent about going to that one.
USE CASE:
A common example of where the Hash Split node may be useful is for a billing system, a telephone switch or call records (see sample data and example lna attached). If you have a lot of billing data and switch data that needs to be joined to some other switch data, Split both up by phone number with 7 output pins. Then join the output of pin 1 on one data source to pin 1 of the other, ensuring that they will have the same phone numbers on that pin. Do the same for pin 2 to pin 2, pin 3 to pin 3. ...pin 7 to pin 7.
Comments
0 comments
Please sign in to leave a comment.