Sometimes, input data needs to be divided among multiple outputs in order to group the data appropriately. A Transform node and some Python code can accomplish this goal.
You can download a data flow containing the code and sample data at the end of this post.
Let's say you have the following data for shoes:
Color | Size |
Black | 8 |
Black | 10 |
Black | 6 |
Burgundy | 7 |
Burgundy | 12 |
Burgundy | 11 |
Let's say that you want to bucket the data into three groups:
- Sizes 11 and above
- Sizes 7 to 10
- Sizes 6 and below
You can do this in two ways.
First, you could use three different Transform nodes with three different where clauses to group the data:
This method has a few advantages. Namely, the nodes are easy to understand and clearly labelled. However, if you want to save a little bit of screen real estate and make your data flow run faster, you can split the input data with a single Transform node with three outputs.
Start by placing and naming a Transform node:
In the ConfigureFields parameter of the Transform node, replace the existing code with the following code:
# Configure all the field metadata from input 'in1' to be mapped # to the corresponding fields on each of the three outputs outputs[0] += in1 outputs[1] += in1 outputs[2] += in1
In the ProcessRecords parameter of the Transform node, replace the existing code with the following code:
# Output the record to the appropriate output pin # based on the specified criteria if fields['Size'] >= 11: outputs[0] += in1 if fields['Size'] > 6 and fields['Size'] < 11: outputs[1] += in1 if fields['Size'] <= 6: outputs[2] += in1
Now add two additional outputs to the Transform node. You can do this by clicking on the node, and then clicking the Define tab in the panel that appears on the right of your screen:
Towards the bottom, you'll see an area that looks like the following:
The first output can be renamed by clicking where it says "out1" and typing a different name. New outputs can be added by clicking where it says "Type to add a new output". Here is how I set up my Transform node:
When you run the node, you'll see that you get the same results as above, just in a single node!
Please be aware that all three of the output statements are applied to each input record. If one record satisfies two output statements, then that record is output twice. The code could be improved by replacing the separate if statements with an if ... elif ... else statement as shown in this code:
# Output the record to the appropriate output pin # based on the specified criteria if fields['Size'] >= 11: outputs[0] += in1 elif fields['Size'] > 6 and fields['Size'] < 11: outputs[1] += in1 else: outputs[2] += in1
In this case please be aware that records which do not match the criteria for the first two outputs will be output to the third output. If required you could extend this example to provide a No Match output to collect records that do not explicitly match the criteria for the other outputs.
Comments
0 comments
Please sign in to leave a comment.