When working with categorical data, the GroupBy functionality of the Aggregate node can be used to count the number of occurrences of a particular categorical value in a data set. However, the Aggregate node will only generate counts for each category that was actually seen in the current data set.
In some situations you know the possible set of categorical values that can be present in the category field but all of them may not be present in a specific data set. In this case you can use the Transform node and some Python scripting to create the desired results.
For example, the default data in the Create Data node includes the 'type' field which comprises categorical data (i.e. 'primary', 'secondary', etc).
Suppose you wanted to count the number of occurrences of the following categories: 'primary', 'secondary' and 'tertiary'. In this case a Transform node would be connected to the output of the Create Data node and configured with the following scripts:
Transform node: 'ConfigureFields' property:
# Create the Output record metadata
out1.primary = int
out1.secondary = int
out1.tertiary = int
# Setup the local counters
primaryCount = 0
secondaryCount = 0
tertiaryCount = 0
otherTypeCount = 0
Transform node: 'ProcessRecords' property:
# For every record track if it is a primary, secondary or tertiary
if in1.type == 'primary':
primaryCount += 1
elif in1.type == 'secondary':
secondaryCount += 1
elif in1.type == 'tertiary':
tertiaryCount += 1
otherTypeCount += 1
# This is the last record so output our results
out1.primary = primaryCount
out1.secondary = secondaryCount
out1.tertiary = tertiaryCount
The output from the Transform node is as follows.
In the case where the input data did not contain records with a particular category, the Transform node would still output the count for the missing category, e.g. if there were no records containing 'primary' the following would be output:
Please sign in to leave a comment.