The new Transform node was made Generally Available in the v3.1.5 release. This node is intended to be the replacement for the previous Transform node.
This article provides an introduction to the Transform node and provides information on how to access input data and write output data.
The original Transform node used the proprietary scripting language to define custom data transformation logic. The new Transform node leverages the Python language to transform data.
The use of an industry standard programming language reduces the learning curve for new Data360 Analyze users. It allows you to re-use existing coding skills, and enables you to leverage the wealth of publicly available help/resources on the usage of Python. Standard Python functions can also be used in data flows to leverage Python functionality that is not available with the proprietary scripting language.
Finding the Transform Node
The new Transform node is available in the node palette in the ‘Transformation and Aggregation’ category. It is displayed in the ‘Favorites’ view and also in the ‘All Nodes’ view.
The original Transform node has been renamed to Transform (Superseded) and is available in the ‘All Nodes’ view. This node remains available for customers to use but Infogix recommends utilizing the new Transform node for current projects.
Accessing Input Data and Writing Output Data
Unlike with the original Transform node, the new Transform node uses a two-step configuration approach:
- Specify input/ output metadata mappings and the metadata for new output fields
- Specify the processing logic to apply to each record
Specifying the Output Metadata
The metadata for the mapping of input fields to output fields is explicitly defined in the node’s ConfigureFields property:
The ConfigureFields script is run once when the node is first executed. The script is run before any input records are processed.
The ConfigureFields script is primarily used to construct the metadata object for the output record (or the metadata objects for each output record, if multiple output pins are defined on the node). Note: lines starting with ‘#’ are comment lines.
You can specify input and output fields using a number of different notation formats. By default, when the Transform node is added to the canvas it is configured to pass all fields on the node’s input pin ‘in1’ through to the node’s output pin, ‘out1’. The metadata mapping:
out1 += in1
instructs the node to append the metadata for all of the fields in the input data to the output metadata object. Since no output metadata objects are implicitly defined and this is the first statement in ConfigureFields script, the effect is to set the field items in the metadata object to be just those in the input data. The input metadata can also be referenced using the following notation formats:
# Use all fields on the node’s first input pin (zero-indexed)
out1 += inputs
# Use all fields on the input pin named 'in1'
out1 += inputs.in1
Note, the above method cannot be used if the input pin name includes space characters or conflicts with a Python keyword. In this case the pin’s name should be specified in square brackets:
# Use all fields on the input pin named 'in1'
out1 += inputs['in1']
The fields() operator allows you to access the metadata for the fields across all of the node’s inputs pins. Since in this case the Transform node only has one input pin, you can also specify the input metadata using the following:
out1 += fields
The methods of referencing an output metadata object are similar to referencing the input metadata:
# Use the name of the output pin
out1 += in1
# Use first output pin on the node (zero-indexed)
outputs += in1
# Use the name of the output pin 'out1'
outputs['out1'] += in1
Selecting Fields to Output
In many situations you may want to output only a subset of the fields in the input data. You can specify the individual fields to be output:
out1 += in1.color
out1 += in1['type']
Note: The order in which the input metadata is added to the output metadata object determines the order in which the fields are output by the node.
When input data is available on the node’s input pin you can also use the ‘Input Fields’ menu option to insert a field reference:
Note: an input field’s metadata can only appear once in and output metadata object. The node will generate an error if duplicate field metadata references are detected for a particular output metadata object.
You can configure the fields to be output by specifying a pattern to be matched or using a Regular Expression:
# Select the fields containing the string "Invoice"
out1 += '*Invoice*'
# Select the fields matching the regular expression
out1 += patterns.regex('.*?(due|present).*?Date', in1)
If you want to pass through the majority of input fields to the output but exclude specific fields then the corresponding Field references need be removed from the output metadata object. For instance, to exclude the 'junk' field and 'type' field you could set the ConfigureFields script to the following:
# Output all of the input fields
out1 += in1
# Exclude the 'junk' and 'type'fields
out1 -= in1.junk
out1 -= in1['type']
Note: When an input field reference is removed from the metadata object all items that refer to that input field are removed from the metadata object using ‘source-based substraction’. For example:
# Add all input fields on pin ‘in1’ to the output specification
out += in1
# Create a new output field based on an input field
out1.foo = in1.bar
# Remove the input field from the output specification
out1 -= in1.bar
# The fields ‘bar’ and ‘foo’ will not be output!
General rule: All input field metadata subtractions must be performed before the creation of output fields that refer to the removed input fields. See the Renaming Existing Input Fields section below.
Duplicating an input field
You can duplicate an input field by adding the new field to the output metadata object and assigning the metadata for the required input field to the new field, e.g.
# Use the input field’s metadata as the metadata for the new field
out1['My New Field'] = in1.rand
# Alternatively, use the dot notation if the new name is valid
out1.My_New_Field = in1.rand
Note: the equals operator ‘=’ is used in this case rather than the plus-equals operator ‘+=’ as you are defining the metadata for a new field.
Renaming Existing Input Fields
Renaming an input field is a two-step process, first the input field reference needs to be removed from the output metadata object and then the new output field needs to be assigned the metadata from the input field:
# Exclude the 'junk' field
out1 -= in1.junk
# Define the new output field name and assign the metadata
# for the original input field to the new field
out1.newJunk = in1.junk
Creating New Calculated Fields
The process of transforming data typically creates variables that contain calculated values. Often times you want to output selected variables as new fields on a node’s output pin. To do this you must define the metadata for the new fields. The data type for the new fields must be appropriate for the data to be output.
# Define the new output field as an integer data type
out1['doubled_ID'] = int
Note: At this stage you have only defined the metadata for the new node so, if the node was run at this point, the new field in the output records would contain all Null values. To output the actual values you also need to assign the values to the new field in the ProcessRecords script, e.g.
out1['doubled_ID'] = 2* in1.id
Note that Data360 Analyze uses Python v.2.7 which means that when using math division functionality you need to decide whether you want to use the __div__() method or the newer __div__() method.
For instance, assume have defined a new output field in the ConfigureFields script to handle the result of your division calculation:
# Define new output field
out1['halved_ID'] = float
and then in the ProcessRecords script you specify the calculated variable as follows:
# Divide the value of the 'id' field by two
out1['halved_ID'] = in1.id / 2
By default, the division will use integer division giving the following results, which may not be what you expected – especially as the ‘halved_ID’ field was defined as a Python float type, which is the same as Data360 Analyze's double type:
To use the newer __div__() method you need to explicitly instruct the Python language interpreter to use this language feature. This can be accomplished by adding the following statement at the start of the ConfigureFields script:
# Enable the newer division functionality
from __future__ import division
Now when the node is run with this statement in the script the node generates in the following results: