The Transform node was made Generally Available in the v3.1.5 release. This node is intended to be the replacement for the legacy Transform (Superseded) node.
This article provides an introduction to the Transform node and provides information on how to access input data and write output data.
What’s Different?
The Transform (Superseded) node used the proprietary scripting language to define custom data transformation logic. The Transform node leverages the Python language to transform data.
The use of an industry standard programming language reduces the learning curve for new Data360 Analyze users. It allows you to re-use existing coding skills, and enables you to leverage the wealth of publicly available help/resources on the usage of Python. Standard Python functions can also be used in data flows to leverage Python functionality that is not available with the proprietary scripting language.
Finding the Transform Node
The Transform node is available in the node library in the ‘Transformation and Aggregation’ category. It is displayed in the ‘Favorites’ view and also in the ‘All Nodes’ view.
The legacy Transform node has been renamed to Transform (Superseded) and is available in the ‘All Nodes’ view when the node library's 'Show superseded nodes' option has been checked. This node remains available for customers to use but Infogix recommends utilizing the Transform node for current projects.
Accessing Input Data and Writing Output Data
Unlike with the Transform (Superseded) node, the Transform node uses a two-step configuration approach:
- Specify input/ output metadata mappings and the metadata for new output fields
- Specify the processing logic to apply to each record
Specifying the Output Metadata
The metadata for the mapping of input fields to output fields is explicitly defined in the node’s ConfigureFields property:
The ConfigureFields script is run once when the node is first executed. The script is run before any input records are processed.
The ConfigureFields script is primarily used to construct the metadata object for the output record (or the metadata objects for each output record, if multiple output pins are defined on the node). Note: lines starting with ‘#’ are comment lines.
You can specify input and output fields using a number of different notation formats. By default, when the Transform node is added to the canvas it is configured to pass all fields on the node’s input pin ‘in1’ through to the node’s output pin, ‘out1’. The metadata mapping:
out1 += in1
instructs the node to append the metadata for all of the fields in the input data to the output metadata object. Since no output metadata objects are implicitly defined and this is the first statement in ConfigureFields script, the effect is to set the field items in the metadata object to be just those in the input data. The input metadata can also be referenced using the following notation formats:
# Use all fields on the node’s first input pin (zero-indexed)
out1 += inputs[0]
# Use all fields on the input pin named 'in1'
out1 += inputs.in1
Note, the above method cannot be used if the input pin name includes space characters or conflicts with a Python keyword. In this case the pin’s name should be specified in square brackets:
# Use all fields on the input pin named 'in1'
out1 += inputs['in1']
The fields() operator allows you to access the metadata for the fields across all of the node’s inputs pins. Since in this case the Transform node only has one input pin, you can also specify the input metadata using the following:
out1 += fields
The methods of referencing an output metadata object are similar to referencing the input metadata:
# Use the name of the output pin
out1 += in1
# Use first output pin on the node (zero-indexed)
outputs[0] += in1
# Use the name of the output pin 'out1'
outputs['out1'] += in1
Selecting Fields to Output
In many situations you may want to output only a subset of the fields in the input data. You can specify the individual fields to be output:
out1 += in1.color
out1 += in1['type']
Note: The order in which the input metadata is added to the output metadata object determines the order in which the fields are output by the node.
When input data is available on the node’s input pin you can also use the ‘Input Fields’ menu option to insert a field reference:
If the input field name contains space characters the field reference will be inserted using the alternative square brackets format, e.g.:
out1 += fields['First Name']
Note: an input field’s metadata can only appear once in and output metadata object. The node will generate an error if duplicate field metadata references are detected for a particular output metadata object.
You can configure the fields to be output by specifying a pattern to be matched or using a Regular Expression:
# Select the fields containing the string "Invoice"
out1 += '*Invoice*'
# Select the fields matching the regular expression
out1 += patterns.regex('.*?(due|present).*?Date', in1)
If you want to pass through the majority of input fields to the output but exclude specific fields then the corresponding Field references need be removed from the output metadata object. For instance, to exclude the 'junk' field and 'type' field you could set the ConfigureFields script to the following:
# Output all of the input fields
out1 += in1
# Exclude the 'junk' and 'type'fields
out1 -= in1.junk
out1 -= in1['type']
Note: When an input field reference is removed from the metadata object all items that refer to that input field are removed from the metadata object using ‘source-based substraction’. For example:
# Add all input fields on pin ‘in1’ to the output specification
out += in1
# Create a new output field based on an input field
out1.foo = in1.bar
# Remove the input field from the output specification
out1 -= in1.bar
# The fields ‘bar’ and ‘foo’ will not be output!
General rule: All input field metadata subtractions must be performed before the creation of output fields that refer to the removed input fields. See the Renaming Existing Input Fields section below.
Duplicating an input field
You can duplicate an input field by adding the new field to the output metadata object and assigning the metadata for the required input field to the new field, e.g.
# Use the input field’s metadata as the metadata for the new field
out1['My New Field'] = in1.rand
# Alternatively, use the dot notation if the new name is valid
out1.My_New_Field = in1.rand
Note: the equals operator ‘=’ is used in this case rather than the plus-equals operator ‘+=’ as you are defining the metadata for a new field.
Renaming Existing Input Fields
Renaming an input field is a two-step process, first the input field reference needs to be removed from the output metadata object and then the new output field needs to be assigned the metadata from the input field:
# Exclude the 'junk' field
out1 -= in1.junk
# Define the new output field name and assign the metadata
# for the original input field to the new field
out1.newJunk = in1.junk
Creating New Calculated Fields
The process of transforming data typically creates variables that contain calculated values. Often times you want to output selected variables as new fields on a node’s output pin. To do this you must define the metadata for the new fields. The data type for the new fields must be appropriate for the data to be output.
# Define the new output field as an integer data type
out1['doubled_ID'] = int
Note: At this stage you have only defined the metadata for the new node so, if the node was run at this point, the new field in the output records would contain all Null values. To output the actual values you also need to assign the values to the new field in the ProcessRecords script, e.g.
out1['doubled_ID'] = 2* in1.id
Note that Data360 Analyze uses Python v.2.7 which means that when using math division functionality you need to decide whether you want to use the __div__() method or the newer __div__() method.
For instance, assume have defined a new output field in the ConfigureFields script to handle the result of your division calculation:
# Define new output field
out1['halved_ID'] = float
and then in the ProcessRecords script you specify the calculated variable as follows:
# Divide the value of the 'id' field by two
out1['halved_ID'] = in1.id / 2
By default, the division will use integer division giving the following results, which may not be what you expected – especially as the ‘halved_ID’ field was defined as a Python float type, which is the same as Data360 Analyze's double type:
To use the newer __div__() method you need to explicitly instruct the Python language interpreter to use this language feature. This can be accomplished by adding the following statement at the start of the ConfigureFields script:
# Enable the newer division functionality
from __future__ import division
Now when the node is run with this statement in the script the node generates in the following results:
Comments
3 comments
Using the Transform node to manipulate field names
To replace spaces in all field names:
ConfigureFields:
Although frankly I'm a little surprised that it works because in ProcessRecords I don't assign anything to the new out1 fields, it still just has the default "out1 += in1"
Convert field names to lower case
ConfigureFields:
Append a prefix or suffix to field names
ConfigureFields:
Convert camelCase field names to lower_case
ConfigureFields:
This one needs a little explaining. We are using regular expressions (import re) to look for lowercase characters ([a-z]) that are immediately followed by uppercase characters ([A-Z]).
The ( ) create groups in the search pattern that can be referenced in the replacement pattern using the \g<#> syntax. So in this case we take the first match, the lowercase character, and add an _ between it and the second match, the uppercase character. Finally, since the re module doesn't support the \L switch to convert text to lowercase, we apply .lower() to the result of the regex operation.
Rearrange fields into alphabetical order
ConfigureFields:
More advanced methods
These methods can be combined, as in this example that converts to lowercase, replaces spaces with underscores and sorts the fields:
Python sort is case-sensitive, however, so all fields starting with uppercase characters will come before all fields starting with lowercase characters. In this example the sorted() function is being applied to the list of incoming fields, so we don't yet have our lowercase names available for sorting there.
One option would be to run one transform node to convert to lowercase, followed by another to sort the fields. But it can also be done in one step - we just need to convert the fields to lowercase while still remembering which original field they identified. We use a dictionary to do this:
This can also be done in a single line:
Once we have this dictionary, we sort it by its keys and then output the new fields and link them to their identification on in1:
In the end, I'm using this code which combines many of these elements to give me sorted, lower_case fieldnames:
Hey,
We are happy that we can use a Python without any classes like in Lavastorm, that is a advantage of this node, but still there are a few things which could be added.
Is it a reason why transform node is working only with standard python libraries?
Is it possible to extend a jython implementation by new libraries like pandas, numpy or cx_oracle?
Thanks
Jaroslaw
The Transform node is built using a Java implementation of the Python language - Jython. The Jython instance includes equivalents for the functionality included in the Python standard library.
You can leverage third-party Python packages that are written in Python - i.e. they are Pure Python. However, it is not possible to use Python packages that are written in the C language.
Please see this article for additional details:
https://support.infogix.com/hc/en-us/articles/360051959533-How-to-install-3rd-party-Python-packages-on-Analyze
Please sign in to leave a comment.