A common function to use in this case would be the Python replace method within the ProcessRecords script of a Transform node. The replace method allows you to perform a "find and replace" of text and doesn't require you to change your source data.
It takes the following format:
source.replace(find, replace[, max])
Where source is the source string, find is the substring to be found, replace is the replacement substring, and max is an optional argument that specifies the maximum number of replacements that are to be made during the replace operation.
You can view the Help for the replace function by using an online search, for example see the pages here:
https://www.tutorialspoint.com/python/string_replace.htm
https://docs.python.org/2/library/string.html
Example 1:
In my data, the country code field 'Country' contains the value "USA" but I want to replace it with "US" in order to match with a reference table that I intend to join with. As I want to output the corrected data in a field with the same fieldname as the original, the Transform node needs to be configured with the appropriate metadata definition in the ConfigureFields property:
out1 += in1
out1 -= in1.Country
out1.Country = unicode
In the above script, the first statement out1 += in1 is used to map the metadata for all of the fields in the input data 'in1' to corresponding fields on the output 'out1'. The second statement out1 -= in1.Country removes the metadata for the 'Country' field from the output record specification (meaning this input field would not appear in the output data). Finally, the statement out1.Country = unicode adds the 'Country' field in the output metadata specification as a (new) string field - for which the values for this field are not known at this stage.
The ProcessRecords script is used to transform each of the records in the input data:
out1 += in1
out1.Country = fields.Country.replace('USA','US')
The first statement out1 += in1 copies the values of each of the input fields that have been set up in the mapping defined in the ConfigureFields script into the output record.
The second statement out1.Country = fields.Country.replace('USA','US') is then used to perform the replacement operation on the 'Country' input field and assign the resultant string value to the output 'Country' field.
Example 2:
In my data, the country code field 'Country' sometimes contains periods in the country name abbreviations. For example, some country values appear as 'U.S.A.' or 'U.S.' but I want these values without the period i.e. 'USA' and 'US' respectively. I can use the replace function to replace '.' with an empty string ''.
As before, I want the replacement string values to be output in a field called 'Country'. The ConfigureFields script is unchanged:
out1 += in1
out1 -= in1.Country
out1.Country = unicode
The ProcessRecords script is as follows:
out1 += in1
out1.Country = fields.Country.replace('.','')
The first statement is the same as for the first example.
The second statement out1.Country = fields.Country.replace('.','') removes any period '.' characters in the data.
Following from Example 1, what if I needed to remove periods in the data but also replace 'USA' with 'US'? I can apply the replace function twice using the following:
out1.Country = fields.Country.replace('.','').replace('USA','US')
Advanced:
Sometimes a direct string replacement may not be enough if, say, the position of the text you want to find or replace is important e.g. only replace 'Rd' or 'rd' with 'Road' if it is at the end of the string.
A function that is useful in this situation is the sub function in the Python re module, which allows you to perform a find and replace of string patterns using Regular Expressions.
You can use an online search to find information for the python re module and re.sub function or view the page here:
https://docs.python.org/2/library/re.html#module-re
Similarly, an online search for regular expression cheatsheet will return a number of pages/documents such as this:
https://cheatography.com/davechild/cheat-sheets/regular-expressions/
Example:
My address data contains the common abbreviation 'rd' and 'Rd' but I want to replace it with the full word 'Road'.
The address values all end with the street suffix and I want to make sure I'm only replacing instances where there is a space before 'rd' or 'Rd' and where it is at the end of the string. This is important so that I don't change street names that contain 'rd' nor change suffixes like 'Boulevard'.
For the example I'm using the following contact data.
The Python re module is not automatically imported by the Transform node. To do this you need to specify it within your script. Since this needs to be done only once it can be included in the ConfigureFields script.
import re
out1 += in1
out1 -= in1.Address
out1.Address = unicode
The import re statement imports the module. This should be done at the start of your script. Then as before, the output metadata specification adds all the input fields to the output metadata specification, removes the input 'Address' field and then re-defines an new 'Address' field.
The ProcessRecords script is configured as shown below:
out1 += in1
out1.Address = re.sub('(\sRd|\srd)$', ' Road', fields['Address'])
First, all of the values for the input fields configured in the ConfigureFields script are copied to the output record. Then the re.sub() function is used to make the replacements using the specified regular expression match pattern.
In the match pattern '(\sRd|\srd)$' the parentheses specify the group of characters to be matched. In this case there are two alternative sets of characters that can generate a match. In the first pattern, the \s specifies that the string must contain a whitespace character then the characters Rd - the 'pipe' character '|' indicates that the pattern before the | can be matched or the pattern following the | can match and form the group. The second pattern \srd requires a whitespace character then the characters rd. The $ symbol is then used to 'anchor' the pattern to only match if the character group appears at the end of the string.
The results of the replacement are shown below. The abbreviations in records 2 and 5 have been replaced. However, the two instances of 'rd' in record 3 ('Boatyard') and record 6 ('Boulevard') have not been changed.
Regular Expressions can be extremely powerful so I do encourage you to explore their use in Data360 Analyze. If you're not familiar with Regular Expressions, don't fret! There are some great free references online to help you get started.
Also attached below is a data flow containing the examples described above.
Comments
0 comments
Please sign in to leave a comment.