This article is only relevant when using our older propriety scripting language: BrainScript. We recommend using the new Transform node which supports Python. See this article for an example of how to use the new Transform node and the Transform node section of the online documentation for further details. Also see the BRAINScript to Python section of the Help for a mapping of functions to Python.
I often get asked how to extract out particular parts of a string e.g. pulling out "Jan" from "Jan 2016" or "02210" from "100 Summer St, Boston, MA 02210".
There are multiple functions and methods in Data360 Analyze to help you do this and this article is intended to provide just a few examples to help you get started.
The functions referenced in this example include the following:
- left()
- right()
- substr()
- split()
- getItem()
- regexMatch()
You may be familiar with similar functions from your experience with other applications and languages.
The image below shows the sample data that will be referenced in the examples below:
Extract first x characters OR last y characters in a string
You can use the left() function to extract characters starting from left of a string or the right() function the extract characters starting from the right.
For example, I have a column containing month and year in the format MMM YYYY e.g. Jan 2016. I want to split out the month and year into two separate column values:
_month = 'Month'.left(3)
_year = 'Month'.right(4)
emit *, _month, _year
You want to extract all the characters starting from a specific position in the string
For example, the phone number field in your data all start with the text "Phone:” e.g. Phone: 504-621-8927, and you want to only take the phone number without the text "Phone:" but the length of the phone number may differ. We can simply take all the characters that occur after the colon.
You can use the substr() function to specify the position that the first character of interest and it will return all the characters from that position.
phonenumber = 'phone'.substr(6).trim()
emit *, phonenumber
You can use the substr() function to take a subset of the string by specifying the number of characters to include.
Note that the trim() function is used here to also remove the space character that may appear after the colon.
You want to split a string into new columns using a character as a delimiter
The name is in the format: last name, first name. We want to separate this value so that we have last and first name in two separate columns.
We can use the split() function to split a string into a list of values and use the getItem() function to extract a specific value in the list.
last_name = 'name'.split(",").getItem(0).trim()
first_name = 'name'.split(",").getItem(1).trim()
emit *, last_name, first_name
You want to extract part of a string that follows a specific pattern e.g. postal code
Sometimes you know the pattern of the text that you want to search for but you're not sure where in the string it may appear.
For example, we may want to extract the zip code from the address but due to the nature of the address field, the zip code will rarely be in the same position in each record. Instead, we can search for text in this string that is 5 consecutive digits i.e the regular expression [0-9]{5}:
zipcode = 'Address'.regexMatch("[0-9]{5}").getItem(0).getItem(0)
emit *, zipcode
You can view the examples mentioned above in your own instance of Data360 Analyze by downloading and importing the attached .lna file.
Update: The Transform node supports the Python language. You can find many resources on using Python to transform data by performing an online search. To allow you to compare the original example with a Python-based example, an additional data flow .lna file has been appended below.
Comments
0 comments
Please sign in to leave a comment.