Data360 Analyze allows you to leverage the Python language to build and run analytic applications. Depending on your use case, you can use Python scripting within a variety of nodes including the Generate Data node, the Transform node or the Python node.
You may, however, want to utilize the capabilities of Analyze to schedule or orchestrate the running of a Python script outside of the context of Analyze. This article provides an example of how you can achieve this task using the Generate Data node.
The Generate Data node is often used to create data from a Python script when you want to perform an operation that does not require an input. A simple example of how it is can be used is shown below:
The required output metadata is defined in the ConfigureFields script - here a single string field is defined to hold the result. The CreateRecords script is used to define the output value of the field(s) in each output record. In this example only a single record is being output but your script can include a loop construct to allow you to generate multiple output records, if required.
Initiating a Subprocess
The Python subprocess module is part of the Python standard library. It allows you to launch (spawn) a new process from within a running process, connect to the new process's input, output and error pipes and to obtain the return code (indicating the success or failure of the subprocess).
The following script implements a 'Hello World!" program:
When run the script prints the message "Hello World!" to its output. The script is saved to a file named 'Say_Hello_World.py'. This is the external Python script that will be run within the subprocess created by the Generate Data node.
The Generate Data node is configured as shown below:
The first step is to import the subprocess module. Then, as for the basic example, the output metadata is defined. The subprocess module provides the subprocess.capture_output() function which is used to launch a subprocess. It accepts a list of input arguments and returns the output of running the command (if there is any). The script defines a list that contains the command to be run and the arguments for the command. In this case, the 'cmdArray' list indicates the Python interpreter is to run the script specified in the second item in the list.
Note, the above example assumes the use of an Analyze Server instance and that the 'Say_Hello_World.py' file has been uploaded to the 'Public' folder in the Analyze Directory. If you are using an Analyze Desktop instance the second item in the list would need to be changed to the directory where the 'Say_Hello_World.py' file was saved, e.g.
Note also the need to escape the backslash characters in the file path.
When the Generate Data node is run, the CreateRecords script executes the subprocess.check_output() function and passes the information in the cmdArray to the subprocess. The results are then assigned to the 'results' variable - and it's value is output on node's output pin in the 'script_Out' field:
Taking it further
In addition to running an external Python script from within Analyze, the same subprocess mechanism can be used for other tasks. In the following example the Python interpreter is used to run the 'pip' package manager module and check it's version number. This can be useful to confirm whether pip is installed in the Python environment.
Pip is used to install third-party Python packages. Here the subprocess mechanism is being used to install the 'xmltodict' Python package. The '--user' option indicates the package is to be installed for the current user which typically does not require any special privileges.
Note that for Windows systems, Python packages are typically available as binary packages. However, for LInux systems, Python packages are typically only available as source packages that must be compiled for the particular machine - which may require additional operating system packages to be installed e.g. the gcc 'C' compiler and associated packages. Packages that are written in pure Python can usually be installed as is.
You can use parameters to make external scripts more dynamic. In this example script a function called 'xml()' is defined. The function accepts an argument which specifies the file to be processed. The contents of the file are parsed and converted to JSON data. The name of the file is obtained from the (first) argument of the command supplied to the script.
As before, the cmdArray in the Generate Data node's ConfigureFields script contains the path of the Python script to be run in the subprocess and an additional item which specifies the file to be processed by the 'XMLtoJSON_parameterized.py' script.
The books.xml file contains catalog information on some books:
When the Generate Data node is run, the corresponding JSON data is output as a string on the node's output pin and can be passed to downstream nodes for subsequent processing, if required:
Handling Errors and Other Considerations
In some situations - depending on the command being run in the subprocess - you may obtain no output or an output that generates an execution warning or error in Analyze. For example, in the following the command obtains the version of the Python interpreter.
However, when run the output of the node is a blank line. The output of the command is being written to the subprocess's stderr output, resulting in a warning in the Analyze Errors panel:
Note the version information is being output but not is not returned by the capture_output() function.
If the subprocess fails you may receive little, if any, information on the reason for the failure. It is recommended that the Python script is tested in a standalone environment (e.g. the Python interpreter shell) to debug it as much as possible before attempting to run it from within Analyze.
Where possible avoid setting the PYTHONHOME environment variable in your profile. This can cause issues where another Python instance or incompatible Python modules are being identified. It may be necessary to unset the PYTHONHOME variable in the subprocess to prevent unwanted interaction:
In the examples above, the Python environment being used is the Python environment that is bundled with the Analyze install - which is currently version 2.7.15. In the cmdArray only the name of the python command was included as it is on the default PATH. In the case where you want to run a different instance of the Python interpreter (e.g. Python 3.8) in the subprocess you will need to include the absolute path to the Python executable. You may also need to modify the PATH to align with the required locations to be searched for executable libraries, for example by modifying the existing PATH:
The examples described in this article are included in a data flow (.lna) file that is within the attached zip archive. Analyze 3.6.0 or above is needed to import the .lna file. The zip file also includes the Python script files and the xml data file used by the examples. The .lna file should be imported into the Public folder of the Analyze directory. The Python script files and xml data file should be uploaded to the Public data directory on the Sever instance. If you are using an Analyze Desktop instance the Python script files and xml data file should be saved in an accessible directory and the file paths specified in the Generate Data node scripts must be modified to align with the location where the files have been saved.