The Plugin node allow custom operations to be introduced in the processing pipeline of Data360 DQ+. These operations will allow custom business logic that can include external system interactions including file system operations and invoking APIs.
The custom plugins will process the data flowing through the pipeline (passed from the prior node) and can generate output data that may have a different structure from input record. The plugin output can comprise of multiple output record for each input record (i.e. a Flatmap operation).
The plugins are stateless and each input record is passed iteratively to the plugin class.
Custom node plugins can also have parameters that can be used to influence the processing logic for each instance of the plugin.
Note that installing a plugin requires server-side access to Spark's library directory, which is only accessible for DQ+ deployments within customer environments and not deployments within the Infogix cloud.
SDK for creating custom Plugin nodes
The plugins can be created using a SDK that has the essential interfaces and examples for building a plugin: Download DQ+ SDK package. The SDK package includes a sample plugin and its source code.
The SDK directory includes the following Interfaces that the plugin class should implement:
Interface |
Description |
INodeCustomPlugin |
The plugin class is a specific implementation of this interface |
IRecordWrapper |
This interface allows a record type to be accessed |
ICustomSparkPlugin |
Optionally implement this interface to extend Apache Spark functionality |
In addition, the SDK also includes a RecordWrapper class as an implementation of the IRecordWrapper to allow construction of the output record returned by the Plugin.
The plugins that implement the INodeCustomPlugin are deployed as Jar files in the application server and are dynamically instantiated when a custom node specifying the plugin class is configured by the user.
Each concrete implementation of the INodeCustomPlugin can comprise of multiple classes and include external dependencies that should be included in the deployment Jar.
Optionally the plugin can implement the ICustomSparkPlugin interface to extend Apache Spark functionality, it provides plugin with the ability to register custom Spark UDFs for use in Spark SQL executions.
Development requirements
A new Plugin Class can be implemented using any modern Java IDE. Typically the following will be required to create and distribute a new custom node plugin:
- JDK 1.8
- A Java IDE like Eclipse, IntelliJ, NetBeans etc.
- The interface classes that the plugin should implement in the classpath.
Plugin class
Plugin class must implement INodeCustomPlugin interface.
The table below lists the methods defined in INodeCustomPlugin.
Method |
Purpose |
getName |
Returns a descriptive name of the plugin to be displayed if errors are encountered in the processing. |
execute |
The main implementation of the plugin will reside here. The parameters to this method include
The method will generate an output of the following type:
|
The template below can be used to create a basic plugin object:
package com.infogix.component.sdk.examples; import java.io.*; import java.util.ArrayList; import java.util.List; import java.util.Map; import com.infogix.component.executionengine.export.spark.operations.RecordWrapper; import com.infogix.component.bombase.export.customplugin.IRecordWrapper; import com.infogix.component.bombase.export.customplugin.INodeCustomPlugin; public class NodeCustomPluginExample implements INodeCustomPlugin { public static final String PLUGIN_NAME = "My plugin example"; /** * @return User friendly name for the plugin */ public String getName(){ return PLUGIN_NAME; } /** * Sample plugin template * @param properties: Collection of key-value pair passed from UI. * @param inputRecord: Input Record to be processed by the plugin * @param outputFields: output fields * @return: Record type */ public List execute( Map<String, Object> properties, IRecordWrapper inputRecord, Map<String, Object> outputFields){ /* * Write your custom business logic here */ List outputRecords = new ArrayList<>(); return outputRecords; } }
Optionally, the plugin can also implement the ICustomSparkPlugin interface, the plugin should only do so if it needs to extend Spark functionality like UDFs.
The table below lists the methods defined in ICustomSparkPlugin
Method |
Purpose |
registerUDFs |
Register custom UDFs for use in Spark SQL executions |
isPassthrough |
Return true if the plugin does not intend to process input records and input should just pass to next node |
Exception handling
The plugin should validate the input data prior to execution to avoid any undesirable behavior.
Since the plugin parameters are not validated from the UI, the developer should specifically check the parameter values and ranges before proceeding to use them in the business logic.
The UI will not detect any missing parameter and/or warn the user for incorrect value or type. Therefore these situations should be handled by the plugin code and appropriate exception raised to the application.
Compiling and packaging the plugin class
Execute the commands below to compile and create the jar file for deployment. If using a different classpath, then the directory names will vary.
javac -d bin -cp lib/execution-engine-plugin.jar src/com/infogix/components/sdk/examples/*.java
jar.exe -cvf NodeCustomPluginExample.jar com\infogix\component\sdk\examples\*.class
Plugin lifecycle
DQ+ implements lazy loading for Custom Node plugins and no validations or other checks are performed for existence after installation unless configured in a DQ+ for use as a custom node.
Once a new custom node is defined in the application (of either output type or plugin type), the existence of the plugin and its conformance to the required interface class is validated. If the plugin class cannot be located or does not implement the required interface, and exception is raised and the custom node cannot be saved unless the exception is resolved.
On successful configuration, the plugin is instantiated using its constructor and the application will call the execute method once the stage is executed. The plugin object is destroyed once the output is returned and next iteration of the record (flowing through the plugin node) will load the plugin again.
Each call to the plugin will therefore reset any internal state and any requirement to persist the state has to be done using an external data store or filesystem.
Plugin deployment, upgrade and uninstall
Once a plugin has been developed and tested, the plugin class may be deployed in DQ+ by placing the distribution Jar in the $SPARK_HOME/jars
directory of the server. This directory is only accessible for DQ+ deployments within customer environments and not deployments within the Infogix cloud.
Ensure that the deployed Jars have all the dependencies and no conflicts exist with the existing jars in the deployment location.
Since the plugins are instantiated and validated on first use, the installation will not validate or raise any error.
The upgrade for the plugin can be achieved by replacing the existing Jars with the new version.
Uninstalling the plugin is achieved by simple removal of the installed Jars from the server after ensuring that no configuration for Custom node referencing the specific class exist. Any dependencies deployed along with the plugin should also be removed (unless there exist shared dependencies with other deployed plugins).
UI configuration
The Plugin class can be specified from the UI in one of the following ways
Input/Output -> Output Plugin
This is a node type that will not pass the output fields to other node and will be typically the terminal node in the processing pipeline. This type of node is accessed from the “Input/Output” menu.
Enhance -> Plugin
This type of node will allow the plugin to pass output fields to subsequent nodes.
Plugin configuration
Once a custom node plugin is selected, the user has to specify the fully qualified name of the class and the plugin parameters from the properties panel as shown below.
Choosing between Output and Enhance Plugin
The choice between the two types of nodes is determined by identifying if the node serves to terminate the processing in the pipeline (and consequently no child node is connected) or the node is used to perform data transformation that will result in output of the plugin being utilized by subsequent child nodes.
The enhance plugin node will generate an output that can be specified by the user from the output tab as shown below. The choice of output fields in the output record can be from existing input fields, or an entirely new field.
Operation modes
The plugin class will be loaded in following two modes from DQ+:
- Interactive Mode: This mode is used when the plugin class is specified as a custom node from the UI during the definition stage. In this mode, the plugin is loaded, but the execute method is not called. The input data to and from the plugin are not accessed and therefore the UI will not display the processed output of the plugin.
- Normal Mode: This mode is used during the execution of the analysis where the plugin class is instantiated and will process the record passed from the prior node to generate the output for subsequent nodes. The plugin class in this mode is deployed in the cluster for distributed processing of the records by the cluster manager.
Security
The custom node plugin is executed in the security context of the application. Access to OS and external resources is available with similar permissions as the application user.
The INodeCustomPlugin interface does not allow privileged access to any DQ+ methods and they should be accessed via the published API’s and will require authentication to be performed by the plugin class.
Development testing
Prior to the deployment on the production environment, the plugin should be tested for the following
- Check the plugin implements the required interfaces
- Plugin validates for missing or invalid parameter values
- Plugin is tested using a standalone test application that will simulate the input record and validate the plugin output for correctness
- Plugin performance is checked to ensure resource usage and execution time fall within the expected constraints for the business logic implemented by the plugin.
- Plugin is build targeting the Java version required by the application.
- Check all dependencies are included in the deployment Jar
Performance considerations
The plugin class is loaded for each record and therefore there is an overhead for initialization of the object. To avoid this from being a performance bottleneck, the plugin class should be designed to be lightweight and all resource heavy business logic should be ideally moved to external services that run in separate process.
The plugins will also be executed across multiple nodes in a cluster and the resource allocation and constraints are similar to what is available to other nodes (controlled by the cluster manager YARN). It is possible to control the partition size from the UI by specifying the # of partitions. This configuration will resize the input record to the configured size if the parent node had larger number of partitions.
The changes in this parameter result in redistribution of data across clusters to adjust to the new partition size and may incur performance penalty if not applied correctly. Leave this setting to the default value of 512 to retain the original partitioning size from the parent node.
The partition size also determines the # of parallel tasks issued to execute the plugin node and is independent of the total number of nodes in the cluster. An incorrectly sized value could result in underutilization of the cluster resources.
Plugin considerations
When developing a plugin, the plugin developer should provide basic guide for allowing the users to specify the following information correctly from the UI:
- Fully qualified plugin class name
- List of Parameters, their types and acceptable values.
- Output fields
These values will be unique per plugin and must match the names and values expected.
Comments
0 comments
Please sign in to leave a comment.