There are many infrastructure variables that affect the performance of Assure, but rules writers typically have no control over such things (server memory, processor cores, clustering vs non-clustering, etc.) The purpose of this page is to share ways to increase performance of Assure controls based on the rules. Processing improvements may be insignificant with small datasets.
Quick Links
Limit Text Control Field Lengths When Possible
There are two main reasons why it's good to limit the length of control fields in Assure.
- ID Fields are indexed in the database, so the smaller the total field size, the better back-end performance will be realized. For example, if a multiple capture detail control point has two text ID fields, and neither field will ever exceed 8 bytes, then both ID fields should be set to a length of 8. If unset, the default length is determined depending on whether Assure uses Oracle or SQL Server database, but either way the number is very large. This requires more resources for the database to index. A text ID field created with out a length will be stored in Oracle as VARCHAR2(765), and in SQL Server as nvarchar(255).
- For Assure deployments using Oracle as the database, controls that have a large number of text fields without a defined length can actually cause the Oracle driver buffer to hold a large amount of memory during execution. The large Oracle buffers could eventually cause Assure to run out of memory if there are a sufficient number of control points for entities with such large number of text fields executing simultaneously. The following extract was taken from an Oracle JDBC Memory Management White Paper that illustrates how the driver allocates memory:
"Character data is stored in the char[] buffer. Java chars take up two bytes per character. A VARCHAR2(10) column will contain a maximum of ten characters, or ten Java chars or 20 bytes per row. A VARCHAR2(4000) column will take 8K bytes per row. What matters is the defined size of the column, not the size of the actual data. A VARCHAR2(4000) column that contains all NULLs will still take 8K bytes per row. The buffers are allocated before the driver sees the query results so the driver must allocate enough memory for the largest possible result. A column defined as VARCHAR2(4000) can contain up to 4000 characters. The buffer must be allocated large enough to hold 4000 chars, even if the actual result data is not so large."
Reduce Function Usage On Large Inputs
When functions are used during a source capture or reconciliation, a small amount of additional processing is needed. When very large volumes of data are being processed, the number of function calls can significantly increase capture/processing times. Sometimes it is inevitable that certain functions need to be used on every record. In this scenario, take some time to ensure you're not using more functions than required.
Scenario: A two million record input source is being captured. There is a date field in the input, but there is no pattern or rhyme to what the date will be for a given transaction. The business requirement is to only capture records where the date field is a Saturday or Sunday.
Approach 1 (worse)
@day_of_week ( some.date ) = 6 or @day_of_week ( some.date ) = 7
Approach 2 (better)
@day_of_week ( some.date ) >= 6
The second approach is better because it reduces the number of times the @day_of_week function needs to be called by half. Assuming it takes half a millisecond to execute the function, and assuming a two million record input, the second approach saves almost 17 minutes during capture.
Use Functions in "On Begin" When Possible
When functions are used during a capture, a small amount of additional processing is needed. When the function is called for every record in the input, the total additional time will increment with the number of records. In certain scenarios, the function can be placed in the "On Begin" section of the source extract and be calculated once.
For the two examples below, assume that a control field is populated based on a portion of a runtime field. The part of the runtime field needed can be obtained using the @t_between() function, in this example it's an arbitrary file identifier (we're interested in the blue part):
c:\some\directory\myfile.identifier.txt
Example 1 (worse): The @t_between function would be called _n_ times. If the file had hundreds of thousands of records (or more), this is a lot of additional overhead, especially considering the function returns the same value every time:
On Begin: When Layout Type - Select When > true Compute > SomeCF = @t_between ( runtime.FilePath , 'myfile.' , '.txt')
Example 2 (better). The result of the function is stored to a work field, so no additional aintcomputation is needed after the function is executed in the "On Begin" section. Instead of executing _n_ times, it executes exactly one time:
On Begin: Work Field Compute > MyWorkField = @t_between ( runtime.FilePath , 'myfile.' , '.txt' ) When Layout Type - Select When > true Compute > SomeCF = work.MyWorkField
Limit Storing Control Data in Scan Controls
By default, a new scan control point has "Store Control Data" set to "Yes". This means that for each control data instance retrieved, a write operation will occur, even if nothing changes. If the control entity has "Control Data History" set to the default "Control Point", that's even more I/O that needs to take place as history is written to the control data history table. If detailed point to point instance history is not needed, we can turn it off.
If a scan control will need to run multiple times each day, looking at thousands or more control data instances, using the default "Store Control Data", millions of new histories would be written within a single day. Not only does this require exponentially more database space, but it causes more CPU and memory consumption while the control is running.
The solution is to only store one point of data for an instance ("Control Data History" set to "None") and to only store control data when needed, for example when certain criteria must pass in order for a change to be made.
Example 1 (worse). The entity containing a scan control has "Control Data History" set to "Control Point", and the scan CP has "Store Control Data" set to "Yes". Example validation rule:
My validation > When @notnull ( current.SomeField ) When Complete > Control Field Change > SomeField = current.SomeField + 1
Example 2 (better). The entity containing a scan control has "Control Data History" set to "None", and the scan CP has "Store Control Data" set to "No". This will require us to expressly define when a control data instance should be updated, and not maintain point to point information. Example validation rule:
My validation > When @notnull ( current.SomeField ) When Complete > Control Field Change > SomeField = current.SomeField + 1 Store Control Data
Consider Nested Evaluations for Multiple Criteria Selection
When Assure is given an equation to evaluate which contains multiple criteria, all criteria will be evaluated. This may seem simple, but consider what that means. Look at the following fictional Select When scenario:
Select When > my.Field1 = true and my.Field2 = false and @t_contains( my.LargeField , 'Some String' ) **do something**
In the above equation, three separate evaluations are taking place:
# ( my.Field1 = true ) is true
# ( my.Field2 = false ) is true
# ( @t_contains( my.LargeField , 'Some String' ) ) is true
The first two evaluations are very simple in terms of processing. It's almost like asking the CPU to tell us if 1=1. This means they're fast. What about the third evaluation? It's a function that parses a text field to determine if it contains a certain string. Depending on the length of the field and the number of records in the input, performing this function over and over could add significant overhead.
The order that Assure will perform these three evaluations is not guaranteed, and regardless of order, all three parts in this example will be evaluated before the "AND" logic is performed.
The alternative is to consider using nested evaluation logic when available and when there is an obvious piece of the equation that is more resource-intensive than the others. This is accomplished by putting the simple piece at the top level, and then nest the pieces of the equation that don't need to be evaluated if the other _easier_ pieces don't pass. Here is the same logic but separated into a nested evaluation:
Select When > my.Field1 = true and my.Field2 = false Evaluate > When Condition > @t_contains( my.LargeField , 'Some String' ) **do something**
Reduce Translation Rule Usage on Sorted Inputs
Translation lookups add overhead during the capture/extract phase of a control execution. If the input source has millions of records, the additional overhead could result in a measurable slowdown. Sometimes this is unavoidable, but other times the usage of a translation rule can be mitigated. Assure extraction does not cache translation inputs/outputs; it will do the translation lookup each time.
Consider this scenario: an input source with millions of transactions needs to be captured. One field (Field_A) needs to be translated. The input source is more or less sorted by the Field_A value (meaning usually there will be the same Field_A values consecutively on the report). Example:
Identifier Field_A Field_B 0010 AA XC 0320 AA EZ 0045 AA LL 0009 BC DP 0150 BC QA
Performing a lookup for each of these consecutively-similar Field_A values can be eliminated via the use of work fields. Store the Field_A value into a work field, and then on the next record, only perform the lookup if Field_A has changed. Here is an example:
Evaluate > When Condition > input.Field_A != work.Previous_Field_A Translate > From MY_LOOKUP_TABLE Save Result As LookupResult IN_Something > input.Field_A Work Field Compute > Previous_Field_A = input.Field_A
What not to do: Somebody familiar with the @map_put() and @map_get() functions may think that storing indexes of the returned values into maps (hash maps) and then checking the map for a value (Field_A in the above scenario) before translating would increase performance. Although true that it would reduce the number of lookups, the memory cost would approximately double from just using the Translation rule alone. Lookups in Assure are more efficient than its predecessor (ACR) by utilizing the Hypersonic database library for complex embedded translation tables (table held in memory) and a hash map for simple translations.
Modify a Control's Execution Profile
In Assure 8.4 and newer, Execution profiles can be used to fine-tune the resources used by controls. This setting is under a Control Point's Advanced settings:
The appropriate setting is dependent on the number of instances a control captures. An explanation of each setting is below:
Default: This option defers to the installation-wide default execution profile setting. This is set by the DEFAULT_CONTROLENTITY_EXECUTION_PROFILE property within the IA.properties config file. By default, the default setting in the properties file is Extra Large. In Assure versions before execution profiles existed, Extra Large is the setting that all controls used.
Small: All processing happens on the main thread of execution. In other words: sequential processing. Bulk loading is not utilized, and Assure will not trigger database query optimizations. This setting is recommended for controls that typically processes no more than a few hundred items.
Medium: Minimal multi-threading is utilized. Bulk loading is not utilized, but Assure will trigger database query optimizations. This setting is recommended for controls that typically process up to a few thousand instances being processed.
Large: More aggressive multi-threading, roughly twice as much as Medium. Assure will utilize bulk loading if enabled, and Assure will trigger database query optimizations. This setting is recommended for controls that typically process up to tens of thousands of instances being processed.
Extra Large: Aggressive multi-threading, roughly twice as much as Large. Assure will utilize bulk loading if enabled, and Assure will trigger database query optimizations. Any controls regularly capturing over 50,000 instances per execution should use this setting.
Avoid Overuse of Control Field Updates
Controls process record-by-record, so when the Control Field Update code is used within a control's logic, just the one record will have its data updated. If your control is capturing 20,000 records, and half of them result in a Control Field Update, then Assure will execute 10,000 SQL queries. This can add When Control Field Updates are needed, always ensure that most common field value is the default value to minimize frequency of updates.
Best Practices for Scheduling Controls and Retention
By default, Assure runs retention for all controls at midnight. This is a very heavy processing time for the database, especially for production environments where a large amount of controls are running regularly and large amounts of data is stored.
The database load can be lightened by creating additional Calendars within Assure, and specifying controls to run their retention during alternative Calendars:
The additional Calendars can be scheduled for different times of the night to spread the database across a larger time frame, rather than all retention starting at one point in time. It is also recommended to schedule controls at a time when retention isn't running.
In Assure 8.1 and newer, you can also utilize the "Process & Memory Viewer" within the Assure Admin Console to view when various processes (controls) are running at different times of the day, along with the JVM's memory utilization during that time. This will help determine the times of day with more server availability, which would be the recommended times to schedule new controls.
An example of the Process & Memory viewer is below, with the process (control) names blanked out:
Comments
0 comments
Please sign in to leave a comment.