We recommend switching to the latest versions of Edge, Firefox, Chrome or Safari. Using Internet Explorer will result in a loss of website functionality.

Performance Questions Data360

Comments

6 comments

  • Avatar
    Adrian Williams

    Hi Mark,

    We are sorry to hear about the performance issues you are experiencing.

    Regarding the execution performance you may want to investigate which nodes account for the longest processing times. You may be able to leverage some of the new nodes that are available in Analyze to replace existing nodes as these provide improved performance on large data sets e.g. 

    • Sort: 2-3x
    • Lookup: 2-9x
    • Transform: ~1.3x
    • Merge: (vs old X-Ref) 3-4x
    • Aggregate: (vs old Agg-Ex) ~1.3x
    • Join: 3-4x
    • Tail: ~10x
    • CSV input: 2-5x

     

    Re. design-time performance, if there are many nodes in the data flow you may have better performance if you use Composites to group related nodes so they are displayed on different 'planes' in the UI. However, if you are clocking nodes together then you should avoid the implicit many-to-many node clocking that results from clocking one composite to another as discussed here https://doc.infogixsaas.com/analyze/Default.htm#j-admin/composite-performance.htm

    Can you confirm what you mean by "Run in agressive mode" - Are you referring to streaming execution? If so, this mode of operation is not supported by Analyze.

    There are upcoming enhancements to Analyze which will improve the design-time performance. These enhancements will be delivered in upcoming releases.

    0
    Comment actions Permalink
  • Avatar
    Mark Bergsma

    Hi Adrian,

    Thank you very much for you reply and assistance. Just to confirm:

    - In order to improve the processing times you recommend to "manually" transform the different superseeded nodes to the new nodes?

    - By "Run by Aggresive" i mean an option in the LAE -unctionality whereby "in-between" steps of the analysis flows (i.e. the outputs of the individual nodes) are not saved, and only the output of the last node executed is being saved. Is this also supported by Data360?

    In addition: Investigation the task manager of my operating system i noticed that it is the "Open JDK platform binary"-process  that consumes almost all my CPU-capacity when i start Data360 and run nodes. Even in idle-modes - opening Data360 but not opening a dataflow", it takes almost 30% of the CPU. Is this something you have encountered earlier and do you have any suggestions to fix this?

    Looking forward to your reaction!

    Regards,

    Mark 

     

    0
    Comment actions Permalink
  • Avatar
    Adrian Williams

    There are no tools to automatically replace legacy LAE nodes that use BRAINScript so it would be necesarry to manually replace the nodes with the equivalent Analyze node (e.g. using a Transform node in place of a legacy Transform (Superseded) [aka Filter] node). While it would be 'cleaner' to refactor an entire data flow to use the new nodes, it may be that the majority of the performance improvements could be gained by replacing a few 'bottleneck' nodes.  

    I do not know of any configration options in Analyze that provide an equivalent of the 'Run by aggressive' mode but I will confer internally to confirm the situation. 

    When Analyze is started up the web application runs in Tomcat. This will be the Open JDK platform binary that has the biggest memory footprint. This can consume a large amount of CPU time when it initially starts and continues to consume CPU time more sporatically as it 'settles down', but eventually it shoud reduce to a low level (as an example, see below for the current Task Manager view of my PC performance). Which Anti-Virus software are you using? It may be there is an interaction with it when the application is running and writing cache files, etc that may be impacting performance.

    0
    Comment actions Permalink
  • Avatar
    Mark Bergsma

    Hi, I am still facing a lot of performance problemens. It seems that the status of the app/workflow is  continuously "Updating Document".

    So far i have done the following:

    - Optimise dataflows by replacing superseeded nodes with Python-based nodes

    - Exclude Data360.exe from the realtime virusdetection

    - Exclude OpenJdkPlatform.exe from the realtime virusdetection

    I'm getting quiet desperate here: My laptop is working fine and i only face these problems  using Data360. Every action i perform on Data360 literary takes minutes to complete.

    Do you have any other suggetstions on how to improve performance?

    Mark

    P.S>

    I work on google chrome version  88.0.4324.104 (Officiële build) (64-bits) with Data360 Data360 Analyze v3.6.7.6658

    0
    Comment actions Permalink
  • Avatar
    Christina Costello

    The architecture of the Analyze Desktop product is very different from the LAE Desktop product.

    When you install the Analyze desktop product the application footprint consists of: Tomcat web application server, Analyze Server, H2 file based database. Whereas LAE Desktop consists of the BRE thick client and LAE Server.

    This means that from a general high level point of view the Analyze product does require and use more system resources than that of LAE Desktop.

    *** What can I do to improve the overall performance of D360? ***

    In terms of execution processing - a lot of work has been done in Analyze to improve the execution performance – we have rewritten a lot of nodes and these have therefore superseded the old nodes.

    For example:
                 - Agg Ex - renamed to "Agg Ex (Superseded)" and replaced by Aggregate
                 - Filter - renamed to "Filter (Superseded)" and replaced by Filter for basic filtering, and Transform for scripted transformations (now in Python and not Brainscript)
                 - All the Join nodes have been superseded and replaced with new nodes
                 - Sort - renamed to "Sort (Superseded)" and replaced by Sort.

    Therefore a Data Flow in Analyze using the new nodes should perform better than the equivalent Graph using the old nodes in BRE.

    It would be interesting to see your results as you start to migrate your Graphs from .brg files into Analyze and start swapping out the old nodes for the new ones.

    *** Is there an option similar to the LAE "Run in aggressive mode" within D360 to help improve processing speed? ***

    300 million records seems quite a lot to process on a Desktop, especially in a larger Data Flow, and given that, the answer to this is currently "no". In Analyze for ad-hoc runs where you go in and run nodes interactively, interim data is always written to file and only cleaned up when you re-run or clear the nodes.
    There is a general setting you can use to determine how long to keep ad-hoc runs for, although that is just in <number of days>.
    For scheduled runs, there are settings for when to delete temporary data, but again this is on completed runs rather than as the run is running.

    - Starting-up D360 Analyze
    Due to the difference in the product architecture, this will generally be slower

    - Browsing and toggling through the Directory User Interface
    This is something we know about, and will be improving in a future release.

    - Opening a dataflow / Browsing and toggling thru a dataflow (e.g. entering and existing composite nodes) / connecting nodes and/or bending connectors / Toggling through the Properties panel
    These are things that we know about, and are currently working on, you should see major improvements here once we move into the 3.8 series of Analyze releases.

    - Opening data in DataViewer

    This is something we know about, generally larger and wider datasets will take longer to load. Again it is something we will be looking at in the future.

    In summary…
    - Execution:  should be faster than in BRE if you replace the superseded / deprecated nodes.
    - Design time performance issues: we know about and are actively looking at and expect major improvements in coming releases.
    - Resource usage: Due to architecture design changes however, the overall resource usage will be higher than LAE

    0
    Comment actions Permalink
  • Avatar
    Mark Bergsma

    Hello Christina.

    Thank you for your quick response. 

    Regarding Runtime of LAE vs D360:

    - I have used a relatively simple dataflow (processing and transforming 2 datasets of appr, 5mln. records with a "x-ref" to join both sets at the end of the flow

    - For all the superseeded nodes I created an alternative node using the Data360/python alternatvie

    - First I did a node-by-node comparison to compare processing times of the superseeded nodes vs the alternative nodes. In some cases the alternative nodes where faster (esp. the x-ref), in some cases the superseeded nodes were faster (esp. for standardizing and normalizing fields using the old "filter-" vs. new "transform"-nodes as well as the filter/split-nodes)

    - Then I optimized the dataflow taking for each step the fastes node

    - Finally I compared the processing time in LAE with thea processing time of Data360 using the omptimised dataflow. Unfortunately the LAE graph was processed significantly faster. 

    - Furthermore: I realise that the earlier example I gave regarding processing 300M records is quiet a lot for processing on a desktop. However, although it takes some hours using LAE, it has never given me any problems and i was always able to perform other tasks in other apps (or graphs) while the graph was running. I would expect similar performance of D360, even tho the "aggresive mode" is not available

    Regarding Design time Performance Issues: Understood. Do you have any idea when the 3.8 series are going to be released?

    Best regards,

    Mark

    0
    Comment actions Permalink

Please sign in to leave a comment.