The article focuses on execution of dataflows. There is also going to be an overhead for the webapp itself. If the Analyze instance is being used for both scheduled data flow execution and development then there will be additional demands on the CPUs due to the activities of users who are editing data flows and running them ad hoc.
LIMITING CPU USAGE:
1. You can limit CPU usage by adding Run Dependencies, also known as clocks, to the dataflows.
This would limit the number of threads running in a particular data flow. For more information on clocking & how to implement them see here.
Please note: it's not recommended to use direct composite - to - composite clocking.
2. Another thing to consider is what the thread limit is currently set to on your machine. If clocking does not work, then you should probably reduce the number of concurrent threads. The default is 4.
See this document for steps on configuring the thread limit.
A 'rule of thumb' is that you should have a maximum of 4 nodes running per CPU (as seen by the OS). This is across all data flows running on the system at any one time.
The thread limit configuration specifies the maximum number of nodes that can run simultaneously within a single data flow. So if you have 8 CPU's, and you set the thread limit to 4, then the maximum number of nodes you should attempt to run should be 32.
When it comes to scheduling, you should consider....
- how many data flows are being scheduled to run simultaneously. Keep in mind that different dataflows have different levels of complexity. A dataflow that has only a single node or two would obviously be less load on the server than some dataflow that has a thousand nodes.
- how many nodes are running in parallel within each data flow. It may be necessary to include run dependencies (clocks) to limit the number of nodes running at any one time within a particular data flow.
CPU HEAVY NODES:
The Sort and Agg nodes are typically those that consume the most CPU time. The Joins can also eat processing time. Some nodes such as the Pivot Data to Names node can also take a long time to run.
Some transformations inherently require more CPU cycles.