Generic rule for allocating NODES, CLUSTERS, EXECUTOR MEMORY OVERHEAD % , CORE COUNT, DRIVER MEMORY, INSTANCE COUNT based on nature of analysis such as :
The sizing is based on the number of files, the total size in bytes of all files, and a complexity metric calculated based on how resource intensive a node is. The number of files is usually not a significant factor unless we have thousands. The complexity metric is highest for analytics nodes when they are used to train. Then comes nodes that are like join, sort, co-group, group by, group, in, notin, script, distinct, sql, plugin.
In terms of custom sizing, we should not normally change executor memory overhead, executor core count, executor memory per core, and driver memory. The instance count and instance type are the ones you will want to adjust more often.
A good rule of thumb is to take the uncompressed data size (compressed size multiplied by about 10) in MB and divide that up 1.25GB if caching is on and 2GB if caching is off to get the number of cores needed to process. You would then count up the number of complex nodes. Assuming about 5 complex operations per core, you would divide the number of complex operations by 5. If the value is greater than 1, multiply the number of cores calculated earlier by the value here to calculate a new values for number of cores. Take the memory per core, add the overhead % and multiple that by number of cores for total memory needed.
Then pick an instance type and number of instances based on the total memory and cores that come out of above calculation. This would give us an initial sizing and we can go up or down from there based on the results we observe with this sizing.