This article tells you which components of the application require the most space. As each Analyze environment is different, it is very difficult for us to say exactly how much space is required - this article aims to give you an understanding of how storage is used by the Analyze server.
BACKGROUND:
You can install Analyze wherever you like provided the directories are accessible to the Analyze user.
The required disk space for the application files is about ~3GB. It is important to ensure that your system meets the minimum requirements in terms of storage availability, and the type of storage.
STORAGE REQUIREMENTS:
1. Space required for the Analyze installation directory:
executable application files and associated software libraries are stored in the Analyze installation directory. In addition to the core software binaries the directory also contains third-party software that is bundled with the Analyze installation e.g. Java, Python, Jython plus other platform-specific software functionality that is written in C++. For a particular release, the space required the Analyze installation directory is static over time.
2. Space required for the 'site' directory (site-7731), which holds:
/conf - the conf directory stores the configuration file - small space requirement
/lib - the log directory can be larger than the conf directory, as it is used to store custom/third-party Java executables/drivers that have been installed on the system. The system does not add to the contents of the lib directory - required files need to be added manually.
/log - this directory stores the system-generated log files. These require medium amount of space and will continue to grow as the application is used unless purged - you should develop a data retention policy for the log files.
/shared - When users upload data files to the server (via the UI or using file transfer nodes e.g. FTP Get, S3 Get, etc) the data files are stored in a sub-directory of the shared directory - in the Public directory or in a directory aligned with their username/id. Similarly, where a results data file is explicitly generated by a node in a data flow (e.g. Output Excel, Output CSV/Delimited, etc) this file will be stored within the shared directory. Sufficient space should be available on the drive/partition to hold the expected volume of data files. If required, the shared directory could be mounted on a different drive/partition to the parent 'site' directory. The location of the shared directory is set by the ls.lae.shareRootDir property in the site-7731/cust.prop file.
3. Space is required for the 'data' directory (data-3371), which holds:
backups - System backup files will be strored in the data-7731 directory you should consider regularly backing up the backup (.lxp) files and placing them somewhere else for safekeeping.
keystores - The keystore files used by Analyze.
Note: Ensure the system is running at the time the system backup is scheduled to run.
It is also recommended that you regularly do a full Linux / Windows system backup.
/pgsql - The pgsql directory contains the Analyze database. Low latency is required for database transactions otherwise it may degrade the overall performance of the application. For example, when a user clicks to open a data flow the application will retrieve information from the database and this request should not be queued behind other bulk I/O operations (e.g. as could occur when writing temp data).
/webapp - the webapp directory stores the web application caches so it needs to be fast storage in order for the API calls to complete in a timely manner.
If these folders don't have ample room / fast storage then the whole performance of the application could suffer. You may see things like delays in adding nodes to the canvas, updating node properties, etc.
4. Space is required for the temporary execution data generated by the system when a data flow is run.
/executions - The /executions directory requires a large amount of data. It's important that not all of the I/O bandwidth is used by read/writes to/from this folder (i.e. for storing and accessing intermediate data created and used by the nodes). This is one of the reasons a separate drive/partions should be used for the executions directory as using the same drive/partition may cause delays when communication is needed with the /pgsql and web application /webapp folders.
As the /executions folder stores the intermediate data used by the nodes, this folder in particular needs good I/O bandwidth, and should be tuned to maximize it. The storage should be tuned to optimize the performance given the typical file sizes generated by the data you are processing.
Note, it is strongly recommended that you put the 'executions' on a different disk / drive to that used for the 'data' directory.
Using a separate drive/partition is strongly recommended because if the stored temporary execution data grows to the point where the space on the disk is exhausted and the same drive/partition was used to store the database then this could lead to database corruption.
For instructions on how to mount a drive, see here:
IMPORTANT NOTE: do not delete the /executions/cache whilst configuring the folder to a new location (this folder is used for in-node Java compilation tasks & Python module caching)
TYPE OF STORAGE:
We recommend the use of local storage for the /data-7731 directory with the exception of the executions folder. The executions folder should provide sufficient capacity to store the temp data being generated by the application on an ongoing basis.
If the executions, pgsql and webapp folders can't read /written then this may lead to corruption.
Factors that contribute to disk space issues:
- big (in bytes) input files or databases (e.g. 10s GB - TBs)
- how many nodes the data passes thru
- how many dataflows are running at one time
- the amount of previous executions retained
If the above are all high, then more space is required.
Comments
0 comments
Please sign in to leave a comment.