10. Best practices

Tip

This is a best practices section to help you achieve maximum performance for your jobs on Spider. In this page you will learn which options are most efficient for:

your software installation on Spider
storing your data on the platform
managing your data on Spider
running a large amount of jobs on Spider

10.1. Background

Spider is continuously expanding as a unique data processing and collaboration platform. The growing demand for storage space, the diversity in applications and the various ways the system is used, all bring some technical challenges. For example, if you’re doing a lot of IO-operations (reading and writing files) in your workflows, you should be mindful on which systems these operations are performed as some options can significantly affect the performance of your jobs and the system load.

Our local file system on Spider is CephFS and is suitable as a staging area for your data before or after analysing it. CephFS hosts both your home and project directories on Spider. It is designed for efficient IO of large files, but when dealing with many small files the file system performance can be degraded. This is because CephFS relies on a parallel distributed system that involves many disks to store the data itself and metadata servers to store the files metadata. As a result, when you operate on many small files e.g. run code from python environments, the system response can become slow for you and other users on the platform.

In order to improve the performance of scientific workflows on Spider’s storage, we have prepared this best practices guide that includes several tips to install, store and analyse your data efficiently.

10.2. Software installation practices

If your software loads a large number of small files upon execution in your jobs then you may see poor I/O performance even if the total software size is not that big. There are ways to mitigate potential bottlenecks, such as the use of CVMFS or containers technology.

Here is an overview of the features and suitability of some of the software installation options supported on Spider:

Feature	CephFS	Apptainer	Lumi Container Wrapper	Softdrive
Software with many files (eg. Conda, pip env)	No	Yes	Yes	Yes
Fast execution times & Low load on the system	No	Yes	Yes	Yes
Extensively used in production	Yes	Yes	No	Yes
Easy to setup	Yes	Moderate	Moderate	Yes
Portability	No	Moderate	Moderate	Yes
Frequent software updates	No	No	Yes	Moderate
Software access can be restricted	Yes	Yes	Yes	No (repos are public)

10.2.1. Which software installation practice should I use?

When your application contains python environments or you seek for a portable way to use your software by multiple (SURF) services we suggest you to install your software on Sofdrive. Softdrive relies on the CVMFS thechnology and uses a caching mechanism on the worker nodes that helps launching your software in shorter times. Here you can find instructions for using Softdrive.

An alternative is using Apptainer. Apptainer relies on the container technology that provides an isolated software environment for each application. Apptainer works best when you have a large software stack does not change often. The image can be placed anywhere on Spider, as long as the location is accessible to your processing jobs. Examples can be found in our Apptainer section.

For software that changes frequently we suggest you the LUMI Container Wrapper (LCW). This is similar to Apptainer but you can update the software without rebuilding the base container. More about LCW can be found in our LUMI section.

In cases that you have to install your software locally on Spider and it loads a limited number of files, it is possible to use CephFS on home or project space locations, but take into account its limitations such as slow execution times.

10.3. Data storage practices

When you work with large volume of data or your application writes/reads a large number of files then you may encounter performance bottlenecks depending on where you have stored your data.

Here is an overview of the features and suitability of some of the data storage options supported on Spider:

Feature	CephFS	dCache	Scratch
High throughput & low load on the system	No	Yes	Yes
Large volumes of data	No	Yes	Moderate
Data available after jobs end	Yes	Yes	No
Data available outside Spider	No*	Yes	No
Granular access control	Yes	Yes	No
Supports disk	Yes	Yes	Yes
Supports tape	No	Yes	No
Available through an API	No	Yes	No

unless explicitly placed in public folder

10.3.1. Which data storage practice should I use?

For bulk data storage we recommend dCache. dCache is highly connected to Spider worker nodes and is designed for high-throughput processing of data. This storage system is also available outside of Spider, and has highly granular access controls, making data releases, or data uploader roles self-service. dCache is available through a number of interfaces, meaning that it can be used out of the box with WebDAV clients or through a REST API, allowing for future data portals to be developed. Another reason to use dCache is that it supports both disk and tape, meaning that it can easily scale to much more data. Here you can find instructions for using the dCache remote storage.

We also advice you to use the scratch file systems as fast temporary storage while running a job. Each of the Spider worker nodes has a large scratch area on local SSD. Any data that you wish to keep should be written to other storage backends such as dCache before the end of the job. The scratch areas are ideal for retrieving the input of a job from dCache during execution or for applications that generate lots of intermediate files that are consumed by other parts of the processing or for generating the job output before copying it back to dCache. More about how to use the temporary disk space can be found in our section Using scratch.

In cases that you have multiple jobs that need to access a single set of files that is too large to copy over to scratch, it is possible to use CephFS on home or project space locations for temporarily storing your data, but take into account its limitations compared to dCache in terms of throughput and capacity. It is highly recommended that you do not store more than 10,000 files in a single directory on CephFS. In terms of file sizes, CephFS is most efficient when you deal with files that are larger than 4MB. Files that are less than 32KB can be very inefficient.

10.4. Managing data practices

There are several data management options for all stages of your project lifecycle. Here we focus on the data managing options for transferring and parsing your data on Spider.

An overview of the features and suitability of some of the managing data options supported on Spider is presented below.

Feature	Rclone	Shared memory	mpifileutils
High speed & low load on the system	Moderate	Yes	Yes
Support for parallel operations	Yes	Yes	Yes
Easy setup	Yes	Yes	No
Supports many backends (Object Store, dCache)	Yes	No	No

10.4.1. Which practice for managing data should I use?

When transferring data from/to Spider your experience will vary depending on the client, protocol and parameters you choose. For efficient data transfers we suggest you to use Rclone. Rclone is a command line tool that works on many platforms and it can talk to many storage systems, including dCache. Some advantages of Rclone are that it can sync directories, like rsync does, and it uses parallel transfers, 4 by default, to get a better performance when copying directories. More information about using Rclone, for example with dCache, can be found in our ADA interface section.

When you need to tar or zip many small files on Spider, this can be very slow on the local CephFS filesystem and can take several hours. In such cases it may be better to copy the files temporarily in memory (RAM) first and then use tar/zip, as it will speed up these operations remarkably. The files are copied from CephFS into memory in a parallel way, while tar operates on files one by one. Once the files are in the shared memory of the node, the tar process is a lot faster. When using this option please keep in mind that memory is limited and shared with other processes and that it is temporary. An example for using the shared memory to tar and process a file can be found in Shared memory.

For advanced users, who are familiar with MPI operations, we also offer an a MPI-based tool for managing datasets such as copying files across the different home and project space folders on the local file system. The MPI-based tool is much faster and efficient than the common cp operations. Example usage for parallel copying of files using this method can be found in the mpifileutils section.

10.5. Running a large amount of jobs

High-throughput workflows that execute a specific application for many different parameter combinations, often requires the submission of many jobs. When running a large amount of jobs it can be difficult to keep track of the status of these jobs or resume failed jobs that were prematurely canceled (e.g. due to time limit). Another challenge is reducing the large scheduling overhead and waiting times in the queue.

An overview of the features and suitability of some of the options for running a large amount of jobs on Spider is presented below.

Feature	Slurm job arrays	PiCaS	Snakemake
High speed & low load on the system	No	Yes	Moderate
Scales to hundreds, thousands of jobs and more	No	Yes	Moderate
Transcends spider	No	Yes	No
Easy setup	Yes	Moderate	Moderate
Handles easily dependencies between tasks	No	Moderate	Yes
Error recovery	No	Yes	Moderate

10.5.1. Which practice for running a large amount of jobs should I use?

The first option to check when running a large amount of jobs is whether the software you’re using comes with a built-in option for managing your workloads on a Slurm-based cluster. Alternatively, an easy way to submit several independent jobs with one command is the use of Slurm job arrays. Job arrays, however do not scale well for more than a few hundreds of jobs. In this case, you can use external tools for managing your workloads, such as PiCaS or Snakemake.

PiCaS works as a queue, providing a mechanism to step through the work one task at a time. It is also a pilot job system, indicating that the client communicates with the PiCaS server to fetch work, instead of having that work specified in a job (or similar) file. As every application needs different parameters, PiCaS has a flexible data structure that allows users to save different types of data. PiCaS can handle thousands or millions of tasks, it has an easy query mechanism to search among your tasks and is accessible from any platform via a Restful HTTP API. Here you can find instructions for using PiCaS.

When your application involves several steps connected in a workflow that each need to be submitted as independent tasks, you may consider using Snakemake. Snakemake is a python-based workflow managment tool for defining, managing and executing workflows with multiple steps and complex dependencies. There are possibilities to combine PiCaS and Snakemake to enable workflow automation and run many jobs and subtasks efficiently and fast. Please contact our our helpdesk if you need help with automating your workloads on Spider.

10.6. Moving a large amount of data within Spider

10.6.1. General

The main general advice is to do the move operation - no matter the tool that you use - from a worker node (WN). Each machine has a connection to the storage cluster, and if many users use just the few UI machines, these will be strained by the large bandwidth usage. This can cause problems with the connection of the storage cluster, and can result in failure in extreme cases. While at that moment the WNs are not doing much at all in terms network traffic. Try spread the work out and use worker nodes, even multiple WNs if needed.

Although it might add some technical overhead, sending data through the local scratch of a WN will improve performance, instead of moving straight between mounted filesystem folders. This is due to lowering the read/write load on the system.

When using tools like rclone and rsync (among others), try to use parameters that optimise transfers:

Removing progress-bars, verbose output, lowering the loglevel will reduce load from output.
Add a bandwith limit when possible: --bwlimit 100M to limit bandwidth and not affect other users on the same node.
Use --size-only to only check file integrity by size, instead of hash/timestamp checks to reduce metadata and cpu load if possible.
Lowering parallelism helps reduce metadata interactions per second, decreasing system load.
Exclude copied or unnecessary files with --exclude.

An example to run rclone to copy data:

sbatch --wrap "rclone copy --size-only --transfers 2 --bwlimit 100M \
       --exclude='.[obsolete files or logs]/' \
       /home/[PROJECTNAME]-[username]/my-analysis /project/[PROJECTNAME]/Data/"

10.6.2. Moving large amounts

Moving a large amount of data can mean multiple things, here are two options: moving large amounts of data, moving many files. The assumption is that the files do not leave the Spider and are moved within the cluster.

1. Moving files that are large (>10GB) that account for more than 1TB in total:

It is advised to start parallel jobs, as the cluster configured for moving large files in parallel. For small files it is recommended AGAINST parallel transfers, as the individual operations on short timescales affect the system negatively immensely.

2. Moving files of which there are many, and are considered small (>10k files, roughly <1MB in size each):

It is advised to copy folder by folder, especially if the folder contains many small files. The storage cluster works on a folder by folder basis, and splitting in this way helps reduce the load. Another step to take is tarring many files into one larger file, when moving.

10.6.3. Data verification

In case you want to verify that all files are copied correctly, the best tool for this is rsync, as it is very lightweight on the system. You should do such a check after the transfer has completed.

10.6.4. Contact

We are continuously working on improving the CephFS performance in Spider. However, there are still issues when user operations on files include too much metadata causing occasional mounting of /project or /home to fail on a cluster node. In case you experience this behavior often or notice that many of your jobs fail due to the mount not being available on the node, please contact us to discuss your workflow and interactions with the filesystem. It could be that your application or your job submission practice is triggering the problem and affecting other users on the shared filesystem.