Service notices list

List of notable changes and incidents on Spider with most recent first:

2026

2026-05-28 Multi-MDS has been rolled out, there are now 5 servers handling storage usage, with each having a backup.
2026-05-18 at 10:00 in order to keep the service up and running safely, some maintenance will be done during the day, to address a CVE. No downtime is expected.
2026-04-30 at 16:00 unscheduled maintenance: Login machines and worker nodes had to be rebooted to apply a fix for a security issue.
2026-04-13 at 03:00 the storage system went down. Investigation started at 09:00 and before noon it was found to be a known bug. Recovery was at 12:00.
2026-03-31 home-folders of users are limited to 200GB max. You can still log in and move data to /project or remove data, but you can not write new data to /home/$USER.
2026-02-26 from 9:00 am till 15:00 pm: Unplanned maintenance users might experience mounting issues on their /project directories. Update 15:20 Situation is improving, we are closely monitoring.
2026-02-23 from 9.00 am till 17:00 pm: planned maintenance on the generic Spider UIs and the Slurm batch node. All interactive processes on the UIs will be stopped and new jobs will not be accepted. The cluster will remain available for running jobs.
2026-02-11 from 22.00 pm till 22:30 pm: planned maintenance on the generic Spider UIs. The UIs will not be reachable during this time and all interactive processes on the UIs will be stopped.
2026-01-19 till 2026-01-20 at 12pm Unplanned maintenance in the network file system (/home and /project mounts). Some nodes might lose access to the shared file system.

2025

2025-12-08 till 2026-01-30: Undergoing some routine maintenance. There will be reduced computing capacity for those days, but the cluster will still be available. Jobs in the infinite queue will need to be stopped, and be restarted after the nodes are back online.
2025-01-15 The issue with longer queue times has been resolved. Job prioritization is now working as expected.
2025-01-10 Queue times are longer than usual due to a bug in the FairShare mechanism in Slurm.

2024

2024-12-16 Spider maintenance has been successfully completed, and the cluster is now fully restored. Upgraded from CentOS Stream 8 to AlmaLinux 9 (binary-compatible with RHEL 9). Upgraded to Slurm version 24.05.
2024-12-05 Spider will undergo maintenance on Dec 12 and 13. The cluster will start draining 5 days beforehand. To keep scheduling your jobs you have to add a time that still fits before the maintenance: sbatch –time=1-00:00:00 my_job_script.sh
2024-11-25 We experienced issues with the DNS servers, which may have impacted the execution of some jobs. Please review your workflows and let us know if you encounter any problems.
2024-8-27 from 16.40 till 2024-8-28 at 09.00 am: The service is down due to network problem during a network maintenance.
2024-06-12 from 7.00 am till 2024-06-14 at 17.00 am: The service is down due to network problem.
2024-05-06 from 9.00 am till 10.30am: ui-01 has crashed and was not accessible until it was rebooted.
2024-04-16 from 6.05am till 12.55pm: transient behaviour for ssh access via ui-01 and ui-02
2024-03-28: User application causing a lot of Out of Memory events on the following nodes: wn-ca-08, wn-ca-13, wn-ca-15, wn-ca-21, wn-ca-23, wn-ha-01, wn-ha-04, wn-ha-05. The nodes were rebooted and jobs running on them failed.
2024-03-19 from 9:00am till 12:30pm: Maintenance on the underlying Cloud infrastructure. The compute nodes may experience small disruptions.
2024-02-15 from 11:00am till 17:00pm: Issues on CephFS with many queued operations. The nodes wn-ca-14, wn-dc-12, wn-hb-04, wn-dc-16, wn-dc-18 had to be rebooted. Jobs affected by the reboots will show up as failed.
2024-02-08 to 2024-02-09: Interruption of service due to CephFS issues. Some running user jobs had to be cancelled or failed.

2023

2023-12-07: High load on CephFS triggered by user activity. Some operations are slower but jobs are not affected.
2023-11-24 at 17:00 till 2023-11-27: Problematic hardware causing slow connection and multiple failures on half the cluster. The affected nodes were rebooted and the jobs running on the nodes may have failed.
2023-11-07 from 16:00pm till 19:00pm : CephFS overload triggered by user activity causing connectivity issues to the local filesystem. The nodes were rebooted and the jobs running on the nodes may have failed.
2023-07-13 to 2023-07-14 at 13:15pm : Problematic hardware causing slow operations on CephFS (/home and /project folders). Some users experience connection issues.
2023-06-05: Reduced capacity for maintenance.
2023-04-28 to 2023-05-04: Spider system downtime to upgrade the storage system. The generic UIs and worker nodes will not be accessible during this time.
2023-01-20: GPUs have been added to the Spider service.

2022

2022-09-25: The cluster has been updated from CentOS 8.4 to CentoOS Stream. No expected impact on job processing.
2022-05-03 to 2022-05-04: System downtime to restore services affected by underlying system maintenance.
2022-05-02: Upgrade of the underlying Cloud infrastructure to the next Openstack release. The upgrade of the services can impact the networking on Spider nodes.
2022-03-11: Nodes wn-hb-01 and wn-hb-03 nodes failed to reconnect to CephFS client after MDS server crash and were restarted. Jobs running on the nodes may have failed.
2022-02-23: Nodes wn-ca-07 and wn-ca-09 node down due to hypervisor crash. Jobs running on the nodes may have failed.
2022-02-11: Wn-ca-03 node down due to hypervisor crash. Jobs running on the node may have failed.
2022-01-26: Wn-ca-01 node down due to hypervisor crash. Jobs running on the node may have failed.

2021

2021-12-20: Updated the underlying CEPH cluster of our CephFS system. No expected impact on the service.
2021-11-17: Wn-ca-04 node down due to hypervisor crash. Jobs running on the node may have failed.
2021-08-26: Added redundancy login node to support live changes and updates. The /scratch space will no longer be available on the login node.
2021-02-23: Updated the OS of the computing nodes & supporting infrastructure to CentOS 8, see more Maintenance instructions.
2021-02-15 to 2021-02-23: New Spider release. See Maintenance instructions.

2020

2020-09-25: Issues with the batch machine. Running jobs should not be effected, but new jobs could not be created.