Extended June MSI Maintenance – June 5-7, 2024
Summary
MSI reserves the first Wednesdays each month for maintenance tasks on its various computing, storage, and infrastructure systems. The next maintenance period, however, will be extended for 1-2 extra days in order for more upgrades to be completed to make MSI a more reliable and more capable resource for U of M researchers.
You will not be able to use the HPC clusters from 5 a.m. Wednesday June 5, until at least the evening of Thursday, June 6.
MSI June Maintenance | ||||||
Wed | Wed | Thu | Thu | Thu | Thu | |
June 5 | June 5 | June 6 | June 6 | June 7 | June 7 | |
Service | ||||||
Storage - Panasas | Down | Restoring | Up | Up | Up | Up |
Storage - VAST | Up | Up | Up | Up | Up | Up |
Tier 2 Storage | Up | Up | Up | Up | Up | Up |
Stratus | Up | Up | Up | Up | Up | Up |
LM&P Pipelines | Down | Down | Down | Restoring | Up | Up |
HBCD Pipelines | Down | Down | Down | Restoring | Up | Up |
Agate Cluster | Down | Down | Down | Restoring | Up | Up |
Mangi Nodes | Down | Down | Down | Down | Down | Restoring |
Mesabi Cluster | Down | Down | Retired | Retired | Retired | Retired |
Detailed information on the outages
Power Infrastructure
The most time-consuming effort for the next maintenance period will be the replacement of our three data center Uninterruptable Power Supplies (UPS), which have reached their end of life. The planning to install their replacements started over a year ago, and construction started in early 2024. Assuming the final testing of the new systems goes well during the last week of May, the switchover to the three new systems will occur sequentially on Wednesday, June 5; Thursday, June 6; and Friday, June 7.
We expect core MSI storage systems to be fully available again by the end of day Wednesday, June 5. However, all of our HPC clusters will be unavailable until at least the end of Thursday, see below.
Agate maintenance
The main Agate cluster, and associated services such as Open OnDemand, will return to service after the second UPS switchover, by the evening of Thursday, June 6.
Mesabi retirement
As previously announced, our nearly decade-old cluster Mesabi will be retired from service. More information on the SLURM partitions to be retired can be found at the Mesabi Retirement webpage.
If you have only done your computing on Mesabi, and/or ignored messages about the new operating system on Agate, here is a guide assisting you in the transition from CentoOS 7 to the Rocky8 operating system.
Mangi nodes transition
After the final UPS switchover on Friday, June 7, the relatively newer expansion compute nodes attached to Mesabi (known as “Mangi”) will be reassociated with the Agate cluster, while also getting the same Rocky8 operating system loaded so all of the nodes are aligned. This operation is targeted to be complete by Friday evening.
Movement of all Notebooks applications to Open OnDemand
During this maintenance, the stand-alone Jupyter Notebooks service will be taken down, as previously announced. Tens of thousands of Notebooks sessions have been run on MSI’s deployed Open OnDemand service, which is more capable, making the original service redundant.
Expansion of the Agate cluster
Although not in the plans for the June maintenance, all of the activity being done will help lay the groundwork for bringing a supplemental set of hardware to expand the Agate cluster later in 2024. Delivery, provisioning, setup, and testing of this new hardware is all coming at future dates. Keep an eye on news from MSI about the new hardware.