Job Partitions

Slurm Job Partitions at MSI

MSI uses the Slurm scheduler to fairly allocate compute resources to users of our systems. The goals of the scheduler are to run people's jobs as soon and as fast as possible; to prevent jobs from interfering with each other; to treat all the users fairly, and to keep the entire supercomputer as highly-utilized (or "full") as possible.

Slurm uses partitions to organize jobs with similar resource requirements. "Resources" refers to the compute features required to run the job, for example, number of cpu cores, memory, and expected run time. The job partitions on our systems involve different sets of hardware and have different limits for various compute resources. For example, we have partitions that support long execution times, high memory requirements, or special GPU hardware.  When submitting a job it is important to choose a partition where the job is suited to the hardware and resource limitations.

Available Partitions

The table below gives a summary of the available partitions and the associated limits. The quantities listed are totals or upper limits.

Partition Name

Max cores per node

Maximum walltime

Max available node memory3 (GB)

Memory per core (MB)

Local scratch per node (GB)

Max nodes per job

Max jobs per user

msismall

128

96:00:00

248-755

3900

850

1

4000 4

msilarge

128

24:00:00

248-755

3900

850

32

4000 4

msibigmem

128

24:00:00

1995

15950

850

1

4000 4

msigpu 1

24-128

24:00:00

374-1002

3900-8000

850

4

4000 4

msilong

32

37-0

248-755

3900

850

*** 5

*** 5

interactive

128

24:00:00

499

3900

850

2

1

interactive-gpu 1,6

64-128

24:00:00

499-755

5120

850

1

1

interactive-long

64-128

37-0

499

3900

850

1

1

preempt 2

128

24:00:00

248-755

3900

850

1

4000 4

preempt-gpu 1,2

64-128

24:00:00

499-755

3900-8000

850

1

4000 4

Notes

  1. In addition to selecting the GPU-enabled partitions, GPUs need to be requested for all GPU jobs. See below under "Requesting specific CPU and GPU types".

  2. Jobs in the preempt and preempt-gpu partitions may be able to start opportunistically sooner than would otherwise be the case, but may also be killed at any time to make room for jobs in the other partitions.

  3. Total available memory per node can vary due to differing hardware configs. Standard cpu compute nodes have 256, 384, 512 or 768GB memory. GPU nodes also have varying amounts of RAM, ranging from 384GB in the v100 nodes to 1TB in the 8-way A100 nodes. In both cases, some of the memory is reserved for other uses and can't be requested by compute jobs.

  4. This is a shared, per-user limit across partitions. Additionally, across these partitions, there is a limit of 17,000 cores, 80 TB of memory, 40 A100s, and 40 A40s in running jobs per PI group.

  5. No node or job limit, but there is a 32 core, 128 GB total usage limit per user.

  6. Interactive-gpu partition currently has a40 and l40s GPUs available. To use other GPUs, the interactive job can be queued on the msigpu partition.

Choosing partitions and resources

Which partition to choose depends highly on the resources appropriate for your software/script. To efficiently share resources, many partitions have limits on the number of jobs or cores a particular user or group may simultaneously use. If a workflow requires many jobs to complete, it can be helpful to choose partitions which will allow many jobs to run simultaneously. 

Besides selecting the appropriate partition, for your job to run as efficiently as possible it's important to request as few resources as possible. Avoiding excessive requests for walltime, cores and memory will lead to your job starting as soon as possible, and have the least impact on your groups fairshare. If uncertain about the required memory or walltime, you might run the first job with generous limits, and refine for future runs based on the results. SLURM also sends a "job efficiency" report, which will give additional insights into how efficiently the requested resources were used.

Interactive jobs

The interactive partitions are primarily used for interactive software that is graphical in nature, and for short-term testing of jobs that will eventually run on a regular batch partition. While it's possible to submit "interactive" jobs with any partition, the named interactive partitions have some dedicated resources associated with them, to help jobs start with minimum wait time.

Job Walltime

The job walltime (--time=) is the time from the start to the finish of a job (as you would measure it using a clock on a wall), not including time spent waiting to run. This is in contrast to cputime, which measures the cumulative time all cores spent working on a job. Different job partitions have different walltime limits, and it is important to choose a partition with a high enough walltime for your job to complete. Jobs that exceed the requested walltime are killed by the system to make room for other jobs. Walltime limits are maximums only; you can always request a shorter walltime, which will reduce the amount of time you wait for your job to start. If you are unsure how much walltime your job will need, start with the partitions with shorter walltime limits and only move to others if needed.

Job nodes and cores

Many calculations have the ability to use multiple cores, or (less often) multiple nodes, to improve calculation speed. A job can request specific numbers of nodes (--nodes=), but each partition has maximum or minimum values for these parameters. 

Job Memory

The memory which a job requires (--mem=) is an important factor when choosing a partition. The largest amount of memory (RAM) that can be requested for a job is limited by the memory on the hardware associated with that partition. Nodes in the main cpu partitions have memory sizes ranging from 256GB-768GB, while these in the msibigmem partition have 2TB.

Requesting specific CPU and GPU types

Nodes in Agate have feature tags assigned to them. Each node has been tagged with a feature based on the CPU ManufacturerCPU name, and GPU name, as well as whether the beeond ephemeral scratch is available (see later). Users can select nodes to run their jobs based on the feature tags using SBATCH or srun --constraint flag.

Nodes

CPU, memory

GPU

Features

acn[01-244]

128 cores AMD Milan, 512GB

 

beeond,amd,milan

acl[01-100]

128 core AMD Milan, 2TB

 

beeond,amd,milan

aga[01-50]

64 cores AMD Milan, 

 

beeond,amd,milan,a100

agb[01-08]

128 cores AMD Milan,

 

beeond,amd,milan,a100

agc[01-10]

128 cores AMD Milan,

10x A40

beeond,amd,milan,a40

agd[01-08]

128 cores AMD Genoa, 768GB

4x L40S

beeond,amd,milan,l40s

cn[1001-1164]

128 cores AMD Rome, 256GB

 

beeond,amd,rome

cn[2101-2115]

24 cores Intel Skylake, 384GB

2x V100

beeond,intel,skylake,v100

n[1-32]

128 cores AMD Genoa, 384GB

 

beeond,amd,genoa

n[33-64]

128 cores AMD Genoa, 768GB

 

beeond,amd,genoa

e[1-8]

128 cores AMD Genoa, 768GB

4x H100

beeond,amd,genoa,h100

With the help of Slurm feature tags associated with each node, you can select a node with certain features for the job to run using the -C or --constraint flag. For example, to run an msismall job on a node using an AMD Genoa processor, you might use the following job description:

#SBATCH -p msismall
#SBATCH --constraint=genoa

Or, as another example, if you want to run on either rome or milan processors, then you could use:

#SBATCH -p msismall
#SBATCH --constraint="rome|milan"

Similarly, one or more GPUs of a particular type can be requested by including the following two lines in your submission script:

This example asks for a single A100 GPU, using the msigpu partition

#SBATCH -p msigpu     
#SBATCH --gres=gpu:a100:1 

This example asks for a single A40 GPU, using the interactive-gpu partition

#SBATCH -p interactive-gpu
#SBATCH --gres=gp
u:a40:1

If you don’t have a GPU type preference, you can request any available GPU in the msigpu partition

#SBATCH -p msigpu
#SBATCH --gres=gp
u:1

Available GPU types include v100, a100 and h100 for the msigpu partition, and a40 and l40s in the interactive-gpu and preempt-gpu partitions. You can request up to 2 v100s, 4 h100s, and up to 8 a100s.

Preemptible partitions  

The preempt and preempt-gpu partitions are special partitions that allow the use of otherwise idle resources. Jobs submitted to the preempt queue may be killed at any time to make room for a higher-priority job. Care must be taken to use these queues only for jobs that can easily restart after being killed. An example job is shown below:

#SBATCH --time=24:00:00
#SBATCH --mem=20gb
#SBATCH -n 12
#SBATCH --requeue
#SBATCH -p preempt-gpu
#SBATCH --gres=gpu:a40:1

module load singularity
singularity exec --nv                               \ /home/support/public/singularity/gromacs_2018.2.sif \
gmx mdrun -s benchMEM.tpr -cpi state.cpi -append

Requesting local scratch storage

The default  temporary storage space on a node (/tmp, /dev/shm) are private to each job and is normally memory-backed. This means that it is backed by your memory request and any files placed there will consume memory from your allocation.

You can request "disk-backed" storage (allocated on local SSD flash storage) like

#SBATCH --tmp 100G

BeeOND

Compute nodes on MSI systems have access to primary storage, global scratch, and local scratch disk. On occasion, there are workloads that demand more scratch disk than is available on a single node. If this is the case, a researcher can use the global scratch or BeeOND. Global scratch has the most space available, but can suffer from variable performance because it is a shared resource. In this case, you might want to consider using BeeOND.


BeeOND establishes an ephemeral
BeeGFS distributed filesystem by aggregating the local scratch on all the nodes used by your job. It is generally suggested that you request at least 12 nodes as in the example below. Please also note that you should request all the resources on the nodes. In the example below, we’re using agate nodes in the msilarge partition, and specifying 480GB RAM and 800GB local scratch per node.

#!/bin/bash
#SBATCH -C beeond
#SBATCH -N 12
#SBATCH -n 128
#SBATCH --mem 480G
#SBATCH --tmp 800G
#SBATCH -p msilarge

#SBATCH --mail-type=BEGIN,FAIL
#SBATCH --mail-user=<userid>@umn.edu
#SBATCH -e beeondjob-%j.err
#SBATCH -o beeondjob-%j.out

cd $SLURM_SUBMIT_DIR
WORKING_DIR=$HOME/Slurm/BeeOND
MY_DATA_DIR=$HOME/Slurm/BeeOND/data #source data to be ‘stagein’
BEEOND_DIR=/tmp #a beegfs mounted space pooled from all nodes

# Generate a node list file

scontrol show hostname > ${WORKING_DIR}/nodes_list
    if [[ $? != 0 ]]; then
        echo "Fail to create nodes_list"
    else
        echo "nodes list"
        cat ${WORKING_DIR}/nodes_list
    fi

# Move data to beeond space using BeeOND stage in
beeond-cp stagein -n ${WORKING_DIR}/nodes_list -g  ${MY_DATA_DIR} -l 
${BEEOND_DIR}
    if [[ $? != 0 ]]; then
        echo "Fail beeond-cp stagein"
    else
        echo "beeond-cp stagein completed"
    fi

# Run your application code here…

# Move data out to my own space using BeeOND stage out
beeond-cp stageout -n ${WORKING_DIR}/nodes_list -g  ${MY_DATA_DIR} -l 
${BEEOND_DIR}
    if [[ $? != 0 ]]; then
        echo "Fail beeond-cp stageout"
    else
        echo "beeond-cp stageout completed"
    fi

 

Discover Advanced Computing and Data Solutions at MSI

Our Services