Submitting Jobs with Slurm¶
Processing computational tasks with Cheaha at the terminal requires submitting jobs to the Slurm scheduler. Slurm offers two commands to submit jobs: sbatch
and srun
. Always use sbatch
to submit jobs to the scheduler, unless you need an interactive terminal. Otherwise only use srun
within sbatch
for submitting job steps within an sbatch
script context.
The command sbatch
accepts script files as input. Scripts should be written in an available shell language on Cheaha, typically bash, and should include the appropriate Slurm directives at the top of the script telling the scheduler the requested resources. Read on to learn more about how to use Slurm effectively.
Important
Much of the information and examples on this page require a working knowledge of terminal commands and the shell. If you are unfamiliar with the terminal then please see our Shell page for more information and educational resources.
Common Slurm Terminology¶
- Node: A self-contained computing devices, forming the basic unit of the cluster. A node has multiple CPUs, memory, and some have GPUs. Jobs requiring multiple nodes must use a protocol such as MPI to communicate between them.
- Login nodes: Gateway for reseacher access to computing resources, shared among all users. DO NOT run research computation tasks on the login node.
- Compute nodes: Dedicated nodes for running research computation tasks.
- Core: A single unit of computational processing, not to be confused with a CPU, which may have many cores.
- Partition: A logical subset of nodes sharing computational features. Different partitions have different resource limits, priorities, and hardware.
- Job: A collection of commands that require computational resources to perform. Can be interactive with
srun
or submitted to the scheduler withsrun
orsbatch
. - Batch Job: An array of jobs which all have the same plan for execution, but may vary in terms of input and output. Only available in non-interactive batch mode via
sbatch
- Job ID: The unique number representing the job, returned by
srun
andsbatch
. Stored in$SLURM_JOB_ID
within a job. - Job Index Number: For array jobs, the index of the currently running job within the array. Stored in
$SLURM_ARRAY_TASK_ID
within a job.
Slurm Flags and Environment Variables¶
Slurm has many flags a researcher can use when creating a job, but a short list of the most important ones for are described below. It is highly recommended to be as explicit as possible with flags and not rely on system defaults. Explicitly using the flags below makes your scripts more portable, shareable and reproducible.
Flag | Short | Environment Variable | Description | sbatch | srun |
---|---|---|---|---|---|
--job-name |
-J |
SBATCH_JOB_NAME |
Name of job stored in records and visible in squeue . |
sbatch | srun |
SLURM_JOB_ID |
Job ID number of running job or array task. May differ from SLURM_ARRAY_JOB_ID depending on array task index |
sbatch | srun | ||
--output |
-o |
SBATCH_OUTPUT |
Path to file storing text output. | sbatch | srun |
--error |
-e |
SBATCH_ERROR |
Path to file storing error output. | sbatch | srun |
--partition |
-p |
SBATCH_PARTITION |
Partition to submit job to. More details below. | sbatch | srun |
--time |
-t |
SBATCH_TIMELIMIT |
Maximum allowed runtime of job. Allowed formats below. | sbatch | srun |
--nodes |
-N |
Number of nodes needed. Set to 1 if your software does not use MPI or if unsure. |
sbatch | srun | |
--ntasks |
-n |
SLURM_NTASKS |
Number of tasks planned per node. Mostly used for bookkeeping and calculating total cpus per node. If unsure set to 1 . |
sbatch | srun |
--cpus-per-task |
-c |
SLURM_CPUS_PER_TASK |
Number of needed cores per task. Cores per node equals -n times -c . |
sbatch | srun |
SLURM_CPUS_ON_NODE |
Number of cpus available on this node. | sbatch | srun | ||
--mem |
SLURM_MEM_PER_NODE |
Amount of RAM needed per node in MB. Can specify 16 GB using 16384 or 16G. | sbatch | srun | |
--gres |
SBATCH_GRES |
Used to request GPUs per node. For 2 GPUs per node use --gres=gpu:2 . |
sbatch | srun | |
--array |
SBATCH_ARRAY_INX |
Comma-separated list of similar tasks to run. More details below. | sbatch | n/a | |
SBATCH_ARRAY_JOB_ID |
Parent Job ID number of array task. Same for all array tasks submitted with same script. May differ from SLURM_JOB_ID depending on array task index. |
sbatch | n/a | ||
SLURM_ARRAY_TASK_COUNT |
Total number of array tasks. | sbatch | n/a | ||
SLURM_ARRAY_TASK_ID |
Current array task index. | sbatch | n/a |
Available Partitions for --partition
¶
Please see Cheaha Hardware for more information. Remember, the smaller your resource request, the sooner your job will get through the queue.
Requesting GPUs¶
Please see the GPUs page for more information.
Dynamic --output
and --error
File Names¶
The --output
and --error
flags can use dynamic job information as part of the name:
%j
is the Job ID, equal to$SLURM_JOB_ID
.%A
is the main Array Job ID, equal to$SLURM_ARRAY_JOB_ID
.%a
is the Array job index number, equal to$SLURM_ARRAY_TASK_ID
.%x
is the--job-name
, equal to$SLURM_JOB_NAME
.
For example if using --job-name=my-job
, then to create an output file like my-job-12345678
use --output=%x-%j
.
If also using --array=0-4
, then to create an output file like my-job-12345678-0
use --output=%x-%A-%a
.
Batch Jobs with sbatch
¶
Important
The following examples assume familiarity with the Linux terminal. If you are unfamiliar with the terminal then please see our Shell page for more information and educational resources.
Batch jobs are typically submitted using scripts with sbatch
. Using sbatch
this way is the preferred method for submitting jobs to Slurm on Cheaha. It is more portable, shareable, reproducible and scripts can be version controlled using Git.
For batch jobs, flags are typically included as directive comments at the top of the script like #SBATCH --job-name=my-job
. Read on to see examples of batch jobs using sbatch
.
A Simple Batch Job¶
Below is an example batch job script. To test it, copy and paste it into a plain text file testjob.sh
in your Home Directory on Cheaha. Run it at the terminal by navigating to your home directory by entering cd ~
and then entering sbatch testjob.sh
. Momentarily, two text files with .out
and .err
suffixes will be produced in your home directory.
There is a lot going on in the above script, so let's break it down. There are three main chunks of this script:
- Line 1 is the interpreter directive:
#!/bin/bash
. This tells the shell what application to use to execute this script. Allsbatch
scripts on Cheaha should start with this line. -
Lines 3-11 are the
sbatch
flags which tell the scheduler what resources you need and how to manage your job.- Line 3: The job name is
test
. - Lines 4-7: The job will have 1 node, with 1 core and 1 GB of memory.
- Line 8: The job will be on the express partition.
- Line 9: The job will be no longer than 10 minutes, and will be terminated if it runs over.
- Line 10: Any standard output (
stdout
) will be written to the filetest_$SLURM_JOB_ID.out
in the same directory as the script, whatever the$SLURM_JOB_ID
happens to be when the job is submitted. The name comes from%x
equal totest
, the--job-name
, and%j
equal to the Job ID. - Line 11: Any error output (
stderr
) will be written to a different filetest_$SLURM_JOB_ID.err
in the same directory.
- Line 3: The job name is
-
Lines 13 and 14 are the payload, or tasks to be run. They will be executed in order from top to bottom just like any shell script. In this case, it is simply writing "Hello World" to the
--output
file and "Hello Error" to the--error
file. The1>&2
Means redirect a copy (>&
) ofstdout
tostderr
.
Batch Array Jobs With Known Indices¶
Building on the job script above, below is an array job. Array jobs are useful when you need to perform the same analysis on slightly different inputs with no interaction between those analyses. We call this situation "pleasingly parallel". We can take advantage of an array job using the variable $SLURM_ARRAY_TASK_ID
, which will have an integer in the set of values we give to the --array
flag.
To test the script below, copy and paste it into a plain text file testarrayjob.sh
in your Home Directory on Cheaha. Run it at the terminal by navigating to your home directory by entering cd ~
and then entering sbatch testarrayjob.sh
. Momentarily, 16 text files with .out
and .err
suffixes will be produced in your home directory.
This script is very similar to the one above, but will submit 10 jobs to the scheduler that all do slightly different things. Each of the 10 jobs will have the same amount and type of resources allocated, and can run in parallel. The 10 jobs come from --array=0-9
. The output of each job will be one of the numbers in the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
, depending on which job is running. The output files will look like test_$(SLURM_ARRAY_JOB_ID)_$(SLURM_ARRAY_TASK_ID).out
or .err
. The value of $(SLURM_ARRAY_JOB_ID)
is the main Job ID given to the entire array submission.
Scripts can be written to take advantage of the $SLURM_ARRAY_TASK_ID
variable indexing variable. For example, a project could have a list of participants that should be processed in the same way, and the analysis script uses the array task ID as an index to pull out one entry from that list for each job. Many common programming languages can interact with shell variables like $SLURM_ARRAY_TASK_ID
, or the values can be passed to a program as an argument.
You can override the --array
flag stored in the script when you call sbatch
. To do so, pass another --array
flag along with the script name like below. This allows you to rerun only subsets of your array script.
# submit jobs with index 0, 3, and 7
sbatch --array=0,3,7 array.sh
# submit jobs with index 0, 2, 4, and 6
sbatch --array=0-6:2 array.sh
For more details on using sbatch
please see the official documentation.
Note
If you are using bash or shell arrays, it is crucial to note they use 0-based indexing. Plan your --array
flag indices accordingly.
Throttling in Slurm Array Jobs¶
Throttling in Slurm array jobs refers to limiting the number of concurrent jobs that can run simultaneously. This approach prevents the overloading of computing resources and ensures fair distribution of resources among users. From a performance perspective, throttling helps optimize overall job performance by reducing resource contention across the Cheaha cluster. When too many jobs run at the same time, they may compete for CPU, memory, or I/O, which can negatively impact performance. Please contact us if your research needs exceed our capacity.
To limit the number of concurrent jobs in a SLURM array, you can use the %
separator. Here’s how to use it in the above example:
In this example, only 4 jobs will run concurrently, regardless of the total number of jobs (10) in the array.
Batch Array Jobs With Dynamic or Computed Indices¶
For a practical example with dynamic indices, please visit our Practical sbatch
Examples
Interactive Jobs with srun
¶
Jobs should be submitted to the Slurm job scheduler either using a batch job or an Open OnDemand (OOD) interactive job.
You can use srun
for working on short interactive tasks such as creating an Anaconda environment and running parallel tasks within an sbatch script.
Warning
The limitations of srun
is that the jobs/execution die if the internet connection is down, and you may have to rerun the job again.
We recommend against using srun
for any scientific or research computing or data analysis. Use a batch job or an Open OnDemand (OOD) interactive job instead.
Let us see how to acquire a compute node quickly using srun
. You can run interactive job using srun
command with the --pty /bin/bash
flag. Here is an example,
$srun --ntasks=2 --time=01:00:00 --mem-per-cpu=8G --partition=medium --job-name=test_srun --pty /bin/bash
srun: job 21648044 queued and waiting for resources
srun: job 21648044 has been allocated resources
The above example allocates a compute node with a 8GB of RAM on a medium
partition with --ntasks=2
to run short tasks.
srun
for running parallel jobs¶
srun
is used to run executables in parallel, and is used within sbatch
script. Let us see an example where srun
is used to launch multiple (parallel) instances of a job.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --job-name=srun_test
#SBATCH --partition=long
#SBATCH --time=05:00
#SBATCH --mem=4G
srun hostname
In the script above, we have asked for two nodes --nodes=2, and each node will run a single instance of a hostname
as we requested --ntasks-per-node=1. The output for the above script is,
Here is another example of running different independent programs simultaneously on different resources within a batch job. Multiple srun
can execute simultaneously as long as they do not exceed the resources reserved for that job i.e., step 1 executes in node 1 with --ntasks=4, and step 2 executes in node 2 with --ntasks=4 simultaneously. Note that --nodes=1 -r1
in step 2 defines the number of nodes and their relative node position within the resources assigned to the job.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=amd-hdr100
#SBATCH --time=05:00
#SBATCH --mem-per-cpu=1G
#Partioning of resources for two different tasks
#STEP 1
srun --nodes=1 --ntasks=4 hostname
#STEP 2
srun --nodes=1 -r1 --ntasks=4 uname -a
Here is the output for running multiple srun
in a single job, i.e., executing the hostname
and uname -a
tasks simultaneously but on different nodes.
c0203
c0203
c0203
c0203
Linux c0204 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Linux c0204 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Linux c0204 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Linux c0204 3.10.0-1160.24.1.el7.x86_64 #1 SMP Thu Mar 25 21:21:56 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Alternatively, srun
can also run MPI, OpenMP, hybrid MPI/OpenMP, and many more parallel jobs. For more details on using srun
, please see the official documentation.
Important
srun
has been disabled for use with MPI. We have removed this functionality due to an open vulnerability: https://nvd.nist.gov/vuln/detail/CVE-2023-41915. The vulnerability could allow an attacker to escalate privileges to root and/or access data they do not have permissions for.
Instead of srun
, please load one of the OpenMPI
modules with an appropriate version. Please contact Support with any questions or concerns.
Environment Setup and Module Usage in Job Submission¶
Before submitting a job using sbatch
, it's crucial to establish a tailored environment, including software installations and loading necessary modules containing the required software packages. We highly recommend the practice of putting module reset
before any module load
calls in job scripts. The module system modifies the environment whenever the module list changes, and Slurm jobs inherit the environment from whatever called sbatch
or srun
. The module reset command normalizes the initial environment for the script, improving repeatability and minimizing the risk of hard-to-diagnose module conflicts. For examples and further information, please see best practice for loading modules.
Graphical Interactive Jobs¶
It is highly recommended to use the Open OnDemand web portal for interactive apps. Interactive sessions for certain software such as MATLAB and RStudio can be created directly from the browser while an HPC Desktop is available to access all of the other software on Cheaha. A terminal is also available through Open OnDemand.
It is possible to use other remote desktop software, such as VNC, to start and interact with jobs. These methods are not officially supported and we do not have the capacity to help with remote desktop connections. Instead, please consider switching your workflow to use the Open OnDemand HPC Desktop. If you are unable to use this method, please contact Support.
Estimating Compute Resources¶
Being able to estimate how many resources a job will need is critical. Requesting many more resources than necessary bottlenecks the cluster by reserving unused resources for an inefficient job preventing other jobs from using them. However, requesting too few resources will slow down the job or cause it to error.
Questions to ask yourself when requesting job resources:
- Can my scripts take advantage of multiple CPUs?
- For instance, RStudio generally works on a single thread. Requesting more than 1 CPU here would not improve performance.
- How large is the data I'm working with?
- Do my pipelines keep large amounts of data in memory?
- How long should my job take?
- For example, do not request 50 hours time for a 15 hour process. Have a reasonable buffer included to account for unexpected processing delays, but do not request the maximum time on a partition if that's unnecessary.
Note
Reasonable overestimation of resources is better than underestimation. However, gross overestimation may cause admins to contact you about adjusting resources for future jobs.
To get the most out of your Cheaha experience and ensure your jobs get through the queue as fast as possible, please read about Job Efficiency.
Faster Queuing with Job Efficiency¶
Please see our page on Job Efficiency for more information on making the best use of cluster resources to minimize your queue wait times.