Checking Job status

To check your job(s) status, using myqueue command.

For example,

hpcuser@x3002c0s7b0n0:~>myqueue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             29745   compute sander.M  hpcuser  R       7:33      1 lanta-c-065
             29746   compute    sleep  hpcuser  R       0:09      1 lanta-c-065

Where,

JOBID	job ID of a given job
PARTITION	running partition of a given job
NAME	name of a given job
USER	user who run the given job
ST	state of a given job (R: running, PD: Pending)
TIME	running time of a given job
NODE	number of nodes used by a given job
NODELIST(REASON)	list of node(s) used by a given job (Reason why given is in PD (Pending) state.

NODELIST (REASON)

Resources

Refers to the job being the first in the queue waiting to be executed. The job will run automatically when the system has sufficient resources as specified in the script submitted for execution.

Priority

Refers to the job being in the queue, waiting to be executed in another order. It will be executed after jobs with the "Resources" status and when the system has enough resources as specified in the script submitted for execution.

QOSGrpBillingMinutes

Means the job has not entered the queuing system because there are not enough service units hour (SHr) available for the job to run (you can check the SU status of the project using the sbalance command). The job's status will automatically change to another state when there are enough SHr available for that particular job to run.

FAQ: Why is the job still in the QOSGrpBillingMinutes state when there are available SHr?

The behavior of Slurm considers whether there are enough remaining SU to use all the resources specified in the script. For example, if you specify 1 GPU node 4 GPU cards (-p gpu -N 1 --gpus=4) for a duration of 1 day (-t 24:00:00), there must be more than 72 SHr (3 x 24) in the account to avoid getting into the QOSGrpBillingMinutes state. Therefore, it is essential to adjust the resource usage time appropriately for each job, especially when there are limited remaining SU.

PartitionTimeLimit

Refers to the job not entering the queuing system because the specified time limit for resource usage is greater than the defined value. In this case, you must cancel the job (scancel), modify the script used for submission (-t), and submit the job again. The maximum time limit in each partition can be checked using the sinfo command.

MaxCpuPerAccount

Means the job has not entered the queuing system because the number of CPUs/Cores exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of CPUs/Cores running in the system decreases.

MaxJobPerAccount

Means the job has not entered the queuing system because the number of jobs exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of jobs running in the system decreases.

LANTA user guide