Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

To check your job(s) status, using myqueue command.

...

Code Block
hpcuser@x3002c0s7b0n0:~>myqueue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             29745   compute sander.M  hpcuser  R       7:33      1 lanta-c-065
             29746   compute    sleep  hpcuser  R       0:09      1 lanta-c-065

Where,

JOBID

job ID of a given job

PARTITION

running partition of a given job

NAME

name of a given job

USER

user who run the given job

ST

state of a given job (R: running, PD: Pending)

TIME

running time of a given job

NODE

number of nodes used by a given job

NODELIST(REASON)

list of node(s) used by a given job (Reason why given is in PD (Pending) state. 

...

NODELIST (REASON)

Resources

Refers to the job being the first in the queue waiting to be executed. The job will run automatically when the system has sufficient resources as specified in the script submitted for execution.

Priority

Refers to the job being in the queue, waiting to be executed in another order. It will be executed after jobs with the "Resources" status and when the system has enough resources as specified in the script submitted for execution.

QOSGrpBillingMinutes

Means the job has not entered the queuing system because there are not enough service units hour (SHr) available for the job to run (you can check the SU status of the project using the sbalance command). The job's status will automatically change to another state when there are enough SHr available for that particular job to run.

...

The behavior of Slurm considers whether there are enough remaining SU to use all the resources specified in the script. For example, if you specify 1 GPU node 4 GPU cards (-p gpu -N 1 --gpus=4) for a duration of 1 day (-t 24:00:00), there must be more than 84 72 SHr (3 .5 x 24) in the account to avoid getting into the QOSGrpBillingMinutes state. Therefore, it is essential to adjust the resource usage time appropriately for each job, especially when there are limited remaining SU.

PartitionTimeLimit

Refers to the job not entering the queuing system because the specified time limit for resource usage is greater than the defined value. In this case, you must cancel the job (scancel), modify the script used for submission (-t), and submit the job again. The maximum time limit in each partition can be checked using the sinfo command.

MaxCpuPerAccount

Means the job has not entered the queuing system because the number of CPUs/Cores exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of CPUs/Cores running in the system decreases.

MaxJobPerAccount

Means the job has not entered the queuing system because the number of jobs exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of jobs running in the system decreases.