Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

To check you your job(s) status, using myqueue command.

...

Code Block
hpcuser@x3002c0s7b0n0:~>myqueue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             29745   compute sander.M  hpcuser  R       7:33      1 lanta-c-065
             29746   compute    sleep  hpcuser  R       0:09      1 lanta-c-065

Where,

JOBID

job ID of a given job

PARTITION

running partition of a given job

NAME

name of a given job

USER

user who run the given job

ST

state of a given job (R: running, PD: Pending)

TIME

running time of a given job

NODE

number of nodes used by a given job

NODELIST(REASON)

list of node(s) used by a given job (Reason why given is in PD (Pending) state. 

...

NODELIST (REASON)

Resources

Refers to the job being the first in the queue waiting to be executed. The job will run automatically when the system has sufficient resources as specified in the script submitted for execution.

Priority

Refers to the job being in the queue, waiting to be executed in another order. It will be executed after jobs with the "Resources" status and when the system has enough resources as specified in the script submitted for execution.

QOSGrpBillingMinutes

Means the job has not entered the queuing system because there are not enough service units hour (SUSHr) available for the job to run (you can check the SU status of the project using the "sbalance" command). The job's status will automatically change to another state when there are enough SU SHr available for that particular job to run.

FAQ: sbalance - Why is the job still in the QOSGrpBillingMinutes state when there are available SUSHr?

The behavior of Slurm considers whether there are enough remaining SU to use all the resources specified in the script. For example, if you specify 1 GPU node 4 GPU cards (-p gpu -N 1 --gpus=4) for a duration of 1 day (-t 24:00:00), there must be more than 187,200 SU (130 x 60 72 SHr (3 x 24) in the account to avoid getting into the QOSGrpBillingMinutes state. Therefore, it is essential to adjust the resource usage time appropriately for each job, especially when there are limited remaining SU.

PartitionTimeLimit

Refers to the job not entering the queuing system because the specified time limit for resource usage is greater than the defined value. In this case, you must cancel the job (scancel), modify the script used for submission (-t), and submit the job again. The maximum time limit in each partition can be checked using the "sinfo" command.

MaxCpuPerAccount

Means the job has not entered the queuing system because the number of CPUs/Cores exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of CPUs/Cores running in the system decreases.

MaxJobPerAccount

Means the job has not entered the queuing system because the number of jobs exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of jobs running in the system decreases.