To check you job(s) status, using myqueue
command.
For example,
hpcuser@x3002c0s7b0n0:~>myqueue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 29745 compute sander.M hpcuser R 7:33 1 lanta-c-065 29746 compute sleep hpcuser R 0:09 1 lanta-c-065
Where,
JOBID | job ID of a given job |
PARTITION | running partition of a given job |
NAME | name of a given job |
USER | user who run the given job |
ST | state of a given job (R: running, PD: Pending) |
TIME | running time of a given job |
NODE | number of nodes used by a given job |
NODELIST(REASON) | list of node(s) used by a given job (Reason why given is in PD (Pending) state. |
NODELIST (REASON)
Resources
Refers to the job being the first in the queue waiting to be executed. The job will run automatically when the system has sufficient resources as specified in the script submitted for execution.
Priority
Refers to the job being in the queue, waiting to be executed in another order. It will be executed after jobs with the "Resources" status and when the system has enough resources as specified in the script submitted for execution.
QOSGrpBillingMinutes
Means the job has not entered the queuing system because there are not enough service units (SU) available for the job to run (you can check the SU status of the project using the "sbalance" command). The job's status will automatically change to another state when there are enough SU available for that particular job to run.
FAQ: sbalance - Why is the job still in the QOSGrpBillingMinutes state when there are available SU?
The behavior of Slurm considers whether there are enough remaining SU to use all the resources specified in the script. For example, if you specify 1 GPU node (-p gpu -N 1) for a duration of 1 day (-t 24:00:00), there must be more than 187,200 SU (130 x 60 x 24) in the account to avoid getting into the QOSGrpBillingMinutes state. Therefore, it is essential to adjust the resource usage time appropriately for each job, especially when there are limited remaining SU.
PartitionTimeLimit
Refers to the job not entering the queuing system because the specified time limit for resource usage is greater than the defined value. In this case, you must cancel the job, modify the script used for submission (-t), and submit the job again. The maximum time limit in each partition can be checked using the "sinfo" command.
MaxCpuPerAccount
Means the job has not entered the queuing system because the number of CPUs/Cores exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of CPUs/Cores running in the system decreases.
MaxJobPerAccount
Means the job has not entered the queuing system because the number of jobs exceeds the policy defined for job submission. The job's status will automatically change to another state when the number of jobs running in the system decreases.