ThaiSC

TARA User Guide

Introduction to SLURM

SLURM is the scheduler of the TARA system, which performs the following functions.

  1. Allocates resources and manages the queue of pending work.
  2. There is a framework for starting, executing, and monitoring work to facilitate execution via SLURM.

Allocating resources

You can define resources as follows

  • Number of nodes/cores
  • Amount of time
  • Type of machine, Partition
  • Memory

PARTITION & QUEUE

  • A partition is like a bus stop, with many bus lines available. A queue is the people who wait to board the bus lines at the stop.
  • In the TARA system, “partition” refers to the resources group, with the following details.
PartitionDetails
computeCluster for CPU job
memoryCluster for high memory job
memory-preemtCluster for preemt high memory job
gpuCluster for GPU job
interactiveCluster for Interactive job

Node Status details

State Details
idleThe machine is no reservations and ready to use.
allocThe machine is used and cannot be accessed.
mixThe machine is reserved some part of CPU (not reserved all). It has some parts that can be accessed.
downThe machine is not available for use.
drainThe machine is not available for use due to a problem with the system.

Basic SLURM Commands

  • Partition status
  • Show job status
  • Cancel job

Partition status

$ sinfo is the command to see the status of various partitions as detailed below.

[tara@tara-frontend-1-node-ib ~]$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
interactive       up    2:00:00     10  alloc tara-c-[051-060]
compute*          up 5-00:00:00      4    mix tara-c-[010,037,039,044]
compute*          up 5-00:00:00     56  alloc tara-c-[001-009,011-036,038,040-043,045-060]
memory            up 5-00:00:00      7    mix tara-m-[002-004,006-008,010]
memory            up 5-00:00:00      3  alloc tara-m-[001,005,009]
memory-preempt    up 5-00:00:00      4    mix tara-m-[006-008,010]
memory-preempt    up 5-00:00:00      1  alloc tara-m-009
gpu               up 5-00:00:00      1  alloc tara-g-001
gpu               up 5-00:00:00      1   idle tara-g-002

Note: If you do not specify a partition when submitting a job, the SLURM will automatically insert the partition marked with *.

The following information is displayed in the details above.

  • Partition is resource group
  • Timelimit is the maximum time that can be used.
  • Nodes are the number of machines available.
  • State is the status of various machines.
  • Node list is the name of the machine.

sinfo information: https://slurm.schedmd.com/sinfo.html

Viewing usage and remaining amount of service units (SU)

You can see the usage quantity and the remaining amount of the service unit (SU) by using the following command.

$ sbalance

[tara@tara-frontend-1-node-ib ~]$ sbalance
Account balances for user: tara
Account: thaisc
        Description:                     thaisc
        Allocation:            1000000000.00 SU
        Remaining Balance:      999162062.00 SU ( 99.92%)
        Used:                      837938.00 SU
Account: tutorial
        Description:                   tutorial
        Allocation:               1000000.00 SU
        Remaining Balance:         774183.00 SU ( 77.42%)
        Used:                      225817.00 SU

You can see the usage service unit (SU) per user in your project by using the following command.

$ sbalance -d

[tara@tara-frontend-1-node-ib ~]$ sbalance -d
                            Used(%)     Used(SU)
 Account      User
 prexxxx      tutorial             5.53       276652                        
              thaisc               0.10         5207                   
              tara                 1.03        51274