TARA User Guide
- TARA User Guide
- Overview of TARA
- File Storage and transfer
- How to Use Application System
- Compiling Source Code
- Introduction to SLURM
- Running Jobs by SLURM Script
- Running Jobs by SLURM Interactive
- ThaiSC Support
- Frequently Asked Questions (FAQ)
Introduction to SLURM
SLURM is the scheduler of the TARA system, which performs the following functions.
- Allocates resources and manages the queue of pending work.
- There is a framework for starting, executing, and monitoring work to facilitate execution via SLURM.
Allocating resources
You can define resources as follows
- Number of nodes/cores
- Amount of time
- Type of machine, Partition
- Memory
PARTITION & QUEUE
- A partition is like a bus stop, with many bus lines available. A queue is the people who wait to board the bus lines at the stop.
- In the TARA system, “partition” refers to the resources group, with the following details.
Partition | Details |
---|---|
compute | Cluster for CPU job |
memory | Cluster for high memory job |
memory-preempt | Cluster for preempt high memory job |
gpu | Cluster for GPU job |
dgx | Cluster for DGX job (GPU) |
interactive | Cluster for Interactive job |
Node Status details
State | Details |
---|---|
idle | The machine is no reservations and ready to use. |
alloc | The machine is used and cannot be accessed. |
mix | The machine is reserved some part of CPU (not reserved all). It has some parts that can be accessed. |
down | The machine is not available for use. |
drain | The machine is not available for use due to a problem with the system. |
Basic SLURM Commands
- Partition status
- Show job status
- Cancel job
Partition status
$ sinfo
is the command to see the status of various partitions as detailed below.
[tara@tara-frontend-1-node-ib ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST interactive up 2:00:00 10 alloc tara-c-[051-060] compute* up 5-00:00:00 4 mix tara-c-[010,037,039,044] compute* up 5-00:00:00 56 alloc tara-c-[001-009,011-036,038,040-043,045-060] memory up 5-00:00:00 7 mix tara-m-[002-004,006-008,010] memory up 5-00:00:00 3 alloc tara-m-[001,005,009] memory-preempt up 5-00:00:00 4 mix tara-m-[006-008,010] memory-preempt up 5-00:00:00 1 alloc tara-m-009 gpu up 5-00:00:00 1 alloc tara-g-001 gpu up 5-00:00:00 1 idle tara-g-002
Note: If you do not specify a partition when submitting a job, the SLURM will automatically insert the partition marked with *.
The following information is displayed in the details above.
- Partition is resource group
- Timelimit is the maximum time that can be used.
- Nodes are the number of machines available.
- State is the status of various machines.
- Node list is the name of the machine.
sinfo information: https://slurm.schedmd.com/sinfo.html
Viewing usage and remaining amount of service units (SU)
You can see the usage quantity and the remaining amount of the service unit (SU) by using the following command.
$ sbalance
[tara@tara-frontend-1-node-ib ~]$ sbalance Account balances for user: tara Account: thaisc Description: thaisc Allocation: 1000000000.00 SU Remaining Balance: 999162062.00 SU ( 99.92%) Used: 837938.00 SU Account: tutorial Description: tutorial Allocation: 1000000.00 SU Remaining Balance: 774183.00 SU ( 77.42%) Used: 225817.00 SU
You can see the usage service unit (SU) per user in your project by using the following command.
$ sbalance
-d
[tara@tara-frontend-1-node-ib ~]$ sbalance -d
Used(%) Used(SU)
Account User
prexxxx tutorial 5.53 276652
thaisc 0.10 5207
tara 1.03 51274