简介

SLURM (Simple Linux Utility for Resource Management)
一种可用于大型计算节点集群的高度可伸缩和容错的集群管理器和作业调度系统

命令

查询分区和节点的状态:

(base) xueruini@nico4:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
V100*        up 1-00:00:00      2  alloc nico[1-2]
Hyb          up 1-00:00:00      1   idle nico3

slurm使用入门-编程知识网
可能遇到:

(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST                  
V100*        up 1-00:00:00      1  drain nico2                     
V100*        up 1-00:00:00      1  alloc nico1                     
Hyb          up 1-00:00:00      1   idle nico3                     

STATE为drain的节点,无法alloc,此时可以使用如下命令查看原因:

(base) xueruini@nico4:~/onion_rain/pytorch/code/ssd.pytorch$ sinfo -R 
REASON               USER      TIMESTAMP           NODELIST           
Kill task failed     root      2020-08-18T15:47:15 nico2              

查询节点信息:

(base) xueruini@nico4:~$ scontrol show node nico1
NodeName=nico1 Arch=x86_64 CoresPerSocket=16CPUAlloc=32 CPUTot=32 CPULoad=0.16AvailableFeatures=(null)ActiveFeatures=(null)Gres=(null)NodeAddr=nico1 NodeHostName=nico1 Version=18.08OS=Linux 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26)RealMemory=128000 AllocMem=0 FreeMem=215529 Sockets=2 Boards=1State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/APartitions=V100BootTime=2020-07-01T21:40:55 SlurmdStartTime=2020-07-01T21:50:13CfgTRES=cpu=32,mem=125G,billing=32AllocTRES=cpu=32,mem=125G,billing=32CapWatts=n/aCurrentWatts=0 LowestJoules=0 ConsumedJoules=0ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

查询分区信息:

(base) xueruini@nico4:~$ scontrol show partition V100
PartitionName=V100AllowGroups=ALL AllowAccounts=ALL AllowQos=ALLAllocNodes=ALL Default=YES QoS=N/ADefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NOMaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITEDNodes=nico[1-2]PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NOOverTimeLimit=NONE PreemptMode=OFFState=UP TotalCPUs=64 TotalNodes=2 SelectTypeParameters=NONEJobDefaults=(null)DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

查询作业状态:

(base) xueruini@nico4:~$ squeueJOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)27      V100     bash xueruini  R    1:29:41      1 nico133      V100      zsh   heheda  R      15:26      1 nico2

slurm使用入门-编程知识网
创建分配式任务(资源抢占):
slurm使用入门-编程知识网
salloc常用参数:

--help  
# 显示帮助信息;-A <account>        
# 指定计费账户;-D, --chdir=<directory>     # 指定工作目录;--get-user-env
# 获取当前的环境变量;--gres=<list>
# 使用gpu这类资源,如申请两块gpu则--gres=gpu:2-J, --job-name=<jobname>
# 指定该作业的作业名;--mail-type=<type>
# 指定状态发生时,发送邮件通知,有效种类为(NONE, BEGIN, END, FAIL, REQUEUE, ALL);--mail-user=<user>
# 发送给指定邮箱;-n, --ntasks=<number>
# sbatch并不会执行任务,当需要申请相应的资源来运行脚本,默认情况下一个任务一个核心,--cpus-per-task参数可以修改该默认值;-c, --cpus-per-task=<ncpus>
# 每个任务所需要的核心数,默认为1;--ntasks-per-node=<ntasks>
# 每个节点的任务数,--ntasks参数的优先级高于该参数,如果使用--ntasks这个参数,那么将会变为每个节点最多运行的任务数;-o, --output=<filename pattern>
# 输出文件,作业脚本中的输出将会输出到该文件;-p, --partition=<partition_names>
# 将作业提交到对应分区;-q, --qos=<qos>
# 指定QOS;-t, --time=<time>
# 设置限定时间;

取消任务:

(base) xueruini@nico4:~$ scancel 28
salloc: Job allocation 28 has been revoked.

在某个节点有任务,就可以ssh过去:

(base) xueruini@nico4:~$ ssh nico1
Linux nico1 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64
NICO NICO NI ~~~Welcome to NICO cluster!Current Nodes: nico[1-4]Hardware:
nico1: 8xV100 32G, IB
nico2: 8xV100 32G, IB
nico3: 4xV100 32G, 4xP100              (for reproducing results on P100, contact @huangkz before using)
nico4: 1xP100, 1xGTX1080, 1xRADEON VII (for AMD related research, contact @laekov before using)Spack is one good west east. We use spack to manage packages.Use the following command to initialize spack:
source /opt/spack/share/spack/setup-env.shAnd use the following command to manage packages (environment-module not needed any more):
spack load openmpi@3.1.2%intel@19        # for example
spack find --loaded                      # list all loaded packages
spack unload openmpi                     # unload currently loaded packageIf you have any questions about spack, please do not hesitate to ask YJP.If the cluster is down, blame Harry Chen.Last login: Thu Jul  2 12:03:44 2020 from 172.23.18.4
-bash: pyenv: command not found
(base) xueruini@nico1:~$

没任务是无法ssh过去的:

(base) xueruini@nico1:~$ ssh nico2
Access denied: user xueruini (uid=17987) has no active jobs on this node.
Connection closed by 172.23.18.2 port 22

参考

slurm作业管理系统怎么用?
SLURM使用基础教程
北京大学高性能计算使用指南