The Job Scheduler#
PBS (Portable Batch System) is the software that performs the job scheduling on our cluster.
The HPC cluster is essentially a large number of servers interconnected with each other.
The programs you have to run, are essentially "Jobs".
The PBS system distributes the jobs requested by users, amongst the computational resources available.
It's called a distributed workload management system because PBS handles management and monitoring of the workload on the set of servers.
Disclaimer#
Please note that the login master nodes are NOT to be used for long jobs.
Any processes running on the login node would be killed by admins. To test, debug and to interactively run any programs, please request for a interactive job.
Submitting Jobs#
qsub is used to submit an executable script to a batch server. The script is a shell script which will be executed by a command shell such as sh or csh. e.g.
qsub -P cc myjobscript
This command will return :
sequence-number.servername.job_ID
QSub Manual#
qsub -q queue jobscript # Job gets sent to the queue - Options : "cpu" or "gpu"
-N jobname # Set a name for Job (Example : "SynthPopv3_21Aug"
-l resources=value # Request Resources - Options - select,ncpus,ngpus,nmics,walltime,place etc
-l select=n # (RESOURCES) Request n slots
-l ncpus=n # (RESOURCES) Request n CPU cores on a node
-l ngpus=n # (RESOURCES) Request n GPUs on a node
-l nmics=n # (RESOURCES) Request n mics on a node
-l place=scatter # (RESOURCES) Specify how to distribute job scatter the proceses across nodes
-l place=pack # (RESOURCES) places the process across nodes in a compact way
-l host=node name # (RESOURCES) run a job on a particular node
-P project Name # Specify the name of the project : Eg "BharatSim"
-o Output # Where your stdout will be saved : Eg "out.log"
-e Error # Where your error log will be saved: Eg "err.log"
-M email # (Not working right now) - Email ID to notify
Samples#
qsub -N test -lselect=20:ncpus=24 -P cc -o out -e err myjobscript
#Will request for 20 node with 24 cpus each under project "cc" with
#output file name "out" and error file name as "err"
qsub -N test -lselect=2:ncpus=24:ngpus=2 -P cc -o out -e err myjobscript
#Will request for 2 node with 24 cpus each with 2 mic cards on each node
qsub -N test -lselect=4:ncpus=24:nmics=2 -P cc -o out -e err myjobscript
#Will request for 4 node with 24 cpus each with 2 mic cards on each node
qsub -N test -lselect=4:ncpus=12 -P cc -l place=scatter myjobscript
#Will request for 4 node with 12 cpus each and will scatter the jobs
#across 4 nodes equally.
qsub -N test -lselect=1:ncpus=24:host=compute1 -P cc myjobscript
#Will request for 1 node with 24 cpus each but only run on node compute1
Commands#
$qsub <script_name>
: Submit the job to the scheduler by$qstat
: Check the jobs status by$qstat -n
: Check where the job is running$qstat -f <job_id>
: Check full information of the job :-$qdel $job_id
: Delete the job from the queue (may take 5-10 seconds)$qstat -Q
: Check the queue information$qstat -a
: List all jobs and their state$qstat -r
: List all running jobs$qstat -f <job_id>
: List detail information on job