scontrol

Langue: en

Version: 314494 (ubuntu - 07/07/09)

Section: 1 (Commandes utilisateur)

NAME

scontrol - Used view and modify Slurm configuration and state.

SYNOPSIS

scontrol [OPTIONS...] [COMMAND...]

DESCRIPTION

scontrol is used to view or modify Slurm configuration including: job, job step, node, partition, and overall system configuration. Most of the commands can only be executed by user root. If an attempt to view or modify configuration information is made by an unauthorized user, an error message will be printed and the requested action will not occur. If no command is entered on the execute line, scontrol will operate in an interactive mode and prompt for input. It will continue prompting for input and executing commands until explicitly terminated. If a command is entered on the execute line, scontrol will execute that command and terminate. All commands and options are case-insensitive, although node names and partition names are case-sensitive (node names "LX" and "lx" are distinct). Commands can be abbreviated to the extent that the specification is unique.

OPTIONS

-a, --all
When the show command is used, then display all partitions, their jobs and jobs steps. This causes information to be displayed about partitions that are configured as hidden and partitions that are unavailable to user's group.
-h, --help
Print a help message describing the usage of scontrol.
--hide
Do not display information about hidden partitions, their jobs and job steps. By default, neither partitions that are configured as hidden nor those partitions unavailable to user's group will be displayed (i.e. this is the default behavior).
-o, --oneliner
Print information one line per record.
-q, --quiet
Print no warning or informational messages, only fatal error messages.
-v, --verbose
Print detailed event logging. Multiple -v's will further increase the verbosity of logging. By default only errors will be displayed.
-V , --version
Print version information and exit.
COMMANDS
all
Show all partitions, their jobs and jobs steps. This causes information to be displayed about partitions that are configured as hidden and partitions that are unavailable to user's group.
abort
Instruct the Slurm controller to terminate immediately and generate a core file.
checkpoint CKPT_OP ID
Perform a checkpoint activity on the job step(s) with the specified identification. ID can be used to identify a specific job (e.g. "<job_id>", which applies to all of its existing steps) or a specific job step (e.g. "<job_id>.<step_id>"). Acceptable values for CKPT_OP include:
disable (disable future checkpoints)
enable (enable future checkpoints)
able (test if presently not disabled, report start time if checkpoint in progress)
create (create a checkpoint and continue the job step)
vacate (create a checkpoint and terminate the job step)
error (report the result for the last checkpoint request, error code and message)
restart (restart execution of the previously checkpointed job steps)
completing
Display all jobs in a COMPLETING state along with associated nodes in either a COMPLETING or DOWN state.
delete SPECIFICATION
Delete the entry with the specified SPECIFICATION. The only supported SPECIFICATION presently is of the form PartitionName=<name>.
exit
Terminate the execution of scontrol. This is an independent command with no options meant for use in interactive mode.
help
Display a description of scontrol options and commands.
hide
Do not display partition, job or jobs step information for partitions that are configured as hidden or partitions that are unavailable to the user's group. This is the default behavior.
notify job_id message
Send a message to standard error of the srun command associated with the specified job_id.
oneliner
Print information one line per record.
pidinfo proc_id
Print the Slurm job id and scheduled termination time corresponding to the supplied process id, proc_id, on the current node. This will work only with processes on node on which scontrol is run, and only for those processes spawned by SLURM and their descendants.
listpids [job_id[.step_id]] [NodeName]
Print a listing of the process IDs in a job step (if JOBID.STEPID is provided), or all of the job steps in a job (if job_id is provided), or all of the job steps in all of the jobs on the local node (if job_id is not provided or job_id is "*"). This will work only with processes on the node on which scontrol is run, and only for those processes spawned by SLURM and their descendants. Note that some SLURM configurations (ProctrackType value of pgid or aix) are unable to identify all processes associated with a job or job step.

Note that the NodeName option is only really useful when you have multiple slurmd daemons running on the same host machine. Multiple slurmd daemons on one host are, in general, only used by SLURM developers.

ping
Ping the primary and secondary slurmctld daemon and report if they are responding.
quiet
Print no warning or informational messages, only fatal error messages.
quit
Terminate the execution of scontrol.
reconfigure
Instruct all Slurm daemons to re-read the configuration file. This command does not restart the daemons. This mechanism would be used to modify configuration parameters (Epilog, Prolog, SlurmctldLogFile, SlurmdLogFile, etc.) register the physical addition or removal of nodes from the cluster or recognize the change of a node's configuration, such as the addition of memory or processors. The Slurm controller (slurmctld) forwards the request all other daemons (slurmd daemon on each compute node). Running jobs continue execution. Most configuration parameters can be changed by just running this command, however, SLURM daemons should be shutdown and restarted if any of these parameters are to be changed: AuthType, BackupAddr, BackupController, ControlAddr, ControlMach, PluginDir, StateSaveLocation, SlurmctldPort or SlurmdPort.
resume job_id
Resume a previously suspended job.
requeue job_id
Requeue a running or pending SLURM batch job.
setdebug LEVEL
Change the debug level of the slurmctld daemon. LEVEL may be an integer value between zero and nine (using the same values as SlurmctldDebug in the slurm.conf file) or the name of the most detailed message type to be printed: "quiet", "fatal", "error", "info", "verbose", "debug", "debug2", "debug3", "debug4", or "debug5". This value is temporary and will be overwritten whenever the slurmctld daemon reads the slurm.conf configuration file (e.g. when the daemon is restarted or "scontrol reconfigure" is executed).
show ENTITY ID
Display the state of the specified entity with the specified identification. ENTITY may be config, daemons, job, node, partition, slurmd, step, hostlist or hostnames (also block or subbp on BlueGene systems). ID can be used to identify a specific element of the identified entity: the configuration parameter name, job ID, node name, partition name, or job step ID config, job, node, partition, or step respectively. hostnames takes an optional hostlist expression as input and writes a list of individual host names to standard output (one per line). If no hostlist expression is supplied, the contents of the SLURM_NODELIST environment variable is used. For example "tux[1-3]" is mapped to "tux1","tux2" and "tux3" (one hostname per line). hostlist takes a list of host names and prints the hostlist expression for them (the inverse of hostnames). hostlist can also take the absolute pathname of a file (beginning with the character '/') containing a list of hostnames. Multiple node names may be specified using simple node range expressions (e.g. "lx[10-20]"). All other ID values must identify a single element. The job step ID is of the form "job_id.step_id", (e.g. "1234.1"). slurmd reports the current status of the slurmd daemon executing on the same node from which the scontrol command is executed (the local host). It can be useful to diagnose problems. By default, all elements of the entity type specified are printed.
shutdown OPTION
Instruct Slurm daemons to save current state and terminate. By default, the Slurm controller (slurmctld) forwards the request all other daemons (slurmd daemon on each compute node). An OPTION of slurmctld or controller results in only the slurmctld daemon being shutdown and the slurmd daemons remaining active.
suspend job_id
Suspend a running job. Use the resume command to resume its execution. User processes must stop on receipt of SIGSTOP signal and resume upon receipt of SIGCONT for this operation to be effective. Not all architectures and configurations support job suspension.
update SPECIFICATION
Update job, node or partition configuration per the supplied specification. SPECIFICATION is in the same format as the Slurm configuration file and the output of the show command described above. It may be desirable to execute the show command (described above) on the specific entity you which to update, then use cut-and-paste tools to enter updated configuration values to the update. Note that while most configuration values can be changed using this command, not all can be changed using this mechanism. In particular, the hardware configuration of a node or the physical addition or removal of nodes from the cluster may only be accomplished through editing the Slurm configuration file and executing the reconfigure command (described above).
verbose
Print detailed event logging. This includes time-stamps on data structures, record counts, etc.
version
Display the version number of scontrol being executed.
!!
Repeat the last command executed.
SPECIFICATIONS FOR SHOW AND UPDATE COMMANDS, JOBS
Account=<account>
Account name to be changed for this job's resource use. Value may be cleared with blank data value, "Account=".
Contiguous=<yes|no>
Set the job's requirement for contiguous (consecutive) nodes to be allocated. Possible values are"YES" and "NO".
Dependency=<job_id>
Defer job's initiation until specified job_id completes. Cancel dependency with job_id value of "0", "Depedency=0".
ExcNodeList=<nodes>
Set the job's list of excluded node. Multiple node names may be specified using simple node range expressions (e.g. "lx[10-20]"). Value may be cleared with blank data value, "ExcNodeList=".
ExitCode=<exit>:<sig>
Exit status reported for the job by the wait() function. The first number is the exit code, typically as set by the exit() function. The second number of the signal that caused the process to terminate if it was terminated by a signal.
Features=<features>
Set the job's required features on nodes specified value. Multiple values may be comma separated if all features are required (AND operation) or separated by "|" if any of the specified features are required (OR operation). Value may be cleared with blank data value, "Features=".
JobId=<id>
Identify the job to be updated. This specification is required.
MinCores=<count>
Set the job's minimum number of cores per socket to the specified value.
MinMemory=<megabytes>
Set the job's minimum real memory required per nodes to the specified value.
MinProcs=<count>
Set the job's minimum number of processors per nodes to the specified value.
MinSockets=<count>
Set the job's minimum number of sockets per node to the specified value.
MinThreads=<count>
Set the job's minimum number of threads per core to the specified value.
MinTmpDisk=<megabytes>
Set the job's minimum temporary disk space required per nodes to the specified value.
Name=<name>
Set the job's name to the specified value.
Partition=<name>
Set the job's partition to the specified value.
Priority=<number>
Set the job's priority to the specified value. Note that a job priority of zero prevents the job from ever being scheduled. By setting a job's priority to zero it is held. Set the priority to a non-zero value to permit it to run.
Nice[=delta]
Adjust job's priority by the specified value. Default value is 100.
ReqProcs=<count>
Set the job's count of required processes to the specified value.
ReqNodeList=<nodes>
Set the job's list of required node. Multiple node names may be specified using simple node range expressions (e.g. "lx[10-20]"). Value may be cleared with blank data value, "ReqNodeList=".
ReqNodes=<min_count>[-<max_count>]
Set the job's minimum and optionally maximum count of nodes to be allocated.
ReqSockets=<count>
Set the job's count of required sockets to the specified value.
ReqCores=<count>
Set the job's count of required cores to the specified value.
ReqThreads=<count>
Set the job's count of required threads to the specified value.
Shared=<yes|no>
Set the job's ability to share nodes with other jobs. Possible values are "YES" and "NO".
StartTime=<time_spec>
Set the job's earliest initiation time. It accepts times of the form HH:MM:SS to run a job at a specific time of day (seconds are optional). (If that time is already past, the next day is assumed.) You may also specify midnight, noon, or teatime (4pm) and you can have a time-of-day suffixed with AM or PM for running in the morning or the evening. You can also say what day the job will be run, by specifying a date of the form MMDDYY or MM/DD/YY or MM.DD.YY. You can also give times like now + count time-units, where the time-units can be minutes, hours, days, or weeks and you can tell SLURM to run the job today with the keyword today and to run the job tomorrow with the keyword tomorrow.
TimeLimit=<time>
The job's time limit. Output format is [days-]hours:minutes:seconds or "UNLIMITED". Input format (for update command) set is minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes or days-hours:minutes:seconds. Time resolution is one minute and second values are rounded up to the next minute.
Connection=<type>
Reset the node connection type. Possible values on Blue Gene are "MESH", "TORUS" and "NAV" (mesh else torus).
Geometry=<geo>
Reset the required job geometry. On Blue Gene the value should be three digits separated by "x" or ",". The digits represent the allocation size in X, Y and Z dimentions (e.g. "2x3x4").
Rotate=<yes|no>
Permit the job's geometry to be rotated. Possible values are "YES" and "NO".
SPECIFICATIONS FOR UPDATE COMMAND, NODES
NodeName=<name>
Identify the node(s) to be updated. Multiple node names may be specified using simple node range expressions (e.g. "lx[10-20]"). This specification is required.
Features=<features>
Identify features to be associated with the specified nodes. Any previously identified features will be overwritten with the new value. NOTE: The Features associated with nodes will be reset to the values specified in slurm.conf (if any) upon slurmctld restart or reconfiguration. Update slurm.conf with any changes meant to be persistent.
Reason=<reason>
Identify the reason the node is in a "DOWN" or "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Use quotes to enclose a reason having more than one word.
State=<state>
Identify the state to be assigned to the node. Possible values are "NoResp", "ALLOC", "ALLOCATED", "DOWN", "DRAIN", "FAIL", "FAILING", "IDLE" or "RESUME". "RESUME is not an actual node state, but will return a DRAINED, DRAINING, or DOWN node to service, either IDLE or ALLOCATED state as appropriate. Setting a node "DOWN" will cause all running and suspended jobs on that node to be terminated. If you want to remove a node from service, you typically want to set it's state to "DRAIN". "FAILING" is similar to "DRAIN" except that some applications will seek to relinquish those nodes before the job completes. The "NoResp" state will only set the "NoResp" flag for a node without changing its underlying state. While all of the above states are valid, some of them are not valid new node states given their prior state. Generally only "DRAIN", "FAIL" and "RESUME" should be used.
SPECIFICATIONS FOR UPDATE AND DELETE COMMANDS, PARTITIONS
AllowGroups=<name>
Identify the user groups which may use this partition. Multiple groups may be specified in a comma separated list. To permit all groups to use the partition specify "AllowGroups=ALL".
Default=<yes|no>
Specify if this partition is to be used by jobs which do not explicitly identify a partition to use. Possible values are"YES" and "NO".
Hidden=<yes|no>
Specify if the partition and its jobs should be hidden from view. Hidden partitions will by default not be reported by SLURM APIs or commands. Possible values are"YES" and "NO".
MaxNodes=<count>
Set the maximum number of nodes which will be allocated to any single job in the partition. Specify a number, "INFINITE" or "UNLIMITED". (On a Bluegene type system this represents a c-node count.)
MaxTime=<time>
The maximum run time for jobs. Output format is [days-]hours:minutes:seconds or "UNLIMITED". Input format (for update command) is minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes or days-hours:minutes:seconds. Time resolution is one minute and second values are rounded up to the next minute.
MinNodes=<count>
Set the minimum number of nodes which will be allocated to any single job in the partition. (On a Bluegene type system this represents a c-node count.)
Nodes=<name>
Identify the node(s) to be associated with this partition. Multiple node names may be specified using simple node range expressions (e.g. "lx[10-20]"). Note that jobs may only be associated with one partition at any time. Specify a blank data value to remove all nodes from a partition: "Nodes=".
PartitionName=<name>
Identify the partition to be updated. This specification is required.
RootOnly=<yes|no>
Specify if only allocation requests initiated by user root will be satisfied. This can be used to restrict control of the partition to some meta-scheduler. Possible values are "YES" and "NO".
Shared=<yes|no|exclusive|force>[:<job_count>]
Specify if nodes in this partition can be shared by multiple jobs. Possible values are "YES", "NO", "EXCLUSIVE" and "FORCE". An optional job count specifies how many jobs can be allocated to use each resource.
State=<up|down>
Specify if jobs can be allocated nodes in this partition. Possible values are"UP" and "DOWN". If a partition allocated nodes to running jobs, those jobs will continue execution even after the partition's state is set to "DOWN". The jobs must be explicitly canceled to force their termination.
SPECIFICATIONS FOR UPDATE, BLOCK
Bluegene systems only!
BlockName=<name>
Identify the bluegene block to be updated. This specification is required.
State=<free|error>
This will update the state of a bluegene block to either FREE or ERROR. (i.e. update BlockName=RMP0 STATE=ERROR) State error will not allow jobs to run on the block. WARNING!!!! This will cancel any running job on the block!
SubBPName=<name>
Identify the bluegene ionodes to be updated (i.e. bg000[0-3]). This specification is required.

ENVIRONMENT VARIABLES

Some scontrol options may be set via environment variables. These environment variables, along with their corresponding options, are listed below. (Note: Commandline options will always override these settings.)

SCONTROL_ALL
-a, --all
SLURM_CONF
The location of the SLURM configuration file.

EXAMPLES


# scontrol
scontrol: show part class
PartitionName=class TotalNodes=10 TotalCPUs=20 RootOnly=NO

   Default=NO Shared=NO State=UP MaxTime=0:30:00 Hidden=NO

   MinNodes=1 MaxNodes=2 AllowGroups=students

   Nodes=lx[0031-0040] NodeIndices=31,40,-1
scontrol: update PartitionName=class MaxTime=60:00 MaxNodes=4
scontrol: show job 65539
JobId=65539 UserId=1500 JobState=PENDING TimeLimit=0:20:00

   Priority=100 Partition=batch Name=job01 NodeList=(null) 

   StartTime=0 EndTime=0 Shared=0 ReqProcs=1000

   ReqNodes=400 Contiguous=1 MinProcs=4 MinMemory=1024

   MinTmpDisk=2034 ReqNodeList=lx[3000-3003] 

   Features=(null) JobScript=/bin/hostname 
scontrol: update JobId=65539 TimeLimit=30:00 Priority=500
scontrol: show hostnames tux[1-3]
tux1
tux2
tux3
scontrol: quit

COPYING

Copyright (C) 2002-2007 The Regents of the University of California. Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). LLNL-CODE-402394.

This file is part of SLURM, a resource management program. For details, see <https://computing.llnl.gov/linux/slurm/>.

SLURM is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

SLURM is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

FILES

/etc/slurm.conf

SEE ALSO

scancel(1), sinfo(1), squeue(1), slurm_checkpoint(3), slurm_delete_partition(3), slurm_load_ctl_conf(3), slurm_load_jobs(3), slurm_load_node(3), slurm_load_partitions(3), slurm_reconfigure(3), slurm_requeue(3), slurm_resume(3), slurm_shutdown(3), slurm_suspend(3), slurm_update_job(3), slurm_update_node(3), slurm_update_partition(3), slurm.conf(5)