Legal notice   Contact   Internals   Search 
System
Status
News /
Events
Support /
Documentation
Accounts /
Projects
Organisation Public
Relations
 

The HLRN Quickstart Guide

Chapter 8. LoadLeveler, the Batch Queuing System of IBM

Bernd Kallies(1) and Wilhelm Vortisch(2)
Revision History:
Revision 2.4, Published 2006/10/27 10:45:02 (UTC) by Bernd Kallies

8.1. Introduction

The HLRN operates LoadLeveler to run serial and parallel programs in batch. Running programs in batch mode is the preferred way to execute production work. To the opposite, interactive program execution should be done for testing purposes only.

Running a program in batch mode requires the user to write a LoadLeveler command file that defines the resources of the job (number of nodes, number of CPU's on each node, required memory, required network) and the actions the job has to do. The command file is submitted into a LoadLeveler class (queue). The batch queuing system schedules the job for execution depending on the load of the system.

In order to use the system resources efficiently, knowledge about the batch system configuration is important.

8.2. Overview of User Commands

The most important LoadLeveler commands are:

llstatus

Lists the status of all machines of the p690 cluster.

Use llstatus -R to get a listing of the actual resource usage (number of CPU's, memory) of each node in the cluster.

llclass

Lists the available job classes.

Use llclass -l [classname] to get a listing of configured limits of the specified class.

llsubmit

Submits a job command file for execution.

llsubmit - reads a command file from stdin. When invoked interactively, press Ctrl-D to quit input and to submit the generated script.

llq

Check the status of jobs.

The command can be customized. An informative output can be obtained e.g. with llq -f %jn %o %dq %st %h.

llcancel

Cancel a job.

xloadl

A graphical user interface (under X11) that provides access to all commands listed above. The GUI can also be used to build job command files that can be saved or submitted.

llmap

Displays the status of all machines in the HLRN cluster in an easily visible form.

This command is provided by a member of the HLRN staff. Look at http://www.hlrn.de/status/long.html#ll-messages for an impression of the output. Visit the llmap online manpage for usage and interpretation hints.

llacct

Displays various accounting information of a running job step.

This command is provided by a member of the HLRN staff. Visit the llacct online manpage for usage and interpretation hints.

bstat

Lists various information on job steps managed by LoadLeveler and on the configuration of the batch system.

This command is under development by a member of the HLRN staff. Use bstat -h for a short description. Main features are

  • listing of all job steps currently managed by LoadLeveler including a large number of various parameters
  • listing of all nodes and the corresponding LoadLeveler classes
  • listing of details concerning a specified job step.

8.3. Examples of Job Command Files

All LoadLeveler jobs must be submitted as a command file. This file specifies information like job type, executable name, job class, resource requirements or input/output files. The following sections comment the command file basics for different job types. Working examples including simple program sources for illustration purposes you find in Chapter 10, Examples.

8.3.1. Running a serial program

Example 8.1 shows a simple command file, that can serve as template for running a serial executable in batch:

Example 8.1. A simple LoadLeveler command file.

#!/bin/sh
#@ job_type         = serial
#@ output           = loadl_ex1.llout
#@ error            = $(output)
#@ class            = cdev
#@ resources        = ConsumableCpus(1) ConsumableMemory(16 MB)
#@ wall_clock_limit = 0:10
#@ notification     = always
#@ queue

env

This command file resembles a shell script, which will be executed at runtime. It will print the runtime environment of the job via execution of the env command. Therefore it will use 1 processor on a machine that will be selected by LoadLeveler. The wall clock limit (recommended keyword) for the job is 10 seconds. The STDOUT of the job will be redirected into a file loadl_ex1.llout. STDERR will be redirected into the same file. The file will appear on the machine in the directory where the job was submitted. The #@ resources keyword is required because resource usage and resource submission is enforced. The counters specify resources consumed by each task of a job step (in case of the above example the job consists of one job step). LoadLeveler will send a mail to the submitting user on the submit machine when the job starts and ends execution (#@ notification).

8.3.2. Running an MPI program

The job command file shown in Example 8.2 starts an MPI program by using POE (IBM's Parallel Operating Environment, see Chapter 7, Running Parallel Programs). It can be used as template for getting started with MPI programs in batch.

Example 8.2. A LoadLeveler command file for an MPI program.

#!/bin/sh
#@ job_type         = parallel
#@ output           = loadl_ex3.llout
#@ error            = loadl_ex3.llerr
#@ node_usage       = not_shared
#@ resources        = ConsumableCpus(1) ConsumableMemory(1760 mb)
#@ wall_clock_limit = 1:10:0,1:0:0
#@ node             = 2
#@ tasks_per_node   = 32
#@ network.mpi      = sn_all,,us
#@ environment      = MEMORY_AFFINITY=MCM; \
#                     MP_SHARED_MEMORY=yes; MP_WAIT_MODE=poll; \
#                     MP_SINGLE_THREAD=yes; MP_TASK_AFFINITY=MCM
#@ queue

./a.out

The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to became compiled with the compiler command mpxlf (Fortran) or mpcc (C). If so, then a.out is an MPI program. When it starts execution, then poe is started at first, which then starts the true executable.

The number of MPI tasks started by poe are defined by the LoadLeveler keywords #@ node and #@ tasks_per_node (total 64 tasks in this example, spread over two nodes). Each MPI task uses 1 CPU (ConsumableCpus(1)) and 1760 MByte memory (ConsumableMemory(1760 mb)). Alltogether, the application states to consume up to 32 CPUs and up to 55 GByte memory on each of the 2 allocated nodes. See Notes on ConsumableCpus and ConsumableMemory in LoadLeveler - The IBM SP Batch Queuing System to learn more about appropriate values for the #@ resources keyword.

The nodes are given not_shared to the job (#@ node_usage = not_shared), which means that the two nodes are dedicated to the job.

MPI communication between tasks running on different nodes uses the HPS switch network (#@ network.mpi). MPI communication between tasks running on the same node uses shared memory (#@ environment = MP_SHARED_MEMORY=yes). The remaining environment variables MP_xxx are appropriate to get performance for the majority of MPI applications.

The started executable becomes cancelled after 1 hour, if it has not finished up to then (#@ wall_clock_limit, second number = soft limit). The job is killed after 1 hour 10 minutes, when any process belonging to the job is still running (#@ wall_clock_limit, first number = hard limit).

Note

Simple MPI programs always state ConsumableCpus(1) (in contrast to hybrid programs).

Be aware that the keyword ConsumableMemory declares the memory per MPI task. If the number of MPI tasks (#@ tasks_per_node) multiplied by ConsumableMemory exceeds the physical amount of memory available on a node, then the job will wait for execution forever (or until one llcancels it).

8.3.3. Running an SMP program

The job command file shown in Example 8.3 is intended to be used for a SMP program that starts a number of parallel threads.

Example 8.3. A LoadLeveler command file for a SMP program.

#!/bin/sh
#@ job_type         = serial
#@ output           = loadl_ex4.llout
#@ error            = loadl_ex4.llerr
#@ node_usage       = shared
#@ resources        = ConsumableCpus(8) ConsumableMemory(10 GB)
#@ wall_clock_limit = 1:10:0,1:0:0
#@ environment      = OMP_NUM_THREADS=8; AIXTHREAD_SCOPE=S
#@ queue

./a.out

The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to be an SMP program.

The command file sets the environment variable OMP_NUM_THREADS, which determines the number of parallel threads for common SMP programs. Consult the manual of the SMP application you use if it relies on other ways to determine the number of threads. In any event, the number of threads has to match the declaration of ConsumableCpus.

The statement ConsumableMemory declares the amount of memory shared by the parallel threads. It is the total amount of memory the application uses. Alltogether, the application states to use up to 8 CPUs and up to 10 GByte memory on one node. See Notes on ConsumableCpus and ConsumableMemory in LoadLeveler - The IBM SP Batch Queuing System to learn more about appropriate values for the #@ resources keyword.

The node may be shared with other jobs (#@ node_usage = shared). However, the resources reserved by the job (8 CPUs, 10 GByte memory) are not shared with other jobs. The statement #@ node_usage = shared does only mean that other jobs are allowed to run on the node, if there are not-reserved resources left on the node.

The started executable becomes cancelled after 1 hour, if it has not finished up to then (#@ wall_clock_limit, second number = soft limit). The job is killed after 1 hour 10 minutes, when any process belonging to the job is still running (#@ wall_clock_limit, first number = hard limit).

The environment variable AIXTHREAD_SCOPE is set to S for performance reasons.

Note

SMP programs are always of #@ job_type = serial, although they are parallel applications. However, in contrast to MPI programs, SMP programs cannot run on more than one node.

Ensure that the number of threads the application starts matches ConsumableCpus. Otherwise it may happen that some of the threads have to share a CPU, which causes loss of performance.

8.3.4. Running a hybrid program

Hybrid programs are applications that use both MPI and SMP parallel programming paradigms. A hybrid program is mainly an MPI program, that is started under control of the POE. In addition, each MPI task can spawn a number of threads, that share the memory belonging to the parent MPI task.

Example 8.4. LoadLeveler command file for a hybrid program.

#!/bin/sh
#@ job_type         = parallel
#@ output           = loadl_ex5.llout
#@ error            = loadl_ex5.llerr
#@ node_usage       = not_shared
#@ node             = 2
#@ tasks_per_node   = 8
#@ resources        = ConsumableCpus(4) ConsumableMemory(4 GB)
#@ wall_clock_limit = 1:10:0,1:0:0
#@ network.mpi      = sn_all,,us
#@ environment      = OMP_NUM_THREADS=4; AIXTHREAD_SCOPE=S; \
#                     MEMORY_AFFINITY=MCM; \
#                     MP_SHARED_MEMORY=yes; MP_WAIT_MODE=poll; \
#                     MP_SINGLE_THREAD=yes; MP_TASK_AFFINITY=MCM
#@ queue

./a.out

The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to be a hybrid MPI + SMP program.

The command file causes allocation of 2 nodes (#@ node). Once the executable starts, poe becomes started at first. poe starts 8 MPI tasks on each of the 2 nodes (#@ tasks_per_node, together: 16 MPI tasks). The MPI tasks state to allocate up to 4 GByte each (ConsumableMemory). When the executable reaches code parts that can run shared-memory parallel, each MPI task spawns up to 4 threads (OMP_NUM_THREADS). The number of threads per MPI task has to match ConsumableCpus. The 4 threads per MPI task share the 4 GByte memory belonging to the MPI task. Alltogether, the application declares to use up to 32 CPU's (#@ tasks_per_node * ConsumableCpus) and up to 32 GByte memory (#@ tasks_per_node * ConsumableMemory) on each of the 2 allocated nodes (#@ node). In total, the application will use up to 64 CPU's and 64 GByte memory.

The MPI tasks running on different nodes use the IBM HPS switch for communication (#@ network.mpi). MPI tasks running on the same node will communicate via shared memory (MP_SHARED_MEMORY=yes). The SMP threads started by an MPI task communicate via shared memory per definition.

8.4. Further Reading

Please see the following documents for a more detailed description of LoadLeveler at HLRN:




2003-2008 © Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen (HLRN)