|
|
Revision History: Revision 2.4, Published 2006/10/27 10:45:02 (UTC) by Bernd Kallies Table of Contents
The HLRN operates LoadLeveler to run serial and parallel programs in batch. Running programs in batch mode is the preferred way to execute production work. To the opposite, interactive program execution should be done for testing purposes only. Running a program in batch mode requires the user to write a LoadLeveler command file that defines the resources of the job (number of nodes, number of CPU's on each node, required memory, required network) and the actions the job has to do. The command file is submitted into a LoadLeveler class (queue). The batch queuing system schedules the job for execution depending on the load of the system. In order to use the system resources efficiently, knowledge about the batch system configuration is important. The most important LoadLeveler commands are:
All LoadLeveler jobs must be submitted as a command file. This file specifies information like job type, executable name, job class, resource requirements or input/output files. The following sections comment the command file basics for different job types. Working examples including simple program sources for illustration purposes you find in Chapter 10, Examples. Example 8.1 shows a simple command file, that can serve as template for running a serial executable in batch: Example 8.1. A simple LoadLeveler command file. #!/bin/sh #@ job_type = serial #@ output = loadl_ex1.llout #@ error = $(output) #@ class = cdev #@ resources = ConsumableCpus(1) ConsumableMemory(16 MB) #@ wall_clock_limit = 0:10 #@ notification = always #@ queue env This command file resembles a shell script, which will be executed at runtime. It will print the runtime environment of the job via execution of the env command. Therefore it will use 1 processor on a machine that will be selected by LoadLeveler. The wall clock limit (recommended keyword) for the job is 10 seconds. The STDOUT of the job will be redirected into a file loadl_ex1.llout. STDERR will be redirected into the same file. The file will appear on the machine in the directory where the job was submitted. The #@ resources keyword is required because resource usage and resource submission is enforced. The counters specify resources consumed by each task of a job step (in case of the above example the job consists of one job step). LoadLeveler will send a mail to the submitting user on the submit machine when the job starts and ends execution (#@ notification). The job command file shown in Example 8.2 starts an MPI program by using POE (IBM's Parallel Operating Environment, see Chapter 7, Running Parallel Programs). It can be used as template for getting started with MPI programs in batch. Example 8.2. A LoadLeveler command file for an MPI program. #!/bin/sh #@ job_type = parallel #@ output = loadl_ex3.llout #@ error = loadl_ex3.llerr #@ node_usage = not_shared #@ resources = ConsumableCpus(1) ConsumableMemory(1760 mb) #@ wall_clock_limit = 1:10:0,1:0:0 #@ node = 2 #@ tasks_per_node = 32 #@ network.mpi = sn_all,,us #@ environment = MEMORY_AFFINITY=MCM; \ # MP_SHARED_MEMORY=yes; MP_WAIT_MODE=poll; \ # MP_SINGLE_THREAD=yes; MP_TASK_AFFINITY=MCM #@ queue ./a.out The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to became compiled with the compiler command mpxlf (Fortran) or mpcc (C). If so, then a.out is an MPI program. When it starts execution, then poe is started at first, which then starts the true executable. The number of MPI tasks started by poe are defined by the LoadLeveler keywords #@ node and #@ tasks_per_node (total 64 tasks in this example, spread over two nodes). Each MPI task uses 1 CPU (ConsumableCpus(1)) and 1760 MByte memory (ConsumableMemory(1760 mb)). Alltogether, the application states to consume up to 32 CPUs and up to 55 GByte memory on each of the 2 allocated nodes. See Notes on ConsumableCpus and ConsumableMemory in LoadLeveler - The IBM SP Batch Queuing System to learn more about appropriate values for the #@ resources keyword. The nodes are given not_shared to the job (#@ node_usage = not_shared), which means that the two nodes are dedicated to the job. MPI communication between tasks running on different nodes uses the HPS switch network (#@ network.mpi). MPI communication between tasks running on the same node uses shared memory (#@ environment = MP_SHARED_MEMORY=yes). The remaining environment variables MP_xxx are appropriate to get performance for the majority of MPI applications. The started executable becomes cancelled after 1 hour, if it has not finished up to then (#@ wall_clock_limit, second number = soft limit). The job is killed after 1 hour 10 minutes, when any process belonging to the job is still running (#@ wall_clock_limit, first number = hard limit). Note
Simple MPI programs always state ConsumableCpus(1) (in contrast to hybrid programs). Be aware that the keyword ConsumableMemory declares the memory per MPI task. If the number of MPI tasks (#@ tasks_per_node) multiplied by ConsumableMemory exceeds the physical amount of memory available on a node, then the job will wait for execution forever (or until one llcancels it). The job command file shown in Example 8.3 is intended to be used for a SMP program that starts a number of parallel threads. Example 8.3. A LoadLeveler command file for a SMP program. #!/bin/sh #@ job_type = serial #@ output = loadl_ex4.llout #@ error = loadl_ex4.llerr #@ node_usage = shared #@ resources = ConsumableCpus(8) ConsumableMemory(10 GB) #@ wall_clock_limit = 1:10:0,1:0:0 #@ environment = OMP_NUM_THREADS=8; AIXTHREAD_SCOPE=S #@ queue ./a.out The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to be an SMP program. The command file sets the environment variable OMP_NUM_THREADS, which determines the number of parallel threads for common SMP programs. Consult the manual of the SMP application you use if it relies on other ways to determine the number of threads. In any event, the number of threads has to match the declaration of ConsumableCpus. The statement ConsumableMemory declares the amount of memory shared by the parallel threads. It is the total amount of memory the application uses. Alltogether, the application states to use up to 8 CPUs and up to 10 GByte memory on one node. See Notes on ConsumableCpus and ConsumableMemory in LoadLeveler - The IBM SP Batch Queuing System to learn more about appropriate values for the #@ resources keyword. The node may be shared with other jobs (#@ node_usage = shared). However, the resources reserved by the job (8 CPUs, 10 GByte memory) are not shared with other jobs. The statement #@ node_usage = shared does only mean that other jobs are allowed to run on the node, if there are not-reserved resources left on the node. The started executable becomes cancelled after 1 hour, if it has not finished up to then (#@ wall_clock_limit, second number = soft limit). The job is killed after 1 hour 10 minutes, when any process belonging to the job is still running (#@ wall_clock_limit, first number = hard limit). The environment variable AIXTHREAD_SCOPE is set to S for performance reasons. Note
SMP programs are always of #@ job_type = serial, although they are parallel applications. However, in contrast to MPI programs, SMP programs cannot run on more than one node. Ensure that the number of threads the application starts matches ConsumableCpus. Otherwise it may happen that some of the threads have to share a CPU, which causes loss of performance. Hybrid programs are applications that use both MPI and SMP parallel programming paradigms. A hybrid program is mainly an MPI program, that is started under control of the POE. In addition, each MPI task can spawn a number of threads, that share the memory belonging to the parent MPI task. Example 8.4. LoadLeveler command file for a hybrid program. #!/bin/sh #@ job_type = parallel #@ output = loadl_ex5.llout #@ error = loadl_ex5.llerr #@ node_usage = not_shared #@ node = 2 #@ tasks_per_node = 8 #@ resources = ConsumableCpus(4) ConsumableMemory(4 GB) #@ wall_clock_limit = 1:10:0,1:0:0 #@ network.mpi = sn_all,,us #@ environment = OMP_NUM_THREADS=4; AIXTHREAD_SCOPE=S; \ # MEMORY_AFFINITY=MCM; \ # MP_SHARED_MEMORY=yes; MP_WAIT_MODE=poll; \ # MP_SINGLE_THREAD=yes; MP_TASK_AFFINITY=MCM #@ queue ./a.out The batch job will run the executable a.out, which is assumed to reside in the directory from where the job is submitted. The executable is assumed to be a hybrid MPI + SMP program. The command file causes allocation of 2 nodes (#@ node). Once the executable starts, poe becomes started at first. poe starts 8 MPI tasks on each of the 2 nodes (#@ tasks_per_node, together: 16 MPI tasks). The MPI tasks state to allocate up to 4 GByte each (ConsumableMemory). When the executable reaches code parts that can run shared-memory parallel, each MPI task spawns up to 4 threads (OMP_NUM_THREADS). The number of threads per MPI task has to match ConsumableCpus. The 4 threads per MPI task share the 4 GByte memory belonging to the MPI task. Alltogether, the application declares to use up to 32 CPU's (#@ tasks_per_node * ConsumableCpus) and up to 32 GByte memory (#@ tasks_per_node * ConsumableMemory) on each of the 2 allocated nodes (#@ node). In total, the application will use up to 64 CPU's and 64 GByte memory. The MPI tasks running on different nodes use the IBM HPS switch for communication (#@ network.mpi). MPI tasks running on the same node will communicate via shared memory (MP_SHARED_MEMORY=yes). The SMP threads started by an MPI task communicate via shared memory per definition. Please see the following documents for a more detailed description of LoadLeveler at HLRN:
2003-2008 © Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen (HLRN) |
|||||||||||||||||||||||||||||