|
|
HLRN FAQ
zeitlich absteigend sortiert
Last update: 02 May 2007, 12:42
What does the hostname part of a LoadLeveler job step ID mean? (May 02, 2007)
My terminal behaves strange (Mar 02, 2007)
Why is my output file empty, and appears only when my exe finishes? (Aug 11, 2006)
How to get another login shell than Korn shell? (Mar 10, 2006)
What does job state "PREEMPTED" mean? (Jan 26, 2006)
Password protection for some online documentation (Jan 17, 2006)
How do I get the output of an X application onto my desktop? (Nov 18, 2005)
1525-108 Error encountered while attempting to allocate a data object. The program will stop. (Nov 18, 2005)
[hlrn.de #506] MPI "ERROR: 0031-309 Connect failed ... not enough memory available now" (Jan 28, 2005)
[hlrn.de #474] Using the development nodes (cdev, interactive) (Jan 05, 2005)
[hlrn.de #154] Network problems when connecting to HLRN sites (Feb 27, 2004)
[hlrn.de #250] Unable to duplicate owner and mode after move. (Mar 11, 2003)
[hlrn.de #193] Information on system status (Mar 06, 2003)
[hlrn.de #181] Cannot load module /usr/lib/libMagick.a(libMagick.so) (Jan 27, 2003)
[hlrn.de #192] Job execution improved (wall_clock_limit) (Nov 12, 2002)
[hlrn.de #90] Falsche Uhrzeiten (Sep 09, 2002)
[hlrn.de #84] IBM diagnostic messages (Aug 21, 2002)
[hlrn.de #32] MPI2 and 64 bit (Jul 29, 2002)
[hlrn.de #34] C: malloc in 64bit-application
(Jul 29, 2002)
[hlrn.de #37] MPI in 64bit mode (Jul 29, 2002)
[hlrn.de #49] hpmcount (Jul 29, 2002)
|
What does the hostname part of a LoadLeveler job step ID mean? (May 02, 2007) | QUESTION:
When I submit several jobs that are to be executed on the same
HLRN complex their IDs sometimes begin with bloadl-en0 while
others begin with hloadl-en0.
What does the hostname part of a LoadLeveler job step ID mean?
ANSWER:
The format of a full LoadLeveler job step identifier is
host.jobid.stepid
where
host is the name of the machine that assigned the job and step
identifiers when the jobstep was submitted. This machine is
responsible for book-keeping events of the job step during
its lifetime. There is no relationship between the name of
this host and names of nodes that (can) execute the job.
jobid is the number assigned to the job when it was submitted.
stepid is the number assigned to a step within the job when it
was submitted.
|
|
My terminal behaves strange (Mar 02, 2007) |
PROBLEM:
When working in an interactive shell through a terminal, sometimes
strange characters appear on the screen, or/and cursor movement
behaves strange.
DIAGNOSIS:
Your terminal capabilities are not correctly understood.
SOLUTIONS:
1) Use a simple terminal emulator to work with HLRN.
Simple xterm should work in any event.
Exotic terminal emulators like xwsh (IRIX) or modern
things like newer gnome-terminal may cause problems.
Some terminal emulators are also able to provide
different compatibility modes. Try something simple like
vt100.
2) Check the character set you are using on your local
machine. HLRN machines do not provide UTF-8 or other
multibyte character sets. The locales available at HLRN are
> locale -a
C
POSIX
en_US
en_US.8859-15
en_US.ISO8859-1
Try to configure your terminal to use ISO-8859-1 or ISO-8859-15,
and avoid UTF-8.
3) When observing strange things with curses programs
(nmon, llmap, smitty, ...), try to use another terminal info
database at HLRN. The AIX default is /usr/share/lib/terminfo.
Try
export TERMINFO=/opt/freeware/share/terminfo
Put this line into your ~/.profile, if this solved your problems.
|
|
Why is my output file empty, and appears only when my exe finishes? (Aug 11, 2006) |
PROBLEM:
I run an executable which opens an output file and writes into it
frequently. But the output appears only when the program finishes.
When the program gets aborted, the output remains empty.
SOLUTION:
You probably have a problem with buffered I/O. You have to flush
I/O buffers or use unbuffered I/O.
Here are some hints how to do that:
0) Create the files in doubt in $WORK or $TMPDIR, not in $HOME.
$HOME is NFS-mounted, which is slower than the GPFS-filesystems.
1) If you have the source code, introduce calls to appropriate
routines to flush the I/O buffer of a connected buffered file after
writing into it.
Within Fortran, use the flush_ service subroutine, or the FLUSH
statement. The latter is an IBM extension to Fortran and can be used
on seekable files only (stdin, stdout, stderr are not seekable).
Within C, use the fflush system call.
2) If you use a Fortran executable and have no sources,
manipulate buffered I/O by setting the XLFRTE environment variable:
export XLFRTE=buffering=disable_all
Note that this unbuffers I/O on all units opened by the program.
This may be not a good idea if some files are used for heavy
scratch I/O. Note also that unbuffered I/O means that output goes
directly to disk character by character. This may degrade performance.
If only preconnected units like stdin, stdout, stderr have to be
unbuffered, use the value disable_preconn instead.
3) If you have the source code, manipulate buffered I/O.
Within Fortran, use the setrteopts subroutine, which does the same
as the XLFRTE environment variable.
Within C, use the ioctl or fcntl system call to manipulate buffered
I/O on a per-file-descriptor basis (the FIONBIO command to ioctl, or
the O_NONBLOCK flag to fcntl).
|
|
How to get another login shell than Korn shell? (Mar 10, 2006) |
PROBLEM:
I want to use my favourite shell for interactive sessions,
but HLRN does not provide a way to change the login shell.
SOLUTION:
On IBM with AIX, only a limited number of (very old) shells are
supported as login shell. The favourites bash or tcsh are not among
them.
To get them as a "login shell", there exists a workaround. Please see
http://www.hlrn.de/doc/login_shell/index.html
how to get the shell you want.
|
|
What does job state "PREEMPTED" mean? (Jan 26, 2006) |
When a LoadLeveler batch job has the state "PREEMPTED", then the job
is suspended temporarily. The resources CPU, memory and network
occupied by the job are freed to give them to some high-priority job.
After this high-priority job has finished, the preempted job continues
to run.
During the time a job is in state "PREEMPTED" (shortcut is "E"), the
job does nothing.
So, if you wonder that one of your jobs seems to do nothing, examine
the job state at first using llq or lxq. If it is "PREEMPTED", then
your observation does not indicate an error. There is no reason to
cancel such jobs, or to manipulate the files they own.
These jobs will continue as soon as possible.
Note that tools that calculate wallclock times consumed by a running
job cannot give correct output for jobs that are or were preempted.
These tools are lxq, llacct, and llmap.
The charge given in the HLRN job step receipt takes preemption events
into account, if possible. The true charge stored in the user database
will do so in any event.
/Bka
|
|
Password protection for some online documentation (Jan 17, 2006) |
PROBLEM:
I have no access to some online manuals (e.g. ABAQUS, FLUENT, etc.).
The server asks me for username and password. When I type in my HLRN
username and my password in my web browser I only get an error message.
ANSWER/SOLUTION:
The access to parts of the documentation is restricted. Due to legal
reasons only registered users of HLRN are allowed to read documentation
installed on the HLRN system and provided by IBM and various independent
software vendors (ISVs).
Read the section about "Restricted access: password protection and login hosts"
in the document "HLRN Online Documentation Overview" for information
about the right credentials.
|
|
How do I get the output of an X application onto my desktop? (Nov 18, 2005) |
1) Your local machine runs Unix with a graphical window manager.
1a) Login into a HLRN machine using ssh -X. This tunnels X11
connections. Executing echo DISPLAY shows localhost:n.0 in
this case.
Test with xlogo. This should work, but is slow.
1b) Check your local window manager to allow remote X11 connections.
How to do this, depends on your window manager.
Check your local firewall to allow incoming TCP connections via
port 6000 (and possibly up to 6020). If you use some Linux,
try /sbin/iptables -L to figure out. If these ports are not
open, open them.
Check your X server to allow remote connections by executing
xhost. Add a HLRN machine with e.g. xhost +berni.hlrn.de.
Login into the HLRN machine using ssh -x.
Execute export DISPLAY=<your-machine>:0.0.
Test with xlogo.
2) Your local machine runs Windows.
You have trouble then. You have to install some additional
software on your machine. Installing CYGWIN should be the easiest
and cheapest way. See http://cygwin.com/.
|
|
1525-108 Error encountered while attempting to allocate a data object. The program will stop. (Nov 18, 2005) |
Background:
-----------
This error message is produced by the Fortran run time environment,
when an allocate statement fails.
It is similar to the message "Not enough space" from the operating
system when the malloc system call fails with errno ENOMEM.
Diagnosis:
----------
a) You ran a 32bit executable, which is not enabled to use the
large address-space model.
b) You ran a 32bit executable, which is enabled to use the
large adress-space model, but you tried to allocate more
than 2 GByte.
c) You tried to allocate tons of GByte memory.
Answers:
--------
a) Link with -bmaxdata:0x80000000 to enable a 32bit executable to
use the large address-space model. Without, only 256 MByte memory
can be allocated in the data segment. With this option, max 2 GByte
memory can be allocated.
b) Port your program to 64bit (compile and link with -q64, take care
on types in C). Then memory requests have almost no limit.
c) HLRN machines have usually 64 GByte physical memory, and
about 128 GByte paging space. There are two HLRN nodes with
128 GByte, and two with 256 GByte physical memory.
Allocating up to 55 GByte physical memory on a common node is
possible in batch, provided a correct setup of the batch job.
Requests for more memory are fulfilled from paging space. How much
memory you get beyond physical memory depends on the usage of the
paging space.
|
|
[hlrn.de #506] MPI "ERROR: 0031-309 Connect failed ... not enough memory available now" (Jan 28, 2005) |
PROBLEM:
At the start of my MPI program I get the error message
"ERROR: 0031-309 Connect failed during message passing initialization,
task NN, reason: There is not enough memory available now.
ERROR: 0031-007 Error initializing communication subsystem: return
code -1"
although I request enough Memory [ConsumableMemory(1630mb)].
SOLUTION:
The environment variable $MP_BUFFER_MEM specifies the size
of the buffer used for early arrivals. The default setting
is 64m (64 Mbytes). If the ConsumableMemory is set too small
on a batch job such that it does not include extra memory
for MPI buffers, the job will fail with this error message.
Change the ConsumableMemory to be at least 200 Mbytes
more than you expect your program to use.
Or if you know how big of a MPI buffer your program will
need, you can try setting the environment variable
MP_BUFFER_MEM to 32m, 16m, or 8m.
Changing MP_BUFFER_MEM to MP_BUFFER_MEM=16m the program
ran fine and required according to the job step receipt:
Resources requested:
Node Cpus Mem/MB Node Usage
--------------------------------------------------
breg05a 32( 32) 52160( 53760) not shared
Resources used :
Node Cpus Real Mem/MB CPU Eff Cpu Time/s
---------------------------------------------------
breg05a 32 13774.36 96.26 % 8162.45
only 430MB / Task!
|
|
[hlrn.de #474] Using the development nodes (cdev, interactive) (Jan 05, 2005) |
PROBLEM:
When I run an application with more tasks than available CPUS on
the development nodes (interactive or Loadleveler class cdev) the
program somehow starts but does not advance. I observe no output,
but the processes use cpu time. What is wrong?
SOLUTION:
Using more tasks than available CPUs means to overload the
development node. Although this is allowed to some extent by the
configuration, the default environment setting for MPI task affinity
does not. Unsetting the task affinity in the job script (for cdev
jobs, only!) helps:
(Korn-Shell): unset MP_TASK_AFFINITY
((T)C-Shell): unsetenv MP_TASK_AFFINITY
This setting also helps when starting an interactive application
(via poe), e.g. debugging with TotalView.
|
|
[hlrn.de #154] Network problems when connecting to HLRN sites (Feb 27, 2004) |
PROBLEM:
Network connections to certain HLRN sites using addresses like
hanni.hlrn.de oder berni.hlrn.de fail
(e.g. ssh/scp: ... Destination Unreachable...).
Or:
Connections to the HLRN web servers (docserv.hlrn.de,
isvdoc.hlrn.de, and ibmdoc.hlrn.de) fail despite using
the right port (8080) and the right username and password
for the online documentation.
SOLUTION:
First consult the HLRN System Status and News web pages
(http://www.hlrn.de/status/ and http://www.hlrn.de/news/
to learn about known problems.
Temporary workarounds:
For ssh/scp try to use the login nodes explicitely: e.g.
hanni1-ex.hlrn.de or berni1-ex.hlrn.de or, for HLRN internal
connections between the complexes only use the HLRN link, e.g.
via hanni1-hl.hlrn.de or berni1-hl.hlrn.de .
For connections to the web servers: use the external interfaces
of the login nodes on BERNI:
berni1-ex.hlrn.de
instead of the regular web server names.
Protected HLRN web pages are currently served by BERNI.
(wwb)
|
|
[hlrn.de #250] Unable to duplicate owner and mode after move. (Mar 11, 2003) |
FRAGE/PROBLEM:
When I compile source code in my home directory I get the message
breg02a> mpxlf90_r -qsuffix=f=f90 -c pointers.f90
** pointers === End of Compilation 1 ===
1501-510 Compilation successful for file pointers.f90.
mv: 0653-404 pointers.o: Unable to duplicate owner and mode after move.
This occurs since the home directory has been moved to a different
file system (JFS on the data server).
ANTWORT/LOESUNG:
The compiler works with temporary files in $TMPDIR which is part of the
global parallel file system (GPFS). $HOME is on a JFS file system. The
AIX mv command gives a message as above during the final move of the
object files. This is also noted in the man page of the AIX mv command.
The message is just a warning, not an error.
If you use the mv command from the Linux Toolbox (freeware) the
message does not occur. You can prepend the freeware tools to your
program search path by loading the module "freeware-first":
module load freeware-first
Caution: this may have unwanted side effects.
For a freeware overview see the link provided at
http://www.hlrn.de/doc/overview.html
|
|
[hlrn.de #193] Information on system status (Mar 06, 2003) |
QUESTION:
How can I get information on system status ?
ANSWER:
There is a system status report for HLRN at
http://www.hlrn.de/status/index.html
This report is generated automatically and updated every 10 minutes.
Further information on system status of a complex may be
obtained with the command
llstatus
Nodes showing "Down" in column "Startd" are most probably
down. Receiving nothing or error messages on llstatus is likely
to be caused by a serious system failure.
More detailed information on the batch system status can be
obtained with the command
llmap
Executing this command without any options produces the same output
as shown in section LoadLeveler Details of
http://www.hlrn.de/status/long.html
To get informed about the huge number of command line options of
llmap, consult the online man page, or
http://www.hlrn.de/doc/llmap/index.html
|
|
[hlrn.de #181] Cannot load module /usr/lib/libMagick.a(libMagick.so) (Jan 27, 2003) |
FRAGE/PROBLEM:
I cannot use the Image Magick Tools (convert, display) because of the
following error:
$ convert
exec(): 0509-036 Cannot load program convert because of the following errors:
0509-022 Cannot load module /usr/lib/libMagick.a(libMagick.so).
0509-150 Dependent module /usr/lib/libttf.a(libttf.so.2) could not be loaded.
0509-152 Member libttf.so.2 is not found in archive
breg02a$
ANTWORT/LOESUNG:
You are using the default HLRN environment which includes the Linux
Toolbox below /opt/freeware/ (module "freeware" is loaded).
However, there are conflicting libraries in the default library search path.
We have provided a modulefile "freewlib" to deal with this situation.
You first have to unload the regular "freeware" module:
module unload freeware
module load freewlib
|
|
[hlrn.de #192] Job execution improved (wall_clock_limit) (Nov 12, 2002) |
FRAGE/PROBLEM:
My batch jobs on berni/hanni are waiting so long in the input queue
although there are plenty of idling nodes and although I gave a quite
small "job_cpu_limit"?
ANTWORT/LOESUNG:
LoadLeveler schedules jobs according to their wall clock limit!
You can support the LoadLeveler by specifying
#@ wall_clock_limit = <required wall clock time for the job>
in your LoadLeveler script. If you omit this limit, a default value
(currently 12 hours (!) for classes "csolo" and "cshare") is taken.
See e.g. column "WLimS" of the command
lxq
(for a short description use "lxq -?")
For a short HowTo on LoadLeveler see
http://www.hlrn.de/doc/quickstart/qs_loadl.html
|
|
[hlrn.de #90] Falsche Uhrzeiten (Sep 09, 2002) |
FRAGE/PROBLEM:
Mindest an einer Stelle im System ist noch nicht die MESZ eingestellt:
breg02a$ who
bzfbusch ttyp0 Aug 19 09:20 (130.73.72.56)
(korrekt)
aber aufgelistet wird:
Last login: Mon Aug 19 2002 07:20:57 from 130.73.72.56
ANTWORT/LOESUNG:
Die Uhrzeiten der Knoten sind alle richtig gesetzt. Die Ausgabe der
ssh2 über die letzte Login Zeit erfolgt in UTC.
|
|
[hlrn.de #84] IBM diagnostic messages (Aug 21, 2002) |
FRAGE/PROBLEM:
Gibt es unter AIX oder im WWW so etwas wie ein Aequivalent zum Cray/SGI
explain oder irgendwelche Dokumente, in denen Fehlermeldungen des
Systems oder von System-Tools erklaert werden?
Fuer das Parallel Environment habe ich was gefunden:
http://publib.boulder.ibm.com/doc_link/en_US/a_doc_lib/sp34/pe/html/am105mst02.html#ToC
aber fuer alles andere?
ANTWORT/LOESUNG:
Ich habe ein paar Dokumente zusammengetragen unter
http://www.unics.uni-hannover.de/rrzn/gehrke/xlf.html#sonst
Dort habe ich auch angegeben, wie man evtl. mehr (oder anderes)
finden kann. Interessant ist das auch fuer Kollegen, die
nicht nur online lesen wollen, sondern die sich Handbuecher
etc. abdrucken wollen.
|
|
[hlrn.de #32] MPI2 and 64 bit (Jul 29, 2002) |
FRAGE/PROBLEM:
How to work with MPI-2 on p690 (e.g. MPI_Put, MPI_Win_create ... )?
ANTWORT/LOESUNG:
For MPI2, simply use the threadsafe compilers with 64-bit flag -q64.
Example: for optimized compiling Test.F90 with C-preprocessing:
mpxlf90_r -O5 -qstrict -qsuffix=cpp=F90 -q64 -o Test-64.x Test.F90 -lmpi
Suggestion : guideline for generation/naming of 64-bit libraries
1. use the -64.a suffix for the name of libraries, -64.so respectively.
2. use the -64 suffix for a different Makefile-include that defines
64bit flags for compile, archive and link-load
Explanation(2): For complex software projects it may be useful to edit
Makefiles and create an include for that to make them capable to compile
both, 32 bit and 64 bit versions in two phases:
make
standard 32bit compilation, makefile includes some configuration by
include Makefile.in$(L64), where L64 is undefined (empty).
make clean
remove 32-bit objects
make L64=-64
includes Makefile.in-64 by the L64=-64 setting.
In Makefile.in-64 the rest of flags must be set for 64-bit:
example:
AR=ar
ARFLAGS=$(AR64) cr
FOPT=-O5 $(Q64)
LOPT=-O5 $(Q64)
LIBS=-L$(NETCDF) -lnetcdf$(L64)
In Makefile.in-64 introduce makefile-variables
Q64=-q64 #for compiler flags
L64=-64 #for library suffix extension
AR64=-X64 #for archiving.
Of course, binary executables should include the -64 name extension,
just to keep you and everybody else in mind that compilation/linkage
used 64bit.
|
|
[hlrn.de #34] C: malloc in 64bit-application
(Jul 29, 2002) |
FRAGE/PROBLEM:
Habe ein Problem mit einem C-Programmteil in einem F90-Main:
bei -q64 tritt zur Laufzeit ein Segmentationfault auf, wenn in der
C-Routine auf ein in dem selben Teil alloziertes Array geschrieben wird.
Und zwar im ersten Zugriff (arr[i]=0) .
malloc scheint in 64-bit nicht zu gehen.
Test: Pointer nach malloc als %ld angezeigt. Die sind nicht NULL, also
irgendwie alloziert. Nur : Warum kann man sie dann nicht nutzen?
(Ich brauche MPI2, deshalb auch 64bit.)
ANTWORT/LOESUNG:
Include <stdlib.h> in the head of the source-code where the malloc
is to be performed.
That include seems to change the default behaviour of the standard
malloc.
|
|
[hlrn.de #37] MPI in 64bit mode (Jul 29, 2002) |
FRAGE/PROBLEM:
It seems to me that MPI with 64 bit is not possible
on the regatta systems, as mpxlf -q64 -b64 does not find libraries for
that. is that true? what a pitty!
ANTWORT/LOESUNG:
You have to compile and link with mpxlf_r !
The reason is that only the threaded MPI lib is 64 bit enabled, not
the signal one.
|
|
[hlrn.de #49] hpmcount (Jul 29, 2002) |
FRAGE/PROBLEM:
Ist hpmcount (und libhpm) installiert und wenn ja, wo?
ANTWORT/LOESUNG:
hpmcount und alles, was dazugehört, ist unter in der Softwaredistribution unter
/class/[bin|lib|...] zu finden. HPM ist in HLRNdoc beschrieben:
http://www.hlrn.de/doc/HPM/.
|
2003-2007 © Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen (HLRN)
|