Topic:
Loose and tight integration of the LAM/MPI library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de;
Philipps-University of Marburg, Germany
Version:
1.0a -- 2005-03-23 Initial release, comments and corrections are
welcome
Contents:
- Prerequisites
- Loose integration using rsh
- Modification of the LAM-MPI source for qrsh startup
- Loose integration using qrsh
- Tight integration using qrsh
- Modification of the Startup for other Platforms
- References and Documents
Note:
This HOWTO complements the information contained in the man pages
of SGE and the Administration Guide.
Acknowledgement:
Many thanks to Sean Dilda for pointing out a bug in the
lamd_wrapper and eliminating an ambiguity.
Prerequisites
Configuration of SGE with qconf or the GUI &endash;
Compilation of LAM/MPI &endash; Cluster setup
You should be familiar with modifying the SGE environment such
as the queue definitions and parallel environment (PE)
configuration. Additional information on the SGE parallel
environment is available from the "sge_pe" man pages (make sure
the SGE man pages are defined in your $MANPATH). You should also
have access to the LAM/MPI source code, and know how to build and
install it. More information can be found in the LAM/MPI
Documentation (see References
and Documents). For the "Loose integration using rsh" startup
method, a working (passwordless) login between the nodes in the
cluster is also required.
Target platform
This Howto targets the LAM/MPI 7.1.1 on the Linux platform. Due
to the modifications made to the source code and introduced
startup sequence, these integrations using qrsh will not work on
platforms, where SGE will use the additional group feature for a
proper shutdown by default.
LAM/MPI
The LAM/MPI library from the Indiana University (http://www.lam-mpi.org)
is a MPI implementation. In contrast to other implementations like
MPICH (http://www-unix.mcs.anl.gov/mpi/mpich),
this one is using daemons on all the included nodes in a parallel
job. One advantage of such an approach is, that programs issuing
many mpirun commands during their execution need only one time a
"conventional" communication to the nodes to start the daemons.
All further communication is handled by the already running
daemons, and so additional mpiruns will startup faster.
Available startup methods in SGE
For now there is no startup possibility in SGE, which allows a
program started on a node to vanish into daemon land and still be
under SGE control for correct accounting and proper shutdown of
all slave tasks. Therefore the loose integrations will rely on a
shutdown of the daemons by a conventional LAM/MPI command, while
the tight integration uses a 2-stage startup, which was introduced
by Christopher Duncan for former LAM/MPI versions [1].
Included setups and scripts
All three possibilities to startup the LAM/MPI daemons
introduced here can be installed in parallel, as they have a
unique PE and script directory. This way you can play around with
them, and decide which one to install in your cluster. While
walking through this Howto, you can install all of them in a
shared directory, e.g. your home directory. For the final
installation it could be put in the conventional place like
$SGE_ROOT. The scripts, which are only slight modifications of the
original scripts in the $SGE_ROOT/mpi directory, can be downloaded
here for your convenience (the inserted lines are commented)
[2].
Reading of this document
To avoid explaining the complete LAM/MPI behavior in each of
the possible startup methods, this document should be read from
the beginning onward, although you are e.g. only interested in the
Tight integration. Some of the special settings of the scripts and
PEs will explain itself, when the default behavior is known.
Loose integration using rsh
Characteristic of the Startup Method
Using a simple (passwordless) rsh between the nodes, the
daemons on the slave nodes are started up.
Since LAM/MPI 7.1.1 is SGE-aware, it will create a job specific
directory on the slave nodes of the job of the form
"lam-$USER@$HOSTNAME-sge-$JOB_ID-$SGE_TASK_ID". So they will be
unique for each job and not interfere, if two (or more) LAM/MPI
jobs are running on the same node. They will be located in 1.
$TMPDIR or 2. /tmp. As $TMPDIR is only set by SGE in this case on
the headnode of the job, it will go here into the SGE created and
removed $TMPDIR, and to the /tmp directory on all the slave nodes.
The removal of these directories on the slave nodes will take
place in the stop_proc_args script of the PE, when the
lamhalt command is executed. In most of the cases this will
work as intended, whether the jobs are finished by a normal end of
the program or stopped with qdel. Don't change the $TMPDIR to a
different location in your jobscript, because the command mpirun
can't execute correctly, as it won't find any information of the
current setup. If you must change it for any reason, it must also
be changed in the start_proc_args and stop_proc_args scripts of
the PE.
As there is no environment on the slave nodes by SGE, this
startup will behave most like an interactive startup, and hence
need the path to the LAM/MPI binaries also on the slave nodes
defined in any file, which will be sourced during a
non-interactive login.
Hostfile a.k.a. Machinefile Format
Although LAM/MPI supports a machinefile format, where each node
is suffixed by the number of slots to be used on this machine
(i.e. "node00 cpu=2"), the default format of the generated
machinefile is sufficient, where each machine is repeated times in
this file, equal to the slots to be used there.
Setup of the PE in SGE
We will name this PE "lam_loose_rsh" and set the following
entries:
$ qconf -sp lam_loose_rsh
pe_name lam_loose_rsh
queue_list para21 para22
slots 4
user_lists NONE
xuser_lists NONE
start_proc_args /home/reuti/lam_loose_rsh/startlam.sh $pe_hostfile
stop_proc_args /home/reuti/lam_loose_rsh/stoplam.sh
allocation_rule $round_robin
control_slaves FALSE
job_is_first_task TRUE
This setup is for SGE 5.3, where the queues to be used are
defined in the PE. For SGE 6.x, the PE must be specified in the
cluster queue, in which it should be used in the entry
"pe_list".
Sample job execution
Most likely you have already a MPI program based on LAM/MPI,
which is running successfully in interactive mode. Otherwise you
will find a sample program in the supplied archive to this Howto,
which will run until a qdel of the job, so that the process
distribution to the nodes can easily be checked. Let's take this
job script:
$ cat tester_mpi.sh
#!/bin/sh
export PATH=/home/reuti/local/lam-7.1.1/bin:$PATH
mpirun C /home/reuti/mpihello
Specifying the option "C" to mpirun will use all of the slots
granted to this job, which is most likely what you want all of the
time. If all went right, we can see the job running, and also
check the involved nodes for the running jobs:
$ qsub -pe lam_loose_rsh 2 tester_mpi.sh
your job 10389 ("tester_mpi.sh") has been submitted
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
10389 0 tester_mpi reuti r 02/20/2005 15:25:31 para21 SLAVE
10389 0 tester_mpi reuti r 02/20/2005 15:25:31 para22 MASTER
$ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80
PID PPID PGRP COMMAND
...
8386 1 8386 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288
8387 8386 8386 \_ /home/reuti/mpihello
$ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80
PID PPID PGRP COMMAND
...
24576 1 24576 /home/reuti/local/lam-7.1.1/bin/lamd -H 192.168.151.23 -P 4288
24581 24576 24576 \_ /home/reuti/mpihello
Using this setup, the issued qdel, which will execute the
lamhalt in the stop_proc_args script of the PE, is the only
way to remove the running processes and created directories on the
nodes (except the directory on the headnode of the job, which is
removed by SGE as usual).
Another more sophisticated solution, maybe in "pseudo code" is
to "rsh <node> 'ps some_jobid | grep process_leader | (cut
process_group) * -1 | xargs kill -9 --'" in a loop over all
involved machines. As we are interested in a Tight integration, we
will no further dig into this, and it's up to the reader to
implement it.
Modification of the LAM/MPI source for qrsh
startup
Pitfall during startup of the Daemons
The daemon on the nodes is started by the program hboot
in LAM/MPI, which will fork a child process, which in turn will be
replaced by the lamd. While this will work without any
modification on the headnode of the job, where hboot is
started locally without rsh, we will face a problem on the slave
nodes: when the parent task ends, SGE will discover this and kill
the whole processgroup of the job. This would also kill the just
started lamd, as it is still in the same processgroup as
hboot was before.
Modification
The solution to this problem is the command setsid,
which will enforce a new processgroup for the child task. This has
to be added in line 317 of hboot.c (which can be found in
your lam-7.1.1 sourcecode installation in the directory
tools/hboot), after the child forks. The program
section:
else if (pid == 0) { /* child */
if (ao_taken(ad, "s")) {
should become:
else if (pid == 0) { /* child */
setsid();
if (ao_taken(ad, "s")) {
Then the executable hboot must be build again with the
conventional make inside the lam-7.1.1 sourcecode
installation, and the resulting binary hboot copied just by
hand to the place, where the binaries were installed, e.g.
~/local/lam-7.1.1/bin.
No warranty
As usual, there is no guaranty, that this will work
successfully on all Linux distributions and versions. Just do on
your own risk.
Loose integration using qrsh
Different behavior on the slave nodes compared to
"Loose integration using rsh"
Because we are now using qrsh to start the daemons, no rsh
would be necessary between the nodes. It is just still defined
here to have an easy access to the nodes. The used qrsh will now
also create a temporary directory on the slave nodes and set the
$TMPDIR accordingly. Unfortunately, this will be deleted when the
job (i.e. the executed hboot) ends. So it can't be used to
store the LAM/MPI created information of the daemon setup. As
$TMPDIR is the only environment variable which can't be exported
to the nodes by using the "-v" or "-V" option to qrsh, we have to
use another way. Otherwise you will see the job starting as
intended, and the lamhalt will also look successful, but
you will still have the processes running on the slave nodes in
case that you issued a qdel, as LAM/MPI stores in its special
directory the PIDs of the processes, which need to be killed.
The supplied rsh-wrapper in the mpi subdirectory in $SGE_ROOT
already has the option to supply a prefix command, which should be
executed before the command on the slave node. This can be set in
the start_proc_args script of the PE:
export RCMD_PREFIX="export TMPDIR=/tmp;"
This way, the LAM/MPI created directories are in the same place
as using just the plain rsh. Be sure, that you also changed the
path to the actual location of your rsh-wrapper in the supplied
script startlam.sh of the archive.
Setup of the PE in SGE
We will name this PE "lam_loose_qrsh" and set the following
entries:
$ qconf -sp lam_loose_qrsh
pe_name lam_loose_qrsh
queue_list para21 para22
slots 4
user_lists NONE
xuser_lists NONE
start_proc_args /home/reuti/lam_loose_qrsh/startlam.sh -catch_rsh $pe_hostfile
stop_proc_args /home/reuti/lam_loose_qrsh/stoplam.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task TRUE
To allow the execution of the qrsh commands, the control_slaves
entry must be set to TRUE in this case. On the headnode of the job
is no qrsh necessary, so we can safely set job_is_first_task to
TRUE.
Behavior on platforms with process control by the Additional
Group Feature
Although the started lamd would be in a new process
group, the reliable shutdown of all the processes using the
additional group feature (see: man setgroups) by SGE would also
kill the just started daemon, as it can't escape from this
additional group.
Disadvantages and Advantages of the Loose
Integration
Which ever of the two loose integrations you choose, they share
most of the characteristics in common, as the only difference is
the usage of qrsh instead of the plain rsh:
- - wrong accounting
- - no processes/directories controlled by SGE, removal
relies on the lamhalt command
If all went okay, the result in case of a normal program stop
or an abort with qdel is:
- + no semaphores or shared memory segments left behind
- + processes and directories removed on the slave nodes
As the output is exactly the same as with the "Loose
Integration using rsh", we skip it here, because it will not
present any new information or facts. The scripts for this startup
method are of course also included in the supplied script archive
to this Howto.
Tight integration using qrsh
2-stage Startup for Tight Integration
For now, there is no direct possibility in SGE to start a
program in daemon-mode on a node, which is still under control of
SGE for correct accounting and removal of the processes.
The original idea by Christopher Duncan in [1],
was to make first a conventional qrsh to the slave node, and on
this slave node a local-qrsh to start the final daemon. This is
possible with LAM/MPI, as the daemon is not pushing itself into
the background (like it is mostly done with daemonizing programs),
but the starting hboot makes the forking. At this step, it
is possible to insert the local-qrsh call and achieve the desired
effect.
Additional changes to LAM/MPI
For this to work, it is necessary to use the already patched
hboot, like in the chapter "Loose Integration using qrsh".
Another change is now necessary to insert the local-qrsh.
Therefore the following steps must be executed:
- move to your binary installation of LAM/MPI, e.g.
~/local/lam-7.1.1/bin
- rename lamd to lamd_binary
- create a file lamd_wrapper with the following
content:
#!/bin/sh
#
# Modified startup of lamd to get it working in a controlled
# setup by SGE. To be used for interactive usage like usual.
#
LAMD_BINARY=lamd_binary
if [ "$ENVIRONMENT" = "BATCH" -a "$PE" = "lam_tight_qrsh" ] ; then
COUNTER=5
while (( COUNTER-- > 0 )) ; do
if (( `ps --no-headers -o ppid -p $$` != 1 )) ; then
:
else
:
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $HOSTNAME $LAMD_BINARY "$@"
fi
done
else
exec $LAMD_BINARY "$@"
fi
#
# If we are lucky, we never get here.
#
exit -1
- make the lamd_wrapper executable, i.e.: chmod +x
lamd_wrapper
- make a symbolic link from lamd to
lamd_wrapper, i.e.: ln -s lamd_wrapper lamd
With this script, the installed LAM/MPI can still be used
interactively or with the loose integration scripts. So nothing
must be changed back, to get the initial behavior.
Setup of the PE in SGE
A new PE "lam_tight_qrsh" will reflect the changes for this
tight integration:
$ qconf -sp lam_tight_qrsh
pe_name lam_tight_qrsh
queue_list para21 para22
slots 4
user_lists NONE
xuser_lists NONE
start_proc_args /home/reuti/lam_tight_qrsh/startlam.sh -catch_rsh $pe_hostfile
stop_proc_args /home/reuti/lam_tight_qrsh/stoplam.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
As we now make a qrsh also on the headnode of the parallel job,
the job_is_first_task must be set to FALSE. It may happen, that we
get only one slot on this machine.
Program flow of the wrapper script
SGE will control the number of qrsh calls made to each node, so
the idea is to wait a small time, until the parent PID of this
lamd_wrapper became 1. This will happen, when the starting
hboot left already the machine.
[Side note: In case, that there are some race conditions,
and SGE still prevents the second local-qrsh to be made (although
the parent PID became already 1), the allocation_rule could be
changed to be just 2 (and so limited to dual CPU machines); this
would give the exact behavior as Christopher Duncan's initial Perl
script. In this case there is no a need to wait at all, and the
entry job_is_first_task could be changed to TRUE (one local qrsh
is at least allowed in this case).]
On some slow systems, there is the chance to increase the
number in the wait loop to a higher value, although it wasn't
observed also under already loaded machines that the waiting loop
in the lamd_wrapper was ever executed. It's just in the script, to
avoid an endless loop, in case anything went wrong.
Again, the startlam.mpi needs to be adjusted, so that the
actual location of the rsh wrapper is used here.
Sample execution
Using the same commands like in the first example, the results
show the tight integration of LAM/MPI:
$ qsub -pe lam_tight_qrsh 2 tester_mpi.sh
your job 10395 ("tester_mpi.sh") has been submitted
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
10395 0 tester_mpi reuti r 02/20/2005 18:39:25 para21 MASTER
0 tester_mpi reuti r 02/20/2005 18:39:25 para21 SLAVE
10395 0 tester_mpi reuti r 02/20/2005 18:39:25 para22 SLAVE
$ rsh node21 ps -e f -o pid,ppid,pgrp,command --cols=80
PID PPID PGRP COMMAND
...
789 1 789 /usr/sge/bin/glinux/sge_execd
8945 789 8945 \_ sge_shepherd-10395 -bg
8993 8945 8993 | \_ /bin/sh /var/spool/sge/node21/job_scripts/10395
8994 8993 8993 | \_ mpirun C /home/reuti/mpihello
8972 789 8972 \_ sge_shepherd-10395 -bg
8973 8972 8973 \_ /usr/sge/utilbin/glinux/rshd -l
8975 8973 8975 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
8976 8975 8976 \_ lamd_binary -H 192.168.151.22 -P 55513 -n 0 -o
8995 8976 8976 \_ /home/reuti/mpihello
8971 1 8971 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node21 lamd_bina
8974 8971 8971 \_ /usr/sge/utilbin/glinux/rsh -n -p 55517 node21 exec '/usr/
$ rsh node22 ps -e f -o pid,ppid,pgrp,command --cols=80
PID PPID PGRP COMMAND
...
789 1 789 /usr/sge/bin/glinux/sge_execd
25180 789 25180 \_ sge_shepherd-10395 -bg
25181 25180 25181 \_ /usr/sge/utilbin/glinux/rshd -l
25183 25181 25183 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sg
25184 25183 25184 \_ lamd_binary -H 192.168.151.22 -P 55513 -n 1 -o
25185 25184 25184 \_ /home/reuti/mpihello
25179 1 25179 /usr/sge/bin/glinux/qrsh -V -inherit -nostdin node22 lamd_bina
25182 25179 25179 \_ /usr/sge/utilbin/glinux/rsh -n -p 49142 node22 exec '/usr/
The escaped processes don't execute any time or resource
consuming program, and can just stay there out of control of SGE.
The real executing lamd and child processes are under SGE
control, which was the goal to be achieved by this setup. These
processes will be killed when they finish on their own, or after
issuing a qdel. In the latter case, it is not possible to get a
proper shutdown by using lamhalt, since lamd was
already shutdown and the directory containing the LAM-MPI specific
information removed SGE.
Advantages and Disadvantages of the Tight
Integration
With this setup we trade some advantages for the
disadvantages:
- + correct accounting
- + processes/directories controlled by SGE, definitely
removed at the end of the job
The drawback is in this case:
- - semaphores or shared memory segments may be left behind,
if the job was aborted with qdel
Depending on the usage of the nodes, there may be a need to
make some cleanup of the semaphores or shared memory segments from
time to time, or in the stop_proc_args of the PE, in case there
are often jobs killed by qdel.
Modification of the Startup for other
Platforms
In case of the shutdown of the processes by the
additional group feature, the other way round may be successful:
usage of the "Tight integration with qrsh" and waiting at the end
of hboot, until the lamd is up and running, instead
of the inserted setsid. Although the starting local-qrsh
will be removed in this case (and so not appear in a process
listing like the above one), the started lamd will continue
to operate and service for the program execution.
References
and Documents
SGE-LAM Integration
[1] Latest version of
the script sge-lam by Christopher Duncan. Because it can only be
found in the mailing list archive at SGE and LAM/MPI, it's
attached here for reference (sge-lam). As
there was a small bug inside, this version was patched to remove
any "-n" for the remote-qrsh call; this was taken from a former
version of this script, where it was included.
[2]
Archive with all the scripts used in this Howto: sge-lam-integration-scripts.tgz.
LAM/MPI
The implementation specific documentation can be found at the
LAM/MPI website on the page "Documentation" (http://www.lam-mpi.org/using/docs/).
MPI documentation in general and tutorials
For some general introduction to MPI and MPI-Programming, you
can study the following documents: