Topic:
Setup MPICH to get all child-processes killed on the slave
nodes.
Author:
Reuti, reuti__at__staff.uni-marburg.de;
Philipps-University of Marburg, Germany
Version:
1.3 -- 2005-02-14 Minor change for MPICH 1.2.6, comments and
corrections are welcome
Contents:
- Symptom of this behaviour
- Explanation
- Solutions
- Number of tasks spread to the nodes
- What rsh command is compiled into a binary?
- Option -nolocal to mpirun
- Uneven distribution of processes to the slave nodes with two
network cards
- Wrong interface selected for the back channel of the
MPICH-tasks with the ch_p4-device
- Hint for ADF
- Hint for Turbomole
Note:
This HOWTO complements the information contained in the
$SGE_ROOT/mpi directory of the Grid Engine distribution.
Symptom of this behaviour
You have parallel jobs using MPICH under SGE on LINUX.
Some of these jobs are killed nicely when you use qdel and don't
leave any running processes on the nodes behind. Other MPICH jobs
are killed, but the calculating tasks are still present after the
job is killed and consuming CPU time, while these jobs don't
appear any longer in SGE.
Explanation
Every MPICH task on a slave, created by an rsh-command,
tries to become the process leader. If the MPICH task is just the
child of qrsh_starter, this is okay as it is already the process
leader and you can kill these jobs in the usual way. This is
achieved, by killing the child of qrsh_starter with "kill (-pid)"
and hence the whole process group will be killed. You can check
this with the command:
> ps f -eo pid,uid,gid,user,pgrp,command --cols=120
If the startup script for the process on the slave nodes
consists of at least two commands (like the startup with Myrinet
on the nodes), a helping shell will be created,and the MPICH task
will be a child of this shell with a new PID, but still in the
same process group. The odd thing is now, that MPICH will discover
this and enforce a new process group with this PID, to become the
process leader. So the MPICH tasks jumps out of the creating
process group and the intended kill of the started process group
will fail, leaving the calculation tasks running.
Solutions
To solve this, there a various possibilities available,
which you may chose so that it fits best to your setup of SGE.
1. Replace the helping shell with the MPICH task
In case of e.g. Myrinet, the helping shell is created by one
line in mpirun.ch_gm.pl:
$cmdline = "cd $wdir ; env $varenv $apps_cmd[$i] $apps_flags[$i]";
You can prefix the final call to your program with an "exec",
so that the line looks like:
$cmdline = "cd $wdir ; exec env $varenv $apps_cmd[$i] $apps_flags[$i]";
With the "exec", the started program will replace the existing
shell, and so will stay to be the process leader.
2. Define MPICH_PROCESS_GROUP=no
When this environment variable is defined, the startup of the
MPICH task won't create a new process group. Where it must be
defines, depends on the used version of the MPICH device:
|
ch_p4:
|
|
must be set on the master node and the slaves
|
|
Myrinet:
|
|
has only to be defined on the slaves
|
So you may define it in any file, which will be sourced during a
noninteractive login to the slave nodes. To be set on the master
node, you have to define it in the submit script. If you want to
propagate this to the slave nodes (and avoid to change any login
file), you will have to edit the rsh-wrapper in the mpi directory
of SGE, so that the qrsh command used there will include -V, so
that the variables are available on the slaves:
echo $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -inherit $rhost $cmd
should read:
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit -nostdin $rhost $cmd
else
echo $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
exec $SGE_ROOT/bin/$ARC/qrsh -V -inherit $rhost $cmd
3. Recompile MPICH
When you have the source of the used programs, it is also
possible to disable the creation of a new process group from MPICH
and remove this behaviour completely.
For MPICH 1.2.5.2 and 1.2.6:
The file where it must be done is session.c in
mpich-1.2.5.2/mpid/util. Change the line:
#if defined(HAVE_SETSID) && defined(HAVE_ISATTY) && defined(SET_NEW_PGRP)
to
#if defined(HAVE_SETSID) && defined(HAVE_ISATTY) && defined(SET_NEW_PGRP) && 0
This way you can easily go back at a later point in
time.
Because the routine will be linked into your final program, you
have to recompile all your programs. It's not sufficient, just to
install this new version in /usr/lib (or your path to mpich),
unless you have used the shared libraries of MPICH. Whether any
delivered binary from a vendor uses the shared version of MPICH,
or has them statically linked in, you can check with the LINUX
command ldd.
Before you run ./configure for MPICH, you should export the
variable RSHCOMMAND=rsh, to get the desired rsh command compiled
into the libraries. To create shared libraries and use them during
compilation of a MPICH program, please refer to the MPICH
documentation.
Number of tasks spread to the nodes
There is a difference, how many calls to qrsh will be
made depending on the used version of MPICH. If you use MPICH with
the ch_p4 device over ethernet, there will always be the first
task started locally without usage of any qrsh call. Instead (n-1)
times will qrsh only be called. Hence you can set
"job_is_first_task" to true in the definition of your PE and allow
only (n-1) calls to qrsh by your job.
If you are using Myrinet, it's different. In this case exactly
n times the qrsh will be used. So set the "job_is_first_task" to
false.
What rsh command is compiled into a
binary?
If you got only the binary from a vendor, and your are in
doubt what rsh command was compiled into the program, you can try
to get some more information with:
> strings <programname> | awk ' /(rsh|ssh)/ { print } '
This may give you more information than you like, but you
should at least get some clue about it.
Option -nolocal to mpirun
You should avoid setting this option, neither in any
mpirun.args script, nor as option to the command. The first
problem would be, that the head node of the MPI job (this is not
the master node, but one of the selected slaves) will not be used
at all, and so the processes will only be spread to the other
nodes in the list (if you got just 2 slots on one machine it can't
run at all this way). The second problem will be the used rsh
command: because the first rsh command is just to start the head
node (which would be on a slave this way), the setting of
P4_RSHCOMMAND is ignored.
Uneven distribution of processes to the slave
nodes with two network cards
As Andreas pointed out in http://gridengine.sunsource.net/servlets/ReadMsg?msgId=15741&listName=users,
you have to check whether the first scan of the machines file by
MPICH can remove an entry at all, because `hostname` may give a
different name than included in the machines file (because you are
using a host_aliases file). Depending on your setup of the
cluster, it may be necessary to change just one entry back to the
one delivered by `hostname`. If you change all entries back to the
external interface again, you program may use the wrong network
for the communication by MPICH. This may, or may not, be the
desired result. Code to change just one entry back to the
`hostname` is a small addition to the PeHostfile2MachineFile
subroutine in startmpi.sh:
PeHostfile2MachineFile()
{
myhostname=`hostname`
myalias=`grep $myhostname $SGE_ROOT/default/common/host_aliases | cut -f 1 -d " "`
cat $1 | while read line; do
# echo $line
host=`echo $line|cut -f1 -d" "|cut -f1 -d"."`
nslots=`echo $line|cut -f2 -d" "`
i=1
while [ $i -le $nslots ]; do
# add here code to map regular hostnames into ATM hostnames
if [ $host = "$myalias" ] ; then # be sure to include " for the second argument
echo $myhostname
unset myalias
else
echo $host
fi
i=`expr $i + 1`
done
done
}
Don't include this, if you already changed the startmpi.sh for
the handling of Turbomole, you would get two times the external
name. In general, I prefer having one PE for each application.
Thus copy the mpi folder in $SGE_ROOT and name it e.g. turbo, adf,
linda etc. and create corresponding PEs.
Wrong interface selected for the back channel of
the MPICH-tasks with the ch_p4-device
With the hints given above, you may achieve the goal of
an even distribution of your MPICH tasks, but the used network
interfaces for the communication of the slave processes to the
master node may be the wrong one. You will notice this, when you
have a look at the program calls to the slave processes. Assume
'ic...' is the internal network (which should be used) and
'node...' the external one for the NFS traffic. Then the call:
rsh ic002 -l user -x -n /home/user/a.out node001 31052 \-p4amslave \-p4yourname ic002 \-p4rmrank 1
will use the wrong interface for the information sent back to
the master. It should have the form:
rsh ic002 -l user -x -n /home/user/a.out ic001 33813 \-p4amslave \-p4yourname ic002 \-p4rmrank 1
This can be changed, when you set and export the $MPI_HOST
environment variable by hand before the call to mpirun to be the
name of the internal interface on the execution node, and not the
(default) `hostname` for the external interface. You can also do
it for all users in the mpich.args where $MPI_HOST will be set, if
it was unset before the call to mpirun and change:
MPI_HOST=`hostname`
to:
MPI_HOST=`grep $(hostname) $SGE_ROOT/default/common/host_aliases | cut -f 1 -d " "`
In case that you don't use a host_aliases file at all, but
still want to change to another interface, you can e.g. use a sed
command like:
MPI_HOST=`hostname | sed "s/^node/ic/"`
If you decide to use this way, then take care with given "Hint
for Turbomole" because you also have to take in account the
internal hostname there and use it. Or you will not have to apply
the change of "Uneven Distribution of processes...", if you
already changed the $MPI_HOST for mpirun. This all highly depends
on your setup of your cluster and desired use of your network
interfaces. So this all might only be places, where you
can change your setup; there is no golden rule, which will
fit for all configurations. Just implement them and have a look at
some test jobs, how they are distributed to the nodes and which
interfaces they are using.
Hint for ADF
Some programs (like ADF parallel with MPICH) have a hard
coded /usr/bin/rsh inside for their rsh command. To access any
wrapper and let SGE take control over the slaves, you must
set:
export P4_RSHCOMMAND=rsh
in the job script. Otherwise, always the built in /usr/bin/rsh
will be used. For the usage of ssh please refer to the the
appropriate HowTo for ssh, and leave the settings for MPICH to rsh
to access the rsh-wrapper of SGE, which will then use ssh in the
end.
Hint for Turbomole
Turbomole is using the ch_p4 device of MPICH. But due to
the fact, that this program will always startup one slave process
more (as server task without CPU load) than parallel slots
requested, you have to set "job_is_first_task" to false, to allow
this qrsh call to be made. As mentioned above, MPICH will make
only (n-1) calls to qrsh, but Turbomole will request (n+1) tasks,
so that you end up with n again.
Because one call more to qrsh is made by TURBOMOLE than SGE is
aware of (because of the behaviour of MPICH to start one task
locally without qrsh, but still removing one line during the first
scan of the machines file), the created machines file on the
master node will have one line less than necessary. This yields to
a second scan of the machines file, starting at the beginning. If
the first line in the machines file is also the master node where
the server task for Turbomole is running, all is in best order,
because there is already the server process running which doesn't
use any CPU time, and another working slave task is welcome. But
in all other cases, you will have the slave processes uneven
distributed to the slave nodes. Therefore just add one line to the
startmpi.sh in the $SGE_ROOT/mpi directory in the already
existing, to create one entry more with the name of the master
node:
if [ $unique = 1 ]; then
PeHostfile2MachineFile $pe_hostfile | uniq >> $machines
else
PeHostfile2MachineFile $pe_hostfile >> $machines
#
# Append own hostname to the machines file.
#
echo `hostname` >> $machines
fi
This has the positive side effect, to deliver one entry in the
machines file which can be at least removed at all during the
first scan by MPICH. It will remove the line with `hostname`. If
there is none, nothing is removed. This may be the case, when you
have two network cards in the slaves, and `hostname` gives the
name of the external interface, but the created hostlist by SGE
contains only the internal names.