Topic:
Tight Integration of the MPICH2 library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de;
Philipps-University of Marburg, Germany
Version:
1.1 -- 2008-11-25 Updated release, comments and corrections are
welcome
1.1.1 -- 2009-03-10 Updated a wrong spelling of $SGE_TASK_ID in the
supplied scripts (thx to Kenneth for mentioning this)
Contents:
- Prerequisites
- Introduction to the MPICH2 family
- Tight Integration of the mpd startup method
- Tight Integration of the daemonsless smpd startup method
- Tight Integration of the daemon-based smpd startup method
- Tight Integration of the gforker startup method
- Nodes with more than one network interface
- References and Documents
Prerequisites
Configuration of SGE with qconf or the GUI
You should already know how to change settings in SGE, like to
setup and change a queue definition or the entries in the PE
configuration. Additional information about queues and parallel
interfaces you can get from the man pages "queue_conf" and "sge_pe" of
SGE (make sure the SGE man pages are defined in your $MANPATH).
Target platform
This Howto targets the MPICH2 version 1.0.8 and SGE 6.2 on Linux.
Most likely it will work under other operating systems in the same way.
Some of the commands will in this case need slight modifications. It
will not work this way for MPICH2 version 1.0, as some things were only
adjusted since 1.0.1, which will allow an easy Tight Integration.
MPICH2
The MPICH2 is a library from the Argonne National Laboratory (http://www.anl.gov) which is an
implementation of the MPI-2 standard. Before you start with the
integration of MPICH2 into SGE, you should already be familiar with the
operation of MPICH2 outside of SGE and know how to compile a parallel
program using MPICH2.
Included setups and scripts
The supplied archive in [1] contains the
necessary scripts for the mpd and smpd startup methods (for the gforker
method only the example shell script is included, as this startup
method needs no scripts to start and stop any daemon). It contains
scripts and programs
similar to the distribution of the PVM and MPICH integration package in
SGE. For installing it for common usage in the whole cluster, you may
like to untar it in $SGE_ROOT to get the new directories
$SGE_ROOT/mpich2_mpd, $SGE_ROOT/mpich2_smpd, $SGE_ROOT/mpich2_smpd_rsh
and $SGE_ROOT/mpich2_gforker.
A short program is provided in [2], which will allow you to
observe the correct distribution of the spawned tasks.
Queue configuration
The supplied jobscripts are to be executed under the bash shell.
As the
default setting in SGE during installation is set to use the csh shell,
you might
need to change either two entries in the queue definition to
read:
$ qconf -sq all.q
...
shell /bin/bash
...
shell_start_mode unix_behavior
...
(please see "man queue_conf" for details about this setting), or submit
the (parallel) jobs with the additional argument:
$ qsub -S /bin/bash ...
Please note, that under the Linux operating system /bin/sh is often a
link to /bin/bash and can be abbreviated this way.
Introduction to the MPICH2 family
This new MPICH2 implementation of the MPI-2 standard was
created to supersede the widely used MPICH(1) implementation. Besides
implementing the MPI-2 standard, another goal was a faster startup. To
give the user a greater flexibility, there are (for now) 3 startup
methods implemented:
- mpd: As the primary startup method mpd
is introduced in MPICH2. It’s based on the script language Python to
startup a so called ring of machines. Giving mpdboot
a list of nodes it will startup daemons on the requested machines,
which
can be used immediately for the execution of parallel programs inside
this ring. This is convenient for the interactive
use of a parallel program, as the only thing which must be prepared is
a list of to be used nodes.
Due to limitations in mpdboot,
it must have a connection to the daemons on the nodes via stdin/stdout,
until the mpd
fork into daemonland. Therefore this Howto will instead launch the mpd in a script
defined in start_proc_args one after the other, so that the mpd are still bound the started sge_shepherd.
- smpd: This startup method can be used in a daemon based
or daemonless mode. The daemon based startup is not creating all the
daemons on the nodes according to a nodelist on its own (like it is
done by the mpdboot command
in the mpd startup method),
but the daemons have to be started
before the execution of the main program, e.g. by a script. In this
Howto, a startup of the daemons will be presented where they are
started by the given procedure to start_proc_args.
A daemonless startup is very similar to the startup of the tasks in the
former MPICH(1). Although it includes the same scripts from the
original
$SGE_ROOT/mpi, it’s included here (with some editing to the templates),
so that it can easily be used with a still installed $SGE_ROOT/mpi
without any side effects.
- gforker: Programs started under gforker are
limited to one machine and supports only forks for additional processes.
Be aware, that for each startup method and chosen way to compile
them, you will get a set of mpirun and/or mpiexec for
each of them. They are not interchangeable! Hence, once you installed mpd
and compiled a program to run in the ring, you can’t switch
to smpd simply by using a different mpirun or mpiexec.
Instead you have to recompile (or at least relink) your program with
the intended libraries to be used with this specific startup method.
This means, that you have to plan carefully your set $PATH during
compilation and execution of the parallel program, to get a correct
behavior. Not doing so will result in strange error messages, which
will not point directly to the cause of trouble. After compiling your
application software, it may be advisable not to rely on the set $PATH
in your interactive shell for the submission, but to set it explicitly
in the submitted script to SGE, as we will do it in this Howto for
demonstration purpose. Also note, that the preferred startup command in
MPICH2 is mpiexec, not mpirun.
Tight Integration of the mpd
startup
method
First we discuss the integration of the preferred startup
method in MPICH2, called mpd.
You can compile MPICH2 after you
configured it; maybe with an alternative path for the installation
of the parallel library:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/mpd
After the usual make and make
install we can compile the short program which is supplied in
[2]
with:
$ mpicc -o mpihello mpihello.c
Similar to the PVM-Integration, we need a small helping
program to start the daemons as a task on the slave nodes using the
qrsh-command. In some way, this start_mpich2 can be seen as a generic
program extending SGE with the ability to run a qrsh-command in the
background, which can easily be modified for similar startup methods.
If you installed the whole package like suggested in
$SGE_ROOT/mpich2_mpd, set the working directory to
$SGE_ROOT/mpich2_mpd/src and compile the included program with:
$ ./aimk
$ ./install.sh
The installation process will put the helping program mpich2_mpd
in a created directory $SGE_ROOT/mpich2_mpd/bin, which is the default
location of the included script startmpich2.sh to look for this
program. This helper program must be compiled for every platform you
have in the cluster, and on which you want to run this startup method.
A parallel environment for this startup method may look like:
$ qconf -sp mpich2_mpd
pe_name mpich2_mpd
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_mpd/startmpich2.sh -catch_rsh $pe_hostfile \
/home/reuti/local/mpich2-1.0.8/mpd
stop_proc_args /usr/sge/mpich2_mpd/stopmpich2.sh -catch_rsh \
/home/reuti/local/mpich2-1.0.8/mpd
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Remember to attach this PE to a cluster queue of your choice and
to adjust the path to your MPICH installation. As the chain of Python
modules used in this startup method will create additonal processes and
processgroups, it’s essential to include in your cluster configuration
a special switch, which will kill the processes at the end of a job or
after an issued qdel by
identifying the associated processes by an additonal group id, which is
attached to all spawned processes on a slave node (by default the
processgroup of the first started process by qrsh_starter is used to kill just
this complete processgroup including its kids):
$ qconf -sconf
...
execd_params ENABLE_ADDGRP_KILL=TRUE
...
Having done so, we can
now submit a job with the ususal sequence:
$ qsub -pe mpich2_mpd 4 mpich2_mpd.sh
This will first start a local mpd
process on the master node of the parallel job. Even this process will
be started by a local qrsh in
the startmpich2.sh, so that the sge_shepherd
will stay alive. After it was launched successfully, its used port will
be queried and the accompanying daemons on the slave nodes of the
parallel job will be started with this information.
The mpich2_mdp.sh will generate a *.po$JOB_ID like:
$ cat mpich2.sh.po628
-catch_rsh /var/spool/sge/pc15381/active_jobs/628.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/mpd
pc15381:2
pc15370:2
startmpich2.sh: check for local mpd daemon (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
startmpich2.sh: check for local mpd daemon (2 of 10)
startmpich2.sh: check for mpd daemons (1 of 10)
/usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
startmpich2.sh: check for mpd daemons (2 of 10)
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/mpd
The first check will only look for the local mpd daemon, i.e. whether it can
contact the local mpd daemon
by a mpdtrace -l, as we need
this information to instruct the other daemons to use the port which
was selected by the first started mpd.
The
following loop will start the daemons on all remaining slave nodes, and
waits until all are up and running. After the startmpich2.sh the mpd processes are available, and
the user program started by an mpiexec
(or several) in the job script will spawn all processes to the already
running mpd. On the master
node of the parallel job the following processes can be discovered:
$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
22110 1 22110 /usr/sge/bin/lx24-x86/sge_execd
31712 22110 31712 \_ sge_shepherd-628 -bg
31775 31712 31775 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/628
31776 31775 31775 | \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpiexec -machinefile /tmp/628.1.all.q/mac
31744 22110 31744 \_ sge_shepherd-628 -bg
31745 31744 31745 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/628.1/1.pc15381
31755 31745 31755 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31777 31755 31777 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31780 31777 31780 | \_ /home/reuti/mpihello
31778 31755 31778 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31779 31778 31779 \_ /home/reuti/mpihello
31736 1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd
31759 1 31713 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -
The distribution of user processes is according to the granted
slot allocation. The two other processes can be found on the other node:
$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
15848 1 15848 /usr/sge/bin/lx24-x86/sge_execd
3146 15848 3146 \_ sge_shepherd-628 -bg
3148 3146 3148 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/628.1/1.pc15370
3156 3148 3156 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3157 3156 3157 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3159 3157 3159 | \_ /home/reuti/mpihello
3158 3156 3158 \_ python2.5 /home/reuti/local/mpich2-1.0.8/mpd/bin/mpd -h pc15381 -p 14581 -n
3160 3158 3160 \_ /home/reuti/mpihello
On the master node of the parallel job, the console of the mpd-ring will be created in /tmp,
as this is unfortunately hard-coded in several places in the MPICH2
source (it might change in version 1.1 of MPICH2 to honor the $TMPDIR
which is already set by SGE). To destinguish between several consoles
of the same user on a
node, the environment variable MPD_CON_EXT is set to reflect the
jobnumber and task-id of SGE in the name of the console file.
Tight Integration of the daemonless smpd startup
method
To compile MPICH2 for a smpd-based startup, it must first
be configured (after a make distclean,
in case you just walked
through the mpd startup
method before):
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/smpd --with-pm=smpd
and to get a Tight Integration we need a PE like the following
(including a -catch_rsh to the start script of the PE):
$ qconf -sp mpich2_smpd_rsh
pe_name mpich2_smpd_rsh
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_smpd_rsh/startmpich2.sh -catch_rsh \
$pe_hostfile
stop_proc_args /usr/sge/mpich2_smpd_rsh/stopmpich2.sh
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
Please lookup in the MPICH2 documentation, how to create a .smpd
file with a "phrase" in it. After submitting the job in exact the same
way as before (but this time taking the script mpich2_smpd_rsh.sh in
the qsub command):
$ qsub -pe mpich2_smpd_rsh 4 mpich2_smpd_rsh.sh
you should see a distribution on the master node of your parallel
job like:
$ ssh pc15381 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
22110 1 22110 /usr/sge/bin/lx24-x86/sge_execd
31930 22110 31930 \_ sge_shepherd-630 -bg
31955 31930 31955 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/630
31956 31955 31955 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31957 31956 31955 | \_ mpiexec -rsh -nopm -n 4 -machinefile /tmp/630.1.all.q/machines /home/reuti/mpihello
31958 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=0 PMI_SIZE=4 PMI_KVS=359B9A86
31959 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15381 env PMI_RANK=1 PMI_SIZE=4 PMI_KVS=359B9A86
31960 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=2 PMI_SIZE=4 PMI_KVS=359B9A86
31961 31956 31955 | \_ /usr/sge/bin/lx24-x86/qrsh -inherit pc15370 env PMI_RANK=3 PMI_SIZE=4 PMI_KVS=359B9A86
31986 22110 31986 \_ sge_shepherd-630 -bg
31987 31986 31987 | \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/1.pc15381
32004 31987 32004 | \_ /home/reuti/mpihello
31991 22110 31991 \_ sge_shepherd-630 -bg
31992 31991 31992 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/630.1/2.pc15381
32010 31992 32010 \_ /home/reuti/mpihello
The important thing is, that the started script including the mpiexec
and the program mpihello are under full SGE control.
(Side note: the default command compiled into MPICH2 this way is ssh -x. You may replace this by
changing in the MPICH2 source
$MPICH2_ROOT/src/pm/smpd/mpiexec_rsh.c in the routine mpiexec_rsh()
the default value ssh -x to a
plain rsh, or change it each
time
during execution of your application program by setting the environment
variable “MPIEXEC_RSH=rsh; export MPIEXEC_RSH” to get access to a the
rsh-wrapper, like in the original MPICH implementation.)
Tight Integration of the daemon-based smpd startup
method
Like for the mpd startup method, we will nees a small
helping program. As different parameters have to be used, this program
is not identical to the one used in the tight mpd integration.
If you installed the whole package like suggested in
$SGE_ROOT/mpich2_smpd, set the working directory to
$SGE_ROOT/mpich2_smpd/src and compile the included program with:
$ ./aimk
$ ./install.sh
The installation process will put the helping program mpich2_smpd
in a created directory $SGE_ROOT/mpich2_smpd/bin, which is the default
location of the included script startmpich2.sh to look for this
program. A parallel environment for this startup method may look like:
$ qconf -sp mpich2_smpd
pe_name mpich2_smpd
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/mpich2_smpd/startmpich2.sh -catch_rsh $pe_hostfile \
/home/reuti/local/mpich2-1.0.8/smpd
stop_proc_args /usr/sge/mpich2_smpd/stopmpich2.sh -catch_rsh \
/home/reuti/local/mpich2-1.0.8/smpd
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE
If we start the daemons on our own, we have to select a free port.
Although it maybe not safe in all cluster setups, the included formula
in startmpich2.sh, stopmpich2.sh and the demonstration submit script
mpich2_smpd.sh uses “$JOB_ID MOD 5000 + 20000” for the port.
Depending
on your job turnaround in your cluster, you may modify it in all
locations where it’s defined. To force the smpds not to fork themselves
into daemon land, they are started with the additional parameter “-d
0”. According to the MPICH2 team, this will not have any speed impact
(because the level of debugging is set to 0), but only prevent the
daemons from forking. Having this setup in a proper way, we can submit
the demonstration job:
$ qsub -pe mpich2_smpd 4 mpich2_smpd.sh
and observe the distributed tasks on the nodes, after looking at
the selected nodes:
$ qstat -g t
job-ID prior name user state submit/start at queue master ja-task-ID
------------------------------------------------------------------------------------------------------------------
643 0.55500 mpich2_smp reuti r 11/25/2008 13:11:37 all.q@pc15370.Chemie.Uni-Marbu SLAVE
all.q@pc15370.Chemie.Uni-Marbu SLAVE
643 0.55500 mpich2_smp reuti r 11/25/2008 13:11:37 all.q@pc15381.Chemie.Uni-Marbu MASTER
all.q@pc15381.Chemie.Uni-Marbu SLAVE
all.q@pc15381.Chemie.Uni-Marbu SLAVE
On the head node of the MPICH2 job, a process distribution like
the following can be observed:
$ ssh pc15381 ps -e f --cols=120
PID TTY STAT TIME COMMAND
...
22110 ? Sl 1:09 /usr/sge/bin/lx24-x86/sge_execd
2446 ? S 0:00 \_ sge_shepherd-643 -bg
2518 ? Ss 0:00 | \_ /bin/sh /var/spool/sge/pc15381/job_scripts/643
2519 ? S 0:00 | \_ mpiexec -n 4 -machinefile /tmp/643.1.all.q/machines -port 20643 /home/reuti/mpihe
2485 ? Sl 0:00 \_ sge_shepherd-643 -bg
2486 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15381/active_jobs/643.1/1.pc1
2495 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
2520 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
2521 ? R 0:12 \_ /home/reuti/mpihello
2522 ? R 0:11 \_ /home/reuti/mpihello
2477 ? Sl 0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -
2497 ? Sl 0:00 /usr/sge/bin/lx24-x86/qrsh -inherit -V pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -
On the slave node only the only the daemon and the attached
processes are shown:
$ ssh pc15370 ps -e f --cols=120
PID TTY STAT TIME COMMAND
...
15848 ? Sl 2:06 /usr/sge/bin/lx24-x86/sge_execd
23121 ? Sl 0:00 \_ sge_shepherd-643 -bg
23122 ? Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter /var/spool/sge/pc15370/active_jobs/643.1/1.pc1
23130 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23131 ? S 0:00 \_ /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
23132 ? R 0:32 \_ /home/reuti/mpihello
23133 ? R 0:32 \_ /home/reuti/mpihello
The forked-off qrsh-commands by the startmpich2.sh (and
start_mpich2 program) are no longer bound to the starting script in
start_proc_args, but they are not consuming any CPU time or need to be
shut down during a qdel (they are just waiting for the shutdown of the
spawned daemons on the slave nodes). Important is, that the working
tasks of the mpihello are bound to the process chain, so that
the accounting will be correct, and also a controlled shutdown of the
daemons is possible. To give some feedback to the user of the started
tasks, the *.po$JOB_ID file will contain the check of the started
MPICH2 universe:
$ cat mpich2_smpd.sh.po643
-catch_rsh /var/spool/sge/pc15381/active_jobs/643.1/pe_hostfile /home/reuti/local/mpich2-1.0.8/smpd
pc15381
pc15381
pc15370
pc15370
/usr/sge/bin/lx24-x86/qrsh -inherit pc15381 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
/usr/sge/bin/lx24-x86/qrsh -inherit pc15370 /home/reuti/local/mpich2-1.0.8/smpd/bin/smpd -port 20643 -d 0
startmpich2.sh: check for smpd daemons (1 of 10)
startmpich2.sh: found running smpd on pc15381
startmpich2.sh: found running smpd on pc15370
startmpich2.sh: got all 2 of 2 nodes
-catch_rsh /home/reuti/local/mpich2-1.0.8/smpd
shutdown smpd on pc15370
shutdown smpd on pc15381
If all is running fine, you may comment out these lines to shorten
the output a little bit and avoid any confusion to the user. Depending
of your personal taste, you may put the definition of your MPICH2 path
in a file like .bashrc, which will be sourced during a non-interactive
login.
Note: it is mandatory, that in your jobscript you include a
line "export SMPD_OPTION_NO_DYNAMIC_HOSTS=1" besides the port
identification. Otherwise, the node where the jobscript is running will
be added to your ~/.smpd. This will prevent a proper shutdown, although
this environment variable is already set during the start and stop of
the daemons in the appropriate scripts of the PE. Also the option -V
will be used in the accompanying skripts for this Howto.
Tight Integration of the gforker startup
method
First we discuss the integration of a startup method,
which
is limited to one machine and hence need no network communication at
all. The command line to compile MPICH2 this way is:
$ ./configure --prefix=/home/reuti/local/mpich2-1.0.8/gforker --with-pm=gforker
After the usual make and make install we can
compile the short program which is supplied in [2] with:
$ mpicc -o mpihello mpihello.c
Although we will run only on one machine, we will use a parallel
environment (PE) inside SGE, to stay conform with the idea of SGE to
request more than one slot by requesting a parallel environment in the
submit command. This PE may look like:
$ qconf -sp mpich2_gforker
pe_name mpich2_gforker
slots 8
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $pe_slots
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
Remember to add this PE to a cluster queue of your choice.
$ qsub -pe mpich2_gforker 4 mpich2_gforker.sh
And with:
$ ssh pc15370 ps -e f -o pid,ppid,pgrp,command --cols=120
PID PPID PGRP COMMAND
...
15848 1 15848 /usr/sge/bin/lx24-x86/sge_execd
7445 15848 7445 \_ sge_shepherd-647 -bg
7447 7445 7447 \_ /bin/sh /var/spool/sge/pc15370/job_scripts/647
7448 7447 7447 \_ mpiexec -n 4 /home/reuti/mpihello
7449 7448 7447 \_ /home/reuti/mpihello
7450 7448 7447 \_ /home/reuti/mpihello
7451 7448 7447 \_ /home/reuti/mpihello
7452 7448 7447 \_ /home/reuti/mpihello
we already got the proper startup and Tight Integration of all
started processes.
Nodes with more than one network
interface
With the version 1.0.8 of mpich2 it’s possible to direct
the network communication to a dedicated interface. For this to work,
you have to adjust the generated machine file, i.e. the file
$TMPDIR/machines, which is created in the start_proc_args defined
script, to include the interface name after the number of slots. E.g.:
node01:2 ifhn=node01-grid
References
and Documents
SGE-MPICH2 Integration
[1] Archive with
all the scripts used in this Howto: mpich2-62.tgz.
It should be installed in your $SGE_ROOT.
[2] Archive with a small
MPICH2 program to check the correct distribution of all the tasks: mpihello.tgz.
MPICH2
The latest version of MPICH2 and build instructions can be
downloaded from (http://www.mcs.anl.gov/research/projects/mpich2/).
MPI documentation in general and tutorials
For some general introduction to MPI and MPI-Programming, you can
study the following documents: