Topic:
Loose and tight integration of the PVM library into SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de;
Philipps-University of Marburg, Germany
Version:
1.0 -- 2005-03-26 Initial release, comments and corrections are
welcome
Contents:
- Advantages of a Tight Integration
- Prerequisites
- Loose Integration
- Additions for Tight Integration
- Tight Integration
- Nodes with more than one network interface
- Restrictions and Future Work
- References and Documents
Note:
This HOWTO complements the information contained in the
$SGE_ROOT/pvm directory of the Grid Engine distribution.
Advantages of a Tight Integration
The original included files for the PVM integration offer
only a loose integration of PVM into SGE. This means, that the
necessary daemons (which build up the PVM), will neither be under
control of SGE, nor that all files on the slave nodes (created by
PVM for its internal management) will be correctly deleted in case
of a job abort. Also the accounting will not be correct, as the
via rsh started PVM daemons on the slave nodes are not related in
any way to SGE.
With Tight Integration on the other hand, you will have all
this, plus that there is also no need to have rsh between the
nodes enabled, as all startups of the daemons on the slave nodes
are handled by the built-in qrsh of SGE.
Prerequisites
Configuration of SGE with qconf or the GUI
You should already know how to change settings in SGE, like to
setup and change a queue definition or the entries in the PE
configuration. Additional information about queues and parallel
interfaces you can get from the man pages "queue_conf" and
"sge_pe" of SGE (make sure the SGE man pages are defined in your
$MANPATH).
Target platform
This Howto targets the PVM version 3.4.5 on SGE 6.0. Some new
platforms were added in the necessary build step while creating
the supporting start_pvm and stop_pvm programs. The
other three included programs are only examples, and not needed
for the integration. For users of SGE 5.3, a backport is also
available.
PVM
The PVM (Parallel Virtual Machine) is a framework from the Oak
Ridge National Laboratory (http://www.csm.ornl.gov/pvm/pvm_home.html)
and provides an interface for parallel programs, which allows the
MPMD (multiple program multiple data) paradigm. Before you start
with the integration of PVM into SGE, you should already be
familiar with the operation of PVM outside SGE, like starting the
daemons and requesting information about the started virtual
machine from the pvm-console. Although this is not directly
necessary for the integration into SGE, it will ease the
understanding of the applied configurations (and detection of
failures in operation in case that something went wrong). There
isn't any patch or modification necessary to the original
distribution of PVM.
Included setups and scripts
The supplied archive in [1]
will supersede the provided $SGE_ROOT/pvm. It contains modified
scripts and programs of the original distribution of the PVM
integration package in SGE, which enables now also the Tight
Integration. So you may first save the original $SGE_ROOT/pvm,
before you untar this package in the same location. In case that
you are using (and will continue to use) the Loose Integration,
you can still live with this new package, as by default it mimics
the original behavior and your already existing configurations
should continue to work without any change.
Another short running program is provided in [2],
which will allow you to observe the correct distribution of the
spawned tasks and removal of files in /tmp.
Loose Integration
As there is only one distribution of $SGE_ROOT/pvm now,
the necessary steps to install this new package will be the same
as in the original integration package. After you set your working
directory to be your $SGE_ROOT and untared the supplied archive in
this location as your admin user for SGE, you will have to compile
some helping programs. To do this, just change to
$SGE_ROOT/pvm/src and set some necessary environment variables
like:
$ export PVM_ROOT=/home/reuti/pvm3
$ export PVM_ARCH=LINUX
$ export SGE_ROOT=/usr/sge
The correct locations you have of course to adjust to your
custom installation. Be aware, that the PVM_ARCH is different by
naming convention from the architecture SGE is detecting. On a
Macintosh this leads e.g. to the situation, that SGE is referring
it as "darwin", and PVM as "DARWIN" (for Linux it may be
"lx24-x86" for SGE and "LINUX" for PVM). With these settings you
can compile the programs by just issuing ./aimk in
$SGE_ROOT/pvm/src. You should now got a directory (e.g. lx24-x86)
and inside the compiled programs. The script ./install.sh executed
in the same location will copy these files to the correct target
in $SGE_ROOT/pvm/bin/<arch>, where startpvm.sh and
stoppvm.sh is expecting them. These two programs start_pvm
and stop_pvm (besides the three examples) will help the
startpvm.sh and stoppvm.sh scripts to set up the PVM in a proper
way. The start_pvm program will fork a child process, which
will start the PVM, while the parent will watchout for the correct
startup, and only be successful if the requested PVM could be
established. The stop_pvm sends just a halt command to the
PVM.
Using the command line qconf or the qmon GUI, a sample setting
for a PVM PE may look like:
$ qconf -sp pvm
pe_name pvm
slots 46
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/pvm/startpvm.sh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args /usr/sge/pvm/stoppvm.sh $pe_hostfile $host
allocation_rule 1
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
This setup is for SGE 6.x, where the PE must be specified in
the cluster queue, in which it should be used in the entry
"pe_list". For SGE 5.3 the queues to be used are defined in the PE
already. To have a working example which will run for some
seconds, we can compile the hello.c and hello_other.c in your home
directoty (taken from the PVM User's Guide, you can find it in the
supplied pvm_hello.tgz archive) with:
$ gcc -o hello hello.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3
$ gcc -o hello_other hello_other.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3
Prepared with this, we can submit a simple PVM job and observe
the bahvior on the nodes:
$ qsub -pe pvm 2 tester_loose.sh
The start script of the PE will start the daemons on the (in
this case two) nodes, and the started program will spawn two
additional processes on each pvmd (besides the main
routine). You may implement such a logic in case that the main
process hello will only collect the work of the
hello_other tasks and not doing any active work on its
own.
$ rsh node16 ps -e f -o pid,ppid,pgrp,command --cols=100
PID PPID PGRP COMMAND
787 1 787 /usr/sge/bin/glinux/sge_commd
789 1 789 /usr/sge/bin/glinux/sge_execd
2999 789 2999 \_ sge_shepherd-11233 -bg
5340 789 5340 \_ sge_shepherd-11262 -bg
5359 5340 5359 \_ /bin/sh /var/spool/sge/node16/job_scripts/11262
5360 5359 5359 \_ ./hello
5350 1 5341 /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11262.1.para16/hostfile
5361 5350 5341 \_ /home/reuti/hello_other
On the slave node only the escaped pvmd and started
hello_other can be found:
$ rsh node08 ps -e f -o pid,ppid,pgrp,command --cols=100
PID PPID PGRP COMMAND
787 1 787 /usr/sge/bin/glinux/sge_commd
789 1 789 /usr/sge/bin/glinux/sge_execd
25950 1 25940 /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode08 1 c0a89711:8110 4080 2 c0a
25951 25950 25940 \_ /home/reuti/hello_other
All files will go in this case to the default directory for
PVM: /tmp.
$ rsh node16 ls -lh /tmp
total 68K
drwxr-xr-x 2 reuti users 4.0K Mar 25 23:21 11262.1.para16
drwx------ 2 root root 16K Apr 23 2004 lost+found
-rw------- 1 reuti users 19 Mar 25 23:21 pvmd.502
-rw------- 1 reuti users 127 Mar 25 23:21 pvml.502
srwxr-xr-x 1 reuti users 0 Mar 25 23:21 pvmtmp005350.0
$ rsh node08 ls -lh /tmp
total 64K
drwx------ 2 root root 16K Apr 23 2004 lost+found
-rw------- 1 reuti users 19 Mar 25 23:21 pvmd.502
-rw------- 1 reuti users 126 Mar 25 23:21 pvml.502
srwxr-xr-x 1 reuti users 0 Mar 25 23:21 pvmtmp025940.0
In case of a proper shutdown of the job and the PVM, all the
files but the pvml.502 (where the used number reflects your user
ID) will be deleted again. Unless you enable the out-commented
PVM_VMID in the start/stop script of this PE (and use it also in
your job script by setting: "PVM_VMID=$JOB_ID; export PVM_VMID"),
you are limited to one PVM per user per machine. A setting of
PVM_VMID to the SGE supplied $JOB_ID will append this value to all
PVM generated files, and makes them so unique for each job.
In case of a job abort via qdel, the started daemons will
proper shutdown, but the started programs hello_other may continue
to work.
Additions for Tight Integration
To achieve a Tight Integration, some changes were
necessary to the already supplied Loose Integration scripts:
- The first started pvmd (which is also responsible
for starting the other pvmds on the slave nodes) has to
be started by a local qrsh. This is done by a change to
start_pvm.c to distinguish between the two startup modes. In
case of a Tight Integration, it will now assemble a call to rsh
instead of only exec'ing the pvmd directly.
- The rsh-wrapper from the $SGE_ROOT/mpi was taken as a
starting point and modified, to direct the rsh call in
start_pvm to a qrsh. In addition, the temporary
directory for PVM has to be set, to be the SGE created
temporary directory. This is done by prefixing the call to
pvmd with "env PVM_TMP=\$TMPDIR". This way PVM_TMP will
be set on the target node, and not on the source node.
- Any built-in rsh command in PVM will be replaced, by
setting already in startpvm.sh "PVM_RSH=rsh; export PVM_RSH",
which will honor the SGE rsh-wrapper this way.
- The first via qrsh started pvmd will now in turn
call qrsh again to start the pvmds on the slave nodes.
Because the default behavior is to fork into daemon land, the
rsh-wrapper will append the option "-f" to the pvmd
command, if it discovers that the rsh-call is not a local one.
So the rsh-wrapper behaves in fact in two different ways.
Tight Integration
Having already set up everything for the Loose
Integration, you can switch to the Tight Integration by simply
changing the definition of your PE and introducing -catch_rsh to
the start procedure of the PE (and reversing the settings of
control_slaves and job_is_first_task of TRUE and FALSE):
$ qconf -sp pvm
pe_name pvm
slots 46
user_lists NONE
xuser_lists NONE
start_proc_args /usr/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args /usr/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
allocation_rule 1
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
The only thing left to do is now to set PVM_TMP like in
tester_tight.sh in your job script, as in this location your
executing program will look for information about the current
PVM:
export PVM_TMP=$TMPDIR
After submitting the job in exact the same way as before (but
this time taking the script tester_tight.sh in the qsub
command):
$ qsub -pe pvm 2 tester_tight.sh
you should see a distribution to the head node of your job
like:
$rsh node18 ps -e f -o pid,ppid,pgrp,command --cols=100
PID PPID PGRP COMMAND
788 1 788 /usr/sge/bin/glinux/sge_commd
791 1 791 /usr/sge/bin/glinux/sge_execd
7147 791 7147 \_ sge_shepherd-11238 -bg
7183 7147 7183 | \_ /bin/sh /var/spool/sge/node18/job_scripts/11238
7184 7183 7183 | \_ ./hello
7161 791 7161 \_ sge_shepherd-11238 -bg
7162 7161 7162 \_ /usr/sge/utilbin/glinux/rshd -l
7164 7162 7164 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node18/active_jobs
7166 7164 7166 \_ /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11238.1.para18/hostfile
7185 7166 7166 \_ /home/reuti/hello_other
7158 1 7148 /usr/sge/bin/glinux/qrsh -V -inherit node18 env PVM_TMP=$TMPDIR /home/reuti/pvm3/l
7163 7158 7148 \_ /usr/sge/utilbin/glinux/rsh -p 46806 node18 exec '/usr/sge/utilbin/glinux/qrsh
7165 7163 7148 \_ [rsh <defunct>]
7173 1 7166 /usr/sge/bin/glinux/qrsh -V -inherit node20 env PVM_TMP=$TMPDIR $PVM_ROOT/lib/pvmd
7181 7173 7166 \_ /usr/sge/utilbin/glinux/rsh -p 53007 node20 exec '/usr/sge/utilbin/glinux/qrsh
7182 7181 7166 \_ [rsh <defunct>]
The important thing is, that the daemon and the started
programs hello and hello_other are under full SGE
control. Also the created files in the usual /tmp will instead be
placed in the SGE created private directory for this job:
$ rsh node18 ls -h /tmp/11238.1.para18
hostfile
pid.1.node18
pvmd.502
pvml.502
pvmtmp007166.0
qrsh_client_cache
rsh
The same can be observed on the (in this case only one) slave
nodes:
$ rsh node20 ps -e f -o pid,ppid,pgrp,command --cols=100
PID PPID PGRP COMMAND
787 1 787 /usr/sge/bin/glinux/sge_commd
790 1 790 /usr/sge/bin/glinux/sge_execd
15397 790 15397 \_ sge_shepherd-11238 -bg
15398 15397 15398 \_ /usr/sge/utilbin/glinux/rshd -l
15399 15398 15399 \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node20/active_jobs
15400 15399 15400 \_ /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode20 1 c0a89713:814c
15401 15400 15400 \_ /home/reuti/hello_other
$ rsh node20 ls -h /tmp/11238.1.para20
pid.1.node20
pvmd.502
pvml.502
pvmtmp015400.0
Regardlessly, whether the job will terminate by the intended
program end or aborted with qdel: all processes and intermediate
files will be cleanly removed. And we don't have to honor
PVM_VMID, as all files are already separated into different
temporary directories.
Nodes with more than one network
interface
If your cluster has two (or more) network interfaces in
the nodes, you also have to set the start_proc_args to catch the
call to hostname by -catch_hostname in case that you are not using
the primary interface for PVM communication, as this will be used
by the rsh_wrapper to distinguish between a local qrsh call and
one to the slave nodes. In this hostname-wrapper, you must change
the response from the original hostname call to give the name of
the internal interface, which is also the one used in the SGE
supplied $pe_hostfile. If you want to have the PVM communication
on a complete other interface then SGE is aware of, you may have a
look into $SGE_ROOT/mpi/startmpi.sh for mapping the $pe_hostfile
supplied names also in $SGE_ROOT/pvm/startpvm.sh in a similar way
(besides the necessary -catch_hostname).
Restrictions and Future Work
As SGE is granting the nodes/slots to your job, you
shouldn't use any pvm_addhosts(), pvm_delhosts() or similar calls
in your program, as this would violate the SEG scheduling of jobs
to the cluster.
PVM uses internally a scheduler on it's own, to distribute the
by pvm_spawn() initiated tasks to the nodes (unless you use the
option to specify the target node directly in pvm_spawn() on your
own [according to the SGE granted nodes/slots of course]).
The built-in default is just a round robin scheme across all the
granted nodes. So, setting a fixed allocation rule of 1 (or for
dual CPU machines of 2) is the only safe setting, although you may
spawn too many tasks to all the nodes, if your request to the SGE
PE doesn't match the internal logic of your program.
To allow a more flexible allocation of the nodes in SGE, a
scheduler for PVM is in preparation, which will honor the SGE
supplied allocation of slots per node. There is still only one
daemon per node running, but the replacement scheduler will be
responsible to get the correct distribution of the started
processes in case of uneven granted slots on the nodes.
References
and Documents
SGE-PVM Integration
[1]
Archive with all the scripts used in this Howto: pvm60.tgz
[for older installations using SGE 5.3: pvm53.tgz].
[2] Archive with a
small PVM program from the PVM User's Guide: pvm_hello.tgz.
PVM
The latest version of PVM and build instructions can be
downloaded from (http://www.netlib.org/pvm3).
PVM documentation in general and tutorials
For some general introduction to PVM and PVM-Programming, you
can study the following documents: