Reducing and Eliminating NFS usage by Grid Engine
The default installation of Grid Engine assumes that the $SGE_ROOT
directory is on a shared filesystem accessible by all hosts in the
cluster. For a large cluster, this could entail significant NFS
traffic. There are various ways to reduce this traffic, including a
way to eliminate entirely the requirement that Grid Engine operate
using shared files. However, for each alternative, there is a
subsequent loss of convenience, and in some cases, functionality.
This HOWTO explains how to implement the different alternatives.
Levels of Grid Engine NFS dependencies
Note: color indicates at each level which part of the file
structure below is moved out of NFS sharing
SGE_ROOT
|
Configuration
|
Description
|
Advantage
|
Disadvantage
|
|
default
|
executables, configuration files, spool directories: all
shared
|
simple to install easy to upgrade easy to debug
|
potentially significant NFS traffic
|
|
local spool directories
|
executables, configuration files: shared.
spool directories: local to each compute host
|
simple to install
easy to upgrade
significant reduction in NFS traffic
|
less convenient to debug (must go to individual host to see
execd messages file)
|
|
local executable files
|
configuration files: shared
executables, spool directories: local to each compute host
|
near-elimination of NFS traffic (NOTE: consequences especially
seen when running massively parallel jobs across many nodes)
|
less convenient to install and upgrade (must modify files on
every host)
less convenient to debug
|
|
local configuration files
|
executables, configuration files, spool directories: all
local to each compute host
|
elimination of NFS requirement
|
less convenient to install and upgrade
less convenient to debug
less convenient to change some configuration parameters (must
modify files on every host)
loss of shadow master functionality; partial loss of qacct
functionality
|
Local Spool Directories
The spool directory for each
execd is the greatest source of NFS traffic for Grid Engine. When
jobs are dispatched to an exec host, the job script gets transferred
via the commd and then written to the spool directory. Each
job gets its own subdirectory, into which additional information is
written by both the execd and the job shepherd process.
Logfiles are also written into the spool directory, for both the
execd as well as the individual jobs.
By configuring local spool
directories, all that traffic can be redirected to the local disk on
each compute host, thus isolating it from the rest of the network as
well as reducing the I/O latency. One disadvantage is that, in order
to view the logfiles for a particular job, you need to log onto the
system where the job ran, instead of simply looking in the shared
directory. But, this would only be necessary for detailed debugging
of a job problem.
The path to the spool
directory controlled by the parameter execd_spool_dir; it
should be set to a directory on the local compute host which is owned
by the admin user and which ideally can handle intensive
reading/writing (e.g., /var/spool/sge). The
execd_spool_dir parameter can be specified when running the
install_qmaster script. However, this directory must already existed
and be owned by the admin user, or else the script will complain and
the execd will not function properly. Alternatively, the
execd_spool_dir parameter can be changed in the cluster
configuration (man sge_conf); the execds need to be halted
before this change can be made. Please make you read sge_conf(5)
Local Executables
In the default setup, all
hosts in a cluster read the binary files for daemons and commands off
the shared directory. For daemons, this only occurs once, when they
start up. When jobs run, other processes are invoked, such as the
shepherd and the rshd (for interactive jobs). In a
high-throughput cluster, or when invoking a massively-parallel job
across many nodes, there is a possibility that many simultaneous NFS
read accesses to these other executables could occur. To counter
this, you could make all executables be local to the compute hosts.
In this configuration,
rather than sharing $SGE_ROOT over NFS to the compute
hosts, you would only share $SGE_ROOT/$SGE_CELL/common
(you would also implement local spool directories as described
above). On each compute host, you would need to install both the
"common" and the architecture-specific binary packages.
Then, you would mount the shared $SGE_ROOT/$SGE_CELL/common
directory before invoking the install_execd script. In
order to prevent confusion, make sure that the path to $SGE_ROOT
is identical on the master host and compute hosts, e.g.,
SGE_ROOT=/opt/sge on all hosts.
For submit and admin hosts, you could choose to either install the
executables locally, or else mount them from some shared version of
$SGE_ROOT, since it is unlikely that NFS traffic on
these types of hosts would be a cause for concern in terms of
performance.
Local Configuration Files
Although the above two setups describe ways to reduce NFS traffic
to almost nil, there might be other reasons why NFS is not desired.
For example, the only available version of NFS for your operating
environment might not be considered reliable enough for production
use. In this case, you can choose not to share the configuration
directory $SGE_ROOT/$SGE_CELL/common, but instead have
it be local to each compute host. This would result in no files being
shared via NFS. However, because you are no longer using a common set
of files shared by all systems, there is some functionality which
requires some extra effort to use, and other functionality which no
longer works.
1) When you modify certain configuration files, the modification
would need to be made manually across all hosts in the cluster. These
files are located in the $SGE_ROOT/$SGE_CELL/common
directory:
sge_request: default submit request flags; man
sge_request(5)
host_aliases: hostname aliasing; man
host_aliases(5)
sge_aliases: file path aliasing; man
sge_aliases(5)
configuration: most of the information in this
file does not need to be used by the exec hosts. However, there are
two parameters, ignore_fqdn and default_domain, which
are used by the commd on all hosts. If the value of these
parameters is changed on the master hosts, then it also needs to be
reflected on all exec hosts in the cluster. Normally, these two
parameters would be set once, when the master host is installed, and
then not changed again. However, in case you experience network
problems, you may need to change these and see if it fixes the
problem.
2) Another consequence is that the qacct command will only
work if executed on the master host. This is because the accounting
file, where all historical information is stored, is only updated on
the master host. Because qacct will by default read information from
the file $SGE_ROOT/$SGE_CELL/common/accounting, it will
only be accurate on the master host. qacct can be directed to read
information from any file, using the -f flag, so one alternative is
to manually copy the accounting file periodically to another system,
where the analysis can take place.
3) Finally, if you do not share the $SGE_ROOT/$SGE_CELL/common
directory, you cannot use the Shadow Master facility. The
Shadow Master feature relies upon a shared filesystem to keep track
of the active master, so without NFS, Shadow Mastering does not
work.
To install with this type of setup, proceed as follows:
unpack/untar the Grid Engine distribution on each system
(common and architecture-specific packages) to the same pathname on
each system
install the master host completely
modify all the configuration files mentioned above to suit
the requirements of your site
on the master hosts make an archive of the directory
$SGE_ROOT/$SGE_CELL/common
on each exec host, unpack the archive created above
on each exec host, run the install_execd script.
It should automatically read in the configurations from the
directory which was unpacked.
Other Considerations
Even though Sun Grid Engine can function perfectly well without
NFS (except the noted functionality), there are other considerations
which might lead to unexpected behavior.
Home directories
Unless otherwise specified, Grid Engine runs jobs in the user's
home directory. If this is not shared, then whatever files are
created will be placed in the home directory on the host where the
job is executed. Also, any configuration given in dot-files, such
as .cshrc and .sge_requests, will be read
out of the home directory on the host where the job is executed .
Finally, if the home directory of the user actually does not exist on
the compute host, the job will go into an error state. You need to
make sure that for every user, and on every compute host, a home
directory is present and contains all the desired dot-file
configurations. Also, for jobs run with the -cwd flag,
the current path will be recorded, and when the job executes on the
compute host, unless the exact same path is accessible to the user
running the job, the job will go into an error state.
Application and data files
Obviously, without NFS there needs to be a way to stage data files
in and out, and the application files (binaries, libraries, config
files, databases, etc.) would also need to be either already present
on each compute host or also staged in. The prolog and epilog script
feature of Grid Engine provides a generic mechanism for implementing
a site-specific stage-in/stage-out facility. Alternatively, these
steps could be embedded into jobs scripts directly.
User virtualization
If application availability and data file staging were accounted
for, one could in principle run Grid Engine without NFS over a WAN.
However, part of the Grid Engine built-in authentication is that the
username of the user submitting a job must be recognized on the
compute host where the job runs. If running across administrative
domains, the username might not exist on the target exec host. In
this case, some of the solutions include:
allow users to log in as a common "grid user" (the
-A submit flag could be used to distinguish the actual
identity of the user).
using a SUID wrapper to submit and administrative commands to
do the same thing transparently