A Prototype of a Multi-Clustering Implementation
using Transfer Queues
last updated: November 27, 2002
Background
Multi-clustering can be defined as the sharing of compute jobs
between two or more independently controlled clustergrids. This could
be desirable, for example, if the owners of the clusters realize that
there are times when one cluster is lightly-loaded while another has
a large pending list. In this case, it makes sense to make use of the
idle resources, at least until the "local" demand for this
cluster rises again.
We define the "local" cluster as the cluster with which
a user would normally interact (qsub, qstat, etc), and "remote"
cluster as the one to which jobs are dispatched (if deemed
appropriate).
One way to implement multi-clustering is with transfer queues. The
idea is that jobs dispatched to a transfer queue get submitted to a
remote cluster; this would happen transparently to the user. To have
jobs going in both directions, simply use transfer queues on both
sides.
Description
The prototype described here implements the following ideas:
a load sensor which gives the number of pending jobs in the
remote cluster
a starter method which takes the job script and submits it to
the remote cluster
suspend, resume, and terminate methods which act upon jobs
sent to the remote cluster
a transfer queue which implements the methods described
above, and which has a load threshold based upon the number of
pending jobs in the remote cluster.
The scripts used to implement the prototype can be downloaded
here:
Restricting Assumptions:
both clusters share a common "namespace" or
"administrative domain". By this we mean: common
usernames/UIDs/GIDs, common filesystem(s), mutually accessible hosts
and hostnames, behind the same firewalls, etc.
resource requests are simply passed along without
modification. This could lead to errors if a particular resource is
not defined in the remote cluster.
once a job is submitted to the remote cluster, it will remain
in the pending list for that cluster until it either runs or is
manually deleted. If meanwhile some local resources are freed, the
jobs which are pending at the remote cluster will not get rerouted
back to the local cluster.
Setup
To implement this prototype, create a new queue (the transfer
queue) on any existing host. Configure it as shown below. Implement
the load sensor on any host (it would make sense to implement it on
the same host where the transfer queue lives, but that's not
necessary). Configure that host as shown below.
NOTE: it is suggested that the transfer queue be implemented on
the local master host, to keep things simple and to reduce
communication latency.
NOTE: make sure this host is both a submit and admin host on the
remote cluster.
NOTE: please modify the scripts to match your local environment.
See the scripts for site-specific variables.
Test it by submitting jobs directly to the transfer queue, eg,
qsub -q transfer1 -o output.log -j y sleeper.sh
In practice, you would not submit jobs directly to this special
queue. Rather, one of the following approaches would be taken:
if all jobs are "generic"
enough, then it might not matter if they run in the local or remote
cluster. In this case, jobs would just be submitted normally, and
those which happen to end up in the transfer queue will run remotely
if certain jobs cannot be run on
the remote cluster for whatever reason (licensing restrictions,
data/binaries inaccessible remotely), then a user-defined complex
with a boolean resource (eg, "remote") should be attached
to the transfer queue(s). Jobs which cannot be run remotely could be
submitted with a resource request "-l remote=false". This
will prevent those jobs from being dispatched to the transfer
queue(s), and hence will not be dispatched remotely.
Setup of transfer queue (only relevant lines shown)
sr3-umpk-01$ qconf -sq transfer1
qname transfer1
hostname tcf-b060
load_thresholds mpk27jobs=5
qtype BATCH
slots 40
prolog NONE
epilog NONE
starter_method /sge/TransferQueues/transfer_starter.sh
suspend_method /sge/TransferQueues/transfer_suspend.sh
resume_method /sge/TransferQueues/transfer_resume.sh
terminate_method /sge/TransferQueues/transfer_terminate.sh
Load sensor implemented on a single host (only relevant lines shown)
sr3-umpk-01$ qconf -sconf tcf-b060
tcf-b060:
load_sensor /sge/TransferQueues/clusterload.sh
load_report_time 00:00:09
Configuration
the total number of "foreign" jobs in a system
(both running and pending) can be specified by setting the number of
slots in the transfer queue. It is recommended that this number be
set to some fraction of the number of remote slots (eg, 1/3 or 1/4),
for a couple of reasons:
if many jobs are dispatched to the remote cluster and are
pending there, and the remote cluster gets flooded with higher
priority jobs, then all those pending jobs could get trapped,
instead of potentially running on free local resources
if there are too many jobs dispatched to the remote cluster,
then the polling mechanism for the transfer queue becomes too much
of a load (but see improvements below)
jobs only get sent to the remote cluster if it is "idle".
The definition of idle is having N or fewer jobs in the remote
pending list; N is configurable but might be something like 10% of
the total number of remote job slots. This number is used in the
load threshold for the transfer queue.
if using Grid Engine Enterprise Edition, "foreign"
jobs can be assigned a particular project category, either
voluntarily in the job request or by giving it an assigned category,
and the share of this project in the local cluster modified
appropriately. The remote project is set in the transfer starter
method.
Discussion
see comments in the scripts to see how certain functionality
was implemented and why other choices were not used.
if this setup is to be used for a high-throughput environment
(large numbers of relatively short jobs, especially array jobs), the
load sensor will not be able to provide a very accurate measure of
the size of the pending list, and might unnecessarily prevent jobs
from being dispatched. In this case, you might consider disabling
the load threshold, and instead setting the number of slots to a
smaller number (eg, 1/10 the number of remote slots). This will
meter out the small jobs in a steady but limited stream, thus
avoiding the possibility of dispatching too many jobs and
overloading the cluster. It would also be useful here to set the
polling time to something lower, like 5 seconds. CAUTION: do not set
the polling time to anything less than 5 seconds, otherwise the
remote cluster will become overloaded with status requests.
One convenient way to do resource mapping, especially if the
remote cluster has systems of widely varying capabilities, is to use
multiple transfer queues for the different types of systems (eg,
different Oses, different memory sizes, etc). By setting the
resources on the local transfer queue to match a specific type of
system on the remote cluster, you can ensure that jobs get
dispatched only when the correct type of system is available.
Improvements
There are some improvements which could be made to the current
implementation of methods:
invalid resource requests: if a request is invalid, have an
error handler which sends the job back to the local cluster and
marks it ineligible to run at the remote cluster. Currently, the job
is just killed.
resource request mapping: if special resources are requested,
have a way to map them to equivalent resources at the remote
cluster.
accounting: have a sensible way to retrieve accounting
information from the remote cluster. Perhaps, have a 'qacct_remote'
command which will get the remote JID from the local accounting
record and then do a remote regular qacct call.
also have a way to map qalter and qstat
commands to the remote cluster. Again, have a *_remote
version of these commands.
Future Directions
The implementation described here is limited in terms of
transfering jobs between administrative domains (eg, different
file/user namespaces, different firewalls,etc). In order to extend
the concept to multiple domains, the following extensions to the
MultiCluster concept would require putting additional functionality
into the Grid Engine code, and most likely, integration with
functionality provided by other software:
Cross domain security and identity.
Firewalls, NATs, etc.
Data transfer between domains.
Brokerage for intelligent routing of jobs based on least
cost, minimal time, best resource, or similar criteria.
Sharing of resources across domains, e.g. software licenses;
policies related to such sharing models.
Cross domain accounting, billing, monitoring.
Co-scheduling across domains for distributed apps (e.g.
computation and visualisation).
Cross domain error tracking.
Utilization of standardized interfaces for global grids (e.g.
OGSA).
In addition, cross-domain issues tend to rely more upon standard
APIs and protocols. Currently, these are not well-defined; much work
towards defining these standards is occuring in standards bodies such
as the Global Grid Forum (GGF).