Distributed Resource Management Application API
This guide is a tutorial for getting started programming with DRMAA. It
assumes that you already know what DRMAA is and know how DRMAA is
supported in the Grid Engine 6.0 release. If you do not already know
these things, try these web sites:
Note that the example programs in this howto can be found in the CVS
source tree.
Starting and Stopping a Session
The following code segment shows the most basic DRMAA C binding program:
Example 1
01: #include
02: #include "drmaa.h"
03:
04: int main (int argc, char **argv) {
05: char error[DRMAA_ERROR_STRING_BUFFER];
06: int errnum = 0;
07:
08: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
09:
10: if (errnum != DRMAA_ERRNO_SUCCESS) {
11: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
12: return 1;
13: }
14:
15: printf ("DRMAA library was started successfully\n");
16:
17: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
18:
19: if (errnum != DRMAA_ERRNO_SUCCESS) {
20: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
21: return 1;
22: }
23:
24: return 0;
25: }
The first thing to notice is that every call to a DRMAA function will
return an error code. If everything goes well, that code will be
DRMAA_ERRNO_SUCCESS. If things don't go so well, an
appropriate error code will be returned. Every DRMAA function also takes
at least two parameters. These two parameters are a string to populate
with a error message in case of an error and an integer representing the
maximum length of the error string.
Now let's look at the functions being called. First, on line 8, we call
drmaa_init(). This function sets up the DRMAA session and must be called
before most other DRMAA functions. Some functions, like
drmaa_get_contact(), can be called before drmaa_init(), but these
functions only provide general information. Any function that does work,
such as drmaa_run_job() or drmaa_wait() must be called after drmaa_init()
returns. If such a function is called before drmaa_init() returns, it
will return the error code DRMAA_ERRNO_NO_ACTIVE_SESSION.
dmraa_init() creates a session and starts an event client listener thread.
The session is used for organizing jobs submitted through DRMAA, and the
thread is used to receive updates from the queue master about the state
of jobs and the system in general. Once drmaa_init() has been called
successfully, it is the responsibility of the calling application to also
call drmaa_exit() before terminating. If an application does not call
drmaa_exit() before terminating, session state may be left behind in the
user's home directory (under .sge/drmaa), and the queue master may be left
with a dead event client handle, which can decrease queue master
performance.
At the end of our program, on line 17, we call drmaa_exit(). drmaa_exit()
cleans up the session and stops the event client listener thread. Most
other DRMAA functions must be called before drmaa_exit(). Some functions,
like drmaa_get_contact(), can be called after drmaa_exit(), but these
functions only provide general information. Any function that does work,
such as drmaa_run_job() or drmaa_wait() must be called before drmaa_exit()
is called. If such a function is called after drmaa_exit() is called, it
will return the error code DRMAA_ERRNO_NO_ACTIVE_SESSION.
Example 1_1
01: #include
02: #include "drmaa.h"
03:
04: int main (int argc, char **argv) {
05: char error[DRMAA_ERROR_STRING_BUFFER];
06: int errnum = 0;
07: char contact[DRMAA_CONTACT_BUFFER];
08:
09: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
10:
11: if (errnum != DRMAA_ERRNO_SUCCESS) {
12: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
13: return 1;
14: }
15:
16: printf ("DRMAA library was started successfully\n");
17:
18: errnum = drmaa_get_contact (contact, DRMAA_CONTACT_BUFFER, error,
19: DRMAA_ERROR_STRING_BUFFER);
20:
21: if (errnum != DRMAA_ERRNO_SUCCESS) {
22: fprintf (stderr, "Could not get the contact string: %s\n", error);
23: return 1;
24: }
25:
26: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
27:
28: if (errnum != DRMAA_ERRNO_SUCCESS) {
29: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
30: return 1;
31: }
32:
33: errnum = drmaa_init (contact, error, DRMAA_ERROR_STRING_BUFFER);
34:
35: if (errnum != DRMAA_ERRNO_SUCCESS) {
36: fprintf (stderr, "Could not reinitialize the DRMAA library: %s\n", error);
37: return 1;
38: }
39:
40: printf ("DRMAA library was restarted successfully\n");
41:
42: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
43:
44: if (errnum != DRMAA_ERRNO_SUCCESS) {
45: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
46: return 1;
47: }
48:
49: return 0;
50: }
This example is very similar to Example 1. The difference is that it uses
the Grid Engine feature of reconnectable sessions. The DRMAA concept of
a session is translated into a session tag in the Grid Engine job
structure. That means that every job knows to which session it belongs.
With reconnectable sessions, it's possible to initialize the DRMAA library
to a previous session, allowing the library access to that session's job
list. The only limitation, though, is that jobs which end between the
calls to exit() and init() will be lost, as the reconnecting session will
no longer see these jobs, and so won't know about them.
Through line 16, this example is very similar to Example 1. On line 18,
however, we use the drmaa_get_contact() function to get the contact
information for this session. On line 26 we then exit the session. On
line 33, we use the stored contact information to reconnect to the
previous session. Had we submitted jobs before calling exit(), those jobs
would now be available again for operations such as drmaa_wait() and
drmaa_synchronize(). Finally, on line 42 we exit the session a second
time.
Running a Job
The following code segment shows how to use the DRMAA C binding to submit
a job to Grid Engine:
Example 2
01: #include
02: #include "drmaa.h"
03:
04: int main (int argc, char **argv) {
05: char error[DRMAA_ERROR_STRING_BUFFER];
06: int errnum = 0;
07: drmaa_job_template_t *jt = NULL;
08:
09: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
10:
11: if (errnum != DRMAA_ERRNO_SUCCESS) {
12: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
13: return 1;
14: }
15:
16: errnum = drmaa_allocate_job_template (&jt, error, DRMAA_ERROR_STRING_BUFFER);
17:
18: if (errnum != DRMAA_ERRNO_SUCCESS) {
19: fprintf (stderr, "Could not create job template: %s\n", error);
20: }
21: else {
22: errnum = drmaa_set_attribute (jt, DRMAA_REMOTE_COMMAND, "sleeper.sh",
23: error, DRMAA_ERROR_STRING_BUFFER);
24:
25: if (errnum != DRMAA_ERRNO_SUCCESS) {
26: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
27: DRMAA_REMOTE_COMMAND, error);
28: }
29: else {
30: const char *args[2] = {"5", NULL};
31:
32: errnum = drmaa_set_vector_attribute (jt, DRMAA_V_ARGV, args, error,
33: DRMAA_ERROR_STRING_BUFFER);
34: }
35:
36: if (errnum != DRMAA_ERRNO_SUCCESS) {
37: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
38: DRMAA_REMOTE_COMMAND, error);
39: }
40: else {
41: char jobid[DRMAA_JOBNAME_BUFFER];
42:
43: errnum = drmaa_run_job (jobid, DRMAA_JOBNAME_BUFFER, jt, error,
44: DRMAA_ERROR_STRING_BUFFER);
45:
46: if (errnum != DRMAA_ERRNO_SUCCESS) {
47: fprintf (stderr, "Could not submit job: %s\n", error);
48: }
49: else {
50: printf ("Your job has been submitted with id %s\n", jobid);
51: }
52: } /* else */
53:
54: errnum = drmaa_delete_job_template (jt, error, DRMAA_ERROR_STRING_BUFFER);
55:
56: if (errnum != DRMAA_ERRNO_SUCCESS) {
57: fprintf (stderr, "Could not delete job template: %s\n", error);
58: }
59: } /* else */
60:
61: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
62:
63: if (errnum != DRMAA_ERRNO_SUCCESS) {
64: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
65: return 1;
66: }
67:
68: return 0;
69: }
The beginning and end of this program are the same as the previous one.
What's different is in lines 16-59. On line 16 we ask DRMAA to allocate a
job template for us. A job template is a structure used to store
information about a job to be submitted. The same template can be reused
for multiple calls to drmaa_run_job() or drmaa_run_bulk_job().
On line 22 we set the DRMAA_REMOTE_COMMAND attribute. This
attribute tells DRMAA where to find the program we want to run. Its value
is the path to the executable. The path be be either relative or
absolute. If relative, it is relative to the DRMAA_WD
attribute, which if not set defaults to the user's home directory. For
more information on DRMAA attributes, please see the
drmaa_attributes
man page. Note that for this program to work, the script
"sleeper.sh" must be in your default path, i.e. the path set by
your shell script when you log in.
On line 32 we set the DRMAA_V_ARGV attribute. This
attribute tells DRMAA what arguments to pass to the executable. For
more information on DRMAA attributes, please see the
drmaa_attributes
man page.
On line 43 we submit the job with drmaa_run_job(). DRMAA will place the
id assigned to the job into the character array we passed to
drmaa_run_job(). The job is now running as though submitted by qsub. At
this point calling drmaa_exit() and/or terminating the program will have
no effect on the job.
To clean things up, we delete the job template on line 54. This frees the
memory DRMAA set aside for the job template, but has no effect on
submitted jobs.
Finally, on line 61, we call drmaa_exit(). The call to drmaa_exit() is
outside of the if structure started on line 18 because regardless of
whether the other commands succeed, once we've called drmaa_init(), we are
obligated to call drmaa_exit() before terminating.
If instead of a single job we had wanted to submit an array job, we could
have replaced the else on lines 40-52 with the following:
Example 2.1
40: else {
41: drmaa_job_ids_t *ids = NULL;
42:
43: errnum = drmaa_run_bulk_jobs (&ids, jt, 1, 30, 2, error, DRMAA_ERROR_STRING_BUFFER);
44:
45: if (errnum != DRMAA_ERRNO_SUCCESS) {
46: fprintf (stderr, "Could not submit job: %s\n", error);
47: }
48: else {
49: char jobid[DRMAA_JOBNAME_BUFFER];
50:
51: while (drmaa_get_next_job_id (ids, jobid, DRMAA_JOBNAME_BUFFER) == DRMAA_ERRNO_SUCCESS) {
52: printf ("A job task has been submitted with id %s\n", jobid);
53: }
54: }
55:
56: drmaa_release_job_ids (ids);
57: }
This code segment submits an array job with 15 tasks numbered 1, 3, 5, 7,
etc. An important difference to note is that drmaa_run_bulk_jobs()
returns the job ids in an opaque structure. On lines 51-53, before we can
print the job ids, we have to extract them from the structure. When we're
done with the job ids, we free the structure on line 56. A more normal
use pattern would be to use the while loop to extract job ids from the
structure and place them into an array for future use. We know when we've
iterated over every element when drmaa_get_next_job_id() returns
DRMAA_ERRNO_INVALID_ATTRIBUTE_VALUE. Note that you can only
iterate through the structure once and only in one direction.
Waiting for a Job
Now we're going to extend our example to include waiting for a job to
finish.
Example 3
001: #include
002: #include "drmaa.h"
003:
004: int main (int argc, char **argv) {
005: char error[DRMAA_ERROR_STRING_BUFFER];
006: int errnum = 0;
007: drmaa_job_template_t *jt = NULL;
008:
009: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
010:
011: if (errnum != DRMAA_ERRNO_SUCCESS) {
012: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
013: return 1;
014: }
015:
016: errnum = drmaa_allocate_job_template (&jt, error, DRMAA_ERROR_STRING_BUFFER);
017:
018: if (errnum != DRMAA_ERRNO_SUCCESS) {
019: fprintf (stderr, "Could not create job template: %s\n", error);
020: }
021: else {
022: errnum = drmaa_set_attribute (jt, DRMAA_REMOTE_COMMAND, "sleeper.sh",
023: error, DRMAA_ERROR_STRING_BUFFER);
024:
025: if (errnum != DRMAA_ERRNO_SUCCESS) {
026: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
027: DRMAA_REMOTE_COMMAND, error);
028: }
029: else {
030: const char *args[2] = {"5", NULL};
031:
032: errnum = drmaa_set_vector_attribute (jt, DRMAA_V_ARGV, args, error,
033: DRMAA_ERROR_STRING_BUFFER);
034: }
035:
036: if (errnum != DRMAA_ERRNO_SUCCESS) {
037: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
038: DRMAA_REMOTE_COMMAND, error);
039: }
040: else {
041: char jobid[DRMAA_JOBNAME_BUFFER];
042: char jobid_out[DRMAA_JOBNAME_BUFFER];
043: int status = 0;
044: drmaa_attr_values_t *rusage = NULL;
045:
046: errnum = drmaa_run_job (jobid, DRMAA_JOBNAME_BUFFER, jt, error,
047: DRMAA_ERROR_STRING_BUFFER);
048:
049: if (errnum != DRMAA_ERRNO_SUCCESS) {
050: fprintf (stderr, "Could not submit job: %s\n", error);
051: }
052: else {
053: printf ("Your job has been submitted with id %s\n", jobid);
054:
055: errnum = drmaa_wait (jobid, jobid_out, DRMAA_JOBNAME_BUFFER, &status,
056: DRMAA_TIMEOUT_WAIT_FOREVER, &rusage, error,
057: DRMAA_ERROR_STRING_BUFFER);
058:
059: if (errnum != DRMAA_ERRNO_SUCCESS) {
060: fprintf (stderr, "Could not wait for job: %s\n", error);
061: }
062: else {
063: char usage[DRMAA_ERROR_STRING_BUFFER];
064: int aborted = 0;
065:
066: drmaa_wifaborted(&aborted, status, NULL, 0);
067:
068: if (aborted == 1) {
069: printf("Job %s never ran\n", jobid);
070: }
071: else {
072: int exited = 0;
073:
074: drmaa_wifexited(&exited, status, NULL, 0);
075:
076: if (exited == 1) {
077: int exit_status = 0;
078:
079: drmaa_wexitstatus(&exit_status, status, NULL, 0);
080: printf("Job %s finished regularly with exit status %d\n", jobid, exit_status);
081: }
082: else {
083: int signaled = 0;
084:
085: drmaa_wifsignaled(&signaled, status, NULL, 0);
086:
087: if (signaled == 1) {
088: char termsig[DRMAA_SIGNAL_BUFFER+1];
089:
090: drmaa_wtermsig(termsig, DRMAA_SIGNAL_BUFFER, status, NULL, 0);
091: printf("Job %s finished due to signal %s\n", jobid, termsig);
092: }
093: else {
094: printf("Job %s finished with unclear conditions\n", jobid);
095: }
096: } /* else */
097: } /* else */
098:
099: printf ("Job Usage:\n");
100:
101: while (drmaa_get_next_attr_value (rusage, usage, DRMAA_ERROR_STRING_BUFFER) == DRMAA_ERRNO_SUCCESS) {
102: printf (" %s\n", usage);
103: }
104:
105: drmaa_release_attr_values (rusage);
106: } /* else */
107: } /* else */
108: } /* else */
109:
110: errnum = drmaa_delete_job_template (jt, error, DRMAA_ERROR_STRING_BUFFER);
111:
112: if (errnum != DRMAA_ERRNO_SUCCESS) {
113: fprintf (stderr, "Could not delete job template: %s\n", error);
114: }
115: } /* else */
116:
117: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
118:
119: if (errnum != DRMAA_ERRNO_SUCCESS) {
120: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
121: return 1;
122: }
123:
124: return 0;
125: }
This example is very similar to Example 2 except for lines 55-106. On
line 55 we call drmaa_wait() to wait for the job to end. We have to give
drmaa_wait() both the id of the job for which we want to wait and a place
to write the id of the job for which we actually waited because the job
id we pass in could be DRMAA_JOB_IDS_SESSION_ANY, in which
case drmaa_wait() must have a way of tell us which job is the one that
made it return. We also have to pass to drmaa_wait() how long we are
willing to wait for the job to finish. This could be a number of seconds,
or it could be either DRMAA_TIMEOUT_WAIT_FOREVER or
DRMAA_TIMEOUT_NO_WAIT. Lastly, aside from the usual error
buffer, we also have to pass in a place to write the exit status and the
usage information. The exit status is an opaque number that is passed to
the drmaa_w...() functions to get information about how the job exited.
The usage information is a list of name=value pairs in a DRMAA values
structure. The values structure works exactly the same as the ids
structure we talked about in Example 2.1.
Assuming the wait worked, we query the job's exit status on lines 66-97
using the drmaa_w...() functions. This if structure is a common usage
pattern for drmaa_wait() and should be encapsulated in a function for
ease of use.
After checking the exit status, we query the job's usage on lines 99-105.
We use the drmaa_get_next_attr_value() function to walk through the usage
information and print out the results. For further processing of the
usage, we'd have to split each string on the '=' character to extract the
name and value of each usage parameter.
An alternative to drmaa_wait() when working with multiple jobs, such as
jobs submitted by drmmaa_run_bulk_jobs() or multiple calls to
drmaa_run_job() is drmaa_synchronize(). drmaa_synchronize() waits for
a set of jobs to finish. To use drmaa_synchronize(), we could replace
lines 40-108 with the following:
Example 3.1
40: else {
41: drmaa_job_ids_t *ids = NULL;
42:
43: errnum = drmaa_run_bulk_jobs (&ids, jt, 1, 30, 2, error, DRMAA_ERROR_STRING_BUFFER);
44:
45: if (errnum != DRMAA_ERRNO_SUCCESS) {
46: fprintf (stderr, "Could not submit job: %s\n", error);
47: }
48: else {
49: char jobid[DRMAA_JOBNAME_BUFFER];
50: const char *jobids[2] = {DRMAA_JOB_IDS_SESSION_ALL, NULL};
51:
52: while (drmaa_get_next_job_id (ids, jobid, DRMAA_JOBNAME_BUFFER) == DRMAA_ERRNO_SUCCESS) {
53: printf ("A job task has been submitted with id %s\n", jobid);
54: }
55:
56: errnum = drmaa_synchronize (jobids, DRMAA_TIMEOUT_WAIT_FOREVER,
57: 1, error, DRMAA_ERROR_STRING_BUFFER);
58:
59: if (errnum != DRMAA_ERRNO_SUCCESS) {
60: fprintf (stderr, "Could not wait for jobs: %s\n", error);
61: }
62: else {
63: printf ("All job tasks have finished.\n");
64: }
65: } /* else */
66:
67: drmaa_release_job_ids (ids);
68: } /* else */
Example 3.1
Lines 41-43 now call drmaa_run_bulk_jobs() so that we have several jobs
for which to wait. On line 56, instead of calling drmaa_wait(), we call
drmaa_synchronize(). drmaa_synchronize() takes only three iteresting
parameters. The first is the list of ids for which to wait. This list
must be a NULL-terminated array of strings. If the special id,
DRMAA_JOB_IDS_SESSION_ALL, appears in the array,
drmaa_synchronize() will wait for all jobs submitted via DRMAA during this
session, i.e. since drmaa_init() was called. The second is how long to
wait for all the jobs in the list to finish. This is the same as the
timeout parameter for drmaa_wait(). The third is whether this call to
drmaa_synchronize() should clean up after the job. After a job completes,
it leaves behind accounting information, such as exist status and usage,
until either drmaa_wait() or drmaa_synchronize() with dispose set to true
is called. It is the responsibility of the application to make sure one
of these two functions is called for every job. Not doing so creates a
memory leak. Note that calling drmaa_synchronize() with dispose set to
true flushes all accounting information for all jobs in the list. If you
want to use drmaa_synchronize() and still recover the accounting
information, set dispose to false and call drmaa_wait() for each job. To
do this in Example 3, we would replace lines 40-108 with the following:
Example 3.2
040: else {
041: drmaa_job_ids_t *ids = NULL;
042: int start = 1;
043: int end = 30;
044: int step = 2;
045:
046: errnum = drmaa_run_bulk_jobs (&ids, jt, start, end, step, error,
047: DRMAA_ERROR_STRING_BUFFER);
048:
049: if (errnum != DRMAA_ERRNO_SUCCESS) {
050: fprintf (stderr, "Could not submit job: %s\n", error);
051: }
052: else {
053: char jobid[DRMAA_JOBNAME_BUFFER];
054: const char *jobids[2] = {DRMAA_JOB_IDS_SESSION_ALL, NULL};
055:
056: while (drmaa_get_next_job_id (ids, jobid, DRMAA_JOBNAME_BUFFER)
057: == DRMAA_ERRNO_SUCCESS) {
058: printf ("A job task has been submitted with id %s\n", jobid);
059: }
060:
061: errnum = drmaa_synchronize (jobids, DRMAA_TIMEOUT_WAIT_FOREVER,
062: 0, error, DRMAA_ERROR_STRING_BUFFER);
063:
064: if (errnum != DRMAA_ERRNO_SUCCESS) {
065: fprintf (stderr, "Could not wait for jobs: %s\n", error);
066: }
067: else {
068: char jobid[DRMAA_JOBNAME_BUFFER];
069: int status = 0;
070: drmaa_attr_values_t *rusage = NULL;
071: int count = 0;
072:
073: for (count = start; count < end; count += step) {
074: errnum = drmaa_wait (DRMAA_JOB_IDS_SESSION_ANY, jobid,
075: DRMAA_JOBNAME_BUFFER, &status,
076: DRMAA_TIMEOUT_WAIT_FOREVER, &rusage,
077: error, DRMAA_ERROR_STRING_BUFFER);
078:
079: if (errnum != DRMAA_ERRNO_SUCCESS) {
080: fprintf (stderr, "Could not wait for job: %s\n", error);
081: }
082: else {
083: char usage[DRMAA_ERROR_STRING_BUFFER];
084: int aborted = 0;
085:
086: drmaa_wifaborted(&aborted, status, NULL, 0);
087:
088: if (aborted == 1) {
089: printf("Job %s never ran\n", jobid);
090: }
091: else {
092: int exited = 0;
093:
094: drmaa_wifexited(&exited, status, NULL, 0);
095:
096: if (exited == 1) {
097: int exit_status = 0;
098:
099: drmaa_wexitstatus(&exit_status, status, NULL, 0);
100: printf("Job %s finished regularly with exit status %d\n",
101: jobid, exit_status);
102: }
103: else {
104: int signaled = 0;
105:
106: drmaa_wifsignaled(&signaled, status, NULL, 0);
107:
108: if (signaled == 1) {
109: char termsig[DRMAA_SIGNAL_BUFFER+1];
110:
111: drmaa_wtermsig(termsig, DRMAA_SIGNAL_BUFFER, status, NULL, 0);
112: printf("Job %s finished due to signal %s\n", jobid, termsig);
113: }
114: else {
115: printf("Job %s finished with unclear conditions\n", jobid);
116: }
117: } /* else */
118: } /* else */
119:
120: printf ("Job Usage:\n");
121:
122: while (drmaa_get_next_attr_value (rusage, usage, DRMAA_ERROR_STRING_BUFFER)
123: == DRMAA_ERRNO_SUCCESS) {
124: printf (" %s\n", usage);
125: }
126:
127: drmaa_release_attr_values (rusage);
128: } /* else */
129: } /* for */
130: } /* else */
131: } /* else */
132:
133: drmaa_release_job_ids (ids);
134: } /* else */
What's different is that on line 61, we set dispose to false, and then on
lines 68-130 we wait once for each job, printing the exit status and
usage information as we did in Example 3. We pass
DRMAA_JOB_IDS_SESSION_ANY to drmaa_wait() as the job id
because we already know that all the jobs have finished, so we don't
really care in what order we process them. In an interactive system
where we couldn't guarantee that more jobs wouldn't be submitted between
the synchronize and the wait, we would have to store the job ids from the
drmaa_run_bulk_jobs() in an array and then wait for each job specifically.
Otherwise, the drmaa_wait() could end up waiting for a job submitted after
the call to drmaa_synchronize().
Controling a Job
Now let's look at an example of how to control a job from DRMAA:
Example 4
01: #include
02: #include "drmaa.h"
03:
04: int main (int argc, char **argv) {
05: char error[DRMAA_ERROR_STRING_BUFFER];
06: int errnum = 0;
07: drmaa_job_template_t *jt = NULL;
08:
09: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
10:
11: if (errnum != DRMAA_ERRNO_SUCCESS) {
12: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
13: return 1;
14: }
15:
16: errnum = drmaa_allocate_job_template (&jt, error, DRMAA_ERROR_STRING_BUFFER);
17:
18: if (errnum != DRMAA_ERRNO_SUCCESS) {
19: fprintf (stderr, "Could not create job template: %s\n", error);
20: }
21: else {
22: errnum = drmaa_set_attribute (jt, DRMAA_REMOTE_COMMAND, "sleeper.sh",
23: error, DRMAA_ERROR_STRING_BUFFER);
24:
25: if (errnum != DRMAA_ERRNO_SUCCESS) {
26: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
27: DRMAA_REMOTE_COMMAND, error);
28: }
29: else {
30: const char *args[2] = {"60", NULL};
31:
32: errnum = drmaa_set_vector_attribute (jt, DRMAA_V_ARGV, args, error,
33: DRMAA_ERROR_STRING_BUFFER);
34: }
35:
36: if (errnum != DRMAA_ERRNO_SUCCESS) {
37: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
38: DRMAA_REMOTE_COMMAND, error);
39: }
40: else {
41: char jobid[DRMAA_JOBNAME_BUFFER];
42:
43: errnum = drmaa_run_job (jobid, DRMAA_JOBNAME_BUFFER, jt, error,
44: DRMAA_ERROR_STRING_BUFFER);
45:
46: if (errnum != DRMAA_ERRNO_SUCCESS) {
47: fprintf (stderr, "Could not submit job: %s\n", error);
48: }
49: else {
50: printf ("Your job has been submitted with id %s\n", jobid);
51:
52: errnum = drmaa_control (jobid, DRMAA_CONTROL_TERMINATE, error,
53: DRMAA_ERROR_STRING_BUFFER);
54:
55: if (errnum != DRMAA_ERRNO_SUCCESS) {
56: fprintf (stderr, "Could not delete job: %s\n", error);
57: }
58: else {
59: printf ("Your job has been deleted\n");
60: }
61: }
62: } /* else */
63:
64: errnum = drmaa_delete_job_template (jt, error, DRMAA_ERROR_STRING_BUFFER);
65:
66: if (errnum != DRMAA_ERRNO_SUCCESS) {
67: fprintf (stderr, "Could not delete job template: %s\n", error);
68: }
69: } /* else */
70:
71: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
72:
73: if (errnum != DRMAA_ERRNO_SUCCESS) {
74: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
75: return 1;
76: }
77:
78: return 0;
79: }
This example is very similar to Example 2 except for lines 52-60. On line
52 we use drmaa_control() to delete the job we just submitted. Aside from
deleting the job, we could have also used drmaa_control() to suspend,
resume, hold, or release it. For more information, see the
drmaa_control
man page.
Note that drmaa_control() can be used to control jobs not submitted
through DRMAA. Any valid SGE job id could be passed to drmaa_control() as
the id of the job to delete.
Getting Job Status
Here's an example of using DRMAA to query the status of a job:
Example 5
001: #include
002: #include
003: #include "drmaa.h"
004:
005: int main (int argc, char **argv) {
006: char error[DRMAA_ERROR_STRING_BUFFER];
007: int errnum = 0;
008: drmaa_job_template_t *jt = NULL;
009:
010: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
011:
012: if (errnum != DRMAA_ERRNO_SUCCESS) {
013: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
014: return 1;
015: }
016:
017: errnum = drmaa_allocate_job_template (&jt, error, DRMAA_ERROR_STRING_BUFFER);
018:
019: if (errnum != DRMAA_ERRNO_SUCCESS) {
020: fprintf (stderr, "Could not create job template: %s\n", error);
021: }
022: else {
023: errnum = drmaa_set_attribute (jt, DRMAA_REMOTE_COMMAND, "sleeper.sh",
024: error, DRMAA_ERROR_STRING_BUFFER);
025:
026: if (errnum != DRMAA_ERRNO_SUCCESS) {
027: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
028: DRMAA_REMOTE_COMMAND, error);
029: }
030: else {
031: const char *args[2] = {"60", NULL};
032:
033: errnum = drmaa_set_vector_attribute (jt, DRMAA_V_ARGV, args, error,
034: DRMAA_ERROR_STRING_BUFFER);
035: }
036:
037: if (errnum != DRMAA_ERRNO_SUCCESS) {
038: fprintf (stderr, "Could not set attribute \"%s\": %s\n",
039: DRMAA_REMOTE_COMMAND, error);
040: }
041: else {
042: char jobid[DRMAA_JOBNAME_BUFFER];
043:
044: errnum = drmaa_run_job (jobid, DRMAA_JOBNAME_BUFFER, jt, error,
045: DRMAA_ERROR_STRING_BUFFER);
046:
047: if (errnum != DRMAA_ERRNO_SUCCESS) {
048: fprintf (stderr, "Could not submit job: %s\n", error);
049: }
050: else {
051: int status = 0;
052:
053: printf ("Your job has been submitted with id %s\n", jobid);
054:
055: sleep (20);
056:
057: errnum = drmaa_job_ps (jobid, &status, error,
058: DRMAA_ERROR_STRING_BUFFER);
059:
060: if (errnum != DRMAA_ERRNO_SUCCESS) {
061: fprintf (stderr, "Could not get job' status: %s\n", error);
062: }
063: else {
064: switch (status) {
065: case DRMAA_PS_UNDETERMINED:
066: printf ("Job status cannot be determined\n");
067: break;
068: case DRMAA_PS_QUEUED_ACTIVE:
069: printf ("Job is queued and active\n");
070: break;
071: case DRMAA_PS_SYSTEM_ON_HOLD:
072: printf ("Job is queued and in system hold\n");
073: break;
074: case DRMAA_PS_USER_ON_HOLD:
075: printf ("Job is queued and in user hold\n");
076: break;
077: case DRMAA_PS_USER_SYSTEM_ON_HOLD:
078: printf ("Job is queued and in user and system hold\n");
079: break;
080: case DRMAA_PS_RUNNING:
081: printf ("Job is running\n");
082: break;
083: case DRMAA_PS_SYSTEM_SUSPENDED:
084: printf ("Job is system suspended\n");
085: break;
086: case DRMAA_PS_USER_SUSPENDED:
087: printf ("Job is user suspended\n");
088: break;
089: case DRMAA_PS_USER_SYSTEM_SUSPENDED:
090: printf ("Job is user and system suspended\n");
091: break;
092: case DRMAA_PS_DONE:
093: printf ("Job finished normally\n");
094: break;
095: case DRMAA_PS_FAILED:
096: printf ("Job finished, but failed\n");
097: break;
098: } /* switch */
099: } /* else */
100: } /* else */
101: } /* else */
102:
103: errnum = drmaa_delete_job_template (jt, error, DRMAA_ERROR_STRING_BUFFER);
104:
105: if (errnum != DRMAA_ERRNO_SUCCESS) {
106: fprintf (stderr, "Could not delete job template: %s\n", error);
107: }
108: } /* else */
109:
110: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
111:
112: if (errnum != DRMAA_ERRNO_SUCCESS) {
113: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
114: return 1;
115: }
116:
117: return 0;
118: }
Again, this example is very similar to Example 2, this time with the
exception of lines 55-99. First, after submitting the job, we sleep for
20 seconds to give SGE time to schedule the job. Then, on line 55, we
use drmaa_job_ps() to get the status of the job. Lines 64-98 determine
what the job status is and report it. This switch is a common usage
pattern for drmaa_job_ps() and should be encapsulated in a function for
ease of use.
Getting DRM information
Lastly, let's look at how to query the DRMAA library for information about
the DRMS and the DRMAA implementation itself:
Example 6
01: #include
02: #include "drmaa.h"
03:
04: int main (int argc, char **argv) {
05: char error[DRMAA_ERROR_STRING_BUFFER];
06: int errnum = 0;
07: char contact[DRMAA_CONTACT_BUFFER];
08: char drm_system[DRMAA_DRM_SYSTEM_BUFFER];
09: char drmaa_impl[DRMAA_DRM_SYSTEM_BUFFER];
10: unsigned int major = 0;
11: unsigned int minor = 0;
12:
13: errnum = drmaa_get_contact (contact, DRMAA_CONTACT_BUFFER, error,
14: DRMAA_ERROR_STRING_BUFFER);
15:
16: if (errnum != DRMAA_ERRNO_SUCCESS) {
17: fprintf (stderr, "Could not get the contact string list: %s\n", error);
18: }
19: else {
20: printf ("Supported contact strings: \"%s\"\n", contact);
21: }
22:
23: errnum = drmaa_get_DRM_system (drm_system, DRMAA_DRM_SYSTEM_BUFFER, error,
24: DRMAA_ERROR_STRING_BUFFER);
25:
26: if (errnum != DRMAA_ERRNO_SUCCESS) {
27: fprintf (stderr, "Could not get the DRM system list: %s\n", error);
28: }
29: else {
30: printf ("Supported DRM systems: \"%s\"\n", drm_system);
31: }
32:
33: errnum = drmaa_get_DRMAA_implementation (drmaa_impl, DRMAA_DRM_SYSTEM_BUFFER,
34: error, DRMAA_ERROR_STRING_BUFFER);
35:
36: if (errnum != DRMAA_ERRNO_SUCCESS) {
37: fprintf (stderr, "Could not get the DRMAA implementation list: %s\n", error);
38: }
39: else {
40: printf ("Supported DRMAA implementations: \"%s\"\n", drmaa_impl);
41: }
42:
43: errnum = drmaa_init (NULL, error, DRMAA_ERROR_STRING_BUFFER);
44:
45: if (errnum != DRMAA_ERRNO_SUCCESS) {
46: fprintf (stderr, "Could not initialize the DRMAA library: %s\n", error);
47: return 1;
48: }
49:
50: errnum = drmaa_get_contact (contact, DRMAA_CONTACT_BUFFER, error,
51: DRMAA_ERROR_STRING_BUFFER);
52:
53: if (errnum != DRMAA_ERRNO_SUCCESS) {
54: fprintf (stderr, "Could not get the contact string: %s\n", error);
55: }
56: else {
57: printf ("Connected contact string: \"%s\"\n", contact);
58: }
59:
60: errnum = drmaa_get_DRM_system (drm_system, DRMAA_CONTACT_BUFFER, error,
61: DRMAA_ERROR_STRING_BUFFER);
62:
63: if (errnum != DRMAA_ERRNO_SUCCESS) {
64: fprintf (stderr, "Could not get the DRM system: %s\n", error);
65: }
66: else {
67: printf ("Connected DRM system: \"%s\"\n", drm_system);
68: }
69:
70: errnum = drmaa_get_DRMAA_implementation (drmaa_impl, DRMAA_DRM_SYSTEM_BUFFER,
71: error, DRMAA_ERROR_STRING_BUFFER);
72:
73: if (errnum != DRMAA_ERRNO_SUCCESS) {
74: fprintf (stderr, "Could not get the DRMAA implementation list: %s\n", error);
75: }
76: else {
77: printf ("Supported DRMAA implementations: \"%s\"\n", drmaa_impl);
78: }
79:
80: errnum = drmaa_version (&major, &minor, error, DRMAA_ERROR_STRING_BUFFER);
81:
82: if (errnum != DRMAA_ERRNO_SUCCESS) {
83: fprintf (stderr, "Could not get the DRMAA version: %s\n", error);
84: }
85: else {
86: printf ("Using DRMAA version %d.%d\n", major, minor);
87: }
88:
89: errnum = drmaa_exit (error, DRMAA_ERROR_STRING_BUFFER);
90:
91: if (errnum != DRMAA_ERRNO_SUCCESS) {
92: fprintf (stderr, "Could not shut down the DRMAA library: %s\n", error);
93: return 1;
94: }
95:
96: return 0;
97: }
On line 13, we get the contact string list. This is the list of contact
strings that will be understood by this DRMAA instance. Normally on of
these strings is used to select to which DRM this DRMAA instance should
be bound. In the Grid Engine 6.0 implementation, the contact string list
is empty because there is only ever one possible DRM to which to bind.
On line 23, we get the list of supported DRM systems. For the Grid Engine
6.0 implementation, this will always be Grid Engine 6.0.
On line 33, we get the list of supported DRMAA implementations. For the
Grid Engine 6.0 implementation, this will always be Grid Engine 6.0.
On line 43, we call drmaa_init(). After drmaa_init() has been called, the
drmaa_get_contact() and drmaa_get_DRM_system() calls change.
On line 50, we call drmaa_get_contact() again, this time to get the
contact string that was used to bind to a DRMS in drmaa_init(). For the
Grid Engine 6.0 implementation, this will always be an empty string.
On line 60, we call drmaa_get_DRM_system() again, this time to get the
name of the DRMS to which DRMAA is bound. For the Grid Engine 6.0
implementation, this will always be Grid Engine 6.0.
On line 70, we call drmaa_get_DRMAA_implementation() again, this time to
get the name of the DRMAA implementation to which the application is
bound. For the Grid Engine 6.0 implementation, this will always be Grid
Engine 6.0.
On line 80, we get the version number of the DRMAA C binding specification
supported by this DRMAA implementation. For the Grid Engine 6.0
implementation this is currently version 0.8.
Finally, on line 89, we end the session with drmaa_exit().