Grid Engine may be configured to
relocate a job, useful in the case where the host being used is a
desktop system and is only to be used when the owner is not actively
working with it. This uses the checkpointing facility of Grid Engine
to kill and restart the job elsewhere when the user returns and moves
the mouse or presses a key.
To set up this configuration, the
following steps are needed:
1) Configure Grid Engine to track
interactive idle time 2) Configure the checkpointing interface 3)
Add the checkpoint ability to the appropriate queues 4) Set the
load threshold in the queues to trigger the relocation
1) Configure Grid Engine to track
interactive idle time
Please see the following application
note: Tracking Interactive Idle Time
2) Configure the checkpointing
interface
The checkpointing interface needs to be
created. This can be done in qmon:
Click on the "Checkpoint
Configuration" icon
Click "Add"
Name the checkpointing interface
(you will be using this name when submitting the job)
Select the interface, USERDEFINED
Select the appropriate queues to
attach the checkpointing interface
Leave all other fields blank (if
you had an actual checkpointable job, you would have actual entries
here)
Select "On Job Suspend"
to relocate the job after it is suspended
Click "Ok"
Click "Done" to close
this dialog box
3) Add the checkpoint ability to the
appropriate queues
In qmon, modify the appropriate queues
to give them the checkpointing ability:
Click "Queue Control"
Select a queue and click "Modify"
Under "Type" on the
"General Configuration" tab, check "Checkpointing"
Click "Ok"
4) Set the load threshold in the
queues to trigger the relocation
In qmon "Queue Control",
select and modify an appropriate queue. On the "Load/Suspend
Thresholds" tab, add to the currently set load threshold by
clicking on the heading labelled "Load" and selecting the
idle time resource. Enter the desired value under the value column.
When submitting a job that is eligible
to be moved, the checkpointing interface needs to be specified. For
example, if the interface created above was named "reloc",
I would submit the job as such:
qsub -ckpt reloc myjob.sh
The job will be eligible to run in any
queue which has the checkpointing ability. Then, if the job is
subsequently suspended (as when the queue it is running in is
suspended when the user clears the interactive idle time), it will be
killed, then requeued.
|