Topic:
Checkpointing of serial jobs using the built in checkpointing
support in SGE.
Author:
Reuti, reuti__at__staff.uni-marburg.de;
Philipps-University of Marburg, Germany
Version:
1.1a -- 2005-02-14 First updated release, comments and corrections
are welcome
Contents:
- Prerequisites
- The transparent interface
- The application-level interface
- The userdefined interface
- Integration with the Condor library
Note:
This HOWTO complements the information contained in the man pages
of SGE and the Administration Guide.
Acknowledgement:
Many thanks to Lip Kian NG (lkng__at__apstc.sun.com.sg)
from SUN Microsystems Singapore for reviewing this document and
eliminating typos and ambiguities.
Prerequisites
Configuration of SGE with qconf or the GUI
You should be familiar with modifying of the SGE environment
such as the queue definitions and scheduler configuration.
Additional information on the SGE checkpoint interfaces is
available from the "checkpoint" and "sge_ckpt" man pages (make
sure the SGE man pages are defined in your $MANPATH).
While working through this HOWTO, you may find it convenient to
modify the scheduler parameters "schedule_interval" to be around
15 seconds, and optionally set:
SGE 5.3: the global configuration "schedd_params" to
"FLUSH_SUBMIT_SEC=0".
SGE 6.x: the scheduler configuration "flush_submit_sec"
to be unequal to "0", like a small value of "4", to enable this
setting to be honoured.
Available checkpointing interfaces in SGE
You can divide the available checkpointing interfaces into two
groups: they are either kernel based or need some support by the
running application, hence called user-level interfaces. This
document will show some examples on how to implement checkpointing
with the different user-level checkpoint interfaces. These
include:
Transparent interface
Application-level interface
Userdefined interface
Goal of checkpointing implementation in your
applications
Some lengthy calculations can run for months. If the
calculation node dies due to a hardware failure or power loss, you
will lose all the calculation time and have to restart from the
beginning. Having a checkpoint created, you can at least restart
from this point of calculation.
Location of the created checkpoint files
You should use a shared file system for the location of the
written checkpoint files. This way the created checkpoint files
will be accessible to the restarted job, which will be most likely
on another node. Depending on your application, this can be just
the home directory of the executing user, or like used in this
document /home/checkpoint. It can of course be a mounted file
system just for this purpose, but it shouldn't be a shared scratch
space without redundancy, like a shared scratch space with
PVFS1.
Interval for the created checkpoints
Although it depends on your application, you can just create
checkpoints after each critical step of calculation without any
timed interval. If it's more suitable, SGE can trigger the
creation of checkpoints after a timed interval. The default
setting in the queue definition for this behaviour is 5 minutes,
which you should consider only as a setting for testing the SGE
setup. For real applications that runs for weeks, it maybe
sufficient to create a checkpoint after every 6 to 24 hours. With
this setting, the amount of data written to the checkpoint
directory will not slow down any other communications to the
shared file system on a large scale. Also you will only lose the
last 6 to 24 hours of calculations.
Supplied files
All scripts and files for the examples you will find in the
following archive: Checkpoint_Howto_Examples.tar.gz.
The paths in the scripts have to be adjusted to your installation
of SGE and your intended directory to run the scripts.
State transition diagrams
The behaviour of the checkpoint interfaces can be described as
a finite machine. State transition diagrams you can find in the
Howto "Checkpointing under Linux with Berkeley Lab
Checkpoint/Restart" on the Howto page at
gridengine.sunsource.net.
The transparent interface
Application Characteristic
This interface is used for applications that has built-in
checkpointing support based on listening to a UNIX signal.
Behaviour of this interface
This interface will send a signal to the running job to
initiate a checkpoint, according to the "min_cpu_interval" setting
in the queue definition. Hence you have to define values for
"signal", "ckpt_dir" and "min_cpu_interval". Not all parameters in
the definition of the checkpointing interface will be honoured.
The needed entries for the checkpointing interface:
$ qconf -sckpt check_transparent
ckpt_name check_transparent
interface transparent
ckpt_command none
migr_command none
restart_command none
clean_command none
ckpt_dir /home/checkpoint
queue_list all
signal usr2
when xmr
Here the "when" condition for the creation of a checkpoint file
is set to "xmr". So a checkpoint will be created, by sending the
signal usr2 (to the job, i.e. the whole processgroup), when the
specified time interval (which is defined in the queue definition)
has elapsed, or when the node goes offline.. It's not possible to
initiate a checkpoint just before the migration of the job, but we
set the "x" anyway to get the job at least restarted. This is
limited to happen in the migration script, which is available for
application-level checkpointing. For demonstration purpose, we set
the time interval in the queue to 5 minutes (for testing purpose
you can lower the value also to be around 15 seconds), which is
the default but not useful for production, since a checkpoint
every few hours is sufficient. The queue should contain the
setting:
$ qconf -sq vast00
...
min_cpu_interval 00:05:00
...
The setting for the availability of the checkpointing interface
depends on the used version of SGE.
SGE 5.3: The queue type must be set to CHECKPOINTING,
and the intended queues added to the queue list in the definition
of the checkpointing interface.
SGE 6.x: Here the definition is in the opposite way, so
you have to add the name of the checkpointing interface to the
queue definition. The type of the queue will automatically be set
to support checkpointing.
First script "checkpoint_1.sh" (example 1)
Having this setting, let's start with just a bash script,
demonstrating the checkpoint behaviour. For now we simply write a
file to the directory pointed to by $SGE_CKPT_DIR. This is set by
SGE from the above setting of the checkpointing interface:
#!/bin/sh
# check_transparent1.sh
trap 'date >> $SGE_CKPT_DIR/checkpoint_1' usr2
echo "Script started."
for ((i=0; i<1000; i++)) ; do
sleep 1
done
echo "Script finished."
exit 0
Although here is no actual checkpoint created for now, you can
have a look at the created files in the $SGE_CKPT_DIR and the
output of the script. After some elapsed time, you can find the
following:
$qsub -ckpt check_transparent check_transparent1.sh
your job 8005 ("check_transparent1.sh") has been submitted
$ ls /home/checkpoint/
checkpoint_1
$ cat /home/checkpoint/checkpoint_1.sh
Tue Dec 21 23:03:35 WET 2004
Tue Dec 21 23:08:35 WET 2004
Okay, this is working. So we may now implement a more
working version.
Second script "checkpoint_2.sh" (example 2)
For this we have to write some more valuable information to the
checkpoint file.
#!/bin/sh
# check_transparent2.sh
trap 'echo $ACTUAL_VALUE > $SGE_CKPT_DIR/checkpoint_2' usr2
#
# Check whether we are restarted and a checkpoint file is already avaiualble.
#
if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_DIR/checkpoint_2" -a -r "$SGE_CKPT_DIR/checkpoint_2" ] ; then
read ACTUAL_VALUE < $SGE_CKPT_DIR/checkpoint_2
echo "Script restarted with value $ACTUAL_VALUE."
else
ACTUAL_VALUE=1
echo "Script started."
fi
#
# Start of the program.
#
while [ "$ACTUAL_VALUE" -le 1000 ] ; do
echo "Processing $ACTUAL_VALUE."
let ACTUAL_VALUE++
sleep 1
done
echo "Script finished."
exit 0
This time we write some information about the actual value of
the loop into the checkpoint file. Although this is not foolproof
(if the node just crashes in the moment when we write as new
version of the checkpoint file we would be out of luck), but for
now it's feasible. So let's look at the created file after at
least five minutes:
$ ls ../checkpoint/checkpoint_2
../checkpoint/checkpoint_2
$ cat ../checkpoint/checkpoint_2
292
This seems working, and we can now try to trigger the restart
of the job. Instead of pulling the power line out of the node we
can use the qmod command, which will in case of a correctly setup
of the checkpointing interface kill the running job and restart it
(i.e. migrate the job).
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8016 0 check_tran reuti r 12/22/2004 00:14:21 vast15 MASTER
$ qmod -s 8016
reuti - suspended job 8016
After a while, you will notice that the job is restarted and
running eventually on a different node than before:
$qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8016 0 check_tran reuti Rr 12/22/2004 00:24:06 vast11 MASTER
You will notice that this job has the same job number as the
original one, but the status field "Rr" will show you, that this
job was restarted. When you look at the generated output file, you
will notice something like this:
$ o check_transparent2.sh.o8016
...
Processing 570.
Processing 571.
Script restarted with value 292.
Processing 292.
Processing 293.
Processing 294.
Processing 295.
Processing 296.
...
As expected, the processing is going on from the last saved
state of the checkpoint file. So we can try to move the checkpoint
creation to a working program, and change the bash script to
provide only some setups for handling more than just one
checkpoint file from a user.
Third script "checkpoint_3.sh" (example 3)
We should create a directory in the $SGE_CKPT_DIR and give this
information to the program to support more than just one
checkpointing program at a time. Since all signal handling is send
to the whole process group, we have to ignore the signal in the
shell and handle it only in the called program.
#!/bin/sh
# check_transparent3.sh
trap '' usr2
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
if [ \! -e "$SGE_CKPT_JOB" ] ; then
mkdir $SGE_CKPT_JOB
fi
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
#
# Check whether we are restarted.
#
if [ "$RESTARTED" -eq "1" ] ; then
/home/reuti/checkpoint_program3 -r -d $SGE_CKPT_JOB
else
/home/reuti/checkpoint_program3 -d $SGE_CKPT_JOB
fi
rm -rf $SGE_CKPT_JOB
exit 0
It may not be necessary to give extra options to the program at
all, as all environment variables are also exported to the
program. But handling it in the shell will keep the programs
flexible to run with other queuing systems if necessary and adjust
only the shell scripts. A sample program in C follows in the file
checkpoint_program3.c. Please compile with:
$ gcc -o checkpoint_program3 checkpoint_program3.c -lm
The key points in the program are:
- Check for valid checkpoint directory
- Read the old checkpoint file if requested
- Install a signal handler to process the checkpoint
request
- Processing
- Output of the results
In this case, the program can also run without any change
without SGE or checkpointing usage at all. It's just an option in
the program. To have always only one checkpoint file in the given
directory, the signal handler routine will remove the old one
only, if a new version could be written.
Let's submit the job and try to move it to a different machine.
We may get the following output:
$ qsub -ckpt check_transparent check_transparent3.sh
your job 8149 ("check_transparent3.sh") has been submitted
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8149 0 check_tran reuti r 12/24/2004 11:08:00 vast12 MASTER
The job is running, and we may check the checkpoint directory,
whether any file was written up to now.
$ ls /home/checkpoint/
8149
$ ls /home/checkpoint/8149/
$
Although there is none up to now, we may try to move it to a
different machine anyway, because this may also happen with just
the crash of the node.
$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8149 0 check_tran reuti Rr 12/24/2004 11:09:09 vast15 MASTER
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
The program discovered the absence of a valid checkpoint file
and therefore just starts at the beginning. We will wait now,
until a valid file was written.
$ ls /home/checkpoint/8149/
checkpoint_3
$ cat check_transparent3.sh.e8149
I will try to restart....
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
$ qmod -s 8149
reuti - suspended job 8149
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8149 0 check_tran reuti Rr 12/24/2004 11:14:53 vast12 MASTER
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.
This time we got the real processing of the checkpoint file. It
starts again at 297, using the already calculated results. After
the job, we may find this in the created error file as the whole
checkpoint processing log:
$ cat check_transparent3.sh.e8149
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 297.
I will try to restart...
Checkpoint file read. Recalculation starts at 297.
Checkpoint creation initiated.
Checkpoint file created to restart with 594.
Checkpoint creation initiated.
Checkpoint file created to restart with 892.
Although the later written checkpoints weren't used, the whole
procedure is working, and also the created directory for the job
specific checkpoint files was deleted correctly:
$ ls /home/checkpoint/
$
We may no move on to the next available interface in SGE, with
a slightly different behaviour.
The userdefined interface
Application Characteristic
This interface is used for applications that has built-in
checkpoint support which will automatically checkpoint based on
the application's internal logic and are not dependent on external
factors to initiate a checkpoint.
Behaviour of this interface (example 4)
This interface is similar to the transparent interface, but in
this case there is no need to specify a signal at all (specifying
a signal will lead to the same behaviour as the transparent
interface):
$ qconf -sckpt check_userdefined
ckpt_name check_userdefined
interface userdefined
ckpt_command none
migr_command none
restart_command none
clean_command none
ckpt_dir /home/checkpoint
queue_list all
signal none
when xr
The creation of the checkpoints is up to the working program.
There will be no external trigger by SGE to create any checkpoint
file at all (unless you specify the signal). Having this setup,
it's still advantageous using checkpointing, because SGE will
restart your job and give this information along with the job.
We can use nearly the same script as in example 3 to start the
job (without the trap command to catch the signal), but the
program checkpoint_program4.c has to decide when to write a
checkpoint. In a real working program this may be after each
critical step. This may also be easier to implement, than writing
all the data of the current program and the state of it (maybe in
form of a finite machine) to a file.
#!/bin/sh
# check_userdefined4.sh
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
if [ \! -e "$SGE_CKPT_JOB" ] ; then
mkdir $SGE_CKPT_JOB
fi
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
#
# Check whether we are restarted.
#
if [ "$RESTARTED" -eq "1" ] ; then
/home/reuti/checkpoint_program4 -r -d $SGE_CKPT_JOB
else
/home/reuti/checkpoint_program4 -d $SGE_CKPT_JOB
fi
rm -rf $SGE_CKPT_JOB
exit 0
You may notice, that here the option -d will also trigger the
creation of the checkpoint file within the program. As we had
already walked through the created output in example 3, here is
just the output of the issued commands:
$ qsub -ckpt check_userdefined check_userdefined4.sh
your job 8150 ("check_userdefined4.sh") has been submitted
$ qstat -u reuti
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8150 0 check_user reuti r 12/24/2004 11:43:15 vast12 MASTER
$ ls /home/checkpoint/
8150
$ ls /home/checkpoint/8150/
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8150 0 check_user reuti Rr 12/24/2004 11:47:31 vast22 MASTER
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
$ ls /home/checkpoint/8150/
checkpoint_4
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
$ qmod -s 8150
reuti - suspended job 8150
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8150 0 check_user reuti Rr 12/24/2004 11:59:46 vast12 MASTER
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
$ qstat
$ cat check_userdefined4.sh.e8150
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Checkpoint creation initiated.
Checkpoint file created to restart with 300.
Checkpoint creation initiated.
Checkpoint file created to restart with 600.
I will try to restart...
Checkpoint file read. Recalculation starts at 600.
Checkpoint creation initiated.
Checkpoint file created to restart with 900.
The application-level interface
Application Characteristics
This interface is usually used for applications that require an
external process to intiate a successful checkpoint such as
passing a message, moving the checkpoint file to the shared
directory etc.
Behaviour of this interface (example 5)
This interface will execute the defined commands for
ckpt_command, migr_command and clean_command. The restart_command
will not be honoured, since this is reserved for the implemented
kernel level checkpointing interfaces. You may also specify just
signal names instead of procedures, but in our example we will use
some procedures, to take care of the checkpoint files. Optional
you may also specify a signal, which we also not use here.
$ qconf -sckpt check_application-level
ckpt_name check_application-level
interface application-level
ckpt_command /home/reuti/check/checkpoint.sh $ckpt_dir $job_id
migr_command /home/reuti/check/migrate.sh $job_pid $ckpt_dir $job_id
restart_command none
clean_command /home/reuti/check/clean.sh $ckpt_dir $job_id
ckpt_dir /home/checkpoint
queue_list all
signal none
when xmr
In this setup, the defined command for "ckpt_command" will be
called to create a checkpoint file in case of the
"min_cpu_interval", the "migr_command" procedure will be executed
if you suspend the job or the queue which the job is running in
(remember, that this can be done by hand using qmod, or also
automatically by exceeding a set "suspend_thresholds" for the
queue). Finally, the "clean_command" will remove the created
checkpoint files in case of a qdel or successful completion of the
job.
Having such an interface maybe still useful, if you have the
binary of a program only, which will create checkpoint files or
can use some kind of scratch file for a restart, but will always
write it to the temporary directory like $TMPDIR, and you have no
option to rename it or save it in a different location. Another
application of this interface maybe in the case, that any
checkpoint or scratch files are written too often or are too big.
In these cases it's advantageous to write to a local filesystem
first, to minimize the network traffic. The checkpointing process
can then copy the local checkpoint file from e.g. $TMPDIR to the
shared checkpoint directory.
For the example, we will reuse a slightly modified program as
in example 4, but write a checkpoint after each 5 steps. Since
this is too often to be put on the shared checkpoint directory, we
will use the defined script for the creation of checkpoint in the
application interface to copy this file less often to the shared
checkpoint directory. For the restart, this copied file will be
used, because all the files in the scratch directory $TMPDIR are
of course deleted with the removing of the job from the node.
Although the copy of the checkpoint file in the
"min_cpu_interval" may be sufficient, we can get a more recent
version when we start the migrate command and perform another
checkpoint copy. In case of a failure of the node, this won't be
possible of course. So the necessary checkpoint.sh script may look
like:
#!/bin/sh
#
# checkpoint.sh
#
#
# Copy a checkpoint file from the scratch directory of the
# job to the checkpoint directory.
#
me=`basename $0`
# test number of args
if [ $# -ne 2 ]; then
echo "$me: got wrong number of arguments" >&2
exit 1
fi
SGE_CKPT_JOB=$1/$2
if [ \! -e "$SGE_CKPT_JOB" ] ; then
mkdir $SGE_CKPT_JOB
fi
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
while [ -z "$DONE" ] ; do
cp $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 2>/dev/null
if [ "$?" -eq "0" ] ; then
diff $TMPDIR/checkpoint_5 $SGE_CKPT_JOB/checkpoint 1>/dev/null 2>&1
if [ "$?" -eq "0" ] ; then
DONE="1"
fi
else
DONE="1"
fi
done
exit 0
To get a valid copy of the checkpoint file, we do a simple test
here by a diff. You may have to do another form of the copy
command with your real application. The two arguments given to the
procedure will be used to create the name of the checkpoint
directory and create it in the first call to the procedure. The
next procedure is to initiate the migration of the job. In this
type of checkpoint interface, you have to kill the job on your
own. We do this with a kill -9 to the whole process group.
#!/bin/sh
#
# migrate.sh
#
me=`basename $0`
# test number of args
if [ $# -ne 3 ]; then
echo "$me: got wrong number of arguments" >&2
exit 1
fi
#
# Get the current checkpoint besides the regular copied one.
#
/home/reuti/check/checkpoint.sh $2 $3
#
# Now kill the job with the whole process group.
#
kill -9 -- -$1
The last step is to remove the created checkpoint directory
after the job. This is done in an appropriate procedure which will
be executed at the end of the job.
#!/bin/sh
#
# clean.sh
#
#
# Delete the checkpoint directory for this job.
#
me=`basename $0`
# test number of args
if [ $# -ne 2 ]; then
echo "$me: got wrong number of arguments" >&2
exit 1
fi
SGE_CKPT_JOB=$1/$2
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
rm -rf $SGE_CKPT_JOB
exit 0
Having defined this all, we may submit a script for this type
of checkpointing interface with:
#!/bin/sh
# check_application-level5.sh
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
#
# Check whether we are restarted.
#
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint
if [ "$RESTARTED" -eq "2" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" ] ; then
/home/reuti/checkpoint_program5 -r $SGE_CKPT_FILE -d $TMPDIR
else
/home/reuti/checkpoint_program5 -d $TMPDIR
fi
exit 0
Please notice, that in this case the value of $RESTARTED is 2.
We also take care of the situation where a node crashed before the
first checkpoint file was successfully created. During the
restart, you may look into the defined checkpoint directory and
check if the job's corresponding subdirectory contains any
checkpoint file. After the job completes or gets deleted (by
qdel), the job's corresponding checkpoint directory is removed by
the procedure clean.sh. For this example, ensure that the correct
name of the checkpointing interface is specified as it's different
from the previous examples:
qsub -ckpt check_application-level check_application-level5.sh
Integration with the Condor library
Checkpointing with Condor in general
Another queuing system is Condor, which in addition supply
built in checkpointing libraries (http://www.cs.wisc.edu/condor).
But it is possible to use just the Condor libraries in a
standalone mode without any queuing system at all or an other one.
This way you can still use SGE and add the checkpointing facility
of Condor. After downloading the appropriate Condor version for
your system, you have to install it in a personal mode to get just
the access to the compiler. So you may install it in a
subdirectory inside your home directory, e.g. ~/local:
$ ls -d1 condor*
condor-6.6.7-linux-x86-redhat80.tgz
$ tar -xzf condor-6.6.7-linux-x86-redhat80.tgz
$ ls -d1 condor*
condor-6.6.7
condor-6.6.7-linux-x86-redhat80.tgz
$ cd condor-6.6.7/
$ ./condor_configure --install=/home/reuti/condor-6.6.7/release.tar \
--install-dir=/home/reuti/local/condor-6.6.7 --make-personal-condor --verbose
$ cd
Having done this, you can remove the installation directory of
Condor again. Now you have to execute two commands, to get access
to the compiler. These can be put in any of the usual files which
will be sourced during login.
$ export CONDOR_CONFIG=/home/reuti/local/condor-6.6.7/etc/condor_config
$ export PATH=/home/reuti/local/condor-6.6.7/bin:$PATH
If all was installed successfully, it's now already possible to
use the compiler. Although further details can be looked up in the
online Condor manual, we want just compile a program, to see
whether it's working this way:
$ cat ever.c
int main(void)
{
float x;
long i;
for (;;)
{
for (i=0;i<=100000;i++)
x=3.1415926*i+i+i*i*2.7182818;
}
return 0;
}
$ condor_compile gcc ever.c -o ever
...
$ ls -l ever
-rwxr-xr-x 1 reuti users 15199008 Dec 25 22:18 ever
$ strip ever
$ ls -l ever
-rwxr-xr-x 1 reuti users 1332132 Dec 25 22:18 ever
Usually the created executables by Condor are really big, and
can be sized down by the command strip, as used in the above
example. Before integrating this into SGE, we want to execute the
short example to see it working.
$ ./ever
Condor: Notice: Will checkpoint to ./ever.ckpt
Condor: Notice: Remote system calls disabled.
^C
$ ./ever -_condor_restart ever.ckpt
Condor: Notice: Will restart from ever.ckpt
Before we kill the program with a Ctrl-C, we login to another
terminal session on the machine and send a "kill -usr2
<pid>" to the executing "ever" program. The SIGUSR2 is the
used signal in Condor to force the creation of a checkpointing
file. The process list can be retrieved with "ps -e f". By default
the checkpoint file is written to a file with the name of the
executable appended with .ckpt, but can be forced to be a
different one with the option "-_condor_ckpt <filename>". In
case of a restart, it is sufficient to give the option
"-condor_restart <filename>", because this will also set the
filename for further checkpoints. In fact, the additional option
"-condor_ckpt <filename>" will be ignored in this case.
There are also function calls available in Condor to be used in
a program, which you will find explained in section 4.2.4
(Checkpoint Library Interface) of the Condor user manual.
Please notice, that there are some restrictions on the suitable
programs which can be compiled with Condor; a full list of these
you can find in section 1.4 (Current limitations) of the Condor
user manual.
Integrating Condor with the transparent checkpoint interface
(example 6)
Now we will use plain C program "condor_program6.c" to
integrate it into SGE. Because we are not using any library
functions of Condor in this case, we can put the whole restart
procedure in the starting script:
#!/bin/sh
# condor_transparent6.sh
trap '' usr2
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
if [ \! -e "$SGE_CKPT_JOB" ] ; then
mkdir $SGE_CKPT_JOB
fi
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
#
# Check whether we are restarted.
#
SGE_CKPT_FILE=$SGE_CKPT_JOB/checkpoint_6
if [ "$RESTARTED" -eq "1" -a -e "$SGE_CKPT_FILE" -a -r "$SGE_CKPT_FILE" ] ; then
/home/reuti/condor_program6 -_condor_restart $SGE_CKPT_FILE
else
/home/reuti/condor_program6 -_condor_ckpt $SGE_CKPT_FILE
fi
rm -rf $SGE_CKPT_JOB
exit 0
The corresponding program can be compiled with
condor_compile gcc condor_program6.c -o condor_program6 -lm; strip condor_program6
You will notice, that this program is really short and without
any reference to a checkpointing at all, because this is handled
by the Condor library. Because the "sleep(1)" command used in the
previous examples is not allowed to be used in Condor programs, we
have to replace this with an active loop. We can reuse the already
defined transparent interface of the above examples and submit the
job with:
$ qsub -ckpt check_transparent condor_transparent6.sh
your job 8188 ("condor_transparent6.sh") has been submitted
After the first creation of a checkpoint file, we can try to
suspend this job to move it to a different node.
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8188 0 condor_tra reuti r 12/25/2004 16:24:32 vast11 MASTER
$ ls ../checkpoint/8188/
checkpoint_6
$ qmod -s 8188
reuti - suspended job 8188
$ qstat
job-ID prior name user state submit/start at queue master ja-task-ID
---------------------------------------------------------------------------------------------
8188 0 condor_tra reuti Rr 12/25/2004 16:30:50 vast12 MASTER
$ cat condor_transparent6.sh.e8188
Condor: Notice: Will checkpoint to /home/checkpoint/8188/checkpoint_6
Condor: Notice: Remote system calls disabled.
Condor: Notice: Will restart from /home/checkpoint/8188/checkpoint_6
This works seamlessly with your applications. Because Condor
will try to use exactly the same environment and files in case of
a restart as during the original run, you can't use any files in
$TMPDIR, because this will be have a different name when the job
starts on the next node. Therefore it's limited to the home
directory of yours, any other shared directory across the nodes or
a shared scratch space like supplied by PVFS{1,2}.
Integrating Condor library functions with the transparent
checkpoint interface (example 7)
You are a little bit more flexible in using Condor, when you
integrate the library functions direct in your program. This way
the job script will look without having any hint of Condor:
#!/bin/sh
# condor_transparent7.sh
trap '' usr2
SGE_CKPT_JOB=$SGE_CKPT_DIR/$JOB_ID
if [ \! -e "$SGE_CKPT_JOB" ] ; then
mkdir $SGE_CKPT_JOB
fi
if [ \! -d "$SGE_CKPT_JOB" ] ; then
echo "Checkpoint subdirectory couldn't be found."
exit 1
fi
#
# Check whether we are restarted.
#
if [ "$RESTARTED" -eq "1" ] ; then
/home/reuti/condor_program7 -r -d $SGE_CKPT_JOB
else
/home/reuti/condor_program7 -d $SGE_CKPT_JOB
fi
rm -rf $SGE_CKPT_JOB
exit 0
This time we have to supply the necessary functions in the
program. The lines with access to the Condor library are
"init_image_with_file_name(checkpoint_read);" and "restart();".
The generation of a checkpoint is still initiated with a SIGUSR2
signal. So we don't have to take care of it by programming by
hand. A sample execution my give these results in the standard
error file:
$ o condor_transparent7.sh.e8189
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...
No checkpoint file written up to now. Restart from the beginning.
Condor: Notice: Will checkpoint to /home/reuti/condor_program7.ckpt
Condor: Notice: Remote system calls disabled.
Update: Condor: Notice: Will checkpoint to /home/checkpoint/8189/checkpoint_7
I will try to restart...
Because at the first suspend of the job was no checkpoint file
written, it will discover this and start at the beginning. The
latter suspend restarted the program.
Integrating Condor library functions with the userdefined
checkpoint interface (example 8)
This way we have the possibility to control also the creation
of a checkpoint file in the program instead of rely on the SIGUSR2
method. The necessary call is "ckpt();" in the routine, which was
used without the Condor library to handle the writing of the
checkpoint file all alone. After compiling the program
condor_program8.c in the same fashion as the previous examples,
you have to submit the job with the specification of the
checkpointing interface set to check_userdefined, as done in the
examples without the usage of the Condor library.