|
My output file for my job says
"Warning: no access to tty; thus no job control in this shell..."
|
|
One or more of your login files contain an stty command. These
commands are only useful if there is a terminal present.
|
In Sun Grid Engine batch jobs, there is no termional
associated with these jobs. You need to either remove all stty
commands from your login files, or bracket them with an 'if'
statement which checks for a terminal before processing. An
example of this is below:
/bin/csh:
stty -g # checks terminal status
if ($status == 0) # succeeds if a terminal is present
<place all stty commands in here>
endif
|
|
The
job standard error log file says:
`tty`: Ambiguous
but
there is no reference to tty in user's shell which is called in
job script
|
|
shell_start_mode
is by default posix_compliant; therefore, all job scripts run
with the shell specified in the queue definition, not the one
specified on the first line of the job script.
|
use
the -S flag to qsub, or change shell_start_mode to unix_behavior
|
|
I
can run my job script from the command line, but it fails when
run via qsub.
|
|
Process
limits could be being set for your job. To test this, write a
test script which does limit and limit -h. Execute both
interactively at the shell prompt and through qsub to compare the
result.
|
make
sure to remove any commands in configuration files which sets
limits in your shell.
|
|
qsub
of a job results in the error "can't set additional group id
for job" (seen in administrator or user mail, or shepherd
trace file) and puts queue into error state
|
|
Possible
reasons
The
error message below can occur if the user already have 16
existing group ids set. SGE tries to set one more group id and
fails b/c usually the limit is 16.
If
you are not running Grid Engine as root, then the setgroups()
command will fail trying to set the unique group ID which is
used to track all the spawned processes of a job.
|
Corresponding
solutions
Please
check to see how many group ids are assigned to the user using
'id -a'. If it's more than 16, then you need to reduce this
number or increase the limit in the kernel (NGROUPS_MAX).
Be
sure to run the Grid Engine daemons as root.
|
|
Jobs
work when run from command line but fail when run via qsub
|
|
Data
and executables may not be accessible where needed
|
The
jobs script itself must be accessible from the submit host. All
data and other executables needed by the script must be
accessible on the execute host. Usually shared via NFS.
|
|
Unlimited
stack size set by default by SGE may cause some apps to crash on
some OS's.
|
In the job script, use “ulimit” to set stack size
limits before calling the executable that crashes.
Or modify the queue to set smaller stack size:
qconf -mattr queue h_stack 8389486 <queue_name> (hard limit in bytes)
qconf -mattr queue s_stack 8389486 <queue_name> (soft limit in bytes)
|
|
Monitoring
|
|
Exec
hosts report a load of 99.99; queue is in “alarm”
and/or “unknown” state
|
|
There
are a few things that could cause your exec hosts to report a
load of 99.99:
The
execd is not running on the host.
A
default domain is incorrectly specified
The
qmaster host sees the exec host as a different name as the exec
host sees itself.
|
Depending
on the cause, here are the appropriate solutions
Start
up the execd as root on the host by running the
$SGE_ROOT/default/common/rcsge script
Run
'qconf -mconf' as the Sun Grid Engine administrator and change
the default_domain to none.
Set
IGNORE_FQDN=TRUE for qmaster_params in
cluster configuration.
See
man page sge_h_aliases(5)
Please
see the AppNote/HOWTO loadinfo
for more information.
|
|
Miscellaneous Error Messages
|
|
A
warning is printed to <cell>/spool/<host>/messages
every 30 seconds. The messages look like this:
Tue Jan 23 21:20:46 2001|execd|meta|W|local
configuration meta not defined - using global configuration
But
there IS a file for each host in <cell>/common/local_conf/,
each with FDQN.
|
|
The
hostname resolving at your machine "meta" returnes the
short name, while at your master machine "meta" with
FQDN is returned.
|
Make
sure that all of your /etc/hosts files and your NIS table are
consistent with this respect. In this example, there could be a
line like
168.0.0.1 meta meta.your.domain
in
/etc/hosts of the host "meta" while it should be
instead
168.0.0.1 meta.your.domain meta
|
|
Occasionally I see "CHECKSUM ERROR", "WRITE
ERROR" or "READ ERROR" messages in the "messages"
files of the daemons. Do I need to worry about these?
|
|
|
As long as these messages do not appear in a one second
interval (they typically may appear between 1-30 times per
day), there is no need to do anything on this issue.
|
|
Jobs
will finish on a particular queue: qmaster/messages:
Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost
But
then the following errors are seen on the exec host:
exechost/messages:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory
"active_jobs/490.1" for reaping job 490.1
exechost/messages:
Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory
"active_jobs/490.1": opendir(active_jobs/490.1) failed: Input/output error
|
|
The
$SGE_ROOT directory, which is automounted, is being unmounted,
causing the sge_execd to lose its cwd.
|
Use
a local spool directory for you execd host. Set the parameter
execd_spool_dir using qmon or qconf.
|
“critical error: can't connect commd”
“critical error: setup failed starting cod_schedd”
|
|
A bug on 32 bit systems: rlim_fd_max > 1024 in
/etc/system
|
Set
rlim_fd_max to < 1024. Or update to SGE 5.3p2 or higher
|
|
The actual hostname <myhostname> of the machine is in
alias to /localhost in etc/hosts. Looks like this:
127.0.0.1 localhost myhostname
|
remove <myhostname> as an alias to localhost and put
<myhostname> after the real IP-address in /etc/hosts
|
|
Multiple queues cascade into error state, rendering the grid
unusable.
|
|
errors in a user's .cshrc/.profile result in setting all
queues in error state
|
Fix errors in users' .cshrc/.profile
Use the -f option in the first line of the jobscript
(i.e. Use “!#/bin/sh -f”) to bypass users' .cshrc or
.profile
|
|
Performance
|
|
Memory
leak and huge memory consumption for schedd on large systems
|
|
Parameter
sched_job_info=true
|
Set
sched_job_info= false or update
to release 5.3p3 or higher
|
|
Configuration
|
|
max_u_jobs
doesn't work as expected.
|
|
It
doesn't work exactly the same way in all versions of the product
– and affects scheduling differently depending on whether
the product is used in SGE or SGEEE mode.
|
Update
to SGE 5.3p2 (or higher) which contains the latest
implementation.
|
|
Qrsh/Interactive Jobs
|
|
Submitting
interactive jobs with qrsh, I get the error:
% qrsh -l mem_free=1G error: error: no suitable queues
Yet
queues are available for batch jobs using qsub, and can be
queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.
|
|
The
message "error: no suitable queues" results from the
"-w e" submit option which is active by default for
interactive jobs like qrsh (look for "-w e" in for
qrsh(1)). This option causes the submit command to fail, if the
qmaster does not know for sure that the job will be dispatchable
according to the current cluster configuration. The intension of
this mechanism is to decline job requests in advance in case they
can't be granted.
|
In
this case 'mem_free' is configured to be a consumable resource,
but you have not specified the amount of memory being available
at each the host. The memory load values are deliberately not
considered for this check, because they vary, so they can't be
seen as part of the cluster configuration. To overcome this you
can either
omit
this check generally by overriding qrsh's default setting "-w
e" explicitly by submitting it with "-w n" (can
also be put into $SGE_ROOT/<cell>/common/sge_request)
if
you intend managing 'mem_free' as a consumbale resource specify
the 'mem_free' capacity for your hosts in 'complex_values' of
host_conf(5) by using 'qconf -me <hostname>'
if
you don't intend managing 'mem_free' as consumable resource make
it a non-consumable resource again in the 'consumable' column of
complex(5) by using 'qconf -mc host'
|
|
qrsh
wont dispatch to the same node it is on. From a qsh shell:
host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed:
host2 [50]% qrsh -inherit host4 hostname
host4
|
|
gid_range
not sufficient. It should be defined as a range, not a single
number. SGEEE assigns each job on a host a distinct gid.
|
Adjust
gid_range using 'qconf -mconf' or qmon. The suggested range is:
gid_range 20000-20100
|
|
when
I do a qrsh, I get this error..
% qrsh
error: 1: can't set additional group id for job
|
|
The
error message below can occur if the user already have 16
existing group ids set. SGE tries to set one more group id and
fails b/c usually the limit is 16.
|
Please
check to see how many group ids are assigned to the user using
'id -a'. If it's more than 16, then you need to reduce this
number or increase the limit in the kernel.
|
|
qrsh
-inherit -V does not work when used inside a parallel job:
cannot get connection to "qlogin_starter"
|
|
This
problem occurs with nested qrsh calls, and is due to the -V
switch. The first qrsh -inherit call will set the environment
variable TASK_ID (the id of the tightly integrated task within
the parallel job). The second qrsh -inherit call will then use
this environment variable for registration of its task which will
fail as it tries to start a task with the same id as the already
running first task.
|
You
can either
unset
TASK_ID before calling qrsh -inherit
not
use the -V switch, but -v and export only the environment
variables really needed.
|
|
qrsh
does not seem to work at all:
host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ...
error: error waiting on socket for client to connect: Interrupted system call
error: error reading returncode of remote command
cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh
host2$
|
|
Permissions
for qrsh are not set properly
|
Check
the permissions of the following files. They are located in
$SGE_ROOT/utilbin/.
Note
that rlogin and rsh need to be setuid and owned by root.
-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*
NOTE:
the $SGE_ROOT directory also needs to be NFS-mounted with the
"setuid" option. If it is mounted with "nosuid"
from your submit client, then qrsh (and associated commands) will
not work.
|
|
Interactive
jobs fail when run via qsh, without error message.
|
|
DISPLAY
variable may be set incorrectly
|
Set DISPLAY correctly. Or to get error messages for this
situation - upgrade to release 5.3p2 or higher
|
|
Qmake
|
|
When
trying to start a distributed make qmake exits with the following
error message:
qrsh_starter: executing child process qmake failed: No such file or directory
|
|
Grid
Engine will start an instance of qmake on the execution host. If
the Grid Engine environment (esp. the PATH) is not setup in the
users
shell
resource file (.profile/.cshrc) this qmake call will fail.
|
Use
the -v option to export the PATH to the qmake job. A typical
qmake call is
qmake -v PATH -cwd -pe make 2-10 --
|
|
When
doing qmake, the error seen is:
waiting for interactive job to be scheduled ...timeout (4 s)
expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.
|
|
ARCH
variable could be set incorrectly in shell which called qmake
|
Set
ARCH variable correctly to a supported value matching a host
available in your cluster, or else specify the correct value at
submit time, eg,
qmake -v ARCH=solaris64 ...
|
|
Qmon
|
|
Why am I am having refresh
problems when running Qmon?
|
|
Your
system needs to be updated with the proper Xserver patch.
|
X-server
Patch 105633-48
fixes the problem of the icons not refreshing properly as well as
icons painting over non-qmon windows. It is important to install
the patch in console mode (no X running) otherwise the patch
installation will fail
|
|
Parallel/Checkpointing
|
|
Parts of Sun HPC ClusterTools
parallel jobs (job script itself, child processes, etc) fail to
stop when terminated by user or by qmaster.
|
|
The
user may not have supplied the necessary means (scripts) for SGE
to control the distributed jobs.
|
Follow the complete HOW-TO instructions
on Integration between Grid Engine and HPC Cluster Tools.
|
|
Bugs
in early versions of loose integration package
|
Update to SGE 5.3p2 (or higher) which includes latest MPI
loose integration package
|
|
Parallel jobs that run with the tight integration of SGE5.3.x
and HPC CT 5 are not terminated if one of the queues has wall
clock limit set.
|
|
A
bug in SGE prevented correct signal delivery to all parallel
processes
|
SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get
corresponding patches from Sunsolve:
SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd
Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02
(pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03
(tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86);
113852-04 (tar.gz Linux); 113853-02 (tar.gz common package)
SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd
Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz
Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02
(tar.gz Linux); 113857-02 (tar.gz common package)
|
|
Parallel jobs that run with the tight integration of SGE5.3.x
and HPC CT 5 would not suspend and resume correctly.
|
|
Another
bug in SGE prevented STOP and CONT signals to be correctly
delivered to all processes.
|
Need to set the suspend/resume methods in the queues used for
the parallel jobs with the appropriate scripts. These scripts can
either be downloaded from the Grid Engine Project site at
this link or obtained from Sun support.
Releases beyond 5.3p4 will ship with these two scripts, a
README file and a parallel environment template.
|
|
Shadow Facility
|
|
After failover to shadow master,
the schedd daemon remains running on the original qmaster
|
|
This
is a bug in earlier versions of SGE.
|
Update to 5.3p2 or higher
|
|
Shadow host fails to own
mastership of SGE cluster
|
|
Lock
file exists.
|
Remove $SGE_ROOT/<cell>/spool/qmaster/lock file if
master host has crashed or can no longer function as
qmaster. NOTE: to force the shadow host to take over
from another master, use the “migrate” option, ie,
“rcsge -migrate”.
|
|
Root
R/W access to $SGE_ROOT directory and its sub-directories should
be from both master and shadow.
|
Adjust permissions for root r/w
access to the $SGE_ROOT directory and its sub-directories from
shadow host.
NOTE: please see the Shadow
Master HOWTO
|