This usually happens on an NFS client in a cluster that is not running
rpc.lockd and rpc.statd (or just lockd and statd on some systems) on both
the NFS server and the NFS client.
Most commercials Unixes install statd and lockd, and as the source for statd and
lockd are widely available, this should not present a problem for most users. On
non-GNU/Linux systems these daemons are required to support reliable NFS file locks.
GNU/Linux does not seem normally to run statd and lockd, which seems to have caused
a problem for one user. GNU Queue runs fine on my RedHat boxes, which seem
to support NFS file locking (it would be most surprising if they didn't!)
so it seems GNU/Linux moved file locking support into the kernel.
NFS file locking is required for a large number of commercial and free apps
to run properly. (SAS, WordPerfect, sendmail, etc.) so it's probably a good idea
to be running the daemons anyway.
However, I've suggested a patch replacing the fctnl() code on systems where locking
is not supported with lock file locking (i.e., creating a file "file.LOCK" in the
same directory to indicate that "file" has been locked.) Under NFS,
this requires a sleep(4) to ensure synchronization and safe propagation
of the lockfile throughout the cluster, so it is much slower that using
statd and lockd. It is also less reliable, since statd and lockd remove locks
when a client reboots, but what's to remove the lockfile if the client crases?
The free, popular procmail(1) delivery implements lockfile
file locking over NFS correctly; this source should be consulted for anyone
wishing to write the patch.
Another solution is to put the spooldir on AFS or another high-reliability
network filesystem that supports fcntl() file locking.
A final solution is to eliminate the need for locks and NFS altogether in Queue and
rely only on TCP/IP transmission of job info. This is planned and in
development, please support the developers by offering to give them a hand.
werner.krebs@yale.edu |