RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.

Message ID 20191014213545.24627-1-jason@redhat.com
State New
Headers show
Series
  • RFA [PATCH] * lock-and-run.sh: Check for process existence rather than timeout.
Related show

Commit Message

Jason Merrill Oct. 14, 2019, 9:35 p.m.
Matthias Klose noted that on less powerful targets, a link might take more
than 5 minutes; he mentions a figure of 3 hours for an LTO link.  So this
patch changes the timeout to a check for whether the locking process still
exists.

Alex, you had a lot of helpful comments when I first wrote this, any thoughts
on this revision?

---
 gcc/lock-and-run.sh | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)


base-commit: aa45db50a034b266c338b55dee1b412178ea84a7
-- 
2.18.1

Comments

Alexandre Oliva Oct. 19, 2019, 4:07 a.m. | #1
Hello, Jason,

On Oct 14, 2019, Jason Merrill <jason@redhat.com> wrote:

> Alex, you had a lot of helpful comments when I first wrote this, any thoughts

> on this revision?


I think the check of the pid file could be made slightly simpler and
cheaper if we created it using:

   echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid

instead of

> +touch $lockdir/$$




> +	pid="`(cd $lockdir; echo *)`"


The ""s are implicit in a shell assignment, though there really
shouldn't be more than one PID-named file in the dir.  With the change
suggested above, this would become

        pid=`cat $lockdir/pid 2>/dev/null`

There's a slight possibility of hitting this right between the creation
of the dir and the creation of the pid file, thus the 2>/dev/null.

> +	if ps "$pid" >/dev/null; then


could be tested with much lower overhead:

        if test -z "$pid" || kill -0 $pid ; then

though it might make sense to have a different test and error message
for the case of the absent pid file.

We might also wish to use different lock-breaking logic for that case,
too, e.g. checking that the timestamp of the dir didn't change by
comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it
changed or the output is now empty, we just lost the race again.

It's unlikely that the dir would remain unchanged for several seconds
without the pid file, so if we get the same timestamp after 30 seconds,
it's likely that something went wrong with the lock holder, though it's
not impossible to imagine a scenario in which the lock program that just
succeeded in creating the dir got stopped (^Z) or killed-9 just before
creating the PID file.


Even then, maybe breaking the lock is not such a great idea in
general...

Though mkdir is an operation that forces a synchronization, reading a
file without a filesystem lock isn't.  The rename alleviates that a bit,
but it's not entirely unreasonable for an attempt to read the file to
cache the absence of the file and not notice a creation shortly
afterward.  This would be a lot more of an issue in case of contention
for the lock between different clients of a shared networked filesystem,
though we might imagine highly parallel systems to eventually hit such
issues as well.

But just the possibility of contention across a shared network
filesystem would give me pause, out of realizing that checking for a
local process with the same PID would do no good.  And then, kill -0
would not succeed if the lock holder was started by a different user,
unlike ps.

What if we printed an error message suggesting the command to clean up,
and then errored out, instead of retrying forever or breaking the lock
and proceeding?  Several programs that rely on lock files (git and svn
come to mind) seem to be taking such an approach these days, presumably
because of all the difficulties in automating the double-checking in all
potential scenarios.

-- 
Alexandre Oliva, freedom fighter  he/him   https://FSFLA.org/blogs/lxo
Be the change, be Free!        FSF VP & FSF Latin America board member
GNU Toolchain Engineer                        Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara
Jason Merrill Oct. 21, 2019, 4:33 p.m. | #2
On 10/21/19 11:35 AM, Jason Merrill wrote:
> On Sat, Oct 19, 2019 at 12:08 AM Alexandre Oliva <oliva@gnu.org 

> <mailto:oliva@gnu.org>> wrote:

> 

>     Hello, Jason,

> 

>     On Oct 14, 2019, Jason Merrill <jason@redhat.com

>     <mailto:jason@redhat.com>> wrote:

> 

>      > Alex, you had a lot of helpful comments when I first wrote this,

>     any thoughts

>      > on this revision?

> 

>     I think the check of the pid file could be made slightly simpler and

>     cheaper if we created it using:

> 

>         echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid

> 

>     instead of

> 

>      > +touch $lockdir/$$

> 

> 

> 

>      > +     pid="`(cd $lockdir; echo *)`"

> 

>     The ""s are implicit in a shell assignment, though there really

>     shouldn't be more than one PID-named file in the dir.  With the change

>     suggested above, this would become

> 

>              pid=`cat $lockdir/pid 2>/dev/null`

> 

> 

> Done.

> 

>     There's a slight possibility of hitting this right between the creation

>     of the dir and the creation of the pid file, thus the 2>/dev/null.

> 

>      > +     if ps "$pid" >/dev/null; then

> 

>     could be tested with much lower overhead:

> 

>              if test -z "$pid" || kill -0 $pid ; then

> 

> 

> Done

> 

>     though it might make sense to have a different test and error message

>     for the case of the absent pid file.

> 

> 

> Done.

> 

>     We might also wish to use different lock-breaking logic for that case,

>     too, e.g. checking that the timestamp of the dir didn't change by

>     comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it

>     changed or the output is now empty, we just lost the race again.

> 

>     It's unlikely that the dir would remain unchanged for several seconds

>     without the pid file, so if we get the same timestamp after 30 seconds,

>     it's likely that something went wrong with the lock holder, though it's

>     not impossible to imagine a scenario in which the lock program that just

>     succeeded in creating the dir got stopped (^Z) or killed-9 just before

>     creating the PID file.

> 

> 

> This patch uses stamps in the lock directory so that they are 

> automatically cleared when the lock changes hands.

> 

>     Even then, maybe breaking the lock is not such a great idea in

>     general...

> 

>     Though mkdir is an operation that forces a synchronization, reading a

>     file without a filesystem lock isn't.  The rename alleviates that a bit,

>     but it's not entirely unreasonable for an attempt to read the file to

>     cache the absence of the file and not notice a creation shortly

>     afterward.  This would be a lot more of an issue in case of contention

>     for the lock between different clients of a shared networked filesystem,

>     though we might imagine highly parallel systems to eventually hit such

>     issues as well.

> 

>     But just the possibility of contention across a shared network

>     filesystem would give me pause, out of realizing that checking for a

>     local process with the same PID would do no good.  And then, kill -0

>     would not succeed if the lock holder was started by a different user,

>     unlike ps.

> 

>     What if we printed an error message suggesting the command to clean up,

>     and then errored out, instead of retrying forever or breaking the lock

>     and proceeding?  Several programs that rely on lock files (git and svn

>     come to mind) seem to be taking such an approach these days, presumably

>     because of all the difficulties in automating the double-checking in all

>     potential scenarios.

> 

> 

> It seems to me that the downside of stealing the lock (greater resource 

> contention) is less than the downside of  erroring out (build fails, 

> user has to intervene manually, possibly many hours later when they notice).

> 

> How's this?


...once more, with less HTML.
commit 157d9c29a21ab0a5d854f53ca51ca9a7986a4e64
Author: Jason Merrill <jason@redhat.com>
Date:   Sat Sep 14 14:02:20 2019 -0400

            * lock-and-run.sh: Check for process existence rather than timeout.
    
    Matthias Klose noted that on less powerful targets, a link might take more
    than 5 minutes; he mentions a figure of 3 hours for an LTO link.  So this
    patch changes the timeout to a check for whether the locking process still
    exists.  If the lock exists in an erroneous state (no pid file or can't
    signal the pid) for 30 sec, steal it.

diff --git a/gcc/lock-and-run.sh b/gcc/lock-and-run.sh
index 3a6a84c253a..c1ae16bbb01 100644
--- a/gcc/lock-and-run.sh
+++ b/gcc/lock-and-run.sh
@@ -1,33 +1,43 @@
 #! /bin/sh
 # Shell-based mutex using mkdir.
 
+self=`basename $0`
 lockdir="$1" prog="$2"; shift 2 || exit 1
 
 # Remember when we started trying to acquire the lock.
 count=0
-touch lock-stamp.$$
 
-trap 'rm -r "$lockdir" lock-stamp.$$' 0
+err () {
+    if test -f $lockdir/lock-$1.$$; then
+	echo "$self: *** (PID $$) removing stale $lockdir" >&2
+	rm -rf $lockdir
+    else
+	touch $lockdir/lock-$1.$$
+    fi
+}
 
 until mkdir "$lockdir" 2>/dev/null; do
     # Say something periodically so the user knows what's up.
     if [ `expr $count % 30` = 0 ]; then
-	# Reset if the lock has been renewed.
-	if [ -n "`find \"$lockdir\" -newer lock-stamp.$$`" ]; then
-	    touch lock-stamp.$$
-	    count=1
-	# Steal the lock after 5 minutes.
-	elif [ $count = 300 ]; then
-	    echo removing stale $lockdir >&2
-	    rm -r "$lockdir"
+	# Check for valid lock.
+	if pid=`cat $lockdir/pid 2>/dev/null` && kill -0 $pid 2>/dev/null; then
+	    echo "$self: (PID $$) waiting $count sec to acquire $lockdir from PID $pid" >&2
+	elif test -z "$pid"; then
+	    echo "$self: (PID $$) cannot read $lockdir/pid" >&2
+	    err nopid
 	else
-	    echo waiting to acquire $lockdir >&2
+	    echo "$self: (PID $$) cannot signal $lockdir owner PID $pid" >&2
+	    err dead
 	fi
     fi
     sleep 1
     count=`expr $count + 1`
 done
 
+trap 'rm -rf "$lockdir"' 0
+echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid
+echo "$self: (PID $$) acquired $lockdir after $count seconds" >&2
+
 echo $prog "$@"
 $prog "$@"
Alexandre Oliva Oct. 21, 2019, 11:12 p.m. | #3
On Oct 21, 2019, Jason Merrill <jason@redhat.com> wrote:

> On Sat, Oct 19, 2019 at 12:08 AM Alexandre Oliva <oliva@gnu.org> wrote:


>> We might also wish to use different lock-breaking logic for that case,

>> too, e.g. checking that the timestamp of the dir didn't change by

>> comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it

>> changed or the output is now empty, we just lost the race again.


>> It's unlikely that the dir would remain unchanged for several seconds

>> without the pid file, so if we get the same timestamp after 30 seconds,

>> it's likely that something went wrong with the lock holder,


> This patch uses stamps in the lock directory so that they are automatically

> cleared when the lock changes hands.


Nice!

>>                                                             though it's

>> not impossible to imagine a scenario in which the lock program that just

>> succeeded in creating the dir got stopped (^Z) or killed-9 just before

>> creating the PID file.


One kind of problem is a left-over lock after a kill -9.  That
doesn't provide a chance for the lock cleanup code to run in the
terminated process.  But the current code will eventually steal it
(maybe too soon) and proceed.


Another kind of problem could arise if a lock-needing program got
stopped at an unfortunate time that enabled the lock to be taken from it
(before creating the pid file), and then woke back up believing it still
has the lock.  If it were to carry out anything that really required
mutual exclusion, Bad Things (TM) could happen.

Some systems adopt a STOPITH (that's short for Shoot The Other Process
In The Head) to avoid that scenario, but we don't have that choice:
either we're stealing from a PID that's gone (safe-ish, assuming all
users on a single host, without a shared FS across separate PID
namespaces, but still potentially racy, see below), or we don't know who
we're stealing from (very likely racy, except after a full-system
crash).


>> Even then, maybe breaking the lock is not such a great idea in

>> general...


> It seems to me that the downside of stealing the lock (greater resource

> contention) is less than the downside of  erroring out (build fails, user

> has to intervene manually, possibly many hours later when they notice).


Oh, if that's indeed the only consequence, I tend to agree.  I was
worried about scenarios that actually required mutual exclusion.  For
that, stealing locks opens another can of worms since multiple processes
might decide to steal the lock at the same time, and then each one might
end up stealing the lock from others who'd just stole the lock, thinking
they're stealing from the dead or stopped process, with very
impredictable results.  For short, it's not hard to implement lock
breaking, it's just very hard to make it race-free.


We should probably have comments as to the intended use, to express a
preference for serialization, so that nobody ends up using it in cases
that actually depend on mutual exclusion (e.g. to avoid corrupting some
file by concurrently changing it) without disabling the racy lock
stealing machinery:

# Shell-based mutex using mkdir.  This script is used to prefer
# serialized execution to avoid consuming too much RAM.  If reusing it,
# bear in mind that the lock-breaking logic is not race-free, so disable
# it in err() if concurrent execution could cause more serious problems.

Ok with that change or somesuch.

(Shell functions used to be non-portable, but I think even super
portable autoconf-generated configure scripts use them now)

Thanks!

-- 
Alexandre Oliva, freedom fighter  he/him   https://FSFLA.org/blogs/lxo
Be the change, be Free!        FSF VP & FSF Latin America board member
GNU Toolchain Engineer                        Free Software Evangelist
Hay que enGNUrecerse, pero sin perder la terGNUra jamás - Che GNUevara
Jason Merrill Oct. 22, 2019, 3:24 a.m. | #4
On 10/21/19 7:12 PM, Alexandre Oliva wrote:
> On Oct 21, 2019, Jason Merrill <jason@redhat.com> wrote:

> 

>> On Sat, Oct 19, 2019 at 12:08 AM Alexandre Oliva <oliva@gnu.org> wrote:

> 

>>> We might also wish to use different lock-breaking logic for that case,

>>> too, e.g. checking that the timestamp of the dir didn't change by

>>> comparing `ls -ld $lockdir` with what we got 30 seconds before.  If it

>>> changed or the output is now empty, we just lost the race again.

> 

>>> It's unlikely that the dir would remain unchanged for several seconds

>>> without the pid file, so if we get the same timestamp after 30 seconds,

>>> it's likely that something went wrong with the lock holder,

> 

>> This patch uses stamps in the lock directory so that they are automatically

>> cleared when the lock changes hands.

> 

> Nice!

> 

>>>                                                              though it's

>>> not impossible to imagine a scenario in which the lock program that just

>>> succeeded in creating the dir got stopped (^Z) or killed-9 just before

>>> creating the PID file.

> 

> One kind of problem is a left-over lock after a kill -9.  That

> doesn't provide a chance for the lock cleanup code to run in the

> terminated process.  But the current code will eventually steal it

> (maybe too soon) and proceed.

> 

> 

> Another kind of problem could arise if a lock-needing program got

> stopped at an unfortunate time that enabled the lock to be taken from it

> (before creating the pid file), and then woke back up believing it still

> has the lock.  If it were to carry out anything that really required

> mutual exclusion, Bad Things (TM) could happen.

> 

> Some systems adopt a STOPITH (that's short for Shoot The Other Process

> In The Head) to avoid that scenario, but we don't have that choice:

> either we're stealing from a PID that's gone (safe-ish, assuming all

> users on a single host, without a shared FS across separate PID

> namespaces, but still potentially racy, see below), or we don't know who

> we're stealing from (very likely racy, except after a full-system

> crash).

> 

> 

>>> Even then, maybe breaking the lock is not such a great idea in

>>> general...

> 

>> It seems to me that the downside of stealing the lock (greater resource

>> contention) is less than the downside of  erroring out (build fails, user

>> has to intervene manually, possibly many hours later when they notice).

> 

> Oh, if that's indeed the only consequence, I tend to agree.  I was

> worried about scenarios that actually required mutual exclusion.  For

> that, stealing locks opens another can of worms since multiple processes

> might decide to steal the lock at the same time, and then each one might

> end up stealing the lock from others who'd just stole the lock, thinking

> they're stealing from the dead or stopped process, with very

> impredictable results.


Sure, that could (very rarely) happen if one process managed to break 
the lock and recreate it in between these two lines in another process:

     if test -f $lockdir/lock-$1.$$; then
         rm -rf $lockdir

> We should probably have comments as to the intended use, to express a

> preference for serialization, so that nobody ends up using it in cases

> that actually depend on mutual exclusion (e.g. to avoid corrupting some

> file by concurrently changing it) without disabling the racy lock

> stealing machinery:

> 

> # Shell-based mutex using mkdir.  This script is used to prefer

> # serialized execution to avoid consuming too much RAM.  If reusing it,

> # bear in mind that the lock-breaking logic is not race-free, so disable

> # it in err() if concurrent execution could cause more serious problems.

> 

> Ok with that change or somesuch.


I mostly adopted that comment, thanks.

> (Shell functions used to be non-portable, but I think even super

> portable autoconf-generated configure scripts use them now)

> 

> Thanks!


Here's what I've committed:
commit f71e3aa2a5d614934ecda16987e12202a6b8007d
Author: jason <jason@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Oct 22 03:09:41 2019 +0000

            * lock-and-run.sh: Check for process existence rather than timeout.
    
    Matthias Klose noted that on less powerful targets, a link might take more
    than 5 minutes; he mentions a figure of 3 hours for an LTO link.  So this
    patch changes the timeout to a check for whether the locking process still
    exists.  If the lock exists in an erroneous state (no pid file or can't
    signal the pid) for 30 sec, steal it.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@277277 138bc75d-0d04-0410-961f-82ee72b054a4

diff --git a/gcc/lock-and-run.sh b/gcc/lock-and-run.sh
index 3a6a84c253a..22bc436565c 100644
--- a/gcc/lock-and-run.sh
+++ b/gcc/lock-and-run.sh
@@ -1,33 +1,49 @@
 #! /bin/sh
-# Shell-based mutex using mkdir.
+# Shell-based mutex using mkdir.  This script is used in make to prefer
+# serialized execution to avoid consuming too much RAM.  If reusing it,
+# bear in mind that the lock-breaking logic is not race-free, so disable
+# it in err() if concurrent execution could cause more serious problems.
 
+self=`basename $0`
 lockdir="$1" prog="$2"; shift 2 || exit 1
 
 # Remember when we started trying to acquire the lock.
 count=0
-touch lock-stamp.$$
 
-trap 'rm -r "$lockdir" lock-stamp.$$' 0
+err () {
+    if test -f $lockdir/lock-$1.$$; then
+	echo "$self: *** (PID $$) removing stale $lockdir" >&2
+	rm -rf $lockdir
+	# Possible variant for uses where races are more problematic:
+	#echo "$self: *** (PID $$) giving up, maybe rm -r $lockdir" >&2
+	#exit 42
+    else
+	touch $lockdir/lock-$1.$$
+    fi
+}
 
 until mkdir "$lockdir" 2>/dev/null; do
     # Say something periodically so the user knows what's up.
     if [ `expr $count % 30` = 0 ]; then
-	# Reset if the lock has been renewed.
-	if [ -n "`find \"$lockdir\" -newer lock-stamp.$$`" ]; then
-	    touch lock-stamp.$$
-	    count=1
-	# Steal the lock after 5 minutes.
-	elif [ $count = 300 ]; then
-	    echo removing stale $lockdir >&2
-	    rm -r "$lockdir"
+	# Check for valid lock.
+	if pid=`cat $lockdir/pid 2>/dev/null` && kill -0 $pid 2>/dev/null; then
+	    echo "$self: (PID $$) waiting $count sec to acquire $lockdir from PID $pid" >&2
+	elif test -z "$pid"; then
+	    echo "$self: (PID $$) cannot read $lockdir/pid" >&2
+	    err nopid
 	else
-	    echo waiting to acquire $lockdir >&2
+	    echo "$self: (PID $$) cannot signal $lockdir owner PID $pid" >&2
+	    err dead
 	fi
     fi
     sleep 1
     count=`expr $count + 1`
 done
 
+trap 'rm -rf "$lockdir"' 0
+echo $$ > $lockdir/pidT && mv $lockdir/pidT $lockdir/pid
+echo "$self: (PID $$) acquired $lockdir after $count seconds" >&2
+
 echo $prog "$@"
 $prog "$@"

Patch

diff --git a/gcc/lock-and-run.sh b/gcc/lock-and-run.sh
index 3a6a84c253a..404350b89a4 100644
--- a/gcc/lock-and-run.sh
+++ b/gcc/lock-and-run.sh
@@ -5,29 +5,28 @@  lockdir="$1" prog="$2"; shift 2 || exit 1
 
 # Remember when we started trying to acquire the lock.
 count=0
-touch lock-stamp.$$
 
-trap 'rm -r "$lockdir" lock-stamp.$$' 0
+trap 'rm -rf "$lockdir"' 0
 
 until mkdir "$lockdir" 2>/dev/null; do
     # Say something periodically so the user knows what's up.
     if [ `expr $count % 30` = 0 ]; then
-	# Reset if the lock has been renewed.
-	if [ -n "`find \"$lockdir\" -newer lock-stamp.$$`" ]; then
-	    touch lock-stamp.$$
-	    count=1
-	# Steal the lock after 5 minutes.
-	elif [ $count = 300 ]; then
-	    echo removing stale $lockdir >&2
-	    rm -r "$lockdir"
+	# Check for stale lock.
+	pid="`(cd $lockdir; echo *)`"
+	if ps "$pid" >/dev/null; then
+	    echo PID $$ waiting $count sec to acquire $lockdir from PID $pid>&2
+	    found=$pid
 	else
-	    echo waiting to acquire $lockdir >&2
+	    echo PID $pid is dead, removing stale $lockdir >&2
+	    rm -r "$lockdir"
 	fi
     fi
     sleep 1
     count=`expr $count + 1`
 done
 
+touch $lockdir/$$
+echo PID $$ acquired $lockdir after $count seconds >&2
 echo $prog "$@"
 $prog "$@"