Use fork instead of vfork on Solaris

Message ID yddo8p7mi83.fsf@CeBiTec.Uni-Bielefeld.DE
State New
Headers show
Series
  • Use fork instead of vfork on Solaris
Related show

Commit Message

Rainer Orth June 25, 2020, 12:40 p.m.
The gdb.mi/mi-exec-run.exp test never completed/timed out on Solaris for
quite some time:

FAIL: gdb.mi/mi-exec-run.exp: inferior-tty=main: mi=main: force-fail=1: run failure detected (timeout)

This is for gdb trying to exec mi-exec-run.nox, a copy of mi-exec-run
with execute permissions removed.

The process tree at this point looks like this:

          21254 /vol/gcc/bin/expect -- /vol/gcc/share/dejagnu/runtest.exp GDB_PARALLEL=yes --outdir=outputs/gdb.mi/mi-exec-run-vfork gdb.mi/mi-exec-run.exp
            21300 <defunct>
            21281 <defunct>
            21294 $obj/gdb/testsuite/../../gdb/gdb -nw -nx -data-directory $obj/gdb/testsuite/../data-directory -i=mi
              21297 $obj/gdb/testsuite/../../gdb/gdb -nw -nx -data-directory $obj/gdb/testsuite/../data-directory -i=mi

The parent gdb hangs here:

21294:  $obj/gdb/testsuite/../../gdb/gdb -nw 
------------  lwp# 1 / thread# 1  ---------------
 0000000000000000 SYS#0    ()
 0000000000daeccd procfs_target::create_inferior(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char**, int) () + 97 (procfs.c:2853)
 0000000000ca63a7 run_command_1(char const*, int, run_how) () + 349 (basic_string.h:187)
 0000000000ca6516 start_command(char const*, int) () + 26 (infcmd.c:584)
 0000000000b3ca8e do_const_cfunc(cmd_list_element*, char const*, int) () + f (cli-decode.c:96)
 0000000000b3ed77 cmd_func(cmd_list_element*, char const*, int) () + 32 (cli-decode.c:2113)
 0000000000f2d219 execute_command(char const*, int) () + 455 (top.c:657)
 0000000000d4ad77 mi_execute_cli_command(char const*, int, char const*) () + 242 (basic_string.h:187)
 0000000000d4ae80 mi_cmd_exec_run(char const*, char**, int) () + ba (mi-main.c:473)

with these process flags

21294:	$obj/gdb/testsuite/../../gdb/gdb -nw 
	data model = _LP64  flags = VFORKP|ORPHAN|MSACCT|MSFORK
	sigpend = 0x00004103,0x00000000,0x00000000
 /1:	flags = 0
	sigmask = 0xffbffeff,0xffffffff,0x000000ff
	cursig = SIGKILL
 /2:	flags = DETACH|STOPPED|ASLEEP  lwp_park(0x0,0x0,0x0)
	why = PR_SUSPENDED
	sigmask = 0x000a2002,0x00000000,0x00000000
[...]

while the child sits at

21297:  $obj/gdb/testsuite/../../gdb/gdb -nw 
 00007fffbf078a0b execve   (7fffbffff756, 7fffbfffec58, 7fffbfffec90, 0)
 00007fffbef84cf6 execvpex () + f6
 00007fffbef84f45 execvp () + 15
 0000000000d60a44 fork_inferior(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char**, void (*)(), gdb::function_view<void (int)>, void (*)(), char const*, void (*)(char const*, char* const*, char* const*)) () + 47f (fork-inferior.c:423)
 0000000000daeccd procfs_target::create_inferior(char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char**, int) () + 97 (procfs.c:2853)
 0000000000ca63a7 run_command_1(char const*, int, run_how) () + 349 (basic_string.h:187)
 0000000000ca6516 start_command(char const*, int) () + 26 (infcmd.c:584)
 0000000000b3ca8e do_const_cfunc(cmd_list_element*, char const*, int) () + f (cli-decode.c:96)
 0000000000b3ed77 cmd_func(cmd_list_element*, char const*, int) () + 32 (cli-decode.c:2113)
 0000000000f2d219 execute_command(char const*, int) () + 455 (top.c:657)
 0000000000d4ad77 mi_execute_cli_command(char const*, int, char const*) () + 242 (basic_string.h:187)
 0000000000d4ae80 mi_cmd_exec_run(char const*, char**, int) () + ba (mi-main.c:473)

with

21297:	$obj/gdb/testsuite/../../gdb/gdb -nw 
	data model = _LP64  flags = MSACCT|MSFORK
	exitset  = 0x00000000 0x04000000 0x00000000 0x00000000
	           0x00000000 0x00000000 0x00000000 0x00000000
 /1:	flags = STOPPED|ISTOP  execve(0x7fffbffff756,0x7fffbfffec58,0x7fffbfffec90,0x0)
	why = PR_SYSEXIT  what = execve

We have a deadlock here: the execve in the child cannot return until the
parent has handled the PR_SYSEXIT while the parent cannot run with a
vfork'ed child as documented in proc(4):

       The child of a vfork(2) borrows the  parent's  address  space.  When  a
       vfork(2) is executed by a traced process, all watched areas established
       for the parent are suspended until the child terminates or performs  an
       exec(2).  Any  watched areas established independently in the child are
       cancelled when the parent resumes  after  the  child's  termination  or
       exec(2).  PCWATCH  fails  with  EBUSY  if  applied  to  the parent of a
       vfork(2) before the child has terminated or performed an  exec(2).  The
       PR_VFORKP  flag  is  set  in  the  pstatus  structure for such a parent
       process.

In that situation, the parent cannot be killed even with SIGKILL (as
runtest will attempt once the timeout occurs; the pending signal can be
seen in the pflags output above), so the whole test hangs until one
manually kills the child process.

Fortunately, there's an easy way out: when using fork instead of vfork,
the problem doesn't occur, and this is what the current patch does: it
calls fork_inferior with a dummy pre_trace_fun arg.

Tested on amd64-pc-solaris2.11 and sparcv9-sun-solaris2.11.

While this is enough to avoid the hang and let make check finish, the
test still FAILs:

FAIL: gdb.mi/mi-exec-run.exp: inferior-tty=main: mi=main: force-fail=1: run failure detected (timeout)
FAIL: gdb.mi/mi-exec-run.exp: inferior-tty=main: mi=separate: force-fail=1: run failure detected (timeout)
FAIL: gdb.mi/mi-exec-run.exp: inferior-tty=separate: mi=main: force-fail=1: run failure detected (timeout)
FAIL: gdb.mi/mi-exec-run.exp: inferior-tty=separate: mi=separate: force-fail=1: run failure detected (timeout)

It turns out the output isn't what the testcase expects:

* On Solaris:

-exec-run --start
=breakpoint-created,bkpt={number="1",type="breakpoint",disp="del",enabled="y",addr="0x0000000000400ff6",func="main",file="$src/gdb/testsuite/gdb.mi/mi-start.c",fullname="$src/gdb/testsuite/gdb.mi/mi-start.c",line="21",thread-groups=["i1"],times="0",original-location="-qualified main"}
=thread-group-started,id="i1",pid="26446"
=thread-created,id="1",group-id="i1"
&"Cannot access memory at address 0x334000003347\n"
&"Cannot access memory at address 0x33400000333f\n"
&"Cannot access memory at address 0x334000003347\n"
&"Cannot access memory at address 0x33400000333f\n"
=library-loaded,id="/usr/lib/amd64/ld.so.1",target-name="/usr/lib/amd64/ld.so.1",host-name="/usr/lib/amd64/ld.so.1",symbols-loaded="0",thread-group="i1",ranges=[{from="0x00007fffbf425c00",to="0x00007fffbf45a401"}]
=breakpoint-modified,bkpt={number="1",type="breakpoint",disp="del",enabled="y",addr="0x0000000000400ff2",func="main",file="$src/gdb/testsuite/gdb.mi/mi-start.c",fullname="$src/gdb/testsuite/gdb.mi/mi-start.c",line="20",thread-groups=["i1"],times="0",original-location="-qualified main"}
^running
*running,thread-id="all"
(gdb) 
&"warning: Cannot exec $obj/gdb/testsuite/outputs/gdb.mi/mi-exec-run/mi-exec-run.nox\n"
&"warning: Error: Permission denied\nsaw mi error
"
~"[Inferior 1 (process 26446) exited with code 0177]\n"
=thread-exited,id="1",group-id="i1"
=thread-group-exited,id="i1",exit-code="0177"
*stopped,reason="exited",exit-code="0177"

* On Linux/x86_64:

-exec-run --start
=breakpoint-created,bkpt={number="1",type="breakpoint",disp="del",enabled="y",addr="0x0000000000400482",func="main",file="/vol/src/gnu/gdb/hg/master/dist/gdb/testsuite/gdb.mi/mi-start.c",fullname="/vol/src/gnu/gdb/hg/master/dist/gdb/testsuite/gdb.mi/mi-start.c",line="21",thread-groups=["i1"],times="0",original-location="main"}
&"warning: Cannot exec $obj/gdb/testsuite/outputs/gdb.mi/mi-exec-run/mi-exec-run.nox\n"
&"warning: Error: Permission denied\n"
&"\n"
saw mi error
=thread-group-started,id="i1",pid="63270"
=thread-created,id="1",group-id="i1"
=thread-exited,id="1",group-id="i1"
=thread-group-exited,id="i1"
^error,msg="During startup program exited with code 127."
(gdb) 
saw mi error

The testcase only expects

^error,msg="During startup program exited with code 127."

printed in nat/fork-inferior.c (startup_inferior); similarly, but not
identical windows-nat.c (windows_nat_target::get_windows_debug_event)

while on Solaris one gets

~"[Inferior 1 (process 26446) exited with code 0177]\n"

from infrun.c (print_exited_reason).

I mean to fix this as a followup, but this being a corner case which
only happens when the exec fails can wait while it's quite helpful to
have the actual fix in for the GDB 10 release.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University


2020-06-24  Rainer Orth  <ro@CeBiTec.Uni-Bielefeld.DE>

	* procfs.c (procfs_pre_trace): New function.
	(procfs_target::create_inferior): Pass it to fork_inferior.

Comments

Shahab Vahedi via Gdb-patches June 25, 2020, 3:27 p.m. | #1
On 6/25/20 1:40 PM, Rainer Orth wrote:
> 

> 2020-06-24  Rainer Orth  <ro@CeBiTec.Uni-Bielefeld.DE>

> 

> 	* procfs.c (procfs_pre_trace): New function.

> 	(procfs_target::create_inferior): Pass it to fork_inferior.


LGTM.

Thanks,
Pedro Alves

Patch

# HG changeset patch
# Parent  6ba0177b65b783cba7ef123b5223d740da428fbc
Use fork instead of vfork on Solaris

diff --git a/gdb/procfs.c b/gdb/procfs.c
--- a/gdb/procfs.c
+++ b/gdb/procfs.c
@@ -2759,6 +2759,13 @@  procfs_set_exec_trap (void)
   /*destroy_procinfo (pi);*/
 }
 
+/* Dummy function to be sure fork_inferior uses fork(2) and not vfork(2).
+   This avoids a possible deadlock gdb and its vfork'ed child.  */
+static void
+procfs_pre_trace (void)
+{
+}
+
 /* This function is called BEFORE gdb forks the inferior process.  Its
    only real responsibility is to set things up for the fork, and tell
    GDB which two functions to call after the fork (one for the parent,
@@ -2851,7 +2858,7 @@  procfs_target::create_inferior (const ch
     push_target (this);
 
   pid = fork_inferior (exec_file, allargs, env, procfs_set_exec_trap,
-		       NULL, NULL, shell_file, NULL);
+		       NULL, procfs_pre_trace, shell_file, NULL);
 
   /* We have something that executes now.  We'll be running through
      the shell at this point (if startup-with-shell is true), but the