[2/2] glibc-2.34: Fix internal error when running gdb.base/gdb-sigterm.exp

Message ID 20210710025129.201884-3-kevinb@redhat.com
State New
Headers show
Series
  • glibc-2.34: Fix gdb.base/gdb-sigterm.exp failure/error
Related show

Commit Message

Simon Marchi via Gdb-patches July 10, 2021, 2:51 a.m.
This commit fixes an internal error that occurs when running
gdb.base/gdb-sigterm.exp on a machine with a prerelease of glibc-2.34.

This test case, gdb.base/gdb-sigterm.exp, turns on infrun debugging
(via "set debug infrun 1") and then attempts to do a step at the
source line "for (;;);" Of course, this is an infinite loop, so
there's no way to get to the next line.  When run by hand, GDB busily
prints out the infrun debugging log messages without end.  After a
while, a SIGTERM is sent from an external source - when debugging by
hand, I'm sending it from another terminal, but when running the test,
the test case sends a SIGTERM shortly after one of the infrun debug
messages is seen.  Regardless, GDB is expected to shut down cleanly.
The test is run 50 times; if it shuts down cleanly after 50
iterations, the test passes; otherwise it's supposed to fail.

When running the test by hand, I'm doing:

  file testsuite/outputs/gdb.base/gdb-sigterm/gdb-sigterm
  set height 0
  set width 0
  b main
  run
  tbreak 29
  continue
  set range-stepping off
  set debug infrun 1
  step

At this point, GDB will start generating copious amounts of log
output.  I then send a SIGTERM to the gdb process from another
terminal.

The internal error occurs roughly once in every five tries.  The
backtrace shows that a double internal error occurs - GDB detects
this recursion and bails out on the second one.  The first internal
error is caused by the assert in inferior_thread() in thread.c.

Further up the stack, I see:

  #34 0x0000000000756eaa in do_target_wait_1 (inf=0x1603a60, ptid=...,
      status=0x7ffca2d42578, options=...)
      at worktree-glibc234/gdb/infrun.c:3663

It's calling target_wait() at this point; there is a call chain
eventually leading to inferior_thread() where the assert occurs.  One
of the interesting things is that the call chain goes through the QUIT
mechanism (i.e.  a call to maybe_quit()) in target_read().  If global
sync_quit_force_run is non-zero - which will be the case after GDB
receives a SIGTERM signal - the path through various calls is taken
eventually culiminating in the assert / internal error.

Now for the important bit: At the beginning of do_target_wait_1()
is the following code:

  /* We know that we are looking for an event in the target of inferior
     INF, but we don't know which thread the event might come from.  As
     such we want to make sure that INFERIOR_PTID is reset so that none of
     the wait code relies on it - doing so is always a mistake.  */
  switch_to_inferior_no_thread (inf);

This call causes the variable being tested by the assert to be
NULL (actually nullptr) which ultimately triggers the assert.

So...  it does seem that setting the variable in question
(current_thread_) to nullptr is intentional, but that path to the
triggered assert was not anticipated.

The reason that the problem occurs with glibc-2.34 is that
libthread_db will always be loaded now; thus the wait() machinery
found in linux-thread-db.c is used instead of linux_nat_target::wait
(which is found in linux-nat.c).

gdb/ChangeLog:

	* thread.c (any_thread_of_inferior): Don't call inferior_thread()
	when there is no current thread.
---
 gdb/thread.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
2.32.0

Comments

Pedro Alves July 10, 2021, 7:07 p.m. | #1
Hi Kevin!

On 2021-07-10 3:51 a.m., Kevin Buettner via Gdb-patches wrote:

> The reason that the problem occurs with glibc-2.34 is that

> libthread_db will always be loaded now; thus the wait() machinery

> found in linux-thread-db.c is used instead of linux_nat_target::wait

> (which is found in linux-nat.c).

> 

> gdb/ChangeLog:

> 

> 	* thread.c (any_thread_of_inferior): Don't call inferior_thread()

> 	when there is no current thread.


This isn't right.  The real bug is that inferior_ptid and current_thread_ got
out of sync.  Where did that happen?  What set inferior_ptid without using
switch_to_thread & friends?
Simon Marchi via Gdb-patches July 13, 2021, 11:35 p.m. | #2
Hi Pedro,

On Sat, 10 Jul 2021 20:07:28 +0100
Pedro Alves <pedro@palves.net> wrote:

> On 2021-07-10 3:51 a.m., Kevin Buettner via Gdb-patches wrote:

> 

> > The reason that the problem occurs with glibc-2.34 is that

> > libthread_db will always be loaded now; thus the wait() machinery

> > found in linux-thread-db.c is used instead of linux_nat_target::wait

> > (which is found in linux-nat.c).

> > 

> > gdb/ChangeLog:

> > 

> > 	* thread.c (any_thread_of_inferior): Don't call inferior_thread()

> > 	when there is no current thread.  

> 

> This isn't right.  The real bug is that inferior_ptid and current_thread_ got

> out of sync.  Where did that happen?  What set inferior_ptid without using

> switch_to_thread & friends?


inferior_ptid is temporarily set without changing the current thread in
ps_xfer_memory in proc-service.c:

static ps_err_e
ps_xfer_memory (const struct ps_prochandle *ph, psaddr_t addr,
		gdb_byte *buf, size_t len, int write)
{
  scoped_restore_current_inferior restore_inferior;
  set_current_inferior (ph->thread->inf);

  scoped_restore_current_program_space restore_current_progspace;
  set_current_program_space (ph->thread->inf->pspace);

  scoped_restore save_inferior_ptid = make_scoped_restore (&inferior_ptid);
  inferior_ptid = ph->thread->ptid;

  CORE_ADDR core_addr = ps_addr_to_core_addr (addr);

  int ret;
  if (write)
    ret = target_write_memory (core_addr, buf, len);
  else
    ret = target_read_memory (core_addr, buf, len);
  return (ret == 0 ? PS_OK : PS_ERR);
}

Under normal circumstances, inferior_ptid is set and then either
target_read_memory() or target_write_memory() are called after
which it'll be reset to its former value when ps_xfer_memory
returns.

However, should GDB receive a SIGTERM while this is happening,
the QUIT processing code in target_read() will be taken, leading
to an eventual call to inferior_thread where the assert occurs.

Here's the relevant portion of the backtrace on my F34 w/ glibc-2.33.9000
machine:

#18 0x0000000000b6abf2 in internal_error (file=<optimized out>, 
    line=<optimized out>, fmt=<optimized out>)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdbsupport/errors.cc:55
#19 0x00000000009b72e2 in inferior_thread ()
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/thread.c:72
#20 0x00000000009b8b2b in any_thread_of_inferior (inf=0x1603a60)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/thread.c:642
#21 0x00000000009c45b8 in kill_or_detach (inf=0x1603a60, from_tty=0)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/top.c:1677
#22 0x00000000009c49e7 in quit_force (exit_arg=0x0, from_tty=0)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/top.c:1779
#23 0x0000000000a37e57 in quit ()
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/utils.c:629
#24 0x0000000000a37eb3 in maybe_quit ()
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/utils.c:653
#25 0x000000000099d8ef in target_read (ops=0x108b9c0 <the_thread_db_target>, 
    object=TARGET_OBJECT_MEMORY, annex=0x0, buf=0x7ffca2d41d90 "", 
    offset=140737354129392, len=8)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:2025
#26 0x000000000099d2e6 in target_read_memory (memaddr=0x7ffff7ffdff0, 
    myaddr=0x7ffca2d41d90 "", len=8)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:1810
#27 0x0000000000854fa0 in ps_xfer_memory (During symbol reading: const value length mismatch for 'sigall_set', got 128, expected 0
ph=0x1b086d8, addr=0x7ffff7ffdff0, 
    buf=0x7ffca2d41d90 "", len=8, write=0)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/proc-service.c:90
#28 0x0000000000855160 in ps_pdread (ph=0x1b086d8, addr=0x7ffff7ffdff0, 
    buf=0x7ffca2d41d90, size=8)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/proc-service.c:124
#29 0x00007fc1240151f0 in _td_fetch_value (ta=ta@entry=0x1b08740, 
    desc=desc@entry=0x1b087f0, descriptor_name=descriptor_name@entry=14, 
    idx=idx@entry=0x0, address=<optimized out>, 
    result=result@entry=0x7ffca2d41dc0) at fetch-value.c:115
#30 0x00007fc124011bdd in td_ta_map_lwp2thr (ta_arg=0x1b08740, lwpid=1452077, 
    th=0x7ffca2d41f90) at td_ta_map_lwp2thr.c:194
#31 0x00000000007ab627 in thread_from_lwp (stopped=0x18588b0, ptid=...)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/linux-thread-db.c:416
#32 0x00000000007ada46 in thread_db_target::wait (
    this=0x108b9c0 <the_thread_db_target>, ptid=..., ourstatus=0x7ffca2d42578, 
    options=...)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/linux-thread-db.c:1431
#33 0x000000000099e953 in target_wait (ptid=..., status=0x7ffca2d42578, 
    options=...)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:2608
#34 0x0000000000756eaa in do_target_wait_1 (inf=0x1603a60, ptid=..., 
    status=0x7ffca2d42578, options=...)
    at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/infrun.c:3663

The key points are as follows:

1) inferior_ptid and current_thread_ are set to null_ptid and nullptr
   respectively via the call to switch_to_inferior_no_thread() in
   do_target_wait_1().
2) Later on, while attempting to figure out the thread corresponding to an
   LWP, ps_xfer_memory is called().
3) inferior_ptid is set (without also setting current_thread_) within
   ps_xfer_memory().  Arrangements are made, however, to have
   inferior_ptid's value restored upon return from ps_xfer_memory().
4) Lower level memory xfer functions are invoked from ps_xfer_memory().
5) A SIGTERM is delivered to GDB; the signal handler is invoked causing
   sync_quit_force_run to be set.
6) The QUIT processing code in target_read_memory notices that sync_quit_force_run
   is set.  It calls quit() which calls quit_force() which calls kill_or_detach().
7) Mayhem (well, an assert) occurs due to inferior_ptid being something other than
   null_ptid while current_thread_ is still nullptr.

Kevin
Pedro Alves July 14, 2021, 12:07 a.m. | #3
On 2021-07-14 12:35 a.m., Kevin Buettner wrote:

> Pedro Alves <pedro@palves.net> wrote:


>> This isn't right.  The real bug is that inferior_ptid and current_thread_ got

>> out of sync.  Where did that happen?  What set inferior_ptid without using

>> switch_to_thread & friends?

> 

> inferior_ptid is temporarily set without changing the current thread in

> ps_xfer_memory in proc-service.c:

> 

> static ps_err_e

> ps_xfer_memory (const struct ps_prochandle *ph, psaddr_t addr,

> 		gdb_byte *buf, size_t len, int write)

> {

>   scoped_restore_current_inferior restore_inferior;

>   set_current_inferior (ph->thread->inf);

> 

>   scoped_restore_current_program_space restore_current_progspace;

>   set_current_program_space (ph->thread->inf->pspace);

> 

>   scoped_restore save_inferior_ptid = make_scoped_restore (&inferior_ptid);

>   inferior_ptid = ph->thread->ptid;

> 

>   CORE_ADDR core_addr = ps_addr_to_core_addr (addr);

> 

>   int ret;

>   if (write)

>     ret = target_write_memory (core_addr, buf, len);

>   else

>     ret = target_read_memory (core_addr, buf, len);

>   return (ret == 0 ? PS_OK : PS_ERR);

> }

> 

> Under normal circumstances, inferior_ptid is set and then either

> target_read_memory() or target_write_memory() are called after

> which it'll be reset to its former value when ps_xfer_memory

> returns.

> 

> However, should GDB receive a SIGTERM while this is happening,

> the QUIT processing code in target_read() will be taken, leading

> to an eventual call to inferior_thread where the assert occurs.

> 

> Here's the relevant portion of the backtrace on my F34 w/ glibc-2.33.9000

> machine:

> 

> #18 0x0000000000b6abf2 in internal_error (file=<optimized out>, 

>     line=<optimized out>, fmt=<optimized out>)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdbsupport/errors.cc:55

> #19 0x00000000009b72e2 in inferior_thread ()

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/thread.c:72

> #20 0x00000000009b8b2b in any_thread_of_inferior (inf=0x1603a60)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/thread.c:642

> #21 0x00000000009c45b8 in kill_or_detach (inf=0x1603a60, from_tty=0)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/top.c:1677

> #22 0x00000000009c49e7 in quit_force (exit_arg=0x0, from_tty=0)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/top.c:1779

> #23 0x0000000000a37e57 in quit ()

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/utils.c:629

> #24 0x0000000000a37eb3 in maybe_quit ()

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/utils.c:653

> #25 0x000000000099d8ef in target_read (ops=0x108b9c0 <the_thread_db_target>, 

>     object=TARGET_OBJECT_MEMORY, annex=0x0, buf=0x7ffca2d41d90 "", 

>     offset=140737354129392, len=8)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:2025

> #26 0x000000000099d2e6 in target_read_memory (memaddr=0x7ffff7ffdff0, 

>     myaddr=0x7ffca2d41d90 "", len=8)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:1810

> #27 0x0000000000854fa0 in ps_xfer_memory (During symbol reading: const value length mismatch for 'sigall_set', got 128, expected 0

> ph=0x1b086d8, addr=0x7ffff7ffdff0, 

>     buf=0x7ffca2d41d90 "", len=8, write=0)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/proc-service.c:90

> #28 0x0000000000855160 in ps_pdread (ph=0x1b086d8, addr=0x7ffff7ffdff0, 

>     buf=0x7ffca2d41d90, size=8)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/proc-service.c:124

> #29 0x00007fc1240151f0 in _td_fetch_value (ta=ta@entry=0x1b08740, 

>     desc=desc@entry=0x1b087f0, descriptor_name=descriptor_name@entry=14, 

>     idx=idx@entry=0x0, address=<optimized out>, 

>     result=result@entry=0x7ffca2d41dc0) at fetch-value.c:115

> #30 0x00007fc124011bdd in td_ta_map_lwp2thr (ta_arg=0x1b08740, lwpid=1452077, 

>     th=0x7ffca2d41f90) at td_ta_map_lwp2thr.c:194

> #31 0x00000000007ab627 in thread_from_lwp (stopped=0x18588b0, ptid=...)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/linux-thread-db.c:416

> #32 0x00000000007ada46 in thread_db_target::wait (

>     this=0x108b9c0 <the_thread_db_target>, ptid=..., ourstatus=0x7ffca2d42578, 

>     options=...)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/linux-thread-db.c:1431

> #33 0x000000000099e953 in target_wait (ptid=..., status=0x7ffca2d42578, 

>     options=...)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/target.c:2608

> #34 0x0000000000756eaa in do_target_wait_1 (inf=0x1603a60, ptid=..., 

>     status=0x7ffca2d42578, options=...)

>     at /ironwood1/sourceware-git/f34-2-glibc234/bld/../../worktree-glibc234/gdb/infrun.c:3663


Ah.  This is sooo much clearer explained this way.

> 

> The key points are as follows:

> 

> 1) inferior_ptid and current_thread_ are set to null_ptid and nullptr

>    respectively via the call to switch_to_inferior_no_thread() in

>    do_target_wait_1().

> 2) Later on, while attempting to figure out the thread corresponding to an

>    LWP, ps_xfer_memory is called().

> 3) inferior_ptid is set (without also setting current_thread_) within

>    ps_xfer_memory().  Arrangements are made, however, to have

>    inferior_ptid's value restored upon return from ps_xfer_memory().


OK, this is one case where we can't use switch_to_thread, because we're in
really lower level code, and ps_xfer_memory may be reached even without a thread,
and also it's not a good idea to switch threads here because it flushes frame caches.

> 4) Lower level memory xfer functions are invoked from ps_xfer_memory().

> 5) A SIGTERM is delivered to GDB; the signal handler is invoked causing

>    sync_quit_force_run to be set.

> 6) The QUIT processing code in target_read_memory notices that sync_quit_force_run

>    is set.  It calls quit() which calls quit_force() which calls kill_or_detach().

> 7) Mayhem (well, an assert) occurs due to inferior_ptid being something other than

>    null_ptid while current_thread_ is still nullptr.


It looks to me that this sync_quit_force_run mechanism as is today is too dangerous.  We're
quite deep in the target implementation, with all sorts of state half switched around, we
can't really assume that it's safe to reenter kill_or_detach (and target_kill/target_detach)
like that.

Off hand, it seems to me that the right fix is to let the normal quit exception propagate all
the way to the top level, and then have the top level call quit_force if sync_quit_force_run
is set.

Patch

diff --git a/gdb/thread.c b/gdb/thread.c
index f850f05ad48..3fe81e810c0 100644
--- a/gdb/thread.c
+++ b/gdb/thread.c
@@ -638,7 +638,8 @@  any_thread_of_inferior (inferior *inf)
   gdb_assert (inf->pid != 0);
 
   /* Prefer the current thread, if there's one.  */
-  if (inf == current_inferior () && inferior_ptid != null_ptid)
+  if (inf == current_inferior () && inferior_ptid != null_ptid
+      && !is_current_thread (nullptr))
     return inferior_thread ();
 
   for (thread_info *tp : inf->non_exited_threads ())