Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDB overlay support for RISC-V #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dmi391
Copy link

@dmi391 dmi391 commented Sep 14, 2022

  1. Fixed problem with overlay support for RISC-V
    Now GDB-client supports overlay debugging in auto-mode.

To able GDB support overlay debugging it is necessary to initialize pointer gdbarch->overlay_update with function pointer simple_overlay_update(struct obj_section *osect): In file /gdb/riscv-tdep.c at the end of definition of riscv_gdbarch_init(...) should be called set_gdbarch_overlay_update(gdbarch, simple_overlay_update) - similarly with file /gdb/m32r-tdep.c.

Without this fix GDB-client can't update overlay table _ovly_table from target RAM and overlay debugging doesn't work:

(gdb) overlay list
No sections are mapped.
(gdb) overlay load
This target does not know how to read its overlay state.

With this fix GDB-client is able to support overlay debugging in auto-mode (GDB-client updates overlay table _ovly_table from target RAM):

(gdb) set verbose on
(gdb) overlay auto
Automatic overlay debugging enabled.
...
(gdb) overlay list
Section .ovly1, loaded at 0x10080444 - 0x100805d0, mapped at 0x10000060 - 0x100001ec
  1. Fixed problem with output of overlay GDB-commands

GDB-commands overlay auto, overlay manual, overlay off have incorrect output message. In file /gdb/symfile.c in functions overlay_auto_command(...), overlay_manual_command(...), overlay_off_command(...) in call printf_filtered(_("...")) it is necessary to add '\n' at the end of string. Otherwise output message of this GDB-commands unseparated with next GDB-command output message. This messages are displayed only with set verbose on.

Without this fix. Incorrect (unseparated):

(gdb) set verbose on
(gdb) overlay auto
<nothing>
(gdb) overlay list
Automatic overlay debugging enabled.No sections are mapped.

With this fix (added '\n'). Correct:

(gdb) set verbose on
(gdb) overlay auto
Automatic overlay debugging enabled.
(gdb) overlay list
No sections are mapped.

My overlay demo project for RISC-V: dmi391/overlay_demo
My custom build of GDB-client with this two fixes (it works correct): release gdb-riscv-ovly

1. Fixed problem with overlay support for RISC-V

To able GDB support overlay debugging it is necessary to initialize pointer `gdbarch->overlay_update` with function pointer `simple_overlay_update(struct obj_section *osect)`:
In file `/gdb/riscv-tdep.c` at the end of definition of `riscv_gdbarch_init(...)` should be called `set_gdbarch_overlay_update(gdbarch, simple_overlay_update)` - similarly with file `/gdb/m32r-tdep.c`.

Without this fix GDB-client can't update overlay table `_ovly_table` from target RAM and overlay debugging doesn't work:

    (gdb) overlay list
    No sections are mapped.
    (gdb) overlay load
    This target does not know how to read its overlay state.

With this fix GDB-client is able to support overlay debugging in auto-mode (GDB-client updates overlay table `_ovly_table` from target RAM):

    (gdb) set verbose on
    (gdb) overlay auto
    Automatic overlay debugging enabled.
    ...
    (gdb) overlay list
    Section .ovly1, loaded at 0x10080444 - 0x100805d0, mapped at 0x10000060 - 0x100001ec

2. Fixed problem with output of overlay GDB-commands

GDB-commands `overlay auto`, `overlay manual`, `overlay off` have incorrect output message.
In file `/gdb/symfile.c` in functions `overlay_auto_command(...)`, `overlay_manual_command(...)`, `overlay_off_command(...)` in call `printf_filtered(_("..."))` it is necessary to add '\n' at the end of string. Otherwise output message of this GDB-commands unseparated with next GDB-command output message.
This messages are displayed only with `set verbose on`.

Without this fix. Incorrect (unseparated):

    (gdb) set verbose on
    (gdb) overlay auto
    <nothing>
    (gdb) overlay list
    Automatic overlay debugging enabled.No sections are mapped.

With this fix (added '\n'). Correct:

    (gdb) set verbose on
    (gdb) overlay auto
    Automatic overlay debugging enabled.
    (gdb) overlay list
    No sections are mapped.
kito-cheng pushed a commit that referenced this pull request Sep 25, 2023
While working on a later patch, which changes gdb.base/foll-vfork.exp,
I noticed that sometimes I would hit this assert:

  x86_linux_update_debug_registers: Assertion `lwp_is_stopped (lwp)' failed.

I eventually tracked it down to a combination of schedule-multiple
mode being on, target-non-stop being off, follow-fork-mode being set
to child, and some bad timing.  The failing case is pretty simple, a
single threaded application performs a vfork, the child process then
execs some other application while the parent process (once the vfork
child has completed its exec) just exits.  As best I understand
things, here's what happens when things go wrong:

  1. The parent process performs a vfork, GDB sees the VFORKED event
  and creates an inferior and thread for the vfork child,

  2. GDB resumes the vfork child process.  As schedule-multiple is on
  and target-non-stop is off, this is translated into a request to
  start all processes (see user_visible_resume_ptid),

  3. In the linux-nat layer we spot that one of the threads we are
  about to start is a vfork parent, and so don't start that
  thread (see resume_lwp), the vfork child thread is resumed,

  4. GDB waits for the next event, eventually entering
  linux_nat_target::wait, which in turn calls linux_nat_wait_1,

  5. In linux_nat_wait_1 we eventually call
  resume_stopped_resumed_lwps, this should restart threads that have
  stopped but don't actually have anything interesting to report.

  6. Unfortunately, resume_stopped_resumed_lwps doesn't check for
  vfork parents like resume_lwp does, so at this point the vfork
  parent is resumed.  This feels like the start of the bug, and this
  is where I'm proposing to fix things, but, resuming the vfork parent
  isn't the worst thing in the world because....

  7. As the vfork child is still alive the kernel holds the vfork
  parent stopped,

  8. Eventually the child performs its exec and GDB is sent and EXECD
  event.  However, because the parent is resumed, as soon as the child
  performs its exec the vfork parent also sends a VFORK_DONE event to
  GDB,

  9. Depending on timing both of these events might seem to arrive in
  GDB at the same time.  Normally GDB expects to see the EXECD or
  EXITED/SIGNALED event from the vfork child before getting the
  VFORK_DONE in the parent.  We know this because it is as a result of
  the EXECD/EXITED/SIGNALED that GDB detaches from the parent (see
  handle_vfork_child_exec_or_exit for details).  Further the comment
  in target/waitstatus.h on TARGET_WAITKIND_VFORK_DONE indicates that
  when we remain attached to the child (not the parent) we should not
  expect to see a VFORK_DONE,

  10. If both events arrive at the same time then GDB will randomly
  choose one event to handle first, in some cases this will be the
  VFORK_DONE.  As described above, upon seeing a VFORK_DONE GDB
  expects that (a) the vfork child has finished, however, in this case
  this is not completely true, the child has finished, but GDB has not
  processed the event associated with the completion yet, and (b) upon
  seeing a VFORK_DONE GDB assumes we are remaining attached to the
  parent, and so resumes the parent process,

  11. GDB now handles the EXECD event.  In our case we are detaching
  from the parent, so GDB calls target_detach (see
  handle_vfork_child_exec_or_exit),

  12. While this has been going on the vfork parent is executing, and
  might even exit,

  13. In linux_nat_target::detach the first thing we do is stop all
  threads in the process we're detaching from, the result of the stop
  request will be cached on the lwp_info object,

  14. In our case the vfork parent has exited though, so when GDB
  waits for the thread, instead of a stop due to signal, we instead
  get a thread exited status,

  15. Later in the detach process we try to resume the threads just
  prior to making the ptrace call to actually detach (see
  detach_one_lwp), as part of the process to resume a thread we try to
  touch some registers within the thread, and before doing this GDB
  asserts that the thread is stopped,

  16. An exited thread is not classified as stopped, and so the assert
  triggers!

So there's two bugs I see here.  The first, and most critical one here
is in step #6.  I think that resume_stopped_resumed_lwps should not
resume a vfork parent, just like resume_lwp doesn't resume a vfork
parent.

With this change in place the vfork parent will remain stopped in step
instead GDB will only see the EXECD/EXITED/SIGNALLED event.  The
problems in #9 and #10 are therefore skipped and we arrive at #11,
handling the EXECD event.  As the parent is still stopped riscvarchive#12 doesn't
apply, and in riscvarchive#13 when we try to stop the process we will see that it
is already stopped, there's no risk of the vfork parent exiting before
we get to this point.  And finally, in riscvarchive#15 we are safe to poke the
process registers because it will not have exited by this point.

However, I did mention two bugs.

The second bug I've not yet managed to actually trigger, but I'm
convinced it must exist: if we forget vforks for a moment, in step riscvarchive#13
above, when linux_nat_target::detach is called, we first try to stop
all threads in the process GDB is detaching from.  If we imagine a
multi-threaded inferior with many threads, and GDB running in non-stop
mode, then, if the user tries to detach there is a chance that thread
could exit just as linux_nat_target::detach is entered, in which case
we should be able to trigger the same assert.

But, like I said, I've not (yet) managed to trigger this second bug,
and even if I could, the fix would not belong in this commit, so I'm
pointing this out just for completeness.

There's no test included in this commit.  In a couple of commits time
I will expand gdb.base/foll-vfork.exp which is when this bug would be
exposed.  Unfortunately there are at least two other bugs (separate
from the ones discussed above) that need fixing first, these will be
fixed in the next commits before the gdb.base/foll-vfork.exp test is
expanded.

If you do want to reproduce this failure then you will for certainly
need to run the gdb.base/foll-vfork.exp test in a loop as the failures
are all very timing sensitive.  I've found that running multiple
copies in parallel makes the failure more likely to appear, I usually
run ~6 copies in parallel and expect to see a failure after within
10mins.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant