Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ryzen 3950X test failures #2680

Closed
emmatyping opened this issue Sep 18, 2020 · 16 comments
Closed

Ryzen 3950X test failures #2680

emmatyping opened this issue Sep 18, 2020 · 16 comments

Comments

@emmatyping
Copy link

From https://twitter.com/MikeHommey/status/1306816099040673792 I tried rr on my 3950x, and I got a few errors like #2677 and #2678, but I got a couple of others, so I figured I should report them in a new issue.

The following tests FAILED:
        565 - setuid-no-syscallbuf (Failed)
        880 - ignore_nested (Failed)
        881 - ignore_nested-no-syscallbuf (Failed)
        920 - nested_detach_wait (Failed)
        921 - nested_detach_wait-no-syscallbuf (Failed)
        1140 - nested_detach (Failed)
        1141 - nested_detach-no-syscallbuf (Failed)
        1152 - record_replay (Failed)
        1153 - record_replay-no-syscallbuf (Failed)
        2122 - ignore_nested-32 (Failed)
        2123 - ignore_nested-32-no-syscallbuf (Failed)
        2162 - nested_detach_wait-32 (Failed)
        2163 - nested_detach_wait-32-no-syscallbuf (Failed)
        2382 - nested_detach-32 (Failed)
        2383 - nested_detach-32-no-syscallbuf (Failed)
        2394 - record_replay-32 (Failed)
        2395 - record_replay-32-no-syscallbuf (Failed)

The ignore_nested tests seem to be the new ones not in the other issues. I attached all of the err outputs if that is of help. Let me know if I can do anything else to be of assistance.

rrtesterrs.zip

@jix
Copy link

jix commented Sep 18, 2020

Did you run the zen_workaround.py script? (running ./bin/rr record echo will tell whether that is needed.)

On my 3950x those tests work reliable after running that script (and seem to be those that fail without).

@emmatyping
Copy link
Author

I did run that script before running the tests, as I got the warning after running exactly the command you mention :)

@jix
Copy link

jix commented Sep 18, 2020

Ah sorry, I could have realized that by checking my local error outputs and noticing that this warning would have ended up in the logs in your rrtesterrs.zip. Apart from that I also get completely different errors w/o the workaround, I get errors during replay, while in your error logs the recording is already failing. (And as I already mentioned I get none of theses errors with the workaround enabled.)

AFAICT these test failures are unrelated to the AMD performance counter issues.

@rocallahan
Copy link
Collaborator

Try rerunning with the latest master.

@emmatyping
Copy link
Author

Still seeing these test failures on master.


The following tests FAILED:
        565 - setuid-no-syscallbuf (Failed)
        880 - ignore_nested (Failed)
        881 - ignore_nested-no-syscallbuf (Failed)
        920 - nested_detach_wait (Failed)
        921 - nested_detach_wait-no-syscallbuf (Failed)
        1140 - nested_detach (Failed)
        1141 - nested_detach-no-syscallbuf (Failed)
        1152 - record_replay (Failed)
        1153 - record_replay-no-syscallbuf (Failed)
        2122 - ignore_nested-32 (Failed)
        2123 - ignore_nested-32-no-syscallbuf (Failed)
        2162 - nested_detach_wait-32 (Failed)
        2163 - nested_detach_wait-32-no-syscallbuf (Failed)
        2382 - nested_detach-32 (Failed)
        2383 - nested_detach-32-no-syscallbuf (Failed)
        2394 - record_replay-32 (Failed)
        2395 - record_replay-32-no-syscallbuf (Failed)

@rocallahan
Copy link
Collaborator

Post the verbose test log?

@jebrosen
Copy link

Hi, I'm getting a nearly identical set of failures on a Ryzen 3700U (after applying the patch in #2740 for GDB 10.1).

        882 - ignore_nested (Failed)
        883 - ignore_nested-no-syscallbuf (Failed)
        922 - nested_detach_wait (Failed)
        923 - nested_detach_wait-no-syscallbuf (Failed)
        1146 - nested_detach (Failed)
        1147 - nested_detach-no-syscallbuf (Failed)
        1158 - record_replay (Failed)
        1159 - record_replay-no-syscallbuf (Failed)
        1280 - bad_ip-32 (Failed)
        1281 - bad_ip-32-no-syscallbuf (Failed)
        2132 - ignore_nested-32 (Failed)
        2133 - ignore_nested-32-no-syscallbuf (Failed)
        2172 - nested_detach_wait-32 (Failed)
        2173 - nested_detach_wait-32-no-syscallbuf (Failed)
        2396 - nested_detach-32 (Failed)
        2397 - nested_detach-32-no-syscallbuf (Failed)
        2408 - record_replay-32 (Failed)
        2409 - record_replay-32-no-syscallbuf (Failed)

Verbose log file: ctest-verbose-4.log

Most of the failed tests have this same error during recording:

Test 'ignore_nested' FAILED: : error during recording:
--------------------------------------------------
rr: ../src/util.cc:1225: bool rr::running_under_rr(bool): Assertion `ret == 0 || (ret == -1 && (*__errno_location ()) == 38)' failed.
--------------------------------------------------

@rocallahan
Copy link
Collaborator

Please rerun the tests with c8a468d for a better error

@jebrosen
Copy link

Those errors now say:

[FATAL ../src/util.cc:1229:running_under_rr() errno: EINVAL] Unexpected result for rrcall_check_presence: -1

On a hunch I changed record_syscall.cc:4072 to return EBADF instead of EINVAL, and they all changed:

      syscall_state.emulate_result(arguments_are_zero ? 0 : (uintptr_t)-EBADF);
[FATAL ../src/util.cc:1229:running_under_rr() errno: EBADF] Unexpected result for rrcall_check_presence: -1

After adding some debug printing, it's reportedly the arg 6 register that has a non-0 value inside SYS_rrcall_check_presence.

@rocallahan
Copy link
Collaborator

Good hunch!

Hmm, sure looks like we're passing zero:

    int ret = syscall(SYS_rrcall_check_presence, 0, 0, 0, 0, 0, 0);

This is worrying.

If you do something like strace -o /tmp/output rr replay -a, you should get a line like

syscall_0x3f0(0, 0, 0, 0, 0, 0)         = -1 ENOSYS (Function not implemented)

Are all the parameters zero there?

@jebrosen
Copy link

Interesting - indeed, running that strace command shows syscall_0x3f0(0, 0, 0, 0, 0, 0x55d500000000) = -1 ENOSYS (Function not implemented). 0x55d500000000 also looks like a typical value from the debug printing I was playing with, which were usually somewhere around 0x500000000000.

I checked the disassembly of running_under_rr to see where it was coming from:

  540d2d:	bf f0 03 00 00       	mov    edi,0x3f0
  540d32:	89 c6                	mov    esi,eax
  540d34:	89 c2                	mov    edx,eax
  540d36:	89 c1                	mov    ecx,eax
  540d38:	41 89 c0             	mov    r8d,eax
  540d3b:	41 89 c1             	mov    r9d,eax
  540d3e:	c7 04 24 00 00 00 00 	mov    DWORD PTR [rsp],0x0

...which looks right. But, the disassembly of syscall in my copy of libc-2.32.so reads a QWORD off the stack, not a DWORD:

   fad40:	f3 0f 1e fa          	endbr64 
   fad44:	48 89 f8             	mov    rax,rdi
   fad47:	48 89 f7             	mov    rdi,rsi
   fad4a:	48 89 d6             	mov    rsi,rdx
   fad4d:	48 89 ca             	mov    rdx,rcx
   fad50:	4d 89 c2             	mov    r10,r8
   fad53:	4d 89 c8             	mov    r8,r9
   fad56:	4c 8b 4c 24 08       	mov    r9,QWORD PTR [rsp+0x8]
   fad5b:	0f 05                	syscall 

Changing accordingly the 6th 0 to a 0L in a few syscall()s fixes all of my failing tests except bad_ip, although I'm not sure if this is a correct thing to do on all architectures or even if it's just "happening" to seem to work on this machine.

@rocallahan
Copy link
Collaborator

Yeah OK. Zero is passed as 'int' by default, but the parameters are actually longs. It doesn't matter for the first five parameters since they're passed in registers and the generated code zero-extends them, but the sixth parameter is in memory so it blows up. Wow, that's a massive footgun I didn't know about.

@rocallahan
Copy link
Collaborator

I can't get gcc to generate that code, interestingly enough.

@rocallahan
Copy link
Collaborator

ah, but clang does.

@jebrosen
Copy link

That did it - with gcc all tests are passing, even bad_ip-32. bad_ip exhibits a similar (non-?)clobbering issue in one part of a variable being inspected. I can open a separate issue for that one; it seems to have a slightly different cause, and only the -32 variant of that test is failing.

@rocallahan
Copy link
Collaborator

6658bdd fixes the main bug here.

Please open separate issues for any remaining bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants