-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failed: System.Text.Tests.TranscodingStreamTests.Read_Span(bufferLength: 1) #47444
Comments
Tagging subscribers to this area: @tarekgh, @krwq Issue DetailsRun: runtime 20210125.57 Failed tests:
Error message:
|
No case conversion occurs as part of these tests. Is this failure reproducible? Are bits of memory randomly being flipped as we succumb to the slow, silent entropy of the universe? |
Having a morbid day, are we? 😄 |
This test runs in this configuration (this queue, at least) on many different machines: TestResults dcount_MachineName230 In that last 4 months it has failed twice -- both on the same box in much the same way Bad memory on ddvsotx2l280 ? cc @MattGal TestResults
|
It might be interesting to query failure % of tests in general on ddvsotx2l280 compared to other machines with the same configuration and OS. |
For the record, the machine seems middle of the pack by rate of test failures or segfaults, using this query WorkItems So... who knows? Twice could be a coincidence. What is the chance that this single bit error would be in roughly the same place in the buffer ? Is this the only test that writes a large buffer and reads it back? @GrabYourPitchforks any other idea? |
Maybe @jkotas has an idea |
I think we should wait and see whether it fails on some other machine too. I can be a bad memory on the specific machine. If we see it failing on more than one machine, it can be dangling pointer bug or GC hole bug. We would need a process dump taken as close to the failure point as possible to diagnose it. |
OK, let's check back in a month... |
Something easy to do and I don't mind doing, if you pick a correlation id of a specific run for me, is to send exactly hundreds of clones of the just the work item you suspect is a problem using the Helix SDK. We'd get pretty useful data about the general flakiness of the test and know whether to blame that one machine from this. |
@MattGal thanks for the offer. Can you extract the "correlation ID" from one of the results in the query above? |
Sure, can you confirm somewhere that this log is the exact one to clone a bunch of times? As part of the recent surge (I think you even contributed) we log the correlation id at the top of all these logs now like: |
Ah, I see, I didn't know what correlation ID is -- sounds like it's job ID. Yes, that's one of the ones with the problem. It would be interesting to loop that a zillion times on this machine and grab the console output of any failures. thank you! |
I was thinking a zillion times on all the machines, to isolate "we think this machine has failing hardware" thing? Job Id is a long in the DB (in the case of the GUID above, "JobId" == 13132067) |
I ran this test 800x in a row across 274/278 of these machines (see JobId: 13167068 Correlation: 7082a89-a973-405f-af71-9babae5ac23e) It ended up passing 3x on that machine:
I am set up to be able to do this over and over to any scale we like but I"m trying to keep it minimal as PRs still need to work. I may run a few hundred here and there today and add them to my checks. |
Machine's back in the pool. I ran stress-ng loads suggested by Dan both on the machine for 2 hours and inside the container the test failed on, no luck :( |
Failed again in runtime-coreclr libraries-jitstress 20210705.1 Failed tests:
Error message:
|
This is the failure on same machine I tried reproducing it on the local Linux/arm machine and it doesn't reproduce. |
Spitballing a few ideas. One option is that we try to log the physical memory address where this failure occurs. From the managed side it'd be easy enough to record the process's virtual memory address, but I don't know how to reliably map that back to a physical memory address short of having some system-level service already running on the box and using IPC. Another option is to query the machine name at the start of the test and to perform a different action if we're on the target box. That could include skipping the test, using absurdly-sized payloads to try to trigger the repro more rapidly, or whatever else we think appropriate. Risks here would be that: (a) there's an opportunity cost for doing this work if we already suspect a hardware problem and have already proposed a resolution of pulling the machine from rotation; and (b) we don't want to risk accidentally leaving this code checked in. A big-hammer is to memtest86+ the physical box. 😈 |
This is a low hit non-deterministic issue, I am inclined to moving this out of .NET6.0. Let me know if anyone thinks that it should be investigated in .NET6. |
@MattGal can we ask for someone to run memtest? It's hard to understand otherwise why it's the same box. |
I took the machine offline and will file a ticket to get DDFUN to try this out. Due to my only access being over KVM, I can't seem to get to the boot loader screen to try memtest. Someone with physical access may have to attempt this. |
I ran the query above again, and although this test has run on 34 distinct machines in the last 120 days, the only two failures (new ones since last time) have both been on ddvsotx2l280. Both failures had the same pattern. I'm guessing this test is one of the most stressful on memory, but I don't know off the top of my head how to create a query like "which tests disproportionately fail on ddvsotx2l280". If that threw up other memory-intensive tests, it would be more evidence. TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= now(-365d)
| where Type == "System.Text.Tests.TranscodingStreamTests"
| where Method == "Read_Span"
| where Result == 'Fail'
| project Started, Duration, Type, Method, FriendlyName, MachineName, QueueName1, Message, StackTrace, Arguments, Branch
|
Assigning to you @MattGal - Feel free to redirect to me once we get the result of memtest. |
@MattGal, any update with memtest? |
@parose1 responded a few days ago on the issue linked. This is failing ~1 time every 2 months, so it will be a long time before we have more data. I suggest to either close this (optimistically) or move out of 6.0 |
The query above shows no new failures since 7/6 so I suggest to close optimistically. I still think it's important to re-flash or whatever the device. |
I agree with @danmoseley . |
Thanks @danmoseley and @kunalspathak. |
It didn't seem to exist for this specific hardware setup, so it got reimaged instead. |
Let's meet back here in 2 months and see whether this test failed again and if so on which machine. |
Hopefully @MattGal the machine name didn't change? If not what is it now. |
Run: runtime 20210125.57
Failed tests:
Error message:
The text was updated successfully, but these errors were encountered: