-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout #13610
Comments
@ahsonkhan, it should have uploaded a dump from the remote process: |
Previously this was blocked by https://github.com/dotnet/core-eng/issues/7950 (https://github.com/dotnet/corefx/issues/41528) .. if dumps work now that's awesome. |
@danmosemsft, the one with the dump is Windows. |
@stephentoub ah, I just saw the RedHat above. |
It looks like https://github.com/dotnet/core-eng/issues/7950 was recently closed, so I would hope to get a dump from that one too. |
The dump that's relevant here is from the RemoteExecutor.Invoke child process; on Windows we now explicitly P/Invoke to MiniDumpWriteDump in order to save out a dump. That doesn't happen on Linux. |
Should we have an issue to find a way to do it on Linux? Presumably deploying and invoking createdump? |
Sure :) |
No, sorry. Will remember to do that next time. |
updated link above (edited above post as the link was wrong)
|
cc @jkotas |
And for learning purposes, to find the uploaded files (i.e. after a retry) you can use the Helix API directly: https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/files This will list all uploaded files, in this case I searched for the ".dmp" file and added it to the url: "/4708.wrrjax4w.c3b.dmp". |
Thanks, @ViktorHofer. This is happening early on in the invocation of the remote process, as we're looking up the method to invoke: |
This seems to be stuck on a page-in request in the OS. I do not see anything wrong on the runtime side. The GC is trying to clear memory from 0x000001c600011398 to 0x000001c6000139d8. The clearing is stuck at address 0x000001c600013000 (ie page boundary). We need to get more dumps for this to see the pattern. |
From just looking at this dump, my bet would be on infrastructure problem: The physical machine is over-subscribed and super slow. |
Note that VS is hiding the part of the stack in coreclr.dll. We are in the middle of GC that happened to be triggered while we're looking up the method to invoke. |
Right, understood. I was pasting the image from VS to show the managed frames that were obscured in the windbg stack trace as well as the arguments to those methods. |
Odd that we've had 3 hangs (in remote executor in Regex cache tests). I'd have thought a slow machine might give a more distributed set of hangs.. |
They can be each different root cause. Do you happen to have dumps for the other 2 hangs? |
We don't have more dumps yet and we only produce ones on Windows for these kinds of failures. |
This failed again just now on Windows, here's the dump: https://helix.dot.net/api/2019-06-17/jobs/f20c2934-638e-4637-809b-b960d0287af0/workitems/System.Text.RegularExpressions.Tests/files/6112.xqszggzs.qit.dmp |
We are in the middle of JITing So the common theme is:
Keep more dumps coming... |
Another hit in dotnet/coreclr#27375 . System.Transactions tests. The process is stuck early during startup inside The full stress log is here: |
The interesting bits of the stress log are:
|
|
This confirms my theory that the machine is oversubscribed when this hang/timeout happens. We need to find out what else is going on the machine that is taking all resources. |
@jkotas I've seen we are hitting this again in some PRs. What thread count and parallelism would you recommend setting if memory is < 1GB? |
Does There is still a chance that we find the machine to be in fine state when the test is starting, and then the expensive background services kick in while the test is running. It may be better to throttle this in the xunit itself to be more agile. It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor:
|
@jkotas sounds good. Do you have cycles to submit a PR with the proposed changes? |
Sorry, I do not have cycles to deal with this.
Do you have the list of recent PRs that are hitting this? The problem may have shifted somewhat compared to what was identified earlier in this issue. The one that I have seen recently (#35451) has Memory load: 14 that is pretty reasonable. The signature of the problem identified earlier in this issue was very high memory load (like Memory load: 98). |
I think it's reasonable to try to address this by throttling the number of concurrent RemoteExecutor processes when resources are low. @jkotas's suggestion above seems like a good starting point. @ViktorHofer / @jkotas are we only running a single test process, such that handling in the static RemoteExecutor would be sufficient, or do we need to consider multiple processes launching RemoteExecutors on the machine? We'd want to measure the prevalence of this problem before and after a change. Here's the command I ran to measure: |
Yes, handling this in the static RemoteExecutor should be sufficient. However, I am not convinced that this will address the problem we are seeing recently. If the problem was really caused by oversubscribed machine, we would see crashes with variety of stacks, but that's not the case. We are always seeing crash with Trying to gather more data about this failure like @noahfalk suggested in #35451 (comment) may be a better course of action. |
I see what you mean. All these logs show the following callstack hanging:
@noahfalk do you think the number of occurrences we are seeing here indicate that progress is not being made? |
The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang. |
If we are seeing many occurences of that stack I agree it raises suspicion that something else is going on. In terms of progress I don't yet have a good explanation and we've got contradictory evidence. On the one hand the dump doesn't indicate that we are waiting on anything other than the OS scheduler to assign a core, however if that were truly the only issue then I wouldn't expect to see this particular callstack at high volume. Right now I don't think we've got the data to draw a conclusion in either direction. Additional dumps after a time delay and/or an ETW trace could provide that evidence. |
Based on this info, we were able to track down that the shas in between those two package versions are: c409bd0 and 44ed52a Then we looked at Kusto data to find the first build that hit this issue and it was: https://dev.azure.com/dnceng/public/_build/results?buildId=608569&view=results -- which used: 629dba5 to merge with master to checkout the repo for the build. The builds started failing in april 21st. So based on this I can only thing of 2 commits in question: 629dba5 and aa5b204 |
The diagnostic server PR 629dba5 is the obvious culprit. It is closely coupled with the code where this hang occurs. It is likely that the diagnostic server eventpipe code is doing something bad during shutdown, and that corruption is leading to this hang. We had similar intermittent shutdown bugs in eventpipe before, so it is not that surprising that a new one popped up after significant eventpipe change. Let's use this issue to track the original problem with machine oversubscription, and reactivate #35451 to track the hang in cc @josalem |
Indeed, thanks for narrowing to a few possible commits. I still have little idea what exactly the server change would do to produce the particular result we saw in #35451, but it narrows the context enough that we can investigate. |
Triage: will need to trigger runfo to get current data to decide if this issue should be closed or further action is necessary. |
@ViktorHofer I think there are few different issues being discussed in this thread. The last one in the thread was resolved with the same fixes in here I believe: #35451. I'm not sure about the other issues discussed here though. |
@safern is this test now stable? (Also is there info on the OneNote to remind me how to make such queries myself) If the test is stable we can close this issue. |
I think we could still hit this issue it just depends the available memory at the test execution but we don't hit it that often.
I don't think so, we can add it. |
Runfo doesn't list any failures in the last 14 days for the described remote executor hang: https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aruntime++started%3A%7E14&tq=name%3A%22after+60000ms+waiting+for+remote+process%22. Feel free to open an issue for following-up on improvements to RemoteExecutor or the test runner. |
From dotnet/corefx#41753
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries
netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)
https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug
https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw
netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open
https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug
https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console
cc @ViktorHofer
The text was updated successfully, but these errors were encountered: