-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local retry behavior keeps previous run's (now duplicate) entries #10916
Comments
Created https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25935 to change the += to an = |
This is definitely intentional. What bad behavior are you seeing? |
We are gathering up all the runs so that we can analyze the set of them together to see, for any given test, did it pass in at least one of the result sets. If you remove the previous results from the set, flaky test analysis will break. |
Let me double check the code path to make sure it's doing what I think it's doing. But this code has worked correctly for years, so I'm skeptical it's this fundamentally incorrect. |
I double checked. The code is definitely doing what it was intended to do, it creating a |
It's possible the just broke reporting of sub-tests with the flaky attributes on them, since these would be the only tests that do that. We are using a bunch of undocumented behaviors there (in particular, we add "IsTestResultFlaky:True" and "AttemptId:1" to those tests, and se the result_group_type to "rerun"), which is pretty unusual (but it's the only way I could make the AzDO UI do what I wanted). So it sounds like they just broke flaky tests with their latest change. Or if they want us to report them in some other way, we can do that too, so that it gets the little "flaky" designator in AzDO. |
I'm considering a change to the python module that will use the... mutable... nature of python to save the request body if we get a failed response form AzDO. They are blaming us for doing something wrong, but I don't believe that's actually happening, so I need access to one of the requests we are sending. |
I pushed a PR that should save the body of the requests when we get this Exception: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26163 Hopefully that will gather some data we can use to figure out what's going on. |
I tried catching the broken body, but I failed, because it wasn't a string yet, so I needed to add another PR to capture it right: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26555 |
Once we have the broken body, we can go back to the IcM showing them what we uploaded. |
Followed up with Chad on next steps: Look up workitems that hit this error with this query:
Check the logs for |
Provided the IcM with logs to show what we're uploading to AzDO, and verified that the tests in that log only contained |
AzDO replied on the IcM yesterday: "We have synced the configurations of 'dnceng' org to 'dnceng-public' org, will monitor if this solves the issue or not." We'll have to verify if that actually fixed things or not. |
According to our Kusto log, the last instance of the "too many 500" occurred on 2022-10-19T08:34:32.98Z. If we start seeing this issue crop up again, we can reopen this issue. |
So it was something on their end the whole time? |
It seems that way. Their last message was that they were syncing the config between |
Related to https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home
I first noticed that the problem was, suspiciously, held to a relatively low number of flaky tests:
(14 in the past 2 days, out of about 1M work items hit this)
I observed this only happens in tests where a local retry had occurred.
The code is basically
I think this is a one-liner fix and will make a PR but will absolutely wait for @ChadNedzlek approval before merging. As he's on FR this week, preemptively assigning to him.
The text was updated successfully, but these errors were encountered: