Local retry behavior keeps previous run's (now duplicate) entries #10916

MattGal · 2022-09-19T17:59:56Z

Related to https://portal.microsofticm.com/imp/v3/incidents/details/335170304/home

I first noticed that the problem was, suspiciously, held to a relatively low number of flaky tests:

let ids=
Messages
| where ingestion_time() > ago(2d)
| where MessageText contains "(Caused by ResponseError('too many 500 error responses'))"
| project WorkItemId;
WorkItems
| where WorkItemId in (ids)
| summarize  count() by FriendlyName

(14 in the past 2 days, out of about 1M work items hit this)

I observed this only happens in tests where a local retry had occurred.

The code is basically

while(should_execute):
  <run some tests>
  if (some results got made):
    run_results += [results]
    if (we need to retry)
      should_execute = true

if (we have an azdo config and should_execute= false)
  Submit (run_results)

I think this is a one-liner fix and will make a PR but will absolutely wait for @ChadNedzlek approval before merging. As he's on FR this week, preemptively assigning to him.

The text was updated successfully, but these errors were encountered:

MattGal · 2022-09-19T18:04:35Z

Created https://dnceng.visualstudio.com/internal/_git/dotnet-helix-machines/pullrequest/25935 to change the += to an =
EDIT This was wrong

ChadNedzlek · 2022-09-19T18:52:27Z

This is definitely intentional. What bad behavior are you seeing?

ChadNedzlek · 2022-09-19T18:53:42Z

We are gathering up all the runs so that we can analyze the set of them together to see, for any given test, did it pass in at least one of the result sets. If you remove the previous results from the set, flaky test analysis will break.

ChadNedzlek · 2022-09-19T18:54:57Z

Let me double check the code path to make sure it's doing what I think it's doing. But this code has worked correctly for years, so I'm skeptical it's this fundamentally incorrect.

ChadNedzlek · 2022-09-19T19:04:32Z

I double checked. The code is definitely doing what it was intended to do, it creating a List[List[TestResults]] where each List[TestResults] is for a single execution. It is 100% intentional that we are gathering up all the various independent runs together to do aggregate analysis on.

ChadNedzlek · 2022-09-19T19:07:31Z

It's possible the just broke reporting of sub-tests with the flaky attributes on them, since these would be the only tests that do that. We are using a bunch of undocumented behaviors there (in particular, we add "IsTestResultFlaky:True" and "AttemptId:1" to those tests, and se the result_group_type to "rerun"), which is pretty unusual (but it's the only way I could make the AzDO UI do what I wanted).

So it sounds like they just broke flaky tests with their latest change. Or if they want us to report them in some other way, we can do that too, so that it gets the little "flaky" designator in AzDO.

ChadNedzlek · 2022-09-23T18:10:42Z

I'm considering a change to the python module that will use the... mutable... nature of python to save the request body if we get a failed response form AzDO. They are blaming us for doing something wrong, but I don't believe that's actually happening, so I need access to one of the requests we are sending.

ChadNedzlek · 2022-09-28T21:40:23Z

I pushed a PR that should save the body of the requests when we get this Exception: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26163

Hopefully that will gather some data we can use to figure out what's going on.

ChadNedzlek · 2022-10-06T21:12:00Z

I tried catching the broken body, but I failed, because it wasn't a string yet, so I needed to add another PR to capture it right: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/26555

ChadNedzlek · 2022-10-06T21:12:15Z

Once we have the broken body, we can go back to the IcM showing them what we uploaded.

missymessa · 2022-10-12T21:37:20Z

Followed up with Chad on next steps:

Look up workitems that hit this error with this query:

WorkItems 
| where Queued > datetime(2022-10-12 19:00:00) and ExitCode == -4 
| join kind=inner(Messages | where MessageText has "too many 500") on WorkItemId 
| join kind=inner(Logs) on WorkItemId

Check the logs for run_client.py and verify the request was uploaded.

missymessa · 2022-10-12T22:37:39Z

Provided the IcM with logs to show what we're uploading to AzDO, and verified that the tests in that log only contained IsTestResultFlaky once if it contains it at all.

missymessa · 2022-10-20T16:34:03Z

AzDO replied on the IcM yesterday: "We have synced the configurations of 'dnceng' org to 'dnceng-public' org, will monitor if this solves the issue or not." We'll have to verify if that actually fixed things or not.

missymessa · 2022-10-20T17:15:41Z

According to our Kusto log, the last instance of the "too many 500" occurred on 2022-10-19T08:34:32.98Z. If we start seeing this issue crop up again, we can reopen this issue.

ChadNedzlek · 2022-10-20T20:17:00Z

So it was something on their end the whole time?

missymessa · 2022-10-20T20:18:34Z

It seems that way. Their last message was that they were syncing the config between dnceng and dnceng-public, which MattGal pointed out was them setting the same feature flags in denceng-public as were set in dnceng.

MattGal added the Ops - First Responder label Sep 19, 2022

MattGal assigned ChadNedzlek Sep 19, 2022

MattGal mentioned this issue Sep 19, 2022

HTTP 500s from AzDO Test Results API #10358

Closed

ChadNedzlek removed their assignment Oct 6, 2022

missymessa self-assigned this Oct 12, 2022

missymessa closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local retry behavior keeps previous run's (now duplicate) entries #10916

Local retry behavior keeps previous run's (now duplicate) entries #10916

MattGal commented Sep 19, 2022

MattGal commented Sep 19, 2022 •

edited

Loading

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022 •

edited

Loading

ChadNedzlek commented Sep 23, 2022

ChadNedzlek commented Sep 28, 2022

ChadNedzlek commented Oct 6, 2022

ChadNedzlek commented Oct 6, 2022

missymessa commented Oct 12, 2022

missymessa commented Oct 12, 2022

missymessa commented Oct 20, 2022

missymessa commented Oct 20, 2022

ChadNedzlek commented Oct 20, 2022

missymessa commented Oct 20, 2022

Local retry behavior keeps previous run's (now duplicate) entries #10916

Local retry behavior keeps previous run's (now duplicate) entries #10916

Comments

MattGal commented Sep 19, 2022

MattGal commented Sep 19, 2022 • edited Loading

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022

ChadNedzlek commented Sep 19, 2022 • edited Loading

ChadNedzlek commented Sep 23, 2022

ChadNedzlek commented Sep 28, 2022

ChadNedzlek commented Oct 6, 2022

ChadNedzlek commented Oct 6, 2022

missymessa commented Oct 12, 2022

missymessa commented Oct 12, 2022

missymessa commented Oct 20, 2022

missymessa commented Oct 20, 2022

ChadNedzlek commented Oct 20, 2022

missymessa commented Oct 20, 2022

MattGal commented Sep 19, 2022 •

edited

Loading

ChadNedzlek commented Sep 19, 2022 •

edited

Loading