Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

VincentBu · 2021-04-29T02:07:26Z

Run: runtime 20210428.85

Failed test:

net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.313.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.13-helix-arm32v7-20210414141857-1ea6b0a
 -System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(windowBits: 15)

Error message:

System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken) in /_/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs:line 564
   at System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(Int32 windowBits) in /_/src/libraries/System.Net.WebSockets/tests/WebSocketDeflateTests.cs:line 444
--- End of stack trace from previous location ---

The text was updated successfully, but these errors were encountered:

ghost · 2021-04-29T02:07:28Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime 20210428.85

Failed test:

net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.313.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:alpine-3.13-helix-arm32v7-20210414141857-1ea6b0a
 -System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(windowBits: 15)

Error message:

System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken) in /_/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs:line 564
   at System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(Int32 windowBits) in /_/src/libraries/System.Net.WebSockets/tests/WebSocketDeflateTests.cs:line 444
--- End of stack trace from previous location ---

Author:	VincentBu
Assignees:	-
Labels:	`arch-arm32`, `area-System.Net`, `os-linux`
Milestone:	-

zlatanov · 2021-04-29T12:05:47Z

To me this issue seems unrelated to the reverted "fix ber scanf/printf". For some reason, unknown to me, on ARM architecture, this test takes more than 5 seconds to complete.

I have created this test to check that deflate context takeover works as expected, but it is a bit speculative on the compression algorithm. We can either increase the timeout (or even remove it entirely and depend on xunit cancelling the test if something hangs), or delete the test, because if in the future the underlying compression ratio changes for some reason, this test will be outdated and can start to fail.

//cc @CarnaViire

CarnaViire · 2021-04-29T12:48:33Z

Timeouts like that tend to be unreliable in our CI, if the test is executed on a heavily loaded machine, it may take more than expected... That being said, we should first gather statistics, how common is this test failure, or it was a one-off. I'll search for the stats.

Btw, how could the algorithm change that much so the aggregation of compressed parts of the message will be much bigger/smaller than the compressed whole message?.. wouldn't it be a different/new algo then, not DEFLATE as it was described in RFC?

zlatanov · 2021-04-29T13:05:06Z

Btw, how could the algorithm change that much so the aggregation of compressed parts of the message will be much bigger/smaller than the compressed whole message?.. wouldn't it be a different/new algo then, not DEFLATE as it was described in RFC?

The deflate only describes the structure/format of the data. The algorithm is implementation detail that might change (for example zlib-intel vs classic zlib), memory constraints, different performance optimizations that might trade compression ratio for speed.

CarnaViire · 2021-04-29T13:27:48Z

I see that it is failing quite regularly (53 times in 2 days on main) for 14 and 15 window bits for

alpine.313.arm32
alpine.312.arm64
ubuntu.1804.arm32
ubuntu.1804.arm64

I'm ok with removing the test as we cannot depend on it from the algorithm's point of view, but... Interesting that it is not failing, but hanging, and it's only on arm. I'm interested in what will happen if we just remove our timeout, whether it will complete after some more time, or will continue hanging until killed as LongRunningTest 🤔

zlatanov · 2021-04-29T13:41:21Z

I'm interested in what will happen if we just remove our timeout, whether it will complete after some more time, or will continue hanging until killed as LongRunningTest

There is only one way to find out 😎

However I have the nagging suspicion that there is some problem with Random on ARM.
There is nothing else in this test that is different from the other deflate tests. Some of them should actually be more expensive than this one.

Is there any way I could run a test on ARM, or the only way would be to create e DRAFT PR that I would delete later?

zlatanov · 2021-04-29T14:01:46Z

@CarnaViire I created a PR, we wait now to see the run times for the test. If I am correct, the fix should be successful. If however you feel that we should remove the test, let me know.

CarnaViire · 2021-04-29T14:03:15Z

Would it defeat the purpose of the test to make the message smaller, e.g. frame x5, not x10?
I also wanted to ask about NextBytes, but you've done that already 🙂

And as for running a test on ARM, I'd say CI is the best way I know of.

zlatanov · 2021-04-29T14:40:40Z

Would it defeat the purpose of the test to make the message smaller, e.g. frame x5, not x10?

It would not. 5 is as good a number as 10 in this case 😊 Let's see how the CI build turns out, I think the issue here was the usage of Random.

BruceForstall · 2021-04-29T15:59:04Z

@CarnaViire

And as for running a test on ARM, I'd say CI is the best way I know of.

@karelz should be able to arrange for you to get access to an ARM machine if needed.

zlatanov · 2021-04-30T07:40:10Z

@CarnaViire can we run this code in one of the failing environments (alpine.313.arm32, alpine.312.arm64, ubuntu.1804.arm32, ubuntu.1804.arm64):

using System;
using System.Diagnostics;

var random = new Random(0);
var stopwatch = Stopwatch.StartNew();

for (int i = 0; i < 1_000_000; ++i)
{
    random.Next(maxValue: 10);
}

stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed);

It shouldn't take more than a few hundred milliseconds on the slowest silicon. On my i7 machine it takes 00:00:00.0074053.

//cc @danmoseley @BruceForstall @karelz

danmoseley · 2021-04-30T16:49:40Z

@BruceForstall this was done in #52086

I agree (this is a reminder to us all 🙂) that when a test fails regularly in CI we should disable it immediately if we can't fix it immediately. We're good engineers and prefer to investigate and fix rather than disable anything, but that can happen while it's disabled. 🙂

MattGal · 2021-05-03T15:30:32Z

@BruceForstall I have no idea how to get hands on ARM machines -- perhaps infra folks can help here? @ViktorHofer any idea?

I assume Viktor CC'ed me for this. We have a limited # of ARM machines that can be accessed via SSH / RDP from appropriate source IPs, you can reach out to @ilyas1974 or myself if he's unavailable to borrow a machine.

CarnaViire · 2021-05-05T19:50:18Z

@zlatanov I was finally able to get access to and setup an ARM machine (failed to setup alpine in docker, so using ubuntu.1804.arm64) to run the code with Random you've posted here #52031 (comment). No breakthrough though... it shows around 00:00:00.0308034 on multiple runs.
I'll try to run and measure the whole test. If you have any experiments for this test you'd like to check, just send it and I'll run it for you.

CarnaViire · 2021-05-05T20:33:35Z

For the whole test too, I see at max 00:00:00.3995667 for 15 bits, nowhere near 5 seconds...

danmoseley · 2021-05-05T21:25:42Z

That's wacky -- we know from Kusto that it was always failing (for 14/15). Not even flaky. That includes Ubuntu.1804.arm64.

It's our own zlib, right? It's not an issue with which zlib is on the machine?

karelz · 2021-05-06T09:25:48Z

Looks like yet another one of the cases where chasing down test failures in CI is difficult without having access to the exact CI machine setup :( ... is there a way to get temporary access to one of the CI machines to confirm we can reproduce it there manually at least?

CarnaViire · 2021-05-06T10:01:40Z

As I understand it, I actually did use one of the CI machines that was excluded from the rotation for me: https://github.com/dotnet/core-eng/issues/13006. But because it's excluded, it's not getting the same load it did while it was in CI, so maybe that's the reason... I also ended up creating my own docker image for Ubuntu 18.04, but I doubt it could influence it that much? Or could it?
@danmoseley what makes you say it was always failing? We wouldn't have green CI when merging the original PR if that was the case, but we did...

zlatanov · 2021-05-07T04:54:43Z

Thanks @CarnaViire!

I still don't think this is related to zlib. @danmoseley we don't use machine wide installation of zlib. In case of ARM, we use the classic zlib instead of zlib-intel.

I will look at the build/tests outputs to see if I can find something interesting. As @CarnaViire suggested it might be related to how busy is the machine, but from 0.01 seconds to 7 seconds seem a bit too much for me. I don't expect the xunit runtime to launch more concurrent tests than there are CPUs. Also if this was the case we should see the total runtime for all the tests combined to be in order of magnitude slower than the other environments.

zlatanov · 2021-05-07T05:23:46Z

After looking again in the build logs, I don't think it's load related either. The run times are too consistent - almost 4 seconds for 13 bits, 7 seconds for 14 bits and 14 seconds for 15 bits.

If it was load related, I expect we would have deviations, but we don't.

danmoseley · 2021-05-07T05:24:17Z

@danmoseley what makes you say it was always failing?

I spoke imprecisely, I meant when it failed, it was always 14 or 15, try this query:

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

TestResults 
| join kind=inner WorkItems on WorkItemId 
| join kind=inner Jobs on JobId
| where Finished >= datetime(2021-3-1 0:00:00)
and Type == "System.Net.WebSockets.Tests.WebSocketDeflateTests"
and Branch == "refs/heads/main"
| summarize count() by Method, Result, QueueAlias, Arguments, Message

danmoseley · 2021-05-07T05:55:17Z

After looking again in the build logs, I don't think it's load related either. The run times are too consistent - almost 4 seconds for 13 bits, 7 seconds for 14 bits and 14 seconds for 15 bits.

Which logs are these? I didn't think we retained data for individual tests unless they failed, and I thought you couldn't repro locally.

Something else I see from querying the failures: it never fails on 13, and it fails on 15 about twice as much as 14. When it fails for 14, it always fails for 15. (This is consistent with the "machine mysteriously slow" theory)

One other idea -- roll back #52086 and add a Stopwatch around each part of the test, including the Random loop, to measure how long each part takes. Put a try/catch for TaskCanceledException around the whole test, and if it's hit, then add those sub-times into the message. That will tell us where the time was spent. Ideally we'd do this locally on @CarnaViire 's ARM machine and just run the test over and over that way ?

danmoseley · 2021-05-07T05:56:01Z

This probably can be done with a simple console app, no xunit involved.

CarnaViire · 2021-05-07T15:02:31Z

The simple console app doesn't help - it consistently shows the small numbers I've posted before, like 0.03 for random and 0.3 for the rest... now I'm trying to build and run the actual full test suit, possibly xunit parallel test run influences it somehow.

CarnaViire · 2021-05-07T16:39:27Z

So, even when running all WebSockets tests with parallel execution, I am getting 0.34-0.35s for 15 window bits for this test.

I guess we have to try executing it in actual CI pipeline... which I am not sure how to do because CI seems to skip test run on ARM e.g. here #52417 (it says it's because isFullMatrix==false apparently) and I don't know how to trigger it 😢

cc @safern @ViktorHofer

zlatanov · 2021-05-10T09:32:23Z

Which logs are these? I didn't think we retained data for individual tests unless they failed, and I thought you couldn't repro locally.

The build logs seem to be retained for 60 days. If you go to the tests and filter them you can find the websocket tests for each of the platform. The run times are too consistent. You can try and find older builds (PRs that mentioned this issue) and you will see similar run times.

And because I originally used 5 seconds as cancellation timeout, having the test throw TaskCanceledException after 14 seconds is dubious.

safern · 2021-05-10T17:35:38Z

and I don't know how to trigger it

You can run it manually from your branch. Just click on "Run Pipeline" here: https://dev.azure.com/dnceng/public/_build?definitionId=686

Then, on the source branch choose your branch... you can push a branch to the dotnet fork of dotnet/runtime and then it will show up as available to run the pipeline from; or if you have a PR you can use refs/pull/<PRid>.

CarnaViire · 2021-05-10T19:36:24Z

I've found the difference between what I was running on my machine and what was failing in CI. I was running on Release runtime, but the test actually ran overtime on Checked runtime.
Here on Release runtime, the test takes the same 0.3s I was seeing

And here on Checked runtime, it takes more than 5s

The pipeline with these results is here.
I will build Checked runtime on my machine to confirm I have the repro and to see what exactly takes the time.

danmoseley · 2021-05-10T20:09:57Z

I was racking my brain to think what we were missing, and you figured it out. 👏

CarnaViire · 2021-05-11T10:40:11Z

I've run the measurements multiple times on Checked runtime and it was indeed Random taking all the time there: random part takes 4.8s and deflate part takes 0.3s. When I apply fix from #52052, random part reduces to 0.02s 😲.
@danmoseley should we pass a word about it to people owning Random?
I will reopen @zlatanov's PR as the fix is working, and I will double-check that in CI 😊

danmoseley · 2021-05-11T14:02:00Z

@jkotas does it surprise you that Random is orders of magnitude slower on a checked runtime on ARM?

jkotas · 2021-05-11T14:18:58Z

I do not see an obvious reason why it should be that much slower compared to release build.

Reducing the number of times Random.Next is called to improve runtime performance of test on ARM. Fixes #52031

VincentBu added arch-arm32 area-System.Net os-linux Linux OS (any supported distro) labels Apr 29, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2021

sandreenko mentioned this issue Apr 29, 2021

Revert "Fix ber scanf/printf. (#51205)" #52038

Merged

zlatanov mentioned this issue Apr 29, 2021

Fix for failing WebSocket deflate test on ARM #52052

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 29, 2021

karelz assigned zlatanov Apr 29, 2021

karelz added this to the 6.0.0 milestone Apr 29, 2021

karelz removed the untriaged New issue has not been triaged by the area owner label Apr 29, 2021

karelz added the test-run-core Test failures in .NET Core test runs label Apr 29, 2021

buyaa-n mentioned this issue Apr 29, 2021

Add few SupportedOSPlatform UnsupportedOSPlatform attributes #52015

Merged

BruceForstall mentioned this issue Apr 29, 2021

Add small morph peep #45463

Merged

naricc mentioned this issue Apr 29, 2021

Enable use of RuntimeVariant to exclude runtime tests #51522

Merged

Anipik mentioned this issue Apr 29, 2021

Add Infra for porting lib only packages to use dotnet pack instead of pkgprojs #51765

Merged

jkotas mentioned this issue Apr 29, 2021

Task/use get last p invoke error #52003

Merged

AntonLapounov mentioned this issue Apr 30, 2021

Updating the JIT to recognize and handle Vector64/128/256<T> for nint/nuint #52016

Merged

danmoseley mentioned this issue Apr 30, 2021

increase timeout for websocket compression tests #52086

Merged

jkotas mentioned this issue Apr 30, 2021

System.Runtime.InteropServices readonly annotation #51784

Merged

buyaa-n mentioned this issue May 3, 2021

Add SupportedOSPlatformGuard, UnsupportedOSPlatformGuard attributes #52028

Merged

ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 7, 2021

karelz assigned CarnaViire May 11, 2021

ghost added the in-pr There is an active PR which will close this issue when it is merged label May 11, 2021

CarnaViire closed this as completed in #52052 May 11, 2021

CarnaViire pushed a commit that referenced this issue May 11, 2021

Fix for failing WebSocket deflate test on ARM (#52052)

d515841

Reducing the number of times Random.Next is called to improve runtime performance of test on ARM. Fixes #52031

ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 11, 2021

CarnaViire mentioned this issue May 18, 2021

Random.Next is slow on checked runtime on ARM #52894

Closed

ghost locked as resolved and limited conversation to collaborators Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

VincentBu commented Apr 29, 2021

ghost commented Apr 29, 2021

zlatanov commented Apr 29, 2021 •

edited

Loading

CarnaViire commented Apr 29, 2021

zlatanov commented Apr 29, 2021

CarnaViire commented Apr 29, 2021 •

edited

Loading

zlatanov commented Apr 29, 2021

zlatanov commented Apr 29, 2021

CarnaViire commented Apr 29, 2021

zlatanov commented Apr 29, 2021

BruceForstall commented Apr 29, 2021

zlatanov commented Apr 30, 2021

danmoseley commented Apr 30, 2021

MattGal commented May 3, 2021

CarnaViire commented May 5, 2021 •

edited

Loading

CarnaViire commented May 5, 2021

danmoseley commented May 5, 2021

karelz commented May 6, 2021

CarnaViire commented May 6, 2021

zlatanov commented May 7, 2021

zlatanov commented May 7, 2021

danmoseley commented May 7, 2021

danmoseley commented May 7, 2021

danmoseley commented May 7, 2021

CarnaViire commented May 7, 2021

CarnaViire commented May 7, 2021

zlatanov commented May 10, 2021

safern commented May 10, 2021

CarnaViire commented May 10, 2021

danmoseley commented May 10, 2021

CarnaViire commented May 11, 2021

danmoseley commented May 11, 2021

jkotas commented May 11, 2021

Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

Comments

VincentBu commented Apr 29, 2021

ghost commented Apr 29, 2021

zlatanov commented Apr 29, 2021 • edited Loading

CarnaViire commented Apr 29, 2021

zlatanov commented Apr 29, 2021

CarnaViire commented Apr 29, 2021 • edited Loading

zlatanov commented Apr 29, 2021

zlatanov commented Apr 29, 2021

CarnaViire commented Apr 29, 2021

zlatanov commented Apr 29, 2021

BruceForstall commented Apr 29, 2021

zlatanov commented Apr 30, 2021

danmoseley commented Apr 30, 2021

MattGal commented May 3, 2021

CarnaViire commented May 5, 2021 • edited Loading

CarnaViire commented May 5, 2021

danmoseley commented May 5, 2021

karelz commented May 6, 2021

CarnaViire commented May 6, 2021

zlatanov commented May 7, 2021

zlatanov commented May 7, 2021

danmoseley commented May 7, 2021

danmoseley commented May 7, 2021

danmoseley commented May 7, 2021

CarnaViire commented May 7, 2021

CarnaViire commented May 7, 2021

zlatanov commented May 10, 2021

safern commented May 10, 2021

CarnaViire commented May 10, 2021

danmoseley commented May 10, 2021

CarnaViire commented May 11, 2021

danmoseley commented May 11, 2021

jkotas commented May 11, 2021

zlatanov commented Apr 29, 2021 •

edited

Loading

CarnaViire commented Apr 29, 2021 •

edited

Loading

CarnaViire commented May 5, 2021 •

edited

Loading