Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment, parallelize some tests #17662

Closed
wants to merge 237 commits into from
Closed

Conversation

majocha
Copy link
Contributor

@majocha majocha commented Sep 4, 2024

To run tests in parallel we must deal with global resources and global state accessed by the test cases.

Out of proc:
Tests running as separate processes are sharing the file system. We must make sure they execute in their own temporary directories and don't overwrite any hardcoded paths. This is already done, mostly in separate PR.

Hosted:
Many tests use hosted compiler and FsiEvaluationSession, sharing global resources and global state within the runner process:

  • Console streams - this is swept under a rug for now by using a simple AsyncLocal stream splitter.
  • FileSystem global mutable of the file system shim - few tests that mutate it, must be excluded from parallelization.
  • Environment.CurrentDirectory - many tests executing in hosted session were doing a variation of File.WriteAllText("test.ok", "ok") all in the current directory i.e. bin, leading to conflicts. This is replaced with a threadsafe mechanism.
  • Environment variables, Path - mostly this applies to DependencyManager, excluded from parallelization for now.
  • Async default cancellation token - few tests doing Async.CancelDefaultToken() must be excluded from parallelization.
  • global state used in conjunction with --times option - tests excluded from parallelization.
  • global mutable state in the form of multiple caches implemented as ConcurrentDictionary. This needs further investigation.

I'll ad to the above list if I recall anything else.

Problems:
Tests depending on tight timing, orchestrating stuff by combinations of Thread.Sleep, Async.Sleep and wait timeouts.
These are mostly excluded from parallelization, some attempts at fixing things were made.

Obscure compiler bugs revealed in this PR:

  • Internal error: value cannot be null this mostly happens in coreClr, one time, sometimes a few times during the test run.

  • Error creating evaluation session because of NRE somewhere in TcImports.BuildNonFrameworkTcImports. This is more rare but may be related to the above.

These were related to some concurrency issues; modyfing frameworkTcImportsCache without lock and a bug in custom lazy implementation in il.fs. Hopefully both fixed now.

Running in parallel:
Xunit runners are configured with mostly default parallelization settings.

dotnet test .\FSharp.sln -c Release -f net9.0 will run all discovered test assemblies in parallel as soon as they're built.
This can be limited with the -m switch. For example,
dotnet test -m:2 .\FSharp.Compiler.Service.sln
will limit the test run to at most 2 simultaneous processes. Still, each test host process runs its test collections in parallel.

Some test collections are excluded form parallelization with [<Collection(nameof DoNotRunInParallel)>] attribute.

Running in the IDE with "Run tests in parallel" enabled will respect xunit.runner.json settings and the above exclusions.

TODO:

Copy link
Contributor

github-actions bot commented Sep 4, 2024

❗ Release notes required


✅ Found changes and release notes in following paths:

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/9.0.200.md

@majocha
Copy link
Contributor Author

majocha commented Sep 4, 2024

Yeah this will not do much.

A lot of test cases write to stdout. We do captureConsoleOutputs in CompilerAssert but in a way that does not allow parallel execution. Not to mention in a lot of places there are printfn calls that will mangle outputs even more. All of those tests must run sequentially.

This needs a systemic approach to deal with stdout capture, maybe xUnit have some mechanism to do this in parallel.

@psfinaki
Copy link
Member

psfinaki commented Sep 4, 2024

@majocha guess what, I was going to try the same today :)

Thanks for looking into that. Indeed, it seems like we should just damn stop printing everything to the console, it's likely an artifact of older testing approaches. I cannot see any reason to do this instead of just in memory processing.

@psfinaki psfinaki changed the title Expreriment, parallelize some tests Experiment, parallelize some tests Sep 4, 2024
@majocha
Copy link
Contributor Author

majocha commented Sep 4, 2024

xUnit does not run tests from the same module in parallel. It also does not parallelize Theories.
This can slow things down even with parallel execution enabled. In FSharpSuite we have modules with lots of slow tests.

This can be mitigated by customizing xUnit in code, I think?

@psfinaki
Copy link
Member

psfinaki commented Sep 4, 2024

By customizing xUnit you mean setting up some special runner settings and assembly attributes?

We can do that. However - I am not a fan of this idea. xUnit's philosophy is really to apply good coding practices to tests. As in, write tests as you write code. Hence, e.g. compared to Nunit, it offers a very limited test platform voodoo (think fixtures, setup/teardown and so on), instead making as much as possible of the builtin language capabilities.

And so I would instead prefer keeping up with this philosophy. If, by default, xUnit parallelizes execution on the module level, then we should actually split modules into smaller ones - thereby it will improve code clarity and will generally add to better code organization :)

What do you think?

@majocha
Copy link
Contributor Author

majocha commented Sep 5, 2024

By customizing I meant something like https://www.meziantou.net/parallelize-test-cases-execution-in-xunit.htm

But this is not the most pressing thing and probably not needed if splitting modules would do.

The biggest hurdle for now is correctly isolating the console when running tests in parallel. Redirecting with Console.SetOut for each test won't work anymore when another thread can also redirect it unpredictably.

Console writes come from multiple sources:

  1. In process executing fsi, probably also fsc
  2. Various printfn and log calls sprinkled through the test cases and helper code.
  3. Source code compiled and executed in AssemblyLoadContext (or AppDomain in case of net472).

While we can manage 1. and 2., 3. is a bit harder.

@majocha
Copy link
Contributor Author

majocha commented Sep 5, 2024

I've been chatting with Bing / Copilot about it, and it actually proposed a not bad idea:

Don't redirect the Console at all for individual tests. Instead install a custom thread splitting TextWriter upfront.
That splitting writer will keep a ThreadLocal inner TextWriter and direct all writes to it.

@psfinaki
Copy link
Member

psfinaki commented Sep 5, 2024

Thanks for the further investigations here.

In the spirit of my comment above - I just vote for reinventing as few wheels as possible and removing those we've already reinvented here :) Unit tests rarely need any output at all, but if they do - it's good to use those few means that xUnit provides for this, which are basically "plug in the writer if and when you need to".

I think it aligns with your thoughts above? It's important to make gradual changes here, probably actually in the way you outline it above. The current direction you're taking (removing stuff) looks promising!


Note, I am off until Monday with limited internet connection so cannot play with the code myself. Also, we've discussed this PR internally yesterday and were all very happy that things are moving in this space!

@majocha
Copy link
Contributor Author

majocha commented Sep 6, 2024

This is at a state that can be run locally in VS test explorer or from the console with build -c Release -testCoreClr
I think it's shaving a few minutes from build -testCoreClr locally, this could be further improved by breaking up large modules in ComponentTests and FSharpSuite. Definitely I see much better CPU utilization.

In the CI there's that weird TaskCancelledException in random tests, no idea where it comes from.

Still, there are some minor fixes here that I'll try to extract to another PR.

@majocha majocha force-pushed the parallelize-tests branch from 3befef0 to c632695 Compare October 6, 2024 17:07
@majocha majocha force-pushed the parallelize-tests branch from aca0134 to ae81bd6 Compare October 7, 2024 15:11
@majocha
Copy link
Contributor Author

majocha commented Oct 7, 2024

This started today I think:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=829440&view=logs&j=2f0d093c-1064-5c86-fc5b-b7b1eca8e66a&t=52d0a7a6-39c9-5fa2-86e8-78f84e98a3a2&l=45

./build.sh --ci --configuration Release --restore --build --pack --publish -bl /p:SourceBuildNonPortable= /p:ArcadeBuildFromSource=true /p:DotNetBuildSourceOnly=true /p:DotNetBuildRepo=true /p:AssetManifestFileName=SourceBuild_Managed.xml
./build.sh: line 16: /__w/1/s/eng/build.sh: Permission denied

No idea what's it about.

@majocha majocha closed this Oct 7, 2024
@majocha majocha reopened this Oct 7, 2024
@majocha
Copy link
Contributor Author

majocha commented Oct 10, 2024

I'm not giving up on this. I'll squash this, clean up a bit and post another draft PR.
I have some wins wrt. utilizing xUnit standard output mechanism and parallel execution of theory cases / collection cases.

@T-Gro
Copy link
Member

T-Gro commented Oct 11, 2024

I'm not giving up on this. I'll squash this, clean up a bit and post another draft PR. I have some wins wrt. utilizing xUnit standard output mechanism and parallel execution of theory cases / collection cases.

This is good news @majocha .
If you think this could be enabled per project/collection, this would be a good alternative to postpone solving some of the problems.

(e.g. FSharpSuite is the slowest one at CI, but likely avoids some of the issues because compilation is via separate .exe invocation. Therefore things like shared state inside the compiler should not matter here that much)

@majocha
Copy link
Contributor Author

majocha commented Oct 11, 2024

@T-Gro there are a lot of modifications in FSharp.Test.Utilities that I think are in use in basically all test projects so this cannot be just disabled selectively as in: use previous implementation. It can be selectively throttled down, even down to full sequential execution, per project, per module etc.

I've been timing the test runs locally a bit, what is really problematic and bottlenecked by something is the net472 target. The slowest here for me is ComponentTests, throwing additional cores at it does nothing in net4, there's just no CPU utilization. I suspect it's the thousands of appdomains it creates and unloads.

net9.0 -testCoreClr runs for me locally in around 4 minutes now:
image

What I've been struggling with atm is hanging processes because of files getting locked. For example the test run hangs for 5 minutes and the dump indicates ilread.fs waits to read "System.Security.Cryptography.Primitives.dll" in the dotnet sdk folder. wth?

Anyway, I'll just post what I got in another PR. It'd be good to test this locally on different machines.

See #17872

@majocha majocha mentioned this pull request Oct 11, 2024
8 tasks
@T-Gro
Copy link
Member

T-Gro commented Oct 11, 2024

Which File IO call was it, was that visible in the stack trace?
We might check if we are using the best set of switches for a read-only operation.

@majocha
Copy link
Contributor Author

majocha commented Oct 11, 2024

Which File IO call was it, was that visible in the stack trace? We might check if we are using the best set of switches for a read-only operation.
It was here IIRC

| Some(start, length) -> stream.ReadBytes(start, length)

I don't think I still have the dump file, but I added --blame-hang-timeout to the script so if it reproduces in CI, should be possible to debug.

@majocha
Copy link
Contributor Author

majocha commented Oct 11, 2024

Closing in favor of #17872

@majocha majocha closed this Oct 11, 2024
@T-Gro
Copy link
Member

T-Gro commented Oct 11, 2024

There might be a race condition in the way the ilModuleReaderCache works, or how the flags are set.
But system dlls should only be ever read from in the build, so it would be good to make it work without shadow copying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants