Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dispatcher] improve api, reduce overhead, improve performances for items > 1k #2083

Merged
merged 3 commits into from
Jan 7, 2024

Conversation

Eideren
Copy link
Collaborator

@Eideren Eideren commented Dec 16, 2023

PR Details

Threadpool now accepts generic work items, allowing the dispatcher and other callers to compose work items out of structures, in turn enabling the JIT to inline the composed call tree into just one function call from the thread's scope up to the actual work the caller dispatched.

Arguments can now be passed by ref to the dispatcher, which is also passed by ref to threads resolving a lot of issues that would require painful workarounds and allocations, like allowing jobs to read and write to a shared structure, interlocked operations per dispatch, and outputting from a job.

Added an interface for processing items in batches, allowing for improved throughput for jobs that have thousands of items to process.

I only changed a couple of engine calls to be batched, others may benefit in more demanding games.

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My change requires a change to the documentation.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have built and run the editor to try this change out.

@Kryptos-FR
Copy link
Member

Any benchmark?

@Eideren
Copy link
Collaborator Author

Eideren commented Dec 17, 2023

I'll send it over later this week

…cted, something to investigate again when JIT gets better
@Eideren
Copy link
Collaborator Author

Eideren commented Dec 20, 2023

Here's a benchmark running on latest changes:

Method ItemsToProcess Mean Error StdDev
New 10 1.093 us 0.0215 us 0.0393 us
Old 10 5.357 us 0.1058 us 0.2038 us
NewLong 10 1.383 us 0.0277 us 0.0477 us
OldLong 10 5.525 us 0.0521 us 0.0435 us
New 100 1.453 us 0.0207 us 0.0162 us
Old 100 2.735 us 0.0528 us 0.0494 us
NewLong 100 4.312 us 0.0857 us 0.0880 us
OldLong 100 7.431 us 0.1068 us 0.0892 us
New 1000 2.084 us 0.0414 us 0.0960 us
Old 1000 7.699 us 0.1528 us 0.2510 us
NewLong 1000 8.039 us 0.0244 us 0.0228 us
OldLong 1000 9.355 us 0.1729 us 0.2123 us
New 10000 6.890 us 0.1352 us 0.2761 us
Old 10000 8.564 us 0.0888 us 0.0693 us
NewLong 10000 23.778 us 0.1935 us 0.1715 us
OldLong 10000 23.918 us 0.1049 us 0.0876 us
New 100000 29.414 us 0.2818 us 0.2636 us
Old 100000 33.156 us 0.6590 us 1.3162 us
NewLong 100000 193.900 us 1.9274 us 1.5048 us
OldLong 100000 232.131 us 7.2847 us 21.4791 us

Difference between New and NewLong is that more operations are run per item for the latter. It doesn't matter too much in this case since both the old and new use the same job splitting tactic but it does give some insight into how much overhead there is to just schedule a job.
Benchmark.zip

N.B.: Also included the changes from PR #1190

Copy link
Member

@manio143 manio143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

{
using (Profile(action))
var batch = (BatchState<TJob>)obj;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any guarantees this would only be called with objects of the correct type? The cast like this throws when type doesn't match and can be more costly compared to obj as/is BatchState<TJob> if the type is sealed or JIT knows it doesn't have any subclasses.
Not sure that it would make much perf difference here anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's guaranteed;

ThreadPool.Instance.QueueUnsafeWorkItem(batch, &TypeAdapter<TJob>, batchCount - 1);

This call schedules the function void TypeAdapter<TJob>(object) to run on other threads with batch passed as the obj parameter. Both TypeAdapter<TJob> and BatchState<TJob>'s generic types match as they both come from the ForBatched<TJob> generic.
It would be a fairly serious bug if there wasn't any guarantee.
But as you said, as is faster than direct cast, it won't move the needle that much compared to how long those jobs tend to take, but anything helps. I'll change it to as shortly.

private class BatchState
/// <summary>
/// An implementation of a job running in batches.
/// This object is shared across all threads scheduled for the job;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention here as well it's not shared if it's a struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, forgot about that one when changing things over, good catch !

{
// For net6+ this should not happen, logging instead of throwing as this is just a performance regression
if(Environment.Version.Major >= 6)
Console.Out?.WriteLine($"{typeof(ThreadPool).FullName}: Falling back to suboptimal semaphore");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log errors to Stride's logging system instead of Console? A lot of the time on Windows you're not going to have a console attached.


public DotnetLifoSemaphore(int spinCount)
{
Type lifoType = Type.GetType("System.Threading.LowLevelLifoSemaphore");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a comment here that we use reflection to access an internal type

private sealed class DotnetLifoSemaphore : ISemaphore
{
private readonly IDisposable semaphore;
private readonly Func<int, bool, bool> wait;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, can C# 9.0 function pointers (and GetFunctionPointer() be useful in this scenario?

(not sure it would make an actual perf difference though, but curious as if usable enough for this use case, as this would mean I can probably use it in some other places)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my limited testing, yes, although I do not know the implications this has for JIT and such; I do remember that the address static function pointers lay on when taking its address is not fixed. If the method is 'moved' after JIT took care of the method, then we might have to retrieve the function pointer from its runtime method handle on every call to make sure we run the optimal version ...

@Eideren Eideren merged commit 81f0f6c into stride3d:master Jan 7, 2024
2 checks passed
@Eideren Eideren deleted the dispatcher branch January 7, 2024 12:42
@Kryptos-FR
Copy link
Member

#1190 can probably be closed now?

@Eideren
Copy link
Collaborator Author

Eideren commented Jan 7, 2024

Yep, thanks for the reminder @Kryptos-FR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants