Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiplayerSynchronizer: Improve performance of watched variables #81593

Conversation

DennisManaa
Copy link
Contributor

@DennisManaa DennisManaa commented Sep 12, 2023

This PR partially resolves a specific issue we encountered while working with the High-Level-Multiplayer-System on our Multiplayer Top-Down Shooter.

I've thoroughly tested these modifications on both version 4.1.1-stable and the master branch. The accompanying screenshots were captured from the master branch for reference.

Background

We recently encountered performance issues in our game, prompting us to investigate their root causes. During this process, we upgraded our project from version 4.0.3 to 4.1.1 and transitioned to using variable "watch"ing as opposed to the "sync" option for data synchronization. We applied this approach wherever it made sense, significantly reducing server-client data traffic. This was particularly effective because many variables were not updated every frame. We greatly appreciated the efficiency gains from this feature that was introduced to the MultiplayerSynchronizer.

Unfortunately, this adjustment did not completely resolve our performance problems. After some time, we pinpointed the issue to our server's CPU struggling to handle the workload required. Regrettably, the integrated profiler did not provide sufficient insights in this case. Therefore, we turned to profiling our game using the Hotspot Profiler in conjunction with a debug build of the Godot Engine, as outlined in the Godot-documentation here: link.

We hope that this pull request, if merged, will contribute to partially alleviating our performance issues as soon as version 4.2 is released. If not, we remain optimistic that it will help to discover alternative ways to enhance the High-Level-Multiplayer-System's performance for all users of the engine. :)

Measurements

In our SceneTree, we typically have between 70 to 120 MultiplayerSynchronizers, which varies based on the game state. To optimize network synchronization, we've configured the replication_interval and delta_interval for most variables to approximately 0.016 seconds, which aligns with a frame rate of 60 frames per second (i.e., approximately once per frame).

We conducted a gameplay session that lasted precisely 5 minutes, with around 30 seconds of setup time before starting a game round and an additional 20 seconds after finishing the game round. These timings were meticulously controlled by in-game timers. During our profiling analysis, we observed that the processing of calls to SceneMultiplayer::poll was consuming a significant amount of time in our specific scenario, accounting for 35.9% of the processing load. This was even more time-consuming than the processing of calls to SceneTree::_process, which constituted 23.7% of the workload.

The root cause of this performance issue appears to be the resource-intensive nature of the calls to _verify_synchronizer, which are invoked by both _send_delta and _send_sync.

Everything

grafik

SceneMultiplayer::poll

grafik

SceneReplicationInterface::_send_delta

grafik

SceneReplicationInterface::_send_sync

grafik

How to Improve

The majority of the performance cost associated with the SceneMultiplayer::poll method can be attributed to two specific methods: _send_delta (responsible for synchronizing variables set to "watch") and _send_sync (responsible for synchronizing variables set to "sync"). At least, that's the understanding I've gathered from the code. Both of these methods call _verify_synchronizer, which appears to be a potential bottleneck in our performance analysis. It's worth noting that any improvements made to this function could potentially benefit all games utilizing the Highlevel Multiplayer Systems.

Given my current limited familiarity with the codebase, I'm unable to provide a specific enhancement strategy for _verify_synchronizer. Instead, I suggest an alternative approach for this pull request. It appears that _send_delta invokes the _verify_synchronizer method before checking if there is data that requires updating. I propose to modify the code to perform these checks before calling _verify_synchronizer. This change would ensure that _verify_synchronizer is only invoked when an update is genuinely necessary (similar to what _send_sync already does). Currently, it's being called even when no update is needed. To illustrate the impact of this modification, I've provided screenshots of the results after implementing this change.

Everything

grafik

SceneMultiplayer::poll

grafik

SceneReplicationInterface::_send_delta

grafik

SceneReplicationInterface::_send_sync

grafik

Summary

The call to _send_delta has undergone a significant reduction in its execution time, which in turn has caused shifts in the relative execution times of various methods within the callstack. Here's an overview of these changes:

Function before after
Main::iteration 74.8% (1.647E+11 cycles) 69.1% (1.294E+11 cycles)
SceneTree::process 60.9% (1.34E+11 cycles) 53.5% (1.002E+11 cycles)
SceneMultiplayer::poll 35.9% (0.7912E+11 cycles) 24.7% (0.4632E+11 cycles)
SceneReplicationInterface::on_network_process 34.8% (0.7659E+11 cycles) 23.4% (0.4389E+11 cycles)
SceneReplicationInterface::_send_delta 20.2% (0.4447E+11 cycles) 6.13% (0.1148E+11 cycles)

It's important to note that the data provided is specific to our project, and the extent of improvements or variations in performance could differ significantly for other projects. Regrettably, I am unable to share the specifics of this particular project. However, if necessary, I'd be willing to explore the possibility of creating a sample project to conduct more comprehensive testing and evaluation.

BUT (Important):

I want to emphasize that there could be potential side-effects that I cannot confidently evaluate due to my limited familiarity with Godot's code. Assistance and insights from those more experienced in the code would be greatly appreciated. In our specific project, we encountered no problems, and everything functioned as expected. However, it's crucial to recognize that this may not necessarily apply to all use cases. Your assistance in assessing and addressing any unforeseen issues would be highly valued. (This is the first time I contribute anything to the Godot-Project)

@DennisManaa
Copy link
Contributor Author

DennisManaa commented Sep 13, 2023

TL:DR: The extent of workload reduction surpasses what's apparent from the percentage figures provided (relative execution times). In our specific project, there's an overall improvement of over 15%, with an even higher reduction exceeding 20% when focusing only on Main::iteration. This is substantiated by the reduction in measured cycles for Main::iteration from 1.674E+11 to 1.294E+11. Next time it would be better to present the measured cycle data directly instead of using the relative execution times (to enhance clarity). As of now, I have included the measured cycle data in the summary for better transparency and understanding.

Long Version:
I also would like to notice that the improvement in Main::iteration isn't a mere 5% boost in performance when just looking at the relative execution times. To gauge the actual performance gain accurately, we must shift our attention to the unaltered segment in our comparisons, rather than the modified one.

Assuming that the required cycle count for the code outside of Main::iteration remained relatively constant, we can establish the following relationship:

$$(1 - 0.748) \cdot \text{[old\_total\_workload]} = (1 - 0.691) \cdot \text{[new\_total\_workload]}$$

To make it clearer, before the change, the relative execution time of Main::iteration was 74.8% (0.748), and after the change, it dropped to 69.1% (0.691). For simplicity, let's assume the old total workload was 1, which leads to the following equation:

$$\text{[new\_total\_workload]} = \frac{1 - 0.748}{1 - 0.691} = \frac{0.252}{0.309} \approx 0.815533981$$

This demonstrates that the overall workload in our application has, in fact, decreased by over 18%. However, this performance gain was specifically within Main::iteration and not across the entire application. Therefore, we need to solve the following equation for x:

$$0.815533981 = (1 - 0.748) + 0.748 \cdot x$$

In other words, the new total workload of 0.815533981 equals the workload for the code outside of Main::iteration, plus the reduced workload for the code inside Main::iteration, where x represents the remaining percentage of the initial workload for Main::iteration (before the change).

$$x = \frac{0.815533981 - (1 - 0.748)}{0.748} = 0.753387675$$

This explanation clarifies why we can anticipate a significantly greater improvement than the initial perception of just 5%. This conclusion gains further support from the observations concerning the "aggregated sample costs," which were 1.674E+11 (before) and 1.294E+11 (after), as illustrated in the provided screenshots.

However, it's important to acknowledge that the assumption that the code outside of Main::iteration remained entirely constant isn't entirely accurate; it was slightly higher afterward. Consequently, the ratio 1.294E+11/1.674E+11 equals 0.772998805. it seems to align reasonably well when considering the potential measurement uncertainty and other effects in this context.

Next time I just will go with the measured cycles instead of the percentage values. Sorry for the confusion :D

Before

Before

After

After

@YuriSizov YuriSizov changed the title MultiplayerSynchronizer: Performance improvement for "watch"ed variables MultiplayerSynchronizer: Improve performance of watched variables Sep 15, 2023
@Faless
Copy link
Collaborator

Faless commented Oct 4, 2023

@DennisManaa thank you very much for the comprehensive report.

As I mentioned in chat, this PR does indeed introduce the side effect of skipping the first (few) delta synchronizations, resulting in an out of sync state until those variable changes.

I've opened #82777 which should address _verify_synchronizer being so slow among other things (the function itself is removed, the actual optimizations are in send_object_cache, make_object_cache, and is_cache_confirmed).

Would be great if you could give that a try and confirm that it does work for you, and that you do get the expected performance improvements.

@DennisManaa
Copy link
Contributor Author

Superseded by #82777

@AThousandShips AThousandShips removed this from the 4.x milestone Oct 10, 2023
@DennisManaa DennisManaa deleted the multiplayer-performance-improvement branch October 11, 2023 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants