Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.1] Track Steam performance patches #23

Closed
wants to merge 48 commits into from

Conversation

kakra
Copy link
Owner

@kakra kakra commented Mar 11, 2023

Export patch series: https://github.com/kakra/linux/pull/23.patch

  • winesync: experimental winesync device driver which can be used by some Proton versions
  • hugepages background reclaim: patch cherry-picked from ZEN
  • threaded IRQs by default: cherry-picked from CK
  • always use bfq IO scheduler by default: although it might benchmark lower throughput it is almost always better for more consistent desktop IO latency during high IO write loads
  • memory soft-dirty flag: used by Proton to support Windows memory write monitoring with better performance
  • ACS patch: for whoever may need it, patch may be dropped at any time
  • futex backward compatibility patch: Properly supports older Proton versions using the latest futex kernel functions
  • lower latency scheduling: CFS patches cherry-picked and combined from TKG, Pop! OS and PF
  • memory management: improved scheduling for huge memory pages and memory zones, cherry-picked from ZEN
  • readahead patches: IO readahead raised to 2 MB to match huge pages, cherry-picked from XANMOD
  • raised vm.max_map_count: as suggested by Valve (in Steam Deck) and TKG, cherry-picked from TKG

Many patches are enabled unconditionally, e.g., there's no config flag to enable ZEN patches as in their original patchset. This is because there's no point otherwise in using this patchset for me.

@kakra kakra marked this pull request as draft March 11, 2023 16:19
@orbea
Copy link

orbea commented Mar 12, 2023

@kakra This doesn't apply against a 6.1.18 kernel from kernel.org, am I missing something?

@kakra
Copy link
Owner Author

kakra commented Mar 12, 2023

@orbea Maybe, my distribution doesn't have kernel 6.1.18 yet, so I didn't try. I'm on 6.17. I'll bump the base-6.1 branch when I'm seeing conflicts.

But I just checked: 6.1.18 is available since today, so stay tuned.

@kakra
Copy link
Owner Author

kakra commented Mar 12, 2023

Yep, confirmed: The conflict is in the ACS patch because a new PCIe quirk has been added. Easy fix, new patchset will be available once I rebooted with the patched kernel.

@kakra
Copy link
Owner Author

kakra commented Mar 12, 2023

@orbea Bumped to 6.1.18

@kakra
Copy link
Owner Author

kakra commented May 20, 2023

New patches added for better scheduling, memory, and IO latency. This also improves compatibility with some demanding games like Detroit: Become Human by raising vm.max_map_count by default, similar to what Valve does on the Steam Deck.

Zebediah Figura added 23 commits October 11, 2023 21:53
Zebediah Figura and others added 25 commits October 11, 2023 21:53
Use [defer+madvise] as default khugepaged defrag strategy:

For some reason, the default strategy to respond to THP fault fallbacks
is still just madvise, meaning stall if the program wants transparent
hugepages, but don't trigger a background reclaim / compaction if THP
begins to fail allocations.  This creates a snowball affect where we
still use the THP code paths, but we almost always fail once a system
has been active and busy for a while.

The option "defer" was created for interactive systems where THP can
still improve performance.  If we have to fallback to a regular page due
to an allocation failure or anything else, we will trigger a background
reclaim and compaction so future THP attempts succeed and previous
attempts eventually have their smaller pages combined without stalling
running applications.

We still want madvise to stall applications that explicitely want THP,
so defer+madvise _does_ make a ton of sense.  Make it the default for
interactive systems, especially if the kernel maintainer left
transparent hugepages on "always".

Reasoning and details in the original patch: https://lwn.net/Articles/711248/

Signed-off-by: Kai Krakow <[email protected]>
Also add ifdefs so that elevator_get_default() remains unchanged with
respect to upstream if CONFIG_IOSCHED_BFQ is disabled.

Signed-off-by: Juuso Alasuutari <[email protected]>
This an updated version of Alex Williamson's patch from:
https://lkml.org/lkml/2013/5/30/513

Original commit message follows:

PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
allows us to control whether transactions are allowed to be redirected
in various subnodes of a PCIe topology.  For instance, if two
endpoints are below a root port or downsteam switch port, the
downstream port may optionally redirect transactions between the
devices, bypassing upstream devices.  The same can happen internally
on multifunction devices.  The transaction may never be visible to the
upstream devices.

One upstream device that we particularly care about is the IOMMU.  If
a redirection occurs in the topology below the IOMMU, then the IOMMU
cannot provide isolation between devices.  This is why the PCIe spec
encourages topologies to include ACS support.  Without it, we have to
assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.

Unfortunately, far too many topologies do not support ACS to make this
a steadfast requirement.  Even the latest chipsets from Intel are only
sporadically supporting ACS.  We have trouble getting interconnect
vendors to include the PCIe spec required PCIe capability, let alone
suggested features.

Therefore, we need to add some flexibility.  The pcie_acs_override=
boot option lets users opt-in specific devices or sets of devices to
assume ACS support.  The "downstream" option assumes full ACS support
on root ports and downstream switch ports.  The "multifunction"
option assumes the subset of ACS features available on multifunction
endpoints and upstream switch ports are supported.  The "id:nnnn:nnnn"
option enables ACS support on devices matching the provided vendor
and device IDs, allowing more strategic ACS overrides.  These options
may be combined in any order.  A maximum of 16 id specific overrides
are available.  It's suggested to use the most limited set of options
necessary to avoid completely disabling ACS across the topology.
Note to hardware vendors, we have facilities to permanently quirk
specific devices which enforce isolation but not provide an ACS
capability.  Please contact me to have your devices added and save
your customers the hassle of this boot option.

Signed-off-by: Mark Weiman <[email protected]>
Add an option to wait on multiple futexes using the old interface, that
uses opcode 31 through futex() syscall. Do that by just translation the
old interface to use the new code. This allows old and stable versions
of Proton to still use fsync in new kernel releases.

Signed-off-by: André Almeida <[email protected]>
…g delays

The page allocator processes free pages in groups of pageblocks, where
the size of a pageblock is typically quite large (1024 pages without
hugetlbpage support). Pageblocks are processed atomically with the zone
lock held, which can cause severe scheduling delays on both the CPU
going through the pageblock and any other CPUs waiting to acquire the
zone lock. A frequent offender is move_freepages_block(), which is used
by rmqueue() for page allocation.

As it turns out, there's no requirement for pageblocks to be so large,
so the pageblock order can simply be reduced to ease the scheduling
delays and zone lock contention. PAGE_ALLOC_COSTLY_ORDER is used as a
reasonable setting to ensure non-costly page allocation requests can
still be serviced without always needing to free up more than one
pageblock's worth of pages at a time.

This has a noticeable effect on overall system latency when memory
pressure is elevated. The various mm functions which operate on
pageblocks no longer appear in the preemptoff tracer, where previously
they would spend up to 100 ms on a mobile arm64 CPU processing a
pageblock with preemption disabled and the zone lock held.

Signed-off-by: Sultan Alsawaf <[email protected]>
There is noticeable scheduling latency and heavy zone lock contention
stemming from rmqueue_bulk's single hold of the zone lock while doing
its work, as seen with the preemptoff tracer. There's no actual need for
rmqueue_bulk() to hold the zone lock the entire time; it only does so
for supposed efficiency. As such, we can relax the zone lock and even
reschedule when IRQs are enabled in order to keep the scheduling delays
and zone lock contention at bay. Forward progress is still guaranteed,
as the zone lock can only be relaxed after page removal.

With this change, rmqueue_bulk() no longer appears as a serious offender
in the preemptoff tracer, and system latency is noticeably improved.

Signed-off-by: Sultan Alsawaf <[email protected]>
The value is still pretty low, and AMD64-ABI and ELF extended numbering
supports that, so we should be fine on modern x86 systems.

This fixes crashes in some applications using more than 65535 vmas (also
affects some windows games running in wine, such as Star Citizen).

Signed-off-by: Kai Krakow <[email protected]>
Some games such as Detroit: Become Human tend to be very crash prone with
lower values.

Signed-off-by: Kai Krakow <[email protected]>
Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.

This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.

And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
@kakra kakra added the done To be superseded by next LTS label Nov 26, 2023
@kakra
Copy link
Owner Author

kakra commented Nov 26, 2023

Rebased to #30

@kakra kakra closed this Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done To be superseded by next LTS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants