Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs on full load on processors with e-cores #963

Open
SheMelody opened this issue Jul 1, 2024 · 6 comments
Open

Hangs on full load on processors with e-cores #963

SheMelody opened this issue Jul 1, 2024 · 6 comments

Comments

@SheMelody
Copy link

OS: Arch Linux
Using linux-tkg causes the system to hang when running in full load for prolonged periods of time on processors with e-cores. Disabling e-cores from BIOS fixes the issue. This is not a hardware issue as the other kernels I have tested (arch, linux-xanmod, linux-zen) do not have this problem.

  • There are no relevant logs to help diagnose this problem, the system just hangs and no errors are saved or reported anywhere. If it can help, the kernel was using the intel_cpufreq driver.
@Tk-Glitch
Copy link
Member

Can we please have more data? Compiler used, uarch optimization selected, CPU scheduler?

I'd try changing _custom_commandline="intel_pstate=passive kernel.split_lock_mitigate=0" in customization.cfg (or your frogminer cfg if you're using that) to _custom_commandline="intel_pstate=active kernel.split_lock_mitigate=0".
If it fixes your issue (which is likely if you're not using a fancy CPU scheduler) it's technically a hardware issue, but to be fair all Intel big.LITTLE CPUs are basically broken hardware from factory so not exactly unexpected, and the workaround is totally acceptable from a user's standpoint imho.

@SheMelody
Copy link
Author

Can we please have more data? Compiler used, uarch optimization selected, CPU scheduler?

I'd try changing _custom_commandline="intel_pstate=passive kernel.split_lock_mitigate=0" in customization.cfg (or your frogminer cfg if you're using that) to _custom_commandline="intel_pstate=active kernel.split_lock_mitigate=0". If it fixes your issue (which is likely if you're not using a fancy CPU scheduler) it's technically a hardware issue, but to be fair all Intel big.LITTLE CPUs are basically broken hardware from factory so not exactly unexpected, and the workaround is totally acceptable from a user's standpoint imho.

I used the latest compiled Release package for Arch, which still doesn't matter to be honest, since I've found what the problem is.

Both Intel and AMD have been silently changing the x86 standard, and that could be considered a "hardware issue", and this is a blatant problem when it comes to ccx, e-cores and all of that.

I have tested a bit more, and apparently it's an issue with voltages (not on my side, and not a hardware fault). I've been tampering with a lot of systems for years, and one thing I know for sure is all processors will just randomly hang if they are not given enough voltage and they try to suddenly transition to a lower idle state (ie. with a higher frequency and voltage).

Raising the Load Line Calibration to Level 8, which decreases the VDroop during power transitions, effectively fixes this problem, but it causes almost 90 °C temperature on full load with a high end liquid cooler on my 14700K. And no, this is still not a hardware issue, and I certainly don't want to run the CPU at 90 °C on full load with my computer sounding like a jet.

intel_pstate can still handle these power transitions just fine with all governors down to level 2 LLC, which is a pretty good result. It is pretty easy and straight-forward to assume that intel_cpufreq does something wrong with these processors when transitioning power states, and disabling e-cores helps to mitigate it and reduces the chance of this happening. The fact that raising LLC also fixes the problems speaks loud that this is a voltage transitioning problem. Whatever intel_cpufreq is doing, it's doing it in a wrong way in this kernel.

Once asserted, I tried this on my other system that I mainly use for AI, which has a 14900K, and I've got the same exact results.

Now this being said, if your concern is reducing stutters, you're taking the wrong path. You only need to address poorly implemented power management on motherboard's side, everything else can be left alone. Disable ASPM, PCI port power management and Advanced Power Management. Components will still idle, this does not control idling of your components, it just affects motherboard's power management. This, together with split_lock_detect=0, kernel.split_lock_mitigate=0 and some other tweaks on other modules, is more than enough to get an extremely stable experience on literally any game.

The following video shows Borderlands 3 flawlessly running on Arch Linux on Ultra settings without stutters:
https://www.youtube.com/watch?v=hDWkD3LoKr0

This is all I had to say and I hope it helps with development.

@Tk-Glitch
Copy link
Member

We are not touching intel_cpufreq, intel_pstate, nor any form of voltage scaling in any way though. All Intel big.LITTLE CPUs have voltage/frequency transition issues with E-cores enabled, or at least they do on most motherboards (my assumption being it's more of a firmware issue). This is most likely exacerbated by our aggressive ondemand governor which is the only possible culprit - again as long as you're using the stock CPU scheduler (EEVDF). Using a different governor or the fake governors from intel_pstate will give you the exact same behavior as stock kernel. Non-big.LITTLE Intel CPUs (tested on xeons and older mainstream series) aren't affected. AMD CPUs don't have such an issue either.

@SheMelody
Copy link
Author

SheMelody commented Jul 2, 2024

We are not touching intel_cpufreq, intel_pstate, nor any form of voltage scaling in any way though. All Intel big.LITTLE CPUs have voltage/frequency transition issues with E-cores enabled, or at least they do on most motherboards (my assumption being it's more of a firmware issue). This is most likely exacerbated by our aggressive ondemand governor which is the only possible culprit - again as long as you're using the stock CPU scheduler (EEVDF). Using a different governor or the fake governors from intel_pstate will give you the exact same behavior as stock kernel. Non-big.LITTLE Intel CPUs (tested on xeons and older mainstream series) aren't affected. AMD CPUs don't have such an issue either.

The aggressive ondemand governor indeed boosts the problem. Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels, I don't really see a reason to use it on such processors. I understand that, as I also said earlier, companies like Intel and AMD have been silently and heavily changing the x86/ACPI standards, but we, as users, even if extremely knowledgeable, can't really do anything about it.

Your kernel works fine when using intel_pstate either way and there's really no other viable alternative on such very recent processors, unless you're using something up to Intel 11th gen (or up to Ryzen 5000 when it comes to other issues that I'm not going to mention because that would lead us off-topic).

Try to test disabling motherboard's dumb power management, that's what really causes microstutters and stuttering, notably on MSI, ASRock and Gigabyte boards. Generally, I disable those and leave everything else alone (except a few minor tweaking of course) and everything is buttery smooth.

@Tk-Glitch
Copy link
Member

Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels

You're absolutely right.

I do have a couple Gigabytes boards around to test with, I'll try to check this out. I'll need to borrow some 13/14th gen mainstream CPU though ^^'

@SheMelody
Copy link
Author

Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels

You're absolutely right.

I do have a couple Gigabytes boards around to test with, I'll try to check this out. I'll need to borrow some 13/14th gen mainstream CPU though ^^'

I have plenty of systems where I tested quite a lot of kernel parameters, some motherboards were fine by default when it comes to microstutters, while most of them weren't.

As a small example, on a system with i5-7400 + MSI board it was microstuttering, especially in VKD3D, until I turned all the motherboard's power management off, while on a system with i7-10700F + ASUS board there were no relevant stutters by default, even with ASPM and power managements enabled all along.

The worst case I've seen is my boyfriend's system, which has a i7-14700K and a Gigabyte board, stutters heavily unless disabling all motherboard's power managements, otherwise games are literally unplayable in there. On my system, which has a 14700K + MSI board it microstutters when using motherboard's power management, just not as bad as my boyfriend's system does.

So yes, I'd definitely give that a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants