Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[servers/pci] Investigate PCI probing code crash on 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21) #134

Open
perlun opened this issue Oct 18, 2018 · 0 comments

Comments

@perlun
Copy link
Contributor

perlun commented Oct 18, 2018

I noticed recently that with a very modern Dell Vostro laptop we have at work, chaos crashes completely on startup, which is kind of interesting. I debugged this a bit and concluded that it's actually the pci server that crashes when setting up the SMBus device.

I disabled this device in eab58d6, since "not detected" is much better than "crashing the machine". Someone with a strong love for the PCI hardware (...) would be very welcome to dig in and do a proper fix for this. I can volunteer to test any fix you make on this hardware (I have only seen it manifested one one single PC, ever.)

Steps taken to find this

While I debugged this, I first tried with this patch which will ignore certain PCI devices in the scanning and setup.

(We could do like MINIX3 has done it (which was written after chaos had its peak years) and borrow the PCI scanning code from NetBSD instead of trying to write it on our own. Their implementation (the MINIX3 one, which is based on the NetBSD code) can be found here: https://github.com/Stichting-MINIX-Research-Foundation/minix/blame/master/sys/dev/pci/pci_subr.c)

diff --git a/servers/system/pci/pci.c b/servers/system/pci/pci.c
index 53c1de8..7fe5210 100644
--- a/servers/system/pci/pci.c
+++ b/servers/system/pci/pci.c
@@ -528,7 +528,7 @@ static pci_device_type *pci_scan_slot(pci_device_type *input_device)
     bool is_multi = FALSE;
     uint8_t header_type;
 
-    for (function = 0; function < 8; function++, input_device->device_function++)
+    for (function = 0; function < 4 /*8*/; function++, input_device->device_function++)
     {
         if (function != 0 && !is_multi)
         {

This is just a thought, but maybe it's wrong to assume that all PCI hosts supports 8 functions per device and this is causing the problem? It could be that there is a flag that we could read somehow, that determines how many functions that should be scanned per device, and by not honoring that flag, we use the hardware incorrectly which it doesn't like and crashes in our face. Just a thought but maybe worth investigating.

Finding the failing device

I continued the investigation and, interestingly enough, it seems to be an SMBus device that doesn't like the way we probe its PCI slot:

diff --git a/servers/system/pci/pci.c b/servers/system/pci/pci.c
index 53c1de8..f0cfbc9 100644
--- a/servers/system/pci/pci.c
+++ b/servers/system/pci/pci.c
@@ -535,6 +535,12 @@ static pci_device_type *pci_scan_slot(pci_device_type *input_device)
             continue;
         }
 
+        // Some specific device 4 causing issues...?
+        if (function == 4)
+        {
+            continue;
+        }
+
         header_type = pci_read_config_uint8_t(input_device, PCI_HEADER_TYPE);
         input_device->header_type = header_type & 0x7F;
         device = pci_scan_device(input_device);

The code above excludes this device/function from the scanning.

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
	Subsystem: Dell Sunrise Point-LP SMBus
	Flags: medium devsel, IRQ 16
	Memory at df232000 (64-bit, non-prefetchable) [size=256]
	I/O ports at f040 [size=32]
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801

Does this SMBus device need to be probed in some special way or what's the deal here?

More details about the PCI subsystem on this machine

For reference, here is the full output of lspci:

$ lspci -v
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 08)
	Subsystem: Dell Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
	Flags: bus master, fast devsel, latency 0
	Capabilities: <access denied>

00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Dell UHD Graphics 620
	Flags: bus master, fast devsel, latency 0, IRQ 128
	Memory at de000000 (64-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at f000 [size=64]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

00:04.0 Signal processing controller: Intel Corporation Skylake Processor Thermal Subsystem (rev 08)
	Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
	Flags: fast devsel, IRQ 16
	Memory at df220000 (64-bit, non-prefetchable) [size=32K]
	Capabilities: <access denied>
	Kernel driver in use: proc_thermal
	Kernel modules: processor_thermal_device

00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21) (prog-if 30 [XHCI])
	Subsystem: Dell Sunrise Point-LP USB 3.0 xHCI Controller
	Flags: bus master, medium devsel, latency 0, IRQ 124
	Memory at df210000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci

00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
	Subsystem: Dell Sunrise Point-LP Thermal subsystem
	Flags: fast devsel, IRQ 18
	Memory at df237000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel_pch_thermal
	Kernel modules: intel_pch_thermal

00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
	Subsystem: Dell Sunrise Point-LP Serial IO I2C Controller
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at df236000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: intel-lpss
	Kernel modules: intel_lpss_pci

00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
	Subsystem: Dell Sunrise Point-LP CSME HECI
	Flags: bus master, fast devsel, latency 0, IRQ 127
	Memory at df235000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: mei_me
	Kernel modules: mei_me

00:17.0 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller [RAID mode] (rev 21)
	Subsystem: Dell 82801 Mobile SATA Controller [RAID mode]
	Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 125
	Memory at df230000 (32-bit, non-prefetchable) [size=8K]
	Memory at df234000 (32-bit, non-prefetchable) [size=256]
	I/O ports at f090 [size=8]
	I/O ports at f080 [size=4]
	I/O ports at f060 [size=32]
	Memory at df233000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: <access denied>
	Kernel driver in use: ahci
	Kernel modules: ahci

00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 122
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000e000-0000efff
	Memory behind bridge: df100000-df1fffff
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.5 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #6 (rev f1) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 123
	Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
	Memory behind bridge: df000000-df0fffff
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1f.0 ISA bridge: Intel Corporation Intel(R) 100 Series Chipset Family LPC Controller/eSPI Controller - 9D4E (rev 21)
	Subsystem: Dell Intel(R) 100 Series Chipset Family LPC Controller/eSPI Controller - 9D4E
	Flags: bus master, fast devsel, latency 0

00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
	Subsystem: Dell Sunrise Point-LP PMC
	Flags: fast devsel
	Memory at df22c000 (32-bit, non-prefetchable) [disabled] [size=16K]

00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
	Subsystem: Dell Sunrise Point-LP HD Audio
	Flags: bus master, fast devsel, latency 32, IRQ 130
	Memory at df228000 (64-bit, non-prefetchable) [size=16K]
	Memory at df200000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel, snd_soc_skl

00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
	Subsystem: Dell Sunrise Point-LP SMBus
	Flags: medium devsel, IRQ 16
	Memory at df232000 (64-bit, non-prefetchable) [size=256]
	I/O ports at f040 [size=32]
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801

01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
	Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
	Flags: bus master, fast devsel, latency 0, IRQ 16
	I/O ports at e000 [size=256]
	Memory at df104000 (64-bit, non-prefetchable) [size=4K]
	Memory at df100000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: r8169
	Kernel modules: r8169

02:00.0 Network controller: Intel Corporation Wireless 3165 (rev 79)
	Subsystem: Intel Corporation Wireless 3165
	Flags: bus master, fast devsel, latency 0, IRQ 129
	Memory at df000000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi
@perlun perlun changed the title [servers/pci] Investigate pci server crashing Dell laptop [servers/pci] Investigate PCI probing code crash on 00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21) Oct 26, 2018
perlun added a commit that referenced this issue Oct 26, 2018
This is a known "bad device" in terms of our PCI setup code. Or rather; our PCI setup code breaks with this device, likely because it hasn''t caught up with the last 20 years of development in the PC world. :-)

This will do for now; I have verified on the machine in question that we don't reboot on startup when this device is exempt from the PCI setup.

Issue for fixing this long-term: #134.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant