Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMe regression Macbook Air 7,1 (early 2015) - SSD not detected on recent kernels (T7131) #77

Closed
celticmagic opened this issue Aug 16, 2023 · 10 comments
Labels

Comments

@celticmagic
Copy link
Collaborator

Florian Piesche (#fpiesche), 2018-10-30 22:29:49 UTC

Running Solus 3 on a Macbook Air 7,1 (early 2015). Recent kernel updates fail to boot; I suspect there might be a regression in the nvme driver.

4.16.15-76.current is the last kernel I've managed to boot on my system. Any updates since fail to boot with dracut unable to find the root partition and no /dev/nvme* devices present at all. The Solus 3.9999 live USB fails to detect the SSD; an earlier version (Solus 3? it's been a while since I installed...) could detect it fine.

It's worth noting that the Fedora 29 and Ubuntu 18.10 live USBs also fail to detect the SSD so I think this issue might come from upstream? https://bugzilla.kernel.org/show_bug.cgi?id=105621 there was this upstream bug in 2015 to work around problems with this exact NVMe SSD/controller.

Note in the following output that Solus does not detect an SSD in and of itself, merely an "Apple NVMe Controller".

 ~$ uname -a
Linux hyatt 4.16.15-76.current #1 SMP PREEMPT Tue Jun 12 20:51:13 UTC 2018 x86_64 GNU/Linux

~$ lspci -vvkx -s 04:00.0
04:00.0 Mass storage controller: Apple Inc. S1X NVMe Controller (rev 01) (prog-if 02)
	Subsystem: Apple Inc. S1X NVMe Controller
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 256 bytes
	Interrupt: pin A routed to IRQ 48
	NUMA node: 0
	Region 0: Memory at c1300000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [50] MSI: Enable+ Count=1/8 Maskable- 64bit+
		Address: 00000000fee00358  Data: 0000
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <4us, L1 <32us
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR+, OBFF Not Supported
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [148 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns
	Kernel driver in use: nvme
00: 6b 10 01 20 06 04 10 08 01 02 80 01 40 00 00 00
10: 04 00 30 c1 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 6b 10 01 20
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00

@celticmagic
Copy link
Collaborator Author

Beatrice T. Meyers (#DataDrake), 2018-10-31 00:03:42 UTC

So two question for ya:

  1. Are you sure that is was exactly that version? Looking at the git log there is nothing in release 77 to indicate that NVME support changed at all.

  2. Where do I get a time machine? That kernel is so old, it was literally the last one published before I took over kernel updates. Seriously though, if you don't mind me asking: why did you wait so long to report and how the hell have you survived on such an old kernel?

@celticmagic
Copy link
Collaborator Author

Florian Piesche (#fpiesche), 2018-10-31 00:16:12 UTC

I'm not sure if it was exactly that version, but it is certainly the only one I've got installed that's working, and the one I've been running ever since the problem occurred (a few months now). I have intermittently tried to run the newer kernels as updates have rolled out but none of the ones I've tried since have indeed worked. What was the next version published after 4.16.15-76? I'll be happy to incrementally install newer versions release by release, until I find the point it breaks :)

The main reason I've waited so long before reporting is that I've had A Lot going on life-wise (health stuff, new baby, you name it) so I've only had very intermittent spots of time to try and fix the problem - though I did post on the Solus forums about it a few weeks back, before I noticed all the nvme devices were missing in the dracut shell.

As for survival on that old a kernel, it's not too terrible on day-to-day use? It's only this week or so that I noticed the headers for 4.16.15-76 seem to have aged off my system and I can't compile the facetime camera drivers anymore, which is what prompted me to try and make some time to dig into the issue properly...

@celticmagic
Copy link
Collaborator Author

Brandon (#bwat47), 2018-10-31 00:39:22 UTC

I was playing with solus on my 2013 macbook air the other day and I noticed the SSD not being detected at all as you're describing. I found that it works and detects the SSD if I boot with intel_iommu=off, might be worth a try

@celticmagic
Copy link
Collaborator Author

Florian Piesche (#fpiesche), 2018-10-31 09:29:53 UTC

I've tried booting 4.18.16-96 with intel_iommu=off which sadly still ran into the same issue. For what it's worth, the switch to NVMe SSDs is a change with Apple's 2015 hardware revision so this likely is why that fix doesn't work here...

@celticmagic
Copy link
Collaborator Author

Beatrice T. Meyers (#DataDrake), 2018-10-31 11:20:08 UTC

I could really use a dmesg log to debug this. Afaict the flag for that controller is still enabled in the kernel.

@celticmagic
Copy link
Collaborator Author

Florian Piesche (#fpiesche), 2018-10-31 11:30:39 UTC

I'll run the live USB tonight and get a dmesg log off of there. Anything else that might be useful while I'm there?

It only now strikes me: the lspci output above is from the Solus system with it running under 4.16. I'll check if that differs at all on the 3.9999 live USB as well, just in case...

@celticmagic
Copy link
Collaborator Author

Beatrice T. Meyers (#DataDrake), 2018-10-31 11:32:04 UTC

sudo journalctl -b wouldn't hurt

@celticmagic
Copy link
Collaborator Author

Florian Piesche (#fpiesche), 2018-10-31 20:41:50 UTC

Here you go!

{F4029939}

{F4029940}

{F4029941}

@celticmagic
Copy link
Collaborator Author

Florian Piesche (#fpiesche), 2019-09-09 12:00:14 UTC

Just to follow up on this (even though it's been a while): It seems this is related to Dunedan/mbp-2016-linux#71 (comment) this issue (see also https://twitter.com/mjg59/status/1149060258116956160)

the tl;dr of it is that some of Apple's NVMe SSDs have PCI device class identifiers that are wrong, identifying them as controllers without attached storage (or something to that effect), and there may be other subtle differences in behaviour that trip up the nvme module in the kernel itself. Most certainly an upstream issue, in either case.

@silkeh silkeh added this to Solus Aug 23, 2023
@github-project-automation github-project-automation bot moved this to Triage in Solus Aug 23, 2023
@silkeh silkeh moved this from Triage to Needs More Info in Solus Aug 23, 2023
@davidjharder
Copy link
Member

Old, and original reporter thinks this is an upstream issue. Closing

@davidjharder davidjharder closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
@github-project-automation github-project-automation bot moved this from Needs More Info to Done in Solus Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

3 participants