Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel Arc A770: Kernel panic on kldload i915kms.ko #315

Open
kenrap opened this issue Aug 31, 2024 · 22 comments
Open

Intel Arc A770: Kernel panic on kldload i915kms.ko #315

kenrap opened this issue Aug 31, 2024 · 22 comments

Comments

@kenrap
Copy link

kenrap commented Aug 31, 2024

Describe the bug
From using a drm-kmod build from efd9167, the i915kms driver kernel panics when using an (Acer Predator BiFrost) Intel Arc A770 graphics card. And the kernel panic still persists even when using an Intel onboard GPU with the same graphics card installed.

FreeBSD version

FreeBSD freebsd 15.0-CURRENT FreeBSD 15.0-CURRENT #0 main-n271909-28294dc92476: Fri Aug 30 08:28:17 PDT 2024     root@freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64 1500023 1500023

DRM KMOD version
My own "custom derived" graphics/drm-66-kmod port with GH_TAGNAME pointing to efd9167.

Also git-clone(1) from linux-firmware and copied all of the i915/dg2_* firmware bins to /boot/modules and renamed them appropriately to match the filename style there.

To Reproduce
Boot into the system with either i915kms using kld_list inside /etc/rc.conf or kldload it manually.

Additional context
core.txt.0 dump

@kenrap
Copy link
Author

kenrap commented Aug 31, 2024

My bad, I didn't see about the "Intel DG2 GUC/HUC support" not being implemented yet from PR #283.

Closing therefore.

I'm reopening this to have it serve as a milestone issue.

I figured why have it be closed anyway since the issue is still valid and my reporting could be useful?

@kenrap kenrap closed this as completed Aug 31, 2024
@kenrap kenrap reopened this Oct 2, 2024
@wulf7
Copy link
Contributor

wulf7 commented Oct 2, 2024

dg2_dmc_ver2_08.bin: could not load binary firmware /boot/firmware/dg2_dmc_ver2_08.bin either
i915/dg2_dmc_ver2_08.bin: could not load binary firmware /boot/firmware/i915/dg2_dmc_ver2_08.bin either
i915_dg2_dmc_ver2_08.bin: could not load binary firmware /boot/firmware/i915_dg2_dmc_ver2_08.bin either
i915_dg2_dmc_ver2_08_bin: could not load binary firmware /boot/firmware/i915_dg2_dmc_ver2_08_bin either
i915_dg2_dmc_ver2_08_bin: could not load binary firmware /boot/firmware/i915_dg2_dmc_ver2_08_bin either
drmn0: could not load firmware image 'i915/dg2_dmc_ver2_08.bin'
drmn0: [drm] Failed to load DMC firmware i915/dg2_dmc_ver2_08.bin. Disabling runtime power management.
drmn0: [drm] Run pkg install gpu-firmware-kmod to install it

You may start with adding dg2_dmc_ver2_08.bin to firmwares

@wulf7
Copy link
Contributor

wulf7 commented Oct 2, 2024

But I doubt that it will help

@kenrap
Copy link
Author

kenrap commented Oct 2, 2024

I'll try that out and report back.

I also updated my bug description to be more specific.

@kenrap
Copy link
Author

kenrap commented Oct 2, 2024

But I doubt that it will help

And you're right, it didn't.

After trying a couple more ideas, I spent a good amount of time re-learning how to create a new core dump of the kernel panic with the firmware(s) loaded. Sorry for the delay.

core.txt.1 dump

@wulf7
Copy link
Contributor

wulf7 commented Oct 2, 2024

After taking a look at the code around faulted line, I have got an impression that it can happen due to missing vmap_pfn() implementation.
It is rather easy to check. Just with replacing of return NULL; line in i915_gem_object_map_pfn() function located in drivers/gpu/drm/i915/gem/i915_gem_pages.c file of drm-kmod with panic("oops"); or return ERR_PTR(-ENOSUP);

@kenrap
Copy link
Author

kenrap commented Oct 2, 2024

It seems you're correct about the missing vmap_pfn() implementation. It triggered an "oops" panic by using the panic("oops"); based on your instructions.

What I did was created this custom patch to put into my /usr/ports/graphics/drm-66-kmod/files:

--- drivers/gpu/drm/i915/gem/i915_gem_pages.c.orig
+++ drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -329,7 +329,7 @@ static void *i915_gem_object_map_pfn(struct drm_i915_gem_object *obj,
 {
 #ifdef __FreeBSD__
        // BSDFIXME: Need vmap_pfn() implementation.
-       return NULL;
+       panic("oops");
 #else
        resource_size_t iomap = obj->mm.region->iomap.base -
                obj->mm.region->region.start;

And rebuilt and reinstalled the package of my derived port. Then I did my usual testing and grabbed a new core dump which shows the "oops" panic. Yay! \o/

core.txt.2

@wulf7
Copy link
Contributor

wulf7 commented Nov 16, 2024

You may try following patches (only compile-tested). It is just quick conversion of vmap() implementation to vmap_pfn() through replacement struct page with page frame number
FreeBSD:

Incomplete patch deleted. See patch three messages below

and drm-kmod:

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index 931e7f46733..0ba955611df 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -327,10 +327,6 @@ static void *i915_gem_object_map_page(struct drm_i915_gem_object *obj,
 static void *i915_gem_object_map_pfn(struct drm_i915_gem_object *obj,
 				     enum i915_map_type type)
 {
-#ifdef __FreeBSD__
-	// BSDFIXME: Need vmap_pfn() implementation.
-	return NULL;
-#else
 	resource_size_t iomap = obj->mm.region->iomap.base -
 		obj->mm.region->region.start;
 	unsigned long n_pfn = obj->base.size >> PAGE_SHIFT;
@@ -356,7 +352,6 @@ static void *i915_gem_object_map_pfn(struct drm_i915_gem_object *obj,
 		kvfree(pfns);
 
 	return vaddr ?: ERR_PTR(-ENOMEM);
-#endif
 }
 
 /* get, pin, and map the pages of the object into kernel space */

@kenrap
Copy link
Author

kenrap commented Nov 16, 2024

@wulf7 core.txt.0

@kenrap
Copy link
Author

kenrap commented Nov 16, 2024

I guess I should note that after doing a lot of experimentation, there was one time I was "lucky" enough at random to get i915kms.ko to load with my Intel Arc but I was noticing this warning at the time:

drmn0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

But that only happened once and it wasn't reproducible.

Also, I don't know if I need to get another coredump by using the following in /boot/loader.conf:

hw.i915kms.enable_guc="2"
compat.linuxkpi.i915_disable_power_well="0"

because I changed the first tunable to 1 and commented out the second one. I needed both as above to allow myself to load i915kms on my integrated Intel GPU and get back into X11.

But right now, I need some rest, because I was doing on like +20 kernel-panic reboots of experiments.

@wulf7
Copy link
Contributor

wulf7 commented Nov 17, 2024

Next version of FreeBSD patch:

diff --git a/sys/compat/linuxkpi/common/include/linux/vmalloc.h b/sys/compat/linuxkpi/common/include/linux/vmalloc.h
index 00650a2df9b..30f7e0e6297 100644
--- a/sys/compat/linuxkpi/common/include/linux/vmalloc.h
+++ b/sys/compat/linuxkpi/common/include/linux/vmalloc.h
@@ -35,8 +35,11 @@
 #define	VM_MAP		0x0000
 #define	PAGE_KERNEL	0x0000
 
+#define	vmap_pfn(...)	lkpi_vmap_pfn(__VA_ARGS__)
+
 void *vmap(struct page **pages, unsigned int count, unsigned long flags,
     int prot);
+void *lkpi_vmap_pfn(unsigned long *pfns, unsigned int count, int prot);
 void vunmap(void *addr);
 
 #endif	/* _LINUXKPI_LINUX_VMALLOC_H_ */
diff --git a/sys/compat/linuxkpi/common/src/linux_compat.c b/sys/compat/linuxkpi/common/src/linux_compat.c
index 81d24603d1d..bce3af61516 100644
--- a/sys/compat/linuxkpi/common/src/linux_compat.c
+++ b/sys/compat/linuxkpi/common/src/linux_compat.c
@@ -60,6 +60,9 @@
 #include <vm/vm_page.h>
 #include <vm/vm_pager.h>
 
+#include <vm/uma.h>
+#include <vm/uma_int.h>
+
 #include <machine/stdarg.h>
 
 #if defined(__i386__) || defined(__amd64__)
@@ -1804,6 +1807,24 @@ vmmap_remove(void *addr)
 	return (vmmap);
 }
 
+int
+is_vmalloc_addr(const void *addr)
+{
+	struct vmmap *vmmap;
+	uintptr_t p = (uintptr_t)addr;
+
+	mtx_lock(&vmmaplock);
+	LIST_FOREACH(vmmap, &vmmaphead[VM_HASH(addr)], vm_next)
+		if (p >= trunc_page(vmmap->vm_addr) &&
+		    p < round_page((char *)vmmap->vm_addr + vmmap->vm_size))
+			break;
+	mtx_unlock(&vmmaplock);
+	if (vmmap != NULL)
+		return(1);
+
+	return (vtoslab((vm_offset_t)addr & ~UMA_SLAB_MASK) != NULL);
+}
+
 #if defined(__i386__) || defined(__amd64__) || defined(__powerpc__) || defined(__aarch64__) || defined(__riscv)
 void *
 _ioremap_attr(vm_paddr_t phys_addr, unsigned long size, int attr)
@@ -1849,6 +1870,58 @@ vmap(struct page **pages, unsigned int count, unsigned long flags, int prot)
 	return ((void *)off);
 }
 
+#ifdef __amd64__
+static void
+_lkpi_pmap_qenter_pfn(vm_offset_t sva, vm_pindex_t *pi, int count,
+    vm_memattr_t mode)
+{
+	pt_entry_t *endpte, oldpte, pa, *pte;
+	vm_pindex_t p;
+	int cache_bits;
+	pt_entry_t pg_g;
+
+	pg_g = pti ? 0 : X86_PG_G;
+	oldpte = 0;
+	pte = vtopte(sva);
+	endpte = pte + count;
+	cache_bits = pmap_cache_bits(kernel_pmap, mode, false);
+	while (pte < endpte) {
+		p = *pi++;
+		pa = IDX_TO_OFF(p) | cache_bits;
+		if ((*pte & (PG_FRAME | X86_PG_PTE_CACHE)) != pa) {
+			oldpte |= *pte;
+			pte_store(pte, pa | pg_g | pg_nx | X86_PG_A |
+			    X86_PG_M | X86_PG_RW | X86_PG_V);
+		}
+		pte++;
+	}
+	if (__predict_false((oldpte & X86_PG_V) != 0))
+		pmap_invalidate_range(kernel_pmap, sva, sva + count *
+		    PAGE_SIZE);
+}
+#endif
+
+void *
+lkpi_vmap_pfn(unsigned long *pfns, unsigned int count, int prot)
+{
+#ifdef __amd64__
+	vm_offset_t off;
+	size_t size;
+
+	size = count * PAGE_SIZE;
+	off = kva_alloc(size);
+	if (off == 0)
+		return (NULL);
+	vmmap_add((void *)off, size);
+	_lkpi_pmap_qenter_pfn(off, pfns, count, pgprot2cachemode(prot));
+
+	return ((void *)off);
+#else
+	panic("vmap_pfn is not implemented");
+	return (NULL);
+#endif
+}
+
 void
 vunmap(void *addr)
 {
diff --git a/sys/compat/linuxkpi/common/src/linux_page.c b/sys/compat/linuxkpi/common/src/linux_page.c
index 25243382f9e..1ed8b99cdf3 100644
--- a/sys/compat/linuxkpi/common/src/linux_page.c
+++ b/sys/compat/linuxkpi/common/src/linux_page.c
@@ -53,9 +53,6 @@
 #include <vm/vm_reserv.h>
 #include <vm/vm_extern.h>
 
-#include <vm/uma.h>
-#include <vm/uma_int.h>
-
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/preempt.h>
@@ -287,12 +284,6 @@ lkpi_get_user_pages(unsigned long start, unsigned long nr_pages,
 	    !!(gup_flags & FOLL_WRITE), pages));
 }
 
-int
-is_vmalloc_addr(const void *addr)
-{
-	return (vtoslab((vm_offset_t)addr & ~UMA_SLAB_MASK) != NULL);
-}
-
 vm_fault_t
 lkpi_vmf_insert_pfn_prot_locked(struct vm_area_struct *vma, unsigned long addr,
     unsigned long pfn, pgprot_t prot)

@kenrap
Copy link
Author

kenrap commented Nov 17, 2024

Will get on it, thanks! 😎

@kenrap
Copy link
Author

kenrap commented Nov 17, 2024

@wulf7 core.txt.1

@wulf7
Copy link
Contributor

wulf7 commented Nov 18, 2024

It is strange. This panic looks like no patches have peen applied. Did you apply drm-kmod patch?

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

I still kept that same patch in the files directory of my drm-kmod port like last time. But I rebuilt the package in an updated poudriere jail using a non-clean build with your latest patch.

I might want to try this over again with a clean FreeBSD build and redo my steps with more care.

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

Actually, you were spot on. I was moving the patch around outside of the files directory and it didn't get applied for the second round. My apologies.

Gonna do a clean build anyway. 😄

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

@wulf7 core.txt.2

This time with interesting GPU HANG output.

@wulf7
Copy link
Contributor

wulf7 commented Nov 18, 2024

Firmware loaded successfully this time.

But unfortunately I have no idea how to debug GPU hangs.

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

Understood. And I appreciate all of your help here regardless. Thanks for working on those patches.

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

Okay, I got good news and bad news.

Good news:
I got past the GPU hangs and successfully loaded the i915kms driver in tty with just the following in /boot/loader.conf:

hw.i915kms.modeset="1"

Bad news:
X11 never starts and it's likely because of the following errors:

Nov 18 06:41:56 freebsd kernel: drmn0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
Nov 18 06:41:56 freebsd kernel: drmn0: [drm] *ERROR* GT0: Enabling uc failed (-5)
Nov 18 06:41:56 freebsd kernel: drmn0: [drm] *ERROR* GT0: Failed to initialize GPU, declaring it wedged!
Nov 18 06:41:56 freebsd kernel: drmn0: [drm:0xffffffff83f34660s] 0xfffffe0229493808Vsysctl_warn_reuse: can't re-use a leaf (hw.dri.debug)!

It's crucial to be able to use hw.i915kms.enable_guc="2" for the Intel Arc, but sadly with it enabled (even with the "1" and "3" values) it always kernel panics. So I'm in a catch-22 here.

@wulf7
Copy link
Contributor

wulf7 commented Nov 18, 2024

GUC/HUC is not supported by drm-kmod on DG2 yet. It requires porting of MEI and PXP drivers

@kenrap
Copy link
Author

kenrap commented Nov 18, 2024

Ah, yeah, that's right. Thanks for the reminder there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants