Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver load BSOD with Aomei driver #384

Open
lundman opened this issue Jun 12, 2024 · 31 comments
Open

Driver load BSOD with Aomei driver #384

lundman opened this issue Jun 12, 2024 · 31 comments

Comments

@lundman
Copy link

lundman commented Jun 12, 2024

I have been trying to find out why loading openfs.sys will BSOD if Aomei's BackUpper is installed. The crash is generally not instant, but a few second (1-10s) later. The stack is indicative of stack overflow, or list corruption and never related to either driver. Usually when receiving network IO.

I do not think there is a bug in Aomei, but rather simply triggered by them querying openzfs.sys.

  1. However, I can ignore all Driver->MajorFunction calls from Aomei, and we will STILL crash. So it isn't anything returned in dispatcher. But what else is there?

  2. If I do not call StorPortInitialize() we do not crash. I do not think there is a storport issue, but rather that it creates a device type that Aomei is interested in, and if we do not have storport, whatever code that triggers the BSOD is just not executed.

  3. It doesn't always BSOD. Of 5 loads, it crashes about 4 times. Sometimes it loads fine and I can use it, benchmark etc. So presumably a race somewhere, but not to calling MajorFunctions.

I am unsure what to do next, a lot of things recommend verifier - but this is a driver load issue, so I do not know if that will help.

I doubt Aomei developers are interested in debugging.

Example stack of BSOD:

!analyze -v

BAD_POOL_CALLER (c2)
The current thread is making a bad pool request.  Typically this is at a bad IRQL level or double freeing the same allocation, etc.
Arguments:
Arg1: 000000000000000d, Attempt to release quota on a corrupted pool allocation.
Arg2: ffffe3086bb6b5c0, Address of pool
Arg3: 00000000fffff807, Pool allocation's tag
Arg4: 7ac30bb7e5c716f2, Quota process pointer (bad).

Debugging Details:
------------------

BUGCHECK_CODE:  c2

BUGCHECK_P1: d

BUGCHECK_P2: ffffe3086bb6b5c0

BUGCHECK_P3: fffff807

BUGCHECK_P4: 7ac30bb7e5c716f2

PROCESS_NAME:  System

STACK_TEXT:  
: ffffbe8b`fd852e00 fffff807`38d12870 ffff8b01`24367180 00000000`00000001 : nt!DbgBreakPointWithStatus
: ffff8b01`00000003 ffffbe8b`fd852e00 fffff807`38e4e9f0 00000000`000000c2 : nt!KiBugCheckDebugBreak+0x12
: 00006504`000a0000 ffffbe8b`fd853680 7ac30bb7`e5c716f2 fffff807`394f42ba : nt!KeBugCheck2+0xba3
: 00000000`000000c2 00000000`0000000d ffffe308`6bb6b5c0 00000000`fffff807 : nt!KeBugCheckEx+0x107
: ffffe308`6bb6b5b0 ffffe308`fffff807 00000000`00000d40 00000000`00000000 : nt!ExReturnPoolQuota+0x222647
: ffffe308`00000000 ffffbe8b`fd853620 ffffe308`00000000 00000000`00000000 : nt!ExFreePoolWithTag+0x1e5
: fffff807`3b4732fa 00000000`00000000 ffffe308`66c08fe0 ffffe308`61886210 : NETIO!WfpPoolFree+0x1a
: ffffe308`6bb6b5c0 ffffe308`6aee2568 ffffe308`6b4fae90 ffffe308`667a3a88 : NETIO!StreamDerefCalloutContext+0x2f
: ffffe308`6b9fb4b0 ffffe308`6aee2568 00000000`00000125 ffffe308`6e62e130 : NETIO!StreamDeleteCalloutContext+0x76
: ffffe308`6b9fb4b0 ffffe308`6aee2568 ffffe308`6b4fae90 00000000`00000000 : NETIO!WfpNotifyFlowContextDelete+0x34
: ffffbcd2`2a00ff00 ffffe308`6b4fae90 ffffe308`6aee2520 ffffbe8b`fd853810 : NETIO!KfdAleNotifyFlowDeletion+0x1ae
: ffffe308`67066010 ffffe308`67066010 ffffe308`66826500 00000000`00000000 : tcpip!TcpCleanupTcbWorkQueueRoutine+0x15a
: ffffe308`66997cc0 fffff807`3b911d80 ffffe308`66997cc0 fffff807`38cf50ff : tcpip!TcpCloseTcbWorkQueueRoutine+0x11b
: ffffe308`66a86260 ffffe308`61f6e650 00000000`00000000 ffffe308`00000000 : NETIO!NetiopIoWorkItemRoutine+0x57
: ffffe308`618a3c90 ffffe308`6b47c040 ffffbe8b`fd853b00 ffffe308`00000000 : nt!IopProcessWorkItem+0x8e
: ffffe308`6b47c040 00000000`0000076e ffffe308`6b47c040 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155
: fffff807`37795180 ffffe308`6b47c040 fffff807`38d1b520 5b5b5b5b`5b5b5b5b : nt!PspSystemThreadStartup+0x57
: ffffbe8b`fd854000 ffffbe8b`fd84e000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34

SYMBOL_NAME:  NETIO!WfpPoolFree+1a

MODULE_NAME: NETIO

IMAGE_NAME:  NETIO.SYS

@lundman
Copy link
Author

lundman commented Jun 12, 2024

BSOD type 2

KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure.  The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: ffffbe8bfd845600, Address of the trap frame for the exception that caused the BugCheck
Arg3: ffffbe8bfd845558, Address of the exception record for the exception that caused the BugCheck
Arg4: 0000000000000000, Reserved

Debugging Details:
------------------

BUGCHECK_CODE:  139

BUGCHECK_P1: 3

BUGCHECK_P2: ffffbe8bfd845600

BUGCHECK_P3: ffffbe8bfd845558

BUGCHECK_P4: 0

TRAP_FRAME:  ffffbe8bfd845600 -- (.trap 0xffffbe8bfd845600)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffffe3086b309310 rbx=0000000000000000 rcx=0000000000000003
rdx=ffffe3086bd4a780 rsi=0000000000000000 rdi=0000000000000000
rip=fffff80738d20b8c rsp=ffffbe8bfd845790 rbp=ffffe308667a3d40
 r8=ade4b591e32e688b  r9=ffffe3086674d920 r10=0000000000000000
r11=0000000000007070 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
nt!RtlRemoveEntryHashTable+0x4c:
fffff807`38d20b8c cd29            int     29h
Resetting default scope

EXCEPTION_RECORD:  ffffbe8bfd845558 -- (.exr 0xffffbe8bfd845558)
ExceptionAddress: fffff80738d20b8c (nt!RtlRemoveEntryHashTable+0x000000000000004c)
   ExceptionCode: c0000409 (Security check failure or stack buffer overrun)
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 0000000000000003
Subcode: 0x3 FAST_FAIL_CORRUPT_LIST_ENTRY 

PROCESS_NAME:  System

ERROR_CODE: (NTSTATUS) 0xc0000409 - The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

EXCEPTION_CODE_STR:  c0000409

EXCEPTION_PARAMETER1:  0000000000000003

EXCEPTION_STR:  0xc0000409

STACK_TEXT:  
: ffffbe8b`fd844c30 fffff807`38d12870 fffff807`37795180 00000000`00000001 : nt!DbgBreakPointWithStatus
: fffff807`00000003 ffffbe8b`fd844c30 fffff807`38e4e9f0 00000000`00000139 : nt!KiBugCheckDebugBreak+0x12
: 00000000`2e7af43a 00000000`00000000 ffffe308`6bd4a7a0 fffff807`38e9adbc : nt!KeBugCheck2+0xba3
: 00000000`00000139 00000000`00000003 ffffbe8b`fd845600 ffffbe8b`fd845558 : nt!KeBugCheckEx+0x107
: 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69
: ffffe308`61202140 00000000`00000000 00000000`00000000 fffff807`38c78349 : nt!KiFastFailDispatch+0xb2
: ffffe308`66c06fe0 00000000`00000001 ffffbe8b`fd845850 fffff807`3b473349 : nt!KiRaiseSecurityCheckFailure+0x346
: ffffe308`66c06fe0 badbadfa`badbadfa ffffbe8b`fd845850 fffff807`3b603747 : nt!RtlRemoveEntryHashTable+0x4c
: ffffbe8b`fd845878 ffffbe8b`fd8458a9 00000000`00000000 00000000`00000000 : NETIO!FreeSomeCacheBucketEntries+0x7d
: ffffe308`667a38e0 00000000`00000000 00000000`00000400 00000000`00000000 : NETIO!FreeSomeCacheEntries+0x3e
: ffffe308`66a88640 00000000`0000ff00 00000000`00000000 fffff807`38ce0a70 : NETIO!UpdateCacheLruBucket+0x828
: ffffe308`66000000 ffffe308`667a3de8 00000000`00000000 ffffe308`6688ad80 : NETIO!LruCleanupWorkItemRoutine+0x9b
: ffffe308`66a88640 ffffe308`66fcc520 00000000`00000000 ffffe308`00000000 : NETIO!NetiopIoWorkItemRoutine+0x57
: ffffe308`618a3c90 ffffe308`6aed3040 ffffbe8b`fd845b00 00796c70`00000000 : nt!IopProcessWorkItem+0x8e
: ffffe308`6aed3040 00000000`000004f0 ffffe308`6aed3040 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155
: fffff807`37795180 ffffe308`6aed3040 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57
: ffffbe8b`fd846000 ffffbe8b`fd840000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34


SYMBOL_NAME:  NETIO!FreeSomeCacheBucketEntries+7d

MODULE_NAME: NETIO

IMAGE_NAME:  NETIO.SYS

@lundman
Copy link
Author

lundman commented Jun 12, 2024

verifier spat this out though:

ATTEMPTED_WRITE_TO_READONLY_MEMORY

ATTEMPTED_WRITE_TO_READONLY_MEMORY (be)
An attempt was made to write to readonly memory.  The guilty driver is on the
stack trace (and is typically the current instruction pointer).
When possible, the guilty driver's name (Unicode string) is printed on
the BugCheck screen and saved in KiBugCheckDriver.
Arguments:
Arg1: fffff8073bbfd740, Virtual address for the attempted write.
Arg2: 00000001be48a121, PTE contents.
Arg3: ffffbe8bfc513fd0, (reserved)
Arg4: 000000000000000a, (reserved)

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 7030

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 40305

    Key  : Analysis.Init.CPU.mSec
    Value: 546

    Key  : Analysis.Init.Elapsed.mSec
    Value: 112129

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 89

    Key  : WER.OS.Branch
    Value: ni_release

    Key  : WER.OS.Timestamp
    Value: 2022-05-06T12:50:00Z

    Key  : WER.OS.Version
    Value: 10.0.22621.1


BUGCHECK_CODE:  be

BUGCHECK_P1: fffff8073bbfd740

BUGCHECK_P2: 1be48a121

BUGCHECK_P3: ffffbe8bfc513fd0

BUGCHECK_P4: a

PROCESS_NAME:  System

TRAP_FRAME:  ffffbe8bfc513fd0 -- (.trap 0xffffbe8bfc513fd0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffff8073bbfd730 rbx=0000000000000000 rcx=48d68b48204f8b48
rdx=fffff8073bbfd73e rsi=0000000000000000 rdi=0000000000000000
rip=fffff8073ae1c049 rsp=ffffbe8bfc514160 rbp=0000000000000008
 r8=fffff8073bbfd730  r9=ffffbe8bfc514240 r10=ffffe3086d890720
r11=ffffbe8bfc514240 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po nc
storport!StorPortNotification+0x1eea9:
fffff807`3ae1c049 c7401002200708  mov     dword ptr [rax+10h],8072002h ds:fffff807`3bbfd740=4c000044
Resetting default scope

LOCK_ADDRESS:  fffff8073967a7c0 -- (!locks fffff8073967a7c0)

Resource @ nt!PiEngineLock (0xfffff8073967a7c0)    Exclusively owned
    Contention Count = 16
    NumberOfExclusiveWaiters = 1
     Threads: ffffe30861972080-01<*> 

     Threads Waiting On Exclusive Access:
              ffffe308619da080       
1 total locks

PNP_TRIAGE_DATA: 
	Lock address  : 0xfffff8073967a7c0
	Thread Count  : 1
	Thread address: 0xffffe30861972080
	Thread wait   : 0xb7a9

STACK_TEXT:  
: ffffbe8b`fc5137b0 fffff807`38d12870 ffff8b01`24367180 fffff807`3bbfd701 : nt!DbgBreakPointWithStatus
: ffff8b01`00000003 ffffbe8b`fc5137b0 fffff807`38e4ea80 ffffbe8b`fc513d60 : nt!KiBugCheckDebugBreak+0x12
: 00000000`00000000 00000000`00000000 00000001`be48a121 fffff807`38e3a414 : nt!KeBugCheck2+0xba3
: 00000000`000000be fffff807`3bbfd740 00000001`be48a121 ffffbe8b`fc513fd0 : nt!KeBugCheckEx+0x107
: 00000001`be48a121 00000000`00000003 ffffbe8b`fc513f69 00000000`00000000 : nt!MiRaisedIrqlFault+0x17ed10
: 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MmAccessFault+0x363
: 00000000`00000000 00000000`00000000 4c4f565a`00000020 00000000`0000007c : nt!KiPageFault+0x369
: 01dabc84`312c5356 ffffe308`6ddd3580 ffffe308`692b0f80 00000000`000000ff : storport!StorPortNotification+0x1eea9
: fffff801`aae81000 fffff801`00000001 0001a64c`00935000 ffffbe8b`fc5147e0 : OpenZFS!wzvol_HwReportLog+0x2ee [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c @ 605] 
: 00000000`00000000 00000000`00000000 00000000`00000000 ffffbe8b`00000000 : OpenZFS!ScsiOpReportLuns+0x5f [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol_scsi.c @ 866] 
: ffffe308`6af83040 fffff807`38de2cfb 00000001`bc077ee4 00000000`00989680 : OpenZFS!ScsiExecuteMain+0x249 [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol_scsi.c @ 346] 
: ffffe308`6d3e51a0 00000000`00000000 ffffe308`6d8901a0 ffffe308`6af83040 : OpenZFS!wzvol_HwStartIo+0x74 [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c @ 721] 
: ffffbe8b`fc514638 ffffe308`6d1c3bd8 ffffe308`6d1c3b60 ffffe308`6b2efc90 : storport!RaidAdapterPostScatterGatherExecute+0x22a
: ffffe308`61879da0 ffffbe8b`00080158 ffffbe8b`fc514670 ffffe308`61dfd100 : storport!RaUnitStartIo+0x310
: ffffe308`6d1c3bd8 ffffe308`6d1c3b60 00000000`00000006 ffffe308`6b2efc00 : storport!RaidStartIoPacket+0x50f
: 00000030`61dfd1a0 00000001`00000054 ffffe308`6d1c3b60 fffff801`aab1f2f5 : storport!RaUnitScsiIrp+0x23e
: ffffe308`61dfd1c0 ffffbe8b`fc514950 ffffe308`61dfd1c0 00000000`00989680 : storport!RaDriverScsiIrp+0x58
: 00000000`000000c0 fffff807`38c79cc7 ffffffff`ffffffff 00000000`00000032 : OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] 
: ffffe308`6d1c3b60 ffffe308`6d3e5050 ffffe308`6d3e5050 00000000`00000000 : nt!DifIRP_MJ_INTERNAL_DEVICE_CONTROLWrapper+0xdb
: ffff9cce`673c6108 ffff9cce`67339e30 ffff9cf1`8436c098 fffff807`38d3a0ef : nt!IopfCallDriver+0x53
: ffffe308`6d1c3b60 ffffbe8b`fc514b90 ffff9cf1`8436c098 00000000`00000000 : nt!IovCallDriver+0x5f
: fffff807`3ae78800 fffff807`3ae82000 ffffe308`6d936790 ffffe308`6d1c3b60 : nt!IofCallDriver+0x2272df
: ffffe308`6d3e51a0 ffffbe8b`fc514d00 00000000`00000fff ffff8b01`24400000 : storport!RaSendIrpSynchronous+0x62
: ffffbe8b`fc514fb0 ffffbe8b`fc514fa0 ffffbe8b`fc514ff0 00000000`00000002 : storport!RaidBusEnumeratorIssueSynchronousRequest+0x23d
: ffffe308`6b2efc90 00000000`00000000 ffffbe8b`fc514db0 ffffbe8b`fc514fb0 : storport!RaidBusEnumeratorIssueReportLuns+0x70
: 00000000`00000000 00000000`00000000 ffffbe8b`fc514fb0 ffffbe8b`00000000 : storport!RaidBusEnumeratorGetLunListFromTarget+0x76
: 00000000`00000000 00000000`00000000 00000000`00000000 ffffe308`6d8901a0 : storport!RaidBusEnumeratorGetLunList+0x54
: 00000000`00000000 ffffbe8b`fc515000 ffffe308`61dfd710 00000000`00000000 : storport!RaidAdapterEnumerateBus+0x99
: ffffe308`6d8901a0 ffffbe8b`fc5150f0 00000000`00000000 fffff801`aa57b10a : storport!RaidAdapterRescanBus+0xb1
: 00000000`00000000 ffffe308`70006c68 ffffe308`6ba6a010 ffffe308`705d0de8 : storport!RaidAdapterQueryDeviceRelationsIrp+0x190
: ffffe308`61dfd6f0 fffff807`394db950 ffffe308`61dfd710 ffffe308`61dfd710 : storport!RaidAdapterPnpIrp+0x144
: ffffe308`61dfd710 ffffbe8b`fc515380 ffffe308`61dfd710 00000000`00989680 : storport!RaDriverPnpIrp+0x94
: 00000000`00000000 00000000`00000000 00000000`0000ffff 00000000`000000ff : OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] 
: ffffe308`6ba6a010 ffffe308`6d890050 ffffe308`6d375060 00000000`000002e6 : nt!DifIRP_MJ_PNPWrapper+0xdb
: 00000000`000000b0 00000000`00000600 00000000`000000b0 00000000`00000600 : nt!IopfCallDriver+0x53
: ffffe308`6ba6a1b8 fffff807`3bbfd7e2 00000000`0000043a 00000000`00000000 : nt!IovCallDriver+0x5f
: ffffe308`00000000 00000000`00001001 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x2272df
: ffffe308`6ba6a010 fffff807`3bbfd7e2 ffffe308`6d375060 ffffe308`6b9c71d0 : ambakdrv+0x13ae
: 00000000`00000002 ffffe308`ffffffff 00000000`00000000 00000000`00000001 : nt!IopfCallDriver+0x53
: 00000000`00000000 00000000`00000001 ffffbe8b`00000439 00000000`00000000 : nt!IovCallDriver+0x5f
: fffff807`39090560 00000000`6d547845 00000000`00000000 fffff807`394c968d : nt!IofCallDriver+0x2272df
: ffffe308`6ba6a010 fffff807`391338fe ffffe308`6b9c7080 ffffe308`6db8cb40 : volsnap!VolSnapPnp+0x142
: ffffe308`6b9c7080 ffffbe8b`fc5157a0 ffffe308`618a0080 fffff807`394e8c35 : nt!IopfCallDriver+0x53
: ffffe308`6ba6a010 ffffbe8b`fc5157a0 ffffe308`6b9c7080 00000000`00000000 : nt!IovCallDriver+0x5f
: ffffe308`618a0080 ffffe308`6a8455f0 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x2272df
: ffffe308`618a0080 00000000`00000000 ffffe308`6a8455f0 fffff807`394c968d : nt!PnpAsynchronousCall+0xe6
: ffffe308`6b56d520 00000000`00000000 ffffe308`618a0080 fffff807`38cdfd41 : nt!PnpSendIrp+0x9e
: ffffe308`6a8455f0 ffffe308`6b56d548 ffffe308`6b56d520 ffffe308`6b56d520 : nt!PnpQueryDeviceRelations+0x51
: ffffe308`6b56d520 ffffbe8b`fc5158a8 ffffe308`6b56d520 00000000`00000000 : nt!PipEnumerateDevice+0xc8
: ffffe308`6b56d500 ffffe308`6b831e01 ffffbe8b`fc515940 fffff807`00000000 : nt!PipProcessDevNodeTree+0x187
: 00000001`00000003 ffffe308`6b56d520 ffffe308`61972080 00000000`00000000 : nt!PiRestartDevice+0xc94a3
: ffffe308`61972080 ffffe308`6189aaa0 fffff807`39767ac0 fffff807`00000000 : nt!PnpDeviceActionWorker+0x654
: ffffe308`61972080 00000000`000003d0 ffffe308`61972080 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155
: fffff807`37795180 ffffe308`61972080 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57
: ffffbe8b`fc516000 ffffbe8b`fc510000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34


FAULTING_SOURCE_LINE:  C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c

FAULTING_SOURCE_FILE:  C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c

FAULTING_SOURCE_LINE_NUMBER:  605

FAULTING_SOURCE_CODE:  
   601: 		    pHBAExt,
   602: 		    pWnode,
   603: 		    0xFF);
   604: 
>  605: 		ExFreePoolWithTag(pWnode, MP_TAG_GENERAL);
   606: 	} else {
   607: 	}
   608: #endif
   609: }
   610: 


SYMBOL_NAME:  OpenZFS!wzvol_HwReportLog+2ee

MODULE_NAME: OpenZFS

IMAGE_NAME:  OpenZFS.sys

@lundman
Copy link
Author

lundman commented Jun 12, 2024

unloading driver

 	nt!DbgBreakPointWithStatus	C/C++/ASM
 	nt!KiBugCheckDebugBreak+0x12	C/C++/ASM
 	nt!KeBugCheck2+0xba3	C/C++/ASM
 	nt!KeBugCheckEx+0x107	C/C++/ASM
 	nt!ExReturnPoolQuota+0x222647	C/C++/ASM
 	nt!ExFreePoolWithTag+0x1e5	C/C++/ASM
 	storport!RaidDeleteDeferredQueue+0x1b	C/C++/ASM
 	storport!RaidDeleteAdapter+0x12e	C/C++/ASM
 	storport!RaidAdapterRemoveDeviceIrp+0x8e	C/C++/ASM
 	storport!RaidAdapterPnpIrp+0x16100	C/C++/ASM
 	storport!RaDriverPnpIrp+0x94	C/C++/ASM
>	OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] 	C/C++/ASM
 	nt!DifIRP_MJ_PNPWrapper+0xdb	C/C++/ASM
 	nt!IopfCallDriver+0x53	C/C++/ASM
 	nt!IovCallDriver+0x5f	C/C++/ASM
 	nt!IofCallDriver+0x2272df	C/C++/ASM
 	nt!ViFilterDispatchPnp+0x121	C/C++/ASM
 	nt!IopfCallDriver+0x53	C/C++/ASM
 	nt!IovCallDriver+0x5f	C/C++/ASM
 	nt!IofCallDriver+0x2272df	C/C++/ASM
 	ambakdrv+0x16da	C/C++/ASM
 	ambakdrv+0x15b4	C/C++/ASM
 	ambakdrv+0x1478	C/C++/ASM
 	nt!DifIRP_MJ_PNPWrapper+0xdb	C/C++/ASM
 	nt!IopfCallDriver+0x53	C/C++/ASM
 	nt!IovCallDriver+0x5f	C/C++/ASM
 	nt!IofCallDriver+0x2272df	C/C++/ASM
 	nt!ViFilterDispatchPnp+0x121	C/C++/ASM
 	nt!IopfCallDriver+0x53	C/C++/ASM
 	nt!IovCallDriver+0x5f	C/C++/ASM
 	nt!IofCallDriver+0x2272df	C/C++/ASM
 	volsnap!VolSnapPnp+0x9e	C/C++/ASM
 	nt!IopfCallDriver+0x53	C/C++/ASM
 	nt!IovCallDriver+0x5f	C/C++/ASM
 	nt!IofCallDriver+0x2272df	C/C++/ASM
 	nt!IopSynchronousCall+0xf8	C/C++/ASM
 	nt!IopRemoveDevice+0x108	C/C++/ASM
 	nt!PnpRemoveLockedDeviceNode+0x1a8	C/C++/ASM
 	nt!PnpDeleteLockedDeviceNode+0x52	C/C++/ASM
 	nt!PnpDeleteLockedDeviceNodes+0xd3	C/C++/ASM
 	nt!PnpProcessQueryRemoveAndEject+0x3b3	C/C++/ASM
 	nt!PnpProcessTargetDeviceEvent+0x109	C/C++/ASM
 	nt!PnpDeviceEventWorker+0x2ca	C/C++/ASM
 	nt!ExpWorkerThread+0x155	C/C++/ASM
 	nt!PspSystemThreadStartup+0x57	C/C++/ASM
 	nt!KiStartSystemThread+0x34	C/C++/ASM


BAD_POOL_CALLER (c2)
The current thread is making a bad pool request.  Typically this is at a bad IRQL level or double freeing the same allocation, etc.
Arguments:
Arg1: 000000000000000d, Attempt to release quota on a corrupted pool allocation.
Arg2: ffffad8f3c788d40, Address of pool
Arg3: 00000000fffff807, Pool allocation's tag
Arg4: a5ff4d454236bb53, Quota process pointer (bad).

Bad previous allocation size @ffffad8f3c788d30, last size was 0

***
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
***
*** Use !poolval ffffad8f3c788000 for more details.


Pool page [ ffffad8f3c788000 ] is INVALID.

Analyzing linked list...
[ ffffad8f3c788000 ]: invalid previous size [ 0x98 ] should be [ 0x0 ]
[ ffffad8f3c788000 --> ffffad8f3c7884e0 (size = 0x4e0 bytes)]: Corrupt region

@sskras
Copy link

sskras commented Jun 12, 2024

@lundman commented 8 hours ago:

I am unsure what to do next, a lot of things recommend verifier - but this is a driver load issue, so I do not know if that will help.

I doubt Aomei developers are interested in debugging.

Maybe it's worth asking on some OSR Developer Community forum?

NTDEV looks like a close match:

Windows System Software Devs Interest List — Windows Driver Developer’s Forum

This is the category to post your questions about designing and developing all types of Windows drivers, particularly device drivers. Got a KMDF driver or UMDF driver question? Got a crash in driver that you need help ana…

Last time I communicated on OSR was circa 2000, so things might have changed since. But the community seemed very friendly. :)

@lundman
Copy link
Author

lundman commented Jun 12, 2024

Ah yeah, I want to chase down a bit further before I post over there

@sskras
Copy link

sskras commented Jun 12, 2024

Oh! Thanks for the personal update :)

@lundman
Copy link
Author

lundman commented Jun 12, 2024

Probably on there still.. perhaps I come across too aristocraty....:)

@lundman
Copy link
Author

lundman commented Jun 13, 2024

This is the earliest crash (in terms of stack depth) I've seen:


Unknown bugcheck code (0)
Unknown bugcheck description
Arguments:
Arg1: 0000000000000000
Arg2: 0000000000000000
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

BUGCHECK_CODE:  0

BUGCHECK_P1: 0

BUGCHECK_P2: 0

BUGCHECK_P3: 0

BUGCHECK_P4: 0

PROCESS_NAME:  System

EXECUTE_ADDRESS: 40

FAILED_INSTRUCTION_ADDRESS: 
+0
00000000`00000040 ??              ???

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000008

EXCEPTION_PARAMETER2:  0000000000000040

STACK_ADDR_RAW_STACK_SYMBOL: ffffef8ba5ce44a8

STACK_COMMAND:  dps ffffef8ba5ce44a8-0x20 ; kb

STACK_TEXT:  
ffffef8b`a5ce4488  00000000`00000000
ffffef8b`a5ce4490  ffffb487`9ac6a060
ffffef8b`a5ce4498  ffffb487`9abdd200
ffffef8b`a5ce44a0  fffff805`6b16d7e2 volsnap!VolSnapPnp+0x142
ffffef8b`a5ce44a8  fffff805`68344247 nt!IopfCallDriver+0x53
ffffef8b`a5ce44b0  ffffb487`9abdd010
ffffef8b`a5ce44b8  ffffb487`970c7be0
ffffef8b`a5ce44c0  ffffb487`9ac6a1b0
ffffef8b`a5ce44c8  00000000`00000000
ffffef8b`a5ce44d0  00000000`00000000
ffffef8b`a5ce44d8  00000000`00000000
ffffef8b`a5ce44e0  00000000`00000002
ffffef8b`a5ce44e8  fffff805`68ad222f nt!IovCallDriver+0x5f
ffffef8b`a5ce44f0  00000000`00000014
ffffef8b`a5ce44f8  ffffb487`ffffffff
ffffef8b`a5ce4500  00000000`00000000


SYMBOL_NAME:  volsnap!VolSnapPnp+142

MODULE_NAME: volsnap

IMAGE_NAME:  volsnap.sys

BUCKET_ID_FUNC_OFFSET:  142

FAILURE_BUCKET_ID:  ACCESS_VIOLATION_VRF_NULL_IP_volsnap!VolSnapPnp

OS_VERSION:  10.0.22621.1

BUILDLAB_STR:  ni_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {e6a35f71-d940-8230-0f7a-a6fc733dd801}

Followup:     MachineOwner
---------

@lundman
Copy link
Author

lundman commented Jun 13, 2024

OK so this could be an indication

: : nt!VerifierIofCompleteRequest+0x10
: : OpenZFS!dispatcher+0x6dc [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7556] 
: : nt!DifIRP_MJ_INTERNAL_DEVICE_CONTROLWrapper+0xdb
: : nt!IopfCallDriver+0x53
: : nt!IovCallDriver+0x5f
: : nt!IofCallDriver+0x2272df
: : storport!RaSendIrpSynchronous+0x62
: : storport!RaidBusEnumeratorIssueSynchronousRequest+0x23d
: : storport!RaidBusEnumeratorIssueReportLuns+0x70
: : storport!RaidBusEnumeratorGetLunListFromTarget+0x76
: : storport!RaidBusEnumeratorGetLunList+0x54
: : storport!RaidAdapterEnumerateBus+0x99
: : storport!RaidAdapterRescanBus+0xb1
: : storport!RaidAdapterQueryDeviceRelationsIrp+0x190
: : storport!RaidAdapterPnpIrp+0x144
: : storport!RaDriverPnpIrp+0x94
: : OpenZFS!dispatcher+0x639 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7545] 

We are called, we pass down to StorPort, which ends up calling us again, and we try to call StorPort again. This is not ideal - it is odd StorPort would call us at all.

@lundman
Copy link
Author

lundman commented Jun 13, 2024

OK so fixing that, and things around there, does not fix the memory corruption - that still exists somewhere - outside of dispatcher

@sskras
Copy link

sskras commented Jun 13, 2024

OK so fixing that, and things around there, does not fix the memory corruption

It' would be interesting to see the changes nevertheless. I guess they will be committed sooner or later.

@lundman
Copy link
Author

lundman commented Jun 13, 2024

heh they are all pretty desperate - trying to narrow the issue down. next is commenting out large chunks in the hopes it either goes away, or not

@lundman
Copy link
Author

lundman commented Jun 14, 2024

OK, I have take out everything ZFS. No SPL, no ZFS, no dispatcher, no dprintf, WPP.

Left is the DriverStart() calling StorPortInitialize().

emptyBSOD

: 00000000`000000c2 00000000`0000000d ffffe308`6e999730 00000000`fffff807 : nt!KeBugCheckEx+0x107
: ffffe308`6e999720 00000000`fffff807 00000000`000004c0 00000000`00000001 : nt!ExReturnPoolQuota+0x222647
: 00000000`00000000 ffffe308`6e5d8720 ffffbe8b`00000000 00000000`00000000 : nt!ExFreePoolWithTag+0x1e5
: ffffe308`6e5d8938 ffffe308`6e5d8050 ffffe308`6d325030 ffffe308`6bc1d550 : storport!RaidDeleteDeferredQueue+0x1b
: ffffe308`6e5d81a0 ffffe308`6e5d8050 ffffe308`6d325030 ffffe308`6bc1d550 : storport!RaidDeleteAdapter+0x12e
: ffffe308`6bc1d550 00000000`00000002 00000000`00000000 00000000`00010000 : storport!RaidAdapterRemoveDeviceIrp+0x8e
: ffffe308`61dfd6f0 fffff807`3a6361f5 ffffe308`61dfd710 ffffbe8b`fc5c4250 : storport!RaidAdapterPnpIrp+0x16100
: ffffe308`61dfd710 ffffbe8b`fc5c4250 ffffe308`61dfd710 fffff807`39766400 : storport!RaDriverPnpIrp+0x94
: ffffe308`6bc1d550 ffffe308`6e5d8050 ffffe308`6b7ddb40 00000000`00000000 : nt!DifIRP_MJ_PNPWrapper+0xdb
: 00000000`00000000 ffffe308`6b7ddca0 00000000`00000000 fffff807`37795180 : nt!IopfCallDriver+0x53
: ffffe308`6bc1d6f8 ffffe308`6bc1d550 ffffe308`6b7ddc98 00000000`00000000 : nt!IovCallDriver+0x5f
: ffffe308`6b7ddca0 ffffbe8b`fc5c4430 ffffbcd2`2b6bddf7 fffff807`38d0ca21 : nt!IofCallDriver+0x2272df
: 00000000`00000000 ffffe308`6b7dd9f0 ffffe308`6e5e0040 ffffe308`6aed4180 : ambakdrv+0x16da
: ffffe308`6bc1d740 fffff807`3bbfd73e ffffe308`6bc1d550 ffffe308`6b7dd9f0 : ambakdrv+0x15b4
: ffffe308`6bc1d550 fffff807`3bbfd73e ffffe308`6b7dd9f0 ffffe308`6e5e0190 : ambakdrv+0x1478
: 00000000`00000000 fffff807`38c20f8d 00000000`00000000 00000000`00000000 : nt!IopfCallDriver+0x53
: ffffe308`6b7dd9f0 00000000`00000001 00000000`00000000 00000000`00000000 : nt!IovCallDriver+0x5f
: ffffe308`6b7dd9f0 00000000`00000001 ffffe308`6bc1d550 fffff807`390dbede : nt!IofCallDriver+0x2272df
: ffffe308`6bc1d550 fffff807`3915ae44 ffffe308`6e5e0040 ffff9f04`636be000 : volsnap!VolSnapPnp+0x9e
: ffffbe8b`fc5c46e0 00000000`00000000 ffffe308`6bc1d550 fffff807`394e8c35 : nt!IopfCallDriver+0x53
: ffffe308`6e5e0040 00000000`00000000 ffffe308`6e5e0040 00000000`00000000 : nt!IovCallDriver+0x5f
: ffffe308`6e5e0040 00000000`00000000 ffffbe8b`fc5c46e0 ffffe308`6bc1d550 : nt!IofCallDriver+0x2272df
: 00000000`00000002 ffffe308`6e5e1300 ffffe308`6a8329f0 ffffe308`6e5e1300 : nt!IopSynchronousCall+0xf8
: ffff9f04`63f0a010 ffffe308`6a8329f0 00000000`00000200 00000000`0000000a : nt!IopRemoveDevice+0x108
: ffffe308`6e5e1300 ffffe308`6a8329f0 00000000`00000000 cb3a4008`00200001 : nt!PnpRemoveLockedDeviceNode+0x1a8
: ffffe308`6a8329f0 ffffbe8b`fc5c4830 ffffe308`6e5e1300 00000000`00000000 : nt!PnpDeleteLockedDeviceNode+0x52
: ffffe308`6e5e1300 00000000`00000002 ffffe308`6e5e1300 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xd3
: ffffbe8b`fc5c49b0 ffffe308`6a832900 ffffe308`6bceb400 ffff9f04`00000001 : nt!PnpProcessQueryRemoveAndEject+0x3b3
: ffff9f04`63f0a010 ffff9f04`63781a10 ffffe308`6189aa00 00000000`00000000 : nt!PnpProcessTargetDeviceEvent+0x109
: ffffe308`6189aaa0 ffffe308`6684c340 ffffbe8b`fc5c4b00 00000000`00000000 : nt!PnpDeviceEventWorker+0x2ca
: ffffe308`6684c340 00000000`00000235 ffffe308`6684c340 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155
: ffff8b01`24367180 ffffe308`6684c340 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57
 : ffffbe8b`fc5c5000 ffffbe8b`fc5bf000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34


SYMBOL_NAME:  ambakdrv+16da

MODULE_NAME: ambakdrv

IMAGE_NAME:  ambakdrv.sys

Using KdPrinEx() to try the callbacks in StorPort:

MpHwCompServReq entered
MpQueueServiceIrp entered
wzvol_HwHandlePnP: entry
MpHwAdapterControl:  ControlType = 1
MpHwAdapterControl:  ScsiStopAdapter
wzvol_StopAdapter: entry
MpHwAdapterControl - OUT
MpHwAdapterControl:  ControlType = 3
MpHwAdapterControl:  ScsiSetBootConfig
MpHwAdapterControl - OUT
MpHwFreeAdapterResources entered, pHBAExt = 0xFFFFE3086DF54770
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000c2

@lundman
Copy link
Author

lundman commented Jun 14, 2024

Then there is this crash:

06 ffffbe8b`fefe00b0 fffff807`390feb62     nt!KiPageFault+0x369
07 ffffbe8b`fefe0240 fffff807`3a3545a9     nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x62
08 ffffbe8b`fefe0280 fffff807`3a354a18     FLTMGR!TargetedIOCtrlGenerateECP+0x8d
09 ffffbe8b`fefe02f0 fffff807`3a354d17     FLTMGR!FltpCreateFile+0xd4
0a ffffbe8b`fefe03e0 fffff807`3b01f4b1     FLTMGR!FltCreateFileEx2+0xe7
0b ffffbe8b`fefe0510 fffff807`391adabc     fileinfo!FIPfInterfaceOpen+0x401

  fffff807`390feb52  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
  fffff807`390feb57  xor ecx,ecx 
  fffff807`390feb59  test rax,rax    # Check if rax is NULL
  fffff807`390feb5c  je nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf635 (fffff807`392ce135)  # jump if it is
* fffff807`390feb62  mov qword ptr [rax+10h],rcx  # deference rax + 16.
(* = PC)

0: kd> r
rax=0000000000000000 rbx=0000000000000003 rcx=0000000000000003
rdx=000000000000008a rsi=fffff8073bbfd701 rdi=0000000000000000
rip=fffff80738e3f520 rsp=ffffbe8bfefdf678 rbp=ffffbe8bfefdf7e0
 r8=0000000000000065  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000003
r14=0000000000000000 r15=0000000000000000

My x64 is a bit rusty, but it seems it checks for rax being NULL, since it isn't, it does not jump. Then the next store to rax+0x10, now rax is NULL, and we crash! Neat. What would be changing my rax on the fly, if I read that correctly.

I believe only preemption/interrupts can execute between those statements, so is it destroying rax somewhere? i386/x64 lists rax as nonvolitile.

UPDATE

rax is not NULL, I had it in mind that the registers rewind when you jump into the callstack, of course they don't. And the error is WRITE_TO_READONLY_MEMORY, which isn't the error given for NULL access.

@sskras
Copy link

sskras commented Jun 14, 2024

it checks for rax being NULL, since it isn't, it does not jump. Then the next store to rax+0x10, now rax is NULL, and we crash! Neat. What would be changing my rax on the fly, if I read that correctly.

Could you step through that code by chance?

My improvised guess is that:

  • rax is actually NULL
  • the execution jumps to FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf635
  • and then it returns back to fffff807 390feb62

@lundman
Copy link
Author

lundman commented Jun 15, 2024

here is the whole function

nt!FsRtlAllocateExtraCreateParameterFromLookasideList:
fffff807`390feb00  mov qword ptr [rsp+8],rbx 
fffff807`390feb05  mov qword ptr [rsp+10h],rbp 
fffff807`390feb0a  mov qword ptr [rsp+18h],rsi 
fffff807`390feb0f  mov qword ptr [rsp+20h],rdi 
fffff807`390feb14  push r14 
fffff807`390feb16  sub rsp,30h 
fffff807`390feb1a  mov rbx,qword ptr [rsp+60h] 
fffff807`390feb1f  lea esi,[rdx+48h] 
fffff807`390feb22  mov r10d,r8d 
fffff807`390feb25  mov rbp,r9 
fffff807`390feb28  and r10d,2 
fffff807`390feb2c  mov edi,42h 
fffff807`390feb31  mov r9d,2 
fffff807`390feb37  mov r14,rcx 
fffff807`390feb3a  mov ecx,dword ptr [rbx+28h] 
fffff807`390feb3d  cmove edi,r9d 
fffff807`390feb41  cmp esi,dword ptr [rbx+2Ch] 
fffff807`390feb44  ja nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf616 (fffff807`392ce116) 
# ^^ Sets rax and jumps to fffff807`390feb99
fffff807`390feb4a  mov rcx,rbx 
fffff807`390feb4d  test r10d,r10d 
fffff807`390feb50  jne nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0xb5 (fffff807`390febb5) 
fffff807`390feb52  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
fffff807`390feb57  xor ecx,ecx 
fffff807`390feb59  test rax,rax 
fffff807`390feb5c  je nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf635 (fffff807`392ce135) 
# ^^ Sets rax and jumps to fffff807`390feb99
fffff807`390feb62  mov qword ptr [rax+10h],rcx 
fffff807`390feb66  mov qword ptr [rax+8],rcx 
fffff807`390feb6a  mov qword ptr [rax],48706345h 
fffff807`390feb71  movups xmm0,xmmword ptr [r14] 
fffff807`390feb75  mov qword ptr [rax+40h],rcx 
fffff807`390feb79  lea rcx,[rax+48h] 
fffff807`390feb7d  mov qword ptr [rax+28h],rbp 
fffff807`390feb81  movups xmmword ptr [rax+18h],xmm0 
fffff807`390feb85  mov dword ptr [rax+30h],edi 
fffff807`390feb88  mov dword ptr [rax+34h],esi 
fffff807`390feb8b  mov qword ptr [rax+38h],rbx 
fffff807`390feb8f  mov rax,qword ptr [rsp+68h] 
fffff807`390feb94  mov qword ptr [rax],rcx 
fffff807`390feb97  xor eax,eax 
fffff807`390feb99  mov rbx,qword ptr [rsp+40h] 
fffff807`390feb9e  mov rbp,qword ptr [rsp+48h] 
fffff807`390feba3  mov rsi,qword ptr [rsp+50h] 
fffff807`390feba8  mov rdi,qword ptr [rsp+58h] 
fffff807`390febad  add rsp,30h 
fffff807`390febb1  pop r14 
fffff807`390febb3  ret  
fffff807`390febb4  int 3 
fffff807`390febb5  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
fffff807`390febba  jmp nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x57 (fffff807`390feb57) 
fffff807`390febbc  int 3 
fffff807`390febbd  int 3 
fffff807`390febbe  int 3 
fffff807`390febbf  int 3 
fffff807`390febc0  int 3 
fffff807`390febc1  int 3 
fffff807`390febc2  int 3 
fffff807`390febc3  int 3 

nt!AlpcpLockForCachedReferenceBlob:
fffff807`390febc4  push rbx 

~~~snip~~~
fffff807`392ce135  mov rax, qword ptr [rsp+68h] 
fffff807`392ce13a  mov qword ptr [rax],rcx 
fffff807`392ce13d  mov eax,0C000009Ah    # STATUS_INSUFFICIENT_RESOURCES
fffff807`392ce142  jmp nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x99 (fffff807390feb99) 
fffff807`392ce147  int 3 


I don't see any jumps to fffff807390feb62 but it's not like devstudio has cross-reference ability.

@sskras
Copy link

sskras commented Jun 15, 2024

I don't see any jumps to fffff807390feb62 but it's not like devstudio has cross-reference ability.

It looks you're right:

  • nt!ExAllocateFromNPagedLookasideList gets called
  • value of rax gets checked via test rax,rax
  • if it's zero, je makes that STATUS_INSUFFICIENT_RESOURCES is set in eax
  • execution jumps to the fn epilog (errors out)
    (which we are not observing in this case)
  • if it's not zero, it suddenly becomes zero 8 bytes further in code, at fffff807 390feb62

Very strange indeed.

@lundman
Copy link
Author

lundman commented Jun 15, 2024

Which usually means interrupt, either hardware, preemption, DPC or similar. But I've commented out all those things, calls to __cpuidex(), assembly, etc. Could explain why the crashes are random, and often with NETIO (most common interrupts on my system).
But that tends to mean it is not my code, but every time I blame the compiler it ends up being my code. Peculiar. I do wonder if I can build the sample storport project that the zvol stuff came from, and trigger it outside of this project. At least could suggest Aomei issue as well.

@lundman
Copy link
Author

lundman commented Jun 15, 2024

Turned on storport verifylevel 1, and verifier to check ambdrv, volsnap and openzfs, this is one of the quickest crashes so far:

volsnap

wzvol_HwFindAdapter: pHBAExt = 0xFFFFB78D5E81E110, NbBuses/MaxTargets/MaxLuns=1/2/21.
MpHwAdapterControl:  ControlType = 0
MpHwAdapterControl: ScsiQuerySupportedControlTypes
MpHwAdapterControl - OUT
wzvol_HwInitialize: entry
wzvol_HwReportAdapter: entry
wzvol_HwReportLink: entry
wzvol_HwReportLog: entry
Access violation - code c0000005 (!!! second chance !!!)
00000000`00000040 ??              ???

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000008

EXCEPTION_PARAMETER2:  0000000000000040

STACK_ADDR_RAW_STACK_SYMBOL: fffff885929a1448

STACK_COMMAND:  dps fffff885929a1448-0x20 ; kb

STACK_TEXT:  
fffff885`929a1428  00000000`00000000
fffff885`929a1430  ffffb78d`5fa21040
fffff885`929a1438  ffffb78d`5e4a5200
fffff885`929a1440  fffff801`8270d7e2 volsnap!VolSnapPnp+0x142
fffff885`929a1448  fffff801`7f93b247 nt!IopfCallDriver+0x53
fffff885`929a1450  ffffb78d`5e4a5010
fffff885`929a1458  ffffb78d`5f672610
fffff885`929a1460  ffffb78d`5fa21190
fffff885`929a1468  00000000`00001001
fffff885`929a1470  ffffb78d`55a86a10
fffff885`929a1478  fffff801`7fbec06b nt!DifKeInitializeEventWrapper+0xbb
fffff885`929a1480  fffff801`8270d785 volsnap!VolSnapPnp+0xe5
fffff885`929a1488  fffff801`800c922f nt!IovCallDriver+0x5f
fffff885`929a1490  00000000`00000fff
fffff885`929a1498  ffffa600`7e200000
fffff885`929a14a0  fffff801`8270d785 volsnap!VolSnapPnp+0xe5


SYMBOL_NAME:  volsnap!VolSnapPnp+142

MODULE_NAME: volsnap

IMAGE_NAME:  volsnap.sys

BUCKET_ID_FUNC_OFFSET:  142

FAILURE_BUCKET_ID:  ACCESS_VIOLATION_VRF_NULL_IP_volsnap!VolSnapPnp

1: kd> r
rax=0000000000000000 rbx=000000200000004c rcx=000000002f7f3938
rdx=00000000001e8480 rsi=000005010000004c rdi=0807060504030201
rip=0000000000000040 rsp=fffff885929a13f0 rbp=004f0056005a0008
 r8=0000000000000000  r9=fffff78000000294 r10=fffff8017f9cfbb0
r11=0000000000000000 r12=0000000a00000000 r13=51b40aadfe995194
r14=498a014bc66015ee r15=ffffb78d5f672610

This looks corrupt, stack maybe, so not sure it adds any more clues

Going up frame #4 in stack is even at address 0x0807060504030201 which seems "suspect".

It comes from volsnap:

fffff801`8270d7ca  lea rcx,[rsp+40h] 
fffff801`8270d7cf  mov qword ptr [rax-8],rcx 
fffff801`8270d7d3  mov byte ptr [rax-45h],0E0h 
fffff801`8270d7d7  mov rcx,qword ptr [rdi+20h] 
fffff801`8270d7db  call qword ptr [volsnap!_imp_IofCallDriver (fffff801`827090d0)]

IofCallDriver(DeviceObject, goes into rcx, which is 000000002f7f3938 (bad), but comes from rdi, which is 0807060504030201 (needs exorcism), and rsp at this frame is:

0xFFFFF885929A13F0  4c 00 00 00 20 00 00 00 08 00 5a 00 56 00 4f 00 4c 00 00 00 01 05 00 00 01 02 03 04 05 06 07 08 10 00 00 00 0a 0b 0c 0d 0e 0f  L... .....Z.V.O.L.........................

Maybe we return something on the stack we shouldn't.

@lundman
Copy link
Author

lundman commented Jun 15, 2024

Interesting that the last call to us wzvol_HwReportLog has

UCHAR myPortWWN[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };

@guenther-alka
Copy link

guenther-alka commented Jun 16, 2024

What I have seen on powershell get-volume (Open-ZFS ..f74)
Get-volume

`DriveLetter FriendlyName FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining Size


                      Unknown        Removable Healthy      Unknown                     0 B       0 B

I Volume2 NTFS Fixed Healthy OK 3.88 GB 3.91 GB
C NTFS Fixed Healthy OK 118.94 GB 475.95 GB
Y Speicherplatz NTFS Fixed Healthy OK 19.82 GB 19.87 GB
FAT32 Fixed Healthy OK 71.03 MB 99.21 MB
E Volume1 NTFS Fixed Healthy OK 3.88 GB 3.91 GB
D Unknown Removable Healthy Unknown 0 B 0 B
NTFS Fixed Healthy OK 86.31 MB 786 MB`

Drive D is a ZFS pool and the only entry with a driveletter, unknown filesystem type,, status unknown and size 0
not sure if this is relevant.

@sskras
Copy link

sskras commented Jun 17, 2024

@guenther-alka, your output is so messed up that it took me a half of an hour to reconstruct the structure:

DriveLetter FriendlyName  FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining      Size
----------- ------------  -------------- --------- ------------ ----------------- -------------      ----
D                         Unknown        Removable Healthy      Unknown                     0 B       0 B
DriveLetter FriendlyName  FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining      Size
----------- ------------  -------------- --------- ------------ ----------------- -------------      ----
I           Volume2       NTFS           Fixed     Healthy      OK                      3.88 GB   3.91 GB 
C                         NTFS           Fixed     Healthy      OK                    118.94 GB 475.95 GB
Y           Speicherplatz NTFS           Fixed     Healthy      OK                     19.82 GB  19.87 GB
                          FAT32          Fixed     Healthy      OK                     71.03 MB  99.21 MB 
E           Volume1       NTFS           Fixed     Healthy      OK                      3.88 GB   3.91 GB 
D                         Unknown        Removable Healthy      Unknown                     0 B       0 B
                          NTFS           Fixed     Healthy      OK                     86.31 MB    786 MB

Also I guess that you're posting to the wrong issue.

@sskras
Copy link

sskras commented Jun 17, 2024

@lundman commented 2 days ago:

OK, I have take out everything ZFS. No SPL, no ZFS, no dispatcher, no dprintf, WPP.

Left is the DriverStart() calling StorPortInitialize().

[...]


*** Fatal System Error: 0x000000c2

@lundman commented 2 days ago:

But I've commented out all those things, calls to __cpuidex(), assembly, etc.

Umm, is there anything left to comment out except for DriverStart()?

But that tends to mean it is not my code, but every time I blame the compiler it ends up being my code. Peculiar.

This time I would rather blame logic coming from other drivers.

I do wonder if I can build the sample storport project that the zvol stuff came from, and trigger it outside of this project.

That would be very interesting.

At least could suggest Aomei issue as well.

To me it already suggests that Aomei is buggy.
It also could be due to some NT kernel change in the newer Windows version. This is w11, right?

@sskras
Copy link

sskras commented Jun 17, 2024

Found a forum in which might be lurking folks skilled in NT kernel crashes:
https://www.sysnative.com/forums/forums/bsod-crashes-kernel-debugging.15/

Eg: Storport.sys BSOD case.

@lundman
Copy link
Author

lundman commented Jun 17, 2024

I took out the __attribute__(constructor) today, since it would make sense for that to crash during load and unload. Alas, crashing continues just to vex me further. :)

@guenther-alka
Copy link

If Windows crashes with ZFS for an unknown reason, my point was to check in what items ZFS behaves different from other volumes and undefined/unknown states are such a point.

@lundman
Copy link
Author

lundman commented Jun 17, 2024

What I have seen on powershell get-volume (Open-ZFS ..f74)

Yeah it's interesting the difference there, I do suspect we aren't handling MOUNTMGR correctly, so we will have to return to it. But that issue kicks in when something is mounted. I can crash just loading/unloading the driver, without ever using ZFS parts. But let's put that on the list to look into.

@lundman
Copy link
Author

lundman commented Jun 21, 2024

So, instead of calling our zvol's StorPortInitialize(), I grabbed SFBDVMiniport.c from
https://github.com/waynemunro/service-fabric-2/blob/master/src/prod/src/BlockStore/SFBDMiniport/SFBDMiniport/SFBDVMiniport.c

and call it's StorPortInitialize() instead;

Driver Verifier: Applied for OpenZFS.sys, 0:0x3bb6dbb, 1:0xa, build 22635, key tnVYFWMVqHgmDJiKeQf9FB
OpenZFS: Started
OpenZFS_Fini
OpenZFS: Goodbye.
Driver Verifier: Removed for OpenZFS.sys, 0:0x3bb6dbb, 1:0xa, build 22635, key tnVYFWMVqHgmDJiKeQf9FB
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000003,0xFFFFBE8BFC90E600,0xFFFFBE8BFC90E558,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)
For analysis of this file, run !analyze -v
nt!DbgBreakPointWithStatus:
fffff807`38e3f520 cc              int     3
1: kd> k
 # Child-SP          RetAddr               Call Site
00 ffffbe8b`fc90dac8 fffff807`38f84822     nt!DbgBreakPointWithStatus
01 ffffbe8b`fc90dad0 fffff807`38f83ee3     nt!KiBugCheckDebugBreak+0x12
02 ffffbe8b`fc90db30 fffff807`38e35487     nt!KeBugCheck2+0xba3
03 ffffbe8b`fc90e2a0 fffff807`38e4ace9     nt!KeBugCheckEx+0x107
04 ffffbe8b`fc90e2e0 fffff807`38e4b2b2     nt!KiBugCheckDispatch+0x69
05 ffffbe8b`fc90e420 fffff807`38e49006     nt!KiFastFailDispatch+0xb2
06 ffffbe8b`fc90e600 fffff807`38d20b8c     nt!KiRaiseSecurityCheckFailure+0x346
07 ffffbe8b`fc90e790 fffff807`3b602535     nt!RtlRemoveEntryHashTable+0x4c
08 ffffbe8b`fc90e7c0 ffffe308`6ecd1c90     0xfffff807`3b602535
09 ffffbe8b`fc90e7c8 badbadfa`badbadfa     0xffffe308`6ecd1c90
0a ffffbe8b`fc90e7d0 ffffbe8b`fc90e850     0xbadbadfa`badbadfa
0b ffffbe8b`fc90e7d8 fffff807`3b603747     0xffffbe8b`fc90e850
0c ffffbe8b`fc90e7e0 00000000`00000000     0xfffff807`3b603747

0xbadbadfabadbadfa is certainly amusing.

But it suggests that it isn't the code. I tried added various Registry entries from the .inf file without luck. Now I am curious if that Service-Fabric-2's miniport driver loads with Aomei at all.

@lundman
Copy link
Author

lundman commented Jun 21, 2024

Err the actual stack

1: kd> k
 # Child-SP          RetAddr               Call Site
00 ffffbe8b`fc90dac8 fffff807`38f84822     nt!DbgBreakPointWithStatus
01 ffffbe8b`fc90dad0 fffff807`38f83ee3     nt!KiBugCheckDebugBreak+0x12
02 ffffbe8b`fc90db30 fffff807`38e35487     nt!KeBugCheck2+0xba3
03 ffffbe8b`fc90e2a0 fffff807`38e4ace9     nt!KeBugCheckEx+0x107
04 ffffbe8b`fc90e2e0 fffff807`38e4b2b2     nt!KiBugCheckDispatch+0x69
05 ffffbe8b`fc90e420 fffff807`38e49006     nt!KiFastFailDispatch+0xb2
06 ffffbe8b`fc90e600 fffff807`38d20b8c     nt!KiRaiseSecurityCheckFailure+0x346
07 ffffbe8b`fc90e790 fffff807`3b602535     nt!RtlRemoveEntryHashTable+0x4c
08 ffffbe8b`fc90e7c0 fffff807`3b6013ce     NETIO!FreeSomeCacheBucketEntries+0x7d
09 ffffbe8b`fc90e810 fffff807`3b607218     NETIO!FreeSomeCacheEntries+0x3e
0a ffffbe8b`fc90e840 fffff807`3b60442b     NETIO!UpdateCacheLruBucket+0x828
0b ffffbe8b`fc90e910 fffff807`3b6118f7     NETIO!LruCleanupWorkItemRoutine+0x9b
0c ffffbe8b`fc90e940 fffff807`38ce0afe     NETIO!NetiopIoWorkItemRoutine+0x57
0d ffffbe8b`fc90e990 fffff807`38cf4b35     nt!IopProcessWorkItem+0x8e
0e ffffbe8b`fc90ea00 fffff807`38d1b577     nt!ExpWorkerThread+0x155
0f ffffbe8b`fc90ebf0 fffff807`38e3a414     nt!PspSystemThreadStartup+0x57
10 ffffbe8b`fc90ec40 00000000`00000000     nt!KiStartSystemThread+0x34

@lundman
Copy link
Author

lundman commented Jun 22, 2024

Since it is not actually calling us, we can't skip ambakdrv.sys based on "name of caller". I suppose we can skip loading stoport, if we detect it is running on the system.

@lundman
Copy link
Author

lundman commented Jun 22, 2024

OK to get some progress, what I will do for now is, the default value of windows_zvol_enabled is 1, and if it is 1, and we detect ambakdrv.sys is loaded, zvols are skipped. If you set windows_zvol_enabled to 2, it will force load OpenZFS with zvols. 0 is disabled as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants