Bug 1201 - virt2phys is sometimes causing bluescreen
Summary: virt2phys is sometimes causing bluescreen
Status: CONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: other (show other bugs)
Version: 22.11
Hardware: x86 Windows
: High critical
Target Milestone: ---
Assignee: Tyler Retzlaff
URL:
Depends on:
Blocks:
 
Reported: 2023-03-28 15:24 CEST by Antoine Pollenus
Modified: 2023-09-15 07:06 CEST (History)
5 users (show)



Attachments
bluescreen (146.29 KB, image/png)
2023-03-28 15:24 CEST, Antoine Pollenus
Details

Description Antoine Pollenus 2023-03-28 15:24:11 CEST
Created attachment 247 [details]
bluescreen

When using DPDK on windows server 2019 sometimes we get a bluescreen.

This bluescreen seams to be caused by virt2phys.sys

The issue has only been reproduced on Windows server 2019.

On windows 10,11 and server 2022 no bluescreen.

To reproduce this issue only test-pmd is necessary.

Bellow the crash analyse:

Microsoft (R) Windows Debugger Version 10.0.22621.755 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.


Loading Dump File [C:\Users\ja\Desktop\BSOD VCS\MEMORY.DMP]
Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.

WARNING: Whitespace at start of path element

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       .sympath cache*C:\MySymbols
Deferred                                       srv*https://msdl.microsoft.com/download/symbols
WARNING: Whitespace at start of path element
Symbol search path is: .sympath cache*C:\MySymbols;srv*https://msdl.microsoft.com/download/symbols; C:\local\github\dpdk-kmods\windows\virt2phys\x64\Release
Executable search path is: 
WARNING: Whitespace at start of path element
Windows 10 Kernel Version 17763 MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Edition build lab: 17763.1.amd64fre.rs5_release.180914-1434
Machine Name:
Kernel base = 0xfffff803`07eba000 PsLoadedModuleList = 0xfffff803`082d24d0
Debug session time: Fri Mar 24 13:59:11.850 2023 (UTC + 1:00)
System Uptime: 0 days 0:04:38.551
Loading Kernel Symbols
.............................................................Page 197e55 not present in the dump file. Type ".hh dbgerr004" for details
..
................................................................
..............................
Loading User Symbols
PEB is paged out (Peb.Ldr = 000000cd`0513b018).  Type ".hh dbgerr001" for details
Loading unloaded module list
........
For analysis of this file, run !analyze -v
0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the BugCheck
Arg2: fffff8030afe2542, Address of the instruction which caused the BugCheck
Arg3: ffffce0bfaf2e3f0, Address of the context record for the exception that caused the BugCheck
Arg4: 0000000000000000, zero.

Debugging Details:
------------------


KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.mSec
    Value: 2280

    Key  : Analysis.DebugAnalysisManager
    Value: Create

    Key  : Analysis.Elapsed.mSec
    Value: 14939

    Key  : Analysis.Init.CPU.mSec
    Value: 4796

    Key  : Analysis.Init.Elapsed.mSec
    Value: 85415

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 126

    Key  : WER.OS.Branch
    Value: rs5_release

    Key  : WER.OS.Timestamp
    Value: 2018-09-14T14:34:00Z

    Key  : WER.OS.Version
    Value: 10.0.17763.1


FILE_IN_CAB:  MEMORY.DMP

BUGCHECK_CODE:  3b

BUGCHECK_P1: c0000005

BUGCHECK_P2: fffff8030afe2542

BUGCHECK_P3: ffffce0bfaf2e3f0

BUGCHECK_P4: 0

CONTEXT:  ffffce0bfaf2e3f0 -- (.cxr 0xffffce0bfaf2e3f0)
rax=ffffe78772ee6000 rbx=a787ff756b60cddd rcx=a787ff756b60cde5
rdx=ffffce0bfaf2f300 rsi=000001f55b600000 rdi=ffffe7876c3cde20
rip=fffff8030afe2542 rsp=ffffce0bfaf2ede0 rbp=00000000000001d4
 r8=0000000000000000  r9=7fffe787726cafd8 r10=7ffffffffffffffc
r11=ffffe78771bd6080 r12=ffffe78772282080 r13=ffffe7876ad78940
r14=0000000000200000 r15=000001f55b600000
iopl=0         nv up ei ng nz na pe nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010282
virt2phys!virt2phys_process_find_block+0xa [inlined in virt2phys!virt2phys_translate+0xd2]:
fffff803`0afe2542 488b03          mov     rax,qword ptr [rbx] ds:002b:a787ff75`6b60cddd=????????????????
Resetting default scope

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXPNP: 1 (!blackboxpnp)


PROCESS_NAME:  vcs.exe

STACK_TEXT:  
ffffce0b`faf2ede0 fffff803`0afe15c6     : 00001878`8e199398 ffffce0b`faf2eeb9 ffffe787`71e66c60 00001878`8e199398 : virt2phys!virt2phys_translate+0xd2 [C:\local\github\dpdk-kmods\windows\virt2phys\virt2phys_logic.c @ 369] 
ffffce0b`faf2ee80 fffff80e`a28276fb     : ffffe787`6f7fc4a0 00000000`00000000 00000000`00000000 00001878`952876b8 : virt2phys!virt2phys_device_EvtIoInCallerContext+0x1c6 [C:\local\github\dpdk-kmods\windows\virt2phys\virt2phys.c @ 225] 
ffffce0b`faf2ef20 fffff80e`a282704d     : ffffe787`6f7fc4a0 ffffe787`729ebb60 00000000`00000000 00000000`00000001 : Wdf01000!FxPkgIo::DispatchStep1+0x69b [minkernel\wdf\framework\shared\irphandlers\io\fxpkgio.cpp @ 324] 
ffffce0b`faf2efe0 fffff80e`a2821b73     : ffffe787`729ebb60 ffffe787`742d3810 00000000`00000000 fffff803`07f26c16 : Wdf01000!FxPkgIo::Dispatch+0x5d [minkernel\wdf\framework\shared\irphandlers\io\fxpkgio.cpp @ 119] 
ffffce0b`faf2f040 fffff803`07ee6d19     : ffffe787`742d3810 00000000`00000001 00000000`00000000 00000000`00000002 : Wdf01000!FxDevice::DispatchWithLock+0x113 [minkernel\wdf\framework\shared\core\fxdevice.cpp @ 1430] 
ffffce0b`faf2f0a0 fffff803`08493911     : ffffce0b`faf2f440 ffffe787`729ebb60 00000000`00000001 ffffe787`742d3810 : nt!IofCallDriver+0x59
ffffce0b`faf2f0e0 fffff803`08493681     : ffffe787`00000005 ffffe787`742d3810 ffffce0b`20206f49 ffffce0b`faf2f440 : nt!IopSynchronousServiceTail+0x1b1
ffffce0b`faf2f190 fffff803`08493746     : ffffe787`71bd6080 00000000`00000000 00000000`00000000 00000000`00000000 : nt!IopXxxControlFile+0xe61
ffffce0b`faf2f2e0 fffff803`08085bf5     : ffffce0b`faf2f440 fffff803`0853286b 00000000`00000000 ffffe787`73efc700 : nt!NtDeviceIoControlFile+0x56
ffffce0b`faf2f350 00007ffa`8b450144     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25
000000cd`052fb568 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ffa`8b450144


FAULTING_SOURCE_LINE:  C:\local\github\dpdk-kmods\windows\virt2phys\virt2phys_logic.c

FAULTING_SOURCE_FILE:  C:\local\github\dpdk-kmods\windows\virt2phys\virt2phys_logic.c

FAULTING_SOURCE_LINE_NUMBER:  369

FAULTING_SOURCE_CODE:  
   113: 	struct virt2phys_block *cur;
   114: 
   115: 	for (node = process->blocks.Next; node != NULL; node = node->Next) {
   116: 		cur = CONTAINING_RECORD(node, struct virt2phys_block, next);
>  117:                 if (cur->mdl->StartVa == virt)
   118: 			return cur;
   119: 	}
   120: 	return NULL;
   121: }
   122: 


SYMBOL_NAME:  virt2phys!virt2phys_translate+d2

MODULE_NAME: virt2phys

IMAGE_NAME:  virt2phys.sys

STACK_COMMAND:  .cxr 0xffffce0bfaf2e3f0 ; kb

BUCKET_ID_FUNC_OFFSET:  d2

FAILURE_BUCKET_ID:  AV_virt2phys!virt2phys_translate

OS_VERSION:  10.0.17763.1

BUILDLAB_STR:  rs5_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {a5b3e3eb-f919-3986-0426-062d62e143e1}

Followup:     MachineOwner
---------
Comment 1 Antoine Pollenus 2023-03-30 09:36:19 CEST
The crash happen with intel nics. Tested with XXV710-DA2 and E810-C  

The crash is reproduced easily with the command:

dpdk-testpmd.exe -l 0-3 -n 4 -- -i --portmask=0x1 --nb-cores=2

sometimes the issue happen at the first launch sometimes you need few launch/stop of testpmd to get it.

The issue seems to appear during the rte_eal_init.
Comment 2 Eric Houet 2023-03-30 17:24:59 CEST
Dmitry Kozlyuk, any chance that this behaviour is due to the Windows Server 2019 limitations you mentionned in this message?

https://www.mail-archive.com/dev@dpdk.org/msg169768.html

> Testing revealed Windows Server 2019 does not allow allocating hugepage
> memory at a reserved address, despite advertised API.  So allocator has
> to temporary free the region to be allocated.  This creates in inherent
> race condition. This issue is being discussed with Microsoft privately.


If it is the case, did you receive support from Microsoft to solve this issue?  
Or do you know any workaround?

Thank you!
Comment 3 Dmitry Kozlyuk 2023-04-03 23:25:05 CEST
Marking as confirmed, because I too have seen this crash before and even analyzed a similar crash dump but with no insight why this might happen.

Eric Houet, this crash is unrelated to the limitation.
I'm going to post a patch to remove it, because it's really bogus but at least one user has hit it.

Few months back, Mark Cheatham hit some "memory check failure" in virt2phys. I'm unsure if it was the same bug. He "was able to workaround it by changing line 307 in virt2phys_logic.c from:
if (info.AllocationBase != virt || info.RegionSize != size) {
to:
if (info.AllocationBase != virt || info.RegionSize < size) {".

Can you try this? If it helps, we can probably ask Microsoft experts for help explaining this behavior.
Comment 4 Antoine Pollenus 2023-04-05 10:20:44 CEST
 This little patch you sent us is not fixing the BSOD.
Comment 5 Antoine Pollenus 2023-04-07 10:06:19 CEST
Hello Dmitry,

Do you know how we could progress on this issue?
Comment 6 Dmitry Kozlyuk 2023-09-12 13:45:01 CEST
Hello Antoine, please check if this patch solves the issue:
http://patchwork.dpdk.org/project/dpdk/patch/20230912111759.1502806-1-ming3.li@intel.com/
Comment 7 Antoine Pollenus 2023-09-12 13:48:20 CEST
Hi Dmitry,

thank you for the help on that issue.

I'll take a look at that as soon as I can, certainly this week or next week.

Note You need to log in before you can comment on or make changes to this bug.