We have an AMD Threadripper 1950X, w/ 2 NUMA nodes. I cannot initialize our XL710 cards in the slots belonging to NUMA node #0. In slots belonging to NUMA node #1 it seems to be OK. I get this at initialization: EAL: Detected 32 lcore(s) EAL: Debug dataplane logs available - lower performance EAL: Probing VFIO support... EAL: PCI device 0000:09:00.0 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e eth_i40e_dev_init(): Failed to init adminq: -54 EAL: Requested device 0000:09:00.0 cannot be used EAL: PCI device 0000:09:00.1 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e eth_i40e_dev_init(): Failed to init adminq: -54 EAL: Requested device 0000:09:00.1 cannot be used EAL: PCI device 0000:41:00.0 on NUMA socket 1 EAL: probe driver: 8086:1583 net_i40e EAL: PCI device 0000:41:00.1 on NUMA socket 1 EAL: probe driver: 8086:1583 net_i40e I.e. the 2 ports in the card belonging to NUMA #1 was initialized correctly (41:00.x), but the other two fails. This machine can "hide" its NUMA behavior by setting so in BIOS (i.e. in this case I see one single NUMA node w/ 16 cores), but this feature does not help eiter. I get the same both when have 1 or 2 XL710 cards in the box. I get the same with the latest DPDK from git. Using the kernel, everything seems to be OK. What tests can I do to help finding the bug?
Are you using X710(4X10G)? Pls write down your firmware version and kernel version. How can u put one card to 2 NUMA nodes?
I'm trying to use two separate XL710 cards in two sockets. But I still cannot initialize the NIC in the socket belonging to NUMA node #0, even if I remove the other card, so this does not really matter. Here are the info you asked for. (sum: ubuntu, 4.13.0-31-generic, firmware: 5.05 0x80002924 1.1313.0) egboeny@leto:~$ uname -a Linux leto 4.13.0-31-generic #34-Ubuntu SMP Fri Jan 19 16:34:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux egboeny@leto:~$ ethtool -i enp9s0f0 driver: i40e version: 2.1.14-k firmware-version: 5.05 0x80002924 1.1313.0 expansion-rom-version: bus-info: 0000:09:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes egboeny@leto:~$ ethtool -i enp65s0f0 driver: i40e version: 2.1.14-k firmware-version: 5.05 0x80002924 1.1313.0 expansion-rom-version: bus-info: 0000:41:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes egboeny@leto:~$
So in summary, the card can't initialize in socket0, could you try other NICs in socket0 with DPDK? Kernel didn't have EAL check, but DPDK need EAL initialization. If it's NUMA related issue, maybe you need check with AMD support.
Do you have any update on this? Do we need to keep this open?
(In reply to Ajit Khaparde from comment #4) > Do you have any update on this? Do we need to keep this open? No update, just checked the bug is still there with the latest DPDK from git (18.05+). Of course you can close the bug, but it will not fix the problem. Let me know, if you need more info/coredump whatever.
(In reply to gabor.sandor.enyedi from comment #5) > (In reply to Ajit Khaparde from comment #4) > > Do you have any update on this? Do we need to keep this open? > > No update, just checked the bug is still there with the latest DPDK from git > (18.05+). Of course you can close the bug, but it will not fix the problem. > Let me know, if you need more info/coredump whatever. Thanks for confirming. I assigned the bug to maintainer of the i40e PMD.
Hi Gabor, Since we don't have the platform, AMD Threadripper 1950X, w/ 2 NUMA nodes, in our test environment we can't reproduce or work more on the issue. Can you please help us here and provide more information? From the log I can see admin queue initialization fails: eth_i40e_dev_init(): Failed to init adminq: -54 First thing I can think of is the stack trace, can you please send the stack trace of the point where the error happened. Analyzing the parameters may give some idea about the issue. btw, just to eliminate the possible issue with the specific NIC, did you test with swapping the NICs? Thanks, ferruh
Created attachment 21 [details] coredump at fail I used DPDK release 18.08. Since the code is not crashing, I made it crash, where the problem is by adding *((int*)0) = 42; to drivers/net/i40e/base/i40e_adminq.c:1012 (modified file is attached). I have also added the core dump, the binary and a simple main.cpp, which I used for compiling a simple error.
(In reply to Ferruh YIGIT from comment #7) > Hi Gabor, > > Since we don't have the platform, AMD Threadripper 1950X, w/ 2 NUMA nodes, > in our test environment we can't reproduce or work more on the issue. > > Can you please help us here and provide more information? > > From the log I can see admin queue initialization fails: > eth_i40e_dev_init(): Failed to init adminq: -54 > > First thing I can think of is the stack trace, can you please send the stack > trace of the point where the error happened. Analyzing the parameters may > give some idea about the issue. > > btw, just to eliminate the possible issue with the specific NIC, did you > test with swapping the NICs? > > Thanks, > ferruh Hi, I rather added a coredump not a stacktrace (see my comment above). I think the problem is at drivers/net/i40e/base/i40e_adminq.c:1012, i.e. a timeout. Of course, waiting by hand when (e.g. when debugging) does not help. Like it somehow does not process the initialization command or so. The same is OK w/ the other socket. The NIC seems to be OK w/ kernel driver. I tried to use another XL710 NIC in the same PCIe socket, no success. I also tried to use --socket-limit 2048,0 to eliminate the chance of allocating memory from the other numa node (I checked the result @ /proc too), but the result was exactly the same again. BR, Gabor
Hi Gabor, Since we haven't the same platform with you, I've tested the similar case with Intel 2 NUMA nodes platform, refer to following validation info: ---------------------------------------------------------------------------- OS platform: Linux 4.15.0-128-generic #131-Ubuntu SMP DPDK-version: dpdk-stable v20.11(b1d36cf82) Driver: i40e Version: 2.1.14-k Firmware-version: 6.01 0x800035ef 1.1747.0 ---------------------------------------------------------------------------- Test Result: Passed ---------------------------------------------------------------------------- root@igb_uio:/home/ubuntu/repo/dpdk-master/build-out/app# ./dpdk-testpmd -l 12-13 -n 4 -- -i --nb-cores=1 --txd=1024 --rxd=1024 EAL: Detected 112 lcore(s) EAL: Detected 2 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Selected IOVA mode 'PA' EAL: Probing VFIO support... EAL: Probe PCI driver: rawdev_ioat (8086:2021) device: 0000:00:04.0 (socket 0) ioat_rawdev_probe(): Init 0000:00:04.0 on NUMA node 0 EAL: Probe PCI driver: rawdev_ioat (8086:2021) device: 0000:00:04.1 (socket 0) ioat_rawdev_probe(): Init 0000:00:04.1 on NUMA node 0 EAL: Probe PCI driver: net_i40e (8086:1584) device: 0000:18:00.0 (socket 0) EAL: No legacy callbacks, legacy socket not created Interactive-mode selected testpmd: create a new mbuf pool <mb_pool_0>: n=155456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Warning! port-topology=paired and odd forward ports number, the last port will pair with itself. Configuring Port 0 (socket 0) Port 0: 3C:FD:FE:E3:9D:88 Checking link statuses... Done testpmd> ---------------------------------------------------------------------------- So it isn't a bug now. Thanks & Regards, Steve Yang.