Bug 13 - Cannot initialize Intel XL710 40G interface
Summary: Cannot initialize Intel XL710 40G interface
Status: RESOLVED INVALID
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 17.11
Hardware: x86 Linux
: Normal normal
Target Milestone: ---
Assignee: beilei.xing
URL:
Depends on:
Blocks:
 
Reported: 2018-01-31 14:50 CET by gabor.sandor.enyedi
Modified: 2020-12-25 04:12 CET (History)
4 users (show)



Attachments
coredump at fail (792.19 KB, application/x-lzma)
2018-12-04 14:35 CET, gabor.sandor.enyedi
Details

Description gabor.sandor.enyedi 2018-01-31 14:50:06 CET
We have an AMD Threadripper 1950X, w/ 2 NUMA nodes. I cannot initialize our XL710 cards in the slots belonging to NUMA node #0. In slots belonging to NUMA node #1 it seems to be OK. I get this at initialization:

EAL: Detected 32 lcore(s)
EAL: Debug dataplane logs available - lower performance
EAL: Probing VFIO support...
EAL: PCI device 0000:09:00.0 on NUMA socket 0
EAL:   probe driver: 8086:1583 net_i40e
eth_i40e_dev_init(): Failed to init adminq: -54
EAL: Requested device 0000:09:00.0 cannot be used
EAL: PCI device 0000:09:00.1 on NUMA socket 0
EAL:   probe driver: 8086:1583 net_i40e
eth_i40e_dev_init(): Failed to init adminq: -54
EAL: Requested device 0000:09:00.1 cannot be used
EAL: PCI device 0000:41:00.0 on NUMA socket 1
EAL:   probe driver: 8086:1583 net_i40e
EAL: PCI device 0000:41:00.1 on NUMA socket 1
EAL:   probe driver: 8086:1583 net_i40e

I.e. the 2 ports in the card belonging to NUMA #1 was initialized correctly (41:00.x), but the other two fails.
This machine can "hide" its NUMA behavior by setting so in BIOS (i.e. in this case I see one single NUMA node w/ 16 cores), but this feature does not help eiter. I get the same both when have 1 or 2 XL710 cards in the box. I get the same with the latest DPDK from git. Using the kernel, everything seems to be OK.
What tests can I do to help finding the bug?
Comment 1 Qian 2018-02-07 11:01:09 CET
Are you using X710(4X10G)? Pls write down your firmware version and kernel version. 
How can u put one card to 2 NUMA nodes?
Comment 2 gabor.sandor.enyedi 2018-02-07 11:11:27 CET
I'm trying to use two separate XL710 cards in two sockets. But I still cannot initialize the NIC in the socket belonging to NUMA node #0, even if I remove the other card, so this does not really matter.
Here are the info you asked for. (sum: ubuntu, 4.13.0-31-generic, firmware: 5.05 0x80002924 1.1313.0)

egboeny@leto:~$ uname -a
Linux leto 4.13.0-31-generic #34-Ubuntu SMP Fri Jan 19 16:34:46 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
egboeny@leto:~$ ethtool -i enp9s0f0
driver: i40e
version: 2.1.14-k
firmware-version: 5.05 0x80002924 1.1313.0
expansion-rom-version: 
bus-info: 0000:09:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
egboeny@leto:~$ ethtool -i enp65s0f0
driver: i40e
version: 2.1.14-k
firmware-version: 5.05 0x80002924 1.1313.0
expansion-rom-version: 
bus-info: 0000:41:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
egboeny@leto:~$
Comment 3 Qian 2018-02-08 04:08:33 CET
So in summary, the card can't initialize in socket0, could you try other NICs in socket0 with DPDK? Kernel didn't have EAL check, but DPDK need EAL initialization. If it's NUMA related issue, maybe you need check with AMD support.
Comment 4 Ajit Khaparde 2018-07-13 19:06:37 CEST
Do you have any update on this? Do we need to keep this open?
Comment 5 gabor.sandor.enyedi 2018-07-17 10:49:11 CEST
(In reply to Ajit Khaparde from comment #4)
> Do you have any update on this? Do we need to keep this open?

No update, just checked the bug is still there with the latest DPDK from git (18.05+). Of course you can close the bug, but it will not fix the problem.
Let me know, if you need more info/coredump whatever.
Comment 6 Ajit Khaparde 2018-07-17 19:29:27 CEST
(In reply to gabor.sandor.enyedi from comment #5)
> (In reply to Ajit Khaparde from comment #4)
> > Do you have any update on this? Do we need to keep this open?
> 
> No update, just checked the bug is still there with the latest DPDK from git
> (18.05+). Of course you can close the bug, but it will not fix the problem.
> Let me know, if you need more info/coredump whatever.

Thanks for confirming. I assigned the bug to maintainer of the i40e PMD.
Comment 7 Ferruh YIGIT 2018-11-29 15:49:58 CET
Hi Gabor,

Since we don't have the platform, AMD Threadripper 1950X, w/ 2 NUMA nodes, in our test environment we can't reproduce or work more on the issue.

Can you please help us here and provide more information?

From the log I can see admin queue initialization fails:
eth_i40e_dev_init(): Failed to init adminq: -54

First thing I can think of is the stack trace, can you please send the stack trace of the point where the error happened. Analyzing the parameters may give some idea about the issue.

btw, just to eliminate the possible issue with the specific NIC, did you test with swapping the NICs?

Thanks,
ferruh
Comment 8 gabor.sandor.enyedi 2018-12-04 14:35:42 CET
Created attachment 21 [details]
coredump at fail

I used DPDK release 18.08. Since the code is not crashing, I made it crash, where the problem is by adding *((int*)0) = 42; to drivers/net/i40e/base/i40e_adminq.c:1012 (modified file is attached). I have also added the core dump, the binary and a simple main.cpp, which I used for compiling a simple error.
Comment 9 gabor.sandor.enyedi 2018-12-04 14:47:27 CET
(In reply to Ferruh YIGIT from comment #7)
> Hi Gabor,
> 
> Since we don't have the platform, AMD Threadripper 1950X, w/ 2 NUMA nodes,
> in our test environment we can't reproduce or work more on the issue.
> 
> Can you please help us here and provide more information?
> 
> From the log I can see admin queue initialization fails:
> eth_i40e_dev_init(): Failed to init adminq: -54
> 
> First thing I can think of is the stack trace, can you please send the stack
> trace of the point where the error happened. Analyzing the parameters may
> give some idea about the issue.
> 
> btw, just to eliminate the possible issue with the specific NIC, did you
> test with swapping the NICs?
> 
> Thanks,
> ferruh

Hi,

I rather added a coredump not a stacktrace (see my comment above). I think the problem is at drivers/net/i40e/base/i40e_adminq.c:1012, i.e. a timeout. Of course, waiting by hand when (e.g. when debugging) does not help. Like it somehow does not process the initialization command or so. The same is OK w/ the other socket.
The NIC seems to be OK w/ kernel driver. I tried to use another XL710 NIC in the same PCIe socket, no success. I also tried to use --socket-limit 2048,0 to eliminate the chance of allocating memory from the other numa node (I checked the result @ /proc too), but the result was exactly the same again.
BR,

Gabor
Comment 10 Steve Yang 2020-12-25 04:12:54 CET
Hi Gabor,

Since we haven't the same platform with you, I've tested the similar case with Intel 2 NUMA nodes platform, refer to following validation info:
----------------------------------------------------------------------------
     OS platform: Linux 4.15.0-128-generic #131-Ubuntu SMP
    DPDK-version: dpdk-stable v20.11(b1d36cf82)
          Driver: i40e
         Version: 2.1.14-k
Firmware-version: 6.01 0x800035ef 1.1747.0
----------------------------------------------------------------------------

Test Result: Passed
----------------------------------------------------------------------------
root@igb_uio:/home/ubuntu/repo/dpdk-master/build-out/app# ./dpdk-testpmd -l 12-13 -n 4 -- -i --nb-cores=1 --txd=1024 --rxd=1024
EAL: Detected 112 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: Probing VFIO support...
EAL: Probe PCI driver: rawdev_ioat (8086:2021) device: 0000:00:04.0 (socket 0)
ioat_rawdev_probe(): Init 0000:00:04.0 on NUMA node 0
EAL: Probe PCI driver: rawdev_ioat (8086:2021) device: 0000:00:04.1 (socket 0)
ioat_rawdev_probe(): Init 0000:00:04.1 on NUMA node 0
EAL: Probe PCI driver: net_i40e (8086:1584) device: 0000:18:00.0 (socket 0)
EAL: No legacy callbacks, legacy socket not created
Interactive-mode selected
testpmd: create a new mbuf pool <mb_pool_0>: n=155456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc

 

Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.

 

Configuring Port 0 (socket 0)
Port 0: 3C:FD:FE:E3:9D:88
Checking link statuses...
Done
testpmd>
----------------------------------------------------------------------------

So it isn't a bug now.

Thanks & Regards,
Steve Yang.

Note You need to log in before you can comment on or make changes to this bug.