Bug 1386

Summary: [dpdk-24.03] [ABI][meson test] driver-tests/link_bonding_autotest test failed: Segmentation fault when do ABI testing
Product: DPDK Reporter: jiang,yu (yux.jiang)
Component: ethdevAssignee: Jerin (jerinjacobk)
Status: RESOLVED INVALID    
Severity: normal CC: david.marchand, ferruh.yigit, hailinx.xu, jerinjacobk
Priority: Normal    
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: All   

Description jiang,yu 2024-02-28 04:18:48 CET
[Environment]

DPDK version: 92c0ad70ca version: 24.03-rc1
OS: RHEL9.0/5.14.0-70.13.1.el9_0.x86_64
Compiler: gcc version 11.2.1
Hardware platform: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
NIC hardware: Ethernet Controller XL710 for 40GbE QSFP+ 1583
NIC firmware: 
driver: i40e
version: 2.24.6
firmware-version: 9.40 0x8000ece4 1.3429.0

[Test Setup]
Steps to reproduce
List the steps to reproduce the issue.

1, Build latest main dpdk24.03-rc1
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Denable_kmods=True -Dlibdir=lib  --default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf /root/tmp/dpdk_share_lib /root/shared_lib_dpdk
DESTDIR=/root/tmp/dpdk_share_lib ninja -C x86_64-native-linuxapp-gcc -j 110 install
mv /root/tmp/dpdk_share_lib/usr/local/lib /root/shared_lib_dpdk
ll /root/shared_lib_dpdk
cat /root/.bashrc | grep LD_LIBRARY_PATH
sed -i 's#export LD_LIBRARY_PATH=.*#export LD_LIBRARY_PATH=/root/shared_lib_dpdk#g' /root/.bashrc

2, Build LTS dpdk23.11.0
rm /root/dpdk
tar zxvf dpdk_abi.tar.gz -C ~
cd ~/dpdk/
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Denable_kmods=True -Dlibdir=lib  --default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf x86_64-native-linuxapp-gcc/lib
rm -rf x86_64-native-linuxapp-gcc/drivers

3, Bind nic
rmmod vfio_pci
rmmod vfio_iommu_type1
rmmod vfio
modprobe vfio
modprobe vfio-pci
usertools/dpdk-devbind.py --force --bind=vfio-pci 0000:18:00.0 0000:1a:00.0

4, Launch dpdk-test and run link_bonding_autotest
x86_64-native-linuxapp-gcc/app/dpdk-test -c 0xff -d /root/shared_lib_dpdk -a 0000:18:00.0 -a 0000:1a:00.0
RTE>>link_bonding_autotest
 
Show the output from the previous commands.
[root@ABI-80 dpdk]# x86_64-native-linuxapp-gcc/app/dpdk-test -c 0xff -d /root/shared_lib_dpdk -a 0000:18:00.0 -a 0000:1a:00.0
EAL: Detected CPU lcores: 112
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: VFIO support initialized
EAL: Using IOMMU type 1 (Type 1)
EAL: Ignore mapping IO port bar(1)
EAL: Ignore mapping IO port bar(4)
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0000:18:00.0 (socket 0)
i40e_GLQF_reg_init(): i40e device 0000:18:00.0 changed global register [0x002689a0]. original: 0x00000021, new: 0x00000029
EAL: Ignore mapping IO port bar(1)
EAL: Ignore mapping IO port bar(4)
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0000:1a:00.0 (socket 0)
i40e_GLQF_reg_init(): i40e device 0000:1a:00.0 changed global register [0x002689a0]. original: 0x00000021, new: 0x00000029
TELEMETRY: No legacy callbacks, legacy socket not created
APP: HPET is not enabled, using TSC as default timer
RTE>>link_bonding_autotest
 + ------------------------------------------------------- +
 + Test Suite : Link Bonding Unit Test Suite
Segmentation fault (core dumped)

[Expected Result]
Test ok.

[Regression]
Is this issue a regression: (Y/N) Y
The first bad commit:
commit d4b9235f95de4f46f368627af256ed8080f20d65
Author: Jerin Jacob <jerinj@marvell.com>
Date:   Thu Jan 18 15:17:42 2024 +0530

    ethdev: add Tx queue used count query

    Introduce a new API to retrieve the number of used descriptors
    in a Tx queue. Applications can leverage this API in the fast path to
    inspect the Tx queue occupancy and take appropriate actions based on the
    available free descriptors.

    A notable use case could be implementing Random Early Discard (RED)
    in software based on Tx queue occupancy.

    Signed-off-by: Jerin Jacob <jerinj@marvell.com>
    Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
    Acked-by: Morten Brørup <mb@smartsharesystems.com>
    Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
    Reviewed-by: Ferruh Yigit <ferruh.yigit@amd.com>
Comment 1 David Marchand 2024-02-29 10:40:06 CET
Could you provide the backtrace of this crash?
Comment 2 jiang,yu 2024-03-01 03:12:22 CET
[root@ABI-80 dpdk]# dmesg
[321775.556832] vfio-pci 0000:18:00.0: Masking broken INTx support
[321775.556908] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[321775.844843] vfio-pci 0000:1a:00.0: Masking broken INTx support
[321775.844947] vfio-pci 0000:1a:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[321781.820113] dpdk-test[2298656]: segfault at 50 ip 0000000000453d31 sp 00007ffdeefcda90 error 6 in dpdk-test[416000+1f6000]
[321781.820124] Code: f7 41 89 45 58 49 8d 4d 10 49 8d 44 24 10 49 89 47 58 49 89 4c 24 30 48 89 44 24 10 ba 01 00 00 00 49 8b 47 40 48 89 4c 24 18 <66> 89 50 50 b9 01 00 00 00 31 d2 49 8b 47 40 be 06 00 00 00 66 89
Comment 3 Xu,Hailin 2024-03-11 07:03:01 CET
Is there any progress on this issue?
Comment 4 Jerin 2024-03-11 07:18:55 CET
Looks like i40e specific issue, can you reproduce with anything vdev or HW PMD
Comment 5 Jerin 2024-03-11 07:19:14 CET
Looks like i40e specific issue, can you reproduce with anything vdev or HW PMD
Comment 6 jiang,yu 2024-03-11 09:53:53 CET
(In reply to Jerin from comment #5)
> Looks like i40e specific issue, can you reproduce with anything vdev or HW
> PMD

Not i40e specific issue, ice nic also can reproduce.

dmesg:
[269453.056342] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[269453.056347] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x25@0x200
[269453.056349] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x26@0x210
[269453.056351] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x27@0x250
[269453.277824] vfio-pci 0000:4b:11.0: enabling device (0000 -> 0002)
[269453.600299] vfio-pci 0000:4b:11.0: enabling device (0000 -> 0002)
[269458.470533] dpdk-test[2956468]: segfault at 50 ip 0000563f05359a25 sp 00007fff5230ead0 error 6 in dpdk-test[563f05313000+220000]
[269458.470544] Code: f7 41 89 45 58 49 8d 4d 10 49 8d 44 24 10 49 89 46 58 49 89 4c 24 30 48 89 44 24 10 ba 01 00 00 00 49 8b 46 40 48 89 4c 24 18 <66> 89 50 50 b9 01 00 00 00 31 d2 49 8b 46 40 be 06 00 00 00 66 89

OS:Ubuntu22.04.3 LTS/5.15.0-91-generic/gcc version 11.4.0
Comment 7 Jerin 2024-03-26 12:49:52 CET
The issue is due to the following change[1]

app/test/virtual_pmd.c is using internal struct rte_eth_dev struct via
virtual_ethdev_create() and it's used by app/test/test_link_bonding.c.
So this test case is not valid for testing the ABI as it is using internal structure.


[1]
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index b482cd12bb..f05f68a67c 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,8 @@ struct rte_eth_dev {
        eth_rx_queue_count_t rx_queue_count;
        /** Check the status of a Rx descriptor */
        eth_rx_descriptor_status_t rx_descriptor_status;
+       /** Get the number of used Tx descriptors */
+       eth_tx_queue_count_t tx_queue_count;
        /** Check the status of a Tx descriptor */
        eth_tx_descriptor_status_t tx_descriptor_status;
        /** Pointer to PMD transmit mbufs reuse function */
Comment 8 Ferruh YIGIT 2024-03-26 13:34:56 CET
How can I reproduce the issue, 'link_bonding_autotest' test passes for me?

And which function/test fails in 'app/test/test_link_bonding.c'?
Comment 9 jiang,yu 2024-04-25 05:06:41 CEST
(In reply to Ferruh YIGIT from comment #8)
> How can I reproduce the issue, 'link_bonding_autotest' test passes for me?
> 
> And which function/test fails in 'app/test/test_link_bonding.c'?

Hi Ferruh, this is testing ABI. Steps you can refer to the above description.
Comment 10 jiang,yu 2024-04-25 05:08:58 CEST
(In reply to Jerin from comment #7)
> The issue is due to the following change[1]
> 
> app/test/virtual_pmd.c is using internal struct rte_eth_dev struct via
> virtual_ethdev_create() and it's used by app/test/test_link_bonding.c.
> So this test case is not valid for testing the ABI as it is using internal
> structure.
> 
> 
> [1]
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index b482cd12bb..f05f68a67c 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -58,6 +58,8 @@ struct rte_eth_dev {
>         eth_rx_queue_count_t rx_queue_count;
>         /** Check the status of a Rx descriptor */
>         eth_rx_descriptor_status_t rx_descriptor_status;
> +       /** Get the number of used Tx descriptors */
> +       eth_tx_queue_count_t tx_queue_count;
>         /** Check the status of a Tx descriptor */
>         eth_tx_descriptor_status_t tx_descriptor_status;
>         /** Pointer to PMD transmit mbufs reuse function */

Hi Jerin, do you mean that we should close this bugzilla, and from the bad commit begins, we should not test this case for ABI testing, right?
Comment 11 Ferruh YIGIT 2024-04-25 13:42:25 CEST
(In reply to jiang,yu from comment #9)
> (In reply to Ferruh YIGIT from comment #8)
> > How can I reproduce the issue, 'link_bonding_autotest' test passes for me?
> > 
> > And which function/test fails in 'app/test/test_link_bonding.c'?
> 
> Hi Ferruh, this is testing ABI. Steps you can refer to the above description.
> 

I overlooked that this is ABI testing.
Then I agree with Jerin, as 'link_bonding_autotest' is using internal API it is not suitable for ABI testing, or changes to internal structures, like "struct rte_eth_dev" in this case will cause ABI issues.

+1 to remove this test for ABI testing.
Comment 12 jiang,yu 2024-04-26 03:58:02 CEST
Thanks Jerin and Ferruh. 
And close this Bugzilla according Jerin and Ferruh's inputs.