Bug 1386 - [dpdk-24.03] [ABI][meson test] driver-tests/link_bonding_autotest test failed: Segmentation fault when do ABI testing
Summary: [dpdk-24.03] [ABI][meson test] driver-tests/link_bonding_autotest test failed...
Status: RESOLVED INVALID
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: unspecified
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: Jerin
URL:
Depends on:
Blocks:
 
Reported: 2024-02-28 04:18 CET by jiang,yu
Modified: 2024-04-26 03:58 CEST (History)
4 users (show)



Attachments

Description jiang,yu 2024-02-28 04:18:48 CET
[Environment]

DPDK version: 92c0ad70ca version: 24.03-rc1
OS: RHEL9.0/5.14.0-70.13.1.el9_0.x86_64
Compiler: gcc version 11.2.1
Hardware platform: Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
NIC hardware: Ethernet Controller XL710 for 40GbE QSFP+ 1583
NIC firmware: 
driver: i40e
version: 2.24.6
firmware-version: 9.40 0x8000ece4 1.3429.0

[Test Setup]
Steps to reproduce
List the steps to reproduce the issue.

1, Build latest main dpdk24.03-rc1
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Denable_kmods=True -Dlibdir=lib  --default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf /root/tmp/dpdk_share_lib /root/shared_lib_dpdk
DESTDIR=/root/tmp/dpdk_share_lib ninja -C x86_64-native-linuxapp-gcc -j 110 install
mv /root/tmp/dpdk_share_lib/usr/local/lib /root/shared_lib_dpdk
ll /root/shared_lib_dpdk
cat /root/.bashrc | grep LD_LIBRARY_PATH
sed -i 's#export LD_LIBRARY_PATH=.*#export LD_LIBRARY_PATH=/root/shared_lib_dpdk#g' /root/.bashrc

2, Build LTS dpdk23.11.0
rm /root/dpdk
tar zxvf dpdk_abi.tar.gz -C ~
cd ~/dpdk/
rm -rf x86_64-native-linuxapp-gcc
CC=gcc meson -Denable_kmods=True -Dlibdir=lib  --default-library=shared x86_64-native-linuxapp-gcc
ninja -C x86_64-native-linuxapp-gcc
rm -rf x86_64-native-linuxapp-gcc/lib
rm -rf x86_64-native-linuxapp-gcc/drivers

3, Bind nic
rmmod vfio_pci
rmmod vfio_iommu_type1
rmmod vfio
modprobe vfio
modprobe vfio-pci
usertools/dpdk-devbind.py --force --bind=vfio-pci 0000:18:00.0 0000:1a:00.0

4, Launch dpdk-test and run link_bonding_autotest
x86_64-native-linuxapp-gcc/app/dpdk-test -c 0xff -d /root/shared_lib_dpdk -a 0000:18:00.0 -a 0000:1a:00.0
RTE>>link_bonding_autotest
 
Show the output from the previous commands.
[root@ABI-80 dpdk]# x86_64-native-linuxapp-gcc/app/dpdk-test -c 0xff -d /root/shared_lib_dpdk -a 0000:18:00.0 -a 0000:1a:00.0
EAL: Detected CPU lcores: 112
EAL: Detected NUMA nodes: 2
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: VFIO support initialized
EAL: Using IOMMU type 1 (Type 1)
EAL: Ignore mapping IO port bar(1)
EAL: Ignore mapping IO port bar(4)
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0000:18:00.0 (socket 0)
i40e_GLQF_reg_init(): i40e device 0000:18:00.0 changed global register [0x002689a0]. original: 0x00000021, new: 0x00000029
EAL: Ignore mapping IO port bar(1)
EAL: Ignore mapping IO port bar(4)
EAL: Probe PCI driver: net_i40e (8086:1583) device: 0000:1a:00.0 (socket 0)
i40e_GLQF_reg_init(): i40e device 0000:1a:00.0 changed global register [0x002689a0]. original: 0x00000021, new: 0x00000029
TELEMETRY: No legacy callbacks, legacy socket not created
APP: HPET is not enabled, using TSC as default timer
RTE>>link_bonding_autotest
 + ------------------------------------------------------- +
 + Test Suite : Link Bonding Unit Test Suite
Segmentation fault (core dumped)

[Expected Result]
Test ok.

[Regression]
Is this issue a regression: (Y/N) Y
The first bad commit:
commit d4b9235f95de4f46f368627af256ed8080f20d65
Author: Jerin Jacob <jerinj@marvell.com>
Date:   Thu Jan 18 15:17:42 2024 +0530

    ethdev: add Tx queue used count query

    Introduce a new API to retrieve the number of used descriptors
    in a Tx queue. Applications can leverage this API in the fast path to
    inspect the Tx queue occupancy and take appropriate actions based on the
    available free descriptors.

    A notable use case could be implementing Random Early Discard (RED)
    in software based on Tx queue occupancy.

    Signed-off-by: Jerin Jacob <jerinj@marvell.com>
    Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
    Acked-by: Morten Brørup <mb@smartsharesystems.com>
    Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
    Reviewed-by: Ferruh Yigit <ferruh.yigit@amd.com>
Comment 1 David Marchand 2024-02-29 10:40:06 CET
Could you provide the backtrace of this crash?
Comment 2 jiang,yu 2024-03-01 03:12:22 CET
[root@ABI-80 dpdk]# dmesg
[321775.556832] vfio-pci 0000:18:00.0: Masking broken INTx support
[321775.556908] vfio-pci 0000:18:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[321775.844843] vfio-pci 0000:1a:00.0: Masking broken INTx support
[321775.844947] vfio-pci 0000:1a:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[321781.820113] dpdk-test[2298656]: segfault at 50 ip 0000000000453d31 sp 00007ffdeefcda90 error 6 in dpdk-test[416000+1f6000]
[321781.820124] Code: f7 41 89 45 58 49 8d 4d 10 49 8d 44 24 10 49 89 47 58 49 89 4c 24 30 48 89 44 24 10 ba 01 00 00 00 49 8b 47 40 48 89 4c 24 18 <66> 89 50 50 b9 01 00 00 00 31 d2 49 8b 47 40 be 06 00 00 00 66 89
Comment 3 Xu,Hailin 2024-03-11 07:03:01 CET
Is there any progress on this issue?
Comment 4 Jerin 2024-03-11 07:18:55 CET
Looks like i40e specific issue, can you reproduce with anything vdev or HW PMD
Comment 5 Jerin 2024-03-11 07:19:14 CET
Looks like i40e specific issue, can you reproduce with anything vdev or HW PMD
Comment 6 jiang,yu 2024-03-11 09:53:53 CET
(In reply to Jerin from comment #5)
> Looks like i40e specific issue, can you reproduce with anything vdev or HW
> PMD

Not i40e specific issue, ice nic also can reproduce.

dmesg:
[269453.056342] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0
[269453.056347] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x25@0x200
[269453.056349] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x26@0x210
[269453.056351] vfio-pci 0000:4b:00.0: vfio_ecap_init: hiding ecap 0x27@0x250
[269453.277824] vfio-pci 0000:4b:11.0: enabling device (0000 -> 0002)
[269453.600299] vfio-pci 0000:4b:11.0: enabling device (0000 -> 0002)
[269458.470533] dpdk-test[2956468]: segfault at 50 ip 0000563f05359a25 sp 00007fff5230ead0 error 6 in dpdk-test[563f05313000+220000]
[269458.470544] Code: f7 41 89 45 58 49 8d 4d 10 49 8d 44 24 10 49 89 46 58 49 89 4c 24 30 48 89 44 24 10 ba 01 00 00 00 49 8b 46 40 48 89 4c 24 18 <66> 89 50 50 b9 01 00 00 00 31 d2 49 8b 46 40 be 06 00 00 00 66 89

OS:Ubuntu22.04.3 LTS/5.15.0-91-generic/gcc version 11.4.0
Comment 7 Jerin 2024-03-26 12:49:52 CET
The issue is due to the following change[1]

app/test/virtual_pmd.c is using internal struct rte_eth_dev struct via
virtual_ethdev_create() and it's used by app/test/test_link_bonding.c.
So this test case is not valid for testing the ABI as it is using internal structure.


[1]
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index b482cd12bb..f05f68a67c 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -58,6 +58,8 @@ struct rte_eth_dev {
        eth_rx_queue_count_t rx_queue_count;
        /** Check the status of a Rx descriptor */
        eth_rx_descriptor_status_t rx_descriptor_status;
+       /** Get the number of used Tx descriptors */
+       eth_tx_queue_count_t tx_queue_count;
        /** Check the status of a Tx descriptor */
        eth_tx_descriptor_status_t tx_descriptor_status;
        /** Pointer to PMD transmit mbufs reuse function */
Comment 8 Ferruh YIGIT 2024-03-26 13:34:56 CET
How can I reproduce the issue, 'link_bonding_autotest' test passes for me?

And which function/test fails in 'app/test/test_link_bonding.c'?
Comment 9 jiang,yu 2024-04-25 05:06:41 CEST
(In reply to Ferruh YIGIT from comment #8)
> How can I reproduce the issue, 'link_bonding_autotest' test passes for me?
> 
> And which function/test fails in 'app/test/test_link_bonding.c'?

Hi Ferruh, this is testing ABI. Steps you can refer to the above description.
Comment 10 jiang,yu 2024-04-25 05:08:58 CEST
(In reply to Jerin from comment #7)
> The issue is due to the following change[1]
> 
> app/test/virtual_pmd.c is using internal struct rte_eth_dev struct via
> virtual_ethdev_create() and it's used by app/test/test_link_bonding.c.
> So this test case is not valid for testing the ABI as it is using internal
> structure.
> 
> 
> [1]
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index b482cd12bb..f05f68a67c 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -58,6 +58,8 @@ struct rte_eth_dev {
>         eth_rx_queue_count_t rx_queue_count;
>         /** Check the status of a Rx descriptor */
>         eth_rx_descriptor_status_t rx_descriptor_status;
> +       /** Get the number of used Tx descriptors */
> +       eth_tx_queue_count_t tx_queue_count;
>         /** Check the status of a Tx descriptor */
>         eth_tx_descriptor_status_t tx_descriptor_status;
>         /** Pointer to PMD transmit mbufs reuse function */

Hi Jerin, do you mean that we should close this bugzilla, and from the bad commit begins, we should not test this case for ABI testing, right?
Comment 11 Ferruh YIGIT 2024-04-25 13:42:25 CEST
(In reply to jiang,yu from comment #9)
> (In reply to Ferruh YIGIT from comment #8)
> > How can I reproduce the issue, 'link_bonding_autotest' test passes for me?
> > 
> > And which function/test fails in 'app/test/test_link_bonding.c'?
> 
> Hi Ferruh, this is testing ABI. Steps you can refer to the above description.
> 

I overlooked that this is ABI testing.
Then I agree with Jerin, as 'link_bonding_autotest' is using internal API it is not suitable for ABI testing, or changes to internal structures, like "struct rte_eth_dev" in this case will cause ABI issues.

+1 to remove this test for ABI testing.
Comment 12 jiang,yu 2024-04-26 03:58:02 CEST
Thanks Jerin and Ferruh. 
And close this Bugzilla according Jerin and Ferruh's inputs.

Note You need to log in before you can comment on or make changes to this bug.