When reading xstats via mlx5 driver *before the port* is started, such application crashes with "stack smashing detected". After the port is started (and then stoped), it does not happen anymore. Terminal 1: # dpdk-testpmd -a 0000:04:00.0 -- -i --disable-device-start ... EAL: RTE Version: 'DPDK 20.11.1' ... EAL: Probe PCI driver: mlx5_pci (15b3:101d) device: 0000:04:00.0 (socket 0) ... testpmd> Terminal 2: # dpdk-proc-info -- --xstats Device 0000:04:00.1 is not driven by the primary process mlx5_pci: can not attach rte ethdev mlx5_pci: probe of PCI device 0000:04:00.1 aborted after encountering an error: Cannot allocate memory common_mlx5: Failed to load driver = mlx5_pci. EAL: Requested device 0000:04:00.1 cannot be used EAL: Cannot find resource for device EAL: No legacy callbacks, legacy socket not created ... *** stack smashing detected ***: dpdk-proc-info terminated ======= Backtrace: ========= /lib64/libc.so.6(__fortify_fail+0x37)[0x7fd76cdf9697] /lib64/libc.so.6(+0x118652)[0x7fd76cdf9652] /usr/lib64/dpdk/pmds-21.0/librte_net_mlx5.so.21.0(+0x1b41e4)[0x7fd7602271e4] ======= Memory map: ======== ...
Please can you check it reproduces on the latest upstream branch? Any details from gdb?
Backtrace: Program received signal SIGABRT, Aborted. 0x00007ffff552c387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-324.el7_9.x86_64 lbr_libnl3-3.5.0-2.el7.x86_64 libfdt-1.4.6-1.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libibverbs-52mlnx1-1.52104.x86_64 libpcap-1.5.3-12.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt full #0 0x00007ffff552c387 in raise () from /lib64/libc.so.6 No symbol table info available. #1 0x00007ffff552da78 in abort () from /lib64/libc.so.6 No symbol table info available. #2 0x00007ffff556ef67 in __libc_message () from /lib64/libc.so.6 No symbol table info available. #3 0x00007ffff560e697 in __fortify_fail () from /lib64/libc.so.6 No symbol table info available. #4 0x00007ffff560e652 in __stack_chk_fail () from /lib64/libc.so.6 No symbol table info available. #5 0x00007fffe8a3c1e4 in mlx5_xstats_get (dev=0x608540 <rte_eth_devices>, stats=0x7fffffffdfa0, n=8) at ../drivers/net/mlx5/mlx5_stats.c:81 priv = 0x1003d0200 i = <optimized out> counters = <optimized out> xstats_ctrl = 0x1003d0af0 mlx5_stats_n = <optimized out> #6 0x0000000000000000 in ?? () No symbol table info available. I will try to check the upstream as well.
Same issue with main be81f77d8077 ("hash: fix tuple adjustment"). dpdk/ $ rm -rf build && CFLAGS="-fstack-protector-strong" meson build && ninja -C build 1 # build/app/dpdk-testpmd -a 0000:04:00.0 -- -i --disable-device-start 2 # build/app/dpdk-proc-info -- --xstats ... *** stack smashing detected ***: /home/shared/xvikto03/Projects/cisticka/dpdk/build/app/dpdk-proc-info terminated ======= Backtrace: ========= /lib64/libc.so.6(__fortify_fail+0x37)[0x7ffff5c99697] /lib64/libc.so.6(+0x118652)[0x7ffff5c99652] /home/shared/xvikto03/Projects/cisticka/dpdk/build/app/dpdk-proc-info[0x100bff4]
Using current main 7220632 ("version: 22.11-rc0"), this bug can be reproduced using the same steps as describe above. With DPDK build with debug symbols, I was able to get the following backtrace: #0 0x00007ffff5bda387 in raise () from /lib64/libc.so.6 #1 0x00007ffff5bdba78 in abort () from /lib64/libc.so.6 #2 0x00007ffff5c1ced7 in __libc_message () from /lib64/libc.so.6 #3 0x00007ffff5cbc577 in __fortify_fail () from /lib64/libc.so.6 #4 0x00007ffff5cbc532 in __stack_chk_fail () from /lib64/libc.so.6 #5 0x0000000001fc840a in mlx5_xstats_get (dev=0x16, stats=0x49d4, n=0) at ../drivers/net/mlx5/mlx5_stats.c:82 #6 0x0000000000a19c6e in rte_eth_xstats_get (port_id=0, xstats=0x7fffffffdf10, n=16) at ../lib/ethdev/rte_ethdev.c:2985 #7 0x0000000000a19a21 in rte_eth_xstats_get_by_id (port_id=0, ids=0x0, values=0x7a76460, size=16) at ../lib/ethdev/rte_ethdev.c:2940 #8 0x00000000005a711e in nic_xstats_display (port_id=0) at ../app/proc-info/main.c:559 #9 0x00000000005a9994 in main (argc=2, argv=0x7fffffffe4f0) at ../app/proc-info/main.c:1552
Using GDB, I was able to track down the cause of the problem. It seems to be the `counters` variable defined at `mlx5_stats.c:43`. That variable is stack allocated with `n` elements. However, in the `mlx5_os_read_dev_counters()` this variable is then used in memset with incorrect (`xstats_ctrl->mlx5_stats_n`) size, which results in overwriting the stack canary, thus crashing the application.
Created attachment 268 [details] Proposed hotfix for the problem Proposed patch works well for me.