Bug 1086 - Significant TX packet drops with Mellanox NIC (mlx5 PMD)
Summary: Significant TX packet drops with Mellanox NIC (mlx5 PMD)
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 21.11
Hardware: x86 Linux
: Normal critical
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2022-09-28 15:41 CEST by Anton Vanda
Modified: 2022-09-29 11:22 CEST (History)
2 users (show)



Attachments
testpmd-fec28ca0e3.log.txt (7.87 KB, text/plain)
2022-09-28 15:41 CEST, Anton Vanda
Details

Description Anton Vanda 2022-09-28 15:41:31 CEST
Created attachment 222 [details]
testpmd-fec28ca0e3.log.txt

Given 2 servers with 25G Mellanox 2-port NICs:

# dpdk-devbind.py -s
Network devices using kernel driver
===================================
0000:3b:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1f0np0 drv=mlx5_core unused=vfio-pci 
0000:3b:00.1 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1f1np1 drv=mlx5_core unused=vfio-pci

Servers are connected directly.


The first server is used as a packet generator, running TRex v2.99 in stateless mode:
./t-rex-64 -c 16 -i
./trex-console
trex>start -f stl/udp_1pkt_range_clients.py -m 17mpps


The second one runs dpdk-testpmd:
OS: Debian GNU/Linux 10 (buster)
uname -r: 4.19.0-21-amd64
ofed_info: MLNX_OFED_LINUX-5.7-1.0.2.0
gcc version 8.3.0 (Debian 8.3.0-6)

When compiled DPDK v21.08 and running testpmd this way:

dpdk-testpmd -l 1-17 -n 4 --log-level=debug -- --nb-ports=2 --nb-cores=16 --portmask=0x3 --rxq=8 --txq=8

It handles roughly 17Mpps per port:

trex>start -f stl/udp_1pkt_range_clients.py -m 17mpps

TRex Port Statistics
   port    |         0         |         1         |       total       
-----------+-------------------+-------------------+------------------
owner      |              root |              root |                   
link       |                UP |                UP |                   
state      |      TRANSMITTING |      TRANSMITTING |                   
speed      |           25 Gb/s |           25 Gb/s |                   
CPU util.  |            27.76% |            27.76% |                   
--         |                   |                   |                   
Tx bps L2  |          8.7 Gbps |         8.73 Gbps |        17.43 Gbps 
Tx bps L1  |        11.42 Gbps |        11.46 Gbps |        22.88 Gbps 
Tx pps     |           17 Mpps |        17.05 Mpps |        34.05 Mpps 
Line Util. |            45.7 % |           45.83 % |                   
---        |                   |                   |                   
Rx bps     |          8.7 Gbps |         8.73 Gbps |        17.43 Gbps 
Rx pps     |           17 Mpps |        17.05 Mpps |        34.05 Mpps 
----       |                   |                   |                   
opackets   |         290928398 |         291050836 |         581979234 
ipackets   |         290885740 |         291093159 |         581978899 
obytes     |       18619417472 |       18627254464 |       37246671936 
ibytes     |       18616688080 |       18629962836 |       37246650916 
tx-pkts    |      290.93 Mpkts |      291.05 Mpkts |      581.98 Mpkts 
rx-pkts    |      290.89 Mpkts |      291.09 Mpkts |      581.98 Mpkts 
tx-bytes   |          18.62 GB |          18.63 GB |          37.25 GB 
rx-bytes   |          18.62 GB |          18.63 GB |          37.25 GB 
-----      |                   |                   |                   
oerrors    |                 0 |                 0 |                 0 
ierrors    |                 0 |                 0 |                 0


But if we switch to DPDK v21.11, it becomes much worse:

TRex Port Statistics
   port    |         0         |         1         |       total       
-----------+-------------------+-------------------+------------------
owner      |              root |              root |                   
link       |                UP |                UP |                   
state      |      TRANSMITTING |      TRANSMITTING |                   
speed      |           25 Gb/s |           25 Gb/s |                   
CPU util.  |            26.06% |            26.06% |                   
--         |                   |                   |                   
Tx bps L2  |          8.7 Gbps |         8.72 Gbps |        17.42 Gbps 
Tx bps L1  |        11.42 Gbps |        11.45 Gbps |        22.86 Gbps 
Tx pps     |        16.99 Mpps |        17.04 Mpps |        34.02 Mpps 
Line Util. |           45.66 % |           45.79 % |                   
---        |                   |                   |                   
Rx bps     |         3.75 Gbps |         3.76 Gbps |          7.5 Gbps 
Rx pps     |         7.32 Mpps |         7.34 Mpps |        14.66 Mpps 
----       |                   |                   |                   
opackets   |         190538147 |         190707494 |         381245641 
ipackets   |          82174700 |          82260152 |         164434852 
obytes     |       12194441408 |       12205280936 |       24399722344 
ibytes     |        5259181520 |        5264649728 |       10523831248 
tx-pkts    |      190.54 Mpkts |      190.71 Mpkts |      381.25 Mpkts 
rx-pkts    |       82.17 Mpkts |       82.26 Mpkts |      164.43 Mpkts 
tx-bytes   |          12.19 GB |          12.21 GB |           24.4 GB 
rx-bytes   |           5.26 GB |           5.26 GB |          10.52 GB 
-----      |                   |                   |                   
oerrors    |                 0 |                 0 |                 0 
ierrors    |                 0 |                 0 |                 0

It handles only ~7 Mpps for each port, instead of ~17 Mpps! There are huge TX drops stats reported by testpmd:
  ---------------------- Forward statistics for port 0  ----------------------
  RX-packets: 1101378001     RX-dropped: 0             RX-total: 1101378001
  TX-packets: 1016776861     TX-dropped: 84576754      TX-total: 1101353615
  ----------------------------------------------------------------------------

  ---------------------- Forward statistics for port 1  ----------------------
  RX-packets: 1101353615     RX-dropped: 0             RX-total: 1101353615
  TX-packets: 1016804108     TX-dropped: 84573893      TX-total: 1101378001
  ----------------------------------------------------------------------------

  +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
  RX-packets: 2202731616     RX-dropped: 0             RX-total: 2202731616
  TX-packets: 2033580969     TX-dropped: 169150647     TX-total: 2202731616
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


I found the commit (between 21.08 and 21.11), which caused this trouble using git bisect: https://github.com/DPDK/dpdk/commit/fec28ca0e3a93143829f3b41a28a8da933f28499

Also, I've used to profile it with Intel VTune 2021.3.0 (-collect hotspots & -collect memory-access). I've compared two revisions:
1. 690b2a88c2 (GOOD)
2. fec28ca0e3 (BAD)
I may try to share corresponding profiling results somehow if it helps. Unfortunately, I cannot attach them here (vtune stats data is too big).
Comment 1 Anton Vanda 2022-09-28 15:58:05 CEST
grep Huge /proc/meminfo
AnonHugePages:   1069056 kB
ShmemHugePages:        0 kB
HugePages_Total:   41504
HugePages_Free:    41495
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        85000192 kB
Comment 2 Anton Vanda 2022-09-29 11:13:53 CEST
# lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping:            7
CPU MHz:             2802.162
CPU max MHz:         3900.0000
CPU min MHz:         800.0000
BogoMIPS:            4200.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            28160K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Comment 3 Anton Vanda 2022-09-29 11:22:08 CEST
It is important to notice, that switching to 1GB hugepages fixed this and improved PPS. Instead of 17 Mpps per port I get 22 Mpps (the same experiment; tested on dpdk commit fec28ca0e3).

Note You need to log in before you can comment on or make changes to this bug.