Created attachment 222 [details] testpmd-fec28ca0e3.log.txt Given 2 servers with 25G Mellanox 2-port NICs: # dpdk-devbind.py -s Network devices using kernel driver =================================== 0000:3b:00.0 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1f0np0 drv=mlx5_core unused=vfio-pci 0000:3b:00.1 'MT27710 Family [ConnectX-4 Lx] 1015' if=ens1f1np1 drv=mlx5_core unused=vfio-pci Servers are connected directly. The first server is used as a packet generator, running TRex v2.99 in stateless mode: ./t-rex-64 -c 16 -i ./trex-console trex>start -f stl/udp_1pkt_range_clients.py -m 17mpps The second one runs dpdk-testpmd: OS: Debian GNU/Linux 10 (buster) uname -r: 4.19.0-21-amd64 ofed_info: MLNX_OFED_LINUX-5.7-1.0.2.0 gcc version 8.3.0 (Debian 8.3.0-6) When compiled DPDK v21.08 and running testpmd this way: dpdk-testpmd -l 1-17 -n 4 --log-level=debug -- --nb-ports=2 --nb-cores=16 --portmask=0x3 --rxq=8 --txq=8 It handles roughly 17Mpps per port: trex>start -f stl/udp_1pkt_range_clients.py -m 17mpps TRex Port Statistics port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | root | root | link | UP | UP | state | TRANSMITTING | TRANSMITTING | speed | 25 Gb/s | 25 Gb/s | CPU util. | 27.76% | 27.76% | -- | | | Tx bps L2 | 8.7 Gbps | 8.73 Gbps | 17.43 Gbps Tx bps L1 | 11.42 Gbps | 11.46 Gbps | 22.88 Gbps Tx pps | 17 Mpps | 17.05 Mpps | 34.05 Mpps Line Util. | 45.7 % | 45.83 % | --- | | | Rx bps | 8.7 Gbps | 8.73 Gbps | 17.43 Gbps Rx pps | 17 Mpps | 17.05 Mpps | 34.05 Mpps ---- | | | opackets | 290928398 | 291050836 | 581979234 ipackets | 290885740 | 291093159 | 581978899 obytes | 18619417472 | 18627254464 | 37246671936 ibytes | 18616688080 | 18629962836 | 37246650916 tx-pkts | 290.93 Mpkts | 291.05 Mpkts | 581.98 Mpkts rx-pkts | 290.89 Mpkts | 291.09 Mpkts | 581.98 Mpkts tx-bytes | 18.62 GB | 18.63 GB | 37.25 GB rx-bytes | 18.62 GB | 18.63 GB | 37.25 GB ----- | | | oerrors | 0 | 0 | 0 ierrors | 0 | 0 | 0 But if we switch to DPDK v21.11, it becomes much worse: TRex Port Statistics port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | root | root | link | UP | UP | state | TRANSMITTING | TRANSMITTING | speed | 25 Gb/s | 25 Gb/s | CPU util. | 26.06% | 26.06% | -- | | | Tx bps L2 | 8.7 Gbps | 8.72 Gbps | 17.42 Gbps Tx bps L1 | 11.42 Gbps | 11.45 Gbps | 22.86 Gbps Tx pps | 16.99 Mpps | 17.04 Mpps | 34.02 Mpps Line Util. | 45.66 % | 45.79 % | --- | | | Rx bps | 3.75 Gbps | 3.76 Gbps | 7.5 Gbps Rx pps | 7.32 Mpps | 7.34 Mpps | 14.66 Mpps ---- | | | opackets | 190538147 | 190707494 | 381245641 ipackets | 82174700 | 82260152 | 164434852 obytes | 12194441408 | 12205280936 | 24399722344 ibytes | 5259181520 | 5264649728 | 10523831248 tx-pkts | 190.54 Mpkts | 190.71 Mpkts | 381.25 Mpkts rx-pkts | 82.17 Mpkts | 82.26 Mpkts | 164.43 Mpkts tx-bytes | 12.19 GB | 12.21 GB | 24.4 GB rx-bytes | 5.26 GB | 5.26 GB | 10.52 GB ----- | | | oerrors | 0 | 0 | 0 ierrors | 0 | 0 | 0 It handles only ~7 Mpps for each port, instead of ~17 Mpps! There are huge TX drops stats reported by testpmd: ---------------------- Forward statistics for port 0 ---------------------- RX-packets: 1101378001 RX-dropped: 0 RX-total: 1101378001 TX-packets: 1016776861 TX-dropped: 84576754 TX-total: 1101353615 ---------------------------------------------------------------------------- ---------------------- Forward statistics for port 1 ---------------------- RX-packets: 1101353615 RX-dropped: 0 RX-total: 1101353615 TX-packets: 1016804108 TX-dropped: 84573893 TX-total: 1101378001 ---------------------------------------------------------------------------- +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++ RX-packets: 2202731616 RX-dropped: 0 RX-total: 2202731616 TX-packets: 2033580969 TX-dropped: 169150647 TX-total: 2202731616 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ I found the commit (between 21.08 and 21.11), which caused this trouble using git bisect: https://github.com/DPDK/dpdk/commit/fec28ca0e3a93143829f3b41a28a8da933f28499 Also, I've used to profile it with Intel VTune 2021.3.0 (-collect hotspots & -collect memory-access). I've compared two revisions: 1. 690b2a88c2 (GOOD) 2. fec28ca0e3 (BAD) I may try to share corresponding profiling results somehow if it helps. Unfortunately, I cannot attach them here (vtune stats data is too big).
grep Huge /proc/meminfo AnonHugePages: 1069056 kB ShmemHugePages: 0 kB HugePages_Total: 41504 HugePages_Free: 41495 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 85000192 kB
# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz Stepping: 7 CPU MHz: 2802.162 CPU max MHz: 3900.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 28160K NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
It is important to notice, that switching to 1GB hugepages fixed this and improved PPS. Instead of 17 Mpps per port I get 22 Mpps (the same experiment; tested on dpdk commit fec28ca0e3).