Our team has been chasing major performance issues with ConnectX6 cards. *Customer challenge:* Flow stable ( symmetric RSS) load-balancing of flows to 8 worker lcores. *Observation:* Performance is fine up to 100Gbps using either tcp *or* udp-only traffic profiles. Mixed traffic drops down to 50% loss with all packets showing up as xstats: rx_phy_discard_packets card infos at end of email. There appears to be a huge performance issue on mixed UDP/TCP using symmetric load-balancing accross multiple workers. E.g. compiling a DPDK v20.11 or newer test-pmd apps: > sudo ./dpdk-testpmd -n 8 -l 4,6,8,10,12,14,16,18,20 -a > > 0000:4b:00.0,rxq_cqe_comp_en=4 -a 0000:4b:00.1,rxq_cqe_comp_en=4 -- > --forward-> mode=mac --rxq=8 --txq=8 --nb-cores=8 --numa -i -a --disable-rss and configuring: > flow create 0 ingress pattern eth / ipv4 / tcp / end actions rss types > ipv4-tcp > end queues 0 1 2 3 4 5 6 7 end key > > 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A > / > end > flow create 0 ingress pattern eth / ipv4 / udp / end actions rss types > ipv4-udp > end queues 0 1 2 3 4 5 6 7 end key > > 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A > / > end > flow create 1 ingress pattern eth / ipv4 / tcp / end actions rss types > ipv4-tcp > end queues 0 1 2 3 4 5 6 7 end key > > 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A > / > end > flow create 1 ingress pattern eth / ipv4 / udp / end actions rss types > ipv4-udp > end queues 0 1 2 3 4 5 6 7 end key > > 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A > / > end will see *significant* packet drops at a load > 50 Gbps on any type of mixed UDP/TCP traffic. E.g. https://github.com/cisco-system-traffic-generator/trex-core/blob/master/scripts/cap2/sfr3.yaml Whenever those packet drops occur, I see those in the xstats as "rx_phy_discard_packets" On the other hand using a TCP-or UDP-only traffic profile perfectly scales up to 100Gbps w/o drops. Thanks for your help! > {code} > ConnectX6DX > <Devices> > <Device pciName="0000:4b:00.0" type="ConnectX6DX" psid="DEL0000000027" > > partNumber="0F6FXM_08P2T2_Ax"> > <Versions> > <FW current="22.31.1014" available="N/A"/> > <PXE current="3.6.0403" available="N/A"/> > <UEFI current="14.24.0013" available="N/A"/> > </Versions> > {code}
Created attachment 212 [details] Additional testcases incl. combinations of additional mlx5 parameters Added Additional testcases incl. combinations of additional mlx5 parameters as listed in DPDK mlx5 docs
Can you please attach the output of the xstats when using one of the cqe=4 scenarios?
testpmd> show port xstats all ###### NIC extended statistics for port 0 rx_good_packets: 425476923 tx_good_packets: 259850189 rx_good_bytes: 126416792933 tx_good_bytes: 358897508382 rx_missed_errors: 14144 rx_errors: 0 tx_errors: 0 rx_mbuf_allocation_errors: 0 rx_q0packets: 53259716 rx_q0bytes: 15621519961 rx_q0errors: 0 rx_q1packets: 53048026 rx_q1bytes: 15814675945 rx_q1errors: 0 rx_q2packets: 53290424 rx_q2bytes: 15703425214 rx_q2errors: 0 rx_q3packets: 53366317 rx_q3bytes: 16286720906 rx_q3errors: 0 rx_q4packets: 53410708 rx_q4bytes: 15838266064 rx_q4errors: 0 rx_q5packets: 52809749 rx_q5bytes: 15622478674 rx_q5errors: 0 rx_q6packets: 53372784 rx_q6bytes: 15885910128 rx_q6errors: 0 rx_q7packets: 52919199 rx_q7bytes: 15643796041 rx_q7errors: 0 tx_q0packets: 32448803 tx_q0bytes: 44802375520 tx_q1packets: 32044006 tx_q1bytes: 44190717419 tx_q2packets: 32821593 tx_q2bytes: 45355762505 tx_q3packets: 32562959 tx_q3bytes: 44955313505 tx_q4packets: 32356256 tx_q4bytes: 44703126508 tx_q5packets: 32324453 tx_q5bytes: 44677002457 tx_q6packets: 32861827 tx_q6bytes: 45414745177 tx_q7packets: 32430292 tx_q7bytes: 44798465291 rx_wqe_err: 0 rx_port_unicast_packets: 425478765 rx_port_unicast_bytes: 128119205827 tx_port_unicast_packets: 259843573 tx_port_unicast_bytes: 358888367522 rx_port_multicast_packets: 2 rx_port_multicast_bytes: 176 tx_port_multicast_packets: 0 tx_port_multicast_bytes: 0 rx_port_broadcast_packets: 0 rx_port_broadcast_bytes: 0 tx_port_broadcast_packets: 0 tx_port_broadcast_bytes: 0 tx_packets_phy: 260039747 rx_packets_phy: 435881051 rx_crc_errors_phy: 0 tx_bytes_phy: 359935880062 rx_bytes_phy: 131043626686 rx_in_range_len_errors_phy: 0 rx_symbol_err_phy: 0 rx_discards_phy: 10403772 tx_discards_phy: 0 tx_errors_phy: 0 rx_out_of_buffer: 14144 txpp_err_miss_int: 0 txpp_err_rearm_queue: 0 txpp_err_clock_queue: 0 txpp_err_ts_past: 0 txpp_err_ts_future: 0 txpp_jitter: 0 txpp_wander: 0 txpp_sync_lost: 0 ###### NIC extended statistics for port 1 rx_good_packets: 259924834 tx_good_packets: 425524183 rx_good_bytes: 358999327754 tx_good_bytes: 126430445513 rx_missed_errors: 15723 rx_errors: 0 tx_errors: 0 rx_mbuf_allocation_errors: 0 rx_q0packets: 32469035 rx_q0bytes: 44829708152 rx_q0errors: 0 rx_q1packets: 32048138 rx_q1bytes: 44196336161 rx_q1errors: 0 rx_q2packets: 32843947 rx_q2bytes: 45386141602 rx_q2errors: 0 rx_q3packets: 32573347 rx_q3bytes: 44969467978 rx_q3errors: 0 rx_q4packets: 32362476 rx_q4bytes: 44711635023 rx_q4errors: 0 rx_q5packets: 32328342 rx_q5bytes: 44682430681 rx_q5errors: 0 rx_q6packets: 32865811 rx_q6bytes: 45420306449 rx_q6errors: 0 rx_q7packets: 32433738 rx_q7bytes: 44803301708 rx_q7errors: 0 tx_q0packets: 53265836 tx_q0bytes: 15623503530 tx_q1packets: 53054391 tx_q1bytes: 15816867710 tx_q2packets: 53296179 tx_q2bytes: 15705142067 tx_q3packets: 53372332 tx_q3bytes: 16288342251 tx_q4packets: 53416250 tx_q4bytes: 15840011770 tx_q5packets: 52815233 tx_q5bytes: 15623803938 tx_q6packets: 53379036 tx_q6bytes: 15887764228 tx_q7packets: 52924926 tx_q7bytes: 15645010019 rx_wqe_err: 0 rx_port_unicast_packets: 259935340 rx_port_unicast_bytes: 360053683508 tx_port_unicast_packets: 425515495 tx_port_unicast_bytes: 126427892220 rx_port_multicast_packets: 2 rx_port_multicast_bytes: 176 tx_port_multicast_packets: 0 tx_port_multicast_bytes: 0 rx_port_broadcast_packets: 0 rx_port_broadcast_bytes: 0 tx_port_broadcast_packets: 0 tx_port_broadcast_bytes: 0 tx_packets_phy: 426208080 rx_packets_phy: 801525247 rx_crc_errors_phy: 0 tx_bytes_phy: 128173358039 rx_bytes_phy: 1110006416162 rx_in_range_len_errors_phy: 0 rx_symbol_err_phy: 0 rx_discards_phy: 541590684 tx_discards_phy: 0 tx_errors_phy: 0 rx_out_of_buffer: 15723 txpp_err_miss_int: 0 txpp_err_rearm_queue: 0 txpp_err_clock_queue: 0 txpp_err_ts_past: 0 txpp_err_ts_future: 0 txpp_jitter: 0 txpp_wander: 0 txpp_sync_lost: 0
(In reply to Asaf Penso from comment #2) > Can you please attach the output of the xstats when using one of the cqe=4 > scenarios? Hi Asaf, was your team able to reproduce our findings? Thy
Thanks, i don't see an issue with the xstats dump. I have another question, can you confirm the checksum of these packets are correct? From previous experience, we saw that in the case of cqe=4 and mixed traffic, bad csum impact on the performance.
(In reply to Asaf Penso from comment #5) > Thanks, i don't see an issue with the xstats dump. > I have another question, can you confirm the checksum of these packets are > correct? > From previous experience, we saw that in the case of cqe=4 and mixed > traffic, bad csum impact on the performance. Layer3 IPv4 checksums and L4-UDP are valid. However I can confirm that for the generated TCP-traffic ( Trex ) the checksum are marked as invalid in Wireshark. Is there any mlx5-driver option to disable checksum checks or fix the csum impact?
I'll have a look and will let you know. Would you try please, in the meanwhile, to work with valid packets and ensure this is the root cause for the degradation?
Hello Asaf, yes I can confirm that with valid TCP packets the issue is not seen.
Another question pleas. Do you see an issue if you send only TCP or only udp (meaning no mixed traffic) with bad csum? Or the bad csum impacts when it's mixed? That would help to direct my analysis.
Hello @Asaf, I can confirm that the perf-issue is only with mixed traffic. I we run udp-only or tcp-only profiles, the issue is not seen.
Hi @Asaf, any updates on this? We are using MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64 currently. Is updating worth it? I could not find any RSS-related fixes in the Changelogs for newer releases
Hi, to better support mixed traffic with cqe zipping the indication of a good or bad csum is not saved in the mini-cqe format. It means that whenever we start a new cqe zipping session we save the csum indication, and any difference that follows in the next packets will break the session and cause a perf degradation. So, the root cause here is that some traffic has good csum while other doesn't. There is no way to configure the mechanism to work differently.
Is there a new fw version that can solve this problem? You can add a switch to directly drop the csum error packets.
Hi @Asaf, it would be great if you could reply to the remaining questions: (1) As per your earlier comment: "There is no way to configure the mechanism to work differently.". Is the issue related to cqe-zipping only in combination with RSS? We noticed that the card can process the *same* traffic mix ( TRex mixed UDP/TCP) without any packet-loss, when RSS is disabled. Can you confirm? (2) Can the csum issue be fixed with a new OFED or FW-image ? (3) Documentation update - Can this behavior / limitation be documented on the mlx5 driver-documentation on DPDK https://doc.dpdk.org/guides/nics/mlx5.html as well as the performance reports https://fast.dpdk.org/doc/perf/DPDK_22_03_NVIDIA_Mellanox_NIC_performance_report.pdf ? Many thanks in advance!, Toby
Hello Toby, (1) Yes, it's due to the combination of RSS, you are right. (2) It's not a SW (OFED/FW) issue. This is how the HW currently works. In future HW there is a plan to extend the zipping mechanism and it might be possible to solve this. There are no details currently on the plan, however. (3) We can update the doc with this, I agree. In our performance reports we do not have mixed traffic use cases.
Hi Asaf, thanks for clarification.
This is the current HW behavior.
Hello Toby, May I ask which command or parameter is used to disable RSS? It's been a year, is there any solution for the traffic mix issue?
HI @Asaf, translates to "It's been a year, are there new network card hardware or new (OFED/FW) solutions to the bad csum performance issue?