Bug 1053 - ConnectX6 / mlx5 DPDK - bad RSS/ rte_flow performance on mixed traffic ( rxq_cqe_comp_en=4 )
Summary: ConnectX6 / mlx5 DPDK - bad RSS/ rte_flow performance on mixed traffic ( rxq...
Status: RESOLVED WONTFIX
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 21.11
Hardware: x86 Linux
: Normal normal
Target Milestone: ---
Assignee: Asaf Penso
URL:
Depends on:
Blocks:
 
Reported: 2022-07-15 18:06 CEST by Toby
Modified: 2023-10-21 15:03 CEST (History)
2 users (show)



Attachments
Additional testcases incl. combinations of additional mlx5 parameters (8.32 KB, text/plain)
2022-07-15 18:23 CEST, Toby
Details

Description Toby 2022-07-15 18:06:58 CEST
Our team has been chasing major performance issues with ConnectX6 cards.

*Customer challenge:*
Flow stable ( symmetric RSS) load-balancing of flows to 8 worker lcores.

*Observation:*
Performance is fine up to 100Gbps using either tcp *or* udp-only traffic profiles.
Mixed traffic drops down to 50% loss with all packets showing up as xstats: rx_phy_discard_packets

card infos at end of email.


There appears to be a huge performance issue on mixed UDP/TCP using symmetric load-balancing accross multiple workers.
E.g. compiling a DPDK v20.11 or newer test-pmd apps:


> sudo ./dpdk-testpmd -n 8 -l 4,6,8,10,12,14,16,18,20  -a >
> 0000:4b:00.0,rxq_cqe_comp_en=4  -a 0000:4b:00.1,rxq_cqe_comp_en=4  --
> --forward-> mode=mac --rxq=8 --txq=8 --nb-cores=8 --numa -i -a --disable-rss


and configuring:


> flow create 0 ingress pattern eth / ipv4 / tcp / end actions rss types
> ipv4-tcp > end queues 0 1 2 3 4 5 6 7 end key >
> 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A
> / > end

> flow create 0 ingress pattern eth / ipv4 / udp / end actions rss types
> ipv4-udp > end queues 0 1 2 3 4 5 6 7 end key >
> 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A
> / > end

> flow create 1 ingress pattern eth / ipv4 / tcp / end actions rss types
> ipv4-tcp > end queues 0 1 2 3 4 5 6 7 end key >
> 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A
> / > end

> flow create 1 ingress pattern eth / ipv4 / udp / end actions rss types
> ipv4-udp > end queues 0 1 2 3 4 5 6 7 end key >
> 6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A6D5A
> / > end

will see *significant* packet drops at a load > 50 Gbps on any type of mixed UDP/TCP traffic. E.g. 

https://github.com/cisco-system-traffic-generator/trex-core/blob/master/scripts/cap2/sfr3.yaml
Whenever those packet drops occur, I see those in the xstats as "rx_phy_discard_packets"


On the other hand using a TCP-or UDP-only traffic profile perfectly scales up to 100Gbps w/o drops.


Thanks for your help!



> {code}
> ConnectX6DX



> <Devices>
>     <Device pciName="0000:4b:00.0" type="ConnectX6DX" psid="DEL0000000027" >
>     partNumber="0F6FXM_08P2T2_Ax">
>       <Versions>
>         <FW current="22.31.1014" available="N/A"/>
>         <PXE current="3.6.0403" available="N/A"/>
>         <UEFI current="14.24.0013" available="N/A"/>
>      </Versions>
> {code}
Comment 1 Toby 2022-07-15 18:23:14 CEST
Created attachment 212 [details]
Additional testcases incl. combinations of additional mlx5 parameters

Added Additional testcases incl. combinations of additional mlx5 parameters as listed in DPDK mlx5 docs
Comment 2 Asaf Penso 2022-07-17 08:05:14 CEST
Can you please attach the output of the xstats when using one of the cqe=4 scenarios?
Comment 3 Toby 2022-07-19 01:28:26 CEST
testpmd> show port xstats all 
###### NIC extended statistics for port 0 
rx_good_packets: 425476923
tx_good_packets: 259850189
rx_good_bytes: 126416792933
tx_good_bytes: 358897508382
rx_missed_errors: 14144
rx_errors: 0
tx_errors: 0
rx_mbuf_allocation_errors: 0
rx_q0packets: 53259716
rx_q0bytes: 15621519961
rx_q0errors: 0
rx_q1packets: 53048026
rx_q1bytes: 15814675945
rx_q1errors: 0
rx_q2packets: 53290424
rx_q2bytes: 15703425214
rx_q2errors: 0
rx_q3packets: 53366317
rx_q3bytes: 16286720906
rx_q3errors: 0
rx_q4packets: 53410708
rx_q4bytes: 15838266064
rx_q4errors: 0
rx_q5packets: 52809749
rx_q5bytes: 15622478674
rx_q5errors: 0
rx_q6packets: 53372784
rx_q6bytes: 15885910128
rx_q6errors: 0
rx_q7packets: 52919199
rx_q7bytes: 15643796041
rx_q7errors: 0
tx_q0packets: 32448803
tx_q0bytes: 44802375520
tx_q1packets: 32044006
tx_q1bytes: 44190717419
tx_q2packets: 32821593
tx_q2bytes: 45355762505
tx_q3packets: 32562959
tx_q3bytes: 44955313505
tx_q4packets: 32356256
tx_q4bytes: 44703126508
tx_q5packets: 32324453
tx_q5bytes: 44677002457
tx_q6packets: 32861827
tx_q6bytes: 45414745177
tx_q7packets: 32430292
tx_q7bytes: 44798465291
rx_wqe_err: 0
rx_port_unicast_packets: 425478765
rx_port_unicast_bytes: 128119205827
tx_port_unicast_packets: 259843573
tx_port_unicast_bytes: 358888367522
rx_port_multicast_packets: 2
rx_port_multicast_bytes: 176
tx_port_multicast_packets: 0
tx_port_multicast_bytes: 0
rx_port_broadcast_packets: 0
rx_port_broadcast_bytes: 0
tx_port_broadcast_packets: 0
tx_port_broadcast_bytes: 0
tx_packets_phy: 260039747
rx_packets_phy: 435881051
rx_crc_errors_phy: 0
tx_bytes_phy: 359935880062
rx_bytes_phy: 131043626686
rx_in_range_len_errors_phy: 0
rx_symbol_err_phy: 0
rx_discards_phy: 10403772
tx_discards_phy: 0
tx_errors_phy: 0
rx_out_of_buffer: 14144
txpp_err_miss_int: 0
txpp_err_rearm_queue: 0
txpp_err_clock_queue: 0
txpp_err_ts_past: 0
txpp_err_ts_future: 0
txpp_jitter: 0
txpp_wander: 0
txpp_sync_lost: 0
###### NIC extended statistics for port 1 
rx_good_packets: 259924834
tx_good_packets: 425524183
rx_good_bytes: 358999327754
tx_good_bytes: 126430445513
rx_missed_errors: 15723
rx_errors: 0
tx_errors: 0
rx_mbuf_allocation_errors: 0
rx_q0packets: 32469035
rx_q0bytes: 44829708152
rx_q0errors: 0
rx_q1packets: 32048138
rx_q1bytes: 44196336161
rx_q1errors: 0
rx_q2packets: 32843947
rx_q2bytes: 45386141602
rx_q2errors: 0
rx_q3packets: 32573347
rx_q3bytes: 44969467978
rx_q3errors: 0
rx_q4packets: 32362476
rx_q4bytes: 44711635023
rx_q4errors: 0
rx_q5packets: 32328342
rx_q5bytes: 44682430681
rx_q5errors: 0
rx_q6packets: 32865811
rx_q6bytes: 45420306449
rx_q6errors: 0
rx_q7packets: 32433738
rx_q7bytes: 44803301708
rx_q7errors: 0
tx_q0packets: 53265836
tx_q0bytes: 15623503530
tx_q1packets: 53054391
tx_q1bytes: 15816867710
tx_q2packets: 53296179
tx_q2bytes: 15705142067
tx_q3packets: 53372332
tx_q3bytes: 16288342251
tx_q4packets: 53416250
tx_q4bytes: 15840011770
tx_q5packets: 52815233
tx_q5bytes: 15623803938
tx_q6packets: 53379036
tx_q6bytes: 15887764228
tx_q7packets: 52924926
tx_q7bytes: 15645010019
rx_wqe_err: 0
rx_port_unicast_packets: 259935340
rx_port_unicast_bytes: 360053683508
tx_port_unicast_packets: 425515495
tx_port_unicast_bytes: 126427892220
rx_port_multicast_packets: 2
rx_port_multicast_bytes: 176
tx_port_multicast_packets: 0
tx_port_multicast_bytes: 0
rx_port_broadcast_packets: 0
rx_port_broadcast_bytes: 0
tx_port_broadcast_packets: 0
tx_port_broadcast_bytes: 0
tx_packets_phy: 426208080
rx_packets_phy: 801525247
rx_crc_errors_phy: 0
tx_bytes_phy: 128173358039
rx_bytes_phy: 1110006416162
rx_in_range_len_errors_phy: 0
rx_symbol_err_phy: 0
rx_discards_phy: 541590684
tx_discards_phy: 0
tx_errors_phy: 0
rx_out_of_buffer: 15723
txpp_err_miss_int: 0
txpp_err_rearm_queue: 0
txpp_err_clock_queue: 0
txpp_err_ts_past: 0
txpp_err_ts_future: 0
txpp_jitter: 0
txpp_wander: 0
txpp_sync_lost: 0
Comment 4 Toby 2022-08-04 11:14:40 CEST
(In reply to Asaf Penso from comment #2)
> Can you please attach the output of the xstats when using one of the cqe=4
> scenarios?

Hi Asaf,
was your team able to reproduce our findings? Thy
Comment 5 Asaf Penso 2022-08-16 12:28:35 CEST
Thanks, i don't see an issue with the xstats dump.
I have another question, can you confirm the checksum of these packets are correct?
From previous experience, we saw that in the case of cqe=4 and mixed traffic, bad csum impact on the performance.
Comment 6 Toby 2022-08-29 18:10:23 CEST
(In reply to Asaf Penso from comment #5)
> Thanks, i don't see an issue with the xstats dump.
> I have another question, can you confirm the checksum of these packets are
> correct?
> From previous experience, we saw that in the case of cqe=4 and mixed
> traffic, bad csum impact on the performance.

Layer3 IPv4 checksums and L4-UDP are valid.
However I can confirm that for the generated TCP-traffic ( Trex ) the checksum are marked as invalid in Wireshark. Is there any mlx5-driver option to disable checksum checks or fix the csum impact?
Comment 7 Asaf Penso 2022-08-29 22:14:34 CEST
I'll have a look and will let you know.
Would you try please, in the meanwhile, to work with valid packets and ensure this is the root cause for the degradation?
Comment 8 Toby 2022-08-30 15:52:38 CEST
Hello Asaf, yes I can confirm that with valid TCP packets the issue is not seen.
Comment 9 Asaf Penso 2022-08-31 09:59:17 CEST
Another question pleas.
Do you see an issue if you send only TCP or only udp (meaning no mixed traffic) with bad csum?
Or the bad csum impacts when it's mixed?
That would help to direct my analysis.
Comment 10 Toby 2022-09-07 16:27:12 CEST
Hello @Asaf, I can confirm that the perf-issue is only with mixed traffic.
I we run udp-only or tcp-only profiles, the issue is not seen.
Comment 11 Toby 2022-09-23 15:04:12 CEST
Hi @Asaf, any updates on this? We are using MLNX_OFED_LINUX-5.3-1.0.5.0-ubuntu20.04-x86_64 currently. Is updating worth it? I could not find any RSS-related fixes in the Changelogs for newer releases
Comment 12 Asaf Penso 2022-09-28 22:35:37 CEST
Hi, to better support mixed traffic with cqe zipping the indication of a good or bad csum is not saved in the mini-cqe format.
It means that whenever we start a new cqe zipping session we save the csum indication, and any difference that follows in the next packets will break the session and cause a perf degradation.
So, the root cause here is that some traffic has good csum while other doesn't.
There is no way to configure the mechanism to work differently.
Comment 13 killers 2022-10-08 04:08:02 CEST
Is there a new fw version that can solve this problem?

You can add a switch to directly drop the csum error packets.
Comment 14 Toby 2022-10-25 03:40:40 CEST
Hi @Asaf,

it would be great if you could reply to the remaining questions:

(1) As per your earlier comment: "There is no way to configure the mechanism to work differently.".  Is the issue related to cqe-zipping only in combination with RSS?  
We noticed that the card can process the *same* traffic mix ( TRex mixed UDP/TCP) without any packet-loss, when RSS is disabled.  Can you confirm?

(2) Can the csum issue be fixed with a new OFED or FW-image ? 

(3) Documentation update - Can this behavior / limitation be documented on the mlx5 driver-documentation on DPDK https://doc.dpdk.org/guides/nics/mlx5.html as well as the performance reports https://fast.dpdk.org/doc/perf/DPDK_22_03_NVIDIA_Mellanox_NIC_performance_report.pdf ?

Many thanks in advance!, Toby
Comment 15 Asaf Penso 2022-11-07 18:30:21 CET
Hello Toby,

(1) Yes, it's due to the combination of RSS, you are right.
(2) It's not a SW (OFED/FW) issue. This is how the HW currently works. In future HW there is a plan to extend the zipping mechanism and it might be possible to solve this. There are no details currently on the plan, however.
(3) We can update the doc with this, I agree. In our performance reports we do not have mixed traffic use cases.
Comment 16 Toby 2022-11-07 19:42:36 CET
Hi Asaf, thanks for clarification.
Comment 17 Asaf Penso 2022-11-08 21:25:15 CET
This is the current HW behavior.
Comment 18 killers 2023-10-21 15:00:16 CEST
Hello Toby,
May I ask which command or parameter is used to disable RSS?
It's been a year, is there any solution for the traffic mix issue?
Comment 19 killers 2023-10-21 15:03:57 CEST
HI @Asaf,

translates to "It's been a year, are there new network card hardware or new (OFED/FW) solutions to the bad csum performance issue?

Note You need to log in before you can comment on or make changes to this bug.