Bug 1367 - net/mlx5 Tx stuck if mbuf has too many segments
Summary: net/mlx5 Tx stuck if mbuf has too many segments
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 23.11
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2024-01-18 08:28 CET by Andrew Rybchenko
Modified: 2024-01-19 10:25 CET (History)
1 user (show)



Attachments

Description Andrew Rybchenko 2024-01-18 08:28:37 CET
net/mlx5 Tx stuck if mbuf has too many segments

net/mlx5 reports maximum number of Tx segments in device info, but it does not check it on Tx prepare and simply do not send it on Tx burst.

As the result if such packet is encountered, app does not know (without extra efforts) why it fails to send the packet after successful Tx prepare. In theory the reason could be fully occupied Tx queue and application should simply retry forever.

Found by test run at UNH IOL:

1. Test checks reported maximum segments count: tx_desc_lim.nb_mtu_seg_max=40
https://ts-factory.io/bublik/v2/log/477842?focusId=479564&mode=treeAndinfoAndlog&experimental=true&lineNumber=1_25

2. Test splits the packet into 60 segments:
https://ts-factory.io/bublik/v2/log/477842?focusId=479564&mode=treeAndinfoAndlog&experimental=true&lineNumber=1_49

3. Test logs expectations for the packet to be rejected by Tx prepare:
https://ts-factory.io/bublik/v2/log/477842?focusId=479564&mode=treeAndinfoAndlog&experimental=true&lineNumber=1_60

4. Tx prepare accepts the packet and the test logs error:
https://ts-factory.io/bublik/v2/log/477842?focusId=479564&mode=treeAndinfoAndlog&experimental=true&lineNumber=1_62

5. Test tries to Tx burst the packet, but it returns 0. One more error is logged.
https://ts-factory.io/bublik/v2/log/477842?focusId=479564&mode=treeAndinfoAndlog&experimental=true&lineNumber=1_62

IMHO better behaviour here would be to accept the packet bug simply drop it in SW on Tx before passing to HW. Just avoid Tx stuck.

Of course Tx prepare should reject the packet at step 4.
Comment 1 Slava 2024-01-18 11:04:43 CET
mlx5 PMD neither uses nor implements tx_pkt_prepare() API call.

On a packet with a segment number exceeding the HW capabilities (as well as on any problematic packet), the PMD stops burst processing and increments tx_queue->oerror counter.

Normally, it is supposed to have "oerrors" counter zero, otherwise application violates some PMD sending rules (for example, exceeds the number of the segments).
It is not a runtime error, but rather a design error, so adding some handling code to the tx_burst would impact the performance and not add some value.

App should behave:
- do tx_burst
- if not all pkts are sent - check oerrors
- if oerrors is not zero/changed report critical error

Then debug/find root cause/update design.
Comment 2 Andrew Rybchenko 2024-01-18 11:22:14 CET
Thanks for quick feedback.

Is it documented somewhere else? Does testpmd behave this way?

IMHO it is a bit strange that oerror is incremented for a packet not accepted for Tx. It is one more gray area in DPDK.
Comment 3 Slava 2024-01-18 11:45:08 CET
>>Is it documented somewhere else?

I did a quick searching in docs.
Explicitly it is not specified, just some logic is there:
https://doc.dpdk.org/guides/howto/debug_troubleshoot.html?highlight=oerrors


>> Does testpmd behave this way?
No, testpmd does not. Maybe we should extend its code.

One more reason not just to skip the bad packet - we want to report to the application what packet caused an error - the first unsent packet, it was found useful to debug user cases.

What is not good - getting the oerrors is done via API and it takes some time, so, it seems to be not efficient to check oerrors on every iteration.
Comment 4 Andrew Rybchenko 2024-01-18 15:34:05 CET
So, you're sure that it is OK. Definitely up to you.
When I have some time I'll fill in expectations for the driver. Is 40 a driver or HW limitation? If HW, which NICs?
Comment 5 Slava 2024-01-19 10:25:28 CET
>>Is 40 a driver or HW limitation? If HW, which NICs?
The absolute limit of the segments for WQE (hardware descriptor of ConnectX NIC series) is 63 (6-bit size field width), 61 of them can be data segments, so 61 is hypothetical limit for number of mbufs in chain.

But there is also some data inlining into WQE (feature to save some PCIe bandwidth), so PMD reports reduced limit, according to actual inline settings.

Note You need to log in before you can comment on or make changes to this bug.