[PATCH v2 3/8] mbuf: fix Tx checksum offload examples

Morten Brørup mb at smartsharesystems.com
Wed Apr 10 14:20:42 CEST 2024


> From: Konstantin Ananyev [mailto:konstantin.ananyev at huawei.com]
> Sent: Wednesday, 10 April 2024 12.35
> 
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev at huawei.com]
> > > Sent: Tuesday, 9 April 2024 15.39
> > >
> > > > > From: David Marchand [mailto:david.marchand at redhat.com]
> > > > > Sent: Friday, 5 April 2024 16.46
> > > > >
> > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > > > examples.
> > > >
> > > > I strongly disagree with this change!
> > > >
> > > > It will cause a huge performance degradation for shaping applications:
> > > >
> > > > A packet will be processed and finalized at an output or forwarding
> > > pipeline stage, where some other fields might also be written, so
> > > > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> > > cache misses).
> > > >
> > > > Then, the packet might be queued for QoS or similar.
> > > >
> > > > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> > > it has to write to the packet and cause a cache miss per packet,
> > > > instead of simply passing on the packet to the NIC hardware.
> > > >
> > > > It must be possible to finalize the packet at the output/forwarding
> > > pipeline stage!
> > >
> > > If you can finalize your packet on  output/forwarding, then why you
> > > can't invoke tx_prepare() on the same stage?
> > > There seems to be some misunderstanding about what tx_prepare() does -
> > > in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> > > etc.), what it does - just make changes in mbuf itself.
> > > Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> > > packet, etc.), but AFAIK it is safe
> > > to call tx_prepare() and tx_burst() from different threads.
> > > At least on implementations I am aware about.
> > > Just checked the docs - it seems not stated explicitly anywhere, might
> > > be that's why it causing such misunderstanding.
> > >
> > > >
> > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> > > egressing on different NIC hardware?
> > >
> > > If you create a clone of full packet (including L2/L3) headers then
> > > obviously such construction might not
> > > work properly with tx_prepare() over two different NICs.
> > > Though In majority of cases you do clone segments with data, while at
> > > least L2 headers are put into different segments.
> > > One simple approach would be to keep L3 header in that separate segment.
> > > But yes, there is a problem when you'll need to send exactly the same
> > > packet over different NICs.
> > > As I remember, for bonding PMD things don't work quite well here - you
> > > might have a bond over 2 NICs with
> > > different tx_prepare() and which one to call might be not clear till
> > > actual PMD tx_burst() is invoked.
> > >
> > > >
> > > > In theory, it might get even worse if we make this opaque instead of
> > > transparent and standardized:
> > > > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> > > reset it to 0xFFFF.
> > >
> > > >
> > > > I can only see one solution:
> > > > We need to standardize on common minimum requirements for how to
> > > prepare packets for each TX offload.
> > >
> > > If we can make each and every vendor to agree here - that definitely
> > > will help to simplify things quite a bit.
> >
> > An API is more than a function name and parameters.
> > It also has preconditions and postconditions.
> >
> > All major NIC vendors are contributing to DPDK.
> > It should be possible to reach consensus for reasonable minimum requirements
> for offloads.
> > Hardware- and driver-specific exceptions can be documented with the offload
> flag, or with rte_eth_rx/tx_burst(), like the note to
> > rte_eth_rx_burst():
> > "Some drivers using vector instructions require that nb_pkts is divisible by
> 4 or 8, depending on the driver implementation."
> 
> If we introduce a rule that everyone supposed to follow and then straightway
> allow people to have a 'documented exceptions',
> for me it means like 'no rule' in practice.
> A 'documented exceptions' approach might work if you have 5 different PMDs to
> support, but not when you have 50+.
> No-one would write an app with possible 10 different exception cases in his
> head.
> Again, with such approach we can forget about backward compatibility.
> I think we already had this discussion before, my opinion remains the same
> here -
> 'documented exceptions' approach is a way to trouble.

The "minimum requirements" should be the lowest common denominator of all NICs.
Exceptions should be extremely few, for outlier NICs that still want to provide an offload and its driver is unable to live up to the minimum requirements.
Any exception should require techboard approval. If a NIC/driver does not support the "minimum requirements" for an offload feature, it is not allowed to claim support for that offload feature, or needs to seek approval for an exception.

As another option for NICs not supporting the minimum requirements of an offload feature, we could introduce offload flags with finer granularity. E.g. one offload flag for "gold standard" TX checksum update (where the packet's checksum field can have any value), and another offload flag for "silver standard" TX checksum update (where the packet's checksum field must have a precomputed value).

For reference, consider RSS, where the feature support flags have very high granularity.

> 
> > You mention the bonding driver, which is a good example.
> > The rte_eth_tx_burst() documentation has a note about the API postcondition
> exception for the bonding driver:
> > "This function must not modify mbufs (including packets data) unless the
> refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > may be modified."
> 
> For me, what we've done for bonding tx_prepare/tx_burst() is a really bad
> example.
> Initial agreement and design choice was that tx_burst() should not modify
> contents of the packets
> (that actually was one of the reasons why tx_prepare() was introduced).
> The only reason I agreed on that exception - because I couldn't come-up with
> something less uglier.
> 
> Actually, these problems with bonding PMD made me to start thinking that
> current
> tx_prepare/tx_burst approach might need to be reconsidered somehow.

In cases where a preceding call to tx_prepare() is required, how is it worse modifying the packet in tx_burst() than modifying the packet in tx_prepare()?

Both cases violate the postcondition that packets are not modified at egress.

> 
> > > Then we can probably have one common tx_prepare() for all vendors ;)
> >
> > Yes, that would be the goal.
> > More realistically, the ethdev layer could perform the common checks, and
> only the non-conforming drivers would have to implement
> > their specific tweaks.
> 
> Hmm,  but that's what we have right now:
> - fields in mbuf and packet data that user has to fill correctly and dev
> specific tx_prepare().
> How what you suggest will differ then?

You're 100 % right here. We could move more checks into the ethdev layer, specifically checks related to the "minimum requirements".

> And how it will help let say with bonding PMD situation, or with TX-ing of the
> same packet over 2 different NICs?

The bonding driver is broken.
It can only be fixed by not violating the egress postcondition in either tx_burst() or tx_prepare().
"Minimum requirements" might help doing that.

> 
> > If we don't standardize the meaning of the offload flags, the application
> developers cannot trust them!
> > I'm afraid this is the current situation - application developers either
> test with specific NIC hardware, or don't use the offload features.
> 
> Well, I have used TX offloads through several projects, it worked quite well.

That is good to hear.
And I don't oppose to that.

In this discussion, I am worried about the roadmap direction for DPDK.
I oppose to the concept of requiring calling tx_prepare() before calling tx_burst() when using offload. I think it is conceptually wrong, and breaks the egress postcondition.
I propose "minimum requirements" as a better solution.

> Though have to admit, never have to use TX offloads together with our bonding
> PMD.
> 



More information about the stable mailing list