[PATCH v2 3/8] mbuf: fix Tx checksum offload examples

Morten Brørup mb at smartsharesystems.com
Fri Apr 12 16:44:14 CEST 2024


> > > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum
> offload
> > > > > > > examples.
> > > > > >
> > > > > > I strongly disagree with this change!
> > > > > >
> > > > > > It will cause a huge performance degradation for shaping
> applications:
> > > > > >
> > > > > > A packet will be processed and finalized at an output or
> forwarding
> > > > > pipeline stage, where some other fields might also be written,
> so
> > > > > > zeroing e.g. the out_ip checksum at this stage has low cost
> (no new
> > > > > cache misses).
> > > > > >
> > > > > > Then, the packet might be queued for QoS or similar.
> > > > > >
> > > > > > If rte_eth_tx_prepare() must be called at the egress pipeline
> stage,
> > > > > it has to write to the packet and cause a cache miss per packet,
> > > > > > instead of simply passing on the packet to the NIC hardware.
> > > > > >
> > > > > > It must be possible to finalize the packet at the
> output/forwarding
> > > > > pipeline stage!
> > > > >
> > > > > If you can finalize your packet on  output/forwarding, then why
> you
> > > > > can't invoke tx_prepare() on the same stage?
> > > > > There seems to be some misunderstanding about what tx_prepare()
> does -
> > > > > in fact it doesn't communicate with HW queue (doesn't update TXD
> ring,
> > > > > etc.), what it does - just make changes in mbuf itself.
> > > > > Yes, it reads some fields in SW TX queue struct (max number of
> TXDs per
> > > > > packet, etc.), but AFAIK it is safe
> > > > > to call tx_prepare() and tx_burst() from different threads.
> > > > > At least on implementations I am aware about.
> > > > > Just checked the docs - it seems not stated explicitly anywhere,
> might
> > > > > be that's why it causing such misunderstanding.
> > > > >
> > > > > >
> > > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned
> packets
> > > > > egressing on different NIC hardware?
> > > > >
> > > > > If you create a clone of full packet (including L2/L3) headers
> then
> > > > > obviously such construction might not
> > > > > work properly with tx_prepare() over two different NICs.
> > > > > Though In majority of cases you do clone segments with data,
> while at
> > > > > least L2 headers are put into different segments.
> > > > > One simple approach would be to keep L3 header in that separate
> segment.
> > > > > But yes, there is a problem when you'll need to send exactly the
> same
> > > > > packet over different NICs.
> > > > > As I remember, for bonding PMD things don't work quite well here
> - you
> > > > > might have a bond over 2 NICs with
> > > > > different tx_prepare() and which one to call might be not clear
> till
> > > > > actual PMD tx_burst() is invoked.
> > > > >
> > > > > >
> > > > > > In theory, it might get even worse if we make this opaque
> instead of
> > > > > transparent and standardized:
> > > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD
> might
> > > > > reset it to 0xFFFF.
> > > > >
> > > > > >
> > > > > > I can only see one solution:
> > > > > > We need to standardize on common minimum requirements for how
> to
> > > > > prepare packets for each TX offload.
> > > > >
> > > > > If we can make each and every vendor to agree here - that
> definitely
> > > > > will help to simplify things quite a bit.
> > > >
> > > > An API is more than a function name and parameters.
> > > > It also has preconditions and postconditions.
> > > >
> > > > All major NIC vendors are contributing to DPDK.
> > > > It should be possible to reach consensus for reasonable minimum
> requirements
> > > for offloads.
> > > > Hardware- and driver-specific exceptions can be documented with
> the offload
> > > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > > rte_eth_rx_burst():
> > > > "Some drivers using vector instructions require that nb_pkts is
> divisible by
> > > 4 or 8, depending on the driver implementation."
> > >
> > > If we introduce a rule that everyone supposed to follow and then
> straightway
> > > allow people to have a 'documented exceptions',
> > > for me it means like 'no rule' in practice.
> > > A 'documented exceptions' approach might work if you have 5
> different PMDs to
> > > support, but not when you have 50+.
> > > No-one would write an app with possible 10 different exception cases
> in his
> > > head.
> > > Again, with such approach we can forget about backward
> compatibility.
> > > I think we already had this discussion before, my opinion remains
> the same
> > > here -
> > > 'documented exceptions' approach is a way to trouble.
> >
> > The "minimum requirements" should be the lowest common denominator of
> all NICs.
> > Exceptions should be extremely few, for outlier NICs that still want
> to provide an offload and its driver is unable to live up to the
> > minimum requirements.
> > Any exception should require techboard approval. If a NIC/driver does
> not support the "minimum requirements" for an offload
> > feature, it is not allowed to claim support for that offload feature,
> or needs to seek approval for an exception.
> >
> > As another option for NICs not supporting the minimum requirements of
> an offload feature, we could introduce offload flags with
> > finer granularity. E.g. one offload flag for "gold standard" TX
> checksum update (where the packet's checksum field can have any
> > value), and another offload flag for "silver standard" TX checksum
> update (where the packet's checksum field must have a
> > precomputed value).
> 
> Actually yes, I was thinking in the same direction - we need some extra
> API to allow user to distinguish.
> Probably we can do something like that: a new API for the ethdev call
> that would take as a parameter
> TX offloads bitmap and in return specify would it need to modify
> contents of packet to support these
> offloads or not.
> Something like:
> int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads)
> 
> For the majority of the drivers that satisfy these "minimum
> requirements" corresponding devops
> entry will be empty and we'll always return 0, otherwise PMD has to
> provide a proper devop.
> Then again, it would be up to the user, to determine can he pass same
> packet to 2 different NICs or not.
> 
> I suppose it is similar to what you were talking about?

I was thinking something more simple:

The NIC exposes its RX and TX offload capabilities to the application through the rx/tx_offload_capa and other fields in the rte_eth_dev_info structure returned by rte_eth_dev_info_get().

E.g. tx_offload_capa might have the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM flag set.
These capability flags (or enums) are mostly undocumented in the code, but I guess that the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability means that the NIC is able to update the IPv4 header checksum at egress (on the wire, i.e. without modifying the mbuf or packet data), and that the application must set RTE_MBUF_F_TX_IP_CKSUM in the mbufs to utilize this offload.
I would define and document what each capability flag/enum exactly means, the minimum requirements (as defined by the DPDK community) for the driver to claim support for it, and the requirements for an application to use it.

For the sake of discussion, let's say that RTE_ETH_TX_OFFLOAD_IPV4_CKSUM means "gold standard" TX checksum update capability (i.e. no requirements to the checksum field in the packet contents).
If some NIC requires the checksum field in the packet contents to have a precomputed value, the NIC would not be allowed to claim the RTE_ETH_TX_OFFLOAD_IPV4_CKSUM capability.
Such a NIC would need to define and document a new capability, e.g. RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ASSISTED, for the "silver standard" TX checksum update capability.

In other words: I would encode variations of offload capabilities directly in the capabilities flags.
Then we don't need additional APIs to help interpret those capabilities.

This way, the application can probe the NIC capabilities to determine what can be offloaded, and how to do it.

The application can be designed to:
1. use a common packet processing pipeline, utilizing only the lowest common capabilities denominator of all detected NICs, or
2. use a packet processing pipeline, handling packets differently according to the capabilities of the involved NICs.

NB: There may be other variations than requiring packet contents to be modified, and they might be granular.
E.g. a NIC might require assistance for TCP/UDP checksum offload, but not for IP checksum offload, so a function telling if packet contents requires modification would not suffice.
E.g. RTE_ETH_TX_OFFLOAD_MULTI_SEGS is defined, but the rte_eth_dev_info structure doesn't expose information about the max number of segments it can handle.

PS: For backwards compatibility, we might define RTE_ETH_TX_OFFLOAD_IPV4_CKSUM as the "silver standard" offload to support the current "minimum requirements", and add RTE_ETH_TX_OFFLOAD_IPV4_CKSUM_ANY for the "gold standard" offload.


> 
> > For reference, consider RSS, where the feature support flags have very
> high granularity.
> >
> > >
> > > > You mention the bonding driver, which is a good example.
> > > > The rte_eth_tx_burst() documentation has a note about the API
> postcondition
> > > exception for the bonding driver:
> > > > "This function must not modify mbufs (including packets data)
> unless the
> > > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > > may be modified."
> > >
> > > For me, what we've done for bonding tx_prepare/tx_burst() is a
> really bad
> > > example.
> > > Initial agreement and design choice was that tx_burst() should not
> modify
> > > contents of the packets
> > > (that actually was one of the reasons why tx_prepare() was
> introduced).
> > > The only reason I agreed on that exception - because I couldn't
> come-up with
> > > something less uglier.
> > >
> > > Actually, these problems with bonding PMD made me to start thinking
> that
> > > current
> > > tx_prepare/tx_burst approach might need to be reconsidered somehow.
> >
> > In cases where a preceding call to tx_prepare() is required, how is it
> worse modifying the packet in tx_burst() than modifying the
> > packet in tx_prepare()?
> >
> > Both cases violate the postcondition that packets are not modified at
> egress.
> >
> > >
> > > > > Then we can probably have one common tx_prepare() for all
> vendors ;)
> > > >
> > > > Yes, that would be the goal.
> > > > More realistically, the ethdev layer could perform the common
> checks, and
> > > only the non-conforming drivers would have to implement
> > > > their specific tweaks.
> > >
> > > Hmm,  but that's what we have right now:
> > > - fields in mbuf and packet data that user has to fill correctly and
> dev
> > > specific tx_prepare().
> > > How what you suggest will differ then?
> >
> > You're 100 % right here. We could move more checks into the ethdev
> layer, specifically checks related to the "minimum
> > requirements".
> >
> > > And how it will help let say with bonding PMD situation, or with TX-
> ing of the
> > > same packet over 2 different NICs?
> >
> > The bonding driver is broken.
> > It can only be fixed by not violating the egress postcondition in
> either tx_burst() or tx_prepare().
> > "Minimum requirements" might help doing that.
> >
> > >
> > > > If we don't standardize the meaning of the offload flags, the
> application
> > > developers cannot trust them!
> > > > I'm afraid this is the current situation - application developers
> either
> > > test with specific NIC hardware, or don't use the offload features.
> > >
> > > Well, I have used TX offloads through several projects, it worked
> quite well.
> >
> > That is good to hear.
> > And I don't oppose to that.
> >
> > In this discussion, I am worried about the roadmap direction for DPDK.
> > I oppose to the concept of requiring calling tx_prepare() before
> calling tx_burst() when using offload. I think it is conceptually wrong,
> > and breaks the egress postcondition.
> > I propose "minimum requirements" as a better solution.
> >
> > > Though have to admit, never have to use TX offloads together with
> our bonding
> > > PMD.
> > >



More information about the stable mailing list