[PATCH v2 3/8] mbuf: fix Tx checksum offload examples

Konstantin Ananyev konstantin.ananyev at huawei.com
Fri Apr 12 14:46:54 CEST 2024



> > > > > > Mandate use of rte_eth_tx_prepare() in the mbuf Tx checksum offload
> > > > > > examples.
> > > > >
> > > > > I strongly disagree with this change!
> > > > >
> > > > > It will cause a huge performance degradation for shaping applications:
> > > > >
> > > > > A packet will be processed and finalized at an output or forwarding
> > > > pipeline stage, where some other fields might also be written, so
> > > > > zeroing e.g. the out_ip checksum at this stage has low cost (no new
> > > > cache misses).
> > > > >
> > > > > Then, the packet might be queued for QoS or similar.
> > > > >
> > > > > If rte_eth_tx_prepare() must be called at the egress pipeline stage,
> > > > it has to write to the packet and cause a cache miss per packet,
> > > > > instead of simply passing on the packet to the NIC hardware.
> > > > >
> > > > > It must be possible to finalize the packet at the output/forwarding
> > > > pipeline stage!
> > > >
> > > > If you can finalize your packet on  output/forwarding, then why you
> > > > can't invoke tx_prepare() on the same stage?
> > > > There seems to be some misunderstanding about what tx_prepare() does -
> > > > in fact it doesn't communicate with HW queue (doesn't update TXD ring,
> > > > etc.), what it does - just make changes in mbuf itself.
> > > > Yes, it reads some fields in SW TX queue struct (max number of TXDs per
> > > > packet, etc.), but AFAIK it is safe
> > > > to call tx_prepare() and tx_burst() from different threads.
> > > > At least on implementations I am aware about.
> > > > Just checked the docs - it seems not stated explicitly anywhere, might
> > > > be that's why it causing such misunderstanding.
> > > >
> > > > >
> > > > > Also, how is rte_eth_tx_prepare() supposed to work for cloned packets
> > > > egressing on different NIC hardware?
> > > >
> > > > If you create a clone of full packet (including L2/L3) headers then
> > > > obviously such construction might not
> > > > work properly with tx_prepare() over two different NICs.
> > > > Though In majority of cases you do clone segments with data, while at
> > > > least L2 headers are put into different segments.
> > > > One simple approach would be to keep L3 header in that separate segment.
> > > > But yes, there is a problem when you'll need to send exactly the same
> > > > packet over different NICs.
> > > > As I remember, for bonding PMD things don't work quite well here - you
> > > > might have a bond over 2 NICs with
> > > > different tx_prepare() and which one to call might be not clear till
> > > > actual PMD tx_burst() is invoked.
> > > >
> > > > >
> > > > > In theory, it might get even worse if we make this opaque instead of
> > > > transparent and standardized:
> > > > > One PMD might reset out_ip checksum to 0x0000, and another PMD might
> > > > reset it to 0xFFFF.
> > > >
> > > > >
> > > > > I can only see one solution:
> > > > > We need to standardize on common minimum requirements for how to
> > > > prepare packets for each TX offload.
> > > >
> > > > If we can make each and every vendor to agree here - that definitely
> > > > will help to simplify things quite a bit.
> > >
> > > An API is more than a function name and parameters.
> > > It also has preconditions and postconditions.
> > >
> > > All major NIC vendors are contributing to DPDK.
> > > It should be possible to reach consensus for reasonable minimum requirements
> > for offloads.
> > > Hardware- and driver-specific exceptions can be documented with the offload
> > flag, or with rte_eth_rx/tx_burst(), like the note to
> > > rte_eth_rx_burst():
> > > "Some drivers using vector instructions require that nb_pkts is divisible by
> > 4 or 8, depending on the driver implementation."
> >
> > If we introduce a rule that everyone supposed to follow and then straightway
> > allow people to have a 'documented exceptions',
> > for me it means like 'no rule' in practice.
> > A 'documented exceptions' approach might work if you have 5 different PMDs to
> > support, but not when you have 50+.
> > No-one would write an app with possible 10 different exception cases in his
> > head.
> > Again, with such approach we can forget about backward compatibility.
> > I think we already had this discussion before, my opinion remains the same
> > here -
> > 'documented exceptions' approach is a way to trouble.
> 
> The "minimum requirements" should be the lowest common denominator of all NICs.
> Exceptions should be extremely few, for outlier NICs that still want to provide an offload and its driver is unable to live up to the
> minimum requirements.
> Any exception should require techboard approval. If a NIC/driver does not support the "minimum requirements" for an offload
> feature, it is not allowed to claim support for that offload feature, or needs to seek approval for an exception.
> 
> As another option for NICs not supporting the minimum requirements of an offload feature, we could introduce offload flags with
> finer granularity. E.g. one offload flag for "gold standard" TX checksum update (where the packet's checksum field can have any
> value), and another offload flag for "silver standard" TX checksum update (where the packet's checksum field must have a
> precomputed value).

Actually yes, I was thinking in the same direction - we need some extra API to allow user to distinguish. 
Probably we can do something like that: a new API for the ethdev call that would take as a parameter
TX offloads bitmap and in return specify would it need to modify contents of packet to support these
offloads or not.
Something like:
int rte_ethdev_tx_offload_pkt_mod_required(unt64_t tx_offloads) 

For the majority of the drivers that satisfy these "minimum requirements" corresponding devops
entry will be empty and we'll always return 0, otherwise PMD has to provide a proper devop.
Then again, it would be up to the user, to determine can he pass same packet to 2 different NICs or not. 

I suppose it is similar to what you were talking about?

> For reference, consider RSS, where the feature support flags have very high granularity.
> 
> >
> > > You mention the bonding driver, which is a good example.
> > > The rte_eth_tx_burst() documentation has a note about the API postcondition
> > exception for the bonding driver:
> > > "This function must not modify mbufs (including packets data) unless the
> > refcnt is 1. An exception is the bonding PMD, [...], mbufs
> > > may be modified."
> >
> > For me, what we've done for bonding tx_prepare/tx_burst() is a really bad
> > example.
> > Initial agreement and design choice was that tx_burst() should not modify
> > contents of the packets
> > (that actually was one of the reasons why tx_prepare() was introduced).
> > The only reason I agreed on that exception - because I couldn't come-up with
> > something less uglier.
> >
> > Actually, these problems with bonding PMD made me to start thinking that
> > current
> > tx_prepare/tx_burst approach might need to be reconsidered somehow.
> 
> In cases where a preceding call to tx_prepare() is required, how is it worse modifying the packet in tx_burst() than modifying the
> packet in tx_prepare()?
> 
> Both cases violate the postcondition that packets are not modified at egress.
> 
> >
> > > > Then we can probably have one common tx_prepare() for all vendors ;)
> > >
> > > Yes, that would be the goal.
> > > More realistically, the ethdev layer could perform the common checks, and
> > only the non-conforming drivers would have to implement
> > > their specific tweaks.
> >
> > Hmm,  but that's what we have right now:
> > - fields in mbuf and packet data that user has to fill correctly and dev
> > specific tx_prepare().
> > How what you suggest will differ then?
> 
> You're 100 % right here. We could move more checks into the ethdev layer, specifically checks related to the "minimum
> requirements".
> 
> > And how it will help let say with bonding PMD situation, or with TX-ing of the
> > same packet over 2 different NICs?
> 
> The bonding driver is broken.
> It can only be fixed by not violating the egress postcondition in either tx_burst() or tx_prepare().
> "Minimum requirements" might help doing that.
> 
> >
> > > If we don't standardize the meaning of the offload flags, the application
> > developers cannot trust them!
> > > I'm afraid this is the current situation - application developers either
> > test with specific NIC hardware, or don't use the offload features.
> >
> > Well, I have used TX offloads through several projects, it worked quite well.
> 
> That is good to hear.
> And I don't oppose to that.
> 
> In this discussion, I am worried about the roadmap direction for DPDK.
> I oppose to the concept of requiring calling tx_prepare() before calling tx_burst() when using offload. I think it is conceptually wrong,
> and breaks the egress postcondition.
> I propose "minimum requirements" as a better solution.
> 
> > Though have to admit, never have to use TX offloads together with our bonding
> > PMD.
> >



More information about the stable mailing list