[dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation

Ananyev, Konstantin konstantin.ananyev at intel.com
Mon Dec 12 12:51:31 CET 2016
Previous message: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
Next message: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Olivier and Tomasz,

> -----Original Message-----
> From: Kulasek, TomaszX
> Sent: Friday, December 9, 2016 5:19 PM
> To: Olivier Matz <olivier.matz at 6wind.com>; Ananyev, Konstantin <konstantin.ananyev at intel.com>
> Cc: Thomas Monjalon <thomas.monjalon at 6wind.com>; dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> 
> Hi Oliver,
> 
> My 5 cents below:
> 
> > -----Original Message-----
> > From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> > Sent: Thursday, December 8, 2016 18:24
> > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>
> > Cc: Thomas Monjalon <thomas.monjalon at 6wind.com>; Kulasek, TomaszX
> > <tomaszx.kulasek at intel.com>; dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
> >
> > Hi Konstantin,
> >
> > On Fri, 2 Dec 2016 16:17:51 +0000, "Ananyev, Konstantin"
> > <konstantin.ananyev at intel.com> wrote:
> > > Hi Olivier,
> > >
> > > > -----Original Message-----
> > > > From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> > > > Sent: Friday, December 2, 2016 8:24 AM
> > > > To: Ananyev, Konstantin <konstantin.ananyev at intel.com>
> > > > Cc: Thomas Monjalon <thomas.monjalon at 6wind.com>; Kulasek, TomaszX
> > > > <tomaszx.kulasek at intel.com>; dev at dpdk.org Subject: Re: [dpdk-dev]
> > > > [PATCH v12 1/6] ethdev: add Tx preparation
> > > >
> > > > Hi Konstantin,
> > > >
> > > > On Fri, 2 Dec 2016 01:06:30 +0000, "Ananyev, Konstantin"
> > > > <konstantin.ananyev at intel.com> wrote:
> > > > > >
> > > > > > 2016-11-23 18:36, Tomasz Kulasek:
> > > > > > > +/**
> > > > > > > + * Process a burst of output packets on a transmit queue of
> > > > > > > an Ethernet device.
> > > > > > > + *
> > > > > > > + * The rte_eth_tx_prepare() function is invoked to prepare
> > > > > > > output packets to be
> > > > > > > + * transmitted on the output queue *queue_id* of the Ethernet
> > > > > > > device designated
> > > > > > > + * by its *port_id*.
> > > > > > > + * The *nb_pkts* parameter is the number of packets to be
> > > > > > > prepared which are
> > > > > > > + * supplied in the *tx_pkts* array of *rte_mbuf* structures,
> > > > > > > each of them
> > > > > > > + * allocated from a pool created with
> > > > > > > rte_pktmbuf_pool_create().
> > > > > > > + * For each packet to send, the rte_eth_tx_prepare() function
> > > > > > > performs
> > > > > > > + * the following operations:
> > > > > > > + *
> > > > > > > + * - Check if packet meets devices requirements for tx
> > > > > > > offloads.
> > > > > > > + *
> > > > > > > + * - Check limitations about number of segments.
> > > > > > > + *
> > > > > > > + * - Check additional requirements when debug is enabled.
> > > > > > > + *
> > > > > > > + * - Update and/or reset required checksums when tx offload
> > > > > > > is set for packet.
> > > > > > > + *
> > > > > > > + * Since this function can modify packet data, provided mbufs
> > > > > > > must be safely
> > > > > > > + * writable (e.g. modified data cannot be in shared
> > > > > > > segment).
> > > > > >
> > > > > > I think we will have to remove this limitation in next releases.
> > > > > > As we don't know how it could affect the API, I suggest to
> > > > > > declare this API EXPERIMENTAL.
> > > > >
> > > > > While I don't really mind to mart it as experimental, I don't
> > > > > really understand the reasoning: Why " this function can modify
> > > > > packet data, provided mbufs must be safely writable" suddenly
> > > > > becomes a problem? That seems like and obvious limitation to me
> > > > > and let say tx_burst() has the same one. Second, I don't see how
> > > > > you are going to remove it without introducing a heavy performance
> > > > > impact. Konstantin
> > > >
> > > > About tx_burst(), I don't think we should force the user to provide
> > > > a writable mbuf. There are many use cases where passing a clone
> > > > already works as of today and it avoids duplicating the mbuf data.
> > > > For instance: traffic generator, multicast, bridging/tap, etc...
> > > >
> > > > Moreover, this requirement would be inconsistent with the model you
> > > > are proposing in case of pipeline:
> > > >  - tx_prepare() on core X, may update the data
> > > >  - tx_burst() on core Y, should not touch the data to avoid cache
> > > > misses
> > >
> > > Probably I wasn't very clear in my previous mail.
> > > I am not saying that we should force the user to pass a writable mbuf.
> > > What I am saying that for tx_burst() current expectation is that after
> > > mbuf is handled to tx_burst() user shouldn't try to modify its buffer
> > > contents till TX engine is done with the buffer (mbuf_free() is called
> > > by TX func for it). For tx_prep(), I think, it is the same though
> > > restrictions are a bit more strict: user should not try to read/write
> > > to the mbuf while tx_prep() is not finished with it. What puzzles me
> > > is that why that should be the reason to mark tx_prep() as
> > > experimental. Konstantin
> >
> > To be sure we're on the same page, let me reword:
> >
> > - mbufs passed to tx_prepare() by the application must have their
> >   headers (l2_len + l3_len + l4_len) writable because the phdr checksum
> >   can be replaced. It could be precised in the api comment.
> >
> > - mbufs passed to tx_burst() must not be modified by the driver/hw, nor
> >   by the application.

Yes, agree with both.

> >
> >
> > About the API itself, I have one more question. I know you've already
> > discussed this a bit with Adrien, I don't want to spawn a new big thread
> > from here ;)
> >
> > The API provides tx_prepare() to check the packets have the proper format,
> > and possibly modify them (ex: csum offload / tso) to match hw
> > requirements. So it does both checks (number of segments) and fixes
> > (csum/tso). What determines things that should be checked and things that
> > should be fixed?
> >

Just to echo Tomasz detailed reply below: I think right now tx_prepare()
doesn't make any fixing.
It just: checks that packet satisfies PMD/HW requirements, if not returns
an error without trying to fix anything/
If the  packet seems valid - calculates and fills L4 cksum to fulfill HW reqs . 
About the question what should be checked - as usual there is a tradeoff between 
additional checks and performance.
As Tomasz stated below right now in non-debug mode we do checks only for things
that can have a critical impact on HW itself (tx hang, etc.).
All extra checks are implemented only in DEBUG mode right now.  

> 
> 1) Performance -- we may assume that packets are created with the common rules (e.g. doesn't tries to count IP checksum for IPv6
> packet, sets all required fields, etc.). For now, additional checks are done only in DEBUG mode.
> 
> 2) Uncommon requirements, such a number of segments for TSO/non-TSO, where invalid values can cause unexpected results (or
> even hang device on burst), so it is critical (at least for ixgbe and i40e) and must be checked.
> 
> 3) Checksum field must be initialized in a hardware specific way to count it properly.
> 
> 4) If packet is invalid its content isn't modified nor restored.

> 
> > The application gets few information from tx_prepare() about what should
> > be done to make the packet accepted by the hw, and the actions will
> > probably be different depending on hardware.

That's true.
I am open to suggestions how in future to provide extra information to the upper layer.
Set rte_errno to different values depending on type of error,
OR extra parameter in tx_prepare() that will provide more detailed error information,
OR something else?

> So could we imagine that in
> > the future the function also tries to fix the packet? I've seen your
> > comment saying that it has to be an application decision, so what about
> > having a parameter saying "fix the packet" or "don't fix it"?
> >

While I am not directly opposed to that idea, I am a bit skeptical
that it could be implemented in useful and effective way.    
As I said before it may be fixed in different ways:
coalesce mbufs, split mbuf chain into multiple ones,
might be some sort of combination of these 2 approaches.
Again it seems a bit ineffective from performance point of view:
form packet first then check and re-adjust it again.
Seems more plausible to have it formed in a right way straightway.
Also it would mean that we'll need to pass to tx_prepare() information
about which mempool to use for new mbufs, probably some extra information. 
That's why I think it might be better to do a proper (re)formatting of the packet
before passing it down to tx_prepare().
Might be some sort of generic helper function in rte_net that would use information
from rte_eth_desc_lim and could be called by upper layer before tx_prepare().
But if you think it could be merged into tx_prepare() into some smart and effective way -     
sure thing, let's look at particular implementation ideas you have. 

> 
> The main question here is how invasive in packet data tx_prepare should be.
> 
> If I understand you correctly, you're talking about the situation when tx_prepare will try to restore packet internally, e.g. linearize data
> for i40e to meet number of segments requirements, split mbufs, when are too large, maybe someone will wanted to have full
> software checksum computation fallback for devices which doesn't support it, and so on? And then this parameter will indicate "do
> the required work for me", or "just check, initialize what should be initialized, if fails, let me decide"?
> 
> Implemented model is quite simple and clear:
> 
> 1) Validate uncommon requirements which may cause a problem and let application to deal with it.
> 2) If packet is valid, initialize required fields according to the hardware requirements.
> 
> In fact, it doesn't fix anything for now, but initializes checksums, and the main reason why it is done here, is the fact that it cannot be
> done in a trivial way in the application itself.
> 
> IMHO, we should to keep this API as simple as it possible and not let it grow in unexpected way (also for performance reason).
> 
> > About rte_eth_desc_lim->nb_seg_max and
> > rte_eth_desc_lim->nb_mtu_seg_max, I'm still quite reserved, especially for
> > the 2nd one, because I don't see how it can be used by the application.
> > Well, it does not hurt to have them, but for me it looks a bit useless.
> >
> 
> * "nb_seg_max": Maximum number of segments per whole packet.
> * "nb_mtu_seg_max": Maximum number of segments per one MTU.
> 
> Application can use provided values in a fallowed way:
> 
> * For non-TSO packet, a single transmit packet may span up to "nb_mtu_seg_max" buffers.
> * For TSO packet the total number of data descriptors is "nb_seg_max", and each segment within the TSO may span up to
> "nb_mtu_seg_max".

I suppose these fields will be useful.
Let say you are about to form and send packets.
Right now, how would you know what are HW limitations on number of segments per packet?  

> 
> > Last thing, I think this API should become the default in the future.
> > For instance, it would prevent the application to calculate a phdr csum
> > that will not be used by the hw. Not calling tx_prepare() would require
> > the user/application to exactly knows the underlying hw and the kind of
> > packet that are generated. So for me it means we'll need to also update
> > other examples (other testpmd engines, l2fwd, ...). Do you agree?
> >
> 
> Most of sample applications doesn't use TX offloads at all and doesn't count checksums. So the question is "should tx_prepare be
> used even if not required?"

I suppose it wouldn't be a big harm to add tx_prepare() into sample apps when appropriate.
In most cases (when simple TS path used) it will be NOP anyway.

Konstantin
Previous message: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
Next message: [dpdk-dev] [PATCH v12 1/6] ethdev: add Tx preparation
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list