[dpdk-dev] [PATCH 0/6] get status of Rx and Tx descriptors

Venkatesan, Venky venky.venkatesan at intel.com
Fri Mar 3 19:46:52 CET 2017


Hi Olivier, 

> -----Original Message-----
> From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> Sent: Friday, March 3, 2017 8:45 AM
> To: Venkatesan, Venky <venky.venkatesan at intel.com>
> Cc: Richardson, Bruce <bruce.richardson at intel.com>; dev at dpdk.org;
> thomas.monjalon at 6wind.com; Ananyev, Konstantin
> <konstantin.ananyev at intel.com>; Lu, Wenzhuo <wenzhuo.lu at intel.com>;
> Zhang, Helin <helin.zhang at intel.com>; Wu, Jingjing
> <jingjing.wu at intel.com>; adrien.mazarguil at 6wind.com;
> nelio.laranjeiro at 6wind.com; Yigit, Ferruh <ferruh.yigit at intel.com>
> Subject: Re: [dpdk-dev] [PATCH 0/6] get status of Rx and Tx descriptors
> 
> Hi Venky,
> 
> On Fri, 3 Mar 2017 16:18:52 +0000, "Venkatesan, Venky"
> <venky.venkatesan at intel.com> wrote:
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> > > Sent: Thursday, March 2, 2017 8:15 AM
> > > To: Richardson, Bruce <bruce.richardson at intel.com>
> > > Cc: dev at dpdk.org; thomas.monjalon at 6wind.com; Ananyev, Konstantin
> > > <konstantin.ananyev at intel.com>; Lu, Wenzhuo
> <wenzhuo.lu at intel.com>;
> > > Zhang, Helin <helin.zhang at intel.com>; Wu, Jingjing
> > > <jingjing.wu at intel.com>; adrien.mazarguil at 6wind.com;
> > > nelio.laranjeiro at 6wind.com; Yigit, Ferruh <ferruh.yigit at intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 0/6] get status of Rx and Tx
> > > descriptors
> > >
> > > On Thu, 2 Mar 2017 15:32:15 +0000, Bruce Richardson
> > > <bruce.richardson at intel.com> wrote:
> > > > On Wed, Mar 01, 2017 at 06:19:06PM +0100, Olivier Matz wrote:
> > > > > This patchset introduces a new ethdev API:
> > > > > - rte_eth_rx_descriptor_status()
> > > > > - rte_eth_tx_descriptor_status()
> > > > >
> > > > > The Rx API is aims to replace rte_eth_rx_descriptor_done() which
> > > > > does almost the same, but does not differentiate the case of a
> > > > > descriptor used by the driver (not returned to the hw).
> > > > >
> > > > > The usage of these functions can be:
> > > > > - on Rx, anticipate that the cpu is not fast enough to process
> > > > >   all incoming packets, and take dispositions to solve the
> > > > >   problem (add more cpus, drop specific packets, ...)
> > > > > - on Tx, detect that the link is overloaded, and take dispositions
> > > > >   to solve the problem (notify flow control, drop specific
> > > > >   packets)
> > > > >
> > > > Looking at it from a slightly higher level, are these APIs really
> > > > going to help in these situations? If something is overloaded,
> > > > doing more work by querying the ring status only makes things
> > > > worse. I suspect that in most cases better results can be got by
> > > > just looking at the results of RX and TX burst functions. For
> > > > example, if RX burst is always returning a full set of packets,
> > > > chances are you are overloaded, or at least have no scope for an
> unexpected burst or event.
> > > >
> > > > Are these really needed for real applications? I suspect our
> > > > trivial l3fwd power example can be made to work ok without them.
> > >
> > > The l3fwd example uses the rte_eth_rx_descriptor_done() API, which
> > > is very similar to what I'm adding here. The differences are:
> > >
> > > - the new lib provides a Tx counterpart
> > > - it differentiates done/avail/hold descriptors
> > >
> > > The alternative was to update the descriptor_done API, but I think
> > > we agreed to do that in this thread:
> > > http://www.dpdk.org/ml/archives/dev/2017-January/054947.html
> > >
> > > About the usefulness of the API, I confirm it is useful: for
> > > instance, you can detect that you ring is more than half-full, and
> > > take dispositions to increase your processing power or select the packets
> you want to drop first.
> > >
> > For either of those cases, you could still implement this in your application
> without any of these APIs. Simply keep reading rx_burst() till it returns zero.
> You now have all the packets that you want - look at how many and increase
> your processing power, or drop them.
> 
> In my use case, I may have several thresholds, so it gives a fast information
> about the ring status.

The problem is that it costs too much to return that status. Let me explain it this way - when processing a burst, it takes the IXGBE driver ~20 - 25 cycles to process a packet. Assuming memory latency is 75ns, a miss to memory costs ~150 cycles at 2.0 GHz, or 215 cycles at 3.0 GHz. Either way, it is between 7 - 10 packet times if you miss to memory. In the case you are suggesting where the CPU isn't able to keep up with the packets, all we've have really done is to make it harder for the CPU to keep up. 

Could you use an RX alternative that returns a 0 or 1 if there are more packets remaining in the ring? That will be lower cost to implement (as a separate API or even as a part of the Rx_burst API itself). But touching the ring at a random offset is almost always going to be a performance problem. 

> 
> Keeping reading rx_burst() until it returns 0 will not work if the packet rate is
> higher than (or close to) what the cpu is able to eat.
> 

If the packet input rate is higher than what the CPU is capable of handling, reading the packets gives you the option of dropping those that you consider lower priority - if you have an application that takes  400 cycles to process a packet, but the input rate is a packet every 100 cycles, then what you have to look at is to read the packets, figure out a lighter weight way of determining a drop per packet (easy suggestion, use the RX filter API to tag packets) and drop them within 10-20 cycles per packet. Ideally, you would do this by draining some percentage of the descriptor ring and prioritizing and dropping those. A second, even better alternative is to use NIC facilities to prioritize packets into different queues. I don't see how adding another 150 cycles to the budget helps you keep up with packets. 

> >
> > The issue I have with this newer instantiation of the API is that it is
> essentially to pick up a descriptor at a specified offset. In most cases, if you
> plan to read far enough ahead with the API (let's say 16 or 32 ahead, or even
> more), you are almost always guaranteed an L1/L2 miss - essentially making it
> a costly API call. In cases that don't have something like Data Direct I/O
> (DDIO), you are now going to hit memory and stall the CPU for a long time. In
> any case, the API becomes pretty useless unless you want to stay within a
> smaller look ahead offset. The rx_burst() methodology simply works better
> in most circumstances, and allows application level control.
> >
> > So, NAK. My suggestion would be to go back to the older API.
> 
> I don't understand the reason of your nack.
> The old API is there (for Rx it works the same), and it is illustrated in an
> example. Since your arguments also applies to the old API, so why are you
> saying we should keep the older API?
> 

I am not a fan of the old API either. In hindsight, it was a mistake (which we didn't catch in time). As Bruce suggested, the example should be reworked to work without the API, and deprecate it. 

> For Tx, I want to know if I have enough room to send my packets before
> doing it. There is no API yet to do that.
> 

Yes. This could be a lightweight API if it returned a count (txq->nb_tx_free) instead of actually touching the ring, which is what I have a problem with. If the implementation changes to that, that may be okay to do. The Tx API has more merit than the Rx API, but not coded around an offset.

> And yes, this could trigger cache misses, but in some situations it's preferable
> to be a bit slower (all tests are not test-iofwd) and be able to anticipate that
> the ring is getting full.
> 
> 
> Regards,
> Olivier


More information about the dev mailing list