[dpdk-dev] [PATCH 0/6] get status of Rx and Tx descriptors

Olivier Matz olivier.matz at 6wind.com
Sat Mar 4 21:45:27 CET 2017


On Fri, 3 Mar 2017 18:46:52 +0000, "Venkatesan, Venky" <venky.venkatesan at intel.com> wrote:
> Hi Olivier, 
> 
> > -----Original Message-----
> > From: Olivier Matz [mailto:olivier.matz at 6wind.com]
> > Sent: Friday, March 3, 2017 8:45 AM
> > To: Venkatesan, Venky <venky.venkatesan at intel.com>
> > Cc: Richardson, Bruce <bruce.richardson at intel.com>; dev at dpdk.org;
> > thomas.monjalon at 6wind.com; Ananyev, Konstantin
> > <konstantin.ananyev at intel.com>; Lu, Wenzhuo <wenzhuo.lu at intel.com>;
> > Zhang, Helin <helin.zhang at intel.com>; Wu, Jingjing
> > <jingjing.wu at intel.com>; adrien.mazarguil at 6wind.com;
> > nelio.laranjeiro at 6wind.com; Yigit, Ferruh <ferruh.yigit at intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 0/6] get status of Rx and Tx descriptors
> > 
> > Hi Venky,
> > 
> > On Fri, 3 Mar 2017 16:18:52 +0000, "Venkatesan, Venky"
> > <venky.venkatesan at intel.com> wrote:  
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> > > > Sent: Thursday, March 2, 2017 8:15 AM
> > > > To: Richardson, Bruce <bruce.richardson at intel.com>
> > > > Cc: dev at dpdk.org; thomas.monjalon at 6wind.com; Ananyev, Konstantin
> > > > <konstantin.ananyev at intel.com>; Lu, Wenzhuo  
> > <wenzhuo.lu at intel.com>;  
> > > > Zhang, Helin <helin.zhang at intel.com>; Wu, Jingjing
> > > > <jingjing.wu at intel.com>; adrien.mazarguil at 6wind.com;
> > > > nelio.laranjeiro at 6wind.com; Yigit, Ferruh <ferruh.yigit at intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH 0/6] get status of Rx and Tx
> > > > descriptors
> > > >
> > > > On Thu, 2 Mar 2017 15:32:15 +0000, Bruce Richardson
> > > > <bruce.richardson at intel.com> wrote:  
> > > > > On Wed, Mar 01, 2017 at 06:19:06PM +0100, Olivier Matz wrote:  
> > > > > > This patchset introduces a new ethdev API:
> > > > > > - rte_eth_rx_descriptor_status()
> > > > > > - rte_eth_tx_descriptor_status()
> > > > > >
> > > > > > The Rx API is aims to replace rte_eth_rx_descriptor_done() which
> > > > > > does almost the same, but does not differentiate the case of a
> > > > > > descriptor used by the driver (not returned to the hw).
> > > > > >
> > > > > > The usage of these functions can be:
> > > > > > - on Rx, anticipate that the cpu is not fast enough to process
> > > > > >   all incoming packets, and take dispositions to solve the
> > > > > >   problem (add more cpus, drop specific packets, ...)
> > > > > > - on Tx, detect that the link is overloaded, and take dispositions
> > > > > >   to solve the problem (notify flow control, drop specific
> > > > > >   packets)
> > > > > >  
> > > > > Looking at it from a slightly higher level, are these APIs really
> > > > > going to help in these situations? If something is overloaded,
> > > > > doing more work by querying the ring status only makes things
> > > > > worse. I suspect that in most cases better results can be got by
> > > > > just looking at the results of RX and TX burst functions. For
> > > > > example, if RX burst is always returning a full set of packets,
> > > > > chances are you are overloaded, or at least have no scope for an  
> > unexpected burst or event.  
> > > > >
> > > > > Are these really needed for real applications? I suspect our
> > > > > trivial l3fwd power example can be made to work ok without them.  
> > > >
> > > > The l3fwd example uses the rte_eth_rx_descriptor_done() API, which
> > > > is very similar to what I'm adding here. The differences are:
> > > >
> > > > - the new lib provides a Tx counterpart
> > > > - it differentiates done/avail/hold descriptors
> > > >
> > > > The alternative was to update the descriptor_done API, but I think
> > > > we agreed to do that in this thread:
> > > > http://www.dpdk.org/ml/archives/dev/2017-January/054947.html
> > > >
> > > > About the usefulness of the API, I confirm it is useful: for
> > > > instance, you can detect that you ring is more than half-full, and
> > > > take dispositions to increase your processing power or select the packets  
> > you want to drop first.  
> > > >  
> > > For either of those cases, you could still implement this in your application  
> > without any of these APIs. Simply keep reading rx_burst() till it returns zero.
> > You now have all the packets that you want - look at how many and increase
> > your processing power, or drop them.
> > 
> > In my use case, I may have several thresholds, so it gives a fast information
> > about the ring status.  
> 
> The problem is that it costs too much to return that status. Let me explain it this way - when processing a burst, it takes the IXGBE driver ~20 - 25 cycles to process a packet. Assuming memory latency is 75ns, a miss to memory costs ~150 cycles at 2.0 GHz, or 215 cycles at 3.0 GHz. Either way, it is between 7 - 10 packet times if you miss to memory. In the case you are suggesting where the CPU isn't able to keep up with the packets, all we've have really done is to make it harder for the CPU to keep up. 

Did you do the test to validate what you say? I did it.
Let's add the following patch to testpmd:

--- a/app/test-pmd/iofwd.c
+++ b/app/test-pmd/iofwd.c
@@ -92,6 +92,8 @@ pkt_burst_io_forward(struct fwd_stream *fs)
        start_tsc = rte_rdtsc();
 #endif

+       rte_eth_rx_descriptor_status(fs->rx_port, fs->rx_queue, 128);
+       
        /*
         * Receive a burst of packets and forward them.
         */

If I measure the performance in iofwd (nb_rxd=512), I see no difference
between the results with and without the patch.

To me, your calculation does not apply to a real life application:

- this function is called once per burst (32 or 64), so the penalty
  of 200 cycles (if any) has to be divided by the number of packets

- you need to take in account the number of cycles for the whole
  processing of a packet, not for the ixgbe driver only. If you
  are doing forwarding, you are at least at 100 cycles / packet for
  a benchmark test case. I don't even talk about IPsec.

So, the penalty, in the worst case (burst of 32, 100c/pkt) is ~6%.
Given the information it provides, it is acceptable to me.

Note we are talking here about an optional API, that would only impact
people that use it.


> Could you use an RX alternative that returns a 0 or 1 if there are more packets remaining in the ring? That will be lower cost to implement (as a separate API or even as a part of the Rx_burst API itself). But touching the ring at a random offset is almost always going to be a performance problem. 

About returning 0 or 1 if there are more packets in the ring, it does
not solve my issue since I want to know if the number of descriptor is
above or below a configurable threshold.

Also, changing the Rx burst function is much more likely to be refused
than adding an optional API.


> > Keeping reading rx_burst() until it returns 0 will not work if the packet rate is
> > higher than (or close to) what the cpu is able to eat.
> >   
> 
> If the packet input rate is higher than what the CPU is capable of handling, reading the packets gives you the option of dropping those that you consider lower priority - if you have an application that takes  400 cycles to process a packet, but the input rate is a packet every 100 cycles, then what you have to look at is to read the packets, figure out a lighter weight way of determining a drop per packet (easy suggestion, use the RX filter API to tag packets) and drop them within 10-20 cycles per packet. Ideally, you would do this by draining some percentage of the descriptor ring and prioritizing and dropping those. A second, even better alternative is to use NIC facilities to prioritize packets into different queues. I don't see how adding another 150 cycles to the budget helps you keep up with packets. 

The RX filter API is not available on all drivers, and is not as flexible
as a software filter. If I use this new API, I don't need to call this
software filter when the CPU is not overload, saving cycles in the common
case.

Trying to read all the packets in the ring before processing them is not
an option either, especially if the ring size is large (4096). This would
consumes more mbufs (hw ring + sw processing queue), it will also break
the mbuf prefetch policies done in the drivers.


> > > The issue I have with this newer instantiation of the API is that it is  
> > essentially to pick up a descriptor at a specified offset. In most cases, if you
> > plan to read far enough ahead with the API (let's say 16 or 32 ahead, or even
> > more), you are almost always guaranteed an L1/L2 miss - essentially making it
> > a costly API call. In cases that don't have something like Data Direct I/O
> > (DDIO), you are now going to hit memory and stall the CPU for a long time. In
> > any case, the API becomes pretty useless unless you want to stay within a
> > smaller look ahead offset. The rx_burst() methodology simply works better
> > in most circumstances, and allows application level control.  
> > >
> > > So, NAK. My suggestion would be to go back to the older API.  
> > 
> > I don't understand the reason of your nack.
> > The old API is there (for Rx it works the same), and it is illustrated in an
> > example. Since your arguments also applies to the old API, so why are you
> > saying we should keep the older API?
> >   
> 
> I am not a fan of the old API either. In hindsight, it was a mistake (which we didn't catch in time). As Bruce suggested, the example should be reworked to work without the API, and deprecate it. 

Before deprecating an API, I think we should check if people are using
it and if it can really be replaced. I think there are many things that
could be deprecated before this one.


> > For Tx, I want to know if I have enough room to send my packets before
> > doing it. There is no API yet to do that.
> >   
> 
> Yes. This could be a lightweight API if it returned a count (txq->nb_tx_free) instead of actually touching the ring, which is what I have a problem with. If the implementation changes to that, that may be okay to do. The Tx API has more merit than the Rx API, but not coded around an offset.

Returning txq->nb_tx_free does not work because it is a software view,
which becomes wrong as soon as the hardware has send the packets.
Example:
1. Send many packets at very high rate, the tx ring becomes full
2. wait that packets are transmitted
3. read nb_tx_free, it returns 0, which is not what I want

So in my case there is a also a need for a Tx descriptor status API.


Thomas, you are the maintainer of ethdev API, do you have an opinion?


Regards,
Olivier




More information about the dev mailing list