[dpdk-users] What to do after rte_eth_tx_burst: free or send again remaining packets?

Wiles, Keith keith.wiles at intel.com
Mon Jan 30 21:55:22 CET 2017


> On Jan 30, 2017, at 10:02 AM, Peter Keereweer <peterkeereweer at hotmail.com> wrote:
> 
> Hi Keith,
> 
> Thanks a lot for your response! Based on your information I have tested different burst sizes in the Load Balancer application (an let the TX ring size unchanged). One can configure the read / write burst sizes of the NIC and the software queues as a command line option. The default value of all burst size is equal to 144. If I configure all read/write burst sizes as 32, every packet will be transmitted by the TX core and no packets are dropped. But is this a valid solution? It seems to work, but it feels a little bit strange to decrease the burst size from 144 to 32.

I am going to guess the reason is 32 is a better fit to the ring size and how the hardware handles the descriptors. Some hardware only frees the descriptors in bursts across the PCI bus. It makes it easier for the hardware and faster as it does not need to send every descriptor back one at a time. You may have just hit a sweet spot in burst size and the hardware. Normally 8 descriptors on Intel hardware is a cache line size 64 bytes, using multiple of 8 is the best performance as you do not want to write part of a cache line which causes a lot of PCI transactions to complete writing part of a cache line.

> 
> Another solution is implementing a while loop (like in _send_burst_fast in pktgen), so every packet will be transmitted. This solution seems to work too, but the same question here, is this a valid solution? The strange feeling about this solution is that basically the same happens in the ixgbe driver code (ixgbe_rxtx.c):
> 
> uint16_t
> ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
> 		       uint16_t nb_pkts)
> {
> 	uint16_t nb_tx;
> 
> 	/* Try to transmit at least chunks of TX_MAX_BURST pkts */
> 	if (likely(nb_pkts <= RTE_PMD_IXGBE_TX_MAX_BURST))
> 		return tx_xmit_pkts(tx_queue, tx_pkts, nb_pkts);
> 
> 	/* transmit more than the max burst, in chunks of TX_MAX_BURST */
> 	nb_tx = 0;
> 	while (nb_pkts) {
> 		uint16_t ret, n;
> 
> 		n = (uint16_t)RTE_MIN(nb_pkts, RTE_PMD_IXGBE_TX_MAX_BURST);
> 		ret = tx_xmit_pkts(tx_queue, &(tx_pkts[nb_tx]), n);
> 		nb_tx = (uint16_t)(nb_tx + ret);
> 		nb_pkts = (uint16_t)(nb_pkts - ret);
> 		if (ret < n)
> 			break;
> 	}
> 
> 	return nb_tx;
> }
> 
> To be honest, I don't know whether this piece of code is called if I use the rte_eth_tx_burst, but I expect something similar happens when rte_eth_tx_burst uses another transmitting function in the ixgbe driver code. This while loop in the ixgbe driver code is exactly doing same as using a while loop in combination with rte_eth_tx_burst. But if I don't use a while loop in combination with the rte_eth_tx_burst (and a burst size of 144) it's not working (many packets are dropped), but if I implement this while loop it seems to work…

I think it is always safe to do the looping in your application as not all drivers do the looping for you. In case above I think they were trying to optimize the transfers to the NIC to 32 packets or descriptors for best performance.

> 
> I hope you can help me again with finding the best solution to solve this problem!
> 
> Peter
> 
> 
> Van: Wiles, Keith <keith.wiles at intel.com>
> Verzonden: zaterdag 28 januari 2017 23:43
> Aan: Peter Keereweer
> CC: users at dpdk.org
> Onderwerp: Re: [dpdk-users] What to do after rte_eth_tx_burst: free or send again remaining packets?
>     
> 
>> On Jan 28, 2017, at 1:57 PM, Peter Keereweer <peterkeereweer at hotmail.com> wrote:
>> 
>> Hi!
>> 
>> Currently I'am running some tests with the Load Balancer Sample Application. I'm testing the Load Balancer Sample Application by sending packets with pktgen.
>> I have a setup of 2 servers with each server containing a Intel 10Gbe 82599 NIC (connected to each other). I have configured the Load Balancer application to use 1 core for RX, 1 worker core and 1 TX core. The TX core sends all packets back to the pktgen  application.
>> 
>> With the pktgen I send 1024 UDP packets to the Load Balancer. Every packet processed by the worker core will be printed to the screen (I added this code by myself). If I send 1024 UDP packets, 1008 ( = 7 x 144) packets will be printed to the screen. This  is  correct, because the RX core reads packets with a burst size of 144. So if I send 1024 packets, I expect 1008 packets back in the pktgen application. But surprisingly I only receive 224 packets instead of 1008 packets. After some research I found that  that  224 packets is not just a random number, its 7 x 32 (= 224). So if the RX reads 7 x 144 packets, I get back 7 x 32 packets. After digging into the code from the Load Balancer application I found in 'runtime.c' in the 'app_lcore_io_tx' function this code  :
>> 
>> n_pkts = rte_eth_tx_burst(
>>                                  port,
>>                                  0,
>>                                  lp->tx.mbuf_out[port].array,
>>                                  (uint16_t) n_mbufs);
>> 
>> ...
>> 
>> if (unlikely(n_pkts < n_mbufs)) {
>>                                  uint32_t k;
>>                                  for (k = n_pkts; k < n_mbufs; k ++) {
>>                                          struct rte_mbuf *pkt_to_free = lp->tx.mbuf_out[port].array[k];
>>                                          rte_pktmbuf_free(pkt_to_free);
>>                                  }
>>                          }
>> 
>> What I understand from this code is that n_mbufs 'packets' are send with 'rte_eth_tx_burst' function. This function returns n_pkts, the number of packets that are actually send. If the actual number of packets send is smaller then n_mbufs (packets ready for   send given to the rte_eth_tx_burst) then all remaining packets, which are not send, are freed. In de the Load Balancer application, n_mbufs is equal to 144. But in my case 'rte_eth_tx_burst' returns the value 32, and not 144. So 32 packets are actually send   and the remaining packets (144 - 32 = 112) are freed. This is the reason why I get 224 (7 x 32) packets back instead of 1008 (= 7 x 144).
>> 
>> But the question is: why are the remaining packets freed instead of trying to send them again? If I look into the 'pktgen.c', there is a function '_send_burst_fast' where all remaining packets are trying to be send again (in a while loop until they are all   send) instead of freeing them (see code below) :
>> 
>> static __inline__ void
>> _send_burst_fast(port_info_t *info, uint16_t qid)
>> {
>>          struct mbuf_table   *mtab = &info->q[qid].tx_mbufs;
>>          struct rte_mbuf **pkts;
>>          uint32_t ret, cnt;
>> 
>>          cnt = mtab->len;
>>          mtab->len = 0;
>> 
>>          pkts    = mtab->m_table;
>> 
>>          if (rte_atomic32_read(&info->port_flags) & PROCESS_TX_TAP_PKTS) {
>>                  while (cnt > 0) {
>>                          ret = rte_eth_tx_burst(info->pid, qid, pkts, cnt);
>> 
>>                          pktgen_do_tx_tap(info, pkts, ret);
>> 
>>                          pkts += ret;
>>                          cnt -= ret;
>>                  }
>>          } else {
>>                  while(cnt > 0) {
>>                          ret = rte_eth_tx_burst(info->pid, qid, pkts, cnt);
>> 
>>                          pkts += ret;
>>                          cnt -= ret;
>>                  }
>>          }
>> } 
>> 
>> Why is this while loop (sending packets until they have all been send) not implemented in the 'app_lcore_io_tx' function in the Load Balancer application? That would make sense right? It looks like that the Load Balancer application makes an assumption that   if not all packets have been send, the remaining packets failed during the sending proces and should be freed.
> 
> The size of the TX ring on the hardware is limited in size, but you can adjust that size. In pktgen I attempt to send all packets requested to be sent, but in the load balancer the developer decided to just drop the packets that are not sent as the TX hardware  ring or even a SW ring is full. This normally means the core is sending packets faster then the HW ring on the NIC can send the packets.
> 
> It was just a choice of the developer to drop the packets instead of trying again until the packets array is empty. One possible way to fix this is to increase the size of the TX ring 2-4 time larger then the RX ring. This still does not truly solve the problem  it just moves it to the RX ring. The NIC if is does not have a valid RX descriptor and a place to DMA the packet into memory it gets dropped at the wire. BTW increasing the TX ring size also means the these packets will not returned to the free pool and you  can exhaust the packet pool. The packets are stuck on the TX ring as done because the threshold to reclaim the done packets is too high.
> 
> Say you have 1024 ring size and the high watermark for flushing the done off the ring is 900 packets. Then if the packet pool is only 512 packets then when you send 512 packets they will all be on the TX done queue and now you are in a deadlock not being able  to send a packet as they are all on the TX done ring. This normally does not happen as the ring sizes or normally much smaller then the number of TX packets or even RX packets.
> 
> In pktgen I attempt to send all of the packets requested as it does not make any sense for the user to ask to send 10000 packets and pktgen only send some number less as the core sending the packets can over run the TX queue at some point.
> 
> I hope that helps.
> 
>> 
>> I hope someone can help me with this questions. Thank you in advance!!
>> 
>> Peter
> 
> Regards,
> Keith
> 

Regards,
Keith



More information about the users mailing list