[dpdk-dev] IXGBE RX packet loss with 5+ cores

Sanford, Robert rsanford at akamai.com
Tue Oct 13 04:57:46 CEST 2015
Previous message: [dpdk-dev] DPDK hash function related question
Next message: [dpdk-dev] IXGBE RX packet loss with 5+ cores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I'm hoping that someone (perhaps at Intel) can help us understand
an IXGBE RX packet loss issue we're able to reproduce with testpmd.

We run testpmd with various numbers of cores. We offer line-rate
traffic (~14.88 Mpps) to one ethernet port, and forward all received
packets via the second port.

When we configure 1, 2, 3, or 4 cores (per port, with same number RX
queues per port), there is no RX packet loss. When we configure 5 or
more cores, we observe the following packet loss (approximate):
 5 cores - 3% loss
 6 cores - 7% loss
 7 cores - 11% loss
 8 cores - 15% loss
 9 cores - 18% loss

All of the "lost" packets are accounted for in the device's Rx Missed
Packets Count register (RXMPC[0]). Quoting the datasheet:
 "Packets are missed when the receive FIFO has insufficient space to
 store the incoming packet. This might be caused due to insufficient
 buffers allocated, or because there is insufficient bandwidth on the
 IO bus."

RXMPC, and our use of API rx_descriptor_done to verify that we don't
run out of mbufs (discussed below), lead us to theorize that packet
loss occurs because the device is unable to DMA all packets from its
internal packet buffer (512 KB, reported by register RXPBSIZE[0])
before overrun.

Questions
=========
1. The 82599 device supports up to 128 queues. Why do we see trouble
with as few as 5 queues? What could limit the system (and one port
controlled by 5+ cores) from receiving at line-rate without loss?

2. As far as we can tell, the RX path only touches the device
registers when it updates a Receive Descriptor Tail register (RDT[n]),
roughly every rx_free_thresh packets. Is there a big difference
between one core doing this and N cores doing it 1/N as often?

3. Do CPU reads/writes from/to device registers have a higher priority
than device reads/writes from/to memory? Could the former transactions
(CPU <-> device) significantly impede the latter (device <-> RAM)?

Thanks in advance for any help you can provide.



Testpmd Command Line
====================
Here is an example of how we run testpmd:

# socket 0 lcores: 0-7, 16-23
N_QUEUES=5
N_CORES=10

./testpmd -c 0x003e013e -n 2 \
 --pci-whitelist "01:00.0" --pci-whitelist "01:00.1" \
 --master-lcore 8 -- \
 --interactive --portmask=0x3 --numa --socket-num=0 --auto-start \
 --coremask=0x003e003e \
 --rxd=4096 --txd=4096 --rxfreet=512 --txfreet=512 \
 --burst=128 --mbcache=256 \
 --nb-cores=$N_CORES --rxq=$N_QUEUES --txq=$N_QUEUES


Test machines
=============
* We performed most testing on a system with two E5-2640 v3
(Haswell 2.6 GHz 8 cores) CPUs, 64 GB 1866 MHz RAM, TYAN S7076 mobo.
* We obtained similar results on a system with two E5-2698 v3
(Haswell 2.3 GHz 16 cores) CPUs, 64 GB 2133 MHz RAM, Dell R730.
* DPDK 2.1.0, Linux 2.6.32-504.23.4

Intel 10GB adapters
===================
All ethernet adapters are 82599_SFP_SF2, vendor 8086, device 154D,
svendor 8086, sdevice 7B11.


Other Details and Ideas we tried
================================
* Make sure that all cores, memory, and ethernet ports in use are on
the same NUMA socket.

* Modify testpmd to insert CPU delays in the forwarding loop, to
target some average number of RX packets that we reap per rx_pkt_burst
(e.g., 75% of burst).

* We configured the RSS redirection table such that all packets go to
one RX queue. In this case, there was NO packet loss (with any number
of RX cores), as the ethernet and core activity is very similar to
using only one RX core.

* When rx_pkt_burst returns a full burst, look at the subsequent RX
descriptors, using a binary search of calls to rx_descriptor_done, to
see whether the RX desc array is close to running out of new buffers.
The answer was: No, none of the RX queues has more than 100 additional
packets "done" (when testing with 5+ cores).

* Increase testpmd config params, e.g., --rxd, --rxfreet, --burst,
--mbcache, etc. These result in very small improvements, i.e., slight
reduction of packet loss.


Other Observations
==================
* Some IXGBE RX/TX code paths do not follow (my interpretation of) the
documented semantics of the rx/tx packet burst APIs. For example,
invoke rx_pkt_burst with nb_pkts=64, and it returns 32, even when more
RX packets are available, because the code path is optimized to handle
a burst of 32. The same thing may be true in the tx_pkt_burst code
path.

To allow us to run testpmd with --burst greater than 32, we worked
around these limitations by wrapping the calls to rx_pkt_burst and
tx_pkt_burst with do-whiles that continue while rx/tx burst returns
32 and we have not yet satisfied the desired burst count.

The point here is that IXGBE's rx/tx packet burst API behavior is
misleading! The application developer should not need to know that
certain drivers or driver paths do not always complete an entire
burst, even though they could have.

* We naïvely believed that if a run-to-completion model uses too
many cycles per packet, we could just spread it over more cores.
If there is some inherent limitation to the number of cores that
together can receive line-rate with no loss, then we obviously need
to change the s/w architecture, e.g., have i/o cores distribute to
worker cores.

* A similar problem was discussed here:
http://dpdk.org/ml/archives/dev/2014-January/001098.html



--
Regards,
Robert Sanford
Previous message: [dpdk-dev] DPDK hash function related question
Next message: [dpdk-dev] IXGBE RX packet loss with 5+ cores
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list