[dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK

Shihabur Rahman Chowdhury shihab.buet at gmail.com
Thu Apr 13 16:21:21 CEST 2017


​
On Thu, Apr 13, 2017 at 1:19 AM, Shahaf Shuler <shahafs at mellanox.com> wrote:

> Why did you choose such configuration?
> Such configuration may cause high overhead in snoop cycles, as the first
> cache line of the packet
> Will first be on the Rx lcore and then it will need to be invalidated when
> the Tx lcore swaps the macs.
>
> Since you are using 2 cores anyway, have you tried that each core will do
> both Rx and Tx (run to completion)?
>

​To give a bit more context, we are developing a set of packet processors
that can be independently deployed as separate processes and can be scaled
out independently as well. So a batch of packet goes through a sequence of
processes until at some point they are written to the Tx queue or gets
dropped because of some processing decision. These packet processors are
running as secondary dpdk processes and the rx is being taking place at a
primary process (since Mellanox PMD does not allow Rx from a secondary
process). In this example configuration, one primary process is doing the
Rx, handing over the packet to another secondary process through a shared
ring and that secondary process is swapping the MAC and writing packets to
Tx queue. We are expecting some performance drop because of the cache
invalidation across lcores (also we cannot use the same lcore for different
secondary process for mempool cache corruption), but again 7.3Mpps is ~30+%
overhead.

Since you said, we tried the run to completion processing in the primary
process (i.e., rx and tx is now on the same lcore). We also configured
pktgent to handle rx and tx on the same lcore as well. With that we are now
getting ~9.9-10Mpps with 64B packets. With our multi-process setup that
drops down to ~8.4Mpps. So it seems like pktgen was not configured
properly. It seems a bit counter-intuitive since from pktgen's side doing
rx and tx on different lcore should not cause any cache invalidation (set
of rx and tx packets are disjoint). So using different lcores should
theoretically be better than handling both rx/tx in the same lcore for
pkgetn. Am I missing something here?

Thanks


More information about the users mailing list