[dpdk-users] Low Rx throughput when using Mellanox ConnectX-3 card with DPDK

Kyle Larose klarose at sandvine.com
Thu Apr 13 17:49:02 CEST 2017


Hey Shihab,


> -----Original Message-----
> From: users [mailto:users-bounces at dpdk.org] On Behalf Of Shihabur Rahman
> Chowdhury
> Sent: Thursday, April 13, 2017 10:21 AM
> To: Shahaf Shuler
> Cc: Dave Wallace; Olga Shern; Adrien Mazarguil; Wiles, Keith; users at dpdk.org
> Subject: Re: [dpdk-users] Low Rx throughput when using Mellanox ConnectX-3
> card with DPDK
> 
>
> ​To give a bit more context, we are developing a set of packet processors
> that can be independently deployed as separate processes and can be scaled
> out independently as well. So a batch of packet goes through a sequence of
> processes until at some point they are written to the Tx queue or gets
> dropped because of some processing decision. These packet processors are
> running as secondary dpdk processes and the rx is being taking place at a
> primary process (since Mellanox PMD does not allow Rx from a secondary
> process). In this example configuration, one primary process is doing the
> Rx, handing over the packet to another secondary process through a shared
> ring and that secondary process is swapping the MAC and writing packets to
> Tx queue. We are expecting some performance drop because of the cache
> invalidation across lcores (also we cannot use the same lcore for different
> secondary process for mempool cache corruption), but again 7.3Mpps is ~30+%
> overhead.
> 
> Since you said, we tried the run to completion processing in the primary
> process (i.e., rx and tx is now on the same lcore). We also configured
> pktgent to handle rx and tx on the same lcore as well. With that we are now
> getting ~9.9-10Mpps with 64B packets. With our multi-process setup that
> drops down to ~8.4Mpps. So it seems like pktgen was not configured properly.
> It seems a bit counter-intuitive since from pktgen's side doing rx and tx on
> different lcore should not cause any cache invalidation (set of rx and tx
> packets are disjoint). So using different lcores should theoretically be
> better than handling both rx/tx in the same lcore for pkgetn. Am I missing
> something here?
> 
> Thanks

It sounds to me like your bottleneck is the primary -- the packet distributor. Consider the comment from Shahaf earlier: the best Mellanox was able to achieve with testpmd (which is extremely simple) is 10Mpps per core. I've always found that receiving is more expensive than transmitting, which means that if you're splitting your work on those dimensions, you'll need to allocate more CPU to the receiver than the transmitter. This may be one of the reasons run to completion works out -- the lower tx load on that core offsets the higher rx.

If you want to continue using the packet distribution model, why don't you try using RSS/multiqueue on the distributor, and allocate two cores to it? You'll need some entropy in the packets for it to distribute well, but hopefully that's not a problem. :)

Thanks,

Kyle


More information about the users mailing list