[dpdk-dev] TX performance regression caused by the mbuf cachline split

  Hi Paul,

> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Paul Emmerich

> Sent: Tuesday, May 12, 2015 12:19 AM

> To: dev@dpdk.org

> Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split

> 

> Found a really simple solution that almost restores the original

> performance: just add a prefetch on alloc. For some reason, I assumed

> that this was already done since the troublesome commit I investigated

> mentioned something about prefetching... I guess the commit referred to

> the hardware prefetcher in the CPU.

> 

> Adding an explicit prefetch command in the mbuf alloc function gives a

> throughput of 12.7/10.35 Mpps in my benchmark with the

> simple/full-featured tx path.

> 

> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since

> I'm primarily interested in the full-featured path and the drop from

> 10.7 to ~10.4 was due to another change.

> 

> Patch: https://github.com/dpdk-org/dpdk/pull/2

> I also sent an email to the mailing list.

> 

> I also think that the rx-path could also benefit from prefetching somewhere.

> 

Before start to discuss your findings, there is one thing in your test app that looks strange to me:
You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
and you'll endup doing ring_dequeue().
I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
that should improve your numbers (at least it did to me). 

About the patch:
So from what you are saying - the reason for the drop is not actually the TX path,
but rte_pktmbuf_alloc()->rte_pktmbuf_reset(). 
That makes sense -  pktmbuf_reset() now has to update 2 cache line instead of one.
 From other side - rte_pktmbuf_alloc() was never considered as a fastest path
(our RX/TX roitinies don't use it) - so we never put a big effort in trying to optimise it.

Though, I am really not a big fan of manual prefetching. 
Its particular behaviour may vary  from one cpu to another,
and is real effect is sort of hard to predict,
in some cases can even cause a performance degradation.
Let say on my IVB box, your patch didn't show any difference at all.
So I think that 'prefetch' should be used only when it really gives great performance boost
and same results can't be achieved by other methods.  
For that particular case - at least that 'prefetch' should be moved from __rte_mbuf_raw_alloc()
to  rte_pktmbuf_alloc(), to avoid any negative impact on RX path.
Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
I don't have box with the same off-hand, but on my IVB box results are quite promising:
on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%), 
for full_tx the drop reduced to 2%.
That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

Thanks
Konstantin

patch1:

[dpdk-dev] TX performance regression caused by the mbuf cachline split

Commit Message

Comments

Patch