[dpdk-dev] TX performance regression caused by the mbuf cachline split
Paul Emmerich
emmericp at net.in.tum.de
Mon Feb 15 20:15:23 CET 2016
Hi,
here's a kind of late follow-up. I've only recently found the need
(mostly for the better support of XL710 NICs (which I still dislike but
people are using them...)) to seriously address DPDK 2.x support in MoonGen.
On 13.05.15 11:03, Ananyev, Konstantin wrote:
> Before start to discuss your findings, there is one thing in your test app that looks strange to me:
> You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
> This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
> and you'll endup doing ring_dequeue().
> I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
> that should improve your numbers (at least it did to me).
Thanks for pointing that out. However, my real app did not have this bug
and I also saw the performance improvement there.
> Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
> Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
> Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
> I don't have box with the same off-hand, but on my IVB box results are quite promising:
> on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%),
> for full_tx the drop reduced to 2%.
> That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
> latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.
The bulk_alloc patch is great and helps. I'd love to see such a function
in DPDK.
I agree that this is a better solution than prefetching. I also can't
see a difference with/without prefetching when using bulk alloc.
Paul
More information about the dev
mailing list