[dpdk-dev] TX performance regression caused by the mbuf cachline split

Message ID 2601191342CEEE43887BDE71AB9772582142EB46@irsmsx105.ger.corp.intel.com (mailing list archive)
State Not Applicable, archived
Headers

Commit Message

Ananyev, Konstantin May 13, 2015, 9:03 a.m. UTC
  Hi Paul,

> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Paul Emmerich

> Sent: Tuesday, May 12, 2015 12:19 AM

> To: dev@dpdk.org

> Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split

> 

> Found a really simple solution that almost restores the original

> performance: just add a prefetch on alloc. For some reason, I assumed

> that this was already done since the troublesome commit I investigated

> mentioned something about prefetching... I guess the commit referred to

> the hardware prefetcher in the CPU.

> 

> Adding an explicit prefetch command in the mbuf alloc function gives a

> throughput of 12.7/10.35 Mpps in my benchmark with the

> simple/full-featured tx path.

> 

> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since

> I'm primarily interested in the full-featured path and the drop from

> 10.7 to ~10.4 was due to another change.

> 

> Patch: https://github.com/dpdk-org/dpdk/pull/2

> I also sent an email to the mailing list.

> 

> I also think that the rx-path could also benefit from prefetching somewhere.

> 


Before start to discuss your findings, there is one thing in your test app that looks strange to me:
You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
and you'll endup doing ring_dequeue().
I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
that should improve your numbers (at least it did to me). 

About the patch:
So from what you are saying - the reason for the drop is not actually the TX path,
but rte_pktmbuf_alloc()->rte_pktmbuf_reset(). 
That makes sense -  pktmbuf_reset() now has to update 2 cache line instead of one.
 From other side - rte_pktmbuf_alloc() was never considered as a fastest path
(our RX/TX roitinies don't use it) - so we never put a big effort in trying to optimise it.

Though, I am really not a big fan of manual prefetching. 
Its particular behaviour may vary  from one cpu to another,
and is real effect is sort of hard to predict,
in some cases can even cause a performance degradation.
Let say on my IVB box, your patch didn't show any difference at all.
So I think that 'prefetch' should be used only when it really gives great performance boost
and same results can't be achieved by other methods.  
For that particular case - at least that 'prefetch' should be moved from __rte_mbuf_raw_alloc()
to  rte_pktmbuf_alloc(), to avoid any negative impact on RX path.
Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
I don't have box with the same off-hand, but on my IVB box results are quite promising:
on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%), 
for full_tx the drop reduced to 2%.
That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

Thanks
Konstantin

patch1:
  

Comments

Paul Emmerich Feb. 15, 2016, 7:15 p.m. UTC | #1
Hi,

here's a kind of late follow-up. I've only recently found the need 
(mostly for the better support of XL710 NICs (which I still dislike but 
people are using them...)) to seriously address DPDK 2.x support in MoonGen.

On 13.05.15 11:03, Ananyev, Konstantin wrote:
> Before start to discuss your findings, there is one thing in your test app that looks strange to me:
> You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
> This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
> and you'll endup doing ring_dequeue().
> I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
> that should improve your numbers (at least it did to me).

Thanks for pointing that out. However, my real app did not have this bug 
and I also saw the performance improvement there.

> Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
> Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
> Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
> I don't have box with the same off-hand, but on my IVB box results are quite promising:
> on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%),
> for full_tx the drop reduced to 2%.
> That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
> latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

The bulk_alloc patch is great and helps. I'd love to see such a function 
in DPDK.

I agree that this is a better solution than prefetching. I also can't 
see a difference with/without prefetching when using bulk alloc.


  Paul
  
Olivier Matz Feb. 19, 2016, 12:31 p.m. UTC | #2
Hi Paul,

On 02/15/2016 08:15 PM, Paul Emmerich wrote:
> The bulk_alloc patch is great and helps. I'd love to see such a function
> in DPDK.
> 

A patch has been submitted by Huawei. I guess it will be integrated
soon.
See http://dpdk.org/dev/patchwork/patch/10122/


Regards,
Olivier
  

Patch

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ab6de67..23d79ca 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -810,6 +810,45 @@  static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
        return (m);
 }

+static inline int
+rte_pktmbuf_bulk_alloc(struct rte_mempool *mp, struct rte_mbuf **m, uint32_t n)
+{
+       int32_t rc;
+       uint32_t i;
+
+       rc = rte_mempool_get_bulk(mp, (void **)m, n);
+
+       if (rc == 0) {
+               i = 0;
+               switch (n % 4) {
+               while (i != n) {
+                       case 0:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 3:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 2:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 1:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+               }
+               }
+       }
+
+       return rc;
+}
+
 /**
  * Attach packet mbuf to another packet mbuf.
  *

patch2:
diff --git a/main.c b/main.c
index 2aa9fcf..749c52c 100644
--- a/main.c
+++ b/main.c
@@ -71,7 +71,7 @@  static struct rte_mempool* make_mempool() {
        static int pool_id = 0;
        char pool_name[32];
        sprintf(pool_name, "pool%d", __sync_fetch_and_add(&pool_id, 1));
-       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 32,
+       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 2 * BATCH_SIZE,
                sizeof(struct rte_pktmbuf_pool_private),
                rte_pktmbuf_pool_init, NULL,
                rte_pktmbuf_init, NULL,
@@ -113,13 +113,21 @@  static uint32_t send_pkts(uint8_t port, struct rte_mempool* pool) {
        // alloc bufs
        struct rte_mbuf* bufs[BATCH_SIZE];
        uint32_t i;
+       int32_t rc;
+
+       rc = rte_pktmbuf_bulk_alloc(pool, bufs, RTE_DIM(bufs));
+       if (rc < 0) {
+               RTE_LOG(ERR, USER1,
+                       "%s: rte_pktmbuf_alloc(%zu) returns error code: %d\n",
+                       __func__, RTE_DIM(bufs), rc);
+               return 0;
+       }
+
        for (i = 0; i < BATCH_SIZE; i++) {
-               struct rte_mbuf* buf = rte_pktmbuf_alloc(pool);
-               rte_pktmbuf_data_len(buf) = 60;
-               rte_pktmbuf_pkt_len(buf) = 60;
-               bufs[i] = buf;
+               rte_pktmbuf_data_len(bufs[i]) = 60;
+               rte_pktmbuf_pkt_len(bufs[i]) = 60;
                // write seq number
-               uint64_t* pkt = rte_pktmbuf_mtod(buf, uint64_t*);
+               uint64_t* pkt = rte_pktmbuf_mtod(bufs[i], uint64_t*);
                pkt[0] = seq++;
        }
        // send pkts