[dpdk-dev] TX performance regression caused by the mbuf cachline split

Ananyev, Konstantin konstantin.ananyev at intel.com
Wed May 13 11:03:18 CEST 2015

Previous message: [dpdk-dev] TX performance regression caused by the mbuf cachline split
Next message: [dpdk-dev] [PATCH 0/6] extend flow director to support L2_paylod type and VF filtering in i40e driver
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Paul,

> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Paul Emmerich
> Sent: Tuesday, May 12, 2015 12:19 AM
> To: dev at dpdk.org
> Subject: Re: [dpdk-dev] TX performance regression caused by the mbuf cachline split
> 
> Found a really simple solution that almost restores the original
> performance: just add a prefetch on alloc. For some reason, I assumed
> that this was already done since the troublesome commit I investigated
> mentioned something about prefetching... I guess the commit referred to
> the hardware prefetcher in the CPU.
> 
> Adding an explicit prefetch command in the mbuf alloc function gives a
> throughput of 12.7/10.35 Mpps in my benchmark with the
> simple/full-featured tx path.
> 
> DPDK 1.7.1 was at 14.1/10.7 Mpps. I guess I can live with that, since
> I'm primarily interested in the full-featured path and the drop from
> 10.7 to ~10.4 was due to another change.
> 
> Patch: https://github.com/dpdk-org/dpdk/pull/2
> I also sent an email to the mailing list.
> 
> I also think that the rx-path could also benefit from prefetching somewhere.
> 

Before start to discuss your findings, there is one thing in your test app that looks strange to me:
You use BATCH_SIZE==64 for TX packets, but your mempool cache_size==32.
This is not really a good choice, as it means that for each iteration your mempool cache will be exhausted,
and you'll endup doing ring_dequeue().
I'd suggest you use something like ' 2 * BATCH_SIZE' for mempools cache size,
that should improve your numbers (at least it did to me). 

About the patch:
So from what you are saying - the reason for the drop is not actually the TX path,
but rte_pktmbuf_alloc()->rte_pktmbuf_reset(). 
That makes sense -  pktmbuf_reset() now has to update 2 cache line instead of one.
 From other side - rte_pktmbuf_alloc() was never considered as a fastest path
(our RX/TX roitinies don't use it) - so we never put a big effort in trying to optimise it.

Though, I am really not a big fan of manual prefetching. 
Its particular behaviour may vary  from one cpu to another,
and is real effect is sort of hard to predict,
in some cases can even cause a performance degradation.
Let say on my IVB box, your patch didn't show any difference at all.
So I think that 'prefetch' should be used only when it really gives great performance boost
and same results can't be achieved by other methods.  
For that particular case - at least that 'prefetch' should be moved from __rte_mbuf_raw_alloc()
to  rte_pktmbuf_alloc(), to avoid any negative impact on RX path.
Though, I suppose that scenario might be improved without manual 'prefetch' - by reordering code a bit.
Below are 2 small patches, that introduce rte_pktmbuf_bulk_alloc() and modifies your test app to use it.
Could you give it a try and see would it help to close a gap between 1.7.1 and 2.0?
I don't have box with the same off-hand, but on my IVB box results are quite promising:
on 1.2 GHz for simple_tx there is practically no difference in results (-0.33%), 
for full_tx the drop reduced to 2%.
That's comparing DPDK1.7.1+testpapp with cache_size=2*batch_size vs
latest DPDK+ testpapp with cache_size=2*batch_size+bulk_alloc.

Thanks
Konstantin

patch1:

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ab6de67..23d79ca 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -810,6 +810,45 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
        return (m);
 }

+static inline int
+rte_pktmbuf_bulk_alloc(struct rte_mempool *mp, struct rte_mbuf **m, uint32_t n)
+{
+       int32_t rc;
+       uint32_t i;
+
+       rc = rte_mempool_get_bulk(mp, (void **)m, n);
+
+       if (rc == 0) {
+               i = 0;
+               switch (n % 4) {
+               while (i != n) {
+                       case 0:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 3:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 2:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+                       case 1:
+                       RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m[i]) == 0);
+                       rte_mbuf_refcnt_set(m[i], 1);
+                       rte_pktmbuf_reset(m[i]);
+                       i++;
+               }
+               }
+       }
+
+       return rc;
+}
+
 /**
  * Attach packet mbuf to another packet mbuf.
  *

patch2:
diff --git a/main.c b/main.c
index 2aa9fcf..749c52c 100644
--- a/main.c
+++ b/main.c
@@ -71,7 +71,7 @@ static struct rte_mempool* make_mempool() {
        static int pool_id = 0;
        char pool_name[32];
        sprintf(pool_name, "pool%d", __sync_fetch_and_add(&pool_id, 1));
-       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 32,
+       return rte_mempool_create(pool_name, NB_MBUF, MBUF_SIZE, 2 * BATCH_SIZE,
                sizeof(struct rte_pktmbuf_pool_private),
                rte_pktmbuf_pool_init, NULL,
                rte_pktmbuf_init, NULL,
@@ -113,13 +113,21 @@ static uint32_t send_pkts(uint8_t port, struct rte_mempool* pool) {
        // alloc bufs
        struct rte_mbuf* bufs[BATCH_SIZE];
        uint32_t i;
+       int32_t rc;
+
+       rc = rte_pktmbuf_bulk_alloc(pool, bufs, RTE_DIM(bufs));
+       if (rc < 0) {
+               RTE_LOG(ERR, USER1,
+                       "%s: rte_pktmbuf_alloc(%zu) returns error code: %d\n",
+                       __func__, RTE_DIM(bufs), rc);
+               return 0;
+       }
+
        for (i = 0; i < BATCH_SIZE; i++) {
-               struct rte_mbuf* buf = rte_pktmbuf_alloc(pool);
-               rte_pktmbuf_data_len(buf) = 60;
-               rte_pktmbuf_pkt_len(buf) = 60;
-               bufs[i] = buf;
+               rte_pktmbuf_data_len(bufs[i]) = 60;
+               rte_pktmbuf_pkt_len(bufs[i]) = 60;
                // write seq number
-               uint64_t* pkt = rte_pktmbuf_mtod(buf, uint64_t*);
+               uint64_t* pkt = rte_pktmbuf_mtod(bufs[i], uint64_t*);
                pkt[0] = seq++;
        }
        // send pkts

Previous message: [dpdk-dev] TX performance regression caused by the mbuf cachline split
Next message: [dpdk-dev] [PATCH 0/6] extend flow director to support L2_paylod type and VF filtering in i40e driver
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the dev mailing list