[dpdk-dev] net/i40e: add packet prefetch

Message ID 188971FCDA171749BED5DA74ABF3E6F03B6ACF0D@shsmsx102.ccr.corp.intel.com (mailing list archive)
State Not Applicable, archived
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK

Commit Message

Pei, Yulong April 1, 2017, 2:01 a.m. UTC
  Hi All

In Non-vector mode, without this patch, single core performance can reach 37.576Mpps with 64Byte packet,
But after applied this patch , single core performance downgrade to 34.343Mpps with 64Byte packet.

Best Regards
Yulong Pei

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
Sent: Wednesday, March 1, 2017 6:57 PM
To: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>
Cc: dev@dpdk.org
Subject: [dpdk-dev] [PATCH] net/i40e: add packet prefetch

Prefetch both cache lines of mbuf and first cache line of payload if CONFIG_RTE_PMD_PACKET_PREFETCH is set.

Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
---
 drivers/net/i40e/i40e_rxtx.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

 
 	rxdp = &rxq->rx_ring[alloc_idx];
 	for (i = 0; i < rxq->rx_free_thresh; i++) {
-		if (likely(i < (rxq->rx_free_thresh - 1)))
+		if (likely(i < (rxq->rx_free_thresh - 1))) {
 			/* Prefetch next mbuf */
-			rte_prefetch0(rxep[i + 1].mbuf);
+			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline0);
+			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline1);
+		}
 
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
@@ -752,7 +763,8 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
 
 		rxm->data_off = RTE_PKTMBUF_HEADROOM;
-		rte_prefetch0(RTE_PTR_ADD(rxm->buf_addr, RTE_PKTMBUF_HEADROOM));
+		rte_packet_prefetch(RTE_PTR_ADD(rxm->buf_addr,
+						RTE_PKTMBUF_HEADROOM));
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = rx_packet_len;
@@ -939,7 +951,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		first_seg->ol_flags |= pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_prefetch0(RTE_PTR_ADD(first_seg->buf_addr,
+		rte_packet_prefetch(RTE_PTR_ADD(first_seg->buf_addr,
 			first_seg->data_off));
 		rx_pkts[nb_rx++] = first_seg;
 		first_seg = NULL;
--
2.1.4
  

Comments

Ferruh Yigit April 3, 2017, 10:31 a.m. UTC | #1
On 4/1/2017 3:01 AM, Pei, Yulong wrote:
> Hi All
> 
> In Non-vector mode, without this patch, single core performance can reach 37.576Mpps with 64Byte packet,
> But after applied this patch , single core performance downgrade to 34.343Mpps with 64Byte packet.

Thanks Yulong for the test.

Vladyslav,

Is above test results match with your result?

If there is a degradation, this patch should not be merged.

Thanks,
ferruh

> 
> Best Regards
> Yulong Pei
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> Sent: Wednesday, March 1, 2017 6:57 PM
> To: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> 
> Prefetch both cache lines of mbuf and first cache line of payload if CONFIG_RTE_PMD_PACKET_PREFETCH is set.
> 
> Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
<..>
  
Ananyev, Konstantin April 3, 2017, 10:47 a.m. UTC | #2
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pei, Yulong
> Sent: Saturday, April 1, 2017 3:02 AM
> To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>;
> Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> 
> Hi All
> 
> In Non-vector mode, without this patch, single core performance can reach 37.576Mpps with 64Byte packet,
> But after applied this patch , single core performance downgrade to 34.343Mpps with 64Byte packet.
> 
> Best Regards
> Yulong Pei
> 
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> Sent: Wednesday, March 1, 2017 6:57 PM
> To: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>
> Cc: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> 
> Prefetch both cache lines of mbuf and first cache line of payload if CONFIG_RTE_PMD_PACKET_PREFETCH is set.
> 
> Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 48429cc..2b4e5c9 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -100,6 +100,12 @@
>  #define I40E_TX_OFFLOAD_NOTSUP_MASK \
>  		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
> 
> +#ifdef RTE_PMD_PACKET_PREFETCH
> +#define rte_packet_prefetch(p)	rte_prefetch0(p)
> +#else
> +#define rte_packet_prefetch(p)	do {} while (0)
> +#endif
> +
>  static uint16_t i40e_xmit_pkts_simple(void *tx_queue,
>  				      struct rte_mbuf **tx_pkts,
>  				      uint16_t nb_pkts);
> @@ -495,6 +501,9 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
>  		/* Translate descriptor info to mbuf parameters */
>  		for (j = 0; j < nb_dd; j++) {
>  			mb = rxep[j].mbuf;
> +			rte_packet_prefetch(
> +				RTE_PTR_ADD(mb->buf_addr,
> +						RTE_PKTMBUF_HEADROOM));
>  			qword1 = rte_le_to_cpu_64(\
>  				rxdp[j].wb.qword1.status_error_len);
>  			pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >> @@ -578,9 +587,11 @@
> i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> 
>  	rxdp = &rxq->rx_ring[alloc_idx];
>  	for (i = 0; i < rxq->rx_free_thresh; i++) {
> -		if (likely(i < (rxq->rx_free_thresh - 1)))
> +		if (likely(i < (rxq->rx_free_thresh - 1))) {
>  			/* Prefetch next mbuf */
> -			rte_prefetch0(rxep[i + 1].mbuf);
> +			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline0);
> +			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline1);

As I can see the line aove is the only real difference in that patch.
If that so, might be worth to re-run perf tests witout that line?
Konstantin

> +		}
> 
>  		mb = rxep[i].mbuf;
>  		rte_mbuf_refcnt_set(mb, 1);
> @@ -752,7 +763,8 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>  				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
> 
>  		rxm->data_off = RTE_PKTMBUF_HEADROOM;
> -		rte_prefetch0(RTE_PTR_ADD(rxm->buf_addr, RTE_PKTMBUF_HEADROOM));
> +		rte_packet_prefetch(RTE_PTR_ADD(rxm->buf_addr,
> +						RTE_PKTMBUF_HEADROOM));
>  		rxm->nb_segs = 1;
>  		rxm->next = NULL;
>  		rxm->pkt_len = rx_packet_len;
> @@ -939,7 +951,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
>  		first_seg->ol_flags |= pkt_flags;
> 
>  		/* Prefetch data of first segment, if configured to do so. */
> -		rte_prefetch0(RTE_PTR_ADD(first_seg->buf_addr,
> +		rte_packet_prefetch(RTE_PTR_ADD(first_seg->buf_addr,
>  			first_seg->data_off));
>  		rx_pkts[nb_rx++] = first_seg;
>  		first_seg = NULL;
> --
> 2.1.4
  
Vladyslav Buslov April 6, 2017, 9:29 a.m. UTC | #3
Ferruh,

In our case patch significantly improves application performance. (~40% more PPS on load balancer core)
Using DPDK examples I can only reproduce perf improvements with similar design apps like load_balancer.
With applications that send on free packets on same core where they are received performance is not improved.

It seems that this patch doesn't do anything(or even actually slightly degrades performance) for run-to-completion apps so we will have to continue maintaining it as part part of our internal branch.

Regards,
Vlad

> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Monday, April 03, 2017 1:31 PM
> To: Pei, Yulong; Vladyslav Buslov; Zhang, Helin; Wu, Jingjing
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> 
> On 4/1/2017 3:01 AM, Pei, Yulong wrote:
> > Hi All
> >
> > In Non-vector mode, without this patch, single core performance can
> > reach 37.576Mpps with 64Byte packet, But after applied this patch , single
> core performance downgrade to 34.343Mpps with 64Byte packet.
> 
> Thanks Yulong for the test.
> 
> Vladyslav,
> 
> Is above test results match with your result?
> 
> If there is a degradation, this patch should not be merged.
> 
> Thanks,
> ferruh
> 
> >
> > Best Regards
> > Yulong Pei
> >
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> > Sent: Wednesday, March 1, 2017 6:57 PM
> > To: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing
> > <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>
> > Cc: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> >
> > Prefetch both cache lines of mbuf and first cache line of payload if
> CONFIG_RTE_PMD_PACKET_PREFETCH is set.
> >
> > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> <..>
  
Bruce Richardson April 6, 2017, 9:54 a.m. UTC | #4
On Mon, Apr 03, 2017 at 10:47:20AM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Pei, Yulong
> > Sent: Saturday, April 1, 2017 3:02 AM
> > To: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>; Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>;
> > Yigit, Ferruh <ferruh.yigit@intel.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> > 
> > Hi All
> > 
> > In Non-vector mode, without this patch, single core performance can reach 37.576Mpps with 64Byte packet,
> > But after applied this patch , single core performance downgrade to 34.343Mpps with 64Byte packet.
> > 
> > Best Regards
> > Yulong Pei
> > 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladyslav Buslov
> > Sent: Wednesday, March 1, 2017 6:57 PM
> > To: Zhang, Helin <helin.zhang@intel.com>; Wu, Jingjing <jingjing.wu@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>
> > Cc: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH] net/i40e: add packet prefetch
> > 
> > Prefetch both cache lines of mbuf and first cache line of payload if CONFIG_RTE_PMD_PACKET_PREFETCH is set.
> > 
> > Signed-off-by: Vladyslav Buslov <vladyslav.buslov@harmonicinc.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 48429cc..2b4e5c9 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -100,6 +100,12 @@
> >  #define I40E_TX_OFFLOAD_NOTSUP_MASK \
> >  		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
> > 
> > +#ifdef RTE_PMD_PACKET_PREFETCH
> > +#define rte_packet_prefetch(p)	rte_prefetch0(p)
> > +#else
> > +#define rte_packet_prefetch(p)	do {} while (0)
> > +#endif
> > +
> >  static uint16_t i40e_xmit_pkts_simple(void *tx_queue,
> >  				      struct rte_mbuf **tx_pkts,
> >  				      uint16_t nb_pkts);
> > @@ -495,6 +501,9 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
> >  		/* Translate descriptor info to mbuf parameters */
> >  		for (j = 0; j < nb_dd; j++) {
> >  			mb = rxep[j].mbuf;
> > +			rte_packet_prefetch(
> > +				RTE_PTR_ADD(mb->buf_addr,
> > +						RTE_PKTMBUF_HEADROOM));
> >  			qword1 = rte_le_to_cpu_64(\
> >  				rxdp[j].wb.qword1.status_error_len);
> >  			pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >> @@ -578,9 +587,11 @@
> > i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
> > 
> >  	rxdp = &rxq->rx_ring[alloc_idx];
> >  	for (i = 0; i < rxq->rx_free_thresh; i++) {
> > -		if (likely(i < (rxq->rx_free_thresh - 1)))
> > +		if (likely(i < (rxq->rx_free_thresh - 1))) {
> >  			/* Prefetch next mbuf */
> > -			rte_prefetch0(rxep[i + 1].mbuf);
> > +			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline0);
> > +			rte_packet_prefetch(rxep[i + 1].mbuf->cacheline1);
> 
> As I can see the line aove is the only real difference in that patch.
> If that so, might be worth to re-run perf tests witout that line?
> Konstantin
>
The prefetch for the packet buf_addr+headroom above also looks new.
Are both needed to get the performance boost you see?
We should also investigate if the same effect can be got using a
runtime option, rather than a compile-time setting. That would give us
the best of both worlds.

/Bruce
  
Bruce Richardson April 6, 2017, 9:54 a.m. UTC | #5
On Thu, Apr 06, 2017 at 09:29:12AM +0000, Vladyslav Buslov wrote:
> Ferruh,
> 
> In our case patch significantly improves application performance. (~40% more PPS on load balancer core)
> Using DPDK examples I can only reproduce perf improvements with similar design apps like load_balancer.
> With applications that send on free packets on same core where they are received performance is not improved.
> 
> It seems that this patch doesn't do anything(or even actually slightly degrades performance) for run-to-completion apps so we will have to continue maintaining it as part part of our internal branch.
> 
> Regards,
> Vlad
>
Is it possible to get the same performance boost in your app, while
avoiding the perf hit in run-to-completion apps, by using a runtime,
rather than compile-time option?

/Bruce
  

Patch

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index 48429cc..2b4e5c9 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -100,6 +100,12 @@ 
 #define I40E_TX_OFFLOAD_NOTSUP_MASK \
 		(PKT_TX_OFFLOAD_MASK ^ I40E_TX_OFFLOAD_MASK)
 
+#ifdef RTE_PMD_PACKET_PREFETCH
+#define rte_packet_prefetch(p)	rte_prefetch0(p)
+#else
+#define rte_packet_prefetch(p)	do {} while (0)
+#endif
+
 static uint16_t i40e_xmit_pkts_simple(void *tx_queue,
 				      struct rte_mbuf **tx_pkts,
 				      uint16_t nb_pkts);
@@ -495,6 +501,9 @@  i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 		/* Translate descriptor info to mbuf parameters */
 		for (j = 0; j < nb_dd; j++) {
 			mb = rxep[j].mbuf;
+			rte_packet_prefetch(
+				RTE_PTR_ADD(mb->buf_addr,
+						RTE_PKTMBUF_HEADROOM));
 			qword1 = rte_le_to_cpu_64(\
 				rxdp[j].wb.qword1.status_error_len);
 			pkt_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >> @@ -578,9 +587,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)