[v1,3/3] net/i40e: auto-vectorization to speed up Tx free

Message ID 20200306050427.66114-4-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: xiaolong ye
Headers
Series i40e vPMD optimization on aarch64 |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/travis-robot success Travis build: passed
ci/Intel-compilation fail Compilation issues

Commit Message

Gavin Hu March 6, 2020, 5:04 a.m. UTC
  Tx mbuf free is a hotspot for i40e on aarch64, as there are no
inter-loop dependencies, it is safe to enable auto-vectorization
to speed up.

This patch showed 2~3% performance lift on ThunderX2 and no degradation
on Arm N1SDP. The test case is single core RFC2544 zero-loss test.

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 5 +++++
 1 file changed, 5 insertions(+)
  

Comments

Jerin Jacob March 6, 2020, 7:44 a.m. UTC | #1
On Fri, Mar 6, 2020 at 10:35 AM Gavin Hu <gavin.hu@arm.com> wrote:
>
> Tx mbuf free is a hotspot for i40e on aarch64, as there are no
> inter-loop dependencies, it is safe to enable auto-vectorization
> to speed up.
>
> This patch showed 2~3% performance lift on ThunderX2 and no degradation
> on Arm N1SDP. The test case is single core RFC2544 zero-loss test.
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx_vec_common.h | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
> index 0e6ffa007..fc0fa45d4 100644
> --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> @@ -98,6 +98,11 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
>         if (likely(m != NULL)) {
>                 free[0] = m;
>                 nb_free = 1;
> +#if defined(__clang__)
> +#pragma clang loop vectorize(assume_safety)
> +#elif defined(__GNUC__)
> +#pragma GCC ivdep
> +#endif

IMO, It is better to abstract the compiler features  (above compiler
feature and __restrict__) as macros in
rte_common.h or so. It will help to support other compilers(ICC or
Windows) and enable them to have "changes" in one place.



>                 for (i = 1; i < n; i++) {
>                         m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
>                         if (likely(m != NULL)) {
> --
> 2.17.1
>
  
Thomas Monjalon March 6, 2020, 9:06 a.m. UTC | #2
06/03/2020 08:44, Jerin Jacob:
> On Fri, Mar 6, 2020 at 10:35 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > @@ -98,6 +98,11 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> >         if (likely(m != NULL)) {
> >                 free[0] = m;
> >                 nb_free = 1;
> > +#if defined(__clang__)
> > +#pragma clang loop vectorize(assume_safety)
> > +#elif defined(__GNUC__)
> > +#pragma GCC ivdep
> > +#endif
> 
> IMO, It is better to abstract the compiler features  (above compiler
> feature and __restrict__) as macros in
> rte_common.h or so. It will help to support other compilers(ICC or
> Windows) and enable them to have "changes" in one place.

I agree with the need for common abstraction.
  
Gavin Hu March 7, 2020, 3:03 p.m. UTC | #3
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Friday, March 6, 2020 3:45 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net;
> jerinj@marvell.com; Ye, Xiaolong <xiaolong.ye@intel.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v1 3/3] net/i40e: auto-vectorization to
> speed up Tx free
> 
> On Fri, Mar 6, 2020 at 10:35 AM Gavin Hu <gavin.hu@arm.com> wrote:
> >
> > Tx mbuf free is a hotspot for i40e on aarch64, as there are no
> > inter-loop dependencies, it is safe to enable auto-vectorization
> > to speed up.
> >
> > This patch showed 2~3% performance lift on ThunderX2 and no
> degradation
> > on Arm N1SDP. The test case is single core RFC2544 zero-loss test.
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > ---
> >  drivers/net/i40e/i40e_rxtx_vec_common.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h
> b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > index 0e6ffa007..fc0fa45d4 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > @@ -98,6 +98,11 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> >         if (likely(m != NULL)) {
> >                 free[0] = m;
> >                 nb_free = 1;
> > +#if defined(__clang__)
> > +#pragma clang loop vectorize(assume_safety)
> > +#elif defined(__GNUC__)
> > +#pragma GCC ivdep
> > +#endif
> 
> IMO, It is better to abstract the compiler features  (above compiler
> feature and __restrict__) as macros in
> rte_common.h or so. It will help to support other compilers(ICC or
> Windows) and enable them to have "changes" in one place.

How about defining RTE_LOOP_AUTO_VECTORIZATION in the rte_common.h?
#if defined(__clang__)
	define RTE_LOOP_AUTO_VECTORIZATION  \
		#pragma clang loop vectorize(assume_safety)
#elif defined(__GNUC__)
	define RTE_LOOP_AUTO_VECTORIZATION  \
		#pragma GCC ivdep
#else 
	define RTE_LOOP_AUTO_VECTORIZATION
#endif

If you agree, I will submit a v2. Thanks for your comments! 
/Gavin
> 
> 
> 
> >                 for (i = 1; i < n; i++) {
> >                         m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> >                         if (likely(m != NULL)) {
> > --
> > 2.17.1
> >
  
Jerin Jacob March 9, 2020, 7:35 a.m. UTC | #4
On Sat, Mar 7, 2020 at 8:34 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
>
> Hi Jerin,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Friday, March 6, 2020 3:45 PM
> > To: Gavin Hu <Gavin.Hu@arm.com>
> > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > <david.marchand@redhat.com>; thomas@monjalon.net;
> > jerinj@marvell.com; Ye, Xiaolong <xiaolong.ye@intel.com>; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v1 3/3] net/i40e: auto-vectorization to
> > speed up Tx free
> >
> > On Fri, Mar 6, 2020 at 10:35 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > >
> > > Tx mbuf free is a hotspot for i40e on aarch64, as there are no
> > > inter-loop dependencies, it is safe to enable auto-vectorization
> > > to speed up.
> > >
> > > This patch showed 2~3% performance lift on ThunderX2 and no
> > degradation
> > > on Arm N1SDP. The test case is single core RFC2544 zero-loss test.
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx_vec_common.h | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > index 0e6ffa007..fc0fa45d4 100644
> > > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > @@ -98,6 +98,11 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> > >         if (likely(m != NULL)) {
> > >                 free[0] = m;
> > >                 nb_free = 1;
> > > +#if defined(__clang__)
> > > +#pragma clang loop vectorize(assume_safety)
> > > +#elif defined(__GNUC__)
> > > +#pragma GCC ivdep
> > > +#endif
> >
> > IMO, It is better to abstract the compiler features  (above compiler
> > feature and __restrict__) as macros in
> > rte_common.h or so. It will help to support other compilers(ICC or
> > Windows) and enable them to have "changes" in one place.
>
> How about defining RTE_LOOP_AUTO_VECTORIZATION in the rte_common.h?

Other compiler stuff in rte_common.h are starting with __rte in small
letter(__rte_packed, __rte_unused) etc.
I think, a better name would be __rte_loop_auto_vectorize or so.
No strong opinion for the name though.

# Probably it is worth checking and add performance result of x86
testing in git commit as well as it
is common code.


> #if defined(__clang__)
>         define RTE_LOOP_AUTO_VECTORIZATION  \
>                 #pragma clang loop vectorize(assume_safety)
> #elif defined(__GNUC__)
>         define RTE_LOOP_AUTO_VECTORIZATION  \
>                 #pragma GCC ivdep
> #else
>         define RTE_LOOP_AUTO_VECTORIZATION
> #endif
> If you agree, I will submit a v2. Thanks for your comments!
> /Gavin
> >
> >
> >
> > >                 for (i = 1; i < n; i++) {
> > >                         m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > >                         if (likely(m != NULL)) {
> > > --
> > > 2.17.1
> > >
  
Gavin Hu March 9, 2020, 9:23 a.m. UTC | #5
Hi Jerin,

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, March 9, 2020 3:36 PM
> To: Gavin Hu <Gavin.Hu@arm.com>
> Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> <david.marchand@redhat.com>; thomas@monjalon.net; jerinj@marvell.com;
> Ye, Xiaolong <xiaolong.ye@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v1 3/3] net/i40e: auto-vectorization to speed
> up Tx free
> 
> On Sat, Mar 7, 2020 at 8:34 PM Gavin Hu <Gavin.Hu@arm.com> wrote:
> >
> > Hi Jerin,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Friday, March 6, 2020 3:45 PM
> > > To: Gavin Hu <Gavin.Hu@arm.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; nd <nd@arm.com>; David Marchand
> > > <david.marchand@redhat.com>; thomas@monjalon.net;
> > > jerinj@marvell.com; Ye, Xiaolong <xiaolong.ye@intel.com>; Honnappa
> > > Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> > > <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH v1 3/3] net/i40e: auto-vectorization to
> > > speed up Tx free
> > >
> > > On Fri, Mar 6, 2020 at 10:35 AM Gavin Hu <gavin.hu@arm.com> wrote:
> > > >
> > > > Tx mbuf free is a hotspot for i40e on aarch64, as there are no
> > > > inter-loop dependencies, it is safe to enable auto-vectorization
> > > > to speed up.
> > > >
> > > > This patch showed 2~3% performance lift on ThunderX2 and no
> > > degradation
> > > > on Arm N1SDP. The test case is single core RFC2544 zero-loss test.
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > ---
> > > >  drivers/net/i40e/i40e_rxtx_vec_common.h | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > > index 0e6ffa007..fc0fa45d4 100644
> > > > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > > > @@ -98,6 +98,11 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> > > >         if (likely(m != NULL)) {
> > > >                 free[0] = m;
> > > >                 nb_free = 1;
> > > > +#if defined(__clang__)
> > > > +#pragma clang loop vectorize(assume_safety)
> > > > +#elif defined(__GNUC__)
> > > > +#pragma GCC ivdep
> > > > +#endif
> > >
> > > IMO, It is better to abstract the compiler features  (above compiler
> > > feature and __restrict__) as macros in
> > > rte_common.h or so. It will help to support other compilers(ICC or
> > > Windows) and enable them to have "changes" in one place.
> >
> > How about defining RTE_LOOP_AUTO_VECTORIZATION in the
> rte_common.h?
> 
> Other compiler stuff in rte_common.h are starting with __rte in small
> letter(__rte_packed, __rte_unused) etc.
> I think, a better name would be __rte_loop_auto_vectorize or so.
> No strong opinion for the name though.
> 
> # Probably it is worth checking and add performance result of x86
> testing in git commit as well as it
> is common code.
Okay, I will do it. 
> 
> 
> > #if defined(__clang__)
> >         define RTE_LOOP_AUTO_VECTORIZATION  \
> >                 #pragma clang loop vectorize(assume_safety)
> > #elif defined(__GNUC__)
> >         define RTE_LOOP_AUTO_VECTORIZATION  \
> >                 #pragma GCC ivdep
> > #else
> >         define RTE_LOOP_AUTO_VECTORIZATION
> > #endif
> > If you agree, I will submit a v2. Thanks for your comments!
> > /Gavin
> > >
> > >
> > >
> > > >                 for (i = 1; i < n; i++) {
> > > >                         m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
> > > >                         if (likely(m != NULL)) {
> > > > --
> > > > 2.17.1
> > > >
  

Patch

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 0e6ffa007..fc0fa45d4 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -98,6 +98,11 @@  i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 	if (likely(m != NULL)) {
 		free[0] = m;
 		nb_free = 1;
+#if defined(__clang__)
+#pragma clang loop vectorize(assume_safety)
+#elif defined(__GNUC__)
+#pragma GCC ivdep
+#endif
 		for (i = 1; i < n; i++) {
 			m = rte_pktmbuf_prefree_seg(txep[i].mbuf);
 			if (likely(m != NULL)) {