[dpdk-dev] examples/l3fwd: fix aliasing in port grouping

Message ID 20171102143114.24380-1-gprathyusha@caviumnetworks.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Guduri Prathyusha Nov. 2, 2017, 2:31 p.m. UTC
  With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
pointers pointing to same chunk of memory and with -f-strict-aliasing the
pointers are assumed to be pointing to different memory and compiler
reorders instructions that depend on pnum and pn. This breaks port
grouping algorithm.

This patch eliminates the usage of union and uses memcpy for copying
gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
not call its library implementation but uses appropriate LD and ST
instructions directly and hence no performance overhead.

Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
---
 examples/l3fwd/l3fwd_neon.h | 11 +++--------
 examples/l3fwd/l3fwd_sse.h  | 11 +++--------
 2 files changed, 6 insertions(+), 16 deletions(-)

--
2.14.1
  

Comments

Ananyev, Konstantin Nov. 2, 2017, 2:46 p.m. UTC | #1
Hi,

> -----Original Message-----
> From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> Sent: Thursday, November 2, 2017 2:31 PM
> To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> Prathyusha <gprathyusha@caviumnetworks.com>
> Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> 
> With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> pointers pointing to same chunk of memory and with -f-strict-aliasing the
> pointers are assumed to be pointing to different memory and compiler
> reorders instructions that depend on pnum and pn. This breaks port
> grouping algorithm.
> 
> This patch eliminates the usage of union and uses memcpy for copying
> gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> not call its library implementation but uses appropriate LD and ST
> instructions directly and hence no performance overhead.
> 
> Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> ---
>  examples/l3fwd/l3fwd_neon.h | 11 +++--------
>  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
>  2 files changed, 6 insertions(+), 16 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> index 4bc161394..10a602a04 100644
> --- a/examples/l3fwd/l3fwd_neon.h
> +++ b/examples/l3fwd/l3fwd_neon.h
> @@ -100,11 +100,6 @@ static inline uint16_t *
>  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
>  	     uint16x8_t dp2)
>  {
> -	union {
> -		uint16_t u16[FWDSTEP + 1];
> -		uint64_t u64;
> -	} *pnum = (void *)pn;
> -
>  	int32_t v;
>  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> 
> @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> 
>  	/* if dest port value has changed. */
>  	if (v != GRPMSK) {
> -		pnum->u64 = gptbl[v].pnum;
> -		pnum->u16[FWDSTEP] = 1;
> -		lp = pnum->u16 + gptbl[v].idx;
> +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> +		pn[FWDSTEP] = 1;
> +		lp = pn + gptbl[v].idx;
>  	}
> 
>  	return lp;
> diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> index 831760f02..79a71d77e 100644
> --- a/examples/l3fwd/l3fwd_sse.h
> +++ b/examples/l3fwd/l3fwd_sse.h
> @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>  static inline uint16_t *
>  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
>  {
> -	union {
> -		uint16_t u16[FWDSTEP + 1];
> -		uint64_t u64;
> -	} *pnum = (void *)pn;
> -
>  	int32_t v;
> 
>  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> 
>  	/* if dest port value has changed. */
>  	if (v != GRPMSK) {
> -		pnum->u64 = gptbl[v].pnum;
> -		pnum->u16[FWDSTEP] = 1;
> -		lp = pnum->u16 + gptbl[v].idx;
> +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> +		pn[FWDSTEP] = 1;
> +		lp = pn + gptbl[v].idx;

Could you explain a bit more here - which exactly instructions were reordered
and what kind of problems did it cause?
Specially on IA?
In any case I don't think using rte_memcpy is a good thing to use here:
it is a huge inline function - way too much to copy just 64 bit variable.
Konstantin

>  	}
> 
>  	return lp;
> --
> 2.14.1
  
Guduri Prathyusha Nov. 2, 2017, 3:33 p.m. UTC | #2
On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> Hi,
Hi
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 2:31 PM
> > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > Prathyusha <gprathyusha@caviumnetworks.com>
> > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > pointers are assumed to be pointing to different memory and compiler
> > reorders instructions that depend on pnum and pn. This breaks port
> > grouping algorithm.
> >
> > This patch eliminates the usage of union and uses memcpy for copying
> > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > not call its library implementation but uses appropriate LD and ST
> > instructions directly and hence no performance overhead.
> >
> > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > ---
> >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> >  2 files changed, 6 insertions(+), 16 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > index 4bc161394..10a602a04 100644
> > --- a/examples/l3fwd/l3fwd_neon.h
> > +++ b/examples/l3fwd/l3fwd_neon.h
> > @@ -100,11 +100,6 @@ static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> >  	     uint16x8_t dp2)
> >  {
> > -	union {
> > -		uint16_t u16[FWDSTEP + 1];
> > -		uint64_t u64;
> > -	} *pnum = (void *)pn;
> > -
> >  	int32_t v;
> >  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> >
> > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> >
> >  	/* if dest port value has changed. */
> >  	if (v != GRPMSK) {
> > -		pnum->u64 = gptbl[v].pnum;
> > -		pnum->u16[FWDSTEP] = 1;
> > -		lp = pnum->u16 + gptbl[v].idx;
> > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +		pn[FWDSTEP] = 1;
> > +		lp = pn + gptbl[v].idx;
> >  	}
> >
> >  	return lp;
> > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > index 831760f02..79a71d77e 100644
> > --- a/examples/l3fwd/l3fwd_sse.h
> > +++ b/examples/l3fwd/l3fwd_sse.h
> > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> >  static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> >  {
> > -	union {
> > -		uint16_t u16[FWDSTEP + 1];
> > -		uint64_t u64;
> > -	} *pnum = (void *)pn;
> > -
> >  	int32_t v;
> >
> >  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> >
> >  	/* if dest port value has changed. */
> >  	if (v != GRPMSK) {
> > -		pnum->u64 = gptbl[v].pnum;
> > -		pnum->u16[FWDSTEP] = 1;
> > -		lp = pnum->u16 + gptbl[v].idx;
> > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +		pn[FWDSTEP] = 1;
> > +		lp = pn + gptbl[v].idx;
>
> Could you explain a bit more here - which exactly instructions were reordered
> and what kind of problems did it cause?
> Specially on IA?

This issue is observed on ARM since ARM gcc is more aggressive in
reordering than x86 gcc. In ARM when v != GRPMSK, the following
instructions ordering is not guarenteed because of strict aliasing.

lp[0] += gptbl[v].lpv;
pnum->u64 = gptbl[v].pnum;
pnum->u16[FWDSTEP] = 1;
lp = pnum->u16 + gptbl[v].idx;

That results in wrong lp[0] updation.
memcpy in this case will avoid this problem.

> In any case I don't think using rte_memcpy is a good thing to use here:
> it is a huge inline function - way too much to copy just 64 bit variable.

I agree that rte_memcpy is overhead in this case but how about using
memcpy that will not use library implementation if the size is constant.
memcpy with constant size uses built_in_memcpy that does not add
performance overhead.

Thoughts?

> Konstantin
>
> >  	}
> >
> >  	return lp;
> > --
> > 2.14.1
>
  
Ananyev, Konstantin Nov. 2, 2017, 3:52 p.m. UTC | #3
> -----Original Message-----
> From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> Sent: Thursday, November 2, 2017 3:34 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> 
> On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > Hi,
> Hi
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 2:31 PM
> > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > pointers are assumed to be pointing to different memory and compiler
> > > reorders instructions that depend on pnum and pn. This breaks port
> > > grouping algorithm.
> > >
> > > This patch eliminates the usage of union and uses memcpy for copying
> > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > not call its library implementation but uses appropriate LD and ST
> > > instructions directly and hence no performance overhead.
> > >
> > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > ---
> > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > index 4bc161394..10a602a04 100644
> > > --- a/examples/l3fwd/l3fwd_neon.h
> > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >  	     uint16x8_t dp2)
> > >  {
> > > -	union {
> > > -		uint16_t u16[FWDSTEP + 1];
> > > -		uint64_t u64;
> > > -	} *pnum = (void *)pn;
> > > -
> > >  	int32_t v;
> > >  	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > >
> > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > >
> > >  	/* if dest port value has changed. */
> > >  	if (v != GRPMSK) {
> > > -		pnum->u64 = gptbl[v].pnum;
> > > -		pnum->u16[FWDSTEP] = 1;
> > > -		lp = pnum->u16 + gptbl[v].idx;
> > > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +		pn[FWDSTEP] = 1;
> > > +		lp = pn + gptbl[v].idx;
> > >  	}
> > >
> > >  	return lp;
> > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > index 831760f02..79a71d77e 100644
> > > --- a/examples/l3fwd/l3fwd_sse.h
> > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > >  static inline uint16_t *
> > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >  {
> > > -	union {
> > > -		uint16_t u16[FWDSTEP + 1];
> > > -		uint64_t u64;
> > > -	} *pnum = (void *)pn;
> > > -
> > >  	int32_t v;
> > >
> > >  	dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > >
> > >  	/* if dest port value has changed. */
> > >  	if (v != GRPMSK) {
> > > -		pnum->u64 = gptbl[v].pnum;
> > > -		pnum->u16[FWDSTEP] = 1;
> > > -		lp = pnum->u16 + gptbl[v].idx;
> > > +		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > +		pn[FWDSTEP] = 1;
> > > +		lp = pn + gptbl[v].idx;
> >
> > Could you explain a bit more here - which exactly instructions were reordered
> > and what kind of problems did it cause?
> > Specially on IA?
> 
> This issue is observed on ARM since ARM gcc is more aggressive in
> reordering than x86 gcc.

Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
Unless there is a reproducible problem with x86 -
my preference would be to keep that file intact.

> In ARM when v != GRPMSK, the following
> instructions ordering is not guarenteed because of strict aliasing.
> 
> lp[0] += gptbl[v].lpv;
> pnum->u64 = gptbl[v].pnum;
> pnum->u16[FWDSTEP] = 1;
> lp = pnum->u16 + gptbl[v].idx;

Ok, so what in particular is reordered by the compiler:

 lp[0] += gptbl[v].lpv; (1)
 pnum->u64 = gptbl[v].pnum; (2)
 pnum->u16[FWDSTEP] = 1;   (3)
 lp = pnum->u16 + gptbl[v].idx; (4)

(2) and (3)?
If so I am not sure how it could be a problem:
they do stores to the different locations.
(1) and (4) as I can see shouldn't be reordered.
Anyway - if you think this a compiler reordering issue,
then adding rte_compiler_barrier() should fix the issue, right?

> 
> That results in wrong lp[0] updation.
> memcpy in this case will avoid this problem.
> 
> > In any case I don't think using rte_memcpy is a good thing to use here:
> > it is a huge inline function - way too much to copy just 64 bit variable.
> 
> I agree that rte_memcpy is overhead in this case but how about using
> memcpy that will not use library implementation if the size is constant.
> memcpy with constant size uses built_in_memcpy that does not add
> performance overhead.

On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
ib/librte_eal/common/include/arch/x86/rte_memcpy.h

> 
> Thoughts?

As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
If it does (still not sure how) - check would compiler barrier help here.
Konstantin
  
Guduri Prathyusha Nov. 2, 2017, 5:38 p.m. UTC | #4
Hi,

Some issue with my mutt command line and hence apologies for unpleasant formatting in the mail. Please see inline
  
Jianbo Liu Nov. 3, 2017, 3:21 a.m. UTC | #5
The 11/02/2017 15:52, Ananyev, Konstantin wrote:
>
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 3:34 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > Hi,
> > Hi
> > >
> > > > -----Original Message-----
> > > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > >
> > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives

May I ask the detail version about the gcc you are using?

> > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > pointers are assumed to be pointing to different memory and compiler
> > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > grouping algorithm.
> > > >
> > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > not call its library implementation but uses appropriate LD and ST
> > > > instructions directly and hence no performance overhead.
> > > >
> > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > > ---
> > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > >
> > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > index 4bc161394..10a602a04 100644
> > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >              uint16x8_t dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > >
> > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > > >         }
> > > >
> > > >         return lp;
> > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > index 831760f02..79a71d77e 100644
> > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > >  static inline uint16_t *
> > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >  {
> > > > -       union {
> > > > -               uint16_t u16[FWDSTEP + 1];
> > > > -               uint64_t u64;
> > > > -       } *pnum = (void *)pn;
> > > > -
> > > >         int32_t v;
> > > >
> > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > >
> > > >         /* if dest port value has changed. */
> > > >         if (v != GRPMSK) {
> > > > -               pnum->u64 = gptbl[v].pnum;
> > > > -               pnum->u16[FWDSTEP] = 1;
> > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > +               pn[FWDSTEP] = 1;
> > > > +               lp = pn + gptbl[v].idx;
> > >
> > > Could you explain a bit more here - which exactly instructions were reordered
> > > and what kind of problems did it cause?
> > > Specially on IA?
> >
> > This issue is observed on ARM since ARM gcc is more aggressive in
> > reordering than x86 gcc.
>
> Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> Unless there is a reproducible problem with x86 -
> my preference would be to keep that file intact.
>
> > In ARM when v != GRPMSK, the following
> > instructions ordering is not guarenteed because of strict aliasing.
> >
> > lp[0] += gptbl[v].lpv;
> > pnum->u64 = gptbl[v].pnum;
> > pnum->u16[FWDSTEP] = 1;
> > lp = pnum->u16 + gptbl[v].idx;
>
> Ok, so what in particular is reordered by the compiler:
>
>  lp[0] += gptbl[v].lpv; (1)
>  pnum->u64 = gptbl[v].pnum; (2)
>  pnum->u16[FWDSTEP] = 1;   (3)
>  lp = pnum->u16 + gptbl[v].idx; (4)
>
> (2) and (3)?
> If so I am not sure how it could be a problem:
> they do stores to the different locations.
> (1) and (4) as I can see shouldn't be reordered.
> Anyway - if you think this a compiler reordering issue,
> then adding rte_compiler_barrier() should fix the issue, right?

Agree.

>
> >
> > That results in wrong lp[0] updation.
> > memcpy in this case will avoid this problem.
> >
> > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > it is a huge inline function - way too much to copy just 64 bit variable.
> >
> > I agree that rte_memcpy is overhead in this case but how about using
> > memcpy that will not use library implementation if the size is constant.
> > memcpy with constant size uses built_in_memcpy that does not add
> > performance overhead.
>
> On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> ib/librte_eal/common/include/arch/x86/rte_memcpy.h
>
> >
> > Thoughts?
>
> As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> If it does (still not sure how) - check would compiler barrier help here.
> Konstantin
>

--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Guduri Prathyusha Nov. 3, 2017, 5:42 a.m. UTC | #6
On Fri, Nov 03, 2017 at 11:21:43AM +0800, Jianbo.Liu@arm.com wrote:
> The 11/02/2017 15:52, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 3:34 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: dev@dpdk.org; Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > >
> > > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > > Hi,
> > > Hi
> > > >
> > > > > -----Original Message-----
> > > > > From: Guduri Prathyusha [mailto:gprathyusha@caviumnetworks.com]
> > > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > > To: Kantecki, Tomasz <tomasz.kantecki@intel.com>
> > > > > Cc: Jianbo.Liu@arm.com; guduriprathyusha@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Guduri
> > > > > Prathyusha <gprathyusha@caviumnetworks.com>
> > > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> > > > >
> > > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
>
> May I ask the detail version about the gcc you are using?
>
Am using gcc provided by ubuntu rootfs

root@localhost:~/dpdk# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/5/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr --program-suffix=-5 --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-arm64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-arm64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-arm64
--with-arch-directory=aarch64
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu
--host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)

> > > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > > > > pointers are assumed to be pointing to different memory and compiler
> > > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > > grouping algorithm.
> > > > >
> > > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > > > > not call its library implementation but uses appropriate LD and ST
> > > > > instructions directly and hence no performance overhead.
> > > > >
> > > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > > Signed-off-by: Guduri Prathyusha <gprathyusha@caviumnetworks.com>
> > > > > ---
> > > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > > index 4bc161394..10a602a04 100644
> > > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >              uint16x8_t dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > > >
> > > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > > >         }
> > > > >
> > > > >         return lp;
> > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > > index 831760f02..79a71d77e 100644
> > > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
> > > > >  static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >
> > > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > >
> > > > Could you explain a bit more here - which exactly instructions were reordered
> > > > and what kind of problems did it cause?
> > > > Specially on IA?
> > >
> > > This issue is observed on ARM since ARM gcc is more aggressive in
> > > reordering than x86 gcc.
> >
> > Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> > Unless there is a reproducible problem with x86 -
> > my preference would be to keep that file intact.
> >
> > > In ARM when v != GRPMSK, the following
> > > instructions ordering is not guarenteed because of strict aliasing.
> > >
> > > lp[0] += gptbl[v].lpv;
> > > pnum->u64 = gptbl[v].pnum;
> > > pnum->u16[FWDSTEP] = 1;
> > > lp = pnum->u16 + gptbl[v].idx;
> >
> > Ok, so what in particular is reordered by the compiler:
> >
> >  lp[0] += gptbl[v].lpv; (1)
> >  pnum->u64 = gptbl[v].pnum; (2)
> >  pnum->u16[FWDSTEP] = 1;   (3)
> >  lp = pnum->u16 + gptbl[v].idx; (4)
> >
> > (2) and (3)?
> > If so I am not sure how it could be a problem:
> > they do stores to the different locations.
> > (1) and (4) as I can see shouldn't be reordered.
> > Anyway - if you think this a compiler reordering issue,
> > then adding rte_compiler_barrier() should fix the issue, right?
>
> Agree.
>
Yes, rte_compiler_barrier() solves the issue. I will spin a V2 if you
are ok with it.
> >
> > >
> > > That results in wrong lp[0] updation.
> > > memcpy in this case will avoid this problem.
> > >
> > > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > > it is a huge inline function - way too much to copy just 64 bit variable.
> > >
> > > I agree that rte_memcpy is overhead in this case but how about using
> > > memcpy that will not use library implementation if the size is constant.
> > > memcpy with constant size uses built_in_memcpy that does not add
> > > performance overhead.
> >
> > On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate function:
> > ib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >
> > >
> > > Thoughts?
> >
> > As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> > If it does (still not sure how) - check would compiler barrier help here.
> > Konstantin
> >
>
> --
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  

Patch

diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
index 4bc161394..10a602a04 100644
--- a/examples/l3fwd/l3fwd_neon.h
+++ b/examples/l3fwd/l3fwd_neon.h
@@ -100,11 +100,6 @@  static inline uint16_t *
 port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
 	     uint16x8_t dp2)
 {
-	union {
-		uint16_t u16[FWDSTEP + 1];
-		uint64_t u64;
-	} *pnum = (void *)pn;
-
 	int32_t v;
 	uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};

@@ -117,9 +112,9 @@  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,

 	/* if dest port value has changed. */
 	if (v != GRPMSK) {
-		pnum->u64 = gptbl[v].pnum;
-		pnum->u16[FWDSTEP] = 1;
-		lp = pnum->u16 + gptbl[v].idx;
+		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
+		pn[FWDSTEP] = 1;
+		lp = pn + gptbl[v].idx;
 	}

 	return lp;
diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
index 831760f02..79a71d77e 100644
--- a/examples/l3fwd/l3fwd_sse.h
+++ b/examples/l3fwd/l3fwd_sse.h
@@ -98,11 +98,6 @@  processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 static inline uint16_t *
 port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)
 {
-	union {
-		uint16_t u16[FWDSTEP + 1];
-		uint64_t u64;
-	} *pnum = (void *)pn;
-
 	int32_t v;

 	dp1 = _mm_cmpeq_epi16(dp1, dp2);
@@ -114,9 +109,9 @@  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i dp2)

 	/* if dest port value has changed. */
 	if (v != GRPMSK) {
-		pnum->u64 = gptbl[v].pnum;
-		pnum->u16[FWDSTEP] = 1;
-		lp = pnum->u16 + gptbl[v].idx;
+		rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
+		pn[FWDSTEP] = 1;
+		lp = pn + gptbl[v].idx;
 	}

 	return lp;