[dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler

Jerin Jacob Kollanukkaran jerinj at marvell.com
Mon Jun 17 08:52:01 CEST 2019



> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>
> Sent: Monday, June 17, 2019 6:19 AM
> To: Jerin Jacob Kollanukkaran <jerinj at marvell.com>; dev at dpdk.org
> Cc: thomas at monjalon.net; Gavin Hu (Arm Technology China)
> <Gavin.Hu at arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>; nd <nd at arm.com>
> Subject: [EXT] RE: [dpdk-dev] [PATCH] acl: fix build issue with some arm64
> compiler
> 
> External Email
> 
> ----------------------------------------------------------------------
> > >
> > > Reduced the CC list (changing the topic slightly)
> > >
> > > > >
> > > > > My understanding is that the generated code for both your patch
> > > > > and my changes above is the same. Above suggested changes will
> > > > > conform to ACLE recommendation.
> > > >
> > > > Though instructions are different. Effective cycles are same even
> > > > though First dup updates the four positions.
> > > Can you elaborate on how the instructions are different?
> > > I wrote the following code with both the methods:
> > >
> > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t
> > > *p2, uint32_t *p3) {
> > >      uint32x4_t r = {*p0, *p1, *p2, *p3};
> > >
> > >      return r;
> > > }
> > >
> > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t
> > > *p2, uint32_t *p3) {
> > >      uint32x4_t r;
> > >
> > >      r = vdupq_n_u32 (* p0);
> > >      r = vsetq_lane_u32 (*p1, r, 1);
> > >      r = vsetq_lane_u32 (*p2, r, 2);
> > >      r = vsetq_lane_u32 (*p3, r, 3);
> > >
> > >      return r;
> > > }
> > >
> > > The generated code has the same instructions for both (omitted the
> > > unwanted
> > > parts):
> > >
> > > u32x4_gather_gcc:
> > >         ld1r    {v0.4s}, [x0]
> > >         ld1     {v0.s}[1], [x1]
> > >         ld1     {v0.s}[2], [x2]
> > >         ld1     {v0.s}[3], [x3]
> > >         ret
> > >
> > > u32x4_gather_acle:
> > >         ld1r    {v0.4s}, [x0]
> > >         ld1     {v0.s}[1], [x1]
> > >         ld1     {v0.s}[2], [x2]
> > >         ld1     {v0.s}[3], [x3]
> > >         ret
> > >
> > > The first 'ld1r' updates all the lanes in both the cases.
> >
> >
> > Please check actual generated code for ACL case. We can see difference
> I think there is something wrong with the way you are looking at the
> generated code. Please see comments below.

I am generating the dis assembly like below.
gdb -batch -ex 'file build/app/test ' -ex 'disassemble /rm search_neon_4'

You can try it out.

> 
> > > > To make forward progress send the v2 based on the updated logic
> > > > just to make ACLE  Spec happy, I don’t see any real reason to do
> > > > it though
> > > > 😊
> > > Thanks for the patch, it was important to make forward progress.
> > > But, I think we should carry forward the discussion as I plan to
> > > change other parts of DPDK on similar lines. I want to understand
> > > why you think there is no real reason. The ACLE recommendation
> > > mentions the
> > reasoning.
> >
> > # I see following in the ACLE spec. What is the actual reasoning?
> > "
> > ACLE does not define static construction of vector types. E.g.
> >  int32x4_t x = { 1, 2, 3, 4 };
> > Is not portable. Use the vcreate or vdup intrinsics to construct
> > values from scalars.
> > "
> Here is the complete text from ACLE 2.1
> 
> 12.2.6 Compatibility with other vector programming models Programmers
> should take particular care when combining the Neon Intrinsics API with
> alternative vector programming models; ACLE does not specify how the
> NEON Intrinsics API interoperates with them.
> For instance, the GCC vector extension permits include “arm_neon.h”
> ...
> uint32x2_t x = {0, 1}; // GCC extension.
> uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic.
> But with this code the value stored in ‘y’ will depend on both the target
> architecture (AArch32 or AArch64) and whether the program is running in
> big- or little-endian mode.

I don’t have a big endian machine to test. I would be interesting to see 
The output in bigendian. 

> It is recommended that NEON Intrinsics be used consistently:
> include “arm_neon.h”
> ...
> const int temp[2] = {0, 1};
> uint32x2_t x = vld1_s32 (temp);
> uint32_t y = vget_lane_s32 (x, 0);
> 
> >
> > # Why does compiler(gcc) allows if it not indented to use?
> I do not have an answer. This is a recommendation and all that I am trying to
> say is, following the recommendation does not cost us anything in
> performance.

If there is no performance regression then no issue in changing to this format.

> 
> >
> > # I think, it may be time to introduce UndefinedBehaviorSanitizer
> > (UBSan) Gcc feature to DPDK to detect undefined behavior checks to
> > detect such case
> I am not sure if it helps here.
> 
> >
> > >
> >
> > > >
> > > > http://patches.dpdk.org/patch/54656/
> > > >


More information about the dev mailing list