[v11,1/3] eal/arm64: add 128-bit atomic compare exchange

Message ID 1571397690-14116-1-git-send-email-phil.yang@arm.com (mailing list archive)
State Accepted, archived
Delegated to: David Marchand
Headers
Series [v11,1/3] eal/arm64: add 128-bit atomic compare exchange |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-compilation success Compile Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Phil Yang Oct. 18, 2019, 11:21 a.m. UTC
  This patch adds the implementation of the 128-bit atomic compare
exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
can perform this operation. Moreover, on the LSE atomic extension
accelerated platforms, it implemented by 'casp' instructions for
better performance.

Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
'cas' version on elder version compilers.

Since direct x0 register used in the code and cas_op_name() and
rte_atomic128_cmp_exchange() is inline function, based on parent
function load, it may corrupt x0 register aka Break arm64 ABI.
Define CAS operations as rte_noinline functions to avoid the ABI
break[1].

[1]5b40ec6b9662 ("mempool/octeontx2: fix possible arm64 ABI break").

Suggested-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Phil Yang <phil.yang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Tested-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Jerin Jacob <jerinj@marvell.com>
---
v11:
1. Renamed cas_op_name by adding the data width tag 128.
2. Replaced __ldx/__stx inline functions with macros.
3. Illustrate the reason of define cas operations as non-inline
functions in commitlog.

v10:
1.Removed all the rte tag for internal functions.
2.Removed __MO_LOAD and _MO_STORE macros and keep define __HAS_ACQ
and __HAS_REL under non LSE conditional branch.
3.Undef the macro once it is unused.
4.Reword the 1/3 and 2/3 patches' commitlog more specific.

v9:
Updated 19.11 release note.

v8:
Fixed "WARNING:LONG_LINE: line over 80 characters" warnings with latest kernel
checkpatch.pl

v7:
1. Adjust code comment.

v6:
1. Put the RTE_ARM_FEATURE_ATOMICS flag into EAL group. (Jerin Jocob)
2. Keep rte_stack_lf_stubs.h doing nothing. (Gage Eads)
3. Fixed 32 bit build issue.

v5:
1. Enable RTE_ARM_FEATURE_ATOMICS on octeontx2 in default. (Jerin Jocob)
2. Record the reason of introducing "rte_stack_lf_stubs.h" in git
commit.
(Jerin, Jocob)
3. Fixed a conditional MACRO error in rte_atomic128_cmp_exchange. (Jerin
Jocob)

v4:
1. Add RTE_ARM_FEATURE_ATOMICS flag to support LSE CASP instructions.
(Jerin Jocob)
2. Fix possible arm64 ABI break by making casp_op_name noinline. (Jerin
Jocob)
3. Add rte_stack_lf_stubs.h to reduce the ifdef clutter. (Gage
Eads/Jerin Jocob)

v3:
1. Avoid duplication code with macro. (Jerin Jocob)
2. Make invalid memory order to strongest barrier. (Jerin Jocob)
3. Update doc/guides/prog_guide/env_abstraction_layer.rst. (Gage Eads)
4. Fix 32-bit x86 builds issue. (Gage Eads)
5. Correct documentation issues in UT. (Gage Eads)

v2:
Initial version.

 config/arm/meson.build                             |   2 +
 config/common_base                                 |   3 +
 config/defconfig_arm64-octeontx2-linuxapp-gcc      |   1 +
 config/defconfig_arm64-thunderx2-linuxapp-gcc      |   1 +
 .../common/include/arch/arm/rte_atomic_64.h        | 151 +++++++++++++++++++++
 .../common/include/arch/x86/rte_atomic_64.h        |  12 --
 lib/librte_eal/common/include/generic/rte_atomic.h |  17 ++-
 7 files changed, 174 insertions(+), 13 deletions(-)
  

Comments

David Marchand Oct. 18, 2019, 2:16 p.m. UTC | #1
On Fri, Oct 18, 2019 at 1:22 PM Phil Yang <phil.yang@arm.com> wrote:
>
> This patch adds the implementation of the 128-bit atomic compare
> exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
> can perform this operation. Moreover, on the LSE atomic extension
> accelerated platforms, it implemented by 'casp' instructions for
> better performance.
>
> Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
> patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
> 'cas' version on elder version compilers.

Jerin, Phil,

I am getting a build error on the octeontx2 target:

{standard input}: Assembler messages:
{standard input}:672: Error: selected processor does not support `casp
x0,x1,x2,x3,[x4]'
{standard input}:690: Error: selected processor does not support
`caspa x0,x1,x2,x3,[x4]'
{standard input}:708: Error: selected processor does not support
`caspl x0,x1,x2,x3,[x4]'
{standard input}:726: Error: selected processor does not support
`caspal x0,x1,x2,x3,[x4]'
ninja: build stopped: subcommand failed.

Looking into the meson logs, I can see:

Native C compiler: ccache gcc (gcc 9.2.1 "gcc (GCC) 9.2.1 20190827
(Red Hat 9.2.1-1)")
Cross C compiler: aarch64-linux-gnu-gcc (gcc 8.2.1)
Host machine cpu family: aarch64
Host machine cpu: armv8-a
Target machine cpu family: aarch64
Target machine cpu: armv8-a
Build machine cpu family: x86_64
Build machine cpu: x86_64
...
Message: Implementer : Cavium
Compiler for C supports arguments -mcpu=octeontx2: NO
Message: []
Fetching value of define "__ARM_NEON" : 1
Fetching value of define "__ARM_FEATURE_CRC32" :
Fetching value of define "__ARM_FEATURE_CRYPTO" :


My toolchain does not support the octeontx2 target, but
RTE_ARM_FEATURE_ATOMICS ends up being set in the configuration anyway.
Tried with Linaro toolchains (4.7.1, 7.4) mentionned in the dpdk
documentation, same result.

Looking at config/arm/meson.build, the "extra machine specific flags"
are appended to the configuration, regardless of what the compiler
replied when testing the machine args.
  
Jerin Jacob Oct. 18, 2019, 2:24 p.m. UTC | #2
On Fri, Oct 18, 2019 at 7:46 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Fri, Oct 18, 2019 at 1:22 PM Phil Yang <phil.yang@arm.com> wrote:
> >
> > This patch adds the implementation of the 128-bit atomic compare
> > exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
> > can perform this operation. Moreover, on the LSE atomic extension
> > accelerated platforms, it implemented by 'casp' instructions for
> > better performance.
> >
> > Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
> > patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
> > 'cas' version on elder version compilers.
>
> Jerin, Phil,
>
> I am getting a build error on the octeontx2 target:
>
> {standard input}: Assembler messages:
> {standard input}:672: Error: selected processor does not support `casp
> x0,x1,x2,x3,[x4]'
> {standard input}:690: Error: selected processor does not support
> `caspa x0,x1,x2,x3,[x4]'
> {standard input}:708: Error: selected processor does not support
> `caspl x0,x1,x2,x3,[x4]'
> {standard input}:726: Error: selected processor does not support
> `caspal x0,x1,x2,x3,[x4]'
> ninja: build stopped: subcommand failed.
>
> Looking into the meson logs, I can see:
>
> Native C compiler: ccache gcc (gcc 9.2.1 "gcc (GCC) 9.2.1 20190827
> (Red Hat 9.2.1-1)")
> Cross C compiler: aarch64-linux-gnu-gcc (gcc 8.2.1)
> Host machine cpu family: aarch64
> Host machine cpu: armv8-a
> Target machine cpu family: aarch64
> Target machine cpu: armv8-a
> Build machine cpu family: x86_64
> Build machine cpu: x86_64
> ...
> Message: Implementer : Cavium
> Compiler for C supports arguments -mcpu=octeontx2: NO

The compiler needs either +lse or mcpu=octeontx2 to generate casp instruction.
Could you try this patch, I can submit a patch if it works for you.

[master][dpdk-next-net-mrvl] $ git diff
diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e16..466522786 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -96,7 +96,7 @@ machine_args_cavium = [
        ['0xa2', ['-mcpu=thunderxt81'], flags_thunderx_extra],
        ['0xa3', ['-mcpu=thunderxt83'], flags_thunderx_extra],
        ['0xaf', ['-march=armv8.1-a+crc+crypto','-mcpu=thunderx2t99'],
flags_thunderx2_extra],
-       ['0xb2', ['-mcpu=octeontx2'], flags_octeontx2_extra]]
+       ['0xb2',
['-march=armv8.2-a+crc+crypto+lse','-mcpu=octeontx2'],
flags_octeontx2_extra]]

 ## Arm implementer ID (ARM DDI 0487C.a, Section G7.2.106, Page G7-5321)
 impl_generic = ['Generic armv8', flags_generic, machine_args_generic]
  
David Marchand Oct. 18, 2019, 2:33 p.m. UTC | #3
On Fri, Oct 18, 2019 at 4:25 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 7:46 PM David Marchand
> <david.marchand@redhat.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 1:22 PM Phil Yang <phil.yang@arm.com> wrote:
> > >
> > > This patch adds the implementation of the 128-bit atomic compare
> > > exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
> > > can perform this operation. Moreover, on the LSE atomic extension
> > > accelerated platforms, it implemented by 'casp' instructions for
> > > better performance.
> > >
> > > Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
> > > patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
> > > 'cas' version on elder version compilers.
> >
> > Jerin, Phil,
> >
> > I am getting a build error on the octeontx2 target:
> >
> > {standard input}: Assembler messages:
> > {standard input}:672: Error: selected processor does not support `casp
> > x0,x1,x2,x3,[x4]'
> > {standard input}:690: Error: selected processor does not support
> > `caspa x0,x1,x2,x3,[x4]'
> > {standard input}:708: Error: selected processor does not support
> > `caspl x0,x1,x2,x3,[x4]'
> > {standard input}:726: Error: selected processor does not support
> > `caspal x0,x1,x2,x3,[x4]'
> > ninja: build stopped: subcommand failed.
> >
> > Looking into the meson logs, I can see:
> >
> > Native C compiler: ccache gcc (gcc 9.2.1 "gcc (GCC) 9.2.1 20190827
> > (Red Hat 9.2.1-1)")
> > Cross C compiler: aarch64-linux-gnu-gcc (gcc 8.2.1)
> > Host machine cpu family: aarch64
> > Host machine cpu: armv8-a
> > Target machine cpu family: aarch64
> > Target machine cpu: armv8-a
> > Build machine cpu family: x86_64
> > Build machine cpu: x86_64
> > ...
> > Message: Implementer : Cavium
> > Compiler for C supports arguments -mcpu=octeontx2: NO
>
> The compiler needs either +lse or mcpu=octeontx2 to generate casp instruction.
> Could you try this patch, I can submit a patch if it works for you.

Ah cool, I was looking at the march stuff.
Tried your patch, it works fine.

I'd say we can squash your bits in the current patch, since this was
unneeded before this patch.
Is this okay for you?


>
> [master][dpdk-next-net-mrvl] $ git diff
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 979018e16..466522786 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -96,7 +96,7 @@ machine_args_cavium = [
>         ['0xa2', ['-mcpu=thunderxt81'], flags_thunderx_extra],
>         ['0xa3', ['-mcpu=thunderxt83'], flags_thunderx_extra],
>         ['0xaf', ['-march=armv8.1-a+crc+crypto','-mcpu=thunderx2t99'],
> flags_thunderx2_extra],
> -       ['0xb2', ['-mcpu=octeontx2'], flags_octeontx2_extra]]
> +       ['0xb2',
> ['-march=armv8.2-a+crc+crypto+lse','-mcpu=octeontx2'],
> flags_octeontx2_extra]]
>
>  ## Arm implementer ID (ARM DDI 0487C.a, Section G7.2.106, Page G7-5321)
>  impl_generic = ['Generic armv8', flags_generic, machine_args_generic]

Thanks for the quick reply.
  
Jerin Jacob Oct. 18, 2019, 2:36 p.m. UTC | #4
On Fri, Oct 18, 2019 at 8:04 PM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Fri, Oct 18, 2019 at 4:25 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 7:46 PM David Marchand
> > <david.marchand@redhat.com> wrote:
> > >
> > > On Fri, Oct 18, 2019 at 1:22 PM Phil Yang <phil.yang@arm.com> wrote:
> > > >
> > > > This patch adds the implementation of the 128-bit atomic compare
> > > > exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
> > > > can perform this operation. Moreover, on the LSE atomic extension
> > > > accelerated platforms, it implemented by 'casp' instructions for
> > > > better performance.
> > > >
> > > > Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
> > > > patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
> > > > 'cas' version on elder version compilers.
> > >
> > > Jerin, Phil,
> > >
> > > I am getting a build error on the octeontx2 target:
> > >
> > > {standard input}: Assembler messages:
> > > {standard input}:672: Error: selected processor does not support `casp
> > > x0,x1,x2,x3,[x4]'
> > > {standard input}:690: Error: selected processor does not support
> > > `caspa x0,x1,x2,x3,[x4]'
> > > {standard input}:708: Error: selected processor does not support
> > > `caspl x0,x1,x2,x3,[x4]'
> > > {standard input}:726: Error: selected processor does not support
> > > `caspal x0,x1,x2,x3,[x4]'
> > > ninja: build stopped: subcommand failed.
> > >
> > > Looking into the meson logs, I can see:
> > >
> > > Native C compiler: ccache gcc (gcc 9.2.1 "gcc (GCC) 9.2.1 20190827
> > > (Red Hat 9.2.1-1)")
> > > Cross C compiler: aarch64-linux-gnu-gcc (gcc 8.2.1)
> > > Host machine cpu family: aarch64
> > > Host machine cpu: armv8-a
> > > Target machine cpu family: aarch64
> > > Target machine cpu: armv8-a
> > > Build machine cpu family: x86_64
> > > Build machine cpu: x86_64
> > > ...
> > > Message: Implementer : Cavium
> > > Compiler for C supports arguments -mcpu=octeontx2: NO
> >
> > The compiler needs either +lse or mcpu=octeontx2 to generate casp instruction.
> > Could you try this patch, I can submit a patch if it works for you.
>
> Ah cool, I was looking at the march stuff.
> Tried your patch, it works fine.
>
> I'd say we can squash your bits in the current patch, since this was
> unneeded before this patch.
> Is this okay for you?

Yup.

>
>
> >
> > [master][dpdk-next-net-mrvl] $ git diff
> > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > index 979018e16..466522786 100644
> > --- a/config/arm/meson.build
> > +++ b/config/arm/meson.build
> > @@ -96,7 +96,7 @@ machine_args_cavium = [
> >         ['0xa2', ['-mcpu=thunderxt81'], flags_thunderx_extra],
> >         ['0xa3', ['-mcpu=thunderxt83'], flags_thunderx_extra],
> >         ['0xaf', ['-march=armv8.1-a+crc+crypto','-mcpu=thunderx2t99'],
> > flags_thunderx2_extra],
> > -       ['0xb2', ['-mcpu=octeontx2'], flags_octeontx2_extra]]
> > +       ['0xb2',
> > ['-march=armv8.2-a+crc+crypto+lse','-mcpu=octeontx2'],
> > flags_octeontx2_extra]]
> >
> >  ## Arm implementer ID (ARM DDI 0487C.a, Section G7.2.106, Page G7-5321)
> >  impl_generic = ['Generic armv8', flags_generic, machine_args_generic]
>
> Thanks for the quick reply.
>
>
> --
> David Marchand
>
  
David Marchand Oct. 21, 2019, 8:24 a.m. UTC | #5
On Fri, Oct 18, 2019 at 4:36 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 8:04 PM David Marchand
> <david.marchand@redhat.com> wrote:
> >
> > On Fri, Oct 18, 2019 at 4:25 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
> > >
> > > On Fri, Oct 18, 2019 at 7:46 PM David Marchand
> > > <david.marchand@redhat.com> wrote:
> > > >
> > > > On Fri, Oct 18, 2019 at 1:22 PM Phil Yang <phil.yang@arm.com> wrote:
> > > > >
> > > > > This patch adds the implementation of the 128-bit atomic compare
> > > > > exchange API on AArch64. Using 64-bit 'ldxp/stxp' instructions
> > > > > can perform this operation. Moreover, on the LSE atomic extension
> > > > > accelerated platforms, it implemented by 'casp' instructions for
> > > > > better performance.
> > > > >
> > > > > Since the '__ARM_FEATURE_ATOMICS' flag only supports GCC-9, so this
> > > > > patch adds a new config flag 'RTE_ARM_FEATURE_ATOMICS' to enable the
> > > > > 'cas' version on elder version compilers.
> > > >
> > > > Jerin, Phil,
> > > >
> > > > I am getting a build error on the octeontx2 target:
> > > >
> > > > {standard input}: Assembler messages:
> > > > {standard input}:672: Error: selected processor does not support `casp
> > > > x0,x1,x2,x3,[x4]'
> > > > {standard input}:690: Error: selected processor does not support
> > > > `caspa x0,x1,x2,x3,[x4]'
> > > > {standard input}:708: Error: selected processor does not support
> > > > `caspl x0,x1,x2,x3,[x4]'
> > > > {standard input}:726: Error: selected processor does not support
> > > > `caspal x0,x1,x2,x3,[x4]'
> > > > ninja: build stopped: subcommand failed.
> > > >
> > > > Looking into the meson logs, I can see:
> > > >
> > > > Native C compiler: ccache gcc (gcc 9.2.1 "gcc (GCC) 9.2.1 20190827
> > > > (Red Hat 9.2.1-1)")
> > > > Cross C compiler: aarch64-linux-gnu-gcc (gcc 8.2.1)
> > > > Host machine cpu family: aarch64
> > > > Host machine cpu: armv8-a
> > > > Target machine cpu family: aarch64
> > > > Target machine cpu: armv8-a
> > > > Build machine cpu family: x86_64
> > > > Build machine cpu: x86_64
> > > > ...
> > > > Message: Implementer : Cavium
> > > > Compiler for C supports arguments -mcpu=octeontx2: NO
> > >
> > > The compiler needs either +lse or mcpu=octeontx2 to generate casp instruction.
> > > Could you try this patch, I can submit a patch if it works for you.
> >
> > Ah cool, I was looking at the march stuff.
> > Tried your patch, it works fine.
> >
> > I'd say we can squash your bits in the current patch, since this was
> > unneeded before this patch.
> > Is this okay for you?
>
> Yup.
>
> >
> >
> > >
> > > [master][dpdk-next-net-mrvl] $ git diff
> > > diff --git a/config/arm/meson.build b/config/arm/meson.build
> > > index 979018e16..466522786 100644
> > > --- a/config/arm/meson.build
> > > +++ b/config/arm/meson.build
> > > @@ -96,7 +96,7 @@ machine_args_cavium = [
> > >         ['0xa2', ['-mcpu=thunderxt81'], flags_thunderx_extra],
> > >         ['0xa3', ['-mcpu=thunderxt83'], flags_thunderx_extra],
> > >         ['0xaf', ['-march=armv8.1-a+crc+crypto','-mcpu=thunderx2t99'],
> > > flags_thunderx2_extra],
> > > -       ['0xb2', ['-mcpu=octeontx2'], flags_octeontx2_extra]]
> > > +       ['0xb2',
> > > ['-march=armv8.2-a+crc+crypto+lse','-mcpu=octeontx2'],
> > > flags_octeontx2_extra]]
> > >
> > >  ## Arm implementer ID (ARM DDI 0487C.a, Section G7.2.106, Page G7-5321)
> > >  impl_generic = ['Generic armv8', flags_generic, machine_args_generic]
> >
> > Thanks for the quick reply.
> >

Applied with above fix.
Thanks.
  

Patch

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 979018e..9f28271 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -71,11 +71,13 @@  flags_thunderx2_extra = [
 	['RTE_CACHE_LINE_SIZE', 64],
 	['RTE_MAX_NUMA_NODES', 2],
 	['RTE_MAX_LCORE', 256],
+	['RTE_ARM_FEATURE_ATOMICS', true],
 	['RTE_USE_C11_MEM_MODEL', true]]
 flags_octeontx2_extra = [
 	['RTE_MACHINE', '"octeontx2"'],
 	['RTE_MAX_NUMA_NODES', 1],
 	['RTE_MAX_LCORE', 24],
+	['RTE_ARM_FEATURE_ATOMICS', true],
 	['RTE_EAL_IGB_UIO', false],
 	['RTE_USE_C11_MEM_MODEL', true]]
 
diff --git a/config/common_base b/config/common_base
index e843a21..a96beb9 100644
--- a/config/common_base
+++ b/config/common_base
@@ -82,6 +82,9 @@  CONFIG_RTE_MAX_LCORE=128
 CONFIG_RTE_MAX_NUMA_NODES=8
 CONFIG_RTE_MAX_HEAPS=32
 CONFIG_RTE_MAX_MEMSEG_LISTS=64
+
+# Use ARM LSE ATOMIC instructions
+CONFIG_RTE_ARM_FEATURE_ATOMICS=n
 # each memseg list will be limited to either RTE_MAX_MEMSEG_PER_LIST pages
 # or RTE_MAX_MEM_MB_PER_LIST megabytes worth of memory, whichever is smaller
 CONFIG_RTE_MAX_MEMSEG_PER_LIST=8192
diff --git a/config/defconfig_arm64-octeontx2-linuxapp-gcc b/config/defconfig_arm64-octeontx2-linuxapp-gcc
index f20da24..7687dbe 100644
--- a/config/defconfig_arm64-octeontx2-linuxapp-gcc
+++ b/config/defconfig_arm64-octeontx2-linuxapp-gcc
@@ -9,6 +9,7 @@  CONFIG_RTE_MACHINE="octeontx2"
 CONFIG_RTE_CACHE_LINE_SIZE=128
 CONFIG_RTE_MAX_NUMA_NODES=1
 CONFIG_RTE_MAX_LCORE=24
+CONFIG_RTE_ARM_FEATURE_ATOMICS=y
 
 # Doesn't support NUMA
 CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=n
diff --git a/config/defconfig_arm64-thunderx2-linuxapp-gcc b/config/defconfig_arm64-thunderx2-linuxapp-gcc
index cc5c64b..af4a89c 100644
--- a/config/defconfig_arm64-thunderx2-linuxapp-gcc
+++ b/config/defconfig_arm64-thunderx2-linuxapp-gcc
@@ -9,3 +9,4 @@  CONFIG_RTE_MACHINE="thunderx2"
 CONFIG_RTE_CACHE_LINE_SIZE=64
 CONFIG_RTE_MAX_NUMA_NODES=2
 CONFIG_RTE_MAX_LCORE=256
+CONFIG_RTE_ARM_FEATURE_ATOMICS=y
diff --git a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
index 97060e4..d9ebccc 100644
--- a/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/arm/rte_atomic_64.h
@@ -1,5 +1,6 @@ 
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2015 Cavium, Inc
+ * Copyright(c) 2019 Arm Limited
  */
 
 #ifndef _RTE_ATOMIC_ARM64_H_
@@ -14,6 +15,9 @@  extern "C" {
 #endif
 
 #include "generic/rte_atomic.h"
+#include <rte_branch_prediction.h>
+#include <rte_compat.h>
+#include <rte_debug.h>
 
 #define dsb(opt) asm volatile("dsb " #opt : : : "memory")
 #define dmb(opt) asm volatile("dmb " #opt : : : "memory")
@@ -40,6 +44,153 @@  extern "C" {
 
 #define rte_cio_rmb() dmb(oshld)
 
+/*------------------------ 128 bit atomic operations -------------------------*/
+
+#if defined(__ARM_FEATURE_ATOMICS) || defined(RTE_ARM_FEATURE_ATOMICS)
+#define __ATOMIC128_CAS_OP(cas_op_name, op_string)                          \
+static __rte_noinline rte_int128_t                                          \
+cas_op_name(rte_int128_t *dst, rte_int128_t old,                            \
+		rte_int128_t updated)                                       \
+{                                                                           \
+	/* caspX instructions register pair must start from even-numbered
+	 * register at operand 1.
+	 * So, specify registers for local variables here.
+	 */                                                                 \
+	register uint64_t x0 __asm("x0") = (uint64_t)old.val[0];            \
+	register uint64_t x1 __asm("x1") = (uint64_t)old.val[1];            \
+	register uint64_t x2 __asm("x2") = (uint64_t)updated.val[0];        \
+	register uint64_t x3 __asm("x3") = (uint64_t)updated.val[1];        \
+	asm volatile(                                                       \
+		op_string " %[old0], %[old1], %[upd0], %[upd1], [%[dst]]"   \
+		: [old0] "+r" (x0),                                         \
+		[old1] "+r" (x1)                                            \
+		: [upd0] "r" (x2),                                          \
+		[upd1] "r" (x3),                                            \
+		[dst] "r" (dst)                                             \
+		: "memory");                                                \
+	old.val[0] = x0;                                                    \
+	old.val[1] = x1;                                                    \
+	return old;                                                         \
+}
+
+__ATOMIC128_CAS_OP(__cas_128_relaxed, "casp")
+__ATOMIC128_CAS_OP(__cas_128_acquire, "caspa")
+__ATOMIC128_CAS_OP(__cas_128_release, "caspl")
+__ATOMIC128_CAS_OP(__cas_128_acq_rel, "caspal")
+
+#undef __ATOMIC128_CAS_OP
+
+#endif
+
+__rte_experimental
+static inline int
+rte_atomic128_cmp_exchange(rte_int128_t *dst,
+				rte_int128_t *exp,
+				const rte_int128_t *src,
+				unsigned int weak,
+				int success,
+				int failure)
+{
+	/* Always do strong CAS */
+	RTE_SET_USED(weak);
+	/* Ignore memory ordering for failure, memory order for
+	 * success must be stronger or equal
+	 */
+	RTE_SET_USED(failure);
+	/* Find invalid memory order */
+	RTE_ASSERT(success == __ATOMIC_RELAXED
+			|| success == __ATOMIC_ACQUIRE
+			|| success == __ATOMIC_RELEASE
+			|| success == __ATOMIC_ACQ_REL
+			|| success == __ATOMIC_SEQ_CST);
+
+	rte_int128_t expected = *exp;
+	rte_int128_t desired = *src;
+	rte_int128_t old;
+
+#if defined(__ARM_FEATURE_ATOMICS) || defined(RTE_ARM_FEATURE_ATOMICS)
+	if (success == __ATOMIC_RELAXED)
+		old = __cas_128_relaxed(dst, expected, desired);
+	else if (success == __ATOMIC_ACQUIRE)
+		old = __cas_128_acquire(dst, expected, desired);
+	else if (success == __ATOMIC_RELEASE)
+		old = __cas_128_release(dst, expected, desired);
+	else
+		old = __cas_128_acq_rel(dst, expected, desired);
+#else
+#define __HAS_ACQ(mo) ((mo) != __ATOMIC_RELAXED && (mo) != __ATOMIC_RELEASE)
+#define __HAS_RLS(mo) ((mo) == __ATOMIC_RELEASE || (mo) == __ATOMIC_ACQ_REL || \
+					  (mo) == __ATOMIC_SEQ_CST)
+
+	int ldx_mo = __HAS_ACQ(success) ? __ATOMIC_ACQUIRE : __ATOMIC_RELAXED;
+	int stx_mo = __HAS_RLS(success) ? __ATOMIC_RELEASE : __ATOMIC_RELAXED;
+
+#undef __HAS_ACQ
+#undef __HAS_RLS
+
+	uint32_t ret = 1;
+
+	/* ldx128 can not guarantee atomic,
+	 * Must write back src or old to verify atomicity of ldx128;
+	 */
+	do {
+
+#define __LOAD_128(op_string, src, dst) {\
+	asm volatile(                    \
+		op_string " %0, %1, %2"  \
+		: "=&r" (dst.val[0]),    \
+		  "=&r" (dst.val[1])     \
+		: "Q" (src->val[0])      \
+		: "memory"); }
+
+		if (ldx_mo == __ATOMIC_RELAXED)
+			__LOAD_128("ldxp", dst, old)
+		else
+			__LOAD_128("ldaxp", dst, old)
+
+#undef __LOAD_128
+
+#define __STORE_128(op_string, dst, src, ret) {\
+	asm volatile(                          \
+		op_string " %w0, %1, %2, %3"   \
+		: "=&r" (ret)                  \
+		: "r" (src.val[0]),            \
+		  "r" (src.val[1]),            \
+		  "Q" (dst->val[0])            \
+		: "memory"); }
+
+		if (likely(old.int128 == expected.int128)) {
+			if (stx_mo == __ATOMIC_RELAXED)
+				__STORE_128("stxp", dst, desired, ret)
+			else
+				__STORE_128("stlxp", dst, desired, ret)
+		} else {
+			/* In the failure case (since 'weak' is ignored and only
+			 * weak == 0 is implemented), expected should contain
+			 * the atomically read value of dst. This means, 'old'
+			 * needs to be stored back to ensure it was read
+			 * atomically.
+			 */
+			if (stx_mo == __ATOMIC_RELAXED)
+				__STORE_128("stxp", dst, old, ret)
+			else
+				__STORE_128("stlxp", dst, old, ret)
+		}
+#undef __STORE_128
+
+	} while (unlikely(ret));
+#endif
+
+	/* Unconditionally updating expected removes
+	 * an 'if' statement.
+	 * expected should already be in register if
+	 * not in the cache.
+	 */
+	*exp = old;
+
+	return (old.int128 == expected.int128);
+}
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index 1335d92..cfe7067 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -183,18 +183,6 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
-/**
- * 128-bit integer structure.
- */
-RTE_STD_C11
-typedef struct {
-	RTE_STD_C11
-	union {
-		uint64_t val[2];
-		__extension__ __int128 int128;
-	};
-} __rte_aligned(16) rte_int128_t;
-
 __rte_experimental
 static inline int
 rte_atomic128_cmp_exchange(rte_int128_t *dst,
diff --git a/lib/librte_eal/common/include/generic/rte_atomic.h b/lib/librte_eal/common/include/generic/rte_atomic.h
index 24ff7dc..e6ab15a 100644
--- a/lib/librte_eal/common/include/generic/rte_atomic.h
+++ b/lib/librte_eal/common/include/generic/rte_atomic.h
@@ -1081,6 +1081,20 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
 
 /*------------------------ 128 bit atomic operations -------------------------*/
 
+/**
+ * 128-bit integer structure.
+ */
+RTE_STD_C11
+typedef struct {
+	RTE_STD_C11
+	union {
+		uint64_t val[2];
+#ifdef RTE_ARCH_64
+		__extension__ __int128 int128;
+#endif
+	};
+} __rte_aligned(16) rte_int128_t;
+
 #ifdef __DOXYGEN__
 
 /**
@@ -1093,7 +1107,8 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
  *     *exp = *dst
  * @endcode
  *
- * @note This function is currently only available for the x86-64 platform.
+ * @note This function is currently available for the x86-64 and aarch64
+ * platforms.
  *
  * @note The success and failure arguments must be one of the __ATOMIC_* values
  * defined in the C++11 standard. For details on their behavior, refer to the