[v3,1/2] eal: add 128-bit cmpset (x86-64 only)

Message ID 20190116151835.22424-2-gage.eads@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series Add non-blocking stack mempool handler |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS
ci/Intel-compilation success Compilation OK

Commit Message

Eads, Gage Jan. 16, 2019, 3:18 p.m. UTC
  This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)
  

Comments

Honnappa Nagarahalli Jan. 17, 2019, 3:45 p.m. UTC | #1
> Subject: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> This operation can be used for non-blocking algorithms, such as a non-
> blocking stack or ring.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic_64.h        | 22
> ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> index fd2ec9c53..34c2addf8 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
Since this is a 128b operation should there be a new file created with the name rte_atomic_128.h?

> @@ -34,6 +34,7 @@
>  /*
>   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
>   * Copyright (c) 1998 Doug Rabson
> + * Copyright (c) 2019 Intel Corporation
>   * All rights reserved.
>   */
> 
> @@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t
> *v)  }  #endif
> 
> +static inline int
> +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> +*src) {
The API name suggests it is a 128b operation. 'dst', 'exp' and 'src' should be pointers to 128b (__int128)? Or we could define our own data type.
Since, it is a new API, can we define it with memory orderings which will be more conducive to relaxed memory ordering based architectures? You can refer to [1] and [2] for guidance.
If this an external API, it requires 'experimental' tag.

1. https://github.com/ARM-software/progress64/blob/master/src/lockfree/aarch64.h#L63
2. https://github.com/ARM-software/progress64/blob/master/src/lockfree/x86-64.h#L34

> +	uint8_t res;
> +
> +	asm volatile (
> +		      MPLOCKED
> +		      "cmpxchg16b %[dst];"
> +		      " sete %[res]"
> +		      : [dst] "=m" (*dst),
> +			[res] "=r" (res)
> +		      : "c" (src[1]),
> +			"b" (src[0]),
> +			"m" (*dst),
> +			"d" (exp[1]),
> +			"a" (exp[0])
> +		      : "memory");
> +
> +	return res;
> +}
> +
>  #endif /* _RTE_ATOMIC_X86_64_H_ */
> --
> 2.13.6
  
Eads, Gage Jan. 17, 2019, 11:03 p.m. UTC | #2
> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 17, 2019 9:45 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > Subject: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64
> > only)
> >
> > This operation can be used for non-blocking algorithms, such as a non-
> > blocking stack or ring.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > index fd2ec9c53..34c2addf8 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> Since this is a 128b operation should there be a new file created with the name
> rte_atomic_128.h?
> 
> > @@ -34,6 +34,7 @@
> >  /*
> >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> >   * Copyright (c) 1998 Doug Rabson
> > + * Copyright (c) 2019 Intel Corporation
> >   * All rights reserved.
> >   */
> >
> > @@ -208,4 +209,25 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t
> > *v)  }  #endif
> >
> > +static inline int
> > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > +*src) {
> The API name suggests it is a 128b operation. 'dst', 'exp' and 'src' should be
> pointers to 128b (__int128)? Or we could define our own data type.

I agree, I'm not a big fan of the 64b pointers here. I avoided __int128 originally because it fails to compile with -pedantic, but on second thought (and with your suggestion of a separate data type), we can resolve that with this typedef:

typedef struct {
        RTE_STD_C11 __int128 val;
} rte_int128_t;

> Since, it is a new API, can we define it with memory orderings which will be more
> conducive to relaxed memory ordering based architectures? You can refer to [1]
> and [2] for guidance.

I certainly see the value in controlling the operation's memory ordering, like in the __atomic intrinsics, but I'm not sure this patchset is the right place to address that. I see that work going a couple ways:
1. Expand the existing rte_atomicN_* interfaces with additional arguments. In that case, I'd prefer this be done in a separate patchset that addresses all the atomic operations, not just cmpset, so the interface changes are chosen according to the needs of the full set of atomic operations. If this approach is taken then there's no need to solve this while rte_atomic128_cmpset is experimental, since all the other functions are non-experimental anyway.

- Or -

2. Don't modify the existing rte_atomicN_* interfaces (or their strongly ordered behavior), and instead create new versions of them that take additional arguments. In this case, we can implement rte_atomic128_cmpset() as is and create a more flexible version in a later patchset.

Either way, I think the current interface (w.r.t. memory ordering options) can work and still leaves us in a good position for future changes/improvements.

> If this an external API, it requires 'experimental' tag.

Good catch -- will fix.

> 
> 1. https://github.com/ARM-
> software/progress64/blob/master/src/lockfree/aarch64.h#L63

I didn't know about aarch64's CASP instruction -- very cool! 

> 2. https://github.com/ARM-
> software/progress64/blob/master/src/lockfree/x86-64.h#L34
> 
> > +	uint8_t res;
> > +
> > +	asm volatile (
> > +		      MPLOCKED
> > +		      "cmpxchg16b %[dst];"
> > +		      " sete %[res]"
> > +		      : [dst] "=m" (*dst),
> > +			[res] "=r" (res)
> > +		      : "c" (src[1]),
> > +			"b" (src[0]),
> > +			"m" (*dst),
> > +			"d" (exp[1]),
> > +			"a" (exp[0])
> > +		      : "memory");
> > +
> > +	return res;
> > +}
> > +
> >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > --
> > 2.13.6
  
Honnappa Nagarahalli Jan. 18, 2019, 5:27 a.m. UTC | #3
> > >
> > > This operation can be used for non-blocking algorithms, such as a
> > > non- blocking stack or ring.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > > ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > index fd2ec9c53..34c2addf8 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > Since this is a 128b operation should there be a new file created with
> > the name rte_atomic_128.h?
> >
> > > @@ -34,6 +34,7 @@
> > >  /*
> > >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> > >   * Copyright (c) 1998 Doug Rabson
> > > + * Copyright (c) 2019 Intel Corporation
> > >   * All rights reserved.
> > >   */
> > >
> > > @@ -208,4 +209,25 @@ static inline void
> > > rte_atomic64_clear(rte_atomic64_t
> > > *v)  }  #endif
> > >
> > > +static inline int
> > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > +uint64_t
> > > +*src) {
> > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > should be pointers to 128b (__int128)? Or we could define our own data
> type.
> 
> I agree, I'm not a big fan of the 64b pointers here. I avoided __int128
> originally because it fails to compile with -pedantic, but on second thought
> (and with your suggestion of a separate data type), we can resolve that with
> this typedef:
> 
> typedef struct {
>         RTE_STD_C11 __int128 val;
> } rte_int128_t;
ok

> 
> > Since, it is a new API, can we define it with memory orderings which
> > will be more conducive to relaxed memory ordering based architectures?
> > You can refer to [1] and [2] for guidance.
> 
> I certainly see the value in controlling the operation's memory ordering, like in
> the __atomic intrinsics, but I'm not sure this patchset is the right place to
> address that. I see that work going a couple ways:
> 1. Expand the existing rte_atomicN_* interfaces with additional arguments. In
> that case, I'd prefer this be done in a separate patchset that addresses all the
> atomic operations, not just cmpset, so the interface changes are chosen
> according to the needs of the full set of atomic operations. If this approach is
> taken then there's no need to solve this while rte_atomic128_cmpset is
> experimental, since all the other functions are non-experimental anyway.
> 
> - Or -
> 
> 2. Don't modify the existing rte_atomicN_* interfaces (or their strongly
> ordered behavior), and instead create new versions of them that take
> additional arguments. In this case, we can implement rte_atomic128_cmpset()
> as is and create a more flexible version in a later patchset.
> 
> Either way, I think the current interface (w.r.t. memory ordering options) can
> work and still leaves us in a good position for future changes/improvements.
> 
I do not see the need to modify/extend the existing rte_atomicN_* APIs as the corresponding __atomic intrinsics serve as replacements. I expect that at some point, DPDK code base will not be using rte_atomicN_* APIs.
However, __atomic intrinsics do not support 128b wide parameters. Hence DPDK needs to write its own. Since this is the first API in that regard, I prefer that we start with a signature that resembles __atomic intrinsics which have been proven to provide best flexibility for all the platforms supported by DPDK.

> > If this an external API, it requires 'experimental' tag.
> 
> Good catch -- will fix.
> 
> >
> > 1. https://github.com/ARM-
> > software/progress64/blob/master/src/lockfree/aarch64.h#L63
> 
> I didn't know about aarch64's CASP instruction -- very cool!
> 
> > 2. https://github.com/ARM-
> > software/progress64/blob/master/src/lockfree/x86-64.h#L34
> >
> > > +	uint8_t res;
> > > +
> > > +	asm volatile (
> > > +		      MPLOCKED
> > > +		      "cmpxchg16b %[dst];"
> > > +		      " sete %[res]"
> > > +		      : [dst] "=m" (*dst),
> > > +			[res] "=r" (res)
> > > +		      : "c" (src[1]),
> > > +			"b" (src[0]),
> > > +			"m" (*dst),
> > > +			"d" (exp[1]),
> > > +			"a" (exp[0])
> > > +		      : "memory");
> > > +
> > > +	return res;
> > > +}
> > > +
> > >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > > --
> > > 2.13.6
  
Eads, Gage Jan. 18, 2019, 10:01 p.m. UTC | #4
> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 17, 2019 11:28 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > > >
> > > > This operation can be used for non-blocking algorithms, such as a
> > > > non- blocking stack or ring.
> > > >
> > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > ---
> > > >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > > > ++++++++++++++++++++++
> > > >  1 file changed, 22 insertions(+)
> > > >
> > > > diff --git
> > > > a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > index fd2ec9c53..34c2addf8 100644
> > > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > Since this is a 128b operation should there be a new file created
> > > with the name rte_atomic_128.h?
> > >
> > > > @@ -34,6 +34,7 @@
> > > >  /*
> > > >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> > > >   * Copyright (c) 1998 Doug Rabson
> > > > + * Copyright (c) 2019 Intel Corporation
> > > >   * All rights reserved.
> > > >   */
> > > >
> > > > @@ -208,4 +209,25 @@ static inline void
> > > > rte_atomic64_clear(rte_atomic64_t
> > > > *v)  }  #endif
> > > >
> > > > +static inline int
> > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > +uint64_t
> > > > +*src) {
> > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > should be pointers to 128b (__int128)? Or we could define our own
> > > data
> > type.
> >
> > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > __int128 originally because it fails to compile with -pedantic, but on
> > second thought (and with your suggestion of a separate data type), we
> > can resolve that with this typedef:
> >
> > typedef struct {
> >         RTE_STD_C11 __int128 val;
> > } rte_int128_t;
> ok
> 
> >
> > > Since, it is a new API, can we define it with memory orderings which
> > > will be more conducive to relaxed memory ordering based architectures?
> > > You can refer to [1] and [2] for guidance.
> >
> > I certainly see the value in controlling the operation's memory
> > ordering, like in the __atomic intrinsics, but I'm not sure this
> > patchset is the right place to address that. I see that work going a couple
> ways:
> > 1. Expand the existing rte_atomicN_* interfaces with additional
> > arguments. In that case, I'd prefer this be done in a separate
> > patchset that addresses all the atomic operations, not just cmpset, so
> > the interface changes are chosen according to the needs of the full
> > set of atomic operations. If this approach is taken then there's no
> > need to solve this while rte_atomic128_cmpset is experimental, since all the
> other functions are non-experimental anyway.
> >
> > - Or -
> >
> > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > strongly ordered behavior), and instead create new versions of them
> > that take additional arguments. In this case, we can implement
> > rte_atomic128_cmpset() as is and create a more flexible version in a later
> patchset.
> >
> > Either way, I think the current interface (w.r.t. memory ordering
> > options) can work and still leaves us in a good position for future
> changes/improvements.
> >
> I do not see the need to modify/extend the existing rte_atomicN_* APIs as the
> corresponding __atomic intrinsics serve as replacements. I expect that at some
> point, DPDK code base will not be using rte_atomicN_* APIs.
> However, __atomic intrinsics do not support 128b wide parameters. Hence

I don't think that's correct. From the GCC docs:

"16-byte integral types are also allowed if `__int128' (see __int128) is supported by the architecture."

This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.

Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-Builtins.html

> DPDK needs to write its own. Since this is the first API in that regard, I prefer that
> we start with a signature that resembles __atomic intrinsics which have been
> proven to provide best flexibility for all the platforms supported by DPDK.
  
Honnappa Nagarahalli Jan. 22, 2019, 8:30 p.m. UTC | #5
Added other platform owners.

> > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > *v)  }  #endif
> > > > >
> > > > > +static inline int
> > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > +uint64_t
> > > > > +*src) {
> > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > should be pointers to 128b (__int128)? Or we could define our own
> > > > data
> > > type.
> > >
> > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > __int128 originally because it fails to compile with -pedantic, but
> > > on second thought (and with your suggestion of a separate data
> > > type), we can resolve that with this typedef:
> > >
> > > typedef struct {
> > >         RTE_STD_C11 __int128 val;
> > > } rte_int128_t;
> > ok
> >
> > >
> > > > Since, it is a new API, can we define it with memory orderings
> > > > which will be more conducive to relaxed memory ordering based
> architectures?
> > > > You can refer to [1] and [2] for guidance.
> > >
> > > I certainly see the value in controlling the operation's memory
> > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > patchset is the right place to address that. I see that work going a
> > > couple
> > ways:
> > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > arguments. In that case, I'd prefer this be done in a separate
> > > patchset that addresses all the atomic operations, not just cmpset,
> > > so the interface changes are chosen according to the needs of the
> > > full set of atomic operations. If this approach is taken then
> > > there's no need to solve this while rte_atomic128_cmpset is
> > > experimental, since all the
> > other functions are non-experimental anyway.
> > >
> > > - Or -
> > >
> > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > strongly ordered behavior), and instead create new versions of them
> > > that take additional arguments. In this case, we can implement
> > > rte_atomic128_cmpset() as is and create a more flexible version in a
> > > later
> > patchset.
> > >
> > > Either way, I think the current interface (w.r.t. memory ordering
> > > options) can work and still leaves us in a good position for future
> > changes/improvements.
> > >
> > I do not see the need to modify/extend the existing rte_atomicN_* APIs
> > as the corresponding __atomic intrinsics serve as replacements. I
> > expect that at some point, DPDK code base will not be using
> rte_atomicN_* APIs.
> > However, __atomic intrinsics do not support 128b wide parameters.
> > Hence
> 
> I don't think that's correct. From the GCC docs:
> 
> "16-byte integral types are also allowed if `__int128' (see __int128) is
> supported by the architecture."
> 
> This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> 
> Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> Builtins.html
(following is based on my reading, not based on experiments)
My understanding is that the __atomic_load_n/store_n were implemented using a compare-and-swap HW instruction (due to lack of HW 128b atomic load and store instructions). This introduced the side effect or store/load respectively. Where as the user does not expect such side effects. As suggested in [1], it looks like GCC delegated the implementation to libatomic which 'it seems' uses locks to implement 128b __atomic intrinsics (needs to be verified)

If __atomic intrinsics, for any of the supported platforms, do not have an optimal implementation, I prefer to add a DPDK API as an abstraction. A given platform can choose to implement such an API using __atomic intrinsics if it wants. The DPDK API can be similar to __atomic_compare_exchange_n.

[1] https://patchwork.ozlabs.org/patch/721686/
[2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html

> 
> > DPDK needs to write its own. Since this is the first API in that
> > regard, I prefer that we start with a signature that resembles
> > __atomic intrinsics which have been proven to provide best flexibility for
> all the platforms supported by DPDK.
  
Eads, Gage Jan. 22, 2019, 10:25 p.m. UTC | #6
> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Tuesday, January 22, 2019 2:31 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>;
> chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> Added other platform owners.
> 
> > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > *v)  }  #endif
> > > > > >
> > > > > > +static inline int
> > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > > +uint64_t
> > > > > > +*src) {
> > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > should be pointers to 128b (__int128)? Or we could define our
> > > > > own data
> > > > type.
> > > >
> > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > __int128 originally because it fails to compile with -pedantic,
> > > > but on second thought (and with your suggestion of a separate data
> > > > type), we can resolve that with this typedef:
> > > >
> > > > typedef struct {
> > > >         RTE_STD_C11 __int128 val;
> > > > } rte_int128_t;
> > > ok
> > >
> > > >
> > > > > Since, it is a new API, can we define it with memory orderings
> > > > > which will be more conducive to relaxed memory ordering based
> > architectures?
> > > > > You can refer to [1] and [2] for guidance.
> > > >
> > > > I certainly see the value in controlling the operation's memory
> > > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > > patchset is the right place to address that. I see that work going
> > > > a couple
> > > ways:
> > > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > > arguments. In that case, I'd prefer this be done in a separate
> > > > patchset that addresses all the atomic operations, not just
> > > > cmpset, so the interface changes are chosen according to the needs
> > > > of the full set of atomic operations. If this approach is taken
> > > > then there's no need to solve this while rte_atomic128_cmpset is
> > > > experimental, since all the
> > > other functions are non-experimental anyway.
> > > >
> > > > - Or -
> > > >
> > > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > > strongly ordered behavior), and instead create new versions of
> > > > them that take additional arguments. In this case, we can
> > > > implement
> > > > rte_atomic128_cmpset() as is and create a more flexible version in
> > > > a later
> > > patchset.
> > > >
> > > > Either way, I think the current interface (w.r.t. memory ordering
> > > > options) can work and still leaves us in a good position for
> > > > future
> > > changes/improvements.
> > > >
> > > I do not see the need to modify/extend the existing rte_atomicN_*
> > > APIs as the corresponding __atomic intrinsics serve as replacements.
> > > I expect that at some point, DPDK code base will not be using
> > rte_atomicN_* APIs.
> > > However, __atomic intrinsics do not support 128b wide parameters.
> > > Hence
> >
> > I don't think that's correct. From the GCC docs:
> >
> > "16-byte integral types are also allowed if `__int128' (see __int128)
> > is supported by the architecture."
> >
> > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> >
> > Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > Builtins.html
> (following is based on my reading, not based on experiments) My understanding
> is that the __atomic_load_n/store_n were implemented using a compare-and-
> swap HW instruction (due to lack of HW 128b atomic load and store
> instructions). This introduced the side effect or store/load respectively. Where
> as the user does not expect such side effects. As suggested in [1], it looks like
> GCC delegated the implementation to libatomic which 'it seems' uses locks to
> implement 128b __atomic intrinsics (needs to be verified)
>
> If __atomic intrinsics, for any of the supported platforms, do not have an
> optimal implementation, I prefer to add a DPDK API as an abstraction. A given
> platform can choose to implement such an API using __atomic intrinsics if it
> wants. The DPDK API can be similar to __atomic_compare_exchange_n.
> 

Certainly. From the linked discussions, I see how this would affect the design of (hypothetical functions) rte_atomic128_read() and rte_atomic128_set(), but I don't see anything that suggests (for the architectures being discussed) that __atomic_compare_exchange_n is suboptimal.

> [1] https://patchwork.ozlabs.org/patch/721686/
> [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> 
> >
> > > DPDK needs to write its own. Since this is the first API in that
> > > regard, I prefer that we start with a signature that resembles
> > > __atomic intrinsics which have been proven to provide best
> > > flexibility for
> > all the platforms supported by DPDK.
  
Honnappa Nagarahalli Jan. 24, 2019, 5:21 a.m. UTC | #7
> >
> > Added other platform owners.
> >
> > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > *v)  }  #endif
> > > > > > >
> > > > > > > +static inline int
> > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > > > +uint64_t
> > > > > > > +*src) {
> > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > > should be pointers to 128b (__int128)? Or we could define our
> > > > > > own data
> > > > > type.
> > > > >
> > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > __int128 originally because it fails to compile with -pedantic,
> > > > > but on second thought (and with your suggestion of a separate
> > > > > data type), we can resolve that with this typedef:
> > > > >
> > > > > typedef struct {
> > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > ok
> > > >
> > > > >
> > > > > > Since, it is a new API, can we define it with memory orderings
> > > > > > which will be more conducive to relaxed memory ordering based
> > > architectures?
> > > > > > You can refer to [1] and [2] for guidance.
> > > > >
> > > > > I certainly see the value in controlling the operation's memory
> > > > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > > > patchset is the right place to address that. I see that work
> > > > > going a couple
> > > > ways:
> > > > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > > > arguments. In that case, I'd prefer this be done in a separate
> > > > > patchset that addresses all the atomic operations, not just
> > > > > cmpset, so the interface changes are chosen according to the
> > > > > needs of the full set of atomic operations. If this approach is
> > > > > taken then there's no need to solve this while
> > > > > rte_atomic128_cmpset is experimental, since all the
> > > > other functions are non-experimental anyway.
> > > > >
> > > > > - Or -
> > > > >
> > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > > > strongly ordered behavior), and instead create new versions of
> > > > > them that take additional arguments. In this case, we can
> > > > > implement
> > > > > rte_atomic128_cmpset() as is and create a more flexible version
> > > > > in a later
> > > > patchset.
> > > > >
> > > > > Either way, I think the current interface (w.r.t. memory
> > > > > ordering
> > > > > options) can work and still leaves us in a good position for
> > > > > future
> > > > changes/improvements.
> > > > >
> > > > I do not see the need to modify/extend the existing rte_atomicN_*
> > > > APIs as the corresponding __atomic intrinsics serve as replacements.
> > > > I expect that at some point, DPDK code base will not be using
> > > rte_atomicN_* APIs.
> > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > Hence
> > >
> > > I don't think that's correct. From the GCC docs:
> > >
> > > "16-byte integral types are also allowed if `__int128' (see
> > > __int128) is supported by the architecture."
> > >
> > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > >
> > > Source:
> > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > Builtins.html
> > (following is based on my reading, not based on experiments) My
> > understanding is that the __atomic_load_n/store_n were implemented
> > using a compare-and- swap HW instruction (due to lack of HW 128b
> > atomic load and store instructions). This introduced the side effect
> > or store/load respectively. Where as the user does not expect such
> > side effects. As suggested in [1], it looks like GCC delegated the
> > implementation to libatomic which 'it seems' uses locks to implement
> > 128b __atomic intrinsics (needs to be verified)
> >
> > If __atomic intrinsics, for any of the supported platforms, do not
> > have an optimal implementation, I prefer to add a DPDK API as an
> > abstraction. A given platform can choose to implement such an API
> > using __atomic intrinsics if it wants. The DPDK API can be similar to
> __atomic_compare_exchange_n.
> >
> 
> Certainly. From the linked discussions, I see how this would affect the design
> of (hypothetical functions) rte_atomic128_read() and rte_atomic128_set(),
> but I don't see anything that suggests (for the architectures being discussed)
> that __atomic_compare_exchange_n is suboptimal.
I wrote some code and generated assembly to verify what is happening. On aarch64, this call is delegated to libatomic and libatomic needs to be linked. In the generated assembly, it is clear that it uses locks (pthread mutex lock) to provide atomicity for. For 32b and 64b the compiler generates the expected inline assembly. Both, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are not lock free. (gcc - 8.2)

Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even after using -mcx16 flag the call is delegated to libatomic. I see the 'lock cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, ' __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I am not an expert on x86.

[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

These problems do not exist with 32b and 64b.

> 
> > [1] https://patchwork.ozlabs.org/patch/721686/
> > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> >
> > >
> > > > DPDK needs to write its own. Since this is the first API in that
> > > > regard, I prefer that we start with a signature that resembles
> > > > __atomic intrinsics which have been proven to provide best
> > > > flexibility for
> > > all the platforms supported by DPDK.
  
Eads, Gage Jan. 25, 2019, 5:19 p.m. UTC | #8
> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, January 23, 2019 11:22 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>;
> chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > >
> > > Added other platform owners.
> > >
> > > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > > *v)  }  #endif
> > > > > > > >
> > > > > > > > +static inline int
> > > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t
> > > > > > > > +*exp, uint64_t
> > > > > > > > +*src) {
> > > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > > > should be pointers to 128b (__int128)? Or we could define
> > > > > > > our own data
> > > > > > type.
> > > > > >
> > > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > > __int128 originally because it fails to compile with
> > > > > > -pedantic, but on second thought (and with your suggestion of
> > > > > > a separate data type), we can resolve that with this typedef:
> > > > > >
> > > > > > typedef struct {
> > > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > > ok
> > > > >
> > > > > >
> > > > > > > Since, it is a new API, can we define it with memory
> > > > > > > orderings which will be more conducive to relaxed memory
> > > > > > > ordering based
> > > > architectures?
> > > > > > > You can refer to [1] and [2] for guidance.
> > > > > >
> > > > > > I certainly see the value in controlling the operation's
> > > > > > memory ordering, like in the __atomic intrinsics, but I'm not
> > > > > > sure this patchset is the right place to address that. I see
> > > > > > that work going a couple
> > > > > ways:
> > > > > > 1. Expand the existing rte_atomicN_* interfaces with
> > > > > > additional arguments. In that case, I'd prefer this be done in
> > > > > > a separate patchset that addresses all the atomic operations,
> > > > > > not just cmpset, so the interface changes are chosen according
> > > > > > to the needs of the full set of atomic operations. If this
> > > > > > approach is taken then there's no need to solve this while
> > > > > > rte_atomic128_cmpset is experimental, since all the
> > > > > other functions are non-experimental anyway.
> > > > > >
> > > > > > - Or -
> > > > > >
> > > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or
> > > > > > their strongly ordered behavior), and instead create new
> > > > > > versions of them that take additional arguments. In this case,
> > > > > > we can implement
> > > > > > rte_atomic128_cmpset() as is and create a more flexible
> > > > > > version in a later
> > > > > patchset.
> > > > > >
> > > > > > Either way, I think the current interface (w.r.t. memory
> > > > > > ordering
> > > > > > options) can work and still leaves us in a good position for
> > > > > > future
> > > > > changes/improvements.
> > > > > >
> > > > > I do not see the need to modify/extend the existing
> > > > > rte_atomicN_* APIs as the corresponding __atomic intrinsics serve as
> replacements.
> > > > > I expect that at some point, DPDK code base will not be using
> > > > rte_atomicN_* APIs.
> > > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > > Hence
> > > >
> > > > I don't think that's correct. From the GCC docs:
> > > >
> > > > "16-byte integral types are also allowed if `__int128' (see
> > > > __int128) is supported by the architecture."
> > > >
> > > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > > >
> > > > Source:
> > > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > > Builtins.html
> > > (following is based on my reading, not based on experiments) My
> > > understanding is that the __atomic_load_n/store_n were implemented
> > > using a compare-and- swap HW instruction (due to lack of HW 128b
> > > atomic load and store instructions). This introduced the side effect
> > > or store/load respectively. Where as the user does not expect such
> > > side effects. As suggested in [1], it looks like GCC delegated the
> > > implementation to libatomic which 'it seems' uses locks to implement
> > > 128b __atomic intrinsics (needs to be verified)
> > >
> > > If __atomic intrinsics, for any of the supported platforms, do not
> > > have an optimal implementation, I prefer to add a DPDK API as an
> > > abstraction. A given platform can choose to implement such an API
> > > using __atomic intrinsics if it wants. The DPDK API can be similar
> > > to
> > __atomic_compare_exchange_n.
> > >
> >
> > Certainly. From the linked discussions, I see how this would affect
> > the design of (hypothetical functions) rte_atomic128_read() and
> > rte_atomic128_set(), but I don't see anything that suggests (for the
> > architectures being discussed) that __atomic_compare_exchange_n is
> suboptimal.
> I wrote some code and generated assembly to verify what is happening. On
> aarch64, this call is delegated to libatomic and libatomic needs to be linked. In
> the generated assembly, it is clear that it uses locks (pthread mutex lock) to
> provide atomicity for. For 32b and 64b the compiler generates the expected
> inline assembly. Both, ' __atomic_always_lock_free' and '
> __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are not
> lock free. (gcc - 8.2)
> 
> Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even
> after using -mcx16 flag the call is delegated to libatomic. I see the 'lock
> cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and
> ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are
> NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, '
> __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I am
> not an expert on x86.
> 
> [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
> 
> These problems do not exist with 32b and 64b.
> 

Thanks for investigating this. If GCC doesn't compile optimal code for aarch64 (i.e. LDXP+STXP or CASP) I don't think we have a choice but to use our own implementation for 128-bit atomics, and it makes sense to model the interface after the __atomic instrinsics as you suggested.

> >
> > > [1] https://patchwork.ozlabs.org/patch/721686/
> > > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> > >
> > > >
> > > > > DPDK needs to write its own. Since this is the first API in that
> > > > > regard, I prefer that we start with a signature that resembles
> > > > > __atomic intrinsics which have been proven to provide best
> > > > > flexibility for
> > > > all the platforms supported by DPDK.
  

Patch

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..34c2addf8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -34,6 +34,7 @@ 
 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */
 
@@ -208,4 +209,25 @@  static inline void rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif
 
+static inline int
+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
+{
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (*dst),
+			[res] "=r" (res)
+		      : "c" (src[1]),
+			"b" (src[0]),
+			"m" (*dst),
+			"d" (exp[1]),
+			"a" (exp[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */