[dpdk-dev] [PATCH v3 0/5] Add non-blocking ring

Ola Liljedahl Ola.Liljedahl at arm.com
Wed Jan 23 17:29:12 CET 2019


On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote:
> On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> > 
> > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > > 
> > > v3:
> > >  - Avoid the ABI break by putting 64-bit head and tail values in
> > > the
> > > same
> > >    cacheline as struct rte_ring's prod and cons members.
> > >  - Don't attempt to compile rte_atomic128_cmpset without
> > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > > libraries.
> > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > someone tries
> > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > >  - Update the ring mempool to use experimental APIs
> > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > ARMv8.1-A builds
> > >    can eventually support it with the CASP instruction.
> > ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> > operation using LDXP/STXP.
> Just wondering what would the performance difference between CASP vs
> LDXP/STXP on LSE supported machine?
I think that is up to the microarchitecture. But one the ideas behind
introducing the LSE atomics was that they should be "better" than the equivalent
code using exclusives. I think non-conditional LDxxx and STxxx atomics could be
better than using exclusives while conditional atomics (CAS, CASP) might not be
so different (the reason has to do with cache coherency, a core can
speculatively snoop-unique the cache line which is targetted by an atomic
instruction but to what extent that provides a benefit could be depend on
whether the atomic actually performs a store or not).

> 
> I think, We can not detect the presese of LSE support in compile time.
> Right?
Unfortunately, AFAIK GCC doesn't notify the source code that it is targetting
v8.1+ with LSE support. If there were intrinsics for (certain) LSE instructions
(e.g. those not generated by the compiler, e.g. STxxx and CASP), we could use
some corresponding preprocessor define to detect the presence of such intrinsics
(they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for SQRDMLAH/SQRDMLSH
instructions and corresponding intrinsics).

I have tried to interest the Arm GCC developers in this but have not yet
succeeded. Perhaps if we have more use cases were atomics intrinsics would be
useful, we could convince them to add such intrinsics to the ACLE (ARM C
Language Extensions). But we will never get intrinsics for exclusives, they are
deemed unsafe for explicit use from C. Instead need to provide inline assembler
that contains the complete exclusives sequence. But in practice it seems to work
with using inline assembler for LDXR and STXR as I do in the lockfree code
linked below.

> 
> The dynamic one will be costly like,
Do you think so? Shouldn't this branch be perfectly predictable? Once in a while
it will fall out of the branch history table but doesn't that mean the
application hasn't been executing this code for some time so not really
performance critical?

> 
> if (hwcaps & HWCAP_ATOMICS) {
> 	casp
> } else {
> 	ldxp
> 	stxp
> }
> 
> > 
> > From an ARM perspective, I want all atomic operations to take memory
> > ordering arguments (e.g. acquire, release). Not all usages of e.g.
> +1
> 
> > 
> > atomic compare exchange require sequential consistency (which I think
> > what x86 cmpxchg instruction provides). DPDK functions should not be
> > modelled after x86 behaviour.
> > 
> > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> > are available here:
> > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> > 
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl



More information about the dev mailing list