[dpdk-dev] [PATCH v7 1/3] doc: add generic atomic deprecation section

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Tue Jul 14 20:36:27 CEST 2020


Hi Phil,
	Few comments.

<snip>

> Subject: [PATCH v7 1/3] doc: add generic atomic deprecation section
I think we should change this as follows:
"doc: add optimizations using C11 atomic built-ins"

> 
> Add deprecating the generic rte_atomic_xx APIs to C11 atomic built-ins guide
> and examples.
I think same here, something like following would help:
"Add information about possible optimizations using C11 atomic built-ins"

> 
> Signed-off-by: Phil Yang <phil.yang at arm.com>
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
> ---
>  doc/guides/prog_guide/writing_efficient_code.rst | 64
> +++++++++++++++++++++++-
>  1 file changed, 63 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/writing_efficient_code.rst
> b/doc/guides/prog_guide/writing_efficient_code.rst
> index 849f63e..16d6188 100644
> --- a/doc/guides/prog_guide/writing_efficient_code.rst
> +++ b/doc/guides/prog_guide/writing_efficient_code.rst
> @@ -167,7 +167,13 @@ but with the added cost of lower throughput.
>  Locks and Atomic Operations
>  ---------------------------
> 
> -Atomic operations imply a lock prefix before the instruction,
> +This section describes some key considerations when using locks and
> +atomic operations in the DPDK environment.
> +
> +Locks
> +~~~~~
> +
> +On x86, atomic operations imply a lock prefix before the instruction,
>  causing the processor's LOCK# signal to be asserted during execution of the
> following instruction.
>  This has a big impact on performance in a multicore environment.
> 
> @@ -176,6 +182,62 @@ It can often be replaced by other solutions like per-
> lcore variables.
>  Also, some locking techniques are more efficient than others.
>  For instance, the Read-Copy-Update (RCU) algorithm can frequently replace
> simple rwlocks.
> 
> +Atomic Operations: Use C11 Atomic Built-ins
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +DPDK generic rte_atomic operations are implemented by `__sync built-ins
> +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_.
We can skip the hyper-link. Google search on "__sync built-in functions" leads to this page easily.

> +These __sync built-ins result in full barriers on aarch64, which are
> +unnecessary in many use cases. They can be replaced by `__atomic
> +built-ins

> +<https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html>`_
Same here, we can skip the hyper-link.

> +that conform to the C11 memory model and provide finer memory order
> control.
> +
> +So replacing the rte_atomic operations with __atomic built-ins might
> +improve performance for aarch64 machines.
> +
> +Some typical optimization cases are listed below:
> +
> +Atomicity
> +^^^^^^^^^
> +
> +Some use cases require atomicity alone, the ordering of the memory
> +operations does not matter. For example the packets statistics in
> +``virtio_xmit()`` function of ``vhost`` example application. It just
> +updates the number of transmitted packets, no subsequent logic depends
Instead of referring to the code here, it is better to keep it generic. May be something like below?
"For example, the packet statistics counters need to be incremented atomically but do not need any particular memory ordering. So, RELAXED memory ordering is sufficient".

> +on these counters. So the RELAXED memory ordering is sufficient.
> +
> +One-way Barrier
> +^^^^^^^^^^^^^^^
> +
> +Some use cases allow for memory reordering in one way while requiring
> +memory ordering in the other direction.
> +
Spinlock is a good example, making is little more generic would be helpful.

> +For example, the memory operations before the ``rte_spinlock_lock()``
Instead of referring to "rte_spinlock_lock" can you just say spinlock lock
> +can move to the critical section, but the memory operations in the
     ^^^ are allowed to
> +critical section cannot move above the lock. In this case, the full
                                ^^^^^^ are not allowed to
> +memory barrier in the compare-and-swap operation can be replaced to
                                                                                                                                 ^^ with
> +ACQUIRE. On the other hand, the memory operations after the
     ^^^^^^^^ ACQUIRE memory order.
> +``rte_spinlock_unlock()`` can move to the critical section, but the
                                                  ^^^ are allowed to
> +memory operations in the critical section cannot move below the unlock.
                                                                                ^^^^^^ are not allowed to
> +So the full barrier in the STORE operation can be replaced with RELEASE.
I think this above statement can be replaced with something like below:

"So the full barrier in the store operation can use RELEASE memory order."

> +
> +Reader-Writer Concurrency
> +^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Lock-free reader-writer concurrency is one of the common use cases in DPDK.
> +
> +The payload or the data that the writer wants to communicate to the
> +reader, can be written with RELAXED memory order. However, the guard
> +variable should be written with RELEASE memory order. This ensures that
> +the store to guard variable is observable only after the store to payload is
> observable.
> +Refer to ``rte_hash_cuckoo_insert_mw()`` for an example.
We can remove just this line above.

> +
> +Correspondingly, on the reader side, the guard variable should be read
> +with ACQUIRE memory order. The payload or the data the writer
> +communicated, can be read with RELAXED memory order. This ensures that,
> +if the store to guard variable is observable, the store to payload is also
> observable.
> +Refer to rte_hash ``search_one_bucket_lf()`` for an example.
Same here, suggest removing just the above line.

> +
>  Coding Considerations
>  ---------------------
> 
> --
> 2.7.4



More information about the dev mailing list