[RFC] ring: improve ring performance with C11 atomics

Wathsala Wathawana Vithanage wathsala.vithanage at arm.com
Thu Jun 8 15:21:22 CEST 2023


The solution presented in this RFC is not C11 compliant. 
C11 __atomic_compare_exchange_n updates "expected" only when CAS instruction fails. 
Therefore, the assumption that there is an address dependency from CAS instructions in both producer/consumer head update to the ring element accesses falls apart with respect to semantics of the C11 memory model. 
Thus, this solution may corrupt ring elements when executed on CPUs with weak memory models.
Therefore, this RFC will be retracted.

> -----Original Message-----
> From: Wathsala Vithanage <wathsala.vithanage at arm.com>
> Sent: Friday, April 21, 2023 3:17 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli at arm.com>;
> konstantin.v.ananyev at yandex.ru; Feifei Wang <Feifei.Wang2 at arm.com>
> Cc: dev at dpdk.org; nd <nd at arm.com>; Wathsala Wathawana Vithanage
> <wathsala.vithanage at arm.com>
> Subject: [RFC] ring: improve ring performance with C11 atomics
> 
> Tail load in __rte_ring_move_cons_head and __rte_ring_move_prod_head can
> be changed to __ATOMIC_RELAXED from __ATOMIC_ACQUIRE.
> Because to calculate the addresses of the dequeue elements
> __rte_ring_dequeue_elems uses the old_head updated by the
> __atomic_compare_exchange_n intrinsic used in __rte_ring_move_prod_head.
> This results in an address dependency between the two operations. Therefore
> __rte_ring_dequeue_elems cannot happen before
> __rte_ring_move_prod_head.
> Similarly __rte_ring_enqueue_elems and __rte_ring_move_cons_head won't be
> reordered either.
> 
> Performance on Arm N1
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8 (Arm N1)                         |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 766730 | Total count: 651686  | Total count: 812125  |
> |                     |        Gain: -15%    |        Gain: 6%      |
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32 (Arm N1)                        |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 816745 | Total count: 646385  | Total count: 830935  |
> |                     |        Gain: -21%    |        Gain: 2%      |
> +-------------------------------------------------------------------+
> 
> Performance on x86-64 Cascade Lake
> Gain relative to generic implementation
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 8                                  |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 181640 | Total count: 181995  | Total count: 182791  |
> |                     |        Gain: 0.2%    |        Gain: 0.6%
> +-------------------------------------------------------------------+
> +-------------------------------------------------------------------+
> | Bulk enq/dequeue count on size 32                                 |
> +-------------------------------------------------------------------+
> | Generic             | C11 atomics          | C11 atomics improved |
> +-------------------------------------------------------------------+
> | Total count: 167495 | Total count: 161536  | Total count: 163190  |
> |                     |        Gain: -3.5%   |        Gain: -2.6%   |
> +-------------------------------------------------------------------+
> 
> Signed-off-by: Wathsala Vithanage <wathsala.vithanage at arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
> Reviewed-by: Feifei Wang <feifei.wang2 at arm.com>
> ---
>  .mailmap                    |  1 +
>  lib/ring/rte_ring_c11_pvt.h | 18 +++++++++---------
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/.mailmap b/.mailmap
> index 4018f0fc47..367115d134 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1430,6 +1430,7 @@ Walter Heymans <walter.heymans at corigine.com>
> Wang Sheng-Hui <shhuiw at gmail.com>  Wangyu (Eric)
> <seven.wangyu at huawei.com>  Waterman Cao <waterman.cao at intel.com>
> +Wathsala Vithanage <wathsala.vithanage at arm.com>
>  Weichun Chen <weichunx.chen at intel.com>
>  Wei Dai <wei.dai at intel.com>
>  Weifeng Li <liweifeng96 at 126.com>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h index
> f895950df4..1895f2bb0e 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -24,6 +24,13 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>  	if (!single)
>  		rte_wait_until_equal_32(&ht->tail, old_val,
> __ATOMIC_RELAXED);
> 
> +	/*
> +	 * Updating of ht->tail cannot happen before elements are added to or
> +	 * removed from the ring, as it could result in data races between
> +	 * producer and consumer threads. Therefore ht->tail should be
> updated
> +	 * with release semantics to prevent ring data copy phase from sinking
> +	 * below it.
> +	 */
>  	__atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);  }
> 
> @@ -69,11 +76,8 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		cons_tail = __atomic_load_n(&r->cons.tail,
> -					__ATOMIC_ACQUIRE);
> +					__ATOMIC_RELAXED);
> 
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have @@ -
> 145,12 +149,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  		/* Ensure the head is read before tail */
>  		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> 
> -		/* this load-acquire synchronize with store-release of ht->tail
> -		 * in update_tail.
> -		 */
>  		prod_tail = __atomic_load_n(&r->prod.tail,
> -					__ATOMIC_ACQUIRE);
> -
> +					__ATOMIC_RELAXED);
>  		/* The subtraction is done between two unsigned 32bits value
>  		 * (the result is always modulo 32 bits even if we have
>  		 * cons_head > prod_tail). So 'entries' is always between 0
> --
> 2.25.1



More information about the dev mailing list