[v3,1/2] ring: synchronize the load and store of the tail

Message ID 1540981587-88590-2-git-send-email-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series ring library with c11 memory model bug fix and optimization |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Gavin Hu Oct. 31, 2018, 10:26 a.m. UTC
  Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
  

Comments

Stephen Hemminger Oct. 31, 2018, 10:07 p.m. UTC | #1
On Wed, 31 Oct 2018 18:26:26 +0800
Gavin Hu <gavin.hu@arm.com> wrote:

> -		/*
> -		 *  The subtraction is done between two unsigned 32bits value
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		const uint32_t cons_tail

Please don't mix declarations and code. Although it is sometimes used in DPDK,
in general the style is to have declarations at the start of the block scope.
  
Gavin Hu Nov. 1, 2018, 9:56 a.m. UTC | #2
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Thursday, November 1, 2018 6:08 AM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 1/2] ring: synchronize the load and store
> of the tail
>
> On Wed, 31 Oct 2018 18:26:26 +0800
> Gavin Hu <gavin.hu@arm.com> wrote:
>
> > -/*
> > - *  The subtraction is done between two unsigned 32bits
> value
> > +/* load-acquire synchronize with store-release of ht->tail
> > + * in update_tail.
> > + */
> > +const uint32_t cons_tail
>
> Please don't mix declarations and code. Although it is sometimes used in
> DPDK, in general the style is to have declarations at the start of the block
> scope.

Thanks for review, your comment was address in v4.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  

Patch

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..4851763 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -67,13 +67,18 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -131,15 +136,22 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)