[dpdk-stable] patch 'ring: guarantee load/load order in enqueue and dequeue' has been queued to LTS release 16.11.4

luca.boccassi at gmail.com luca.boccassi at gmail.com
Mon Nov 13 15:49:38 CET 2017


Hi,

FYI, your patch has been queued to LTS release 16.11.4

Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet.
It will be pushed if I get no objections before 11/15/17. So please
shout if anyone has objections.

Thanks.

Kind regards,
Luca Boccassi

---
>From 0b16342d527b3ea5ca579274598ff8a52fcb7e8e Mon Sep 17 00:00:00 2001
From: Jia He <jia.he at hxt-semitech.com>
Date: Fri, 10 Nov 2017 03:30:42 +0000
Subject: [PATCH] ring: guarantee load/load order in enqueue and dequeue

[ upstream commit 9bc2cbb007c0a3335c5582357ae9f6d37ea0b654 ]

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server
(Amberwing).

Root cause:
In __rte_ring_move_cons_head()
...
        do {
                /* Restore n as it may change every loop */
                n = max;

                *old_head = r->cons.head;                //1st load
                const uint32_t prod_tail = r->prod.tail; //2nd load

In weak memory order architectures (powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we wanted.
This nasty reording messed enque/deque up.

cpu1(producer)          cpu2(consumer)          cpu3(consumer)
                        load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

                                                load r->cons.head
                                                load r->prod.tail
                                                ...
                                                store r->cons.{head,tail}
                        load r->cons.head

Then, r->cons.head will be bigger than prod_tail, then make *entries very
big and the consumer will go forward incorrectly.

After this patch, the old cons.head will be recaculated after failure of
rte_atomic32_cmpset

There is no such issue on X86, because X86 is strong memory order model.
But rte_smp_rmb() doesn't have impact on runtime performance on X86, so
keep the same code without architectures specific concerns.

Fixes: 50d769054872 ("ring: add burst API")

Signed-off-by: Jia He <jia.he at hxt-semitech.com>
Signed-off-by: Jie Liu <jie2.liu at hxt-semitech.com>
Signed-off-by: Bing Zhao <bing.zhao at hxt-semitech.com>
Acked-by: Jerin Jacob <jerin.jacob at caviumnetworks.com>
Acked-by: Jianbo Liu <jianbo.liu at arm.com>
---
 lib/librte_ring/rte_ring.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 32b8c8d2c..c1b311a8b 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -450,6 +450,12 @@ __rte_ring_mp_do_enqueue(struct rte_ring *r, void * const *obj_table,
 		n = max;
 
 		prod_head = r->prod.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
 		cons_tail = r->cons.tail;
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
@@ -642,6 +648,12 @@ __rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,
 		n = max;
 
 		cons_head = r->cons.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
 		prod_tail = r->prod.tail;
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
-- 
2.11.0



More information about the stable mailing list