[dpdk-dev,v2] ring: guarantee ordering of cons/prod loading when doing
Checks
Commit Message
We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
As for the possible race condition, please refer to [1].
Furthermore, there are 2 options as suggested by Jerin:
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
default it is n;
The reason why providing 2 options is due to the performance benchmark
difference in different arm machines, please refer to [3].
Already fuctionally tested on the machines as follows:
on X86(passed the compilation)
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n
[1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
[2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
[3] http://dpdk.org/ml/archives/dev/2017-October/080861.html
---
Changelog:
V2: let users choose whether using load_acquire/store_release
V1: rte_smp_rmb() between 2 loads
Signed-off-by: Jia He <hejianet@gmail.com>
Signed-off-by: jie2.liu@hxt-semitech.com
Signed-off-by: bing.zhao@hxt-semitech.com
Signed-off-by: jia.he@hxt-semitech.com
Suggested-by: jerin.jacob@caviumnetworks.com
---
lib/librte_ring/Makefile | 4 +++-
lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
4 files changed, 127 insertions(+), 8 deletions(-)
create mode 100644 lib/librte_ring/rte_ring_arm64.h
create mode 100644 lib/librte_ring/rte_ring_generic.h
Comments
Hi Jia,
> -----Original Message-----
> From: Jia He [mailto:hejianet@gmail.com]
> Sent: Thursday, November 2, 2017 8:44 AM
> To: jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com;
> hemant.agrawal@nxp.com; Jia He <hejianet@gmail.com>; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; jia.he@hxt-
> semitech.com
> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
>
> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
> As for the possible race condition, please refer to [1].
>
> Furthermore, there are 2 options as suggested by Jerin:
> 1. use rte_smp_rmb
> 2. use load_acquire/store_release(refer to [2]).
> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
> default it is n;
>
> The reason why providing 2 options is due to the performance benchmark
> difference in different arm machines, please refer to [3].
>
> Already fuctionally tested on the machines as follows:
> on X86(passed the compilation)
> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n
>
> [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
> [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
> [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html
>
> ---
> Changelog:
> V2: let users choose whether using load_acquire/store_release
> V1: rte_smp_rmb() between 2 loads
>
> Signed-off-by: Jia He <hejianet@gmail.com>
> Signed-off-by: jie2.liu@hxt-semitech.com
> Signed-off-by: bing.zhao@hxt-semitech.com
> Signed-off-by: jia.he@hxt-semitech.com
> Suggested-by: jerin.jacob@caviumnetworks.com
> ---
> lib/librte_ring/Makefile | 4 +++-
> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
> 4 files changed, 127 insertions(+), 8 deletions(-)
> create mode 100644 lib/librte_ring/rte_ring_arm64.h
> create mode 100644 lib/librte_ring/rte_ring_generic.h
>
> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> index e34d9d9..fa57a86 100644
> --- a/lib/librte_ring/Makefile
> +++ b/lib/librte_ring/Makefile
> @@ -45,6 +45,8 @@ LIBABIVER := 1
> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
>
> # install includes
> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> + rte_ring_arm64.h \
> + rte_ring_generic.h
>
> include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index 5e9b3b7..943b1f9 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -103,6 +103,18 @@ extern "C" {
> #include <rte_memzone.h>
> #include <rte_pause.h>
>
> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
> + * between load and load.
> + * In those weak models(powerpc/arm), there are 2 choices for the users
> + * 1.use rmb() memory barrier
> + * 2.use one-direcion load_acquire/store_release barrier
> + * It depends on performance test results. */
> +#ifdef RTE_ARCH_ARM64
> +#include "rte_ring_arm64.h"
> +#else
> +#include "rte_ring_generic.h"
> +#endif
> +
> #define RTE_TAILQ_RING_NAME "RTE_RING"
>
> enum rte_ring_queue_behavior {
> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
> while (unlikely(ht->tail != old_val))
> rte_pause();
>
> - ht->tail = new_val;
> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
> }
>
> /**
> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> /* Reset n to the initial burst count */
> n = max;
>
> - *old_head = r->prod.head;
> + *old_head = arch_rte_atomic_load(&r->prod.head,
> + __ATOMIC_ACQUIRE);
> const uint32_t cons_tail = r->cons.tail;
The code starts to look a bit messy with all these arch specific macros...
So I wonder wouldn't it be more cleaner to:
1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
into rte_ring_generic.h
2. Add rte_smp_rmb into generic __rte_ring_move_prod_head/__rte_ring_move_cons_head
(as was in v1 of your patch).
3. Introduce ARM specific versions of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
in the rte_ring_arm64.h
That way we will keep ogneric code simple and clean, while still allowing arch specific optimizations.
> /*
> * The subtraction is done between two unsigned 32bits value
> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> if (is_sp)
> r->prod.head = *new_head, success = 1;
> else
> - success = rte_atomic32_cmpset(&r->prod.head,
> - *old_head, *new_head);
> + success = arch_rte_atomic32_cmpset(&r->prod.head,
> + old_head, *new_head,
> + 0, __ATOMIC_ACQUIRE,
> + __ATOMIC_RELAXED);
> } while (unlikely(success == 0));
> return n;
> }
> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
> goto end;
>
> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
> +
> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
I wonder why do we need that macro?
Would be there situations when smp_wmb() are not needed here?
Konstantin
Hi Ananyev
On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote:
> Hi Jia,
>
>> -----Original Message-----
>> From: Jia He [mailto:hejianet@gmail.com]
>> Sent: Thursday, November 2, 2017 8:44 AM
>> To: jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
>> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com;
>> hemant.agrawal@nxp.com; Jia He <hejianet@gmail.com>; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; jia.he@hxt-
>> semitech.com
>> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
>>
>> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
>> As for the possible race condition, please refer to [1].
>>
>> Furthermore, there are 2 options as suggested by Jerin:
>> 1. use rte_smp_rmb
>> 2. use load_acquire/store_release(refer to [2]).
>> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
>> default it is n;
>>
>> The reason why providing 2 options is due to the performance benchmark
>> difference in different arm machines, please refer to [3].
>>
>> Already fuctionally tested on the machines as follows:
>> on X86(passed the compilation)
>> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
>> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n
>>
>> [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
>> [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
>> [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html
>>
>> ---
>> Changelog:
>> V2: let users choose whether using load_acquire/store_release
>> V1: rte_smp_rmb() between 2 loads
>>
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> Signed-off-by: jie2.liu@hxt-semitech.com
>> Signed-off-by: bing.zhao@hxt-semitech.com
>> Signed-off-by: jia.he@hxt-semitech.com
>> Suggested-by: jerin.jacob@caviumnetworks.com
>> ---
>> lib/librte_ring/Makefile | 4 +++-
>> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
>> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
>> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
>> 4 files changed, 127 insertions(+), 8 deletions(-)
>> create mode 100644 lib/librte_ring/rte_ring_arm64.h
>> create mode 100644 lib/librte_ring/rte_ring_generic.h
>>
>> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
>> index e34d9d9..fa57a86 100644
>> --- a/lib/librte_ring/Makefile
>> +++ b/lib/librte_ring/Makefile
>> @@ -45,6 +45,8 @@ LIBABIVER := 1
>> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
>>
>> # install includes
>> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
>> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
>> + rte_ring_arm64.h \
>> + rte_ring_generic.h
>>
>> include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
>> index 5e9b3b7..943b1f9 100644
>> --- a/lib/librte_ring/rte_ring.h
>> +++ b/lib/librte_ring/rte_ring.h
>> @@ -103,6 +103,18 @@ extern "C" {
>> #include <rte_memzone.h>
>> #include <rte_pause.h>
>>
>> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
>> + * between load and load.
>> + * In those weak models(powerpc/arm), there are 2 choices for the users
>> + * 1.use rmb() memory barrier
>> + * 2.use one-direcion load_acquire/store_release barrier
>> + * It depends on performance test results. */
>> +#ifdef RTE_ARCH_ARM64
>> +#include "rte_ring_arm64.h"
>> +#else
>> +#include "rte_ring_generic.h"
>> +#endif
>> +
>> #define RTE_TAILQ_RING_NAME "RTE_RING"
>>
>> enum rte_ring_queue_behavior {
>> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
>> while (unlikely(ht->tail != old_val))
>> rte_pause();
>>
>> - ht->tail = new_val;
>> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
>> }
>>
>> /**
>> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>> /* Reset n to the initial burst count */
>> n = max;
>>
>> - *old_head = r->prod.head;
>> + *old_head = arch_rte_atomic_load(&r->prod.head,
>> + __ATOMIC_ACQUIRE);
>> const uint32_t cons_tail = r->cons.tail;
> The code starts to look a bit messy with all these arch specific macros...
> So I wonder wouldn't it be more cleaner to:
>
> 1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> into rte_ring_generic.h
> 2. Add rte_smp_rmb into generic __rte_ring_move_prod_head/__rte_ring_move_cons_head
> (as was in v1 of your patch).
> 3. Introduce ARM specific versions of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> in the rte_ring_arm64.h
>
> That way we will keep ogneric code simple and clean, while still allowing arch specific optimizations.
Thanks for your review.
But as per your suggestion, there will be at least 2 copies of
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail.
Thus, if there are any bugs in the future, both 2 copies have to be
changed, right?
>
>> /*
>> * The subtraction is done between two unsigned 32bits value
>> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>> if (is_sp)
>> r->prod.head = *new_head, success = 1;
>> else
>> - success = rte_atomic32_cmpset(&r->prod.head,
>> - *old_head, *new_head);
>> + success = arch_rte_atomic32_cmpset(&r->prod.head,
>> + old_head, *new_head,
>> + 0, __ATOMIC_ACQUIRE,
>> + __ATOMIC_RELAXED);
>> } while (unlikely(success == 0));
>> return n;
>> }
>> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
>> goto end;
>>
>> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
>> +
>> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> I wonder why do we need that macro?
> Would be there situations when smp_wmb() are not needed here?
If the dpdk user chooses the config acquire/release, the store_release
barrier in update_tail together with
the load_acquire barrier pair in __rte_ring_move_{prod,cons}_head
guarantee the order. So smp_wmb() is not required here. Please refer to
the freebsd ring implementation and Jerin's debug patch.
https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
---
Cheers,
Jia
> Konstantin
>
>
> -----Original Message-----
> From: Jia He [mailto:hejianet@gmail.com]
> Sent: Thursday, November 2, 2017 3:43 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com; hemant.agrawal@nxp.com; jie2.liu@hxt-semitech.com;
> bing.zhao@hxt-semitech.com; jia.he@hxt-semitech.com
> Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
>
> Hi Ananyev
>
>
> On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote:
> > Hi Jia,
> >
> >> -----Original Message-----
> >> From: Jia He [mailto:hejianet@gmail.com]
> >> Sent: Thursday, November 2, 2017 8:44 AM
> >> To: jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
> >> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com;
> >> hemant.agrawal@nxp.com; Jia He <hejianet@gmail.com>; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; jia.he@hxt-
> >> semitech.com
> >> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
> >>
> >> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
> >> As for the possible race condition, please refer to [1].
> >>
> >> Furthermore, there are 2 options as suggested by Jerin:
> >> 1. use rte_smp_rmb
> >> 2. use load_acquire/store_release(refer to [2]).
> >> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
> >> default it is n;
> >>
> >> The reason why providing 2 options is due to the performance benchmark
> >> difference in different arm machines, please refer to [3].
> >>
> >> Already fuctionally tested on the machines as follows:
> >> on X86(passed the compilation)
> >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
> >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n
> >>
> >> [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
> >> [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
> >> [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html
> >>
> >> ---
> >> Changelog:
> >> V2: let users choose whether using load_acquire/store_release
> >> V1: rte_smp_rmb() between 2 loads
> >>
> >> Signed-off-by: Jia He <hejianet@gmail.com>
> >> Signed-off-by: jie2.liu@hxt-semitech.com
> >> Signed-off-by: bing.zhao@hxt-semitech.com
> >> Signed-off-by: jia.he@hxt-semitech.com
> >> Suggested-by: jerin.jacob@caviumnetworks.com
> >> ---
> >> lib/librte_ring/Makefile | 4 +++-
> >> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
> >> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
> >> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
> >> 4 files changed, 127 insertions(+), 8 deletions(-)
> >> create mode 100644 lib/librte_ring/rte_ring_arm64.h
> >> create mode 100644 lib/librte_ring/rte_ring_generic.h
> >>
> >> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> >> index e34d9d9..fa57a86 100644
> >> --- a/lib/librte_ring/Makefile
> >> +++ b/lib/librte_ring/Makefile
> >> @@ -45,6 +45,8 @@ LIBABIVER := 1
> >> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> >>
> >> # install includes
> >> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
> >> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> >> + rte_ring_arm64.h \
> >> + rte_ring_generic.h
> >>
> >> include $(RTE_SDK)/mk/rte.lib.mk
> >> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> >> index 5e9b3b7..943b1f9 100644
> >> --- a/lib/librte_ring/rte_ring.h
> >> +++ b/lib/librte_ring/rte_ring.h
> >> @@ -103,6 +103,18 @@ extern "C" {
> >> #include <rte_memzone.h>
> >> #include <rte_pause.h>
> >>
> >> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
> >> + * between load and load.
> >> + * In those weak models(powerpc/arm), there are 2 choices for the users
> >> + * 1.use rmb() memory barrier
> >> + * 2.use one-direcion load_acquire/store_release barrier
> >> + * It depends on performance test results. */
> >> +#ifdef RTE_ARCH_ARM64
> >> +#include "rte_ring_arm64.h"
> >> +#else
> >> +#include "rte_ring_generic.h"
> >> +#endif
> >> +
> >> #define RTE_TAILQ_RING_NAME "RTE_RING"
> >>
> >> enum rte_ring_queue_behavior {
> >> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
> >> while (unlikely(ht->tail != old_val))
> >> rte_pause();
> >>
> >> - ht->tail = new_val;
> >> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
> >> }
> >>
> >> /**
> >> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> >> /* Reset n to the initial burst count */
> >> n = max;
> >>
> >> - *old_head = r->prod.head;
> >> + *old_head = arch_rte_atomic_load(&r->prod.head,
> >> + __ATOMIC_ACQUIRE);
> >> const uint32_t cons_tail = r->cons.tail;
> > The code starts to look a bit messy with all these arch specific macros...
> > So I wonder wouldn't it be more cleaner to:
> >
> > 1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> > into rte_ring_generic.h
> > 2. Add rte_smp_rmb into generic __rte_ring_move_prod_head/__rte_ring_move_cons_head
> > (as was in v1 of your patch).
> > 3. Introduce ARM specific versions of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> > in the rte_ring_arm64.h
> >
> > That way we will keep ogneric code simple and clean, while still allowing arch specific optimizations.
> Thanks for your review.
> But as per your suggestion, there will be at least 2 copies of
> __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail.
> Thus, if there are any bugs in the future, both 2 copies have to be
> changed, right?
Yes, there would be some code duplication.
Though generic code would be much cleaner and simpler to read in that case.
Which is worth some dups I think.
> >
> >> /*
> >> * The subtraction is done between two unsigned 32bits value
> >> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> >> if (is_sp)
> >> r->prod.head = *new_head, success = 1;
> >> else
> >> - success = rte_atomic32_cmpset(&r->prod.head,
> >> - *old_head, *new_head);
> >> + success = arch_rte_atomic32_cmpset(&r->prod.head,
> >> + old_head, *new_head,
> >> + 0, __ATOMIC_ACQUIRE,
> >> + __ATOMIC_RELAXED);
> >> } while (unlikely(success == 0));
> >> return n;
> >> }
> >> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
> >> goto end;
> >>
> >> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
> >> +
> >> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> > I wonder why do we need that macro?
> > Would be there situations when smp_wmb() are not needed here?
> If the dpdk user chooses the config acquire/release, the store_release
> barrier in update_tail together with
> the load_acquire barrier pair in __rte_ring_move_{prod,cons}_head
> guarantee the order. So smp_wmb() is not required here. Please refer to
> the freebsd ring implementation and Jerin's debug patch.
> https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h
> https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch
Ok, so because you are doing
arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
inside update_tail() you don't need mb() anymore here...
In that case, can we probably hide all that logic inside update_tail()?
I.E. move sp_rmb/smp_wmb into update_tail() and then for arm64 you'll
Have your special one with atomic_store()
Something like that for generic version :
static __rte_always_inline void
update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
- uint32_t single)
+ uint32_t single, uint32_t enqueue)
{
+ if (enqueue)
+ rte_smp_wmb();
+ else
+ rte_smp_rmb();
/*
* If there are other enqueues/dequeues in progress that preceded us,
* we need to wait for them to complete
*/
if (!single)
while (unlikely(ht->tail != old_val))
rte_pause();
ht->tail = new_val;
}
....
static __rte_always_inline unsigned int
__rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
unsigned int n, enum rte_ring_queue_behavior behavior,
int is_sp, unsigned int *free_space)
{
.....
ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
- rte_smp_wmb();
- update_tail(&r->prod, prod_head, prod_next, is_sp);
+ update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
....
static __rte_always_inline unsigned int
__rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
unsigned int n, enum rte_ring_queue_behavior behavior,
int is_sc, unsigned int *available)
{
....
DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
- rte_smp_rmb();
- update_tail(&r->cons, cons_head, cons_next, is_sc);
+ update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
Konstantin
>
> ---
> Cheers,
> Jia
> > Konstantin
> >
> >
>
> --
> Cheers,
> Jia
-----Original Message-----
> Date: Thu, 2 Nov 2017 16:16:33 +0000
> From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
> To: Jia He <hejianet@gmail.com>, "jerin.jacob@caviumnetworks.com"
> <jerin.jacob@caviumnetworks.com>, "dev@dpdk.org" <dev@dpdk.org>,
> "olivier.matz@6wind.com" <olivier.matz@6wind.com>
> CC: "Richardson, Bruce" <bruce.richardson@intel.com>, "jianbo.liu@arm.com"
> <jianbo.liu@arm.com>, "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
> "jie2.liu@hxt-semitech.com" <jie2.liu@hxt-semitech.com>,
> "bing.zhao@hxt-semitech.com" <bing.zhao@hxt-semitech.com>,
> "jia.he@hxt-semitech.com" <jia.he@hxt-semitech.com>
> Subject: RE: [PATCH v2] ring: guarantee ordering of cons/prod loading when
> doing
>
>
>
> > -----Original Message-----
> > From: Jia He [mailto:hejianet@gmail.com]
> > Sent: Thursday, November 2, 2017 3:43 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
> > Cc: Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com; hemant.agrawal@nxp.com; jie2.liu@hxt-semitech.com;
> > bing.zhao@hxt-semitech.com; jia.he@hxt-semitech.com
> > Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
> >
> > Hi Ananyev
> >
> >
> > On 11/2/2017 9:26 PM, Ananyev, Konstantin Wrote:
> > > Hi Jia,
> > >
> > >> -----Original Message-----
> > >> From: Jia He [mailto:hejianet@gmail.com]
> > >> Sent: Thursday, November 2, 2017 8:44 AM
> > >> To: jerin.jacob@caviumnetworks.com; dev@dpdk.org; olivier.matz@6wind.com
> > >> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; jianbo.liu@arm.com;
> > >> hemant.agrawal@nxp.com; Jia He <hejianet@gmail.com>; jie2.liu@hxt-semitech.com; bing.zhao@hxt-semitech.com; jia.he@hxt-
> > >> semitech.com
> > >> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
> > >>
> > >> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
> > >> As for the possible race condition, please refer to [1].
> > >>
> > >> Furthermore, there are 2 options as suggested by Jerin:
> > >> 1. use rte_smp_rmb
> > >> 2. use load_acquire/store_release(refer to [2]).
> > >> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
> > >> default it is n;
> > >>
> > >> The reason why providing 2 options is due to the performance benchmark
> > >> difference in different arm machines, please refer to [3].
> > >>
> > >> Already fuctionally tested on the machines as follows:
> > >> on X86(passed the compilation)
> > >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=y
> > >> on arm64 with CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER=n
> > >>
> > >> [1] http://dpdk.org/ml/archives/dev/2017-October/078366.html
> > >> [2] https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L170
> > >> [3] http://dpdk.org/ml/archives/dev/2017-October/080861.html
> > >>
> > >> ---
> > >> Changelog:
> > >> V2: let users choose whether using load_acquire/store_release
> > >> V1: rte_smp_rmb() between 2 loads
> > >>
> > >> Signed-off-by: Jia He <hejianet@gmail.com>
> > >> Signed-off-by: jie2.liu@hxt-semitech.com
> > >> Signed-off-by: bing.zhao@hxt-semitech.com
> > >> Signed-off-by: jia.he@hxt-semitech.com
> > >> Suggested-by: jerin.jacob@caviumnetworks.com
> > >> ---
> > >> lib/librte_ring/Makefile | 4 +++-
> > >> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
> > >> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
> > >> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
> > >> 4 files changed, 127 insertions(+), 8 deletions(-)
> > >> create mode 100644 lib/librte_ring/rte_ring_arm64.h
> > >> create mode 100644 lib/librte_ring/rte_ring_generic.h
> > >>
> > >> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > >> index e34d9d9..fa57a86 100644
> > >> --- a/lib/librte_ring/Makefile
> > >> +++ b/lib/librte_ring/Makefile
> > >> @@ -45,6 +45,8 @@ LIBABIVER := 1
> > >> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> > >>
> > >> # install includes
> > >> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
> > >> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> > >> + rte_ring_arm64.h \
> > >> + rte_ring_generic.h
> > >>
> > >> include $(RTE_SDK)/mk/rte.lib.mk
> > >> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > >> index 5e9b3b7..943b1f9 100644
> > >> --- a/lib/librte_ring/rte_ring.h
> > >> +++ b/lib/librte_ring/rte_ring.h
> > >> @@ -103,6 +103,18 @@ extern "C" {
> > >> #include <rte_memzone.h>
> > >> #include <rte_pause.h>
> > >>
> > >> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
> > >> + * between load and load.
> > >> + * In those weak models(powerpc/arm), there are 2 choices for the users
> > >> + * 1.use rmb() memory barrier
> > >> + * 2.use one-direcion load_acquire/store_release barrier
> > >> + * It depends on performance test results. */
> > >> +#ifdef RTE_ARCH_ARM64
> > >> +#include "rte_ring_arm64.h"
> > >> +#else
> > >> +#include "rte_ring_generic.h"
> > >> +#endif
> > >> +
> > >> #define RTE_TAILQ_RING_NAME "RTE_RING"
> > >>
> > >> enum rte_ring_queue_behavior {
> > >> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
> > >> while (unlikely(ht->tail != old_val))
> > >> rte_pause();
> > >>
> > >> - ht->tail = new_val;
> > >> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
> > >> }
> > >>
> > >> /**
> > >> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> > >> /* Reset n to the initial burst count */
> > >> n = max;
> > >>
> > >> - *old_head = r->prod.head;
> > >> + *old_head = arch_rte_atomic_load(&r->prod.head,
> > >> + __ATOMIC_ACQUIRE);
> > >> const uint32_t cons_tail = r->cons.tail;
> > > The code starts to look a bit messy with all these arch specific macros...
> > > So I wonder wouldn't it be more cleaner to:
> > >
> > > 1. move existing __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> > > into rte_ring_generic.h
> > > 2. Add rte_smp_rmb into generic __rte_ring_move_prod_head/__rte_ring_move_cons_head
> > > (as was in v1 of your patch).
> > > 3. Introduce ARM specific versions of __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
> > > in the rte_ring_arm64.h
> > >
> > > That way we will keep ogneric code simple and clean, while still allowing arch specific optimizations.
> > Thanks for your review.
> > But as per your suggestion, there will be at least 2 copies of
> > __rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail.
> > Thus, if there are any bugs in the future, both 2 copies have to be
> > changed, right?
>
> Yes, there would be some code duplication.
> Though generic code would be much cleaner and simpler to read in that case.
> Which is worth some dups I think.
I agree with Konstantin here.
-----Original Message-----
> Date: Thu, 2 Nov 2017 08:43:30 +0000
> From: Jia He <hejianet@gmail.com>
> To: jerin.jacob@caviumnetworks.com, dev@dpdk.org, olivier.matz@6wind.com
> Cc: konstantin.ananyev@intel.com, bruce.richardson@intel.com,
> jianbo.liu@arm.com, hemant.agrawal@nxp.com, Jia He <hejianet@gmail.com>,
> jie2.liu@hxt-semitech.com, bing.zhao@hxt-semitech.com,
> jia.he@hxt-semitech.com
> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
> X-Mailer: git-send-email 2.7.4
>
> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
> As for the possible race condition, please refer to [1].
Hi Jia,
In addition to Konstantin comments. Please find below some review
comments.
>
> Furthermore, there are 2 options as suggested by Jerin:
> 1. use rte_smp_rmb
> 2. use load_acquire/store_release(refer to [2]).
> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
I think, The better name would be CONFIG_RTE_RING_USE_C11_MEM_MODEL
or something like that.
> default it is n;
>
> ---
> Changelog:
> V2: let users choose whether using load_acquire/store_release
> V1: rte_smp_rmb() between 2 loads
>
> Signed-off-by: Jia He <hejianet@gmail.com>
> Signed-off-by: jie2.liu@hxt-semitech.com
> Signed-off-by: bing.zhao@hxt-semitech.com
> Signed-off-by: jia.he@hxt-semitech.com
> Suggested-by: jerin.jacob@caviumnetworks.com
> ---
> lib/librte_ring/Makefile | 4 +++-
> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
> 4 files changed, 127 insertions(+), 8 deletions(-)
> create mode 100644 lib/librte_ring/rte_ring_arm64.h
> create mode 100644 lib/librte_ring/rte_ring_generic.h
>
> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> index e34d9d9..fa57a86 100644
> --- a/lib/librte_ring/Makefile
> +++ b/lib/librte_ring/Makefile
> @@ -45,6 +45,8 @@ LIBABIVER := 1
> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
>
> # install includes
> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> + rte_ring_arm64.h \
It is really not specific to arm64. We could rename it to rte_ring_c11_mem.h or
something like that to reflect the implementation based on c11 memory
model.
> + rte_ring_generic.h
>
> include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index 5e9b3b7..943b1f9 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -103,6 +103,18 @@ extern "C" {
> #include <rte_memzone.h>
> #include <rte_pause.h>
>
> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
> + * between load and load.
> + * In those weak models(powerpc/arm), there are 2 choices for the users
> + * 1.use rmb() memory barrier
> + * 2.use one-direcion load_acquire/store_release barrier
> + * It depends on performance test results. */
> +#ifdef RTE_ARCH_ARM64
s/RTE_ARCH_ARM64/RTE_RING_USE_C11_MEM_MODEL and update the generic arm64 config.
By that way it can used by another architecture like ppc if they choose to do so.
> +#include "rte_ring_arm64.h"
> +#else
> +#include "rte_ring_generic.h"
> +#endif
> +
> #define RTE_TAILQ_RING_NAME "RTE_RING"
>
> enum rte_ring_queue_behavior {
> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
> while (unlikely(ht->tail != old_val))
> rte_pause();
>
> - ht->tail = new_val;
> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
> }
>
> /**
> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> /* Reset n to the initial burst count */
> n = max;
>
> - *old_head = r->prod.head;
> + *old_head = arch_rte_atomic_load(&r->prod.head,
> + __ATOMIC_ACQUIRE);
Same as Konstantin comments, i.e move to this function to c11 memory model
header file
> const uint32_t cons_tail = r->cons.tail;
> /*
> * The subtraction is done between two unsigned 32bits value
> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> if (is_sp)
> r->prod.head = *new_head, success = 1;
> else
> - success = rte_atomic32_cmpset(&r->prod.head,
> - *old_head, *new_head);
> + success = arch_rte_atomic32_cmpset(&r->prod.head,
> + old_head, *new_head,
> + 0, __ATOMIC_ACQUIRE,
> + __ATOMIC_RELAXED);
> } while (unlikely(success == 0));
> return n;
> }
> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
> goto end;
>
> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
> +
> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
This #define will be removed if the function moves.
> rte_smp_wmb();
> +#endif
>
> update_tail(&r->prod, prod_head, prod_next, is_sp);
> end:
> @@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> /* Restore n as it may change every loop */
> n = max;
>
> - *old_head = r->cons.head;
> + *old_head = arch_rte_atomic_load(&r->cons.head,
> + __ATOMIC_ACQUIRE);
> const uint32_t prod_tail = r->prod.tail;
> /* The subtraction is done between two unsigned 32bits value
> * (the result is always modulo 32 bits even if we have
> @@ -535,8 +554,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> if (is_sc)
> r->cons.head = *new_head, success = 1;
> else
> - success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> - *new_head);
> + success = arch_rte_atomic32_cmpset(&r->cons.head,
> + old_head, *new_head,
> + 0, __ATOMIC_ACQUIRE,
> + __ATOMIC_RELAXED);
> } while (unlikely(success == 0));
> return n;
> }
> @@ -575,7 +596,10 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
> goto end;
>
> DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
> +
> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
This #define will be removed if the function moves.
> rte_smp_rmb();
> +#endif
Hi Jerin
On 11/3/2017 1:23 AM, Jerin Jacob Wrote:
> -----Original Message-----
>> Date: Thu, 2 Nov 2017 08:43:30 +0000
>> From: Jia He <hejianet@gmail.com>
>> To: jerin.jacob@caviumnetworks.com, dev@dpdk.org, olivier.matz@6wind.com
>> Cc: konstantin.ananyev@intel.com, bruce.richardson@intel.com,
>> jianbo.liu@arm.com, hemant.agrawal@nxp.com, Jia He <hejianet@gmail.com>,
>> jie2.liu@hxt-semitech.com, bing.zhao@hxt-semitech.com,
>> jia.he@hxt-semitech.com
>> Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
>> X-Mailer: git-send-email 2.7.4
>>
>> We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
>> As for the possible race condition, please refer to [1].
> Hi Jia,
>
> In addition to Konstantin comments. Please find below some review
> comments.
>> Furthermore, there are 2 options as suggested by Jerin:
>> 1. use rte_smp_rmb
>> 2. use load_acquire/store_release(refer to [2]).
>> CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
> I think, The better name would be CONFIG_RTE_RING_USE_C11_MEM_MODEL
> or something like that.
Ok, but how to distinguish following 2 options?
CONFIG_RTE_RING_USE_C11_MEM_MODEL doesn't seem to be enough
1. use rte_smp_rmb
2. use load_acquire/store_release(refer to [2]).
>> default it is n;
>>
>> ---
>> Changelog:
>> V2: let users choose whether using load_acquire/store_release
>> V1: rte_smp_rmb() between 2 loads
>>
>> Signed-off-by: Jia He <hejianet@gmail.com>
>> Signed-off-by: jie2.liu@hxt-semitech.com
>> Signed-off-by: bing.zhao@hxt-semitech.com
>> Signed-off-by: jia.he@hxt-semitech.com
>> Suggested-by: jerin.jacob@caviumnetworks.com
>> ---
>> lib/librte_ring/Makefile | 4 +++-
>> lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
>> lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
>> lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
>> 4 files changed, 127 insertions(+), 8 deletions(-)
>> create mode 100644 lib/librte_ring/rte_ring_arm64.h
>> create mode 100644 lib/librte_ring/rte_ring_generic.h
>>
>> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
>> index e34d9d9..fa57a86 100644
>> --- a/lib/librte_ring/Makefile
>> +++ b/lib/librte_ring/Makefile
>> @@ -45,6 +45,8 @@ LIBABIVER := 1
>> SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
>>
>> # install includes
>> -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
>> +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
>> + rte_ring_arm64.h \
> It is really not specific to arm64. We could rename it to rte_ring_c11_mem.h or
> something like that to reflect the implementation based on c11 memory
> model.
>
>
>> + rte_ring_generic.h
>>
>> include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
>> index 5e9b3b7..943b1f9 100644
>> --- a/lib/librte_ring/rte_ring.h
>> +++ b/lib/librte_ring/rte_ring.h
>> @@ -103,6 +103,18 @@ extern "C" {
>> #include <rte_memzone.h>
>> #include <rte_pause.h>
>>
>> +/* In those strong memory models (e.g. x86), there is no need to add rmb()
>> + * between load and load.
>> + * In those weak models(powerpc/arm), there are 2 choices for the users
>> + * 1.use rmb() memory barrier
>> + * 2.use one-direcion load_acquire/store_release barrier
>> + * It depends on performance test results. */
>> +#ifdef RTE_ARCH_ARM64
> s/RTE_ARCH_ARM64/RTE_RING_USE_C11_MEM_MODEL and update the generic arm64 config.
> By that way it can used by another architecture like ppc if they choose to do so.
>
>
>> +#include "rte_ring_arm64.h"
>> +#else
>> +#include "rte_ring_generic.h"
>> +#endif
>> +
>> #define RTE_TAILQ_RING_NAME "RTE_RING"
>>
>> enum rte_ring_queue_behavior {
>> @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
>> while (unlikely(ht->tail != old_val))
>> rte_pause();
>>
>> - ht->tail = new_val;
>> + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
>> }
>>
>> /**
>> @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>> /* Reset n to the initial burst count */
>> n = max;
>>
>> - *old_head = r->prod.head;
>> + *old_head = arch_rte_atomic_load(&r->prod.head,
>> + __ATOMIC_ACQUIRE);
> Same as Konstantin comments, i.e move to this function to c11 memory model
> header file
>
>> const uint32_t cons_tail = r->cons.tail;
>> /*
>> * The subtraction is done between two unsigned 32bits value
>> @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
>> if (is_sp)
>> r->prod.head = *new_head, success = 1;
>> else
>> - success = rte_atomic32_cmpset(&r->prod.head,
>> - *old_head, *new_head);
>> + success = arch_rte_atomic32_cmpset(&r->prod.head,
>> + old_head, *new_head,
>> + 0, __ATOMIC_ACQUIRE,
>> + __ATOMIC_RELAXED);
>> } while (unlikely(success == 0));
>> return n;
>> }
>> @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
>> goto end;
>>
>> ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
>> +
>> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> This #define will be removed if the function moves.
>
>> rte_smp_wmb();
>> +#endif
>>
>> update_tail(&r->prod, prod_head, prod_next, is_sp);
>> end:
>> @@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>> /* Restore n as it may change every loop */
>> n = max;
>>
>> - *old_head = r->cons.head;
>> + *old_head = arch_rte_atomic_load(&r->cons.head,
>> + __ATOMIC_ACQUIRE);
>> const uint32_t prod_tail = r->prod.tail;
>> /* The subtraction is done between two unsigned 32bits value
>> * (the result is always modulo 32 bits even if we have
>> @@ -535,8 +554,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>> if (is_sc)
>> r->cons.head = *new_head, success = 1;
>> else
>> - success = rte_atomic32_cmpset(&r->cons.head, *old_head,
>> - *new_head);
>> + success = arch_rte_atomic32_cmpset(&r->cons.head,
>> + old_head, *new_head,
>> + 0, __ATOMIC_ACQUIRE,
>> + __ATOMIC_RELAXED);
>> } while (unlikely(success == 0));
>> return n;
>> }
>> @@ -575,7 +596,10 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
>> goto end;
>>
>> DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
>> +
>> +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> This #define will be removed if the function moves.
>
>> rte_smp_rmb();
>> +#endif
-----Original Message-----
> Date: Fri, 3 Nov 2017 09:46:40 +0800
> From: Jia He <hejianet@gmail.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: dev@dpdk.org, olivier.matz@6wind.com, konstantin.ananyev@intel.com,
> bruce.richardson@intel.com, jianbo.liu@arm.com, hemant.agrawal@nxp.com,
> jie2.liu@hxt-semitech.com, bing.zhao@hxt-semitech.com,
> jia.he@hxt-semitech.com
> Subject: Re: [PATCH v2] ring: guarantee ordering of cons/prod loading when
> doing
> User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
> Thunderbird/52.4.0
>
> Hi Jerin
>
>
> On 11/3/2017 1:23 AM, Jerin Jacob Wrote:
> > -----Original Message-----
> > > Date: Thu, 2 Nov 2017 08:43:30 +0000
> > > From: Jia He <hejianet@gmail.com>
> > > To: jerin.jacob@caviumnetworks.com, dev@dpdk.org, olivier.matz@6wind.com
> > > Cc: konstantin.ananyev@intel.com, bruce.richardson@intel.com,
> > > jianbo.liu@arm.com, hemant.agrawal@nxp.com, Jia He <hejianet@gmail.com>,
> > > jie2.liu@hxt-semitech.com, bing.zhao@hxt-semitech.com,
> > > jia.he@hxt-semitech.com
> > > Subject: [PATCH v2] ring: guarantee ordering of cons/prod loading when doing
> > > X-Mailer: git-send-email 2.7.4
> > >
> > > We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
> > > As for the possible race condition, please refer to [1].
> > Hi Jia,
> >
> > In addition to Konstantin comments. Please find below some review
> > comments.
> > > Furthermore, there are 2 options as suggested by Jerin:
> > > 1. use rte_smp_rmb
> > > 2. use load_acquire/store_release(refer to [2]).
> > > CONFIG_RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER is provided, and by
> > I think, The better name would be CONFIG_RTE_RING_USE_C11_MEM_MODEL
> > or something like that.
> Ok, but how to distinguish following 2 options?
No clearly understood this question. For arm64 case, you can add
CONFIG_RTE_RING_USE_C11_MEM_MODEL=y in config/defconfig_arm64-armv8a-*
>
> CONFIG_RTE_RING_USE_C11_MEM_MODEL doesn't seem to be enough
>
> 1. use rte_smp_rmb
> 2. use load_acquire/store_release(refer to [2]).
>
> > > default it is n;
> > >
> > > ---
> > > Changelog:
> > > V2: let users choose whether using load_acquire/store_release
> > > V1: rte_smp_rmb() between 2 loads
> > >
> > > Signed-off-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: jie2.liu@hxt-semitech.com
> > > Signed-off-by: bing.zhao@hxt-semitech.com
> > > Signed-off-by: jia.he@hxt-semitech.com
> > > Suggested-by: jerin.jacob@caviumnetworks.com
> > > ---
> > > lib/librte_ring/Makefile | 4 +++-
> > > lib/librte_ring/rte_ring.h | 38 ++++++++++++++++++++++++------
> > > lib/librte_ring/rte_ring_arm64.h | 48 ++++++++++++++++++++++++++++++++++++++
> > > lib/librte_ring/rte_ring_generic.h | 45 +++++++++++++++++++++++++++++++++++
> > > 4 files changed, 127 insertions(+), 8 deletions(-)
> > > create mode 100644 lib/librte_ring/rte_ring_arm64.h
> > > create mode 100644 lib/librte_ring/rte_ring_generic.h
> > >
> > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > index e34d9d9..fa57a86 100644
> > > --- a/lib/librte_ring/Makefile
> > > +++ b/lib/librte_ring/Makefile
> > > @@ -45,6 +45,8 @@ LIBABIVER := 1
> > > SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> > > # install includes
> > > -SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
> > > +SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> > > + rte_ring_arm64.h \
> > It is really not specific to arm64. We could rename it to rte_ring_c11_mem.h or
> > something like that to reflect the implementation based on c11 memory
> > model.
> >
> >
> > > + rte_ring_generic.h
> > > include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index 5e9b3b7..943b1f9 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -103,6 +103,18 @@ extern "C" {
> > > #include <rte_memzone.h>
> > > #include <rte_pause.h>
> > > +/* In those strong memory models (e.g. x86), there is no need to add rmb()
> > > + * between load and load.
> > > + * In those weak models(powerpc/arm), there are 2 choices for the users
> > > + * 1.use rmb() memory barrier
> > > + * 2.use one-direcion load_acquire/store_release barrier
> > > + * It depends on performance test results. */
> > > +#ifdef RTE_ARCH_ARM64
> > s/RTE_ARCH_ARM64/RTE_RING_USE_C11_MEM_MODEL and update the generic arm64 config.
> > By that way it can used by another architecture like ppc if they choose to do so.
> >
> >
> > > +#include "rte_ring_arm64.h"
> > > +#else
> > > +#include "rte_ring_generic.h"
> > > +#endif
> > > +
> > > #define RTE_TAILQ_RING_NAME "RTE_RING"
> > > enum rte_ring_queue_behavior {
> > > @@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
> > > while (unlikely(ht->tail != old_val))
> > > rte_pause();
> > > - ht->tail = new_val;
> > > + arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
> > > }
> > > /**
> > > @@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> > > /* Reset n to the initial burst count */
> > > n = max;
> > > - *old_head = r->prod.head;
> > > + *old_head = arch_rte_atomic_load(&r->prod.head,
> > > + __ATOMIC_ACQUIRE);
> > Same as Konstantin comments, i.e move to this function to c11 memory model
> > header file
> >
> > > const uint32_t cons_tail = r->cons.tail;
> > > /*
> > > * The subtraction is done between two unsigned 32bits value
> > > @@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
> > > if (is_sp)
> > > r->prod.head = *new_head, success = 1;
> > > else
> > > - success = rte_atomic32_cmpset(&r->prod.head,
> > > - *old_head, *new_head);
> > > + success = arch_rte_atomic32_cmpset(&r->prod.head,
> > > + old_head, *new_head,
> > > + 0, __ATOMIC_ACQUIRE,
> > > + __ATOMIC_RELAXED);
> > > } while (unlikely(success == 0));
> > > return n;
> > > }
> > > @@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
> > > goto end;
> > > ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
> > > +
> > > +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> > This #define will be removed if the function moves.
> >
> > > rte_smp_wmb();
> > > +#endif
> > > update_tail(&r->prod, prod_head, prod_next, is_sp);
> > > end:
> > > @@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> > > /* Restore n as it may change every loop */
> > > n = max;
> > > - *old_head = r->cons.head;
> > > + *old_head = arch_rte_atomic_load(&r->cons.head,
> > > + __ATOMIC_ACQUIRE);
> > > const uint32_t prod_tail = r->prod.tail;
> > > /* The subtraction is done between two unsigned 32bits value
> > > * (the result is always modulo 32 bits even if we have
> > > @@ -535,8 +554,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
> > > if (is_sc)
> > > r->cons.head = *new_head, success = 1;
> > > else
> > > - success = rte_atomic32_cmpset(&r->cons.head, *old_head,
> > > - *new_head);
> > > + success = arch_rte_atomic32_cmpset(&r->cons.head,
> > > + old_head, *new_head,
> > > + 0, __ATOMIC_ACQUIRE,
> > > + __ATOMIC_RELAXED);
> > > } while (unlikely(success == 0));
> > > return n;
> > > }
> > > @@ -575,7 +596,10 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
> > > goto end;
> > > DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
> > > +
> > > +#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
> > This #define will be removed if the function moves.
> >
> > > rte_smp_rmb();
> > > +#endif
>
> --
> Cheers,
> Jia
>
Hi Jerin
On 11/3/2017 8:56 PM, Jerin Jacob Wrote:
> -----Original Message-----
>>
[...]
>> g like that.
>> Ok, but how to distinguish following 2 options?
> No clearly understood this question. For arm64 case, you can add
> CONFIG_RTE_RING_USE_C11_MEM_MODEL=y in config/defconfig_arm64-armv8a-*
Sorry for my unclear expressions.
I mean there should be one additional config macro besides
CONFIG_RTE_RING_USE_C11_MEM_MODEL
for users to choose?
i.e.
- On X86:CONFIG_RTE_RING_USE_C11_MEM_MODEL=n
include rte_ring_generic.h, no changes
- On arm64,CONFIG_RTE_RING_USE_C11_MEM_MODEL=y
include rte_ring_c11_mem.h by default.
In rte_ring_c11_mem.h, implement new version of
__rte_ring_move_prod_head/__rte_ring_move_cons_head/update_tail
Then, how to distinguish the option of using rte_smp_rmb() or
__atomic_load/store_n()?
Thanks for the clarification.
@@ -45,6 +45,8 @@ LIBABIVER := 1
SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
# install includes
-SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+ rte_ring_arm64.h \
+ rte_ring_generic.h
include $(RTE_SDK)/mk/rte.lib.mk
@@ -103,6 +103,18 @@ extern "C" {
#include <rte_memzone.h>
#include <rte_pause.h>
+/* In those strong memory models (e.g. x86), there is no need to add rmb()
+ * between load and load.
+ * In those weak models(powerpc/arm), there are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direcion load_acquire/store_release barrier
+ * It depends on performance test results. */
+#ifdef RTE_ARCH_ARM64
+#include "rte_ring_arm64.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
#define RTE_TAILQ_RING_NAME "RTE_RING"
enum rte_ring_queue_behavior {
@@ -368,7 +380,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
while (unlikely(ht->tail != old_val))
rte_pause();
- ht->tail = new_val;
+ arch_rte_atomic_store(&ht->tail, new_val, __ATOMIC_RELEASE);
}
/**
@@ -408,7 +420,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
/* Reset n to the initial burst count */
n = max;
- *old_head = r->prod.head;
+ *old_head = arch_rte_atomic_load(&r->prod.head,
+ __ATOMIC_ACQUIRE);
const uint32_t cons_tail = r->cons.tail;
/*
* The subtraction is done between two unsigned 32bits value
@@ -430,8 +443,10 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
if (is_sp)
r->prod.head = *new_head, success = 1;
else
- success = rte_atomic32_cmpset(&r->prod.head,
- *old_head, *new_head);
+ success = arch_rte_atomic32_cmpset(&r->prod.head,
+ old_head, *new_head,
+ 0, __ATOMIC_ACQUIRE,
+ __ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
}
@@ -470,7 +485,10 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
goto end;
ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+
+#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
rte_smp_wmb();
+#endif
update_tail(&r->prod, prod_head, prod_next, is_sp);
end:
@@ -516,7 +534,8 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
/* Restore n as it may change every loop */
n = max;
- *old_head = r->cons.head;
+ *old_head = arch_rte_atomic_load(&r->cons.head,
+ __ATOMIC_ACQUIRE);
const uint32_t prod_tail = r->prod.tail;
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
@@ -535,8 +554,10 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
if (is_sc)
r->cons.head = *new_head, success = 1;
else
- success = rte_atomic32_cmpset(&r->cons.head, *old_head,
- *new_head);
+ success = arch_rte_atomic32_cmpset(&r->cons.head,
+ old_head, *new_head,
+ 0, __ATOMIC_ACQUIRE,
+ __ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
}
@@ -575,7 +596,10 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
goto end;
DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
+
+#ifndef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
rte_smp_rmb();
+#endif
update_tail(&r->cons, cons_head, cons_next, is_sc);
new file mode 100644
@@ -0,0 +1,48 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_ARM64_H_
+#define _RTE_RING_ARM64_H_
+
+#define arch_rte_atomic_store __atomic_store_n
+#define arch_rte_atomic32_cmpset __atomic_compare_exchange_n
+#ifdef RTE_ATOMIC_ACQUIRE_RELEASE_BARRIER_PREFER
+#define arch_rte_atomic_load __atomic_load_n
+#else
+#define arch_rte_atomic_load(a, b) \
+ *(a); \
+ rte_smp_rmb()
+
+#endif
+
+#endif /* _RTE_RING_ARM64_H_ */
+
new file mode 100644
@@ -0,0 +1,45 @@
+/*-
+ * BSD LICENSE
+ *
+ * Copyright(c) 2017 hxt-semitech. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of hxt-semitech nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_RING_GENERIC_H_
+#define _RTE_RING_GENERIC_H_
+
+#define arch_rte_atomic_load(a, b) *(a)
+#define arch_rte_atomic_store(a, b, c) \
+do { \
+ *(a) = b; \
+} while(0)
+#define arch_rte_atomic32_cmpset(a, b, c, d, e, f) \
+ rte_atomic32_cmpset(a, *(b), c)
+
+#endif /* _RTE_RING_GENERIC_H_ */
+