[v2] ring: enforce reading the tails before ring operations

Message ID 1551941137-33250-1-git-send-email-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] ring: enforce reading the tails before ring operations |

Checks

Context Check Description
ci/intel-Performance-Testing success Performance Testing PASS
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Gavin Hu March 7, 2019, 6:45 a.m. UTC
  In weak memory models, like arm64, reading the {prod,cons}.tail may get
reordered after reading or writing the ring slots, which corrupts the ring
and stale data is observed.

This issue was reported by NXP on 8-A72 DPAA2 board. The problem is most
likely caused by missing the acquire semantics when reading cons.tail (in
SP enqueue) or prod.tail (in SC dequeue) which makes it possible to read a
stale value from the ring slots.

For MP (and MC) case, rte_atomic32_cmpset() already provides the required
ordering. This patch is to prevent reading and writing the ring slots get
reordered before reading {prod,cons}.tail for SP (and SC) case.

Signed-off-by: gavin hu <gavin.hu@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
---
 lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)
  

Comments

Ilya Maximets March 7, 2019, 8:52 a.m. UTC | #1
On 07.03.2019 9:45, gavin hu wrote:
> In weak memory models, like arm64, reading the {prod,cons}.tail may get
> reordered after reading or writing the ring slots, which corrupts the ring
> and stale data is observed.
> 
> This issue was reported by NXP on 8-A72 DPAA2 board. The problem is most
> likely caused by missing the acquire semantics when reading cons.tail (in
> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to read a
> stale value from the ring slots.
> 
> For MP (and MC) case, rte_atomic32_cmpset() already provides the required
> ordering. This patch is to prevent reading and writing the ring slots get
> reordered before reading {prod,cons}.tail for SP (and SC) case.

Read barrier rte_smp_rmb() is OK to prevent reading the ring get reordered
before reading the tail. However, to prevent *writing* the ring get reordered
*before reading* the tail you need a full memory barrier, i.e. rte_smp_mb().

> 
> Signed-off-by: gavin hu <gavin.hu@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
> ---
>  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
> index ea7dbe5..1bd3dfd 100644
> --- a/lib/librte_ring/rte_ring_generic.h
> +++ b/lib/librte_ring/rte_ring_generic.h
> @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
>  			return 0;
>  
>  		*new_head = *old_head + n;
> -		if (is_sp)
> -			r->prod.head = *new_head, success = 1;
> -		else
> +		if (is_sp) {
> +			r->prod.head = *new_head;
> +			rte_smp_rmb();
> +			success = 1;
> +		} else
>  			success = rte_atomic32_cmpset(&r->prod.head,
>  					*old_head, *new_head);
>  	} while (unlikely(success == 0));
> @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
>  			return 0;
>  
>  		*new_head = *old_head + n;
> -		if (is_sc)
> -			r->cons.head = *new_head, success = 1;
> -		else
> +		if (is_sc) {
> +			r->cons.head = *new_head;
> +			rte_smp_rmb();
> +			success = 1;
> +		} else
>  			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
>  					*new_head);
>  	} while (unlikely(success == 0));
>
  
Gavin Hu March 7, 2019, 9:27 a.m. UTC | #2
> -----Original Message-----
> From: Ilya Maximets <i.maximets@samsung.com>
> Sent: Thursday, March 7, 2019 4:52 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> operations
> 
> On 07.03.2019 9:45, gavin hu wrote:
> > In weak memory models, like arm64, reading the {prod,cons}.tail may get
> > reordered after reading or writing the ring slots, which corrupts the ring
> > and stale data is observed.
> >
> > This issue was reported by NXP on 8-A72 DPAA2 board. The problem is
> most
> > likely caused by missing the acquire semantics when reading cons.tail (in
> > SP enqueue) or prod.tail (in SC dequeue) which makes it possible to read
> a
> > stale value from the ring slots.
> >
> > For MP (and MC) case, rte_atomic32_cmpset() already provides the
> required
> > ordering. This patch is to prevent reading and writing the ring slots get
> > reordered before reading {prod,cons}.tail for SP (and SC) case.
> 
> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> reordered
> before reading the tail. However, to prevent *writing* the ring get
> reordered
> *before reading* the tail you need a full memory barrier, i.e.
> rte_smp_mb().

ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while WMB(ST Option) orders ST/ST.
For more details, please refer to: Table B2-1 Encoding of the DMB and DSB <option> parameter  in
https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile 

> 
> >
> > Signed-off-by: gavin hu <gavin.hu@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
> > ---
> >  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_generic.h
> b/lib/librte_ring/rte_ring_generic.h
> > index ea7dbe5..1bd3dfd 100644
> > --- a/lib/librte_ring/rte_ring_generic.h
> > +++ b/lib/librte_ring/rte_ring_generic.h
> > @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
> >  			return 0;
> >
> >  		*new_head = *old_head + n;
> > -		if (is_sp)
> > -			r->prod.head = *new_head, success = 1;
> > -		else
> > +		if (is_sp) {
> > +			r->prod.head = *new_head;
> > +			rte_smp_rmb();
> > +			success = 1;
> > +		} else
> >  			success = rte_atomic32_cmpset(&r->prod.head,
> >  					*old_head, *new_head);
> >  	} while (unlikely(success == 0));
> > @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> unsigned int is_sc,
> >  			return 0;
> >
> >  		*new_head = *old_head + n;
> > -		if (is_sc)
> > -			r->cons.head = *new_head, success = 1;
> > -		else
> > +		if (is_sc) {
> > +			r->cons.head = *new_head;
> > +			rte_smp_rmb();
> > +			success = 1;
> > +		} else
> >  			success = rte_atomic32_cmpset(&r->cons.head,
> *old_head,
> >  					*new_head);
> >  	} while (unlikely(success == 0));
> >
  
Ilya Maximets March 7, 2019, 9:48 a.m. UTC | #3
On 07.03.2019 12:27, Gavin Hu (Arm Technology China) wrote:
> 
> 
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets@samsung.com>
>> Sent: Thursday, March 7, 2019 4:52 PM
>> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
>> dev@dpdk.org
>> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
>> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
>> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
>> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
>> operations
>>
>> On 07.03.2019 9:45, gavin hu wrote:
>>> In weak memory models, like arm64, reading the {prod,cons}.tail may get
>>> reordered after reading or writing the ring slots, which corrupts the ring
>>> and stale data is observed.
>>>
>>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem is
>> most
>>> likely caused by missing the acquire semantics when reading cons.tail (in
>>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to read
>> a
>>> stale value from the ring slots.
>>>
>>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
>> required
>>> ordering. This patch is to prevent reading and writing the ring slots get
>>> reordered before reading {prod,cons}.tail for SP (and SC) case.
>>
>> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
>> reordered
>> before reading the tail. However, to prevent *writing* the ring get
>> reordered
>> *before reading* the tail you need a full memory barrier, i.e.
>> rte_smp_mb().
> 
> ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while WMB(ST Option) orders ST/ST.
> For more details, please refer to: Table B2-1 Encoding of the DMB and DSB <option> parameter  in
> https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile 

I see. But you have to change the rte_smp_rmb() function definition in
lib/librte_eal/common/include/generic/rte_atomic.h and assure that all
other architectures follows same rules.
Otherwise, this change is logically wrong, because read barrier in current
definition could not be used to order Load with Store.

> 
>>
>>>
>>> Signed-off-by: gavin hu <gavin.hu@arm.com>
>>> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
>>> Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
>>> ---
>>>  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
>>>  1 file changed, 10 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/lib/librte_ring/rte_ring_generic.h
>> b/lib/librte_ring/rte_ring_generic.h
>>> index ea7dbe5..1bd3dfd 100644
>>> --- a/lib/librte_ring/rte_ring_generic.h
>>> +++ b/lib/librte_ring/rte_ring_generic.h
>>> @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
>> unsigned int is_sp,
>>>  			return 0;
>>>
>>>  		*new_head = *old_head + n;
>>> -		if (is_sp)
>>> -			r->prod.head = *new_head, success = 1;
>>> -		else
>>> +		if (is_sp) {
>>> +			r->prod.head = *new_head;
>>> +			rte_smp_rmb();
>>> +			success = 1;
>>> +		} else
>>>  			success = rte_atomic32_cmpset(&r->prod.head,
>>>  					*old_head, *new_head);
>>>  	} while (unlikely(success == 0));
>>> @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct rte_ring *r,
>> unsigned int is_sc,
>>>  			return 0;
>>>
>>>  		*new_head = *old_head + n;
>>> -		if (is_sc)
>>> -			r->cons.head = *new_head, success = 1;
>>> -		else
>>> +		if (is_sc) {
>>> +			r->cons.head = *new_head;
>>> +			rte_smp_rmb();
>>> +			success = 1;
>>> +		} else
>>>  			success = rte_atomic32_cmpset(&r->cons.head,
>> *old_head,
>>>  					*new_head);
>>>  	} while (unlikely(success == 0));
>>>
  
Gavin Hu March 7, 2019, 10:44 a.m. UTC | #4
> -----Original Message-----
> From: Ilya Maximets <i.maximets@samsung.com>
> Sent: Thursday, March 7, 2019 5:48 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> operations
> 
> On 07.03.2019 12:27, Gavin Hu (Arm Technology China) wrote:
> >
> >
> >> -----Original Message-----
> >> From: Ilya Maximets <i.maximets@samsung.com>
> >> Sent: Thursday, March 7, 2019 4:52 PM
> >> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> >> dev@dpdk.org
> >> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> >> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> >> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> >> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> >> operations
> >>
> >> On 07.03.2019 9:45, gavin hu wrote:
> >>> In weak memory models, like arm64, reading the {prod,cons}.tail may get
> >>> reordered after reading or writing the ring slots, which corrupts the ring
> >>> and stale data is observed.
> >>>
> >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem is
> >> most
> >>> likely caused by missing the acquire semantics when reading cons.tail (in
> >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to
> read
> >> a
> >>> stale value from the ring slots.
> >>>
> >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> >> required
> >>> ordering. This patch is to prevent reading and writing the ring slots get
> >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> >>
> >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> >> reordered
> >> before reading the tail. However, to prevent *writing* the ring get
> >> reordered
> >> *before reading* the tail you need a full memory barrier, i.e.
> >> rte_smp_mb().
> >
> > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while WMB(ST
> Option) orders ST/ST.
> > For more details, please refer to: Table B2-1 Encoding of the DMB and DSB
> <option> parameter  in
> > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> reference-manual-armv8-for-armv8-a-architecture-profile
> 
> I see. But you have to change the rte_smp_rmb() function definition in
> lib/librte_eal/common/include/generic/rte_atomic.h and assure that all
> other architectures follows same rules.
> Otherwise, this change is logically wrong, because read barrier in current
> definition could not be used to order Load with Store.
> 

Good points, let me re-think how to handle for other architectures. 
Full MB is required for other architectures(x86? Ppc?), but for arm, read barrier(load/store and load/load) is enough. 

> >
> >>
> >>>
> >>> Signed-off-by: gavin hu <gavin.hu@arm.com>
> >>> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> >>> Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
> >>> ---
> >>>  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
> >>>  1 file changed, 10 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/lib/librte_ring/rte_ring_generic.h
> >> b/lib/librte_ring/rte_ring_generic.h
> >>> index ea7dbe5..1bd3dfd 100644
> >>> --- a/lib/librte_ring/rte_ring_generic.h
> >>> +++ b/lib/librte_ring/rte_ring_generic.h
> >>> @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> >> unsigned int is_sp,
> >>>  			return 0;
> >>>
> >>>  		*new_head = *old_head + n;
> >>> -		if (is_sp)
> >>> -			r->prod.head = *new_head, success = 1;
> >>> -		else
> >>> +		if (is_sp) {
> >>> +			r->prod.head = *new_head;
> >>> +			rte_smp_rmb();
> >>> +			success = 1;
> >>> +		} else
> >>>  			success = rte_atomic32_cmpset(&r->prod.head,
> >>>  					*old_head, *new_head);
> >>>  	} while (unlikely(success == 0));
> >>> @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct rte_ring
> *r,
> >> unsigned int is_sc,
> >>>  			return 0;
> >>>
> >>>  		*new_head = *old_head + n;
> >>> -		if (is_sc)
> >>> -			r->cons.head = *new_head, success = 1;
> >>> -		else
> >>> +		if (is_sc) {
> >>> +			r->cons.head = *new_head;
> >>> +			rte_smp_rmb();
> >>> +			success = 1;
> >>> +		} else
> >>>  			success = rte_atomic32_cmpset(&r->cons.head,
> >> *old_head,
> >>>  					*new_head);
> >>>  	} while (unlikely(success == 0));
> >>>
  
Ananyev, Konstantin March 7, 2019, 11:17 a.m. UTC | #5
Hi Gavin,

> > >>> In weak memory models, like arm64, reading the {prod,cons}.tail may get
> > >>> reordered after reading or writing the ring slots, which corrupts the ring
> > >>> and stale data is observed.
> > >>>
> > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem is
> > >> most
> > >>> likely caused by missing the acquire semantics when reading cons.tail (in
> > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to
> > read
> > >> a
> > >>> stale value from the ring slots.
> > >>>
> > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> > >> required
> > >>> ordering. This patch is to prevent reading and writing the ring slots get
> > >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> > >>
> > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > >> reordered
> > >> before reading the tail. However, to prevent *writing* the ring get
> > >> reordered
> > >> *before reading* the tail you need a full memory barrier, i.e.
> > >> rte_smp_mb().
> > >
> > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while WMB(ST
> > Option) orders ST/ST.
> > > For more details, please refer to: Table B2-1 Encoding of the DMB and DSB
> > <option> parameter  in
> > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > reference-manual-armv8-for-armv8-a-architecture-profile
> >
> > I see. But you have to change the rte_smp_rmb() function definition in
> > lib/librte_eal/common/include/generic/rte_atomic.h and assure that all
> > other architectures follows same rules.
> > Otherwise, this change is logically wrong, because read barrier in current
> > definition could not be used to order Load with Store.
> >
> 
> Good points, let me re-think how to handle for other architectures.
> Full MB is required for other architectures(x86? Ppc?), but for arm, read barrier(load/store and load/load) is enough.

For x86, I don't think you need any barrier here, as with IA memory mode:
-  Reads are not reordered with other reads.
- Writes are not reordered with older reads.

BTW, could you explain a bit more why barrier is necessary even on arm here?
As I can see, there is a data dependency between the tail value and
subsequent address calculations for ring writes/reads.
Isn't that sufficient to prevent re-ordering even for weak memory model?
Konstantin
 

> 
> > >
> > >>
> > >>>
> > >>> Signed-off-by: gavin hu <gavin.hu@arm.com>
> > >>> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > >>> Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
> > >>> ---
> > >>>  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
> > >>>  1 file changed, 10 insertions(+), 6 deletions(-)
> > >>>
> > >>> diff --git a/lib/librte_ring/rte_ring_generic.h
> > >> b/lib/librte_ring/rte_ring_generic.h
> > >>> index ea7dbe5..1bd3dfd 100644
> > >>> --- a/lib/librte_ring/rte_ring_generic.h
> > >>> +++ b/lib/librte_ring/rte_ring_generic.h
> > >>> @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> > >> unsigned int is_sp,
> > >>>  			return 0;
> > >>>
> > >>>  		*new_head = *old_head + n;
> > >>> -		if (is_sp)
> > >>> -			r->prod.head = *new_head, success = 1;
> > >>> -		else
> > >>> +		if (is_sp) {
> > >>> +			r->prod.head = *new_head;
> > >>> +			rte_smp_rmb();
> > >>> +			success = 1;
> > >>> +		} else
> > >>>  			success = rte_atomic32_cmpset(&r->prod.head,
> > >>>  					*old_head, *new_head);
> > >>>  	} while (unlikely(success == 0));
> > >>> @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct rte_ring
> > *r,
> > >> unsigned int is_sc,
> > >>>  			return 0;
> > >>>
> > >>>  		*new_head = *old_head + n;
> > >>> -		if (is_sc)
> > >>> -			r->cons.head = *new_head, success = 1;
> > >>> -		else
> > >>> +		if (is_sc) {
> > >>> +			r->cons.head = *new_head;
> > >>> +			rte_smp_rmb();
> > >>> +			success = 1;
> > >>> +		} else
> > >>>  			success = rte_atomic32_cmpset(&r->cons.head,
> > >> *old_head,
> > >>>  					*new_head);
> > >>>  	} while (unlikely(success == 0));
> > >>>
  
Honnappa Nagarahalli March 8, 2019, 3:21 a.m. UTC | #6
> Hi Gavin,
> 
> > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> > > >>> may get reordered after reading or writing the ring slots, which
> > > >>> corrupts the ring and stale data is observed.
> > > >>>
> > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem
> > > >>> is
> > > >> most
> > > >>> likely caused by missing the acquire semantics when reading
> > > >>> cons.tail (in SP enqueue) or prod.tail (in SC dequeue) which
> > > >>> makes it possible to
> > > read
> > > >> a
> > > >>> stale value from the ring slots.
> > > >>>
> > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> > > >> required
> > > >>> ordering. This patch is to prevent reading and writing the ring
> > > >>> slots get reordered before reading {prod,cons}.tail for SP (and SC)
> case.
> > > >>
> > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > >> reordered before reading the tail. However, to prevent *writing*
> > > >> the ring get reordered *before reading* the tail you need a full
> > > >> memory barrier, i.e.
> > > >> rte_smp_mb().
> > > >
> > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > > > WMB(ST
> > > Option) orders ST/ST.
> > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> > > > and DSB
> > > <option> parameter  in
> > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > reference-manual-armv8-for-armv8-a-architecture-profile
> > >
> > > I see. But you have to change the rte_smp_rmb() function definition
> > > in lib/librte_eal/common/include/generic/rte_atomic.h and assure
> > > that all other architectures follows same rules.
> > > Otherwise, this change is logically wrong, because read barrier in
> > > current definition could not be used to order Load with Store.
> > >
> >
> > Good points, let me re-think how to handle for other architectures.
> > Full MB is required for other architectures(x86? Ppc?), but for arm, read
> barrier(load/store and load/load) is enough.
> 
> For x86, I don't think you need any barrier here, as with IA memory mode:
> -  Reads are not reordered with other reads.
> - Writes are not reordered with older reads.
Agree

> 
> BTW, could you explain a bit more why barrier is necessary even on arm here?
> As I can see, there is a data dependency between the tail value and
> subsequent address calculations for ring writes/reads.
> Isn't that sufficient to prevent re-ordering even for weak memory model?
The tail value affects 'n'. But, the value of 'n' can be speculated because of the following 'if' statement:

if (unlikely(n > *free_entries))
                        n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *free_entries;

The address calculations for actual ring writes/reads do not depend on the tail value. Since 'n' can be speculated, the writes/reads can be moved up before the load of the tail value.

> Konstantin
> 
> 
<snip>
  
Gavin Hu March 8, 2019, 4:23 a.m. UTC | #7
> -----Original Message-----
> From: Gavin Hu (Arm Technology China)
> Sent: Thursday, March 7, 2019 6:45 PM
> To: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> chaozhu@linux.vnet.ibm.com
> Subject: RE: [PATCH v2] ring: enforce reading the tails before ring operations
> 
> 
> 
> > -----Original Message-----
> > From: Ilya Maximets <i.maximets@samsung.com>
> > Sent: Thursday, March 7, 2019 5:48 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > operations
> >
> > On 07.03.2019 12:27, Gavin Hu (Arm Technology China) wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Ilya Maximets <i.maximets@samsung.com>
> > >> Sent: Thursday, March 7, 2019 4:52 PM
> > >> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > >> dev@dpdk.org
> > >> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > >> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> Nagarahalli
> > >> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > >> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > >> operations
> > >>
> > >> On 07.03.2019 9:45, gavin hu wrote:
> > >>> In weak memory models, like arm64, reading the {prod,cons}.tail may
> get
> > >>> reordered after reading or writing the ring slots, which corrupts the
> ring
> > >>> and stale data is observed.
> > >>>
> > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem
> is
> > >> most
> > >>> likely caused by missing the acquire semantics when reading cons.tail
> (in
> > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to
> > read
> > >> a
> > >>> stale value from the ring slots.
> > >>>
> > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> > >> required
> > >>> ordering. This patch is to prevent reading and writing the ring slots get
> > >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> > >>
> > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > >> reordered
> > >> before reading the tail. However, to prevent *writing* the ring get
> > >> reordered
> > >> *before reading* the tail you need a full memory barrier, i.e.
> > >> rte_smp_mb().
> > >
> > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> WMB(ST
> > Option) orders ST/ST.
> > > For more details, please refer to: Table B2-1 Encoding of the DMB and
> DSB
> > <option> parameter  in
> > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > reference-manual-armv8-for-armv8-a-architecture-profile
> >
> > I see. But you have to change the rte_smp_rmb() function definition in
> > lib/librte_eal/common/include/generic/rte_atomic.h and assure that all
> > other architectures follows same rules.
> > Otherwise, this change is logically wrong, because read barrier in current
> > definition could not be used to order Load with Store.
> >
> 
> Good points, let me re-think how to handle for other architectures.
> Full MB is required for other architectures(x86? Ppc?), but for arm, read
> barrier(load/store and load/load) is enough.

Hi Ilya,

I would expand the rmb definition to cover load/store, in addition to load/load.
For X86, as a strong memory order model, rmb is actually equivalent to mb, as implemented as a compiler barrier: rte_compiler_barrier, arm32 is also this case.
For PPC, both 32 and 64-bit, rmb=wmb=mb, lwsync/sync orders load/store, load/load, store/load, store/store, looking at the table on this page:
https://www.ibm.com/developerworks/systems/articles/powerpc.html 

In summary, we are safe to expand this definition for all the architectures DPDK support? 
Any comments are welcome!

BR. Gavin

> > >
> > >>
> > >>>
> > >>> Signed-off-by: gavin hu <gavin.hu@arm.com>
> > >>> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > >>> Tested-by: Nipun Gupta <nipun.gupta@nxp.com>
> > >>> ---
> > >>>  lib/librte_ring/rte_ring_generic.h | 16 ++++++++++------
> > >>>  1 file changed, 10 insertions(+), 6 deletions(-)
> > >>>
> > >>> diff --git a/lib/librte_ring/rte_ring_generic.h
> > >> b/lib/librte_ring/rte_ring_generic.h
> > >>> index ea7dbe5..1bd3dfd 100644
> > >>> --- a/lib/librte_ring/rte_ring_generic.h
> > >>> +++ b/lib/librte_ring/rte_ring_generic.h
> > >>> @@ -90,9 +90,11 @@ __rte_ring_move_prod_head(struct rte_ring
> *r,
> > >> unsigned int is_sp,
> > >>>  			return 0;
> > >>>
> > >>>  		*new_head = *old_head + n;
> > >>> -		if (is_sp)
> > >>> -			r->prod.head = *new_head, success = 1;
> > >>> -		else
> > >>> +		if (is_sp) {
> > >>> +			r->prod.head = *new_head;
> > >>> +			rte_smp_rmb();
> > >>> +			success = 1;
> > >>> +		} else
> > >>>  			success = rte_atomic32_cmpset(&r->prod.head,
> > >>>  					*old_head, *new_head);
> > >>>  	} while (unlikely(success == 0));
> > >>> @@ -158,9 +160,11 @@ __rte_ring_move_cons_head(struct
> rte_ring
> > *r,
> > >> unsigned int is_sc,
> > >>>  			return 0;
> > >>>
> > >>>  		*new_head = *old_head + n;
> > >>> -		if (is_sc)
> > >>> -			r->cons.head = *new_head, success = 1;
> > >>> -		else
> > >>> +		if (is_sc) {
> > >>> +			r->cons.head = *new_head;
> > >>> +			rte_smp_rmb();
> > >>> +			success = 1;
> > >>> +		} else
> > >>>  			success = rte_atomic32_cmpset(&r->cons.head,
> > >> *old_head,
> > >>>  					*new_head);
> > >>>  	} while (unlikely(success == 0));
> > >>>
  
Honnappa Nagarahalli March 8, 2019, 5:06 a.m. UTC | #8
> > > >> On 07.03.2019 9:45, gavin hu wrote:
> > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> > > >>> may
> > get
> > > >>> reordered after reading or writing the ring slots, which
> > > >>> corrupts the
> > ring
> > > >>> and stale data is observed.
> > > >>>
> > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem
> > is
> > > >> most
> > > >>> likely caused by missing the acquire semantics when reading
> > > >>> cons.tail
> > (in
> > > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible
> > > >>> to
> > > read
> > > >> a
> > > >>> stale value from the ring slots.
> > > >>>
> > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> > > >> required
> > > >>> ordering. This patch is to prevent reading and writing the ring
> > > >>> slots get reordered before reading {prod,cons}.tail for SP (and SC)
> case.
> > > >>
> > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > >> reordered before reading the tail. However, to prevent *writing*
> > > >> the ring get reordered *before reading* the tail you need a full
> > > >> memory barrier, i.e.
> > > >> rte_smp_mb().
> > > >
> > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > WMB(ST
> > > Option) orders ST/ST.
> > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> > > > and
> > DSB
> > > <option> parameter  in
> > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > reference-manual-armv8-for-armv8-a-architecture-profile
> > >
> > > I see. But you have to change the rte_smp_rmb() function definition
> > > in lib/librte_eal/common/include/generic/rte_atomic.h and assure
> > > that all other architectures follows same rules.
> > > Otherwise, this change is logically wrong, because read barrier in
> > > current definition could not be used to order Load with Store.
> > >
> >
> > Good points, let me re-think how to handle for other architectures.
> > Full MB is required for other architectures(x86? Ppc?), but for arm,
> > read barrier(load/store and load/load) is enough.
> 
> Hi Ilya,
> 
> I would expand the rmb definition to cover load/store, in addition to
> load/load.
> For X86, as a strong memory order model, rmb is actually equivalent to mb,
> as implemented as a compiler barrier: rte_compiler_barrier, arm32 is also
> this case.
> For PPC, both 32 and 64-bit, rmb=wmb=mb, lwsync/sync orders load/store,
> load/load, store/load, store/store, looking at the table on this page:
> https://www.ibm.com/developerworks/systems/articles/powerpc.html
> 
> In summary, we are safe to expand this definition for all the architectures
> DPDK support?
Essentially, it is a documentation bug. i.e. the current implementation of rte_smp_rmb() already behaves as load/load and load/store barrier.

> Any comments are welcome!
> 
> BR. Gavin
> 

<snip>
  
Gavin Hu March 8, 2019, 5:27 a.m. UTC | #9
Hi Konstantin,

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Friday, March 8, 2019 11:21 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>; Ilya Maximets
> <i.maximets@samsung.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; olivier.matz@6wind.com;
> Richardson, Bruce <bruce.richardson@intel.com>;
> chaozhu@linux.vnet.ibm.com; nd <nd@arm.com>
> Subject: RE: [PATCH v2] ring: enforce reading the tails before ring operations
> 
> > Hi Gavin,
> >
> > > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> > > > >>> may get reordered after reading or writing the ring slots, which
> > > > >>> corrupts the ring and stale data is observed.
> > > > >>>
> > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> problem
> > > > >>> is
> > > > >> most
> > > > >>> likely caused by missing the acquire semantics when reading
> > > > >>> cons.tail (in SP enqueue) or prod.tail (in SC dequeue) which
> > > > >>> makes it possible to
> > > > read
> > > > >> a
> > > > >>> stale value from the ring slots.
> > > > >>>
> > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides
> the
> > > > >> required
> > > > >>> ordering. This patch is to prevent reading and writing the ring
> > > > >>> slots get reordered before reading {prod,cons}.tail for SP (and SC)
> > case.
> > > > >>
> > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > > >> reordered before reading the tail. However, to prevent *writing*
> > > > >> the ring get reordered *before reading* the tail you need a full
> > > > >> memory barrier, i.e.
> > > > >> rte_smp_mb().
> > > > >
> > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > > > > WMB(ST
> > > > Option) orders ST/ST.
> > > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> > > > > and DSB
> > > > <option> parameter  in
> > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > >
> > > > I see. But you have to change the rte_smp_rmb() function definition
> > > > in lib/librte_eal/common/include/generic/rte_atomic.h and assure
> > > > that all other architectures follows same rules.
> > > > Otherwise, this change is logically wrong, because read barrier in
> > > > current definition could not be used to order Load with Store.
> > > >
> > >
> > > Good points, let me re-think how to handle for other architectures.
> > > Full MB is required for other architectures(x86? Ppc?), but for arm, read
> > barrier(load/store and load/load) is enough.
> >
> > For x86, I don't think you need any barrier here, as with IA memory mode:
> > -  Reads are not reordered with other reads.
> > - Writes are not reordered with older reads.
> Agree

I understand herein no instruction level barriers are required for IA, but how about the
compiler barrier: rte_compiler_barrier? 

> 
> >
> > BTW, could you explain a bit more why barrier is necessary even on arm
> here?
> > As I can see, there is a data dependency between the tail value and
> > subsequent address calculations for ring writes/reads.
> > Isn't that sufficient to prevent re-ordering even for weak memory model?
> The tail value affects 'n'. But, the value of 'n' can be speculated because of
> the following 'if' statement:
> 
> if (unlikely(n > *free_entries))
>                         n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *free_entries;
> 
> The address calculations for actual ring writes/reads do not depend on the
> tail value. Since 'n' can be speculated, the writes/reads can be moved up
> before the load of the tail value.

Good explanation. The address calculations does not depend on tail/n, only the
limit/last one depends on it, while it can be speculated. 

> > Konstantin
> >
> >
> <snip>
  
Ananyev, Konstantin March 8, 2019, 12:13 p.m. UTC | #10
> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Friday, March 8, 2019 4:23 AM
> To: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com; hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>; chaozhu@linux.vnet.ibm.com
> Subject: RE: [PATCH v2] ring: enforce reading the tails before ring operations
> 
> 
> 
> > -----Original Message-----
> > From: Gavin Hu (Arm Technology China)
> > Sent: Thursday, March 7, 2019 6:45 PM
> > To: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org
> > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com;
> > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > chaozhu@linux.vnet.ibm.com
> > Subject: RE: [PATCH v2] ring: enforce reading the tails before ring operations
> >
> >
> >
> > > -----Original Message-----
> > > From: Ilya Maximets <i.maximets@samsung.com>
> > > Sent: Thursday, March 7, 2019 5:48 PM
> > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > dev@dpdk.org
> > > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > > hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > > Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > > operations
> > >
> > > On 07.03.2019 12:27, Gavin Hu (Arm Technology China) wrote:
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Ilya Maximets <i.maximets@samsung.com>
> > > >> Sent: Thursday, March 7, 2019 4:52 PM
> > > >> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > > >> dev@dpdk.org
> > > >> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > > >> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> > Nagarahalli
> > > >> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > > >> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > > >> operations
> > > >>
> > > >> On 07.03.2019 9:45, gavin hu wrote:
> > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail may
> > get
> > > >>> reordered after reading or writing the ring slots, which corrupts the
> > ring
> > > >>> and stale data is observed.
> > > >>>
> > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The problem
> > is
> > > >> most
> > > >>> likely caused by missing the acquire semantics when reading cons.tail
> > (in
> > > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible to
> > > read
> > > >> a
> > > >>> stale value from the ring slots.
> > > >>>
> > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides the
> > > >> required
> > > >>> ordering. This patch is to prevent reading and writing the ring slots get
> > > >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> > > >>
> > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > >> reordered
> > > >> before reading the tail. However, to prevent *writing* the ring get
> > > >> reordered
> > > >> *before reading* the tail you need a full memory barrier, i.e.
> > > >> rte_smp_mb().
> > > >
> > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > WMB(ST
> > > Option) orders ST/ST.
> > > > For more details, please refer to: Table B2-1 Encoding of the DMB and
> > DSB
> > > <option> parameter  in
> > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > reference-manual-armv8-for-armv8-a-architecture-profile
> > >
> > > I see. But you have to change the rte_smp_rmb() function definition in
> > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that all
> > > other architectures follows same rules.
> > > Otherwise, this change is logically wrong, because read barrier in current
> > > definition could not be used to order Load with Store.
> > >
> >
> > Good points, let me re-think how to handle for other architectures.
> > Full MB is required for other architectures(x86? Ppc?), but for arm, read
> > barrier(load/store and load/load) is enough.
> 
> Hi Ilya,
> 
> I would expand the rmb definition to cover load/store, in addition to load/load.
> For X86, as a strong memory order model, rmb is actually equivalent to mb,

That's not exactly the case, on x86 we have:
smp_rmb == compiler_barrier
smp_mb is a proper memory barrier.

Konstantin

> as implemented as a compiler barrier: rte_compiler_barrier,
> arm32 is also this case.
> For PPC, both 32 and 64-bit, rmb=wmb=mb, lwsync/sync orders load/store, load/load, store/load, store/store, looking at the table on this
> page:
> https://www.ibm.com/developerworks/systems/articles/powerpc.html
> 
> In summary, we are safe to expand this definition for all the architectures DPDK support?
> Any comments are welcome!
> 
> BR. Gavin
>
  
Gavin Hu March 8, 2019, 3:05 p.m. UTC | #11
Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Sent: Friday, March 8, 2019 8:13 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Ilya
> Maximets <i.maximets@samsung.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com; Richardson,
> Bruce <bruce.richardson@intel.com>; chaozhu@linux.vnet.ibm.com
> Subject: RE: [PATCH v2] ring: enforce reading the tails before ring
> operations
> 
> 
> 
> > -----Original Message-----
> > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > Sent: Friday, March 8, 2019 4:23 AM
> > To: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org
> > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> > Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> olivier.matz@6wind.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev,
> > Konstantin <konstantin.ananyev@intel.com>;
> chaozhu@linux.vnet.ibm.com
> > Subject: RE: [PATCH v2] ring: enforce reading the tails before ring
> operations
> >
> >
> >
> > > -----Original Message-----
> > > From: Gavin Hu (Arm Technology China)
> > > Sent: Thursday, March 7, 2019 6:45 PM
> > > To: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org
> > > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > > hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com;
> > > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > > chaozhu@linux.vnet.ibm.com
> > > Subject: RE: [PATCH v2] ring: enforce reading the tails before ring
> operations
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Ilya Maximets <i.maximets@samsung.com>
> > > > Sent: Thursday, March 7, 2019 5:48 PM
> > > > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > > dev@dpdk.org
> > > > Cc: nd <nd@arm.com>; thomas@monjalon.net; jerinj@marvell.com;
> > > > hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > > > Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > > > operations
> > > >
> > > > On 07.03.2019 12:27, Gavin Hu (Arm Technology China) wrote:
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Ilya Maximets <i.maximets@samsung.com>
> > > > >> Sent: Thursday, March 7, 2019 4:52 PM
> > > > >> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > > > >> dev@dpdk.org
> > > > >> Cc: nd <nd@arm.com>; thomas@monjalon.net;
> jerinj@marvell.com;
> > > > >> hemant.agrawal@nxp.com; Nipun.gupta@nxp.com; Honnappa
> > > Nagarahalli
> > > > >> <Honnappa.Nagarahalli@arm.com>; olivier.matz@6wind.com
> > > > >> Subject: Re: [PATCH v2] ring: enforce reading the tails before ring
> > > > >> operations
> > > > >>
> > > > >> On 07.03.2019 9:45, gavin hu wrote:
> > > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> may
> > > get
> > > > >>> reordered after reading or writing the ring slots, which corrupts
> the
> > > ring
> > > > >>> and stale data is observed.
> > > > >>>
> > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> problem
> > > is
> > > > >> most
> > > > >>> likely caused by missing the acquire semantics when reading
> cons.tail
> > > (in
> > > > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible
> to
> > > > read
> > > > >> a
> > > > >>> stale value from the ring slots.
> > > > >>>
> > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides
> the
> > > > >> required
> > > > >>> ordering. This patch is to prevent reading and writing the ring
> slots get
> > > > >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> > > > >>
> > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > > >> reordered
> > > > >> before reading the tail. However, to prevent *writing* the ring get
> > > > >> reordered
> > > > >> *before reading* the tail you need a full memory barrier, i.e.
> > > > >> rte_smp_mb().
> > > > >
> > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > > WMB(ST
> > > > Option) orders ST/ST.
> > > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> and
> > > DSB
> > > > <option> parameter  in
> > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > >
> > > > I see. But you have to change the rte_smp_rmb() function definition
> in
> > > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that
> all
> > > > other architectures follows same rules.
> > > > Otherwise, this change is logically wrong, because read barrier in
> current
> > > > definition could not be used to order Load with Store.
> > > >
> > >
> > > Good points, let me re-think how to handle for other architectures.
> > > Full MB is required for other architectures(x86? Ppc?), but for arm,
> read
> > > barrier(load/store and load/load) is enough.
> >
> > Hi Ilya,
> >
> > I would expand the rmb definition to cover load/store, in addition to
> load/load.
> > For X86, as a strong memory order model, rmb is actually equivalent to
> mb,
> 
> That's not exactly the case, on x86 we have:
> smp_rmb == compiler_barrier
> smp_mb is a proper memory barrier.
> 
> Konstantin

Sorry I did not make it clear.
Anyway, on x86, smp_rmb, as a compiler barrier, applies to load/store, not only load/load.
This is the case also for arm, arm64, ppc32, ppc64.
I will submit a patch to expand the definition of this API. 

> 
> > as implemented as a compiler barrier: rte_compiler_barrier,
> > arm32 is also this case.
> > For PPC, both 32 and 64-bit, rmb=wmb=mb, lwsync/sync orders
> load/store, load/load, store/load, store/store, looking at the table on this
> > page:
> > https://www.ibm.com/developerworks/systems/articles/powerpc.html
> >
> > In summary, we are safe to expand this definition for all the
> architectures DPDK support?
> > Any comments are welcome!
> >
> > BR. Gavin
> >
  
Ananyev, Konstantin March 8, 2019, 3:50 p.m. UTC | #12
> > > > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> > may
> > > > get
> > > > > >>> reordered after reading or writing the ring slots, which corrupts
> > the
> > > > ring
> > > > > >>> and stale data is observed.
> > > > > >>>
> > > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> > problem
> > > > is
> > > > > >> most
> > > > > >>> likely caused by missing the acquire semantics when reading
> > cons.tail
> > > > (in
> > > > > >>> SP enqueue) or prod.tail (in SC dequeue) which makes it possible
> > to
> > > > > read
> > > > > >> a
> > > > > >>> stale value from the ring slots.
> > > > > >>>
> > > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides
> > the
> > > > > >> required
> > > > > >>> ordering. This patch is to prevent reading and writing the ring
> > slots get
> > > > > >>> reordered before reading {prod,cons}.tail for SP (and SC) case.
> > > > > >>
> > > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > > > >> reordered
> > > > > >> before reading the tail. However, to prevent *writing* the ring get
> > > > > >> reordered
> > > > > >> *before reading* the tail you need a full memory barrier, i.e.
> > > > > >> rte_smp_mb().
> > > > > >
> > > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > > > WMB(ST
> > > > > Option) orders ST/ST.
> > > > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> > and
> > > > DSB
> > > > > <option> parameter  in
> > > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > > >
> > > > > I see. But you have to change the rte_smp_rmb() function definition
> > in
> > > > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that
> > all
> > > > > other architectures follows same rules.
> > > > > Otherwise, this change is logically wrong, because read barrier in
> > current
> > > > > definition could not be used to order Load with Store.
> > > > >
> > > >
> > > > Good points, let me re-think how to handle for other architectures.
> > > > Full MB is required for other architectures(x86? Ppc?), but for arm,
> > read
> > > > barrier(load/store and load/load) is enough.
> > >
> > > Hi Ilya,
> > >
> > > I would expand the rmb definition to cover load/store, in addition to
> > load/load.
> > > For X86, as a strong memory order model, rmb is actually equivalent to
> > mb,
> >
> > That's not exactly the case, on x86 we have:
> > smp_rmb == compiler_barrier
> > smp_mb is a proper memory barrier.
> >
> > Konstantin
> 
> Sorry I did not make it clear.
> Anyway, on x86, smp_rmb, as a compiler barrier, applies to load/store, not only load/load.

Yes, that's true, but I think that's happened by coincidence, 
not intentionally.

> This is the case also for arm, arm64, ppc32, ppc64.
> I will submit a patch to expand the definition of this API.

I understand your intention, but that does mean we would also need
to change not only rte_smp_rmb() but rte_rmb() too (to keep things consistent)?
That sounds worring.  
Might be better to keep smp_rmb() definition as it is, and introduce new function
that fits your purposes (smp_rwmb or smp_load_store_barrier)?

Konstantin

> 
> >
> > > as implemented as a compiler barrier: rte_compiler_barrier,
> > > arm32 is also this case.
> > > For PPC, both 32 and 64-bit, rmb=wmb=mb, lwsync/sync orders
> > load/store, load/load, store/load, store/store, looking at the table on this
> > > page:
> > > https://www.ibm.com/developerworks/systems/articles/powerpc.html
> > >
> > > In summary, we are safe to expand this definition for all the
> > architectures DPDK support?
> > > Any comments are welcome!
> > >
> > > BR. Gavin
> > >
  
Ananyev, Konstantin March 8, 2019, 4:33 p.m. UTC | #13
> > > > > >>> In weak memory models, like arm64, reading the {prod,cons}.tail
> > > > > >>> may get reordered after reading or writing the ring slots, which
> > > > > >>> corrupts the ring and stale data is observed.
> > > > > >>>
> > > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> > problem
> > > > > >>> is
> > > > > >> most
> > > > > >>> likely caused by missing the acquire semantics when reading
> > > > > >>> cons.tail (in SP enqueue) or prod.tail (in SC dequeue) which
> > > > > >>> makes it possible to
> > > > > read
> > > > > >> a
> > > > > >>> stale value from the ring slots.
> > > > > >>>
> > > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already provides
> > the
> > > > > >> required
> > > > > >>> ordering. This patch is to prevent reading and writing the ring
> > > > > >>> slots get reordered before reading {prod,cons}.tail for SP (and SC)
> > > case.
> > > > > >>
> > > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the ring get
> > > > > >> reordered before reading the tail. However, to prevent *writing*
> > > > > >> the ring get reordered *before reading* the tail you need a full
> > > > > >> memory barrier, i.e.
> > > > > >> rte_smp_mb().
> > > > > >
> > > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST, while
> > > > > > WMB(ST
> > > > > Option) orders ST/ST.
> > > > > > For more details, please refer to: Table B2-1 Encoding of the DMB
> > > > > > and DSB
> > > > > <option> parameter  in
> > > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architecture-
> > > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > > >
> > > > > I see. But you have to change the rte_smp_rmb() function definition
> > > > > in lib/librte_eal/common/include/generic/rte_atomic.h and assure
> > > > > that all other architectures follows same rules.
> > > > > Otherwise, this change is logically wrong, because read barrier in
> > > > > current definition could not be used to order Load with Store.
> > > > >
> > > >
> > > > Good points, let me re-think how to handle for other architectures.
> > > > Full MB is required for other architectures(x86? Ppc?), but for arm, read
> > > barrier(load/store and load/load) is enough.
> > >
> > > For x86, I don't think you need any barrier here, as with IA memory mode:
> > > -  Reads are not reordered with other reads.
> > > - Writes are not reordered with older reads.
> > Agree
> 
> I understand herein no instruction level barriers are required for IA, but how about the
> compiler barrier: rte_compiler_barrier?
> 
> >
> > >
> > > BTW, could you explain a bit more why barrier is necessary even on arm
> > here?
> > > As I can see, there is a data dependency between the tail value and
> > > subsequent address calculations for ring writes/reads.
> > > Isn't that sufficient to prevent re-ordering even for weak memory model?
> > The tail value affects 'n'. But, the value of 'n' can be speculated because of
> > the following 'if' statement:
> >
> > if (unlikely(n > *free_entries))
> >                         n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *free_entries;
> >
> > The address calculations for actual ring writes/reads do not depend on the
> > tail value. 

Ok, agree I formulated it wrongly, only limit value is dependent on cons.tail. 
Address is not. 

>Since 'n' can be speculated, the writes/reads can be moved up
> > before the load of the tail value.

For my curiosity: ok, I understand that 'n' value can be speculated,
and speculative stores could start before n is calculated properly...
But are you saying that such speculative store results might be visible to the 
other observers (different cpu)? 

> 
> Good explanation. The address calculations does not depend on tail/n, only the
> limit/last one depends on it, while it can be speculated.
> 
> > > Konstantin
> > >
> > >
> > <snip>
  
Thomas Monjalon March 8, 2019, 11:18 p.m. UTC | #14
08/03/2019 16:50, Ananyev, Konstantin:
> 08/03/2019 16:05, Gavin Hu (Arm Technology China):
> > Anyway, on x86, smp_rmb, as a compiler barrier, applies to load/store, not only load/load.
> 
> Yes, that's true, but I think that's happened by coincidence, 
> not intentionally.
> 
> > This is the case also for arm, arm64, ppc32, ppc64.
> > I will submit a patch to expand the definition of this API.
> 
> I understand your intention, but that does mean we would also need
> to change not only rte_smp_rmb() but rte_rmb() too (to keep things consistent)?
> That sounds worring.  
> Might be better to keep smp_rmb() definition as it is, and introduce new function
> that fits your purposes (smp_rwmb or smp_load_store_barrier)?

How is it managed in other projects?
  
Honnappa Nagarahalli March 8, 2019, 11:48 p.m. UTC | #15
> 08/03/2019 16:50, Ananyev, Konstantin:
> > 08/03/2019 16:05, Gavin Hu (Arm Technology China):
> > > Anyway, on x86, smp_rmb, as a compiler barrier, applies to load/store, not
> only load/load.
> >
> > Yes, that's true, but I think that's happened by coincidence, not
> > intentionally.
> >
> > > This is the case also for arm, arm64, ppc32, ppc64.
> > > I will submit a patch to expand the definition of this API.
> >
> > I understand your intention, but that does mean we would also need to
> > change not only rte_smp_rmb() but rte_rmb() too (to keep things consistent)?
> > That sounds worring.
> > Might be better to keep smp_rmb() definition as it is, and introduce
> > new function that fits your purposes (smp_rwmb or smp_load_store_barrier)?
Looking at rte_rmb, rte_io_rmb, rte_cio_rmb implementations for Arm, they all provide load/store barrier as well. If other architectures also provide load/store barrier with rte_xxx_rmb, then we could extend the meaning of the existing APIs.

Even if a new API is provided, we need to do provide the same APIs for IO and CIO variants.

> 
> How is it managed in other projects?
In my experience, I usually have been changing the algorithms to use C11 memory model. So, I have not come across this issue yet. Others can comment.

>
  
Gavin Hu March 9, 2019, 10:28 a.m. UTC | #16
> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Saturday, March 9, 2019 7:48 AM
> To: thomas@monjalon.net; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>
> Cc: Ilya Maximets <i.maximets@samsung.com>; dev@dpdk.org; nd
> <nd@arm.com>; jerinj@marvell.com; hemant.agrawal@nxp.com;
> Nipun.gupta@nxp.com; olivier.matz@6wind.com; Richardson, Bruce
> <bruce.richardson@intel.com>; chaozhu@linux.vnet.ibm.com; nd
> <nd@arm.com>
> Subject: RE: [PATCH v2] ring: enforce reading the tails before ring
> operations
> 
> > 08/03/2019 16:50, Ananyev, Konstantin:
> > > 08/03/2019 16:05, Gavin Hu (Arm Technology China):
> > > > Anyway, on x86, smp_rmb, as a compiler barrier, applies to
> load/store, not
> > only load/load.
> > >
> > > Yes, that's true, but I think that's happened by coincidence, not
> > > intentionally.
> > >
> > > > This is the case also for arm, arm64, ppc32, ppc64.
> > > > I will submit a patch to expand the definition of this API.
> > >
> > > I understand your intention, but that does mean we would also need
> to
> > > change not only rte_smp_rmb() but rte_rmb() too (to keep things
> consistent)?
> > > That sounds worring.
> > > Might be better to keep smp_rmb() definition as it is, and introduce
> > > new function that fits your purposes (smp_rwmb or
> smp_load_store_barrier)?
> Looking at rte_rmb, rte_io_rmb, rte_cio_rmb implementations for Arm,
> they all provide load/store barrier as well. If other architectures also
> provide load/store barrier with rte_xxx_rmb, then we could extend the
> meaning of the existing APIs.
Further looking at rte_rmb, rte_io_rmb, rte_cio_rmb implementations for PPC64 and x86,
They also provide load/store barrier. It is safe to extend the meaning of the existing rte_XXX_rmb API.
> 
> Even if a new API is provided, we need to do provide the same APIs for IO
> and CIO variants.
Since rte_XXX_rmbs API for all architectures already provide the desired load/store ordering,
a new API is redundant and not needed. 

> >
> > How is it managed in other projects?
> In my experience, I usually have been changing the algorithms to use C11
> memory model. So, I have not come across this issue yet. Others can
> comment.
> 
> >
  
Honnappa Nagarahalli March 10, 2019, 8:47 p.m. UTC | #17
> > > > > > >>> In weak memory models, like arm64, reading the
> > > > > > >>> {prod,cons}.tail may get reordered after reading or
> > > > > > >>> writing the ring slots, which corrupts the ring and stale data is
> observed.
> > > > > > >>>
> > > > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> > > problem
> > > > > > >>> is
> > > > > > >> most
> > > > > > >>> likely caused by missing the acquire semantics when
> > > > > > >>> reading cons.tail (in SP enqueue) or prod.tail (in SC
> > > > > > >>> dequeue) which makes it possible to
> > > > > > read
> > > > > > >> a
> > > > > > >>> stale value from the ring slots.
> > > > > > >>>
> > > > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already
> > > > > > >>> provides
> > > the
> > > > > > >> required
> > > > > > >>> ordering. This patch is to prevent reading and writing the
> > > > > > >>> ring slots get reordered before reading {prod,cons}.tail
> > > > > > >>> for SP (and SC)
> > > > case.
> > > > > > >>
> > > > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the
> > > > > > >> ring get reordered before reading the tail. However, to
> > > > > > >> prevent *writing* the ring get reordered *before reading*
> > > > > > >> the tail you need a full memory barrier, i.e.
> > > > > > >> rte_smp_mb().
> > > > > > >
> > > > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST,
> > > > > > > while WMB(ST
> > > > > > Option) orders ST/ST.
> > > > > > > For more details, please refer to: Table B2-1 Encoding of
> > > > > > > the DMB and DSB
> > > > > > <option> parameter  in
> > > > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architectu
> > > > > > > re-
> > > > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > > > >
> > > > > > I see. But you have to change the rte_smp_rmb() function
> > > > > > definition in
> > > > > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that
> all other architectures follows same rules.
> > > > > > Otherwise, this change is logically wrong, because read
> > > > > > barrier in current definition could not be used to order Load with
> Store.
> > > > > >
> > > > >
> > > > > Good points, let me re-think how to handle for other architectures.
> > > > > Full MB is required for other architectures(x86? Ppc?), but for
> > > > > arm, read
> > > > barrier(load/store and load/load) is enough.
> > > >
> > > > For x86, I don't think you need any barrier here, as with IA memory
> mode:
> > > > -  Reads are not reordered with other reads.
> > > > - Writes are not reordered with older reads.
> > > Agree
> >
> > I understand herein no instruction level barriers are required for IA,
> > but how about the compiler barrier: rte_compiler_barrier?
> >
> > >
> > > >
> > > > BTW, could you explain a bit more why barrier is necessary even on
> > > > arm
> > > here?
> > > > As I can see, there is a data dependency between the tail value
> > > > and subsequent address calculations for ring writes/reads.
> > > > Isn't that sufficient to prevent re-ordering even for weak memory model?
> > > The tail value affects 'n'. But, the value of 'n' can be speculated
> > > because of the following 'if' statement:
> > >
> > > if (unlikely(n > *free_entries))
> > >                         n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 :
> > > *free_entries;
> > >
> > > The address calculations for actual ring writes/reads do not depend
> > > on the tail value.
> 
> Ok, agree I formulated it wrongly, only limit value is dependent on cons.tail.
> Address is not.
> 
> >Since 'n' can be speculated, the writes/reads can be moved up
> > > before the load of the tail value.
> 
> For my curiosity: ok, I understand that 'n' value can be speculated, and
> speculative stores could start before n is calculated properly...
> But are you saying that such speculative store results might be visible to the
> other observers (different cpu)?
> 
You are correct. The speculative stores will NOT be visible to other observers till the value of 'n' is fixed. Speculative stores might have to be discarded depending on the value of 'n' (which will affect cache performance).
There is also a control dependency between the load of cons.tail and the stores to the ring. That should also keep the load and stores from getting reordered (though I am not sure if it still allows for speculative stores).
So, IMO, the barrier in enqueue is not needed. Is this what you wanted to drive at?
  
Ananyev, Konstantin March 11, 2019, 1:58 p.m. UTC | #18
> > > > > > > >>> In weak memory models, like arm64, reading the
> > > > > > > >>> {prod,cons}.tail may get reordered after reading or
> > > > > > > >>> writing the ring slots, which corrupts the ring and stale data is
> > observed.
> > > > > > > >>>
> > > > > > > >>> This issue was reported by NXP on 8-A72 DPAA2 board. The
> > > > problem
> > > > > > > >>> is
> > > > > > > >> most
> > > > > > > >>> likely caused by missing the acquire semantics when
> > > > > > > >>> reading cons.tail (in SP enqueue) or prod.tail (in SC
> > > > > > > >>> dequeue) which makes it possible to
> > > > > > > read
> > > > > > > >> a
> > > > > > > >>> stale value from the ring slots.
> > > > > > > >>>
> > > > > > > >>> For MP (and MC) case, rte_atomic32_cmpset() already
> > > > > > > >>> provides
> > > > the
> > > > > > > >> required
> > > > > > > >>> ordering. This patch is to prevent reading and writing the
> > > > > > > >>> ring slots get reordered before reading {prod,cons}.tail
> > > > > > > >>> for SP (and SC)
> > > > > case.
> > > > > > > >>
> > > > > > > >> Read barrier rte_smp_rmb() is OK to prevent reading the
> > > > > > > >> ring get reordered before reading the tail. However, to
> > > > > > > >> prevent *writing* the ring get reordered *before reading*
> > > > > > > >> the tail you need a full memory barrier, i.e.
> > > > > > > >> rte_smp_mb().
> > > > > > > >
> > > > > > > > ISHLD(rte_smp_rmb is DMB(ishld) orders LD/LD and LD/ST,
> > > > > > > > while WMB(ST
> > > > > > > Option) orders ST/ST.
> > > > > > > > For more details, please refer to: Table B2-1 Encoding of
> > > > > > > > the DMB and DSB
> > > > > > > <option> parameter  in
> > > > > > > > https://developer.arm.com/docs/ddi0487/latest/arm-architectu
> > > > > > > > re-
> > > > > > > reference-manual-armv8-for-armv8-a-architecture-profile
> > > > > > >
> > > > > > > I see. But you have to change the rte_smp_rmb() function
> > > > > > > definition in
> > > > > > > lib/librte_eal/common/include/generic/rte_atomic.h and assure that
> > all other architectures follows same rules.
> > > > > > > Otherwise, this change is logically wrong, because read
> > > > > > > barrier in current definition could not be used to order Load with
> > Store.
> > > > > > >
> > > > > >
> > > > > > Good points, let me re-think how to handle for other architectures.
> > > > > > Full MB is required for other architectures(x86? Ppc?), but for
> > > > > > arm, read
> > > > > barrier(load/store and load/load) is enough.
> > > > >
> > > > > For x86, I don't think you need any barrier here, as with IA memory
> > mode:
> > > > > -  Reads are not reordered with other reads.
> > > > > - Writes are not reordered with older reads.
> > > > Agree
> > >
> > > I understand herein no instruction level barriers are required for IA,
> > > but how about the compiler barrier: rte_compiler_barrier?
> > >
> > > >
> > > > >
> > > > > BTW, could you explain a bit more why barrier is necessary even on
> > > > > arm
> > > > here?
> > > > > As I can see, there is a data dependency between the tail value
> > > > > and subsequent address calculations for ring writes/reads.
> > > > > Isn't that sufficient to prevent re-ordering even for weak memory model?
> > > > The tail value affects 'n'. But, the value of 'n' can be speculated
> > > > because of the following 'if' statement:
> > > >
> > > > if (unlikely(n > *free_entries))
> > > >                         n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 :
> > > > *free_entries;
> > > >
> > > > The address calculations for actual ring writes/reads do not depend
> > > > on the tail value.
> >
> > Ok, agree I formulated it wrongly, only limit value is dependent on cons.tail.
> > Address is not.
> >
> > >Since 'n' can be speculated, the writes/reads can be moved up
> > > > before the load of the tail value.
> >
> > For my curiosity: ok, I understand that 'n' value can be speculated, and
> > speculative stores could start before n is calculated properly...
> > But are you saying that such speculative store results might be visible to the
> > other observers (different cpu)?
> >
> You are correct. The speculative stores will NOT be visible to other observers till the value of 'n' is fixed. Speculative stores might have to be
> discarded depending on the value of 'n' (which will affect cache performance).
> There is also a control dependency between the load of cons.tail and the stores to the ring. That should also keep the load and stores from
> getting reordered (though I am not sure if it still allows for speculative stores).
> So, IMO, the barrier in enqueue is not needed. Is this what you wanted to drive at?

Yes, that was my thought.
Thanks
Konstantin
  

Patch

diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5..1bd3dfd 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -90,9 +90,11 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 			return 0;
 
 		*new_head = *old_head + n;
-		if (is_sp)
-			r->prod.head = *new_head, success = 1;
-		else
+		if (is_sp) {
+			r->prod.head = *new_head;
+			rte_smp_rmb();
+			success = 1;
+		} else
 			success = rte_atomic32_cmpset(&r->prod.head,
 					*old_head, *new_head);
 	} while (unlikely(success == 0));
@@ -158,9 +160,11 @@  __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 			return 0;
 
 		*new_head = *old_head + n;
-		if (is_sc)
-			r->cons.head = *new_head, success = 1;
-		else
+		if (is_sc) {
+			r->cons.head = *new_head;
+			rte_smp_rmb();
+			success = 1;
+		} else
 			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
 					*new_head);
 	} while (unlikely(success == 0));