[1/2] ring: synchronize the load and store of the tail

Message ID 1539757786-226178-1-git-send-email-gavin.hu@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [1/2] ring: synchronize the load and store of the tail |

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Gavin Hu Oct. 17, 2018, 6:29 a.m. UTC
  Synchronize the load-acquire of the tail and the store-release
within update_tail, the store release ensures all the ring operations,
enqueue or dequeue, are seen by the observers on the other side as soon
as they see the updated tail. The load-acquire is needed here as the
data dependency is not a reliable way for ordering as the compiler might
break it by saving to temporary values to boost performance.
When computing the free_entries and avail_entries, use atomic semantics
to load the heads and tails instead.

The patch was benchmarked with test/ring_perf_autotest and it decreases
the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
For 1 lcore, it also improves a little, about 3 ~ 4%.
It is a big improvement, in case of MPMC, with two lcores and ring size
of 32, it saves latency up to (3.26-2.36)/3.26 = 27.6%.

This patch is a bug fix, while the improvement is a bonus. In our analysis
the improvement comes from the cacheline pre-filling after hoisting load-
acquire from _atomic_compare_exchange_n up above.

The test command:
$sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
1024 -- -i

Test result with this patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.86
 MP/MC bulk enq/dequeue (size: 8): 10.15
 SP/SC bulk enq/dequeue (size: 32): 1.94
 MP/MC bulk enq/dequeue (size: 32): 2.36

In comparison of the test result without this patch:
 SP/SC bulk enq/dequeue (size: 8): 6.67
 MP/MC bulk enq/dequeue (size: 8): 13.12
 SP/SC bulk enq/dequeue (size: 32): 2.04
 MP/MC bulk enq/dequeue (size: 32): 3.26

Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)
  

Comments

Gavin Hu Oct. 17, 2018, 6:35 a.m. UTC | #1
Hi Jerin

As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.

Best Regards,
Gavin

> -----Original Message-----
> From: Gavin Hu <gavin.hu@arm.com>
> Sent: Wednesday, October 17, 2018 2:30 PM
> To: dev@dpdk.org
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> jerin.jacob@caviumnetworks.com; stable@dpdk.org
> Subject: [PATCH 1/2] ring: synchronize the load and store of the tail
>
> Synchronize the load-acquire of the tail and the store-release within
> update_tail, the store release ensures all the ring operations, enqueue or
> dequeue, are seen by the observers on the other side as soon as they see
> the updated tail. The load-acquire is needed here as the data dependency is
> not a reliable way for ordering as the compiler might break it by saving to
> temporary values to boost performance.
> When computing the free_entries and avail_entries, use atomic semantics to
> load the heads and tails instead.
>
> The patch was benchmarked with test/ring_perf_autotest and it decreases
> the enqueue/dequeue latency by 5% ~ 27.6% with two lcores, the real gains
> are dependent on the number of lcores, depth of the ring, SPSC or MPMC.
> For 1 lcore, it also improves a little, about 3 ~ 4%.
> It is a big improvement, in case of MPMC, with two lcores and ring size of 32,
> it saves latency up to (3.26-2.36)/3.26 = 27.6%.
>
> This patch is a bug fix, while the improvement is a bonus. In our analysis the
> improvement comes from the cacheline pre-filling after hoisting load-
> acquire from _atomic_compare_exchange_n up above.
>
> The test command:
> $sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=\
> 1024 -- -i
>
> Test result with this patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.86
>  MP/MC bulk enq/dequeue (size: 8): 10.15  SP/SC bulk enq/dequeue (size:
> 32): 1.94  MP/MC bulk enq/dequeue (size: 32): 2.36
>
> In comparison of the test result without this patch:
>  SP/SC bulk enq/dequeue (size: 8): 6.67
>  MP/MC bulk enq/dequeue (size: 8): 13.12  SP/SC bulk enq/dequeue (size:
> 32): 2.04  MP/MC bulk enq/dequeue (size: 32): 3.26
>
> Fixes: 39368ebfc6 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
>
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Jia He <justin.he@arm.com>
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  lib/librte_ring/rte_ring_c11_mem.h | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 94df3c4..4851763 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -67,13 +67,18 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
>  *old_head = __atomic_load_n(&r->prod.head,
>  __ATOMIC_ACQUIRE);
>
> -/*
> - *  The subtraction is done between two unsigned 32bits
> value
> +/* load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
> +
> __ATOMIC_ACQUIRE);
> +
> +/* The subtraction is done between two unsigned 32bits
> value
>   * (the result is always modulo 32 bits even if we have
>   * *old_head > cons_tail). So 'free_entries' is always
> between 0
>   * and capacity (which is < size).
>   */
> -*free_entries = (capacity + r->cons.tail - *old_head);
> +*free_entries = (capacity + cons_tail - *old_head);
>
>  /* check that we have enough room in ring */
>  if (unlikely(n > *free_entries))
> @@ -131,15 +136,22 @@ __rte_ring_move_cons_head(struct rte_ring *r, int
> is_sc,
>  do {
>  /* Restore n as it may change every loop */
>  n = max;
> +
>  *old_head = __atomic_load_n(&r->cons.head,
>  __ATOMIC_ACQUIRE);
>
> +/* this load-acquire synchronize with store-release of ht->tail
> + * in update_tail.
> + */
> +const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
> +__ATOMIC_ACQUIRE);
> +
>  /* The subtraction is done between two unsigned 32bits
> value
>   * (the result is always modulo 32 bits even if we have
>   * cons_head > prod_tail). So 'entries' is always between 0
>   * and size(ring)-1.
>   */
> -*entries = (r->prod.tail - *old_head);
> +*entries = (prod_tail - *old_head);
>
>  /* Set the actual entries for dequeue */
>  if (n > *entries)
> --
> 2.7.4

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Thomas Monjalon Oct. 27, 2018, 2:39 p.m. UTC | #2
17/10/2018 08:35, Gavin Hu (Arm Technology China):
> Hi Jerin
> 
> As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.

The thread is totally messed up because:
	- there is no cover letter
	- some different series (testpmd, i40e and doc) are in the same thread
	- v4 replies to a different series
	- this version should be a v5 but has no number
	- this version replies to the v3
	- patchwork still shows v3 and "v5"
	- replies from Ola are not quoting previous discussion

Because of all of this, it is really difficult to follow.
This is probably the reason of the lack of review outside of Arm.

One more issue: you must Cc the relevant maintainers.
Here:
	- Olivier for rte_ring
	- Chao for IBM platform
	- Bruce and Konstantin for x86

Guys, it is really cool to have more Arm developpers in DPDK.
But please consider better formatting your discussions, it is really
important in our contribution workflow.

I don't know what to do.
I suggest to wait for more feedbacks and integrate it in -rc2.
  
Jerin Jacob Oct. 27, 2018, 3 p.m. UTC | #3
-----Original Message-----
> Date: Sat, 27 Oct 2018 16:39:58 +0200
> From: Thomas Monjalon <thomas@monjalon.net>
> To: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org, "jerin.jacob@caviumnetworks.com"
>  <jerin.jacob@caviumnetworks.com>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, olivier.matz@6wind.com,
>  chaozhu@linux.vnet.ibm.com, bruce.richardson@intel.com,
>  konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> External Email
> 
> 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > Hi Jerin
> >
> > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> 
> The thread is totally messed up because:
>         - there is no cover letter
>         - some different series (testpmd, i40e and doc) are in the same thread
>         - v4 replies to a different series
>         - this version should be a v5 but has no number
>         - this version replies to the v3
>         - patchwork still shows v3 and "v5"
>         - replies from Ola are not quoting previous discussion
> 
> Because of all of this, it is really difficult to follow.
> This is probably the reason of the lack of review outside of Arm.
> 
> One more issue: you must Cc the relevant maintainers.
> Here:
>         - Olivier for rte_ring
>         - Chao for IBM platform
>         - Bruce and Konstantin for x86
> 
> Guys, it is really cool to have more Arm developpers in DPDK.
> But please consider better formatting your discussions, it is really
> important in our contribution workflow.
> 
> I don't know what to do.
> I suggest to wait for more feedbacks and integrate it in -rc2.

This series has been acked and tested. Sure, if we are looking for some
more feedback we can push to -rc2 if not it a good candidate to be
selected for -rc1.

> 
> 
>
  
Thomas Monjalon Oct. 27, 2018, 3:13 p.m. UTC | #4
27/10/2018 17:00, Jerin Jacob:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > Hi Jerin
> > >
> > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > 
> > The thread is totally messed up because:
> >         - there is no cover letter
> >         - some different series (testpmd, i40e and doc) are in the same thread
> >         - v4 replies to a different series
> >         - this version should be a v5 but has no number
> >         - this version replies to the v3
> >         - patchwork still shows v3 and "v5"
> >         - replies from Ola are not quoting previous discussion
> > 
> > Because of all of this, it is really difficult to follow.
> > This is probably the reason of the lack of review outside of Arm.
> > 
> > One more issue: you must Cc the relevant maintainers.
> > Here:
> >         - Olivier for rte_ring
> >         - Chao for IBM platform
> >         - Bruce and Konstantin for x86
> > 
> > Guys, it is really cool to have more Arm developpers in DPDK.
> > But please consider better formatting your discussions, it is really
> > important in our contribution workflow.
> > 
> > I don't know what to do.
> > I suggest to wait for more feedbacks and integrate it in -rc2.
> 
> This series has been acked and tested. Sure, if we are looking for some
> more feedback we can push to -rc2 if not it a good candidate to be
> selected for -rc1.

It has been acked and tested only for Arm platforms.
And Olivier, the ring maintainer, was not Cc.

I feel it is not enough.
  
Jerin Jacob Oct. 27, 2018, 3:34 p.m. UTC | #5
-----Original Message-----
> Date: Sat, 27 Oct 2018 17:13:10 +0200
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, "dev@dpdk.org"
>  <dev@dpdk.org>, Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
>  "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
>  <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> 27/10/2018 17:00, Jerin Jacob:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > Hi Jerin
> > > >
> > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > >
> > > The thread is totally messed up because:
> > >         - there is no cover letter
> > >         - some different series (testpmd, i40e and doc) are in the same thread
> > >         - v4 replies to a different series
> > >         - this version should be a v5 but has no number
> > >         - this version replies to the v3
> > >         - patchwork still shows v3 and "v5"
> > >         - replies from Ola are not quoting previous discussion
> > >
> > > Because of all of this, it is really difficult to follow.
> > > This is probably the reason of the lack of review outside of Arm.
> > >
> > > One more issue: you must Cc the relevant maintainers.
> > > Here:
> > >         - Olivier for rte_ring
> > >         - Chao for IBM platform
> > >         - Bruce and Konstantin for x86
> > >
> > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > But please consider better formatting your discussions, it is really
> > > important in our contribution workflow.
> > >
> > > I don't know what to do.
> > > I suggest to wait for more feedbacks and integrate it in -rc2.
> >
> > This series has been acked and tested. Sure, if we are looking for some
> > more feedback we can push to -rc2 if not it a good candidate to be
> > selected for -rc1.
> 
> It has been acked and tested only for Arm platforms.
> And Olivier, the ring maintainer, was not Cc.
> 
> I feel it is not enough.

Sure, More reviews is already better. But lets keep as -rc2 target.


> 
>
  
Thomas Monjalon Oct. 27, 2018, 3:48 p.m. UTC | #6
27/10/2018 17:34, Jerin Jacob:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is really
> > > > important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for some
> > > more feedback we can push to -rc2 if not it a good candidate to be
> > > selected for -rc1.
> > 
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> > 
> > I feel it is not enough.
> 
> Sure, More reviews is already better. But lets keep as -rc2 target.

Yes. If no objection, it will enter in -rc2.
After -rc2, we cannot do such core change anyway.
  
Gavin Hu Oct. 29, 2018, 2:51 a.m. UTC | #7
Hi Thomas and Jerin,

The patches were extensively and heavily reviewed by Arm internally. As the 1st patch was not concluded, so I create a new series(2 patches),


> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Saturday, October 27, 2018 11:34 PM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
>
> -----Original Message-----
> > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > From: Thomas Monjalon <thomas@monjalon.net>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> "dev@dpdk.org"
> >  <dev@dpdk.org>, Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>,
> > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> >  <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > store of  the tail
> >
> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
>
> Sure, More reviews is already better. But lets keep as -rc2 target.
>
>
> >
> >
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Gavin Hu Oct. 29, 2018, 2:57 a.m. UTC | #8
Hi Thomas and Jerin,

The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
How can I clean up this mess?
1. make all the previous patches Superseded?
2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?

Best Regards,
Gavin


> -----Original Message-----
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Sent: Saturday, October 27, 2018 11:34 PM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
>
> -----Original Message-----
> > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > From: Thomas Monjalon <thomas@monjalon.net>
> > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> "dev@dpdk.org"
> >  <dev@dpdk.org>, Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>,
> > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> >  <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > store of  the tail
> >
> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > Hi Jerin
> > > > >
> > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > > >
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
>
> Sure, More reviews is already better. But lets keep as -rc2 target.
>
>
> >
> >
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Jerin Jacob Oct. 29, 2018, 10:16 a.m. UTC | #9
-----Original Message-----
> Date: Mon, 29 Oct 2018 02:57:17 +0000
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, Thomas Monjalon
>  <thomas@monjalon.net>
> CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> Hi Thomas and Jerin,
> 
> The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> How can I clean up this mess?
> 1. make all the previous patches Superseded?
> 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?

I would suggest to supersede the old patches(not in this case, in any case when you
send new version and update the version number).

I would suggest send new patches as separate series. If it has dependency on
exiting Acked patches please mention that in cover letter.


> 
> Best Regards,
> Gavin
> 
> 
> > -----Original Message-----
> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Sent: Saturday, October 27, 2018 11:34 PM
> > To: Thomas Monjalon <thomas@monjalon.net>
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org;
> > Ola Liljedahl <Ola.Liljedahl@arm.com>; olivier.matz@6wind.com;
> > chaozhu@linux.vnet.ibm.com; bruce.richardson@intel.com;
> > konstantin.ananyev@intel.com
> > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> > the tail
> >
> > -----Original Message-----
> > > Date: Sat, 27 Oct 2018 17:13:10 +0200
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Cc: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
> > "dev@dpdk.org"
> > >  <dev@dpdk.org>, Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>,
> > > "stable@dpdk.org" <stable@dpdk.org>, Ola Liljedahl
> > > <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
> > >  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
> > >  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
> > >  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
> > >  <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and
> > > store of  the tail
> > >
> > >
> > > 27/10/2018 17:00, Jerin Jacob:
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > Hi Jerin
> > > > > >
> > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > patch series to unblock the merge.
> > > > >
> > > > > The thread is totally messed up because:
> > > > >         - there is no cover letter
> > > > >         - some different series (testpmd, i40e and doc) are in the same
> > thread
> > > > >         - v4 replies to a different series
> > > > >         - this version should be a v5 but has no number
> > > > >         - this version replies to the v3
> > > > >         - patchwork still shows v3 and "v5"
> > > > >         - replies from Ola are not quoting previous discussion
> > > > >
> > > > > Because of all of this, it is really difficult to follow.
> > > > > This is probably the reason of the lack of review outside of Arm.
> > > > >
> > > > > One more issue: you must Cc the relevant maintainers.
> > > > > Here:
> > > > >         - Olivier for rte_ring
> > > > >         - Chao for IBM platform
> > > > >         - Bruce and Konstantin for x86
> > > > >
> > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > But please consider better formatting your discussions, it is
> > > > > really important in our contribution workflow.
> > > > >
> > > > > I don't know what to do.
> > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > >
> > > > This series has been acked and tested. Sure, if we are looking for
> > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > be selected for -rc1.
> > >
> > > It has been acked and tested only for Arm platforms.
> > > And Olivier, the ring maintainer, was not Cc.
> > >
> > > I feel it is not enough.
> >
> > Sure, More reviews is already better. But lets keep as -rc2 target.
> >
> >
> > >
> > >
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Thomas Monjalon Oct. 29, 2018, 10:47 a.m. UTC | #10
29/10/2018 11:16, Jerin Jacob:
> From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > 
> > Hi Thomas and Jerin,
> > 
> > The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> > How can I clean up this mess?
> > 1. make all the previous patches Superseded?
> > 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?
> 
> I would suggest to supersede the old patches(not in this case, in any case when you
> send new version and update the version number).

Why not in this case?
There are some old patches in patchwork which should be superseded.

> I would suggest send new patches as separate series. If it has dependency on
> exiting Acked patches please mention that in cover letter.

I would suggest also to stop top-posting, it doesn't help reading threads.


> > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > 27/10/2018 17:00, Jerin Jacob:
> > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > > Hi Jerin
> > > > > > >
> > > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > > patch series to unblock the merge.
> > > > > >
> > > > > > The thread is totally messed up because:
> > > > > >         - there is no cover letter
> > > > > >         - some different series (testpmd, i40e and doc) are in the same
> > > thread
> > > > > >         - v4 replies to a different series
> > > > > >         - this version should be a v5 but has no number
> > > > > >         - this version replies to the v3
> > > > > >         - patchwork still shows v3 and "v5"
> > > > > >         - replies from Ola are not quoting previous discussion
> > > > > >
> > > > > > Because of all of this, it is really difficult to follow.
> > > > > > This is probably the reason of the lack of review outside of Arm.
> > > > > >
> > > > > > One more issue: you must Cc the relevant maintainers.
> > > > > > Here:
> > > > > >         - Olivier for rte_ring
> > > > > >         - Chao for IBM platform
> > > > > >         - Bruce and Konstantin for x86
> > > > > >
> > > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > > But please consider better formatting your discussions, it is
> > > > > > really important in our contribution workflow.
> > > > > >
> > > > > > I don't know what to do.
> > > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > > >
> > > > > This series has been acked and tested. Sure, if we are looking for
> > > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > > be selected for -rc1.
> > > >
> > > > It has been acked and tested only for Arm platforms.
> > > > And Olivier, the ring maintainer, was not Cc.
> > > >
> > > > I feel it is not enough.
> > >
> > > Sure, More reviews is already better. But lets keep as -rc2 target.
> > >
> > >
> > > >
> > > >
> > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
  
Jerin Jacob Oct. 29, 2018, 11:10 a.m. UTC | #11
-----Original Message-----
> Date: Mon, 29 Oct 2018 11:47:05 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, "Gavin Hu (Arm Technology
>  China)" <Gavin.Hu@arm.com>
> Cc: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli
>  <Honnappa.Nagarahalli@arm.com>, "stable@dpdk.org" <stable@dpdk.org>, Ola
>  Liljedahl <Ola.Liljedahl@arm.com>, "olivier.matz@6wind.com"
>  <olivier.matz@6wind.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "bruce.richardson@intel.com"
>  <bruce.richardson@intel.com>, "konstantin.ananyev@intel.com"
>  <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
>  the tail
> 
> 
> 29/10/2018 11:16, Jerin Jacob:
> > From: "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>
> > >
> > > Hi Thomas and Jerin,
> > >
> > > The patches were extensively reviewed by Arm internally, as the 1st patch was not able to be concluded, I created a new patch series(2 patches).
> > > How can I clean up this mess?
> > > 1. make all the previous patches Superseded?
> > > 2. We have two more new patches, should I submit the 4 patches (the old 2 patches + 2 new patches) with V2?
> >
> > I would suggest to supersede the old patches(not in this case, in any case when you
> > send new version and update the version number).
> 
> Why not in this case?

I did not mean in this case particular. I meant in all cases.

> There are some old patches in patchwork which should be superseded.
> 
> > I would suggest send new patches as separate series. If it has dependency on
> > exiting Acked patches please mention that in cover letter.
> 
> I would suggest also to stop top-posting, it doesn't help reading threads.
> 
> 
> > > From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > 27/10/2018 17:00, Jerin Jacob:
> > > > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > > > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > > > > > Hi Jerin
> > > > > > > >
> > > > > > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> > > > patch series to unblock the merge.
> > > > > > >
> > > > > > > The thread is totally messed up because:
> > > > > > >         - there is no cover letter
> > > > > > >         - some different series (testpmd, i40e and doc) are in the same
> > > > thread
> > > > > > >         - v4 replies to a different series
> > > > > > >         - this version should be a v5 but has no number
> > > > > > >         - this version replies to the v3
> > > > > > >         - patchwork still shows v3 and "v5"
> > > > > > >         - replies from Ola are not quoting previous discussion
> > > > > > >
> > > > > > > Because of all of this, it is really difficult to follow.
> > > > > > > This is probably the reason of the lack of review outside of Arm.
> > > > > > >
> > > > > > > One more issue: you must Cc the relevant maintainers.
> > > > > > > Here:
> > > > > > >         - Olivier for rte_ring
> > > > > > >         - Chao for IBM platform
> > > > > > >         - Bruce and Konstantin for x86
> > > > > > >
> > > > > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > > > > But please consider better formatting your discussions, it is
> > > > > > > really important in our contribution workflow.
> > > > > > >
> > > > > > > I don't know what to do.
> > > > > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > > > > >
> > > > > > This series has been acked and tested. Sure, if we are looking for
> > > > > > some more feedback we can push to -rc2 if not it a good candidate to
> > > > > > be selected for -rc1.
> > > > >
> > > > > It has been acked and tested only for Arm platforms.
> > > > > And Olivier, the ring maintainer, was not Cc.
> > > > >
> > > > > I feel it is not enough.
> > > >
> > > > Sure, More reviews is already better. But lets keep as -rc2 target.
> > > >
> > > >
> > > > >
> > > > >
> > > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
> 
> 
> 
> 
>
  
Mattias Rönnblom Nov. 3, 2018, 8:12 p.m. UTC | #12
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
> Sent: den 27 oktober 2018 17:13
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> stable@dpdk.org; Ola Liljedahl <Ola.Liljedahl@arm.com>;
> olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com
> Subject: Re: [dpdk-dev] [PATCH 1/2] ring: synchronize the load and store of
> the tail
> 
> 27/10/2018 17:00, Jerin Jacob:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > 17/10/2018 08:35, Gavin Hu (Arm Technology China):
> > > > Hi Jerin
> > > >
> > > > As the 1st one of the 3-patch set was not concluded, I submit this 2-
> patch series to unblock the merge.
> > >
> > > The thread is totally messed up because:
> > >         - there is no cover letter
> > >         - some different series (testpmd, i40e and doc) are in the same
> thread
> > >         - v4 replies to a different series
> > >         - this version should be a v5 but has no number
> > >         - this version replies to the v3
> > >         - patchwork still shows v3 and "v5"
> > >         - replies from Ola are not quoting previous discussion
> > >
> > > Because of all of this, it is really difficult to follow.
> > > This is probably the reason of the lack of review outside of Arm.
> > >
> > > One more issue: you must Cc the relevant maintainers.
> > > Here:
> > >         - Olivier for rte_ring
> > >         - Chao for IBM platform
> > >         - Bruce and Konstantin for x86
> > >
> > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > But please consider better formatting your discussions, it is really
> > > important in our contribution workflow.
> > >
> > > I don't know what to do.
> > > I suggest to wait for more feedbacks and integrate it in -rc2.
> >
> > This series has been acked and tested. Sure, if we are looking for
> > some more feedback we can push to -rc2 if not it a good candidate to
> > be selected for -rc1.
> 
> It has been acked and tested only for Arm platforms.
> And Olivier, the ring maintainer, was not Cc.
> 
> I feel it is not enough.
> 

I've just run an out-of-tree test program I have for the DSW scheduler, which verify scheduler atomic semantics. The results are:
Non-C11 mode: pass
C11 mode before this patch set: fail
C11 mode after this patch set: pass

This suggests the current C11 mode is broken even on x86_64. I haven't been following this thread closely, so maybe this is known already.

I've also run an out-of-tree DSW throughput benchmark, and I've found that going from Non-C11 to C11 gives a 4% slowdown. After this patch, the slowdown is only 2,8%.

GCC 7.3.0 and a Skylake x86_64.
  
Honnappa Nagarahalli Nov. 5, 2018, 9:51 p.m. UTC | #13
> >
> > 27/10/2018 17:00, Jerin Jacob:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > > > The thread is totally messed up because:
> > > >         - there is no cover letter
> > > >         - some different series (testpmd, i40e and doc) are in the
> > > > same
> > thread
> > > >         - v4 replies to a different series
> > > >         - this version should be a v5 but has no number
> > > >         - this version replies to the v3
> > > >         - patchwork still shows v3 and "v5"
> > > >         - replies from Ola are not quoting previous discussion
> > > >
> > > > Because of all of this, it is really difficult to follow.
> > > > This is probably the reason of the lack of review outside of Arm.
> > > >
> > > > One more issue: you must Cc the relevant maintainers.
> > > > Here:
> > > >         - Olivier for rte_ring
> > > >         - Chao for IBM platform
> > > >         - Bruce and Konstantin for x86
> > > >
> > > > Guys, it is really cool to have more Arm developpers in DPDK.
> > > > But please consider better formatting your discussions, it is
> > > > really important in our contribution workflow.
> > > >
> > > > I don't know what to do.
> > > > I suggest to wait for more feedbacks and integrate it in -rc2.
> > >
> > > This series has been acked and tested. Sure, if we are looking for
> > > some more feedback we can push to -rc2 if not it a good candidate to
> > > be selected for -rc1.
> >
> > It has been acked and tested only for Arm platforms.
> > And Olivier, the ring maintainer, was not Cc.
> >
> > I feel it is not enough.
> >
> 
> I've just run an out-of-tree test program I have for the DSW scheduler, which
> verify scheduler atomic semantics. The results are:
> Non-C11 mode: pass
> C11 mode before this patch set: fail
> C11 mode after this patch set: pass
> 
> This suggests the current C11 mode is broken even on x86_64. I haven't
> been following this thread closely, so maybe this is known already.
> 
> I've also run an out-of-tree DSW throughput benchmark, and I've found that
> going from Non-C11 to C11 gives a 4% slowdown. After this patch, the
> slowdown is only 2,8%.
This is interesting. The general understanding seems to be that C11 atomics should not add any additional instructions on x86. But, we still see some drop in performance. Is this attributed to compiler not being allowed to re-order?

> 
> GCC 7.3.0 and a Skylake x86_64.
  
Mattias Rönnblom Nov. 6, 2018, 11:03 a.m. UTC | #14
On 2018-11-05 22:51, Honnappa Nagarahalli wrote:
>> I've also run an out-of-tree DSW throughput benchmark, and I've found that
>> going from Non-C11 to C11 gives a 4% slowdown. After this patch, the
>> slowdown is only 2,8%.
> This is interesting. The general understanding seems to be that C11 atomics should not add any additional instructions on x86. But, we still see some drop in performance. Is this attributed to compiler not being allowed to re-order?
> 

I was lazy enough not to disassemble, so I don't know.

I would suggest non-C11 mode stays as the default on x86_64.
  

Patch

diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 94df3c4..4851763 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -67,13 +67,18 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		*old_head = __atomic_load_n(&r->prod.head,
 					__ATOMIC_ACQUIRE);
 
-		/*
-		 *  The subtraction is done between two unsigned 32bits value
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t cons_tail = __atomic_load_n(&r->cons.tail,
+							__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * *old_head > cons_tail). So 'free_entries' is always between 0
 		 * and capacity (which is < size).
 		 */
-		*free_entries = (capacity + r->cons.tail - *old_head);
+		*free_entries = (capacity + cons_tail - *old_head);
 
 		/* check that we have enough room in ring */
 		if (unlikely(n > *free_entries))
@@ -131,15 +136,22 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
+
 		*old_head = __atomic_load_n(&r->cons.head,
 					__ATOMIC_ACQUIRE);
 
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		const uint32_t prod_tail = __atomic_load_n(&r->prod.tail,
+					__ATOMIC_ACQUIRE);
+
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
 		 * cons_head > prod_tail). So 'entries' is always between 0
 		 * and size(ring)-1.
 		 */
-		*entries = (r->prod.tail - *old_head);
+		*entries = (prod_tail - *old_head);
 
 		/* Set the actual entries for dequeue */
 		if (n > *entries)