diff mbox series

[v5,2/2] ring: move the atomic load of head above the loop

Message ID	1541157688-40012-3-git-send-email-gavin.hu@arm.com (mailing list archive)
State	Accepted, archived
Delegated to:	Thomas Monjalon
Headers	From: Gavin Hu <gavin.hu@arm.com> To: dev@dpdk.org Cc: thomas@monjalon.net, stephen@networkplumber.org, olivier.matz@6wind.com, chaozhu@linux.vnet.ibm.com, bruce.richardson@intel.com, konstantin.ananyev@intel.com, jerin.jacob@caviumnetworks.com, Honnappa.Nagarahalli@arm.com, gavin.hu@arm.com, stable@dpdk.org Date: Fri, 2 Nov 2018 19:21:28 +0800 Message-Id: <1541157688-40012-3-git-send-email-gavin.hu@arm.com> In-Reply-To: <1541157688-40012-1-git-send-email-gavin.hu@arm.com> References: <1541157688-40012-1-git-send-email-gavin.hu@arm.com> In-Reply-To: <1541066031-29125-1-git-send-email-gavin.hu@arm.com> References: <1541066031-29125-1-git-send-email-gavin.hu@arm.com> Subject: [dpdk-dev] [PATCH v5 2/2] ring: move the atomic load of head above the loop Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	ring library with c11 memory model bug fix and optimization \| [v5,0/2] ring library with c11 memory model bug fix and optimization [v5,1/2] ring: synchronize the load and store of the tail [v5,2/2] ring: move the atomic load of head above the loop

Checks

Context	Check	Description
ci/Intel-compilation	success	Compilation OK
ci/checkpatch	success	coding style OK

Commit Message

Gavin Hu Nov. 2, 2018, 11:21 a.m. UTC

  In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
the do {} while loop as upon failure the old_head will be updated,
another load is costly and not necessary.

This helps a little on the latency,about 1~5%.

 Test result with the patch(two cores):
 SP/SC bulk enq/dequeue (size: 8): 5.64
 MP/MC bulk enq/dequeue (size: 8): 9.58
 SP/SC bulk enq/dequeue (size: 32): 1.98
 MP/MC bulk enq/dequeue (size: 32): 2.30

Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
Cc: stable@dpdk.org

Signed-off-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
Reviewed-by: Jia He <justin.he@arm.com>
Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 doc/guides/rel_notes/release_18_11.rst |  7 +++++++
 lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Comments

Bruce Richardson Nov. 2, 2018, 11:43 a.m. UTC | #1

On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> In __rte_ring_move_prod_head, move the __atomic_load_n up and out of
> the do {} while loop as upon failure the old_head will be updated,
> another load is costly and not necessary.
> 
> This helps a little on the latency,about 1~5%.
> 
>  Test result with the patch(two cores):
>  SP/SC bulk enq/dequeue (size: 8): 5.64
>  MP/MC bulk enq/dequeue (size: 8): 9.58
>  SP/SC bulk enq/dequeue (size: 32): 1.98
>  MP/MC bulk enq/dequeue (size: 32): 2.30
> 
> Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier option")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> Reviewed-by: Jia He <justin.he@arm.com>
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
>  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
>  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
> index 376128f..b68afab 100644
> --- a/doc/guides/rel_notes/release_18_11.rst
> +++ b/doc/guides/rel_notes/release_18_11.rst
> @@ -69,6 +69,13 @@ New Features
>    checked out against that dma mask and rejected if out of range. If more than
>    one device has addressing limitations, the dma mask is the more restricted one.
>  
> +* **Updated the ring library with C11 memory model.**
> +
> +  Updated the ring library with C11 memory model, in our tests the changes
> +  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases respectively.
> +  The real improvements may vary with the number of contending lcores and the
> +  size of ring.
> +
Is this a little misleading, and will users expect massive performance
improvements generally? The C11 model seems to be used only on some, but
not all, arm platforms, and then only with "make" builds.

config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
config/defconfig_arm64-thunderx-linuxapp-gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n

/Bruce

Gavin Hu Nov. 3, 2018, 1:19 a.m. UTC | #2

> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Friday, November 2, 2018 7:44 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> Cc: dev@dpdk.org; thomas@monjalon.net; stephen@networkplumber.org;
> olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> Subject: Re: [PATCH v5 2/2] ring: move the atomic load of head above the
> loop
>
> On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> of
> > the do {} while loop as upon failure the old_head will be updated,
> > another load is costly and not necessary.
> >
> > This helps a little on the latency,about 1~5%.
> >
> >  Test result with the patch(two cores):
> >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > enq/dequeue (size: 32): 2.30
> >
> > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > option")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > Reviewed-by: Jia He <justin.he@arm.com>
> > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> >  2 files changed, 11 insertions(+), 6 deletions(-)
> >
> > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > b/doc/guides/rel_notes/release_18_11.rst
> > index 376128f..b68afab 100644
> > --- a/doc/guides/rel_notes/release_18_11.rst
> > +++ b/doc/guides/rel_notes/release_18_11.rst
> > @@ -69,6 +69,13 @@ New Features
> >    checked out against that dma mask and rejected if out of range. If more
> than
> >    one device has addressing limitations, the dma mask is the more
> restricted one.
> >
> > +* **Updated the ring library with C11 memory model.**
> > +
> > +  Updated the ring library with C11 memory model, in our tests the
> > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> cases respectively.
> > +  The real improvements may vary with the number of contending lcores
> > + and the  size of ring.
> > +
> Is this a little misleading, and will users expect massive performance
> improvements generally? The C11 model seems to be used only on some,
> but not all, arm platforms, and then only with "make" builds.
>
> config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> config/defconfig_arm64-thunderx-linuxapp-
> gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
>
> /Bruce

Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
" Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."

/Gavin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Honnappa Nagarahalli Nov. 3, 2018, 9:34 a.m. UTC | #3

> > > ---
> > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > b/doc/guides/rel_notes/release_18_11.rst
> > > index 376128f..b68afab 100644
> > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > @@ -69,6 +69,13 @@ New Features
> > >    checked out against that dma mask and rejected if out of range.
> > > If more
> > than
> > >    one device has addressing limitations, the dma mask is the more
> > restricted one.
> > >
> > > +* **Updated the ring library with C11 memory model.**
> > > +
> > > +  Updated the ring library with C11 memory model, in our tests the
> > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > cases respectively.
> > > +  The real improvements may vary with the number of contending
> > > + lcores and the  size of ring.
> > > +
> > Is this a little misleading, and will users expect massive performance
> > improvements generally? The C11 model seems to be used only on some,
> > but not all, arm platforms, and then only with "make" builds.
> >
> > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
Would this be ab acceptable change for RC2 or RC3?

> > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > config/defconfig_arm64-thunderx-linuxapp-
> > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> >
> > /Bruce
> 
> Thank you Bruce for the review, to limit the scope of improvement, I rewrite
> the note as follows, could you help review? Feel free to change anything if you
> like.
> " Updated the ring library with C11 memory model, running
> ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased
> latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores)
> respectively. Note the changes help the relaxed memory ordering
> architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y
> was configured, no impact on strong memory ordering architectures like x86.
> To what extent they help the real use cases depends on other factors, like the
> number of contending readers/writers, size of the ring, whether or not it is on
> the critical path."
> 
> /Gavin
IMO, mentioning the performance numbers requires mentioning system configurations. I suggest we keep this somewhat vague (which will make the users to run the test on their specific platform) and simple. Can I suggest the following:

"C11 memory model algorithm for ring library is updated. This results in improved performance on some Arm based platforms."

Olivier Matz Nov. 5, 2018, 9:44 a.m. UTC | #4

Hi,

On Sat, Nov 03, 2018 at 01:19:29AM +0000, Gavin Hu (Arm Technology China) wrote:
> 
> 
> > -----Original Message-----
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > Sent: Friday, November 2, 2018 7:44 PM
> > To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>
> > Cc: dev@dpdk.org; thomas@monjalon.net; stephen@networkplumber.org;
> > olivier.matz@6wind.com; chaozhu@linux.vnet.ibm.com;
> > konstantin.ananyev@intel.com; jerin.jacob@caviumnetworks.com;
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org
> > Subject: Re: [PATCH v5 2/2] ring: move the atomic load of head above the
> > loop
> >
> > On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> > of
> > > the do {} while loop as upon failure the old_head will be updated,
> > > another load is costly and not necessary.
> > >
> > > This helps a little on the latency,about 1~5%.
> > >
> > >  Test result with the patch(two cores):
> > >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > > enq/dequeue (size: 32): 2.30
> > >
> > > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > > option")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > Reviewed-by: Jia He <justin.he@arm.com>
> > > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > ---
> > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > b/doc/guides/rel_notes/release_18_11.rst
> > > index 376128f..b68afab 100644
> > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > @@ -69,6 +69,13 @@ New Features
> > >    checked out against that dma mask and rejected if out of range. If more
> > than
> > >    one device has addressing limitations, the dma mask is the more
> > restricted one.
> > >
> > > +* **Updated the ring library with C11 memory model.**
> > > +
> > > +  Updated the ring library with C11 memory model, in our tests the
> > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > cases respectively.
> > > +  The real improvements may vary with the number of contending lcores
> > > + and the  size of ring.
> > > +
> > Is this a little misleading, and will users expect massive performance
> > improvements generally? The C11 model seems to be used only on some,
> > but not all, arm platforms, and then only with "make" builds.
> >
> > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > config/defconfig_arm64-thunderx-linuxapp-
> > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> >
> > /Bruce
> 
> Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
> " Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."

I prefer your initial proposal which is more concise. What about
something like this?


* **Updated the C11 memory model version of ring library.**

  The latency is decreased for architectures using the C11 memory model
  version of the ring library.

  On Cavium ThunderX2 platform, the changes decreased latency by 27~29%
  and 3~15% for MPMC and SPSC cases respectively (with 2 lcores). The
  real improvements may vary with the number of contending lcores and
  the size of ring.


About the patch itself:
Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thanks

Thomas Monjalon Nov. 5, 2018, 1:17 p.m. UTC | #5

03/11/2018 10:34, Honnappa Nagarahalli:
> > > > ---
> > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > index 376128f..b68afab 100644
> > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > @@ -69,6 +69,13 @@ New Features
> > > >    checked out against that dma mask and rejected if out of range.
> > > > If more
> > > than
> > > >    one device has addressing limitations, the dma mask is the more
> > > restricted one.
> > > >
> > > > +* **Updated the ring library with C11 memory model.**
> > > > +
> > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > cases respectively.
> > > > +  The real improvements may vary with the number of contending
> > > > + lcores and the  size of ring.
> > > > +
> > > Is this a little misleading, and will users expect massive performance
> > > improvements generally? The C11 model seems to be used only on some,
> > > but not all, arm platforms, and then only with "make" builds.
> > >
> > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
> Would this be ab acceptable change for RC2 or RC3?

If NXP and Cavium agrees, I think it can go in RC2.
For RC3, not sure.

Thomas Monjalon Nov. 5, 2018, 1:36 p.m. UTC | #6

05/11/2018 10:44, Olivier Matz:
> Hi,
> 
> On Sat, Nov 03, 2018 at 01:19:29AM +0000, Gavin Hu (Arm Technology China) wrote:
> > From: Bruce Richardson <bruce.richardson@intel.com>
> > > On Fri, Nov 02, 2018 at 07:21:28PM +0800, Gavin Hu wrote:
> > > > In __rte_ring_move_prod_head, move the __atomic_load_n up and out
> > > of
> > > > the do {} while loop as upon failure the old_head will be updated,
> > > > another load is costly and not necessary.
> > > >
> > > > This helps a little on the latency,about 1~5%.
> > > >
> > > >  Test result with the patch(two cores):
> > > >  SP/SC bulk enq/dequeue (size: 8): 5.64  MP/MC bulk enq/dequeue (size:
> > > > 8): 9.58  SP/SC bulk enq/dequeue (size: 32): 1.98  MP/MC bulk
> > > > enq/dequeue (size: 32): 2.30
> > > >
> > > > Fixes: 39368ebfc606 ("ring: introduce C11 memory model barrier
> > > > option")
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > Reviewed-by: Ola Liljedahl <Ola.Liljedahl@arm.com>
> > > > Reviewed-by: Jia He <justin.he@arm.com>
> > > > Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > Tested-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > ---
> > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > index 376128f..b68afab 100644
> > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > @@ -69,6 +69,13 @@ New Features
> > > >    checked out against that dma mask and rejected if out of range. If more
> > > than
> > > >    one device has addressing limitations, the dma mask is the more
> > > restricted one.
> > > >
> > > > +* **Updated the ring library with C11 memory model.**
> > > > +
> > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > cases respectively.
> > > > +  The real improvements may vary with the number of contending lcores
> > > > + and the  size of ring.
> > > > +
> > > Is this a little misleading, and will users expect massive performance
> > > improvements generally? The C11 model seems to be used only on some,
> > > but not all, arm platforms, and then only with "make" builds.
> > >
> > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > > config/common_armv8a_linuxapp:CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > config/common_base:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > > config/defconfig_arm64-thunderx-linuxapp-
> > > gcc:CONFIG_RTE_USE_C11_MEM_MODEL=n
> > >
> > > /Bruce
> > 
> > Thank you Bruce for the review, to limit the scope of improvement, I rewrite the note as follows, could you help review? Feel free to change anything if you like.
> > " Updated the ring library with C11 memory model, running ring_perf_autotest on Cavium ThunderX2 platform, the changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases (2 lcores) respectively. Note the changes help the relaxed memory ordering architectures (arm, ppc) only when CONFIG_RTE_USE_C11_MEM_MODEL=y was configured, no impact on strong memory ordering architectures like x86. To what extent they help the real use cases depends on other factors, like the number of contending readers/writers, size of the ring, whether or not it is on the critical path."
> 
> I prefer your initial proposal which is more concise. What about
> something like this?
> 
> 
> * **Updated the C11 memory model version of ring library.**
> 
>   The latency is decreased for architectures using the C11 memory model
>   version of the ring library.
> 
>   On Cavium ThunderX2 platform, the changes decreased latency by 27~29%
>   and 3~15% for MPMC and SPSC cases respectively (with 2 lcores). The
>   real improvements may vary with the number of contending lcores and
>   the size of ring.
> 
> 
> About the patch itself:
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Series applied with the suggested notes.

Thanks

Jerin Jacob Nov. 5, 2018, 1:41 p.m. UTC | #7

-----Original Message-----
> Date: Mon, 05 Nov 2018 14:17:27 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: stable@dpdk.org, "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>,
>  Bruce Richardson <bruce.richardson@intel.com>, "dev@dpdk.org"
>  <dev@dpdk.org>, "stephen@networkplumber.org" <stephen@networkplumber.org>,
>  "olivier.matz@6wind.com" <olivier.matz@6wind.com>,
>  "chaozhu@linux.vnet.ibm.com" <chaozhu@linux.vnet.ibm.com>,
>  "konstantin.ananyev@intel.com" <konstantin.ananyev@intel.com>,
>  "jerin.jacob@caviumnetworks.com" <jerin.jacob@caviumnetworks.com>, nd
>  <nd@arm.com>, hemant.agrawal@nxp.com, shreyansh.jain@nxp.com
> Subject: Re: [dpdk-stable] [PATCH v5 2/2] ring: move the atomic load of
>  head above the loop
> 
> External Email
> 
> 03/11/2018 10:34, Honnappa Nagarahalli:
> > > > > ---
> > > > >  doc/guides/rel_notes/release_18_11.rst |  7 +++++++
> > > > >  lib/librte_ring/rte_ring_c11_mem.h     | 10 ++++------
> > > > >  2 files changed, 11 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/doc/guides/rel_notes/release_18_11.rst
> > > > > b/doc/guides/rel_notes/release_18_11.rst
> > > > > index 376128f..b68afab 100644
> > > > > --- a/doc/guides/rel_notes/release_18_11.rst
> > > > > +++ b/doc/guides/rel_notes/release_18_11.rst
> > > > > @@ -69,6 +69,13 @@ New Features
> > > > >    checked out against that dma mask and rejected if out of range.
> > > > > If more
> > > > than
> > > > >    one device has addressing limitations, the dma mask is the more
> > > > restricted one.
> > > > >
> > > > > +* **Updated the ring library with C11 memory model.**
> > > > > +
> > > > > +  Updated the ring library with C11 memory model, in our tests the
> > > > > + changes  decreased latency by 27~29% and 3~15% for MPMC and SPSC
> > > > cases respectively.
> > > > > +  The real improvements may vary with the number of contending
> > > > > + lcores and the  size of ring.
> > > > > +
> > > > Is this a little misleading, and will users expect massive performance
> > > > improvements generally? The C11 model seems to be used only on some,
> > > > but not all, arm platforms, and then only with "make" builds.
> > > >
> > > > config/arm/meson.build: ['RTE_USE_C11_MEM_MODEL', false]]
> > This is an error. There is already an agreement that on Arm based platforms, C11 memory model would be used by default. Specific platforms can override it if required.
> > Would this be ab acceptable change for RC2 or RC3?
> 
> If NXP and Cavium agrees, I think it can go in RC2.

Yes. meson and make config should be same. i.e  on Arm based platforms,
C11 memory model would be used by default. Specific platforms can
override it if required.

I think, meson config needs to be updated to be inline with make config.


> For RC3, not sure.
> 
> 
>

diff mbox series

Patch

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index 376128f..b68afab 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -69,6 +69,13 @@  New Features
   checked out against that dma mask and rejected if out of range. If more than
   one device has addressing limitations, the dma mask is the more restricted one.
 
+* **Updated the ring library with C11 memory model.**
+
+  Updated the ring library with C11 memory model, in our tests the changes
+  decreased latency by 27~29% and 3~15% for MPMC and SPSC cases respectively.
+  The real improvements may vary with the number of contending lcores and the
+  size of ring.
+
 * **Added hot-unplug handle mechanism.**
 
   ``rte_dev_hotplug_handle_enable`` and ``rte_dev_hotplug_handle_disable`` are
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 52da95a..7bc74a4 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -61,13 +61,11 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 	unsigned int max = n;
 	int success;
 
+	*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Reset n to the initial burst count */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->prod.head,
-					__ATOMIC_ACQUIRE);
-
 		/* load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -93,6 +91,7 @@  __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head is updated */
 			success = __atomic_compare_exchange_n(&r->prod.head,
 					old_head, *new_head,
 					0, __ATOMIC_ACQUIRE,
@@ -135,13 +134,11 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	int success;
 
 	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons.head, __ATOMIC_ACQUIRE);
 	do {
 		/* Restore n as it may change every loop */
 		n = max;
 
-		*old_head = __atomic_load_n(&r->cons.head,
-					__ATOMIC_ACQUIRE);
-
 		/* this load-acquire synchronize with store-release of ht->tail
 		 * in update_tail.
 		 */
@@ -166,6 +163,7 @@  __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
+			/* on failure, *old_head will be updated */
 			success = __atomic_compare_exchange_n(&r->cons.head,
 							old_head, *new_head,
 							0, __ATOMIC_ACQUIRE,