[dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod loading when doing enqueue/dequeue

Bruce Richardson bruce.richardson at intel.com
Mon Oct 23 11:10:59 CEST 2017


On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote:
> 
> Hi,
> 
> > On 10/20/2017 1:43 PM, Jerin Jacob Wrote:
> > > -----Original Message-----
> > >>
> > [...]
> > >> dependant on each other.  Thus a memory barrier is neccessary.
> > > Yes. The barrier is necessary.  In fact, upstream freebsd fixed
> > > this issue for arm64. DPDK ring implementation is derived from
> > > freebsd's buf_ring.h.
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > >
> > > I think, the only outstanding issue is, how to reduce the
> > > performance impact for arm64. I believe using accurate/release
> > > semantics instead of rte_smp_rmb() will reduce the performance
> > > overhead like similar ring implementations below, freebsd:
> > >
> > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166
> > > odp:
> > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio
> > > /ring.c
> > >
> > > Jia, 1) Can you verify the use of accurate/release semantics fixes
> > > the problem in your platform? like use of atomic_load_acq* in the
> > > reference
> > code.
> > > 2) If so, What is the overhead between accurate/release and plane
> > > smp_smb() barriers. Based on that we need decide what path to
> > > take.
> > I've tested 3 cases.  The new 3rd case is to use the load_acquire
> > barrier (half barrier) you mentioned at above link.  The patch seems
> > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring
> > *r, int is_sp,                 /* Reset n to the initial burst count
> > */                 n = max;
> > 
> > -               *old_head = r->prod.head; -               const
> > uint32_t cons_tail = r->cons.tail; +               *old_head =
> > atomic_load_acq_32(&r->prod.head); +               const uint32_t
> > cons_tail = atomic_load_acq_32(&r->cons.tail);
> > 
> > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> > int is_s                 /* Restore n as it may change every loop */
> >                 n = max;
> > 
> > -               *old_head = r->cons.head; -               const
> > uint32_t prod_tail = r->prod.tail; +               *old_head =
> > atomic_load_acq_32(&r->cons.head); +               const uint32_t
> > prod_tail = atomic_load_acq_32(&r->prod.tail)                 /* The
> > subtraction is done between two unsigned 32bits value
> >                  * (the result is always modulo 32 bits even if we
> > have                  * cons_head > prod_tail). So 'entries' is
> > always between 0                  * and size(ring)-1. */
> >
> > The half barrier patch passed the fuctional test.
> > 
> > As for the performance comparision on *arm64*(the debug patch is at
> > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please
> > see the test results below:
> > 
> > [case 1] old codes, no barrier
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       689275.001200      task-clock (msec)         #    9.771 CPUs
> >  utilized               6223      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                653      page-faults               #   
> >  0.001 K/sec      1721190914583      cycles                    #   
> >  2.497 GHz      3363238266430      instructions              #   
> >  1.95  insn per cycle    <not supported> branches          
> >  27804740      branch-misses             #    0.00% of all branches
> > 
> >        70.540618825 seconds time elapsed
> > 
> > [case 2] full barrier with rte_smp_rmb()
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       582557.895850      task-clock (msec)         #    9.752 CPUs
> >  utilized               5242      context-switches          #   
> >  0.009 K/sec                 10      cpu-migrations            #   
> >  0.000 K/sec                665      page-faults               #   
> >  0.001 K/sec      1454360730055      cycles                    #   
> >  2.497 GHz       587197839907      instructions              #   
> >  0.40  insn per cycle    <not supported> branches          
> >  27799687      branch-misses             #    0.00% of all branches
> > 
> >        59.735582356 seconds time elapse
> > 
> > [case 1] half barrier with load_acquire
> > ============================================  Performance counter
> > stats for './test --no-huge -l 1-10':
> > 
> >       660758.877050      task-clock (msec)         #    9.764 CPUs
> >  utilized               5982      context-switches          #   
> >  0.009 K/sec                 11      cpu-migrations            #   
> >  0.000 K/sec                657      page-faults               #   
> >  0.001 K/sec      1649875318044      cycles                    #   
> >  2.497 GHz       591583257765      instructions              #   
> >  0.36  insn per cycle    <not supported> branches          
> >  27994903      branch-misses             #    0.00% of all branches
> > 
> >        67.672855107 seconds time elapsed
> > 
> > Please see the context-switches in the perf results test result 
> > sorted by time is: full barrier < half barrier < no barrier
> > 
> > AFAICT, in this case ,the cpu reordering will add the possibility
> > for context switching and increase the running time.
> > 
> > Any ideas?
> 
> You could also try a case where you rearrange the rte_ring structure
> so that prod.head and cons.tail are parts of an union of a 64-bit
> variable and then read this 64-bit variable with one atomic read. I do
> not think that half barrier is even needed here with this approach, as
> long as you can really read the 64-bit variable fully at once. This
> should speed up. 
> 
> Cheers, --
Yes, that would work, but unfortunately it means mixing producer and
consumer data onto the same cacheline, which may have other negative
effects on performance (depending on arch/platform) due to cores
invalidating each other's cachelines.

/Bruce


More information about the dev mailing list