[dpdk-dev] [PATCH v9 0/4] lib/rcu: add RCU library supporting QSBR mechanism

Honnappa Nagarahalli Honnappa.Nagarahalli at arm.com
Wed May 1 23:18:05 CEST 2019


> Subject: Re: [dpdk-dev] [PATCH v9 0/4] lib/rcu: add RCU library supporting
> QSBR mechanism
> 
> On Wed, May 01, 2019 at 02:56:48PM +0000, Honnappa Nagarahalli wrote:
> > >
> > > On Tue, Apr 30, 2019 at 10:54:15PM -0500, Honnappa Nagarahalli wrote:
> > > > Lock-less data structures provide scalability and determinism.
> > > > They enable use cases where locking may not be allowed (for ex:
> > > > real-time applications).
> > > >
> > > I know this is version 9 of the patch, so I'm sorry for the late
> > > comment, but I have to ask: Why re-invent this wheel?  There are
> > > already several Userspace
> > Thanks Neil, for asking the question. This has been debated before. Please
> refer to [2] for more details.
> >
> > liburcu [1] was explored as it seemed to be familiar to others in the
> community . I am not aware of any other library.
> >
> > There are unique requirements in DPDK and there is still scope for
> improvement from what is available. I have explained this in the cover letter
> without making a direct comparison to liburcu. May be it is worth tweaking the
> documentation to call this out explicitly.
> >
> I think what you're referring to here is the need for multiple QSBR variables,
> yes?  I'm not sure thats, strictly speaking, a requirement.  It seems like its a
> performance improvement, but I'm not sure thats the case (see performance
> numbers below).
DPDK supports service cores feature and pipeline mode where a particular data structure is used by a subset of readers. These use cases affect the writer and reader (which are on the data plane) in the following ways:

1) The writer does not need to wait for all the readers to complete the quiescent state. Writer does not need to spend CPU cycles and add to memory bandwidth polling the unwanted readers. DPDK has uses cases where the writer is on the data plane as well.
2) The readers that do not use the data structure do not have to spend cycles reporting their quiescent state. Note that these are data plane cycles

Other than this, please read about how grace period and critical section affect the over head introduced by QSBR mechanism in the cover letter. It also explains how this library solves this issue.

This is discussed in the discussion thread I provided earlier.

> 
> Regarding performance, we can't keep using raw performance as a trump card
IMO, performance is NOT a 'trump card'. The whole essence of DPDK is performance. If not for performance, would DPDK exist?

> for all other aspects of the DPDK.  This entire patch is meant to improve
> performance, it seems like it would be worthwhile to gain the code
> consolidation and reuse benefits for the minor performance hit.
Apologies, I did not understand this. Can you please elaborate code consolidation part?

> 
> Further to performance, I may be misreading this, but I ran the integrated
> performance test you provided in this patch, as well as the benchmark tests for
> liburcw (trimmed for easier reading here)
Just to be sure, I believe you are referring to *liburcu*

> 
> liburcw:
> [nhorman at hmswarspite benchmark]$ ./test_urcu 7 1 1 -v -a 0 -a 1 -a 2 -a 3 -a
> 4 -a 5 -a 6 -a 7 -a 0 Adding CPU 0 affinity Adding CPU 1 affinity Adding CPU 2
> affinity Adding CPU 3 affinity Adding CPU 4 affinity Adding CPU 5 affinity
> Adding CPU 6 affinity Adding CPU 7 affinity Adding CPU 0 affinity running test
> for 1 seconds, 7 readers, 1 writers.
> Writer delay : 0 loops.
> Reader duration : 0 loops.
> thread main  , tid 22712
> thread_begin reader, tid 22726
> thread_begin reader, tid 22729
> thread_begin reader, tid 22728
> thread_begin reader, tid 22727
> thread_begin reader, tid 22731
> thread_begin reader, tid 22730
> thread_begin reader, tid 22732
> thread_begin writer, tid 22733
> thread_end reader, tid 22729
> thread_end reader, tid 22731
> thread_end reader, tid 22730
> thread_end reader, tid 22728
> thread_end reader, tid 22727
> thread_end writer, tid 22733
> thread_end reader, tid 22726
> thread_end reader, tid 22732
> total number of reads : 1185640774, writes 264444
> SUMMARY /home/nhorman/git/userspace-rcu/tests/benchmark/.libs/lt-
> test_urcu testdur    1 nr_readers   7 rdur      0 wdur      0 nr_writers   1 wdelay
> 0 nr_reads   1185640774 nr_writes       264444 nr_ops   1185905218
> 
> DPDK:
> Perf test: 1 writer, 7 readers, 1 QSBR variable, 1 QSBR Query, Non-Blocking
> QSBR check Following numbers include calls to rte_hash functions Cycles per 1
> update(online/update/offline): 813407 Cycles per 1 check(start, check):
> 859679
> 
> 
> Both of these tests qsbr rcu in each library using 7 readers and 1 writer.  Its a
> little bit of an apples to oranges comparison, as the tests run using slightly
Thanks for running the test. Yes, it is apples to oranges comparison:
1) The test you are running is not the correct test assuming the code for this test is [3]
2) This is not QSBR

I suggest you use [4] for your testing. It also need further changes to match the test case in this patch. The function 'thr_reader' reports quiescent state every 1024 iterations, please change it to report every iteration.

After this you need to compare these results with the first test case in this patch.

[3] https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu.c
[4] https://github.com/urcu/userspace-rcu/blob/master/tests/benchmark/test_urcu_qsbr.c

> different parameters, and produce different output statistics, but I think they
> can be somewhat normalized.  Primarily the stat that stuck out to me was the
> DPDK Cycles per 1 update statistic, which I believe is effectively the number of
> cycles spent in the test / the number of writer updates.  On DPDK that number
> in this test run works out to 813407.  In the liburcw test, it reports the total
> number of ops (cycles), and the number of writes completed within those
> cycles.
> If we do the same division there we get 185905218 / 264444 = 4484. I may be
> misreading something here, but that seems like a pretty significant write side
Yes, you are misreading. 'number of ops' is not cycles. It is sum of 'nr_writes' and 'nr_reads'. The test runs for 1 sec (uses 'sleep'), so these are number of operations done in 1 sec. You need to normalize to number of cycles using this data.

> performance improvement over this implementation.
> 
> Neil
> 
> > [1] https://liburcu.org/
> > [2] http://mails.dpdk.org/archives/dev/2018-November/119875.html
> >
> > > RCU libraries that are mature and carried by Linux and BSD distributions.
> > > Why would we throw another one into DPDK instead of just using whats
> > > already available, mature and stable?
> > >
> > > Neil
> >
> >


More information about the dev mailing list