[dpdk-dev] [PATCH 1/3] rcu: add RCU library supporting QSBR mechanism

Ananyev, Konstantin konstantin.ananyev at intel.com
Thu Mar 28 12:15:31 CET 2019


> >
> > > +#define RTE_QSBR_CNT_THR_OFFLINE 0
> > > +#define RTE_QSBR_CNT_INIT 1
> > > +
> > > +/**
> > > + * RTE thread Quiescent State structure.
> > > + * Quiescent state counter array (array of 'struct
> > > +rte_rcu_qsbr_cnt'),
> > > + * whose size is dependent on the maximum number of reader threads
> > > + * (m_threads) using this variable is stored immediately following
> > > + * this structure.
> > > + */
> > > +struct rte_rcu_qsbr {
> > > +	uint64_t token __rte_cache_aligned;
> > > +	/**< Counter to allow for multiple simultaneous QS queries */
> > > +
> > > +	uint32_t num_elems __rte_cache_aligned;
> > > +	/**< Number of elements in the thread ID array */
> > > +	uint32_t m_threads;
> > > +	/**< Maximum number of threads this RCU variable will use */
> > > +
> > > +	uint64_t reg_thread_id[RTE_QSBR_THRID_ARRAY_ELEMS]
> > __rte_cache_aligned;
> > > +	/**< Registered thread IDs are stored in a bitmap array */
> >
> >
> > As I understand you ended up with fixed size array to avoid 2 variable size
> > arrays in this struct?
> Yes
> 
> > Is that big penalty for register/unregister() to either store a pointer to bitmap,
> > or calculate it based on num_elems value?
> In the last RFC I sent out [1], I tested the impact of having non-fixed size array. There 'was' a performance degradation in most of the
> performance tests. The issue was with calculating the address of per thread QSBR counters (not with the address calculation of the bitmap).
> With the current patch, I do not see the performance difference (the difference between the RFC and this patch are the memory orderings,
> they are masking any perf gain from having a fixed array). However, I have kept the fixed size array as the generated code does not have
> additional calculations to get the address of qsbr counter array elements.
> 
> [1] http://mails.dpdk.org/archives/dev/2019-February/125029.html

Ok I see, but can we then arrange them ina  different way:
qsbr_cnt[] will start at the end of struct rte_rcu_qsbr
(same as you have it right now).
While bitmap will be placed after qsbr_cnt[].
As I understand register/unregister is not consider on critical path,
so some perf-degradation here doesn't matter.
Also check() would need extra address calculation for bitmap,
but considering that we have to go through all bitmap (and in worst case qsbr_cnt[])
anyway, that probably not a big deal?   

> 
> > As another thought - do we really need bitmap at all?
> The bit map is helping avoid accessing all the elements in rte_rcu_qsbr_cnt array (as you have mentioned below). This provides the ability to
> scale the number of threads dynamically. For ex: an application can create a qsbr variable with 48 max threads, but currently only 2 threads
> are active (due to traffic conditions).

I understand that bitmap supposed to speedup check() for
situations when most threads are unregistered.
My thought was that might be check() speedup for such situation is not that critical.

> 
> > Might it is possible to sotre register value for each thread inside it's
> > rte_rcu_qsbr_cnt:
> > struct rte_rcu_qsbr_cnt {uint64_t cnt; uint32_t register;}
> > __rte_cache_aligned; ?
> > That would cause check() to walk through all elems in rte_rcu_qsbr_cnt array,
> > but from other side would help to avoid cache conflicts for register/unregister.
> With the addition of rte_rcu_qsbr_thread_online/offline APIs, the register/unregister APIs are not in critical path anymore. Hence, the
> cache conflicts are fine. The online/offline APIs work on thread specific cache lines and these are in the critical path.
> 
> >
> > > +} __rte_cache_aligned;
> > > +


More information about the dev mailing list