[dpdk-dev] [PATCH 1/4] eventdev: introduce event driven programming model

Jerin Jacob jerin.jacob at caviumnetworks.com
Sat Nov 26 03:54:55 CET 2016


On Fri, Nov 25, 2016 at 11:00:53AM +0000, Bruce Richardson wrote:
> On Fri, Nov 25, 2016 at 05:53:34AM +0530, Jerin Jacob wrote:
> > On Thu, Nov 24, 2016 at 04:35:56PM +0100, Thomas Monjalon wrote:
> > > 2016-11-24 07:29, Jerin Jacob:
> > > > On Wed, Nov 23, 2016 at 07:39:09PM +0100, Thomas Monjalon wrote:
> > > > > 2016-11-18 11:14, Jerin Jacob:
> > > > > > +Eventdev API - EXPERIMENTAL
> > > > > > +M: Jerin Jacob <jerin.jacob at caviumnetworks.com>
> > > > > > +F: lib/librte_eventdev/
> > > > > 
> > 
> > I don't think there is any portability issue here, I can explain.
> > 
> > The application level, we have two more use case to deal with non burst
> > variant
> > 
> > - latency critical work
> > - on dequeue, if application wants to deal with only one flow(i.e to
> >   avoid processing two different application flows to avoid cache trashing)
> > 
> > Selection of the burst variants will be based on
> > rte_event_dev_info_get() and rte_event_dev_configure()(see, max_event_port_dequeue_depth,
> > max_event_port_enqueue_depth, nb_event_port_dequeue_depth, nb_event_port_enqueue_depth )
> > So I don't think their is portability issue here and I don't want to waste my
> > CPU cycles on the for loop if application known to be working with non
> > bursts variant like below
> > 
> 
> If the application is known to be working on non-burst varients, then
> they always request a burst-size of 1, and skip the loop completely.
> There is no extra performance hit in that case in either the app or the
> driver (since the non-burst driver always returns 1, irrespective of the
> number requested).

Hmm. I am afraid, There is.
On the app side, the const "1" can not be optimized by the compiler as
on downside it is function pointer based driver interface
On the driver side, the implementation would be for loop based instead
of plain access.
(compiler never can see the const "1" in driver interface)

We are planning to implement burst mode as kind of emulation mode and
have a different scheme for burst and nonburst. The similar approach we have
taken in introducing rte_event_schedule() and split the responsibility so
that SW driver can work without additional performance overhead and neat
driver interface.

If you are concerned about the usability part and regression on the SW
driver, then it's not the case, application will use nonburst variant only if
dequeue_depth == 1 and/or explicit case where latency matters.

On the portability side, we support both case and application if written based
on dequeue_depth it will perform well in both implementations.IMO, There is
no another shortcut for performance optimized application running on different
set of model.I think it is not an issue as, in event model as each cores
identical and main loop can be changed based on dequeue_depth
if needs performance(anyway mainloop will be function pointer based).

> 
> > nb_events = rte_event_dequeue_burst();
> > for(i=0; i < nb_events; i++){
> > 	process ev[i]
> > }
> > 
> > And mostly importantly the NPU can get almost same throughput
> > without burst variant so why not?
> > 
> > > 
> > > > > > +/**
> > > > > > + * Converts nanoseconds to *wait* value for rte_event_dequeue()
> > > > > > + *
> > > > > > + * If the device is configured with RTE_EVENT_DEV_CFG_PER_DEQUEUE_WAIT flag then
> > > > > > + * application can use this function to convert wait value in nanoseconds to
> > > > > > + * implementations specific wait value supplied in rte_event_dequeue()
> > > > > 
> > > > > Why is it implementation-specific?
> > > > > Why this conversion is not internal in the driver?
> > > > 
> > > > This is for performance optimization, otherwise in drivers
> > > > need to convert ns to ticks in "fast path"
> > > 
> > > So why not defining the unit of this timeout as CPU cycles like the ones
> > > returned by rte_get_timer_cycles()?
> > 
> > Because HW co-processor can run in different clock domain. Need not be at
> > CPU frequency.
> > 
> While I've no huge objection to this API, since it will not be
> implemented by our SW implementation, I'm just curious as to how much
> having this will save. How complicated is the arithmetic that needs to
> be done, and how many cycles on your platform is that going to take?

one load, division and/or multiplication of (floating) numbers. I could be
6isl cycles or more, but it matters when burst size is less(worst case 1).
I think the software implementation could use rte_get_timer_cycles() here
if required.I think there is no harm in moving some-work in slow-path if it
can be, like this case.



More information about the dev mailing list