[dpdk-dev] [PATCH v2 1/6] ethdev: add Tx preparation

Jerin Jacob jerin.jacob at caviumnetworks.com
Fri Sep 23 12:29:10 CEST 2016


On Fri, Sep 23, 2016 at 09:41:52AM +0000, Ananyev, Konstantin wrote:
Hi Konstantin,
> Hi Jerin,
> 
> > 
> > Hi Konstantin,
> > 
> > >
> > > Hi Jerin,
> > >
> > > > >
> > > > > Hi Jerin,
> > > >
> > > > Hi Konstantin,
> > > >
> > > > >
> > > > > > > > >
> > > > > > >
> > > > > > > [...]
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +#ifdef RTE_ETHDEV_TX_PREP
> > > > > > > >
> > > > > > > > Sorry for being a bit late on that discussion, but what the
> > > > > > > > point of having that config macro (RTE_ETHDEV_TX_PREP ) at all?
> > > > > > > > As I can see right now, if driver doesn't setup tx_pkt_prep,
> > > > > > > > then nb_pkts would be return anyway...
> > > > > > > >
> > > > > > > > BTW, there is my another question - should it be that way?
> > > > > > > > Shouldn't we return 0 (and set rte_errno=ENOTSUP) here if
> > > > > > > > dev->tx_pk_prep == NULL?
> > > > > > > >
> > > > > > >
> > > > > > > It's an answer to the Jerin's request discussed here:
> > > > > > > http://dpdk.org/ml/archives/dev/2016-September/046437.html
> > > > > > >
> > > > > > > When driver doesn't support tx_prep, default behavior is "we
> > > > > > > don't know requirements, so we have nothing to do here". It
> > > > > > > will simplify
> > > > > > application logic and improve performance for these drivers, I think. Catching this error with every burst may be problematic.
> > > > > > >
> > > > > > > As for RTE_ETHDEV_TX_PREP macro, suggested by Jerin in the
> > > > > > > same thread, I still don't think It's the best solution of the
> > > > > > > problem
> > > > > > described by him. I have added it here for further discussion.
> > > > > > >
> > > > > > > Jerin, have you something to add?
> > > > > >
> > > > > > Nothing very specific to add here. I think, I have tried to
> > > > > > share the rational in, http://dpdk.org/ml/archives/dev/2016-
> > > > > > September/046437.html
> > > > > >
> > > > >
> > > > > Ok, not sure I am fully understand your intention here.
> > > > > I think I understand why you propose rte_eth_tx_prep() to do:
> > > > > 	if (!dev->tx_pkt_prep)
> > > > > 		return nb_pkts;
> > > > >
> > > > > That allows drivers to NOOP the tx_prep functionality without
> > > > > paying the price for callback invocation.
> > > >
> > > > In true sense, returning the nb_pkts makes it functional NOP as
> > > > well(i.e The PMD does not have any limitation on Tx side, so all
> > > > packets are _good_(no preparation is required))
> > > >
> > > >
> > > > > What I don't understand, why with that in place we also need a
> > > > > NOOP for the whole rte_eth_tx_prep():
> > > > > +static inline uint16_t
> > > > > +rte_eth_tx_prep(uint8_t port_id __rte_unused, uint16_t queue_id __rte_unused,
> > > > > +		struct rte_mbuf **tx_pkts __rte_unused, uint16_t nb_pkts) {
> > > > > +	return nb_pkts;
> > > > > +}
> > > > > +
> > > > > +#endif
> > > > >
> > > > > What are you trying to save here: just reading ' dev->tx_pkt_prep'?
> > > > > If so, then it seems not that performance critical for me.
> > > > > Something else?
> > > >
> > > > The proposed scheme can make it as true NOOP from compiler
> > > > perspective too if a target decided to do that, I have checked the instruction generation with arm Assembly, a non true compiler
> > NOOP has following instructions overhead at minimum.
> > > >
> > > > 	# 1 load
> > > > 	# 1  mov
> > > > 	if (!dev->tx_pkt_prep)
> > > > 		return nb_pkts;
> > >
> > > Yep.
> > >
> > > >
> > > > 	# compile can't predict this function needs be executed or not so
> > > > 	# pressure on register allocation and mostly likely it call for
> > > > 	# some stack push and pop based load on outer function(as it is an
> > > > 	# inline function)
> > >
> > >
> > > Well, I suppose compiler wouldn't try to fill function argument registers before the branch above.
> > 
> > Not the case with arm gcc compiler(may be based outer function load).
> 
> Ok, so for my own curiosity (I am not very familiar with the ARM arch):
>  gcc generates several conditional execution instructions in a row to spill/fill 
> function arguments registers, and that comes at a price at execution time if condition is not met? 

Yes. That's what I see(at least for gcc 5.3 + arm64 back-end case) when I was debugging
external mempool function pointer performance regression issue.
The sad part is, I couldn't see any gcc option to override it.


> 
> > The recent, external pool manager function pointer conversion reduced around 700kpps/core in local cache mode(even though the
> > function pointers are not executed)
> > 
> > >
> > > >
> > > > 	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id], tx_pkts,
> > > > nb_pkts);
> > > >
> > > > 	# 1 branch
> > > > 	if (unlikely(nb_prep < nb_rx)) {
> > > > 		# bunch of code not executed, but pressure on i cache
> > > > 		int i;
> > > > 		for (i = nb_prep; i < nb_rx; i++)
> > > > 	                 rte_pktmbuf_free(pkts_burst[i]);
> > > > 	}
> > > >
> > > > From a server target(IA or high-end armv8) with external PCIe based
> > > > system makes sense to have RTE_ETHDEV_TX_PREP option enabled(which
> > > > is the case in proposed patch) but the low end arm platforms with
> > > > - limited cores
> > > > - less i cache
> > > > - IPC == 1
> > > > - running around 1GHz
> > > > - most importantly, _integrated_ nic controller with no external PCIE
> > > >   support
> > > > does not make much sense to waste the cycles/time for it.
> > > > cycle saved is cycle earned.
> > >
> > > Ok, so it is all to save one memory de-refrence and a comparison plus branch.
> > > Do you have aby estimation how badly it would hit low-end cpu performance?
> > 
> > around 400kpps/core. On four core systems, around 2 mpps.(4 core with
> > 10G*2 ports)
> 
> So it is about ~7% for 2x10G, correct?
> I agree that seems big enough to keep the config option,
> even though I am not quite happy with introducing new config option. 
> So no more objections from my side here.

Thanks.

That's was for very low end cpus.
So even if it is for high-end cpu case, event if it calls for 100kpps drop/core,
The Cavium configuration like 96 cores + >200G case will at least 9.6mpps worth
of cycles drop.


> Thanks 
> Konstantin
> 
> > 
> > > The reason I am asking: obviously I would prefer to avoid to introduce
> > > new build config option (that's the common dpdk coding practice these days) unless it is really important.
> > Practice is something we need to revisit based on the new use case/usage.
> > I think, the scheme of non external pcie based NW cards is new to DPDK.
> > 
> > >
> > > >
> > > > Since DPDK compilation is based on _target_, I really don't see any
> > > > issue with this approach nor it does not hurt anything on server target.
> > > > So, IMO, It should be upto the target to decide what works better for the target.
> > > >
> > > > Jerin
> > > >
> > > > > From my point of view NOOP on the driver level is more than enough.
> > > > > Again I would prefer to introduce new config option, if possible.
> > > > >
> > > > > Konstantin
> > > > >


More information about the dev mailing list