[dpdk-dev] [RFC PATCH 0/3] ethdev: few changes in rte_ethdev layer

Ananyev, Konstantin konstantin.ananyev at intel.com
Tue Dec 19 16:27:26 CET 2017


Hi Jerin,

> > > >
> > > > 2. Make it safe to remove rx/tx callback at runtime.
> > > > Right now it is not possible for the application to figure out
> > > > when it is safe to free removed callback handle and
> > > > associated with it resources(unless the queue is stopped).
> > > > That's probably not a big problem if all callbacks are static
> > > > hange through whole application lifetime)
> > > > and/or application doesn't allocate any resources for the callback handler.
> > > > Though if callbacks have to be added/removed dynamically and
> > > > callback handler would require more resources to operate properly -
> > > > then it might become an issue.
> > > > So patch #2 fixes that problem - now as soon as
> > > > rte_eth_remove_(rx|tx)_callback() completes successfully, application
> > > > can safely free all associated with the removed callback resources.
> > > >
> > > > Performance impact:
> > > > If application doesn't use RX/TX callbacks, then the tests I run didn't
> > > > reveal any performance degradation.
> > > > Though if application do use RX/TX callbacks - patch #2 does introduce
> > > > some slowdown.
> > > >
> > > > To be more specific here, on BDW (E5-2699 v4) 2.2GHz, 4x10Gb (X520-4)
> > > > with http://dpdk.org/dev/patchwork/patch/31864/ patch installed I got:
> > > > 1) testpmd ... --latencystats=1 - slowdown < 1%
> > > > 2) examples//l3fwd ... --parse-ptype - - slowdown < 1%
> > > > 3) examples/rxtx_callbacks - slowdown ~8%
> > > > All that in terms of packet throughput (Mpps).
> > >
> > > We tried on an arm64 machine; We got following result on host and guest
> > > using l3fwd.
> >
> > Thanks for testing it.
> > So these numbers below - are for l3fwd running with or without rx callback installed?
> > i.e. with or without '--parse-ptype' option (and on a NIC with/without ptype HW support)?
> 
> without '--parse-ptype' option(i.e without rx callback installed).
> That makes it sad, Even though application is not using the rx callback,
> We need to pay on a average 3% drop in performance(based on the
> use case/arch/SoC/topology etc)

Yes it is sad...
Do you have any idea why that happens?
Is that the same situation you described before: for arm compiler
often replaces small branches with conditional execution instructions
and miss-prediction for such instructions is not free?
Or is it something else?

Another thing - the series contains 2 main patches:
1) http://dpdk.org/dev/patchwork/patch/31866/
Introduces eth_queue_local and moves callback data into it.
That adds one extra memory reference for rx/tx function to read cb func and data.
2) http://dpdk.org/dev/patchwork/patch/31867/
Introduces some synchronization for cb data.

Is it possible to rerun your tests with just patch #1 installed?
Hopefully it would help us to figure out which of the patches (#1 or #2)  
is the main source of slowdown on arm machines.

> 
> >
> > >
> > >
> > > Note:
> > > +ve number means "Performance regression with patches"
> > > -ve number means "Performance improvement with patches"
> > >
> > > Relative performance difference in percentage:
> > >
> > > % difference on host (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64	1.41	0.18	0.51	0.29
> > > 72	0.50	0.53	0.09	0.19
> > > 128	0.31	0.31	0.42	0.00
> > > 256	-0.44	-0.44	0.00	0.00
> > > 512	1.94	1.94	0.01	0.01
> > > 1024	0.00	0.01	0.00	0.01
> > > 1518	0.02	0.01	0.02	0.02
> > >
> > > % difference on guest (2 x 40G)
> > > pkt_sz  1c      2c      4c      8c
> > > 64      5.78	2.30	-2.45	-0.01
> > > 72	1.13	1.13	1.31	-4.29
> > > 128	1.36	1.36	1.54	-0.82
> > > 256	2.02	2.03	-0.01	-0.35
> > > 512	4.05	4.04	0.00	-0.46
> > > 1024	0.39	0.38	0.00	-0.74
> > > 1518	0.00	0.00	-0.05	-4.20
> > >
> > > I think, it is because we added more code under
> > > the RTE_ETHDEV_RXTX_CALLBACKS and it is enabled by
> > > default.
> > >
> > > I think, the impact will vary based on
> > > architecture and underneath i-cache, d-cache and IPC.
> > >
> > > I am just thinking, How we can avoid such impact?
> > > How about,
> > > # Remove RTE_ETHDEV_RXTX_CALLBACKS option
> > > # Hook Rx/TX callbacks under TX/RX offload flags
> > > # ethdev layer gives helper function to invoke
> > > callbacks to driver.
> >
> > I suppose you suggest that user would have to decide at queue_setup()
> > stage does he wants a callback(s) for RX/TX or no?
> 
> Yes. queue_setup or configure() time.
> 
> > As I can see, the main drawback with that approach - it would introduce new limitation.
> > Right now it is possible to add/remove RX/TX callback at runtime
> > (without stopping and reconfiguring the queue), and I suppose we do want to keep that ability.
> 
> For those _application_ where it can not decide or it need to add without
> stopping and re configuring the queue still can configure/enable the ethdev
> device with RX/TX callback option(and pay price for it)
> 
> > Again in some cases user can't decide does he wants a callback installed or not,
> > before he actually calls dev_start().
> > l3fwd ptype-parse callback is an example of such situation:
> > right now user has to start device first to figure out which ptypes are supported by this PMD
> > with desired configuration, then based on this information he will (or will not) setup the callback.
> 
> Good point. Can we remove that limitation from PMD perspective? IMO, ptype-parse is more of HW feature,
> All we need to do a lookup(single load) to convert to DPDK packet type.
> http://dpdk.org/browse/dpdk/tree/drivers/net/thunderx/nicvf_rxtx.c#n258
> 
> I see almost all the drivers has dummy check like this, I think, That makes it
> dependency with start().
> 
>         if (dev->rx_pkt_burst == i40e_recv_pkts ||
>             dev->rx_pkt_burst == i40e_recv_pkts_bulk_alloc ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts ||
>             dev->rx_pkt_burst == i40e_recv_scattered_pkts_vec ||
>             dev->rx_pkt_burst == i40e_recv_pkts_vec)
>                 return ptypes;
> 

We probably can, but it would require extra work and changes inside the PMDs.
i40e is a good one - all rx functions support the same ptypes.
But let say for ixgbe ARM version of vRX doesn't support all ptypes that x86 vRX
and generic scalar RX do.
For fm10k - vector RX and scalar RX also support different ptype sets.  

> 
> > What I thought in same direction - introduce a notion of 'static' and 'dynamic' callbacks.
> > Static ones can't be removed while queue is not stopped while dynamic can.
> > But that also requires user to select at queue_setup() stage does he wants dynamic
> > callbacks for that queue or not (new offload flag or extra field in rxconf/txconf structure).
> > So again - same limitation for the user.
> > Another possibility here - have a flag 'dynamic' or 'static' not per queue, but per individual callback.
> > Yes, that would mean API change for add/remove callback functions.
> 
> At least on my test, the regression was caused because adding the code in
> tx_burst and rx_burst under the RTE_ETHDEV_RXTX_CALLBACKS(which enabled
> by default). By disabling the RTE_ETHDEV_RXTX_CALLBACKS option, I don't
> see any regression. If so, Will such regression fixed by introducing
> 'static' or 'dynamic' callbacks? I am happy to test and provide feedback.

Probably no then, but let's try the experiment I mentioned above first.

> 
> For me, I think, The better option is request from application for the
> need for Rx/TX callbacks and set the handler appropriate in the driver to
> have zero impact for the application which does not need callback at all.

It is possible, but:
- that would be quite significant change and would affect each RX and TX function   in each PMD.
- from other side - even now there are several libs inside DPDK that rely on ability to add/remove
  callbacks on the fly - pdump, latencystats.
  ethdev rx profiling  also relies on ability to install RX callback.

With all these things I presume most apps will configure their devices with rx/tx callback support enabled anyway.
Which makes me think - is it worth to introduce such changes at all.
Anyway, my first suggestion would be let's try to figure out what exactly causes such big slowdown on ARM.
Konstantin

> 
> at minimum, on those PMDs that support detecting the supported ptypes without
> device start()
> 
> 
> 
> 
> >
> > Konstantin
> >
> > > But, don't invoke from common code
> > > # When application needs the callback support, it
> > > configured the given RX/TX queue offloads
> > > # If the Rx/TX callback configure by the application
> > > then driver calls the helper functions exposed by
> > > the common code to invoke RX callbacks.
> >
> >
> > >
> > > Jerin
> > >
> > > >
> > > > Ability to safely remove callbacks at runtime implies
> > > > some sort of synchronization.
> > > > Even I tried to make it as light as possible,
> > > > probably some slowdown is unavoidable.
> > > > Of course instead of introducing these changes at rte_ethdev layer
> > > > similar technique could be applied on individual callback basis.
> > > > In that case it would be up to callback writer/installer to decide
> > > > does he/she need a removable at runtime callback or not.
> > > > Though in that case, each installed callback might introduce extra
> > > > synchronization overhead and slowdown.
> > > >
> > > > Konstantin Ananyev (3):
> > > >   ethdev: introduce eth_queue_local
> > > >   ethdev: make it safe to remove rx/tx callback at runtime
> > > >   doc: ethdev ABI change deprecation notice
> > > >
> > > >  doc/guides/rel_notes/deprecation.rst |   5 +
> > > >  lib/librte_ether/rte_ethdev.c        | 390 ++++++++++++++++++++++-------------
> > > >  lib/librte_ether/rte_ethdev.h        | 174 ++++++++++++----
> > > >  3 files changed, 387 insertions(+), 182 deletions(-)
> > > >
> > > > --
> > > > 2.13.5
> > > >


More information about the dev mailing list