[RFC] ethdev: fast path async flow API

Dariusz Sosnowski dsosnowski at nvidia.com
Tue Jan 30 17:08:35 CET 2024


> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit at amd.com>
> Sent: Tuesday, January 30, 2024 13:17
> To: Dariusz Sosnowski <dsosnowski at nvidia.com>; Stephen Hemminger
> <stephen at networkplumber.org>
> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas at monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko at oktetlabs.ru>; Ori Kam
> <orika at nvidia.com>; dev at dpdk.org
> Subject: Re: [RFC] ethdev: fast path async flow API 
> 
> On 1/30/2024 12:06 PM, Dariusz Sosnowski wrote:
> > Hi Ferruh,
> >
> >> -----Original Message-----
> >> From: Ferruh Yigit <ferruh.yigit at amd.com>
> >> Sent: Monday, January 29, 2024 18:36
> >> To: Dariusz Sosnowski <dsosnowski at nvidia.com>; Stephen Hemminger
> >> <stephen at networkplumber.org>
> >> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas at monjalon.net>;
> >> Andrew Rybchenko <andrew.rybchenko at oktetlabs.ru>; Ori Kam
> >> <orika at nvidia.com>; dev at dpdk.org
> >> Subject: Re: [RFC] ethdev: fast path async flow API
> >>
> >> On 1/29/2024 1:38 PM, Dariusz Sosnowski wrote:
> >>> Hi all,
> >>>
> >>>> -----Original Message-----
> >>>> From: Dariusz Sosnowski <dsosnowski at nvidia.com>
> >>>> Sent: Tuesday, January 23, 2024 12:37
> >>>> To: Stephen Hemminger <stephen at networkplumber.org>
> >>>> Cc: NBU-Contact-Thomas Monjalon (EXTERNAL)
> <thomas at monjalon.net>;
> >>>> Ferruh Yigit <ferruh.yigit at amd.com>; Andrew Rybchenko
> >>>> <andrew.rybchenko at oktetlabs.ru>; Ori Kam <orika at nvidia.com>;
> >>>> dev at dpdk.org
> >>>> Subject: RE: [RFC] ethdev: fast path async flow API
> >>>>
> >>>> Hi Stephen,
> >>>>
> >>>> Sorry for such a late response.
> >>>>
> >>>> From: Stephen Hemminger <stephen at networkplumber.org>
> >>>> Sent: Thursday, January 4, 2024 02:08
> >>>>> On Wed, 3 Jan 2024 19:14:49 +0000
> >>>>> Dariusz Sosnowski <dsosnowski at nvidia.com> wrote:
> >>>>>> In summary, in my opinion extending the async flow API with
> >>>>>> bulking
> >>>>> capabilities or exposing the queue directly to the application is
> >>>>> not
> >> desirable.
> >>>>>> This proposal aims to reduce the I-cache overhead in async flow
> >>>>>> API by
> >>>>> reusing the existing design pattern in DPDK - fast path functions
> >>>>> are inlined to the application code and they call cached PMD callbacks.
> >>>>>
> >>>>> Inline needs to more discouraged in DPDK, because it only works if
> >>>>> application ends up building with DPDK from source.
> >>>>> It doesn't work for the Linux distro packaging model and symbol
> >>>>> versioning, etc.
> >>>>
> >>>> I understand the problem. In that case, I have a proposal.
> >>>> I had a chat with Thomas regarding this RFC, and he noticed that
> >>>> there are 2 separate changes proposed here:
> >>>>
> >>>> 1. Per-port callbacks for async flow API.
> >>>>     - Moves specified callbacks to rte_flow_fp_ops struct and allow
> >>>> PMDs to register them dynamically.
> >>>>     - Removes indirection at API level - no need to call rte_flow_ops_get().
> >>>>     - Removes checking if callbacks are defined - either the
> >>>> default DPDK callback is used or the one provided by PMD.
> >>>> 2. Make async flow API functions inlineable.
> >>>>
> >>>> Change (1) won't break ABI (existing callbacks in rte_flow_ops can
> >>>> be marked as deprecated for now and phased out later) and in my
> >>>> opinion is less controversial compared to change (2).
> >>>>
> >>>> I'll rerun the benchmarks for both changes separately to compare
> >>>> their benefits in isolation.
> >>>> This would allow us to decide if change (2) is worth the drawbacks
> >>>> it introduces.
> >>>>
> >>>> What do you think?
> >>>
> >>> I split the RFC into 2 parts:
> >>>
> >>> 1. Introduce per-port callbacks:
> >>>     - Introduce rte_flow_fp_ops struct - holds callbacks for fast
> >>> path calls, for
> >> each port. PMD registers callbacks through rte_flow_fp_ops_register().
> >>>     - Relevant rte_flow_async_* functions just pass arguments to
> >>> fast path
> >> callbacks. Validation checks are done only if RTE_FLOW_DEBUG macro is
> >> defined.
> >>>     - Biggest difference is the removal of callback access through
> >> rte_flow_get_ops().
> >>> 2. Inline async flow API functions.
> >>>     - Relevant rte_flow_async_* functions definitions are moved to
> >>> rte_flow.h
> >> and made inlineable.
> >>>
> >>> Here are the results of the benchmark:
> >>>
> >>> |                       | Avg Insertion | Diff over baseline | Avg
> >>> | Deletion | Diff over baseline |
> >>> |-----------------------|---------------|--------------------|--------------|----------
> ---
> >> -------|
> >>> | baseline (v24.03-rc0) |     5855.4    |                    |    9026.8    |                    |
> >>> | applied (1)           |     6384.2    |    +528.8 (+9%)    |    10054.2   |  +1027.4
> >> (+11.4%)  |
> >>> | applied (2)           |     6434.6    |   +579.2 (+9.9%)   |    10011.4   |   +984.6
> >> (+10.9%)  |
> >>>
> >>> Results are in KFlows/sec.
> >>> The benchmark is continuously inserting and deleting 2M flow rules.
> >>> These rules match on IPv4 destination address and with a single
> >>> action
> >> DROP.
> >>> Flow rules are inserted and deleted using a single flow queue.
> >>>
> >>> Change (2) improves insertion rate performance by ~1% compared to
> >>> (1),
> >> but decreases deletion rate by ~0.5%.
> >>> Based on these results, I think we can say that making
> >>> rte_flow_async_*()
> >> calls inlineable may not be worth it compared to the issues it causes.
> >>>
> >>> What do you all think about the results?
> >>>
> >>
> >> Hi Dariusz,
> >>
> >> As discussed before, converting APIs to static inline functions or
> >> exposing 'rte_eth_dev' has cons but with only applying first part
> >> above seems get rid of them with reasonable performance gain, so I
> >> think we can continue with this approach and continue to reviews.
> >>
> >> Most of the 'rte_flow_async_*' are already missing the function
> >> validity check, so having a default (dummy?) rte_flow_fp_ops for them
> >> seems even an improvement there.
> >>
> >>
> >> Only previously 'struct rte_flow_ops' was coming from drivers, ethdev
> >> layer doesn't need to maintain anything.
> >> But with 'rte_flow_fp_ops' struct, ethdev needs to store another
> >> array with fixed ('RTE_MAX_ETHPORTS') size which will be another
> >> blocker in the future to convert this fixed arrays to dynamically allocated
> arrays.
> >>
> >> For this, does it still help we add an a new field to "struct
> >> rte_eth_dev", like "struct rte_flow_ops *flow_ops", similar to 'dev_ops'?
> >> The indirection still will be there, but eliminate 'rte_flow_get_ops()'
> >> call and checks comes with it.
> >> If makes sense, is there a chance to experiment this and get some
> >> performance numbers with it?
> > Which option do you have in mind specifically?
> >
> > 1. Keeping only fast path callbacks in "dev->flow_ops", so "struct
> rte_eth_dev" will hold only a pointer to "struct rte_flow_fp_ops" as defined in
> RFC.
> >     - Only async flow API will use "dev->flow_ops->callback".
> >     - Other APIs will go through "rte_flow_ops_get()"
> > 2. Keeping all flow callbacks in "dev->flow_ops", so "struct rte_flow_ops".
> >     - As a result, I think that "rte_flow_get_ops()" could be removed
> altogether assuming that all functions have default implementation,
> >       checking for port validity (ENODEV if port not valid) and returning
> ENOSYS.
> >
> > Regardless of the version, I can experiment with additional indirection.
> >
> >
> 
> Both can work, has some cons,
> 
> I think "rte_flow_ops_get()" is more clear approach and rte_flow APIs already
> using it, so no change is required, but using separate struct for fast path will
> create two different structs drivers needs to fill, and both uses slightly different
> way to populate which is not nice.
> 
> For this RFC I think we can go with option 1, and consider updating rest if
> there is more motivation for it.

Ok, sounds good to me.

I redid the benchmark with "dev->flow_ops->callback()" method.
There's no statistically significant difference between this and static array of rte_flow_fp_ops, so I think we can go with option 1.
I'll send the patch for review as soon as possible.

Best regards,
Dariusz Sosnowski


More information about the dev mailing list