[dpdk-dev] [RFC] tunnel endpoint hw acceleration enablement

Adrien Mazarguil adrien.mazarguil at 6wind.com
Mon Mar 5 17:23:33 CET 2018
Previous message: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
Next message: [dpdk-dev] OPDL and 18.02 Release Notes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, Feb 26, 2018 at 05:44:01PM +0000, Doherty, Declan wrote:
> On 13/02/2018 5:05 PM, Adrien Mazarguil wrote:
> > Hi,
> > 
> > Apologies for being late to this thread, I've read the ensuing discussion
> > (hope I didn't miss any) and also think rte_flow could be improved in
> > several ways to enable TEP support, in particular regarding the ordering of
> > actions.
> > 
> > On the other hand I'm not sure a dedicated API for TEP is needed at all. I'm
> > not convinced rte_security chose the right path and would like to avoid
> > repeating the same mistakes if possible, more below.
> > 
> > On Thu, Dec 21, 2017 at 10:21:13PM +0000, Doherty, Declan wrote:
> > > This RFC contains a proposal to add a new tunnel endpoint API to DPDK that when used
> > > in conjunction with rte_flow enables the configuration of inline data path encapsulation
> > > and decapsulation of tunnel endpoint network overlays on accelerated IO devices.
> > > 
> > > The proposed new API would provide for the creation, destruction, and
> > > monitoring of a tunnel endpoint in supporting hw, as well as capabilities APIs to allow the
> > > acceleration features to be discovered by applications.
<snip>
> > Although I'm not convinced an opaque object is the right approach, if we
> > choose this route I suggest the much simpler:
> > 
> >   struct rte_flow_action_tep_(encap|decap) {
> >       struct rte_tep *tep;
> >       uint32_t flow_id;
> >   };
> > 
> 
> That's a fair point, the only other action that we currently had the
> encap/decap actions supporting was the Ethernet item, and going back to a
> comment from Boris having the Ethernet header separate from the tunnel is
> probably not ideal anyway. As one of our reasons for using an opaque tep
> item was to allow modification of the TEP independently of all the flows
> being carried on it. So for instance if the src or dst MAC needs to be
> modified or the output port needs to changed, the TEP itself could be
> modified.

Makes sense. I think there's now consensus that without a dedicated API, it
can be done through multiple rte_flow groups and "jump" actions targeting
them. Such actions remain to be formally defined though.

In the meantime there is an alternative approach when opaque pattern
items/actions are unavoidable: by using negative values [1].

In addition to an opaque object to use with rte_flow, a PMD could return a
PMD-specific negative value cast as enum rte_flow_{item,action}_type and
usable with the associated port ID only.

An API could even initialize a pattern item or an action object directly:

 struct rte_flow_action tep_action;
 
 if (rte_tep_create(port_id, &tep_action, ...) != 0)
      rte_panic("nooooo!");
 /*
  * tep_action is now initialized with an opaque type and conf pointer, it
  * can be used with rte_flow_create() as part of an action list.
  */

[1] http://dpdk.org/doc/guides/prog_guide/rte_flow.html#negative-types

<snip>
> > > struct rte_tep *tep = rte_tep_create(port_id, &attrs, pattern);
> > > 
> > > Once the tep context is created flows can then be directed to that endpoint for
> > > processing. The following sections will outline how the author envisage flow
> > > programming will work and also how TEP acceleration can be combined with other
> > > accelerations.
> > 
> > In order to allow a single TEP context object to be shared by multiple flow
> > rules, a whole new API must be implemented and applications still have to
> > additionally create one rte_flow rule per TEP flow_id to manage. While this
> > probably results in shorter flow rule patterns and action lists, is it
> > really worth it?
> > 
> > While I understand the reasons for this approach, I'd like to push for a
> > rte_flow-only API as much as possible, I'll provide suggestions below.
> > 
> 
> Not only are the rules shorter to implement, it could help to greatly
> reduces the amount of cycles required to add flows, both in terms of the
> application marshaling the data in rte_flow patterns and the PMD parsing
> that those patterns every time a flow is added, in the case where 10k's of
> flow are getting added per second this could add a significant overhead on
> the system.

True, although only if the underlying hardware supports it; some PMDs may
still have to update each flow rule independently in order to expose such an
API. Applications can't be certain an update operation will be quick and
atomic.

<snip>
> > > /** VERY IMPORTANT NOTE **/
> > > One of the core concepts of this proposal is that actions which modify the
> > > packet are defined in the order which they are to be processed. So first decap
> > > outer ethernet header, then the outer TEP headers.
> > > I think this is not only logical from a usability point of view, it should also
> > > simplify the logic required in PMDs to parse the desired actions.
> > 
> > This. I've been thinking about it for a very long time but never got around
> > submit a patch. Handling rte_flow actions in order, allowing repeated
> > identical actions and therefore getting rid of DUP. >
> > The current approach was a bad design decision from my part, I'm convinced
> > it must be redefined before combinations become commonplace (right now no
> > PMD implements any action whose order matters as far as I know).
> > 
> 
> I don't think it was an issue with the original implementation as I don't
> think it really becomes an issue until we start working with packet
> modifications, to that note I think that we only need to limit action
> ordering to actions which modify the packet itself. Actions like counting,
> marking, selecting output, be it port/pf/vf/queue/rss are all independent to
> the actions which modify the packet.

I think a behavior that differs depending on the action type would make the
API more difficult to implement and document. Limiting it now could also
result in the need for breaking it again later.

For instance an application may want to send an unencapsulated packet to
some queue, then encapsulate its copy twice before sending the result to
some VF:

 actions queue index 5 / vxlan_encap / vlan_encap / vf id 42 / end

If the application wanted two encapsulated copies instead:

 actions vxlan_encap / vlan_encap / queue index 5 / vf id 42 / end

Defining actions as always performed left-to-right is simpler and more
versatile in my opinion.

<snip>
> > I see three main use cases for egress since we do not want a PMD to parse
> > traffic in software to determine if it's candidate for TEP encapsulation:
> > 
> > 1. Traffic generated/forwarded by an application.
> > 2. Same as 1. assuming an application is aware hardware can match egress
> >     traffic in addition to encapsulate it.
> > 3. Traffic fully processed internally in hardware.
> > 
> > To handle 1., in my opinion the most common use case, PMDs should rely on an
> > application-provided mark pattern item (the converse of the MARK action):
> > 
> >   attr = egress;
> >   pattern = mark is 42 / end;
> >   actions = vxlan_encap {many parameters} / end;
> > 
> > To handle 2, hardware with the ability to recognize and encapsulate outgoing
> > traffic is required (applications can rely on rte_flow_validate()):
> > 
> >   attr = egress;
> >   pattern = eth / ipv4 / tcp / end;
> >   actions = vxlan_encap {many parameters} / end;
> > 
> > For 3, a combination of ingress and egress can be used needed on a given
> > rule. For clarity, one should assert where traffic comes from and where it's
> > supposed to go:
> > 
> >   attr = ingress egress;
> >   pattern = eth / ipv4 / tcp / port id 0 / end;
> >   actions = vxlan_encap {many parameters} / vf id 5 / end;

I take "ingress" back from this example, it doesn't make sense in its
context.

> > The {many parameters} for VXLAN_ENCAP obviously remain to be defined,
> > they have to either include everything needed to construct L2, L3, L4 and
> > VXLAN headers, or separate actions for each layer specified in
> > innermost-to-outermost order.
> > 
> > No need for dedicated mbuf TEP flags.
> 
> These all look make sense to me, if we really want to avoid the TEP API,
> just a point on 3, if using port representors then the ingress port can be
> implied by the rule on which the tunnel is created on.

By the way I will soon submit yet another RFC on this topic. It describes
how one could configure device switching through rte_flow and its
interaction with representor ports (assuming they exist) as a follow up to
all the recent discussion.

<snip>
> > - First question is what's your opinion regarding focusing on rte_flow
> >    instead of a TEP API? (Note for counters: one could add COUNT actions as
> >    well, what's currently missing is a way to share counters among several
> >    flow rules, which is planned as well)
> > 
> Technically I see no issue with both approaches being workable, but I think
> the flow based approach has issues in terms of usability and performance. In
> my mind, thinking of a TEP as a logical object which flows get mapped into
> maps very closely to the how they are used functionally in networks
> deployments, and is the way I've seen them supported in ever TOR switch
> API/CLI I've ever used. I also think it add should enable a more preformant
> control path when you don't need to specify all the TEP parameters for every
> flow, this is not an inconsiderable overhead. I saying all that I do see the
> value in the cleanness at an API level of using purely rte_flow, although I
> do wonder will that just end up moving that into the application domain.

I see, well that's a valid use case. If TEP is really supported as a kind of
action target and not as an opaque collection of multiple flow rules, it
make sense to expose a dedicated action for it.

As previously described, I would suggest to use a negative type generated by
an experimental API while work is being performed on rte_flow to add simple
low-level encaps (VLAN, VXLAN, etc) and support for actions order and the
more-or-less related switching configuration.

We'll then determine if an opaque TEP API still makes sense and if an
official rte_flow action type should be assigned.

> > - Regarding dedicated encap/decap actions instead of generic ones, given all
> >    protocols have different requirements (e.g. ESP encap is on a whole
> >    different level of complexity and likely needs callbacks)?
> > 
> Agreed on the need for dedicated encap/decap TEP actions.
> 
> > - Regarding the reliance on a MARK meta pattern item as a standard means for
> >    applications to tag egress traffic so a PMD knows what to do?
> 
> I do like that it as an approach but how would it work for combined actions,
> TEP + IPsec SA

A given MARK ID would correspond to a given list of actions that would
include both TEP + IPsec SA in whichever order was requested, not to a
specific action.

> > - I'd like to send a deprecation notice for rte_flow regarding handling of
> >    actions (documentation and change in some PMDs to reject currently valid
> >    but seldom used flow rules accordingly) instead of a new flow
> >    attribute. Would you ack such a change for 18.05?
> > 
> 
> Apologies, I complete missed the ack for 18.05 part of the question when I
> read it first this mail, the answer would have been yes, I was out of office
> due to illness for part of that week, which was part of the reason for the
> delay in response to this mail. But I think if we only restrict the action
> ordering requirement to chained modification actions do we still need the
> deprecation notice, as it won't break any existing implementations, as as
> you note there isn't anyone supporting that yet?

Not in DPDK itself AFAIK but you never know, the change of behavior may
result in previously unseen bugs in applications.

-- 
Adrien Mazarguil
6WIND
Previous message: [dpdk-dev] [PATCH] vhost: stop device before updating public vring data
Next message: [dpdk-dev] OPDL and 18.02 Release Notes
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list