[dpdk-dev,01/22] ethdev: introduce generic flow API

Message ID	1c8a8e4fec73ed33836f1da9525b1b8b53048518.1479309720.git.adrien.mazarguil@6wind.com (mailing list archive)
State	Superseded, archived
Headers	From: Adrien Mazarguil <adrien.mazarguil@6wind.com> To: dev@dpdk.org Cc: Thomas Monjalon <thomas.monjalon@6wind.com>, Pablo de Lara <pablo.de.lara.guarch@intel.com>, Olivier Matz <olivier.matz@6wind.com> Date: Wed, 16 Nov 2016 17:23:27 +0100 Message-Id: <1c8a8e4fec73ed33836f1da9525b1b8b53048518.1479309720.git.adrien.mazarguil@6wind.com> In-Reply-To: <cover.1479309719.git.adrien.mazarguil@6wind.com> References: <cover.1471632644.git.adrien.mazarguil@6wind.com> <cover.1479309719.git.adrien.mazarguil@6wind.com> Subject: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>

Message ID

1c8a8e4fec73ed33836f1da9525b1b8b53048518.1479309720.git.adrien.mazarguil@6wind.com (mailing list archive)

State

Superseded, archived

Headers

From: Adrien Mazarguil <adrien.mazarguil@6wind.com>
To: dev@dpdk.org
Cc: Thomas Monjalon <thomas.monjalon@6wind.com>,
	Pablo de Lara <pablo.de.lara.guarch@intel.com>,
	Olivier Matz <olivier.matz@6wind.com>
Date: Wed, 16 Nov 2016 17:23:27 +0100
Message-Id: <1c8a8e4fec73ed33836f1da9525b1b8b53048518.1479309720.git.adrien.mazarguil@6wind.com>
In-Reply-To: <cover.1479309719.git.adrien.mazarguil@6wind.com>
References: <cover.1471632644.git.adrien.mazarguil@6wind.com>
	<cover.1479309719.git.adrien.mazarguil@6wind.com>
Subject: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
Precedence: list
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Checks

Context	Check	Description
checkpatch/checkpatch	success	coding style OK

Commit Message

Adrien Mazarguil Nov. 16, 2016, 4:23 p.m. UTC

  This new API supersedes all the legacy filter types described in
rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
PMDs to process and validate flow rules.

Benefits:

- A unified API is easier to program for, applications do not have to be
  written for a specific filter type which may or may not be supported by
  the underlying device.

- The behavior of a flow rule is the same regardless of the underlying
  device, applications do not need to be aware of hardware quirks.

- Extensible by design, API/ABI breakage should rarely occur if at all.

- Documentation is self-standing, no need to look up elsewhere.

Existing filter types will be deprecated and removed in the near future.

Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
---
 MAINTAINERS                            |   4 +
 lib/librte_ether/Makefile              |   3 +
 lib/librte_ether/rte_eth_ctrl.h        |   1 +
 lib/librte_ether/rte_ether_version.map |  10 +
 lib/librte_ether/rte_flow.c            | 159 +++++
 lib/librte_ether/rte_flow.h            | 947 ++++++++++++++++++++++++++++
 lib/librte_ether/rte_flow_driver.h     | 177 ++++++
 7 files changed, 1301 insertions(+)

Comments

Xing, Beilei Nov. 18, 2016, 6:36 a.m. UTC | #1

Hi Adrien,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Thursday, November 17, 2016 12:23 AM
> To: dev@dpdk.org
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch,
> Pablo <pablo.de.lara.guarch@intel.com>; Olivier Matz
> <olivier.matz@6wind.com>
> Subject: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> 
> This new API supersedes all the legacy filter types described in rte_eth_ctrl.h.
> It is slightly higher level and as a result relies more on PMDs to process and
> validate flow rules.
> 
> Benefits:
> 
> - A unified API is easier to program for, applications do not have to be
>   written for a specific filter type which may or may not be supported by
>   the underlying device.
> 
> - The behavior of a flow rule is the same regardless of the underlying
>   device, applications do not need to be aware of hardware quirks.
> 
> - Extensible by design, API/ABI breakage should rarely occur if at all.
> 
> - Documentation is self-standing, no need to look up elsewhere.
> 
> Existing filter types will be deprecated and removed in the near future.
> 
> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>


> +
> +/**
> + * Opaque type returned after successfully creating a flow.
> + *
> + * This handle can be used to manage and query the related flow (e.g.
> +to
> + * destroy it or retrieve counters).
> + */
> +struct rte_flow;
> +

As we talked before, we use attr/pattern/actions to create and destroy a flow in PMD, 
but I don't think it's easy to clone the user-provided parameters and return the result
to the application as a rte_flow pointer.  As you suggested:
/* PMD-specific code. */
 struct rte_flow {
    struct rte_flow_attr attr;
    struct rte_flow_item *pattern;
    struct rte_flow_action *actions;
 };

Because both pattern and actions are pointers, and there're also pointers in structure
rte_flow_item and struct rte_flow_action. We need to iterate allocation during clone
and iterate free during destroy, then seems that the code is something ugly, right?

I think application saves info when creating a flow rule, so why not application provide
attr/pattern/actions info to PMD before calling PMD API?

Thanks,
Beilei Xing

Adrien Mazarguil Nov. 18, 2016, 10:28 a.m. UTC | #2

Hi Beilei,

On Fri, Nov 18, 2016 at 06:36:31AM +0000, Xing, Beilei wrote:
> Hi Adrien,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> > Sent: Thursday, November 17, 2016 12:23 AM
> > To: dev@dpdk.org
> > Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch,
> > Pablo <pablo.de.lara.guarch@intel.com>; Olivier Matz
> > <olivier.matz@6wind.com>
> > Subject: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> > 
> > This new API supersedes all the legacy filter types described in rte_eth_ctrl.h.
> > It is slightly higher level and as a result relies more on PMDs to process and
> > validate flow rules.
> > 
> > Benefits:
> > 
> > - A unified API is easier to program for, applications do not have to be
> >   written for a specific filter type which may or may not be supported by
> >   the underlying device.
> > 
> > - The behavior of a flow rule is the same regardless of the underlying
> >   device, applications do not need to be aware of hardware quirks.
> > 
> > - Extensible by design, API/ABI breakage should rarely occur if at all.
> > 
> > - Documentation is self-standing, no need to look up elsewhere.
> > 
> > Existing filter types will be deprecated and removed in the near future.
> > 
> > Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> 
> 
> > +
> > +/**
> > + * Opaque type returned after successfully creating a flow.
> > + *
> > + * This handle can be used to manage and query the related flow (e.g.
> > +to
> > + * destroy it or retrieve counters).
> > + */
> > +struct rte_flow;
> > +
> 
> As we talked before, we use attr/pattern/actions to create and destroy a flow in PMD, 
> but I don't think it's easy to clone the user-provided parameters and return the result
> to the application as a rte_flow pointer.  As you suggested:
> /* PMD-specific code. */
>  struct rte_flow {
>     struct rte_flow_attr attr;
>     struct rte_flow_item *pattern;
>     struct rte_flow_action *actions;
>  };

Just to provide some context to the community since the above snippet comes
from private exchanges, I've suggested the above structure as a mean to
create and remove rules in the same fashion as FDIR, by providing the rule
used for creation to the destroy callback.

As an opaque type, each PMD currently needs to implement its own version of
struct rte_flow. The above definition may ease transition from FDIR to
rte_flow for some PMDs, however they need to clone the entire
application-provided rule to do so because there is no requirement for it to
be kept allocated.

I've implemented such a function in testpmd (port_flow_new() in commit [1])
as an example.

 [1] http://dpdk.org/ml/archives/dev/2016-November/050266.html

However my suggestion is for PMDs to use their own HW-specific structure
that only contains relevant information instead of being forced to drag
large, non-native data around, missing useful context and that requires
parsing every time. This is one benefit of using an opaque type in the first
place, the other being ABI breakage avoidance.

> Because both pattern and actions are pointers, and there're also pointers in structure
> rte_flow_item and struct rte_flow_action. We need to iterate allocation during clone
> and iterate free during destroy, then seems that the code is something ugly, right?

Well since I wrote that code, I won't easily admit it's ugly. I think PMDs
should not require the duplication of generic rules actually, which are only
defined as a common language between applications and PMDs. Both are free to
store rules in their own preferred and efficient format internally.

> I think application saves info when creating a flow rule, so why not application provide
> attr/pattern/actions info to PMD before calling PMD API?

They have to do so temporarily (e.g. allocated on the stack) while calling
rte_flow_create() and rte_flow_validate(), that's it. Once a rule is
created, there's no requirement for applications to keep anything around.

For simple applications such as testpmd, the generic format is probably
enough. More complex and existing applications such as ovs-dpdk may rather
choose to keep using their internal format that already fits their needs,
partially duplicating this information in rte_flow_attr and
rte_flow_item/rte_flow_action lists would waste memory. The conversion in
this case should only be performed when creating/validating flow rules.

In short, I fail to see any downside with maintaining struct rte_flow opaque
to applications.

Best regards,

Kevin Traynor Nov. 30, 2016, 5:47 p.m. UTC | #3

Hi Adrien,

On 11/16/2016 04:23 PM, Adrien Mazarguil wrote:
> This new API supersedes all the legacy filter types described in
> rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
> PMDs to process and validate flow rules.
> 
> Benefits:
> 
> - A unified API is easier to program for, applications do not have to be
>   written for a specific filter type which may or may not be supported by
>   the underlying device.
> 
> - The behavior of a flow rule is the same regardless of the underlying
>   device, applications do not need to be aware of hardware quirks.
> 
> - Extensible by design, API/ABI breakage should rarely occur if at all.
> 
> - Documentation is self-standing, no need to look up elsewhere.
> 
> Existing filter types will be deprecated and removed in the near future.

I'd suggest to add a deprecation notice to deprecation.rst, ideally with
a target release.

> 
> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> ---
>  MAINTAINERS                            |   4 +
>  lib/librte_ether/Makefile              |   3 +
>  lib/librte_ether/rte_eth_ctrl.h        |   1 +
>  lib/librte_ether/rte_ether_version.map |  10 +
>  lib/librte_ether/rte_flow.c            | 159 +++++
>  lib/librte_ether/rte_flow.h            | 947 ++++++++++++++++++++++++++++
>  lib/librte_ether/rte_flow_driver.h     | 177 ++++++
>  7 files changed, 1301 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d6bb8f8..3b46630 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -243,6 +243,10 @@ M: Thomas Monjalon <thomas.monjalon@6wind.com>
>  F: lib/librte_ether/
>  F: scripts/test-null.sh
>  
> +Generic flow API
> +M: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> +F: lib/librte_ether/rte_flow*
> +
>  Crypto API
>  M: Declan Doherty <declan.doherty@intel.com>
>  F: lib/librte_cryptodev/
> diff --git a/lib/librte_ether/Makefile b/lib/librte_ether/Makefile
> index efe1e5f..9335361 100644
> --- a/lib/librte_ether/Makefile
> +++ b/lib/librte_ether/Makefile
> @@ -44,6 +44,7 @@ EXPORT_MAP := rte_ether_version.map
>  LIBABIVER := 5
>  
>  SRCS-y += rte_ethdev.c
> +SRCS-y += rte_flow.c
>  
>  #
>  # Export include files
> @@ -51,6 +52,8 @@ SRCS-y += rte_ethdev.c
>  SYMLINK-y-include += rte_ethdev.h
>  SYMLINK-y-include += rte_eth_ctrl.h
>  SYMLINK-y-include += rte_dev_info.h
> +SYMLINK-y-include += rte_flow.h
> +SYMLINK-y-include += rte_flow_driver.h
>  
>  # this lib depends upon:
>  DEPDIRS-y += lib/librte_net lib/librte_eal lib/librte_mempool lib/librte_ring lib/librte_mbuf
> diff --git a/lib/librte_ether/rte_eth_ctrl.h b/lib/librte_ether/rte_eth_ctrl.h
> index fe80eb0..8386904 100644
> --- a/lib/librte_ether/rte_eth_ctrl.h
> +++ b/lib/librte_ether/rte_eth_ctrl.h
> @@ -99,6 +99,7 @@ enum rte_filter_type {
>  	RTE_ETH_FILTER_FDIR,
>  	RTE_ETH_FILTER_HASH,
>  	RTE_ETH_FILTER_L2_TUNNEL,
> +	RTE_ETH_FILTER_GENERIC,
>  	RTE_ETH_FILTER_MAX
>  };
>  
> diff --git a/lib/librte_ether/rte_ether_version.map b/lib/librte_ether/rte_ether_version.map
> index 72be66d..b5d2547 100644
> --- a/lib/librte_ether/rte_ether_version.map
> +++ b/lib/librte_ether/rte_ether_version.map
> @@ -147,3 +147,13 @@ DPDK_16.11 {
>  	rte_eth_dev_pci_remove;
>  
>  } DPDK_16.07;
> +
> +DPDK_17.02 {
> +	global:
> +
> +	rte_flow_validate;
> +	rte_flow_create;
> +	rte_flow_destroy;
> +	rte_flow_query;
> +
> +} DPDK_16.11;
> diff --git a/lib/librte_ether/rte_flow.c b/lib/librte_ether/rte_flow.c
> new file mode 100644
> index 0000000..064963d
> --- /dev/null
> +++ b/lib/librte_ether/rte_flow.c
> @@ -0,0 +1,159 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright 2016 6WIND S.A.
> + *   Copyright 2016 Mellanox.

There's Mellanox copyright but you are the only signed-off-by - is that
right?

> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of 6WIND S.A. nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_errno.h>
> +#include <rte_branch_prediction.h>
> +#include "rte_ethdev.h"
> +#include "rte_flow_driver.h"
> +#include "rte_flow.h"
> +
> +/* Get generic flow operations structure from a port. */
> +const struct rte_flow_ops *
> +rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops;
> +	int code;
> +
> +	if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
> +		code = ENODEV;
> +	else if (unlikely(!dev->dev_ops->filter_ctrl ||
> +			  dev->dev_ops->filter_ctrl(dev,
> +						    RTE_ETH_FILTER_GENERIC,
> +						    RTE_ETH_FILTER_GET,
> +						    &ops) ||
> +			  !ops))
> +		code = ENOTSUP;
> +	else
> +		return ops;
> +	rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(code));
> +	return NULL;
> +}
> +

Is it expected that the application or pmd will provide locking between
these functions if required? I think it's going to have to be the app.

> +/* Check whether a flow rule can be created on a given port. */
> +int
> +rte_flow_validate(uint8_t port_id,
> +		  const struct rte_flow_attr *attr,
> +		  const struct rte_flow_item pattern[],
> +		  const struct rte_flow_action actions[],
> +		  struct rte_flow_error *error)
> +{
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (unlikely(!ops))
> +		return -rte_errno;
> +	if (likely(!!ops->validate))
> +		return ops->validate(dev, attr, pattern, actions, error);
> +	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(ENOTSUP));
> +	return -rte_errno;
> +}
> +
> +/* Create a flow rule on a given port. */
> +struct rte_flow *
> +rte_flow_create(uint8_t port_id,
> +		const struct rte_flow_attr *attr,
> +		const struct rte_flow_item pattern[],
> +		const struct rte_flow_action actions[],
> +		struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (unlikely(!ops))
> +		return NULL;
> +	if (likely(!!ops->create))
> +		return ops->create(dev, attr, pattern, actions, error);
> +	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(ENOTSUP));
> +	return NULL;
> +}
> +
> +/* Destroy a flow rule on a given port. */
> +int
> +rte_flow_destroy(uint8_t port_id,
> +		 struct rte_flow *flow,
> +		 struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (unlikely(!ops))
> +		return -rte_errno;
> +	if (likely(!!ops->destroy))
> +		return ops->destroy(dev, flow, error);
> +	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(ENOTSUP));
> +	return -rte_errno;
> +}
> +
> +/* Destroy all flow rules associated with a port. */
> +int
> +rte_flow_flush(uint8_t port_id,
> +	       struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (unlikely(!ops))
> +		return -rte_errno;
> +	if (likely(!!ops->flush))
> +		return ops->flush(dev, error);
> +	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(ENOTSUP));
> +	return -rte_errno;
> +}
> +
> +/* Query an existing flow rule. */
> +int
> +rte_flow_query(uint8_t port_id,
> +	       struct rte_flow *flow,
> +	       enum rte_flow_action_type action,
> +	       void *data,
> +	       struct rte_flow_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
> +
> +	if (!ops)
> +		return -rte_errno;
> +	if (likely(!!ops->query))
> +		return ops->query(dev, flow, action, data, error);
> +	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			   NULL, rte_strerror(ENOTSUP));
> +	return -rte_errno;
> +}
> diff --git a/lib/librte_ether/rte_flow.h b/lib/librte_ether/rte_flow.h
> new file mode 100644
> index 0000000..211f307
> --- /dev/null
> +++ b/lib/librte_ether/rte_flow.h
> @@ -0,0 +1,947 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright 2016 6WIND S.A.
> + *   Copyright 2016 Mellanox.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of 6WIND S.A. nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef RTE_FLOW_H_
> +#define RTE_FLOW_H_
> +
> +/**
> + * @file
> + * RTE generic flow API
> + *
> + * This interface provides the ability to program packet matching and
> + * associated actions in hardware through flow rules.
> + */
> +
> +#include <rte_arp.h>
> +#include <rte_ether.h>
> +#include <rte_icmp.h>
> +#include <rte_ip.h>
> +#include <rte_sctp.h>
> +#include <rte_tcp.h>
> +#include <rte_udp.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Flow rule attributes.
> + *
> + * Priorities are set on two levels: per group and per rule within groups.
> + *
> + * Lower values denote higher priority, the highest priority for both levels
> + * is 0, so that a rule with priority 0 in group 8 is always matched after a
> + * rule with priority 8 in group 0.
> + *
> + * Although optional, applications are encouraged to group similar rules as
> + * much as possible to fully take advantage of hardware capabilities
> + * (e.g. optimized matching) and work around limitations (e.g. a single
> + * pattern type possibly allowed in a given group).
> + *
> + * Group and priority levels are arbitrary and up to the application, they
> + * do not need to be contiguous nor start from 0, however the maximum number
> + * varies between devices and may be affected by existing flow rules.
> + *
> + * If a packet is matched by several rules of a given group for a given
> + * priority level, the outcome is undefined. It can take any path, may be
> + * duplicated or even cause unrecoverable errors.

I get what you are trying to do here wrt supporting multiple
pmds/hardware implementations and it's a good idea to keep it flexible.

Given that the outcome is undefined, it would be nice that the
application has a way of finding the specific effects for verification
and debugging.

> + *
> + * Note that support for more than a single group and priority level is not
> + * guaranteed.
> + *
> + * Flow rules can apply to inbound and/or outbound traffic (ingress/egress).
> + *
> + * Several pattern items and actions are valid and can be used in both
> + * directions. Those valid for only one direction are described as such.
> + *
> + * Specifying both directions at once is not recommended but may be valid in
> + * some cases, such as incrementing the same counter twice.
> + *
> + * Not specifying any direction is currently an error.
> + */
> +struct rte_flow_attr {
> +	uint32_t group; /**< Priority group. */
> +	uint32_t priority; /**< Priority level within group. */
> +	uint32_t ingress:1; /**< Rule applies to ingress traffic. */
> +	uint32_t egress:1; /**< Rule applies to egress traffic. */
> +	uint32_t reserved:30; /**< Reserved, must be zero. */
> +};
> +
> +/**
> + * Matching pattern item types.
> + *
> + * Items are arranged in a list to form a matching pattern for packets.
> + * They fall in two categories:
> + *
> + * - Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, SCTP,
> + *   VXLAN and so on), usually associated with a specification
> + *   structure. These must be stacked in the same order as the protocol
> + *   layers to match, starting from L2.
> + *
> + * - Affecting how the pattern is processed (END, VOID, INVERT, PF, VF, PORT
> + *   and so on), often without a specification structure. Since they are
> + *   meta data that does not match packet contents, these can be specified
> + *   anywhere within item lists without affecting the protocol matching
> + *   items.
> + *
> + * See the description of individual types for more information. Those
> + * marked with [META] fall into the second category.
> + */
> +enum rte_flow_item_type {
> +	/**
> +	 * [META]
> +	 *
> +	 * End marker for item lists. Prevents further processing of items,
> +	 * thereby ending the pattern.
> +	 *
> +	 * No associated specification structure.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_END,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Used as a placeholder for convenience. It is ignored and simply
> +	 * discarded by PMDs.
> +	 *
> +	 * No associated specification structure.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_VOID,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Inverted matching, i.e. process packets that do not match the
> +	 * pattern.
> +	 *
> +	 * No associated specification structure.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_INVERT,
> +
> +	/**
> +	 * Matches any protocol in place of the current layer, a single ANY
> +	 * may also stand for several protocol layers.
> +	 *
> +	 * See struct rte_flow_item_any.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_ANY,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Matches packets addressed to the physical function of the device.
> +	 *
> +	 * If the underlying device function differs from the one that would
> +	 * normally receive the matched traffic, specifying this item
> +	 * prevents it from reaching that device unless the flow rule
> +	 * contains a PF action. Packets are not duplicated between device
> +	 * instances by default.
> +	 *
> +	 * No associated specification structure.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_PF,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Matches packets addressed to a virtual function ID of the device.
> +	 *
> +	 * If the underlying device function differs from the one that would
> +	 * normally receive the matched traffic, specifying this item
> +	 * prevents it from reaching that device unless the flow rule
> +	 * contains a VF action. Packets are not duplicated between device
> +	 * instances by default.
> +	 *
> +	 * See struct rte_flow_item_vf.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_VF,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Matches packets coming from the specified physical port of the
> +	 * underlying device.
> +	 *
> +	 * The first PORT item overrides the physical port normally
> +	 * associated with the specified DPDK input port (port_id). This
> +	 * item can be provided several times to match additional physical
> +	 * ports.
> +	 *
> +	 * See struct rte_flow_item_port.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_PORT,
> +
> +	/**
> +	 * Matches a byte string of a given length at a given offset.
> +	 *
> +	 * See struct rte_flow_item_raw.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_RAW,
> +
> +	/**
> +	 * Matches an Ethernet header.
> +	 *
> +	 * See struct rte_flow_item_eth.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_ETH,
> +
> +	/**
> +	 * Matches an 802.1Q/ad VLAN tag.
> +	 *
> +	 * See struct rte_flow_item_vlan.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_VLAN,
> +
> +	/**
> +	 * Matches an IPv4 header.
> +	 *
> +	 * See struct rte_flow_item_ipv4.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_IPV4,
> +
> +	/**
> +	 * Matches an IPv6 header.
> +	 *
> +	 * See struct rte_flow_item_ipv6.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_IPV6,
> +
> +	/**
> +	 * Matches an ICMP header.
> +	 *
> +	 * See struct rte_flow_item_icmp.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_ICMP,
> +
> +	/**
> +	 * Matches a UDP header.
> +	 *
> +	 * See struct rte_flow_item_udp.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_UDP,
> +
> +	/**
> +	 * Matches a TCP header.
> +	 *
> +	 * See struct rte_flow_item_tcp.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_TCP,
> +
> +	/**
> +	 * Matches a SCTP header.
> +	 *
> +	 * See struct rte_flow_item_sctp.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_SCTP,
> +
> +	/**
> +	 * Matches a VXLAN header.
> +	 *
> +	 * See struct rte_flow_item_vxlan.
> +	 */
> +	RTE_FLOW_ITEM_TYPE_VXLAN,
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_ANY
> + *
> + * Matches any protocol in place of the current layer, a single ANY may also
> + * stand for several protocol layers.
> + *
> + * This is usually specified as the first pattern item when looking for a
> + * protocol anywhere in a packet.
> + *
> + * A maximum value of 0 requests matching any number of protocol layers
> + * above or equal to the minimum value, a maximum value lower than the
> + * minimum one is otherwise invalid.
> + *
> + * This type does not work with a range (struct rte_flow_item.last).
> + */
> +struct rte_flow_item_any {
> +	uint16_t min; /**< Minimum number of layers covered. */
> +	uint16_t max; /**< Maximum number of layers covered, 0 for infinity. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_VF
> + *
> + * Matches packets addressed to a virtual function ID of the device.
> + *
> + * If the underlying device function differs from the one that would
> + * normally receive the matched traffic, specifying this item prevents it
> + * from reaching that device unless the flow rule contains a VF
> + * action. Packets are not duplicated between device instances by default.
> + *
> + * - Likely to return an error or never match any traffic if this causes a
> + *   VF device to match traffic addressed to a different VF.
> + * - Can be specified multiple times to match traffic addressed to several
> + *   specific VFs.
> + * - Can be combined with a PF item to match both PF and VF traffic.
> + *
> + * A zeroed mask can be used to match any VF.

can you refer explicitly to id

> + */
> +struct rte_flow_item_vf {
> +	uint32_t id; /**< Destination VF ID. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_PORT
> + *
> + * Matches packets coming from the specified physical port of the underlying
> + * device.
> + *
> + * The first PORT item overrides the physical port normally associated with
> + * the specified DPDK input port (port_id). This item can be provided
> + * several times to match additional physical ports.
> + *
> + * Note that physical ports are not necessarily tied to DPDK input ports
> + * (port_id) when those are not under DPDK control. Possible values are
> + * specific to each device, they are not necessarily indexed from zero and
> + * may not be contiguous.
> + *
> + * As a device property, the list of allowed values as well as the value
> + * associated with a port_id should be retrieved by other means.
> + *
> + * A zeroed mask can be used to match any port index.
> + */
> +struct rte_flow_item_port {
> +	uint32_t index; /**< Physical port index. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_RAW
> + *
> + * Matches a byte string of a given length at a given offset.
> + *
> + * Offset is either absolute (using the start of the packet) or relative to
> + * the end of the previous matched item in the stack, in which case negative
> + * values are allowed.
> + *
> + * If search is enabled, offset is used as the starting point. The search
> + * area can be delimited by setting limit to a nonzero value, which is the
> + * maximum number of bytes after offset where the pattern may start.
> + *
> + * Matching a zero-length pattern is allowed, doing so resets the relative
> + * offset for subsequent items.
> + *
> + * This type does not work with a range (struct rte_flow_item.last).
> + */
> +struct rte_flow_item_raw {
> +	uint32_t relative:1; /**< Look for pattern after the previous item. */
> +	uint32_t search:1; /**< Search pattern from offset (see also limit). */
> +	uint32_t reserved:30; /**< Reserved, must be set to zero. */
> +	int32_t offset; /**< Absolute or relative offset for pattern. */
> +	uint16_t limit; /**< Search area limit for start of pattern. */
> +	uint16_t length; /**< Pattern length. */
> +	uint8_t pattern[]; /**< Byte string to look for. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_ETH
> + *
> + * Matches an Ethernet header.
> + */
> +struct rte_flow_item_eth {
> +	struct ether_addr dst; /**< Destination MAC. */
> +	struct ether_addr src; /**< Source MAC. */
> +	unsigned int type; /**< EtherType. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_VLAN
> + *
> + * Matches an 802.1Q/ad VLAN tag.
> + *
> + * This type normally follows either RTE_FLOW_ITEM_TYPE_ETH or
> + * RTE_FLOW_ITEM_TYPE_VLAN.
> + */
> +struct rte_flow_item_vlan {
> +	uint16_t tpid; /**< Tag protocol identifier. */
> +	uint16_t tci; /**< Tag control information. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_IPV4
> + *
> + * Matches an IPv4 header.
> + *
> + * Note: IPv4 options are handled by dedicated pattern items.
> + */
> +struct rte_flow_item_ipv4 {
> +	struct ipv4_hdr hdr; /**< IPv4 header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_IPV6.
> + *
> + * Matches an IPv6 header.
> + *
> + * Note: IPv6 options are handled by dedicated pattern items.
> + */
> +struct rte_flow_item_ipv6 {
> +	struct ipv6_hdr hdr; /**< IPv6 header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_ICMP.
> + *
> + * Matches an ICMP header.
> + */
> +struct rte_flow_item_icmp {
> +	struct icmp_hdr hdr; /**< ICMP header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_UDP.
> + *
> + * Matches a UDP header.
> + */
> +struct rte_flow_item_udp {
> +	struct udp_hdr hdr; /**< UDP header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_TCP.
> + *
> + * Matches a TCP header.
> + */
> +struct rte_flow_item_tcp {
> +	struct tcp_hdr hdr; /**< TCP header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_SCTP.
> + *
> + * Matches a SCTP header.
> + */
> +struct rte_flow_item_sctp {
> +	struct sctp_hdr hdr; /**< SCTP header definition. */
> +};
> +
> +/**
> + * RTE_FLOW_ITEM_TYPE_VXLAN.
> + *
> + * Matches a VXLAN header (RFC 7348).
> + */
> +struct rte_flow_item_vxlan {
> +	uint8_t flags; /**< Normally 0x08 (I flag). */
> +	uint8_t rsvd0[3]; /**< Reserved, normally 0x000000. */
> +	uint8_t vni[3]; /**< VXLAN identifier. */
> +	uint8_t rsvd1; /**< Reserved, normally 0x00. */
> +};
> +
> +/**
> + * Matching pattern item definition.
> + *
> + * A pattern is formed by stacking items starting from the lowest protocol
> + * layer to match. This stacking restriction does not apply to meta items
> + * which can be placed anywhere in the stack with no effect on the meaning
> + * of the resulting pattern.
> + *
> + * A stack is terminated by a END item.
> + *
> + * The spec field should be a valid pointer to a structure of the related
> + * item type. It may be set to NULL in many cases to use default values.
> + *
> + * Optionally, last can point to a structure of the same type to define an
> + * inclusive range. This is mostly supported by integer and address fields,
> + * may cause errors otherwise. Fields that do not support ranges must be set
> + * to the same value as their spec counterparts.
> + *
> + * By default all fields present in spec are considered relevant.* This

typo "*"

> + * behavior can be altered by providing a mask structure of the same type
> + * with applicable bits set to one. It can also be used to partially filter
> + * out specific fields (e.g. as an alternate mean to match ranges of IP
> + * addresses).
> + *
> + * Note this is a simple bit-mask applied before interpreting the contents
> + * of spec and last, which may yield unexpected results if not used
> + * carefully. For example, if for an IPv4 address field, spec provides
> + * 10.1.2.3, last provides 10.3.4.5 and mask provides 255.255.0.0, the
> + * effective range is 10.1.0.0 to 10.3.255.255.
> + *
> + * * The defaults for data-matching items such as IPv4 when mask is not
> + *   specified actually depend on the underlying implementation since only
> + *   recognized fields can be taken into account.
> + */
> +struct rte_flow_item {
> +	enum rte_flow_item_type type; /**< Item type. */
> +	const void *spec; /**< Pointer to item specification structure. */
> +	const void *last; /**< Defines an inclusive range (spec to last). */
> +	const void *mask; /**< Bit-mask applied to spec and last. */
> +};
> +
> +/**
> + * Action types.
> + *
> + * Each possible action is represented by a type. Some have associated
> + * configuration structures. Several actions combined in a list can be
> + * affected to a flow rule. That list is not ordered.
> + *
> + * They fall in three categories:
> + *
> + * - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> + *   processing matched packets by subsequent flow rules, unless overridden
> + *   with PASSTHRU.
> + *
> + * - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
> + *   for additional processing by subsequent flow rules.
> + *
> + * - Other non terminating meta actions that do not affect the fate of
> + *   packets (END, VOID, MARK, FLAG, COUNT).
> + *
> + * When several actions are combined in a flow rule, they should all have
> + * different types (e.g. dropping a packet twice is not possible). The
> + * defined behavior is for PMDs to only take into account the last action of
> + * a given type found in the list. PMDs still perform error checking on the
> + * entire list.

why do you define that the pmd will interpret multiple same type rules
in this way...would it not make more sense for the pmd to just return
EINVAL for an invalid set of rules? It seems more transparent for the
application.

> + *
> + * Note that PASSTHRU is the only action able to override a terminating
> + * rule.
> + */
> +enum rte_flow_action_type {
> +	/**
> +	 * [META]
> +	 *
> +	 * End marker for action lists. Prevents further processing of
> +	 * actions, thereby ending the list.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_END,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Used as a placeholder for convenience. It is ignored and simply
> +	 * discarded by PMDs.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_VOID,
> +
> +	/**
> +	 * Leaves packets up for additional processing by subsequent flow
> +	 * rules. This is the default when a rule does not contain a
> +	 * terminating action, but can be specified to force a rule to
> +	 * become non-terminating.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_PASSTHRU,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Attaches a 32 bit value to packets.
> +	 *
> +	 * See struct rte_flow_action_mark.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_MARK,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Flag packets. Similar to MARK but only affects ol_flags.
> +	 *
> +	 * Note: a distinctive flag must be defined for it.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_FLAG,
> +
> +	/**
> +	 * Assigns packets to a given queue index.
> +	 *
> +	 * See struct rte_flow_action_queue.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_QUEUE,
> +
> +	/**
> +	 * Drops packets.
> +	 *
> +	 * PASSTHRU overrides this action if both are specified.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_DROP,
> +
> +	/**
> +	 * [META]
> +	 *
> +	 * Enables counters for this rule.
> +	 *
> +	 * These counters can be retrieved and reset through rte_flow_query(),
> +	 * see struct rte_flow_query_count.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_COUNT,
> +
> +	/**
> +	 * Duplicates packets to a given queue index.
> +	 *
> +	 * This is normally combined with QUEUE, however when used alone, it
> +	 * is actually similar to QUEUE + PASSTHRU.
> +	 *
> +	 * See struct rte_flow_action_dup.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_DUP,
> +
> +	/**
> +	 * Similar to QUEUE, except RSS is additionally performed on packets
> +	 * to spread them among several queues according to the provided
> +	 * parameters.
> +	 *
> +	 * See struct rte_flow_action_rss.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_RSS,
> +
> +	/**
> +	 * Redirects packets to the physical function (PF) of the current
> +	 * device.
> +	 *
> +	 * No associated configuration structure.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_PF,
> +
> +	/**
> +	 * Redirects packets to the virtual function (VF) of the current
> +	 * device with the specified ID.
> +	 *
> +	 * See struct rte_flow_action_vf.
> +	 */
> +	RTE_FLOW_ACTION_TYPE_VF,
> +};
> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_MARK
> + *
> + * Attaches a 32 bit value to packets.
> + *
> + * This value is arbitrary and application-defined. For compatibility with
> + * FDIR it is returned in the hash.fdir.hi mbuf field. PKT_RX_FDIR_ID is
> + * also set in ol_flags.
> + */
> +struct rte_flow_action_mark {
> +	uint32_t id; /**< 32 bit value to return with packets. */
> +};

One use case I thought we would be able to do for OVS is classification
in hardware and the unique flow id is sent with the packet to software.
But in OVS the ufid is 128 bits, so it means we can't and there is still
the miniflow extract overhead. I'm not sure if there is a practical way
around this.

Sugesh (cc'd) has looked at this before and may be able to comment or
correct me.

> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_QUEUE
> + *
> + * Assign packets to a given queue index.
> + *
> + * Terminating by default.
> + */
> +struct rte_flow_action_queue {
> +	uint16_t index; /**< Queue index to use. */
> +};
> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_COUNT (query)
> + *
> + * Query structure to retrieve and reset flow rule counters.
> + */
> +struct rte_flow_query_count {
> +	uint32_t reset:1; /**< Reset counters after query [in]. */
> +	uint32_t hits_set:1; /**< hits field is set [out]. */
> +	uint32_t bytes_set:1; /**< bytes field is set [out]. */
> +	uint32_t reserved:29; /**< Reserved, must be zero [in, out]. */
> +	uint64_t hits; /**< Number of hits for this rule [out]. */
> +	uint64_t bytes; /**< Number of bytes through this rule [out]. */
> +};
> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_DUP
> + *
> + * Duplicates packets to a given queue index.
> + *
> + * This is normally combined with QUEUE, however when used alone, it is
> + * actually similar to QUEUE + PASSTHRU.
> + *
> + * Non-terminating by default.
> + */
> +struct rte_flow_action_dup {
> +	uint16_t index; /**< Queue index to duplicate packets to. */
> +};
> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_RSS
> + *
> + * Similar to QUEUE, except RSS is additionally performed on packets to
> + * spread them among several queues according to the provided parameters.
> + *
> + * Note: RSS hash result is normally stored in the hash.rss mbuf field,
> + * however it conflicts with the MARK action as they share the same
> + * space. When both actions are specified, the RSS hash is discarded and
> + * PKT_RX_RSS_HASH is not set in ol_flags. MARK has priority. The mbuf
> + * structure should eventually evolve to store both.
> + *
> + * Terminating by default.
> + */
> +struct rte_flow_action_rss {
> +	const struct rte_eth_rss_conf *rss_conf; /**< RSS parameters. */
> +	uint16_t queues; /**< Number of entries in queue[]. */
> +	uint16_t queue[]; /**< Queues indices to use. */

I'd try and avoid queue and queues - someone will say "huh?" when
reading code. s/queues/num ?

> +};
> +
> +/**
> + * RTE_FLOW_ACTION_TYPE_VF
> + *
> + * Redirects packets to a virtual function (VF) of the current device.
> + *
> + * Packets matched by a VF pattern item can be redirected to their original
> + * VF ID instead of the specified one. This parameter may not be available
> + * and is not guaranteed to work properly if the VF part is matched by a
> + * prior flow rule or if packets are not addressed to a VF in the first
> + * place.

Not clear what you mean by "not guaranteed to work if...". Please return
fail when this action is used if this is not going to work.

> + *
> + * Terminating by default.
> + */
> +struct rte_flow_action_vf {
> +	uint32_t original:1; /**< Use original VF ID if possible. */
> +	uint32_t reserved:31; /**< Reserved, must be zero. */
> +	uint32_t id; /**< VF ID to redirect packets to. */
> +};
> +
> +/**
> + * Definition of a single action.
> + *
> + * A list of actions is terminated by a END action.
> + *
> + * For simple actions without a configuration structure, conf remains NULL.
> + */
> +struct rte_flow_action {
> +	enum rte_flow_action_type type; /**< Action type. */
> +	const void *conf; /**< Pointer to action configuration structure. */
> +};
> +
> +/**
> + * Opaque type returned after successfully creating a flow.
> + *
> + * This handle can be used to manage and query the related flow (e.g. to
> + * destroy it or retrieve counters).
> + */
> +struct rte_flow;
> +
> +/**
> + * Verbose error types.
> + *
> + * Most of them provide the type of the object referenced by struct
> + * rte_flow_error.cause.
> + */
> +enum rte_flow_error_type {
> +	RTE_FLOW_ERROR_TYPE_NONE, /**< No error. */
> +	RTE_FLOW_ERROR_TYPE_UNSPECIFIED, /**< Cause unspecified. */
> +	RTE_FLOW_ERROR_TYPE_HANDLE, /**< Flow rule (handle). */
> +	RTE_FLOW_ERROR_TYPE_ATTR_GROUP, /**< Group field. */
> +	RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY, /**< Priority field. */
> +	RTE_FLOW_ERROR_TYPE_ATTR_INGRESS, /**< Ingress field. */
> +	RTE_FLOW_ERROR_TYPE_ATTR_EGRESS, /**< Egress field. */
> +	RTE_FLOW_ERROR_TYPE_ATTR, /**< Attributes structure. */
> +	RTE_FLOW_ERROR_TYPE_ITEM_NUM, /**< Pattern length. */
> +	RTE_FLOW_ERROR_TYPE_ITEM, /**< Specific pattern item. */
> +	RTE_FLOW_ERROR_TYPE_ACTION_NUM, /**< Number of actions. */
> +	RTE_FLOW_ERROR_TYPE_ACTION, /**< Specific action. */
> +};
> +
> +/**
> + * Verbose error structure definition.
> + *
> + * This object is normally allocated by applications and set by PMDs, the
> + * message points to a constant string which does not need to be freed by
> + * the application, however its pointer can be considered valid only as long
> + * as its associated DPDK port remains configured. Closing the underlying
> + * device or unloading the PMD invalidates it.
> + *
> + * Both cause and message may be NULL regardless of the error type.
> + */
> +struct rte_flow_error {
> +	enum rte_flow_error_type type; /**< Cause field and error types. */
> +	const void *cause; /**< Object responsible for the error. */
> +	const char *message; /**< Human-readable error message. */
> +};
> +
> +/**
> + * Check whether a flow rule can be created on a given port.
> + *
> + * While this function has no effect on the target device, the flow rule is
> + * validated against its current configuration state and the returned value
> + * should be considered valid by the caller for that state only.
> + *
> + * The returned value is guaranteed to remain valid only as long as no
> + * successful calls to rte_flow_create() or rte_flow_destroy() are made in
> + * the meantime and no device parameter affecting flow rules in any way are
> + * modified, due to possible collisions or resource limitations (although in
> + * such cases EINVAL should not be returned).
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in] attr
> + *   Flow rule attributes.
> + * @param[in] pattern
> + *   Pattern specification (list terminated by the END pattern item).
> + * @param[in] actions
> + *   Associated actions (list terminated by the END action).
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 if flow rule is valid and can be created. A negative errno value
> + *   otherwise (rte_errno is also set), the following errors are defined:
> + *
> + *   -ENOSYS: underlying device does not support this functionality.
> + *
> + *   -EINVAL: unknown or invalid rule specification.
> + *
> + *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
> + *   bit-masks are unsupported).
> + *
> + *   -EEXIST: collision with an existing rule.
> + *
> + *   -ENOMEM: not enough resources.
> + *
> + *   -EBUSY: action cannot be performed due to busy device resources, may
> + *   succeed if the affected queues or even the entire port are in a stopped
> + *   state (see rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
> + */
> +int
> +rte_flow_validate(uint8_t port_id,
> +		  const struct rte_flow_attr *attr,
> +		  const struct rte_flow_item pattern[],
> +		  const struct rte_flow_action actions[],
> +		  struct rte_flow_error *error);

Why not just use rte_flow_create() and get an error? Is it less
disruptive to do a validate and find the rule cannot be created, than
using a create directly?

> +
> +/**
> + * Create a flow rule on a given port.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[in] attr
> + *   Flow rule attributes.
> + * @param[in] pattern
> + *   Pattern specification (list terminated by the END pattern item).
> + * @param[in] actions
> + *   Associated actions (list terminated by the END action).
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   A valid handle in case of success, NULL otherwise and rte_errno is set
> + *   to the positive version of one of the error codes defined for
> + *   rte_flow_validate().
> + */
> +struct rte_flow *
> +rte_flow_create(uint8_t port_id,
> +		const struct rte_flow_attr *attr,
> +		const struct rte_flow_item pattern[],
> +		const struct rte_flow_action actions[],
> +		struct rte_flow_error *error);

General question - are these functions threadsafe? In the OVS example
you could have several threads wanting to create flow rules at the same
time for same or different ports.

> +
> +/**
> + * Destroy a flow rule on a given port.
> + *
> + * Failure to destroy a flow rule handle may occur when other flow rules
> + * depend on it, and destroying it would result in an inconsistent state.
> + *
> + * This function is only guaranteed to succeed if handles are destroyed in
> + * reverse order of their creation.

How can the application find this information out on error?

> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param flow
> + *   Flow rule handle to destroy.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +rte_flow_destroy(uint8_t port_id,
> +		 struct rte_flow *flow,
> +		 struct rte_flow_error *error);
> +
> +/**
> + * Destroy all flow rules associated with a port.
> + *
> + * In the unlikely event of failure, handles are still considered destroyed
> + * and no longer valid but the port must be assumed to be in an inconsistent
> + * state.
> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +rte_flow_flush(uint8_t port_id,
> +	       struct rte_flow_error *error);

rte_flow_destroy_all() would be more descriptive (but breaks your style)

> +
> +/**
> + * Query an existing flow rule.
> + *
> + * This function allows retrieving flow-specific data such as counters.
> + * Data is gathered by special actions which must be present in the flow
> + * rule definition.

re last sentence, it would be good if you can put a link to
RTE_FLOW_ACTION_TYPE_COUNT

> + *
> + * @param port_id
> + *   Port identifier of Ethernet device.
> + * @param flow
> + *   Flow rule handle to query.
> + * @param action
> + *   Action type to query.
> + * @param[in, out] data
> + *   Pointer to storage for the associated query data type.

can this be anything other than rte_flow_query_count?

> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +int
> +rte_flow_query(uint8_t port_id,
> +	       struct rte_flow *flow,
> +	       enum rte_flow_action_type action,
> +	       void *data,
> +	       struct rte_flow_error *error);
> +
> +#ifdef __cplusplus
> +}
> +#endif

I don't see a way to dump all the rules for a port out. I think this is
neccessary for degbugging. You could have a look through dpif.h in OVS
and see how dpif_flow_dump_next() is used, it might be a good reference.

Also, it would be nice if there were an api that would allow a test
packet to be injected and traced for debugging - although I'm not
exactly sure how well it could be traced. For reference:
http://developers.redhat.com/blog/2016/10/12/tracing-packets-inside-open-vswitch/

thanks,
Kevin.

> +
> +#endif /* RTE_FLOW_H_ */
> diff --git a/lib/librte_ether/rte_flow_driver.h b/lib/librte_ether/rte_flow_driver.h
> new file mode 100644
> index 0000000..a88c621
> --- /dev/null
> +++ b/lib/librte_ether/rte_flow_driver.h
> @@ -0,0 +1,177 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright 2016 6WIND S.A.
> + *   Copyright 2016 Mellanox.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of 6WIND S.A. nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef RTE_FLOW_DRIVER_H_
> +#define RTE_FLOW_DRIVER_H_
> +
> +/**
> + * @file
> + * RTE generic flow API (driver side)
> + *
> + * This file provides implementation helpers for internal use by PMDs, they
> + * are not intended to be exposed to applications and are not subject to ABI
> + * versioning.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_errno.h>
> +#include "rte_flow.h"
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Generic flow operations structure implemented and returned by PMDs.
> + *
> + * To implement this API, PMDs must handle the RTE_ETH_FILTER_GENERIC filter
> + * type in their .filter_ctrl callback function (struct eth_dev_ops) as well
> + * as the RTE_ETH_FILTER_GET filter operation.
> + *
> + * If successful, this operation must result in a pointer to a PMD-specific
> + * struct rte_flow_ops written to the argument address as described below:
> + *
> + *  // PMD filter_ctrl callback
> + *
> + *  static const struct rte_flow_ops pmd_flow_ops = { ... };
> + *
> + *  switch (filter_type) {
> + *  case RTE_ETH_FILTER_GENERIC:
> + *      if (filter_op != RTE_ETH_FILTER_GET)
> + *          return -EINVAL;
> + *      *(const void **)arg = &pmd_flow_ops;
> + *      return 0;
> + *  }
> + *
> + * See also rte_flow_ops_get().
> + *
> + * These callback functions are not supposed to be used by applications
> + * directly, which must rely on the API defined in rte_flow.h.
> + *
> + * Public-facing wrapper functions perform a few consistency checks so that
> + * unimplemented (i.e. NULL) callbacks simply return -ENOTSUP. These
> + * callbacks otherwise only differ by their first argument (with port ID
> + * already resolved to a pointer to struct rte_eth_dev).
> + */
> +struct rte_flow_ops {
> +	/** See rte_flow_validate(). */
> +	int (*validate)
> +		(struct rte_eth_dev *,
> +		 const struct rte_flow_attr *,
> +		 const struct rte_flow_item [],
> +		 const struct rte_flow_action [],
> +		 struct rte_flow_error *);
> +	/** See rte_flow_create(). */
> +	struct rte_flow *(*create)
> +		(struct rte_eth_dev *,
> +		 const struct rte_flow_attr *,
> +		 const struct rte_flow_item [],
> +		 const struct rte_flow_action [],
> +		 struct rte_flow_error *);
> +	/** See rte_flow_destroy(). */
> +	int (*destroy)
> +		(struct rte_eth_dev *,
> +		 struct rte_flow *,
> +		 struct rte_flow_error *);
> +	/** See rte_flow_flush(). */
> +	int (*flush)
> +		(struct rte_eth_dev *,
> +		 struct rte_flow_error *);
> +	/** See rte_flow_query(). */
> +	int (*query)
> +		(struct rte_eth_dev *,
> +		 struct rte_flow *,
> +		 enum rte_flow_action_type,
> +		 void *,
> +		 struct rte_flow_error *);
> +};
> +
> +/**
> + * Initialize generic flow error structure.
> + *
> + * This function also sets rte_errno to a given value.
> + *
> + * @param[out] error
> + *   Pointer to flow error structure (may be NULL).
> + * @param code
> + *   Related error code (rte_errno).
> + * @param type
> + *   Cause field and error types.
> + * @param cause
> + *   Object responsible for the error.
> + * @param message
> + *   Human-readable error message.
> + *
> + * @return
> + *   Pointer to flow error structure.
> + */
> +static inline struct rte_flow_error *
> +rte_flow_error_set(struct rte_flow_error *error,
> +		   int code,
> +		   enum rte_flow_error_type type,
> +		   void *cause,
> +		   const char *message)
> +{
> +	if (error) {
> +		*error = (struct rte_flow_error){
> +			.type = type,
> +			.cause = cause,
> +			.message = message,
> +		};
> +	}
> +	rte_errno = code;
> +	return error;
> +}
> +
> +/**
> + * Get generic flow operations structure from a port.
> + *
> + * @param port_id
> + *   Port identifier to query.
> + * @param[out] error
> + *   Pointer to flow error structure.
> + *
> + * @return
> + *   The flow operations structure associated with port_id, NULL in case of
> + *   error, in which case rte_errno is set and the error structure contains
> + *   additional details.
> + */
> +const struct rte_flow_ops *
> +rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* RTE_FLOW_DRIVER_H_ */
>

Adrien Mazarguil Dec. 1, 2016, 8:36 a.m. UTC | #4

Hi Kevin,

On Wed, Nov 30, 2016 at 05:47:17PM +0000, Kevin Traynor wrote:
> Hi Adrien,
> 
> On 11/16/2016 04:23 PM, Adrien Mazarguil wrote:
> > This new API supersedes all the legacy filter types described in
> > rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
> > PMDs to process and validate flow rules.
> > 
> > Benefits:
> > 
> > - A unified API is easier to program for, applications do not have to be
> >   written for a specific filter type which may or may not be supported by
> >   the underlying device.
> > 
> > - The behavior of a flow rule is the same regardless of the underlying
> >   device, applications do not need to be aware of hardware quirks.
> > 
> > - Extensible by design, API/ABI breakage should rarely occur if at all.
> > 
> > - Documentation is self-standing, no need to look up elsewhere.
> > 
> > Existing filter types will be deprecated and removed in the near future.
> 
> I'd suggest to add a deprecation notice to deprecation.rst, ideally with
> a target release.

Will do, not a sure about the target release though. It seems a bit early
since no PMD really supports this API yet.

[...]
> > diff --git a/lib/librte_ether/rte_flow.c b/lib/librte_ether/rte_flow.c
> > new file mode 100644
> > index 0000000..064963d
> > --- /dev/null
> > +++ b/lib/librte_ether/rte_flow.c
> > @@ -0,0 +1,159 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright 2016 6WIND S.A.
> > + *   Copyright 2016 Mellanox.
> 
> There's Mellanox copyright but you are the only signed-off-by - is that
> right?

Yes, I'm the primary maintainer for Mellanox PMDs and this API was designed
on their behalf to expose several features from mlx4/mlx5 as the existing
filter types had too many limitations.

[...]
> > +/* Get generic flow operations structure from a port. */
> > +const struct rte_flow_ops *
> > +rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error)
> > +{
> > +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> > +	const struct rte_flow_ops *ops;
> > +	int code;
> > +
> > +	if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
> > +		code = ENODEV;
> > +	else if (unlikely(!dev->dev_ops->filter_ctrl ||
> > +			  dev->dev_ops->filter_ctrl(dev,
> > +						    RTE_ETH_FILTER_GENERIC,
> > +						    RTE_ETH_FILTER_GET,
> > +						    &ops) ||
> > +			  !ops))
> > +		code = ENOTSUP;
> > +	else
> > +		return ops;
> > +	rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > +			   NULL, rte_strerror(code));
> > +	return NULL;
> > +}
> > +
> 
> Is it expected that the application or pmd will provide locking between
> these functions if required? I think it's going to have to be the app.

Locking is indeed expected to be performed by applications. This API only
documents places where locking would make sense if necessary and expected
behavior.

Like all control path APIs, this one assumes a single control thread.
Applications must take the necessary precautions.

[...]
> > +/**
> > + * Flow rule attributes.
> > + *
> > + * Priorities are set on two levels: per group and per rule within groups.
> > + *
> > + * Lower values denote higher priority, the highest priority for both levels
> > + * is 0, so that a rule with priority 0 in group 8 is always matched after a
> > + * rule with priority 8 in group 0.
> > + *
> > + * Although optional, applications are encouraged to group similar rules as
> > + * much as possible to fully take advantage of hardware capabilities
> > + * (e.g. optimized matching) and work around limitations (e.g. a single
> > + * pattern type possibly allowed in a given group).
> > + *
> > + * Group and priority levels are arbitrary and up to the application, they
> > + * do not need to be contiguous nor start from 0, however the maximum number
> > + * varies between devices and may be affected by existing flow rules.
> > + *
> > + * If a packet is matched by several rules of a given group for a given
> > + * priority level, the outcome is undefined. It can take any path, may be
> > + * duplicated or even cause unrecoverable errors.
> 
> I get what you are trying to do here wrt supporting multiple
> pmds/hardware implementations and it's a good idea to keep it flexible.
> 
> Given that the outcome is undefined, it would be nice that the
> application has a way of finding the specific effects for verification
> and debugging.

Right, however it was deemed a bit difficult to manage in many cases hence
the vagueness.

For example, suppose two rules with the same group and priority, one
matching any IPv4 header, the other one any UDP header:

- TCPv4 packets => rule #1.
- UDPv6 packets => rule #2.
- UDPv4 packets => both?

That last one is perhaps invalid, checking that some unspecified protocol
combination does not overlap is expensive and may miss corner cases, even
assuming this is not an issue, what if the application guarantees that no
UDPv4 packets can ever hit that rule?

Suggestions are welcome though, perhaps we can refine the description.

> > + *
> > + * Note that support for more than a single group and priority level is not
> > + * guaranteed.
> > + *
> > + * Flow rules can apply to inbound and/or outbound traffic (ingress/egress).
> > + *
> > + * Several pattern items and actions are valid and can be used in both
> > + * directions. Those valid for only one direction are described as such.
> > + *
> > + * Specifying both directions at once is not recommended but may be valid in
> > + * some cases, such as incrementing the same counter twice.
> > + *
> > + * Not specifying any direction is currently an error.
> > + */
> > +struct rte_flow_attr {
> > +	uint32_t group; /**< Priority group. */
> > +	uint32_t priority; /**< Priority level within group. */
> > +	uint32_t ingress:1; /**< Rule applies to ingress traffic. */
> > +	uint32_t egress:1; /**< Rule applies to egress traffic. */
> > +	uint32_t reserved:30; /**< Reserved, must be zero. */
> > +};
[...]
> > +/**
> > + * RTE_FLOW_ITEM_TYPE_VF
> > + *
> > + * Matches packets addressed to a virtual function ID of the device.
> > + *
> > + * If the underlying device function differs from the one that would
> > + * normally receive the matched traffic, specifying this item prevents it
> > + * from reaching that device unless the flow rule contains a VF
> > + * action. Packets are not duplicated between device instances by default.
> > + *
> > + * - Likely to return an error or never match any traffic if this causes a
> > + *   VF device to match traffic addressed to a different VF.
> > + * - Can be specified multiple times to match traffic addressed to several
> > + *   specific VFs.
> > + * - Can be combined with a PF item to match both PF and VF traffic.
> > + *
> > + * A zeroed mask can be used to match any VF.
> 
> can you refer explicitly to id

If you mean "VF" to "VF ID" then yes, will do it for v2.

> > + */
> > +struct rte_flow_item_vf {
> > +	uint32_t id; /**< Destination VF ID. */
> > +};
[...]
> > +/**
> > + * Matching pattern item definition.
> > + *
> > + * A pattern is formed by stacking items starting from the lowest protocol
> > + * layer to match. This stacking restriction does not apply to meta items
> > + * which can be placed anywhere in the stack with no effect on the meaning
> > + * of the resulting pattern.
> > + *
> > + * A stack is terminated by a END item.
> > + *
> > + * The spec field should be a valid pointer to a structure of the related
> > + * item type. It may be set to NULL in many cases to use default values.
> > + *
> > + * Optionally, last can point to a structure of the same type to define an
> > + * inclusive range. This is mostly supported by integer and address fields,
> > + * may cause errors otherwise. Fields that do not support ranges must be set
> > + * to the same value as their spec counterparts.
> > + *
> > + * By default all fields present in spec are considered relevant.* This
> 
> typo "*"

No, that's an asterisk for a footnote below. Perhaps it is a bit unusual,
would something like "[1]" look better?

> > + * behavior can be altered by providing a mask structure of the same type
> > + * with applicable bits set to one. It can also be used to partially filter
> > + * out specific fields (e.g. as an alternate mean to match ranges of IP
> > + * addresses).
> > + *
> > + * Note this is a simple bit-mask applied before interpreting the contents
> > + * of spec and last, which may yield unexpected results if not used
> > + * carefully. For example, if for an IPv4 address field, spec provides
> > + * 10.1.2.3, last provides 10.3.4.5 and mask provides 255.255.0.0, the
> > + * effective range is 10.1.0.0 to 10.3.255.255.
> > + *

See footnote below:

> > + * * The defaults for data-matching items such as IPv4 when mask is not
> > + *   specified actually depend on the underlying implementation since only
> > + *   recognized fields can be taken into account.
> > + */
> > +struct rte_flow_item {
> > +	enum rte_flow_item_type type; /**< Item type. */
> > +	const void *spec; /**< Pointer to item specification structure. */
> > +	const void *last; /**< Defines an inclusive range (spec to last). */
> > +	const void *mask; /**< Bit-mask applied to spec and last. */
> > +};
> > +
> > +/**
> > + * Action types.
> > + *
> > + * Each possible action is represented by a type. Some have associated
> > + * configuration structures. Several actions combined in a list can be
> > + * affected to a flow rule. That list is not ordered.
> > + *
> > + * They fall in three categories:
> > + *
> > + * - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> > + *   processing matched packets by subsequent flow rules, unless overridden
> > + *   with PASSTHRU.
> > + *
> > + * - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
> > + *   for additional processing by subsequent flow rules.
> > + *
> > + * - Other non terminating meta actions that do not affect the fate of
> > + *   packets (END, VOID, MARK, FLAG, COUNT).
> > + *
> > + * When several actions are combined in a flow rule, they should all have
> > + * different types (e.g. dropping a packet twice is not possible). The
> > + * defined behavior is for PMDs to only take into account the last action of
> > + * a given type found in the list. PMDs still perform error checking on the
> > + * entire list.
> 
> why do you define that the pmd will interpret multiple same type rules
> in this way...would it not make more sense for the pmd to just return
> EINVAL for an invalid set of rules? It seems more transparent for the
> application.

Well, I had to define something as a default. The reason is that any number
of VOID actions may specified and did not want that to be a special case in
order to keep PMD parsers as simple as possible. I'll settle for EINVAL (or
some other error) if at least one PMD maintainer other than Nelio who
intends to implement this API is not convinced by this explanation, all
right?

[...]
> > +/**
> > + * RTE_FLOW_ACTION_TYPE_MARK
> > + *
> > + * Attaches a 32 bit value to packets.
> > + *
> > + * This value is arbitrary and application-defined. For compatibility with
> > + * FDIR it is returned in the hash.fdir.hi mbuf field. PKT_RX_FDIR_ID is
> > + * also set in ol_flags.
> > + */
> > +struct rte_flow_action_mark {
> > +	uint32_t id; /**< 32 bit value to return with packets. */
> > +};
> 
> One use case I thought we would be able to do for OVS is classification
> in hardware and the unique flow id is sent with the packet to software.
> But in OVS the ufid is 128 bits, so it means we can't and there is still
> the miniflow extract overhead. I'm not sure if there is a practical way
> around this.
> 
> Sugesh (cc'd) has looked at this before and may be able to comment or
> correct me.

Yes, we settled on 32 bit because currently no known hardware implementation
supports more than this. If that changes, another action with a larger type
shall be provided (no ABI breakage).

Also since even 64 bit would not be enough for the use case you mention,
there is no choice but use this as an indirect value (such as an array or
hash table index/value).

[...]
> > +/**
> > + * RTE_FLOW_ACTION_TYPE_RSS
> > + *
> > + * Similar to QUEUE, except RSS is additionally performed on packets to
> > + * spread them among several queues according to the provided parameters.
> > + *
> > + * Note: RSS hash result is normally stored in the hash.rss mbuf field,
> > + * however it conflicts with the MARK action as they share the same
> > + * space. When both actions are specified, the RSS hash is discarded and
> > + * PKT_RX_RSS_HASH is not set in ol_flags. MARK has priority. The mbuf
> > + * structure should eventually evolve to store both.
> > + *
> > + * Terminating by default.
> > + */
> > +struct rte_flow_action_rss {
> > +	const struct rte_eth_rss_conf *rss_conf; /**< RSS parameters. */
> > +	uint16_t queues; /**< Number of entries in queue[]. */
> > +	uint16_t queue[]; /**< Queues indices to use. */
> 
> I'd try and avoid queue and queues - someone will say "huh?" when
> reading code. s/queues/num ?

Agreed, will update for v2.

> > +};
> > +
> > +/**
> > + * RTE_FLOW_ACTION_TYPE_VF
> > + *
> > + * Redirects packets to a virtual function (VF) of the current device.
> > + *
> > + * Packets matched by a VF pattern item can be redirected to their original
> > + * VF ID instead of the specified one. This parameter may not be available
> > + * and is not guaranteed to work properly if the VF part is matched by a
> > + * prior flow rule or if packets are not addressed to a VF in the first
> > + * place.
> 
> Not clear what you mean by "not guaranteed to work if...". Please return
> fail when this action is used if this is not going to work.

Again, this is a case where it is difficult for a PMD to determine if the
entire list of flow rules makes sense. Perhaps it does, perhaps whatever
goes through has already been filtered out of possible issues.

Here the documentation states the precautions an application should take to
guarantee it will work as intended. Perhaps it can be reworded (any
suggestion?), but a PMD can certainly not provide any strong guarantee.

> > + *
> > + * Terminating by default.
> > + */
> > +struct rte_flow_action_vf {
> > +	uint32_t original:1; /**< Use original VF ID if possible. */
> > +	uint32_t reserved:31; /**< Reserved, must be zero. */
> > +	uint32_t id; /**< VF ID to redirect packets to. */
> > +};
[...]
> > +/**
> > + * Check whether a flow rule can be created on a given port.
> > + *
> > + * While this function has no effect on the target device, the flow rule is
> > + * validated against its current configuration state and the returned value
> > + * should be considered valid by the caller for that state only.
> > + *
> > + * The returned value is guaranteed to remain valid only as long as no
> > + * successful calls to rte_flow_create() or rte_flow_destroy() are made in
> > + * the meantime and no device parameter affecting flow rules in any way are
> > + * modified, due to possible collisions or resource limitations (although in
> > + * such cases EINVAL should not be returned).
> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param[in] attr
> > + *   Flow rule attributes.
> > + * @param[in] pattern
> > + *   Pattern specification (list terminated by the END pattern item).
> > + * @param[in] actions
> > + *   Associated actions (list terminated by the END action).
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   0 if flow rule is valid and can be created. A negative errno value
> > + *   otherwise (rte_errno is also set), the following errors are defined:
> > + *
> > + *   -ENOSYS: underlying device does not support this functionality.
> > + *
> > + *   -EINVAL: unknown or invalid rule specification.
> > + *
> > + *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
> > + *   bit-masks are unsupported).
> > + *
> > + *   -EEXIST: collision with an existing rule.
> > + *
> > + *   -ENOMEM: not enough resources.
> > + *
> > + *   -EBUSY: action cannot be performed due to busy device resources, may
> > + *   succeed if the affected queues or even the entire port are in a stopped
> > + *   state (see rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
> > + */
> > +int
> > +rte_flow_validate(uint8_t port_id,
> > +		  const struct rte_flow_attr *attr,
> > +		  const struct rte_flow_item pattern[],
> > +		  const struct rte_flow_action actions[],
> > +		  struct rte_flow_error *error);
> 
> Why not just use rte_flow_create() and get an error? Is it less
> disruptive to do a validate and find the rule cannot be created, than
> using a create directly?

The rationale can be found in the original RFC, which I'll convert to actual
documentation in v2. In short:

- Calling rte_flow_validate() before rte_flow_create() is useless since
  rte_flow_create() also performs validation.

- We cannot possibly express a full static set of allowed flow rules, even
  if we could, it usually depends on the current hardware configuration
  therefore would not be static.

- rte_flow_validate() is thus provided as a replacement for capability
  flags. It can be used to determine during initialization if the underlying
  device can support the typical flow rules an application might want to
  provide later and do something useful with that information (e.g. always
  use software fallback due to HW limitations).

- rte_flow_validate() being a subset of rte_flow_create(), it is essentially
  free to expose.

> > +
> > +/**
> > + * Create a flow rule on a given port.
> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param[in] attr
> > + *   Flow rule attributes.
> > + * @param[in] pattern
> > + *   Pattern specification (list terminated by the END pattern item).
> > + * @param[in] actions
> > + *   Associated actions (list terminated by the END action).
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   A valid handle in case of success, NULL otherwise and rte_errno is set
> > + *   to the positive version of one of the error codes defined for
> > + *   rte_flow_validate().
> > + */
> > +struct rte_flow *
> > +rte_flow_create(uint8_t port_id,
> > +		const struct rte_flow_attr *attr,
> > +		const struct rte_flow_item pattern[],
> > +		const struct rte_flow_action actions[],
> > +		struct rte_flow_error *error);
> 
> General question - are these functions threadsafe? In the OVS example
> you could have several threads wanting to create flow rules at the same
> time for same or different ports.

No they aren't, applications have to perform their own locking. The RFC (to
be converted to actual documentation in v2) says that:

- API operations are synchronous and blocking (``EAGAIN`` cannot be
  returned).

- There is no provision for reentrancy/multi-thread safety, although nothing
  should prevent different devices from being configured at the same
  time. PMDs may protect their control path functions accordingly.

> > +
> > +/**
> > + * Destroy a flow rule on a given port.
> > + *
> > + * Failure to destroy a flow rule handle may occur when other flow rules
> > + * depend on it, and destroying it would result in an inconsistent state.
> > + *
> > + * This function is only guaranteed to succeed if handles are destroyed in
> > + * reverse order of their creation.
> 
> How can the application find this information out on error?

Without maintaining a list, they cannot. The specified case is the only
possible guarantee. That does not mean PMDs should not do their best to
destroy flow rules, only that ordering must remain consistent in case of
inability to destroy one.

What do you suggest?

> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param flow
> > + *   Flow rule handle to destroy.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_errno is set.
> > + */
> > +int
> > +rte_flow_destroy(uint8_t port_id,
> > +		 struct rte_flow *flow,
> > +		 struct rte_flow_error *error);
> > +
> > +/**
> > + * Destroy all flow rules associated with a port.
> > + *
> > + * In the unlikely event of failure, handles are still considered destroyed
> > + * and no longer valid but the port must be assumed to be in an inconsistent
> > + * state.
> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_errno is set.
> > + */
> > +int
> > +rte_flow_flush(uint8_t port_id,
> > +	       struct rte_flow_error *error);
> 
> rte_flow_destroy_all() would be more descriptive (but breaks your style)

There are enough underscores as it is. I like flush, if enough people
complain we'll change it but it has to occur before the first public
release.

> > +
> > +/**
> > + * Query an existing flow rule.
> > + *
> > + * This function allows retrieving flow-specific data such as counters.
> > + * Data is gathered by special actions which must be present in the flow
> > + * rule definition.
> 
> re last sentence, it would be good if you can put a link to
> RTE_FLOW_ACTION_TYPE_COUNT

Will do, I did not know how until very recently.

> > + *
> > + * @param port_id
> > + *   Port identifier of Ethernet device.
> > + * @param flow
> > + *   Flow rule handle to query.
> > + * @param action
> > + *   Action type to query.
> > + * @param[in, out] data
> > + *   Pointer to storage for the associated query data type.
> 
> can this be anything other than rte_flow_query_count?

Likely in the future. I've only defined this one as a counterpart for
existing API functionality and because we wanted to expose it in mlx5.

> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_errno is set.
> > + */
> > +int
> > +rte_flow_query(uint8_t port_id,
> > +	       struct rte_flow *flow,
> > +	       enum rte_flow_action_type action,
> > +	       void *data,
> > +	       struct rte_flow_error *error);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> 
> I don't see a way to dump all the rules for a port out. I think this is
> neccessary for degbugging. You could have a look through dpif.h in OVS
> and see how dpif_flow_dump_next() is used, it might be a good reference.

DPDK does not maintain flow rules and, depending on hardware capabilities
and level of compliance, PMDs do not necessarily do it either, particularly
since it requires space and application probably have a better method to
store these pointers for their own needs.

What you see here is only a PMD interface. Depending on applications needs,
generic helper functions built on top of these may be added to manage flow
rules in the future.

> Also, it would be nice if there were an api that would allow a test
> packet to be injected and traced for debugging - although I'm not
> exactly sure how well it could be traced. For reference:
> http://developers.redhat.com/blog/2016/10/12/tracing-packets-inside-open-vswitch/

Thanks for the link, I'm not sure how you'd do this either. Remember, as
generic as it looks, this interface is only meant to configure the
underlying device. You need to see it as one big offload, everything else
is left to applications.

Kevin Traynor Dec. 2, 2016, 9:06 p.m. UTC | #5

On 12/01/2016 08:36 AM, Adrien Mazarguil wrote:
> Hi Kevin,
> 
> On Wed, Nov 30, 2016 at 05:47:17PM +0000, Kevin Traynor wrote:
>> Hi Adrien,
>>
>> On 11/16/2016 04:23 PM, Adrien Mazarguil wrote:
>>> This new API supersedes all the legacy filter types described in
>>> rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
>>> PMDs to process and validate flow rules.
>>>
>>> Benefits:
>>>
>>> - A unified API is easier to program for, applications do not have to be
>>>   written for a specific filter type which may or may not be supported by
>>>   the underlying device.
>>>
>>> - The behavior of a flow rule is the same regardless of the underlying
>>>   device, applications do not need to be aware of hardware quirks.
>>>
>>> - Extensible by design, API/ABI breakage should rarely occur if at all.
>>>
>>> - Documentation is self-standing, no need to look up elsewhere.
>>>
>>> Existing filter types will be deprecated and removed in the near future.
>>
>> I'd suggest to add a deprecation notice to deprecation.rst, ideally with
>> a target release.
> 
> Will do, not a sure about the target release though. It seems a bit early
> since no PMD really supports this API yet.
> 
> [...]
>>> diff --git a/lib/librte_ether/rte_flow.c b/lib/librte_ether/rte_flow.c
>>> new file mode 100644
>>> index 0000000..064963d
>>> --- /dev/null
>>> +++ b/lib/librte_ether/rte_flow.c
>>> @@ -0,0 +1,159 @@
>>> +/*-
>>> + *   BSD LICENSE
>>> + *
>>> + *   Copyright 2016 6WIND S.A.
>>> + *   Copyright 2016 Mellanox.
>>
>> There's Mellanox copyright but you are the only signed-off-by - is that
>> right?
> 
> Yes, I'm the primary maintainer for Mellanox PMDs and this API was designed
> on their behalf to expose several features from mlx4/mlx5 as the existing
> filter types had too many limitations.
> 
> [...]
>>> +/* Get generic flow operations structure from a port. */
>>> +const struct rte_flow_ops *
>>> +rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error)
>>> +{
>>> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
>>> +	const struct rte_flow_ops *ops;
>>> +	int code;
>>> +
>>> +	if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
>>> +		code = ENODEV;
>>> +	else if (unlikely(!dev->dev_ops->filter_ctrl ||
>>> +			  dev->dev_ops->filter_ctrl(dev,
>>> +						    RTE_ETH_FILTER_GENERIC,
>>> +						    RTE_ETH_FILTER_GET,
>>> +						    &ops) ||
>>> +			  !ops))
>>> +		code = ENOTSUP;
>>> +	else
>>> +		return ops;
>>> +	rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
>>> +			   NULL, rte_strerror(code));
>>> +	return NULL;
>>> +}
>>> +
>>
>> Is it expected that the application or pmd will provide locking between
>> these functions if required? I think it's going to have to be the app.
> 
> Locking is indeed expected to be performed by applications. This API only
> documents places where locking would make sense if necessary and expected
> behavior.
> 
> Like all control path APIs, this one assumes a single control thread.
> Applications must take the necessary precautions.

If you look at OVS now it's quite possible that you have 2 rx queues
serviced by different threads, that would also install the flow rules in
the software flow caches - possibly that could extend to adding hardware
flows. There could also be another thread that is querying for stats. So
anything that can be done to minimise the locking would be helpful -
maybe query() could be atomic and not require any locking?

> 
> [...]
>>> +/**
>>> + * Flow rule attributes.
>>> + *
>>> + * Priorities are set on two levels: per group and per rule within groups.
>>> + *
>>> + * Lower values denote higher priority, the highest priority for both levels
>>> + * is 0, so that a rule with priority 0 in group 8 is always matched after a
>>> + * rule with priority 8 in group 0.
>>> + *
>>> + * Although optional, applications are encouraged to group similar rules as
>>> + * much as possible to fully take advantage of hardware capabilities
>>> + * (e.g. optimized matching) and work around limitations (e.g. a single
>>> + * pattern type possibly allowed in a given group).
>>> + *
>>> + * Group and priority levels are arbitrary and up to the application, they
>>> + * do not need to be contiguous nor start from 0, however the maximum number
>>> + * varies between devices and may be affected by existing flow rules.
>>> + *
>>> + * If a packet is matched by several rules of a given group for a given
>>> + * priority level, the outcome is undefined. It can take any path, may be
>>> + * duplicated or even cause unrecoverable errors.
>>
>> I get what you are trying to do here wrt supporting multiple
>> pmds/hardware implementations and it's a good idea to keep it flexible.
>>
>> Given that the outcome is undefined, it would be nice that the
>> application has a way of finding the specific effects for verification
>> and debugging.
> 
> Right, however it was deemed a bit difficult to manage in many cases hence
> the vagueness.
> 
> For example, suppose two rules with the same group and priority, one
> matching any IPv4 header, the other one any UDP header:
> 
> - TCPv4 packets => rule #1.
> - UDPv6 packets => rule #2.
> - UDPv4 packets => both?
> 
> That last one is perhaps invalid, checking that some unspecified protocol
> combination does not overlap is expensive and may miss corner cases, even
> assuming this is not an issue, what if the application guarantees that no
> UDPv4 packets can ever hit that rule?

that's fine - I don't expect the software to be able to know what the
hardware will do with those rules. It's more about trying to get a dump
from the hardware if something goes wrong. Anyway covered in comment later.

> 
> Suggestions are welcome though, perhaps we can refine the description
> 
>>> + *
>>> + * Note that support for more than a single group and priority level is not
>>> + * guaranteed.
>>> + *
>>> + * Flow rules can apply to inbound and/or outbound traffic (ingress/egress).
>>> + *
>>> + * Several pattern items and actions are valid and can be used in both
>>> + * directions. Those valid for only one direction are described as such.
>>> + *
>>> + * Specifying both directions at once is not recommended but may be valid in
>>> + * some cases, such as incrementing the same counter twice.
>>> + *
>>> + * Not specifying any direction is currently an error.
>>> + */
>>> +struct rte_flow_attr {
>>> +	uint32_t group; /**< Priority group. */
>>> +	uint32_t priority; /**< Priority level within group. */
>>> +	uint32_t ingress:1; /**< Rule applies to ingress traffic. */
>>> +	uint32_t egress:1; /**< Rule applies to egress traffic. */
>>> +	uint32_t reserved:30; /**< Reserved, must be zero. */
>>> +};
> [...]
>>> +/**
>>> + * RTE_FLOW_ITEM_TYPE_VF
>>> + *
>>> + * Matches packets addressed to a virtual function ID of the device.
>>> + *
>>> + * If the underlying device function differs from the one that would
>>> + * normally receive the matched traffic, specifying this item prevents it
>>> + * from reaching that device unless the flow rule contains a VF
>>> + * action. Packets are not duplicated between device instances by default.
>>> + *
>>> + * - Likely to return an error or never match any traffic if this causes a
>>> + *   VF device to match traffic addressed to a different VF.
>>> + * - Can be specified multiple times to match traffic addressed to several
>>> + *   specific VFs.
>>> + * - Can be combined with a PF item to match both PF and VF traffic.
>>> + *
>>> + * A zeroed mask can be used to match any VF.
>>
>> can you refer explicitly to id
> 
> If you mean "VF" to "VF ID" then yes, will do it for v2.
> 
>>> + */
>>> +struct rte_flow_item_vf {
>>> +	uint32_t id; /**< Destination VF ID. */
>>> +};
> [...]
>>> +/**
>>> + * Matching pattern item definition.
>>> + *
>>> + * A pattern is formed by stacking items starting from the lowest protocol
>>> + * layer to match. This stacking restriction does not apply to meta items
>>> + * which can be placed anywhere in the stack with no effect on the meaning
>>> + * of the resulting pattern.
>>> + *
>>> + * A stack is terminated by a END item.
>>> + *
>>> + * The spec field should be a valid pointer to a structure of the related
>>> + * item type. It may be set to NULL in many cases to use default values.
>>> + *
>>> + * Optionally, last can point to a structure of the same type to define an
>>> + * inclusive range. This is mostly supported by integer and address fields,
>>> + * may cause errors otherwise. Fields that do not support ranges must be set
>>> + * to the same value as their spec counterparts.
>>> + *
>>> + * By default all fields present in spec are considered relevant.* This
>>
>> typo "*"
> 
> No, that's an asterisk for a footnote below. Perhaps it is a bit unusual,
> would something like "[1]" look better?

oh, I thought it was the start of a comment line gone astray. Maybe "See
note below", no big deal though.

> 
>>> + * behavior can be altered by providing a mask structure of the same type
>>> + * with applicable bits set to one. It can also be used to partially filter
>>> + * out specific fields (e.g. as an alternate mean to match ranges of IP
>>> + * addresses).
>>> + *
>>> + * Note this is a simple bit-mask applied before interpreting the contents
>>> + * of spec and last, which may yield unexpected results if not used
>>> + * carefully. For example, if for an IPv4 address field, spec provides
>>> + * 10.1.2.3, last provides 10.3.4.5 and mask provides 255.255.0.0, the
>>> + * effective range is 10.1.0.0 to 10.3.255.255.
>>> + *
> 
> See footnote below:
> 
>>> + * * The defaults for data-matching items such as IPv4 when mask is not
>>> + *   specified actually depend on the underlying implementation since only
>>> + *   recognized fields can be taken into account.
>>> + */
>>> +struct rte_flow_item {
>>> +	enum rte_flow_item_type type; /**< Item type. */
>>> +	const void *spec; /**< Pointer to item specification structure. */
>>> +	const void *last; /**< Defines an inclusive range (spec to last). */
>>> +	const void *mask; /**< Bit-mask applied to spec and last. */
>>> +};
>>> +
>>> +/**
>>> + * Action types.
>>> + *
>>> + * Each possible action is represented by a type. Some have associated
>>> + * configuration structures. Several actions combined in a list can be
>>> + * affected to a flow rule. That list is not ordered.
>>> + *
>>> + * They fall in three categories:
>>> + *
>>> + * - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
>>> + *   processing matched packets by subsequent flow rules, unless overridden
>>> + *   with PASSTHRU.
>>> + *
>>> + * - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
>>> + *   for additional processing by subsequent flow rules.
>>> + *
>>> + * - Other non terminating meta actions that do not affect the fate of
>>> + *   packets (END, VOID, MARK, FLAG, COUNT).
>>> + *
>>> + * When several actions are combined in a flow rule, they should all have
>>> + * different types (e.g. dropping a packet twice is not possible). The
>>> + * defined behavior is for PMDs to only take into account the last action of
>>> + * a given type found in the list. PMDs still perform error checking on the
>>> + * entire list.
>>
>> why do you define that the pmd will interpret multiple same type rules
>> in this way...would it not make more sense for the pmd to just return
>> EINVAL for an invalid set of rules? It seems more transparent for the
>> application.
> 
> Well, I had to define something as a default. The reason is that any number
> of VOID actions may specified and did not want that to be a special case in
> order to keep PMD parsers as simple as possible. I'll settle for EINVAL (or
> some other error) if at least one PMD maintainer other than Nelio who
> intends to implement this API is not convinced by this explanation, all
> right?

From an API perspective I think it's cleaner to pass or fail with the
input rather than change it. But yes, please take pmd maintainers input
as to what is reasonable to check also.

> 
> [...]
>>> +/**
>>> + * RTE_FLOW_ACTION_TYPE_MARK
>>> + *
>>> + * Attaches a 32 bit value to packets.
>>> + *
>>> + * This value is arbitrary and application-defined. For compatibility with
>>> + * FDIR it is returned in the hash.fdir.hi mbuf field. PKT_RX_FDIR_ID is
>>> + * also set in ol_flags.
>>> + */
>>> +struct rte_flow_action_mark {
>>> +	uint32_t id; /**< 32 bit value to return with packets. */
>>> +};
>>
>> One use case I thought we would be able to do for OVS is classification
>> in hardware and the unique flow id is sent with the packet to software.
>> But in OVS the ufid is 128 bits, so it means we can't and there is still
>> the miniflow extract overhead. I'm not sure if there is a practical way
>> around this.
>>
>> Sugesh (cc'd) has looked at this before and may be able to comment or
>> correct me.
> 
> Yes, we settled on 32 bit because currently no known hardware implementation
> supports more than this. If that changes, another action with a larger type
> shall be provided (no ABI breakage).
> 
> Also since even 64 bit would not be enough for the use case you mention,
> there is no choice but use this as an indirect value (such as an array or
> hash table index/value).

ok, cool. I think Sugesh has other ideas anyway!

> 
> [...]
>>> +/**
>>> + * RTE_FLOW_ACTION_TYPE_RSS
>>> + *
>>> + * Similar to QUEUE, except RSS is additionally performed on packets to
>>> + * spread them among several queues according to the provided parameters.
>>> + *
>>> + * Note: RSS hash result is normally stored in the hash.rss mbuf field,
>>> + * however it conflicts with the MARK action as they share the same
>>> + * space. When both actions are specified, the RSS hash is discarded and
>>> + * PKT_RX_RSS_HASH is not set in ol_flags. MARK has priority. The mbuf
>>> + * structure should eventually evolve to store both.
>>> + *
>>> + * Terminating by default.
>>> + */
>>> +struct rte_flow_action_rss {
>>> +	const struct rte_eth_rss_conf *rss_conf; /**< RSS parameters. */
>>> +	uint16_t queues; /**< Number of entries in queue[]. */
>>> +	uint16_t queue[]; /**< Queues indices to use. */
>>
>> I'd try and avoid queue and queues - someone will say "huh?" when
>> reading code. s/queues/num ?
> 
> Agreed, will update for v2.
> 
>>> +};
>>> +
>>> +/**
>>> + * RTE_FLOW_ACTION_TYPE_VF
>>> + *
>>> + * Redirects packets to a virtual function (VF) of the current device.
>>> + *
>>> + * Packets matched by a VF pattern item can be redirected to their original
>>> + * VF ID instead of the specified one. This parameter may not be available
>>> + * and is not guaranteed to work properly if the VF part is matched by a
>>> + * prior flow rule or if packets are not addressed to a VF in the first
>>> + * place.
>>
>> Not clear what you mean by "not guaranteed to work if...". Please return
>> fail when this action is used if this is not going to work.
> 
> Again, this is a case where it is difficult for a PMD to determine if the
> entire list of flow rules makes sense. Perhaps it does, perhaps whatever
> goes through has already been filtered out of possible issues.
> 
> Here the documentation states the precautions an application should take to
> guarantee it will work as intended. Perhaps it can be reworded (any
> suggestion?), but a PMD can certainly not provide any strong guarantee.

I see your point. Maybe for easy check things the pmd would return fail,
but for more complex I agree it's too difficult.

> 
>>> + *
>>> + * Terminating by default.
>>> + */
>>> +struct rte_flow_action_vf {
>>> +	uint32_t original:1; /**< Use original VF ID if possible. */
>>> +	uint32_t reserved:31; /**< Reserved, must be zero. */
>>> +	uint32_t id; /**< VF ID to redirect packets to. */
>>> +};
> [...]
>>> +/**
>>> + * Check whether a flow rule can be created on a given port.
>>> + *
>>> + * While this function has no effect on the target device, the flow rule is
>>> + * validated against its current configuration state and the returned value
>>> + * should be considered valid by the caller for that state only.
>>> + *
>>> + * The returned value is guaranteed to remain valid only as long as no
>>> + * successful calls to rte_flow_create() or rte_flow_destroy() are made in
>>> + * the meantime and no device parameter affecting flow rules in any way are
>>> + * modified, due to possible collisions or resource limitations (although in
>>> + * such cases EINVAL should not be returned).
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param[in] attr
>>> + *   Flow rule attributes.
>>> + * @param[in] pattern
>>> + *   Pattern specification (list terminated by the END pattern item).
>>> + * @param[in] actions
>>> + *   Associated actions (list terminated by the END action).
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL.
>>> + *
>>> + * @return
>>> + *   0 if flow rule is valid and can be created. A negative errno value
>>> + *   otherwise (rte_errno is also set), the following errors are defined:
>>> + *
>>> + *   -ENOSYS: underlying device does not support this functionality.
>>> + *
>>> + *   -EINVAL: unknown or invalid rule specification.
>>> + *
>>> + *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
>>> + *   bit-masks are unsupported).
>>> + *
>>> + *   -EEXIST: collision with an existing rule.
>>> + *
>>> + *   -ENOMEM: not enough resources.
>>> + *
>>> + *   -EBUSY: action cannot be performed due to busy device resources, may
>>> + *   succeed if the affected queues or even the entire port are in a stopped
>>> + *   state (see rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
>>> + */
>>> +int
>>> +rte_flow_validate(uint8_t port_id,
>>> +		  const struct rte_flow_attr *attr,
>>> +		  const struct rte_flow_item pattern[],
>>> +		  const struct rte_flow_action actions[],
>>> +		  struct rte_flow_error *error);
>>
>> Why not just use rte_flow_create() and get an error? Is it less
>> disruptive to do a validate and find the rule cannot be created, than
>> using a create directly?
> 
> The rationale can be found in the original RFC, which I'll convert to actual
> documentation in v2. In short:
> 
> - Calling rte_flow_validate() before rte_flow_create() is useless since
>   rte_flow_create() also performs validation.
> 
> - We cannot possibly express a full static set of allowed flow rules, even
>   if we could, it usually depends on the current hardware configuration
>   therefore would not be static.
> 
> - rte_flow_validate() is thus provided as a replacement for capability
>   flags. It can be used to determine during initialization if the underlying
>   device can support the typical flow rules an application might want to
>   provide later and do something useful with that information (e.g. always
>   use software fallback due to HW limitations).
> 
> - rte_flow_validate() being a subset of rte_flow_create(), it is essentially
>   free to expose.

make sense now, thanks.

> 
>>> +
>>> +/**
>>> + * Create a flow rule on a given port.
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param[in] attr
>>> + *   Flow rule attributes.
>>> + * @param[in] pattern
>>> + *   Pattern specification (list terminated by the END pattern item).
>>> + * @param[in] actions
>>> + *   Associated actions (list terminated by the END action).
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL.
>>> + *
>>> + * @return
>>> + *   A valid handle in case of success, NULL otherwise and rte_errno is set
>>> + *   to the positive version of one of the error codes defined for
>>> + *   rte_flow_validate().
>>> + */
>>> +struct rte_flow *
>>> +rte_flow_create(uint8_t port_id,
>>> +		const struct rte_flow_attr *attr,
>>> +		const struct rte_flow_item pattern[],
>>> +		const struct rte_flow_action actions[],
>>> +		struct rte_flow_error *error);
>>
>> General question - are these functions threadsafe? In the OVS example
>> you could have several threads wanting to create flow rules at the same
>> time for same or different ports.
> 
> No they aren't, applications have to perform their own locking. The RFC (to
> be converted to actual documentation in v2) says that:
> 
> - API operations are synchronous and blocking (``EAGAIN`` cannot be
>   returned).
> 
> - There is no provision for reentrancy/multi-thread safety, although nothing
>   should prevent different devices from being configured at the same
>   time. PMDs may protect their control path functions accordingly.

other comment above wrt locking.

> 
>>> +
>>> +/**
>>> + * Destroy a flow rule on a given port.
>>> + *
>>> + * Failure to destroy a flow rule handle may occur when other flow rules
>>> + * depend on it, and destroying it would result in an inconsistent state.
>>> + *
>>> + * This function is only guaranteed to succeed if handles are destroyed in
>>> + * reverse order of their creation.
>>
>> How can the application find this information out on error?
> 
> Without maintaining a list, they cannot. The specified case is the only
> possible guarantee. That does not mean PMDs should not do their best to
> destroy flow rules, only that ordering must remain consistent in case of
> inability to destroy one.
> 
> What do you suggest?

I think if the app cannot remove a specific rule it may want to remove
all rules and deal with flows in software for a time. So once the app
knows it fails that should be enough.

> 
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param flow
>>> + *   Flow rule handle to destroy.
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL.
>>> + *
>>> + * @return
>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>> + */
>>> +int
>>> +rte_flow_destroy(uint8_t port_id,
>>> +		 struct rte_flow *flow,
>>> +		 struct rte_flow_error *error);
>>> +
>>> +/**
>>> + * Destroy all flow rules associated with a port.
>>> + *
>>> + * In the unlikely event of failure, handles are still considered destroyed
>>> + * and no longer valid but the port must be assumed to be in an inconsistent
>>> + * state.
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL.
>>> + *
>>> + * @return
>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>> + */
>>> +int
>>> +rte_flow_flush(uint8_t port_id,
>>> +	       struct rte_flow_error *error);
>>
>> rte_flow_destroy_all() would be more descriptive (but breaks your style)
> 
> There are enough underscores as it is. I like flush, if enough people
> complain we'll change it but it has to occur before the first public
> release.
> 
>>> +
>>> +/**
>>> + * Query an existing flow rule.
>>> + *
>>> + * This function allows retrieving flow-specific data such as counters.
>>> + * Data is gathered by special actions which must be present in the flow
>>> + * rule definition.
>>
>> re last sentence, it would be good if you can put a link to
>> RTE_FLOW_ACTION_TYPE_COUNT
> 
> Will do, I did not know how until very recently.
> 
>>> + *
>>> + * @param port_id
>>> + *   Port identifier of Ethernet device.
>>> + * @param flow
>>> + *   Flow rule handle to query.
>>> + * @param action
>>> + *   Action type to query.
>>> + * @param[in, out] data
>>> + *   Pointer to storage for the associated query data type.
>>
>> can this be anything other than rte_flow_query_count?
> 
> Likely in the future. I've only defined this one as a counterpart for
> existing API functionality and because we wanted to expose it in mlx5.
> 
>>> + * @param[out] error
>>> + *   Perform verbose error reporting if not NULL.
>>> + *
>>> + * @return
>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>> + */
>>> +int
>>> +rte_flow_query(uint8_t port_id,
>>> +	       struct rte_flow *flow,
>>> +	       enum rte_flow_action_type action,
>>> +	       void *data,
>>> +	       struct rte_flow_error *error);
>>> +
>>> +#ifdef __cplusplus
>>> +}
>>> +#endif
>>
>> I don't see a way to dump all the rules for a port out. I think this is
>> neccessary for degbugging. You could have a look through dpif.h in OVS
>> and see how dpif_flow_dump_next() is used, it might be a good reference.
> 
> DPDK does not maintain flow rules and, depending on hardware capabilities
> and level of compliance, PMDs do not necessarily do it either, particularly
> since it requires space and application probably have a better method to
> store these pointers for their own needs.

understood

> 
> What you see here is only a PMD interface. Depending on applications needs,
> generic helper functions built on top of these may be added to manage flow
> rules in the future.

I'm thinking of the case where something goes wrong and I want to get a
dump of all the flow rules from hardware, not query the rules I think I
have. I don't see a way to do it or something to build a helper on top of?

> 
>> Also, it would be nice if there were an api that would allow a test
>> packet to be injected and traced for debugging - although I'm not
>> exactly sure how well it could be traced. For reference:
>> http://developers.redhat.com/blog/2016/10/12/tracing-packets-inside-open-vswitch/
> 
> Thanks for the link, I'm not sure how you'd do this either. Remember, as
> generic as it looks, this interface is only meant to configure the
> underlying device. You need to see it as one big offload, everything else
> is left to applications.
>

Chandran, Sugesh Dec. 6, 2016, 6:11 p.m. UTC | #6

Hi Adrien,
Thanks for sending out the patches,

Please find few comments below,


Regards
_Sugesh


> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Kevin Traynor
> Sent: Friday, December 2, 2016 9:07 PM
> To: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Cc: dev@dpdk.org; Thomas Monjalon <thomas.monjalon@6wind.com>; De
> Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; Olivier Matz
> <olivier.matz@6wind.com>; sugesh.chandran@intel.comn
> Subject: Re: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> 
>>>>>>Snipp
> >>> + *
> >>> + * Attaches a 32 bit value to packets.
> >>> + *
> >>> + * This value is arbitrary and application-defined. For
> >>> +compatibility with
> >>> + * FDIR it is returned in the hash.fdir.hi mbuf field.
> >>> +PKT_RX_FDIR_ID is
> >>> + * also set in ol_flags.
> >>> + */
> >>> +struct rte_flow_action_mark {
> >>> +	uint32_t id; /**< 32 bit value to return with packets. */ };
> >>
> >> One use case I thought we would be able to do for OVS is
> >> classification in hardware and the unique flow id is sent with the packet to
> software.
> >> But in OVS the ufid is 128 bits, so it means we can't and there is
> >> still the miniflow extract overhead. I'm not sure if there is a
> >> practical way around this.
> >>
> >> Sugesh (cc'd) has looked at this before and may be able to comment or
> >> correct me.
> >
> > Yes, we settled on 32 bit because currently no known hardware
> > implementation supports more than this. If that changes, another
> > action with a larger type shall be provided (no ABI breakage).
> >
> > Also since even 64 bit would not be enough for the use case you
> > mention, there is no choice but use this as an indirect value (such as
> > an array or hash table index/value).
> 
> ok, cool. I think Sugesh has other ideas anyway!
[Sugesh] It should be fine with 32 bit . we can manage it in OVS accordingly.
> 
> >
> > [...]
> >>> +/**
> >>> + * RTE_FLOW_ACTION_TYPE_RSS
> >>> + *
> >>> +
> >>> + *
> >>> + * Terminating by default.
> >>> + */
> >>> +struct rte_flow_action_vf {
> >>> +	uint32_t original:1; /**< Use original VF ID if possible. */
> >>> +	uint32_t reserved:31; /**< Reserved, must be zero. */
> >>> +	uint32_t id; /**< VF ID to redirect packets to. */ };
> > [...]
> >>> +/**
> >>> + * Check whether a flow rule can be created on a given port.
> >>> + *
> >>> + * While this function has no effect on the target device, the flow
> >>> +rule is
> >>> + * validated against its current configuration state and the
> >>> +returned value
> >>> + * should be considered valid by the caller for that state only.
> >>> + *
> >>> + * The returned value is guaranteed to remain valid only as long as
> >>> +no
> >>> + * successful calls to rte_flow_create() or rte_flow_destroy() are
> >>> +made in
> >>> + * the meantime and no device parameter affecting flow rules in any
> >>> +way are
> >>> + * modified, due to possible collisions or resource limitations
> >>> +(although in
> >>> + * such cases EINVAL should not be returned).
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param[in] attr
> >>> + *   Flow rule attributes.
> >>> + * @param[in] pattern
> >>> + *   Pattern specification (list terminated by the END pattern item).
> >>> + * @param[in] actions
> >>> + *   Associated actions (list terminated by the END action).
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   0 if flow rule is valid and can be created. A negative errno value
> >>> + *   otherwise (rte_errno is also set), the following errors are defined:
> >>> + *
> >>> + *   -ENOSYS: underlying device does not support this functionality.
> >>> + *
> >>> + *   -EINVAL: unknown or invalid rule specification.
> >>> + *
> >>> + *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
> >>> + *   bit-masks are unsupported).
> >>> + *
> >>> + *   -EEXIST: collision with an existing rule.
> >>> + *
> >>> + *   -ENOMEM: not enough resources.
> >>> + *
> >>> + *   -EBUSY: action cannot be performed due to busy device resources,
> may
> >>> + *   succeed if the affected queues or even the entire port are in a
> stopped
> >>> + *   state (see rte_eth_dev_rx_queue_stop() and
> rte_eth_dev_stop()).
> >>> + */
> >>> +int
> >>> +rte_flow_validate(uint8_t port_id,
> >>> +		  const struct rte_flow_attr *attr,
> >>> +		  const struct rte_flow_item pattern[],
> >>> +		  const struct rte_flow_action actions[],
> >>> +		  struct rte_flow_error *error);
> >>
> >> Why not just use rte_flow_create() and get an error? Is it less
> >> disruptive to do a validate and find the rule cannot be created, than
> >> using a create directly?
> >
> > The rationale can be found in the original RFC, which I'll convert to
> > actual documentation in v2. In short:
> >
> > - Calling rte_flow_validate() before rte_flow_create() is useless since
> >   rte_flow_create() also performs validation.
> >
> > - We cannot possibly express a full static set of allowed flow rules, even
> >   if we could, it usually depends on the current hardware configuration
> >   therefore would not be static.
> >
> > - rte_flow_validate() is thus provided as a replacement for capability
> >   flags. It can be used to determine during initialization if the underlying
> >   device can support the typical flow rules an application might want to
> >   provide later and do something useful with that information (e.g. always
> >   use software fallback due to HW limitations).
> >
> > - rte_flow_validate() being a subset of rte_flow_create(), it is essentially
> >   free to expose.
> 
> make sense now, thanks.
[Sugesh] : We had this discussion earlier at the design stage about the time taken for programming the hardware,
and how to make it deterministic. How about having a timeout parameter as well for the rte_flow_*
If the hardware flow insert is timed out, error out than waiting indefinitely, so that application have some control over
The time to program the flow. It can be another set of APIs something like, rte_flow_create_timeout()

Are you going to provide any control over the initialization of NIC  to define the capability matrices
For eg; To operate in a L3 router mode,  software wanted to initialize the NIC port only to consider the L2 and L3 fields.
I assume the initialization is done based on the first rules that are programmed into the NIC.?
> 
> >
> >>> +
> >>> +/**
> >>> + * Create a flow rule on a given port.
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param[in] attr
> >>> + *   Flow rule attributes.
> >>> + * @param[in] pattern
> >>> + *   Pattern specification (list terminated by the END pattern item).
> >>> + * @param[in] actions
> >>> + *   Associated actions (list terminated by the END action).
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   A valid handle in case of success, NULL otherwise and rte_errno is
> set
> >>> + *   to the positive version of one of the error codes defined for
> >>> + *   rte_flow_validate().
> >>> + */
> >>> +struct rte_flow *
> >>> +rte_flow_create(uint8_t port_id,
> >>> +		const struct rte_flow_attr *attr,
> >>> +		const struct rte_flow_item pattern[],
> >>> +		const struct rte_flow_action actions[],
> >>> +		struct rte_flow_error *error);
> >>
> >> General question - are these functions threadsafe? In the OVS example
> >> you could have several threads wanting to create flow rules at the
> >> same time for same or different ports.
> >
> > No they aren't, applications have to perform their own locking. The
> > RFC (to be converted to actual documentation in v2) says that:
> >
> > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> >   returned).
> >
> > - There is no provision for reentrancy/multi-thread safety, although
> nothing
> >   should prevent different devices from being configured at the same
> >   time. PMDs may protect their control path functions accordingly.
> 
> other comment above wrt locking.
> 
> >

Xing, Beilei Dec. 8, 2016, 9 a.m. UTC | #7

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Adrien Mazarguil
> Sent: Thursday, November 17, 2016 12:23 AM
> To: dev@dpdk.org
> Cc: Thomas Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch,
> Pablo <pablo.de.lara.guarch@intel.com>; Olivier Matz
> <olivier.matz@6wind.com>
> Subject: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> 
> This new API supersedes all the legacy filter types described in rte_eth_ctrl.h.
> It is slightly higher level and as a result relies more on PMDs to process and
> validate flow rules.
> 
> Benefits:
> 
> - A unified API is easier to program for, applications do not have to be
>   written for a specific filter type which may or may not be supported by
>   the underlying device.
> 
> - The behavior of a flow rule is the same regardless of the underlying
>   device, applications do not need to be aware of hardware quirks.
> 
> - Extensible by design, API/ABI breakage should rarely occur if at all.
> 
> - Documentation is self-standing, no need to look up elsewhere.
> 
> Existing filter types will be deprecated and removed in the near future.
> 
> Signed-off-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> ---
>  MAINTAINERS                            |   4 +
>  lib/librte_ether/Makefile              |   3 +
>  lib/librte_ether/rte_eth_ctrl.h        |   1 +
>  lib/librte_ether/rte_ether_version.map |  10 +
>  lib/librte_ether/rte_flow.c            | 159 +++++
>  lib/librte_ether/rte_flow.h            | 947 ++++++++++++++++++++++++++++
>  lib/librte_ether/rte_flow_driver.h     | 177 ++++++
>  7 files changed, 1301 insertions(+)
> 
> +/**
> + * RTE_FLOW_ITEM_TYPE_ETH
> + *
> + * Matches an Ethernet header.
> + */
> +struct rte_flow_item_eth {
> +	struct ether_addr dst; /**< Destination MAC. */
> +	struct ether_addr src; /**< Source MAC. */
> +	unsigned int type; /**< EtherType. */
Hi Adrien,

ETHERTYPE in ether header is 2 bytes, so I think "uint16_t type" is more appropriate here, what do you think?

Thanks,
Beilei Xing
> +};
> +

Adrien Mazarguil Dec. 8, 2016, 2:50 p.m. UTC | #8

Hi Beilei,

On Thu, Dec 08, 2016 at 09:00:05AM +0000, Xing, Beilei wrote:
[...]
> > +/**
> > + * RTE_FLOW_ITEM_TYPE_ETH
> > + *
> > + * Matches an Ethernet header.
> > + */
> > +struct rte_flow_item_eth {
> > +	struct ether_addr dst; /**< Destination MAC. */
> > +	struct ether_addr src; /**< Source MAC. */
> > +	unsigned int type; /**< EtherType. */
> Hi Adrien,
> 
> ETHERTYPE in ether header is 2 bytes, so I think "uint16_t type" is more appropriate here, what do you think?

You're right, thanks for catching this. I'll update it in v2 (soon).

Adrien Mazarguil Dec. 8, 2016, 3:09 p.m. UTC | #9

Hi Sugesh,

On Tue, Dec 06, 2016 at 06:11:38PM +0000, Chandran, Sugesh wrote:
[...]
> > >>> +int
> > >>> +rte_flow_validate(uint8_t port_id,
> > >>> +		  const struct rte_flow_attr *attr,
> > >>> +		  const struct rte_flow_item pattern[],
> > >>> +		  const struct rte_flow_action actions[],
> > >>> +		  struct rte_flow_error *error);
> > >>
> > >> Why not just use rte_flow_create() and get an error? Is it less
> > >> disruptive to do a validate and find the rule cannot be created, than
> > >> using a create directly?
> > >
> > > The rationale can be found in the original RFC, which I'll convert to
> > > actual documentation in v2. In short:
> > >
> > > - Calling rte_flow_validate() before rte_flow_create() is useless since
> > >   rte_flow_create() also performs validation.
> > >
> > > - We cannot possibly express a full static set of allowed flow rules, even
> > >   if we could, it usually depends on the current hardware configuration
> > >   therefore would not be static.
> > >
> > > - rte_flow_validate() is thus provided as a replacement for capability
> > >   flags. It can be used to determine during initialization if the underlying
> > >   device can support the typical flow rules an application might want to
> > >   provide later and do something useful with that information (e.g. always
> > >   use software fallback due to HW limitations).
> > >
> > > - rte_flow_validate() being a subset of rte_flow_create(), it is essentially
> > >   free to expose.
> > 
> > make sense now, thanks.
> [Sugesh] : We had this discussion earlier at the design stage about the time taken for programming the hardware,
> and how to make it deterministic. How about having a timeout parameter as well for the rte_flow_*
> If the hardware flow insert is timed out, error out than waiting indefinitely, so that application have some control over
> The time to program the flow. It can be another set of APIs something like, rte_flow_create_timeout()

Yes as discussed the existing API does not provide any timing constraints to
PMDs, validate() and create() may take forever to complete, although PMDs
are strongly encouraged to take as little time as possible.

Like you suggested, this could be done through distinct API calls. The
validate() function would also have its _timeout() counterpart since the set
of possible rules could be restricted in that mode.

> Are you going to provide any control over the initialization of NIC  to define the capability matrices
> For eg; To operate in a L3 router mode,  software wanted to initialize the NIC port only to consider the L2 and L3 fields.
> I assume the initialization is done based on the first rules that are programmed into the NIC.?

Precisely, PMDs are supposed to determine the most appropriate device mode
to use in order to handle the requested rules. They may even switch to
another mode if necessary assuming this does not break existing constraints.

I think we've discussed an atomic (commit-based) mode of operation through
separate functions as well, where the application would attempt to create a
bunch of rules at once, possibly making it easier for PMDs to determine the
most appropriate mode of operation for the device.

All of these may be added later according to users feedback once the basic
API has settled.

Adrien Mazarguil Dec. 8, 2016, 5:07 p.m. UTC | #10

On Fri, Dec 02, 2016 at 09:06:42PM +0000, Kevin Traynor wrote:
> On 12/01/2016 08:36 AM, Adrien Mazarguil wrote:
> > Hi Kevin,
> > 
> > On Wed, Nov 30, 2016 at 05:47:17PM +0000, Kevin Traynor wrote:
> >> Hi Adrien,
> >>
> >> On 11/16/2016 04:23 PM, Adrien Mazarguil wrote:
> >>> This new API supersedes all the legacy filter types described in
> >>> rte_eth_ctrl.h. It is slightly higher level and as a result relies more on
> >>> PMDs to process and validate flow rules.
> >>>
> >>> Benefits:
> >>>
> >>> - A unified API is easier to program for, applications do not have to be
> >>>   written for a specific filter type which may or may not be supported by
> >>>   the underlying device.
> >>>
> >>> - The behavior of a flow rule is the same regardless of the underlying
> >>>   device, applications do not need to be aware of hardware quirks.
> >>>
> >>> - Extensible by design, API/ABI breakage should rarely occur if at all.
> >>>
> >>> - Documentation is self-standing, no need to look up elsewhere.
> >>>
> >>> Existing filter types will be deprecated and removed in the near future.
> >>
> >> I'd suggest to add a deprecation notice to deprecation.rst, ideally with
> >> a target release.
> > 
> > Will do, not a sure about the target release though. It seems a bit early
> > since no PMD really supports this API yet.
> > 
> > [...]
> >>> diff --git a/lib/librte_ether/rte_flow.c b/lib/librte_ether/rte_flow.c
> >>> new file mode 100644
> >>> index 0000000..064963d
> >>> --- /dev/null
> >>> +++ b/lib/librte_ether/rte_flow.c
> >>> @@ -0,0 +1,159 @@
> >>> +/*-
> >>> + *   BSD LICENSE
> >>> + *
> >>> + *   Copyright 2016 6WIND S.A.
> >>> + *   Copyright 2016 Mellanox.
> >>
> >> There's Mellanox copyright but you are the only signed-off-by - is that
> >> right?
> > 
> > Yes, I'm the primary maintainer for Mellanox PMDs and this API was designed
> > on their behalf to expose several features from mlx4/mlx5 as the existing
> > filter types had too many limitations.
> > 
> > [...]
> >>> +/* Get generic flow operations structure from a port. */
> >>> +const struct rte_flow_ops *
> >>> +rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error)
> >>> +{
> >>> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> >>> +	const struct rte_flow_ops *ops;
> >>> +	int code;
> >>> +
> >>> +	if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
> >>> +		code = ENODEV;
> >>> +	else if (unlikely(!dev->dev_ops->filter_ctrl ||
> >>> +			  dev->dev_ops->filter_ctrl(dev,
> >>> +						    RTE_ETH_FILTER_GENERIC,
> >>> +						    RTE_ETH_FILTER_GET,
> >>> +						    &ops) ||
> >>> +			  !ops))
> >>> +		code = ENOTSUP;
> >>> +	else
> >>> +		return ops;
> >>> +	rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> >>> +			   NULL, rte_strerror(code));
> >>> +	return NULL;
> >>> +}
> >>> +
> >>
> >> Is it expected that the application or pmd will provide locking between
> >> these functions if required? I think it's going to have to be the app.
> > 
> > Locking is indeed expected to be performed by applications. This API only
> > documents places where locking would make sense if necessary and expected
> > behavior.
> > 
> > Like all control path APIs, this one assumes a single control thread.
> > Applications must take the necessary precautions.
> 
> If you look at OVS now it's quite possible that you have 2 rx queues
> serviced by different threads, that would also install the flow rules in
> the software flow caches - possibly that could extend to adding hardware
> flows. There could also be another thread that is querying for stats. So
> anything that can be done to minimise the locking would be helpful -
> maybe query() could be atomic and not require any locking?

I think we need basic functions with as few constraints as possible on PMDs
first, this API being somewhat complex to implement on their side. That
covers the common use case where applications have a single control thread
or otherwise perform locking on their own.

Once the basics are there for most PMDs, we may add new functions, items,
properties and actions that provide additional constraints (timing,
multi-threading and so on), which remain to be defined according to
feedback. It is designed to be extended without causing ABI breakage.

As for query(), let's see how PMDs handle it first. A race between query()
and create() on a given device is almost unavoidable without locking, same
for queries that reset counters in a given flow rule. Basic parallel queries
should not cause any harm otherwise, although this cannot be guaranteed yet.

> > [...]
> >>> +/**
> >>> + * Flow rule attributes.
> >>> + *
> >>> + * Priorities are set on two levels: per group and per rule within groups.
> >>> + *
> >>> + * Lower values denote higher priority, the highest priority for both levels
> >>> + * is 0, so that a rule with priority 0 in group 8 is always matched after a
> >>> + * rule with priority 8 in group 0.
> >>> + *
> >>> + * Although optional, applications are encouraged to group similar rules as
> >>> + * much as possible to fully take advantage of hardware capabilities
> >>> + * (e.g. optimized matching) and work around limitations (e.g. a single
> >>> + * pattern type possibly allowed in a given group).
> >>> + *
> >>> + * Group and priority levels are arbitrary and up to the application, they
> >>> + * do not need to be contiguous nor start from 0, however the maximum number
> >>> + * varies between devices and may be affected by existing flow rules.
> >>> + *
> >>> + * If a packet is matched by several rules of a given group for a given
> >>> + * priority level, the outcome is undefined. It can take any path, may be
> >>> + * duplicated or even cause unrecoverable errors.
> >>
> >> I get what you are trying to do here wrt supporting multiple
> >> pmds/hardware implementations and it's a good idea to keep it flexible.
> >>
> >> Given that the outcome is undefined, it would be nice that the
> >> application has a way of finding the specific effects for verification
> >> and debugging.
> > 
> > Right, however it was deemed a bit difficult to manage in many cases hence
> > the vagueness.
> > 
> > For example, suppose two rules with the same group and priority, one
> > matching any IPv4 header, the other one any UDP header:
> > 
> > - TCPv4 packets => rule #1.
> > - UDPv6 packets => rule #2.
> > - UDPv4 packets => both?
> > 
> > That last one is perhaps invalid, checking that some unspecified protocol
> > combination does not overlap is expensive and may miss corner cases, even
> > assuming this is not an issue, what if the application guarantees that no
> > UDPv4 packets can ever hit that rule?
> 
> that's fine - I don't expect the software to be able to know what the
> hardware will do with those rules. It's more about trying to get a dump
> from the hardware if something goes wrong. Anyway covered in comment later.
> 
> > 
> > Suggestions are welcome though, perhaps we can refine the description
> > 
> >>> + *
> >>> + * Note that support for more than a single group and priority level is not
> >>> + * guaranteed.
> >>> + *
> >>> + * Flow rules can apply to inbound and/or outbound traffic (ingress/egress).
> >>> + *
> >>> + * Several pattern items and actions are valid and can be used in both
> >>> + * directions. Those valid for only one direction are described as such.
> >>> + *
> >>> + * Specifying both directions at once is not recommended but may be valid in
> >>> + * some cases, such as incrementing the same counter twice.
> >>> + *
> >>> + * Not specifying any direction is currently an error.
> >>> + */
> >>> +struct rte_flow_attr {
> >>> +	uint32_t group; /**< Priority group. */
> >>> +	uint32_t priority; /**< Priority level within group. */
> >>> +	uint32_t ingress:1; /**< Rule applies to ingress traffic. */
> >>> +	uint32_t egress:1; /**< Rule applies to egress traffic. */
> >>> +	uint32_t reserved:30; /**< Reserved, must be zero. */
> >>> +};
> > [...]
> >>> +/**
> >>> + * RTE_FLOW_ITEM_TYPE_VF
> >>> + *
> >>> + * Matches packets addressed to a virtual function ID of the device.
> >>> + *
> >>> + * If the underlying device function differs from the one that would
> >>> + * normally receive the matched traffic, specifying this item prevents it
> >>> + * from reaching that device unless the flow rule contains a VF
> >>> + * action. Packets are not duplicated between device instances by default.
> >>> + *
> >>> + * - Likely to return an error or never match any traffic if this causes a
> >>> + *   VF device to match traffic addressed to a different VF.
> >>> + * - Can be specified multiple times to match traffic addressed to several
> >>> + *   specific VFs.
> >>> + * - Can be combined with a PF item to match both PF and VF traffic.
> >>> + *
> >>> + * A zeroed mask can be used to match any VF.
> >>
> >> can you refer explicitly to id
> > 
> > If you mean "VF" to "VF ID" then yes, will do it for v2.
> > 
> >>> + */
> >>> +struct rte_flow_item_vf {
> >>> +	uint32_t id; /**< Destination VF ID. */
> >>> +};
> > [...]
> >>> +/**
> >>> + * Matching pattern item definition.
> >>> + *
> >>> + * A pattern is formed by stacking items starting from the lowest protocol
> >>> + * layer to match. This stacking restriction does not apply to meta items
> >>> + * which can be placed anywhere in the stack with no effect on the meaning
> >>> + * of the resulting pattern.
> >>> + *
> >>> + * A stack is terminated by a END item.
> >>> + *
> >>> + * The spec field should be a valid pointer to a structure of the related
> >>> + * item type. It may be set to NULL in many cases to use default values.
> >>> + *
> >>> + * Optionally, last can point to a structure of the same type to define an
> >>> + * inclusive range. This is mostly supported by integer and address fields,
> >>> + * may cause errors otherwise. Fields that do not support ranges must be set
> >>> + * to the same value as their spec counterparts.
> >>> + *
> >>> + * By default all fields present in spec are considered relevant.* This
> >>
> >> typo "*"
> > 
> > No, that's an asterisk for a footnote below. Perhaps it is a bit unusual,
> > would something like "[1]" look better?
> 
> oh, I thought it was the start of a comment line gone astray. Maybe "See
> note below", no big deal though.

OK, will change it anyway for clarity.

> >>> + * behavior can be altered by providing a mask structure of the same type
> >>> + * with applicable bits set to one. It can also be used to partially filter
> >>> + * out specific fields (e.g. as an alternate mean to match ranges of IP
> >>> + * addresses).
> >>> + *
> >>> + * Note this is a simple bit-mask applied before interpreting the contents
> >>> + * of spec and last, which may yield unexpected results if not used
> >>> + * carefully. For example, if for an IPv4 address field, spec provides
> >>> + * 10.1.2.3, last provides 10.3.4.5 and mask provides 255.255.0.0, the
> >>> + * effective range is 10.1.0.0 to 10.3.255.255.
> >>> + *
> > 
> > See footnote below:
> > 
> >>> + * * The defaults for data-matching items such as IPv4 when mask is not
> >>> + *   specified actually depend on the underlying implementation since only
> >>> + *   recognized fields can be taken into account.
> >>> + */
> >>> +struct rte_flow_item {
> >>> +	enum rte_flow_item_type type; /**< Item type. */
> >>> +	const void *spec; /**< Pointer to item specification structure. */
> >>> +	const void *last; /**< Defines an inclusive range (spec to last). */
> >>> +	const void *mask; /**< Bit-mask applied to spec and last. */
> >>> +};
> >>> +
> >>> +/**
> >>> + * Action types.
> >>> + *
> >>> + * Each possible action is represented by a type. Some have associated
> >>> + * configuration structures. Several actions combined in a list can be
> >>> + * affected to a flow rule. That list is not ordered.
> >>> + *
> >>> + * They fall in three categories:
> >>> + *
> >>> + * - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
> >>> + *   processing matched packets by subsequent flow rules, unless overridden
> >>> + *   with PASSTHRU.
> >>> + *
> >>> + * - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
> >>> + *   for additional processing by subsequent flow rules.
> >>> + *
> >>> + * - Other non terminating meta actions that do not affect the fate of
> >>> + *   packets (END, VOID, MARK, FLAG, COUNT).
> >>> + *
> >>> + * When several actions are combined in a flow rule, they should all have
> >>> + * different types (e.g. dropping a packet twice is not possible). The
> >>> + * defined behavior is for PMDs to only take into account the last action of
> >>> + * a given type found in the list. PMDs still perform error checking on the
> >>> + * entire list.
> >>
> >> why do you define that the pmd will interpret multiple same type rules
> >> in this way...would it not make more sense for the pmd to just return
> >> EINVAL for an invalid set of rules? It seems more transparent for the
> >> application.
> > 
> > Well, I had to define something as a default. The reason is that any number
> > of VOID actions may specified and did not want that to be a special case in
> > order to keep PMD parsers as simple as possible. I'll settle for EINVAL (or
> > some other error) if at least one PMD maintainer other than Nelio who
> > intends to implement this API is not convinced by this explanation, all
> > right?
> 
> From an API perspective I think it's cleaner to pass or fail with the
> input rather than change it. But yes, please take pmd maintainers input
> as to what is reasonable to check also.
> 
> > 
> > [...]
> >>> +/**
> >>> + * RTE_FLOW_ACTION_TYPE_MARK
> >>> + *
> >>> + * Attaches a 32 bit value to packets.
> >>> + *
> >>> + * This value is arbitrary and application-defined. For compatibility with
> >>> + * FDIR it is returned in the hash.fdir.hi mbuf field. PKT_RX_FDIR_ID is
> >>> + * also set in ol_flags.
> >>> + */
> >>> +struct rte_flow_action_mark {
> >>> +	uint32_t id; /**< 32 bit value to return with packets. */
> >>> +};
> >>
> >> One use case I thought we would be able to do for OVS is classification
> >> in hardware and the unique flow id is sent with the packet to software.
> >> But in OVS the ufid is 128 bits, so it means we can't and there is still
> >> the miniflow extract overhead. I'm not sure if there is a practical way
> >> around this.
> >>
> >> Sugesh (cc'd) has looked at this before and may be able to comment or
> >> correct me.
> > 
> > Yes, we settled on 32 bit because currently no known hardware implementation
> > supports more than this. If that changes, another action with a larger type
> > shall be provided (no ABI breakage).
> > 
> > Also since even 64 bit would not be enough for the use case you mention,
> > there is no choice but use this as an indirect value (such as an array or
> > hash table index/value).
> 
> ok, cool. I think Sugesh has other ideas anyway!
> 
> > 
> > [...]
> >>> +/**
> >>> + * RTE_FLOW_ACTION_TYPE_RSS
> >>> + *
> >>> + * Similar to QUEUE, except RSS is additionally performed on packets to
> >>> + * spread them among several queues according to the provided parameters.
> >>> + *
> >>> + * Note: RSS hash result is normally stored in the hash.rss mbuf field,
> >>> + * however it conflicts with the MARK action as they share the same
> >>> + * space. When both actions are specified, the RSS hash is discarded and
> >>> + * PKT_RX_RSS_HASH is not set in ol_flags. MARK has priority. The mbuf
> >>> + * structure should eventually evolve to store both.
> >>> + *
> >>> + * Terminating by default.
> >>> + */
> >>> +struct rte_flow_action_rss {
> >>> +	const struct rte_eth_rss_conf *rss_conf; /**< RSS parameters. */
> >>> +	uint16_t queues; /**< Number of entries in queue[]. */
> >>> +	uint16_t queue[]; /**< Queues indices to use. */
> >>
> >> I'd try and avoid queue and queues - someone will say "huh?" when
> >> reading code. s/queues/num ?
> > 
> > Agreed, will update for v2.
> > 
> >>> +};
> >>> +
> >>> +/**
> >>> + * RTE_FLOW_ACTION_TYPE_VF
> >>> + *
> >>> + * Redirects packets to a virtual function (VF) of the current device.
> >>> + *
> >>> + * Packets matched by a VF pattern item can be redirected to their original
> >>> + * VF ID instead of the specified one. This parameter may not be available
> >>> + * and is not guaranteed to work properly if the VF part is matched by a
> >>> + * prior flow rule or if packets are not addressed to a VF in the first
> >>> + * place.
> >>
> >> Not clear what you mean by "not guaranteed to work if...". Please return
> >> fail when this action is used if this is not going to work.
> > 
> > Again, this is a case where it is difficult for a PMD to determine if the
> > entire list of flow rules makes sense. Perhaps it does, perhaps whatever
> > goes through has already been filtered out of possible issues.
> > 
> > Here the documentation states the precautions an application should take to
> > guarantee it will work as intended. Perhaps it can be reworded (any
> > suggestion?), but a PMD can certainly not provide any strong guarantee.
> 
> I see your point. Maybe for easy check things the pmd would return fail,
> but for more complex I agree it's too difficult.
> 
> > 
> >>> + *
> >>> + * Terminating by default.
> >>> + */
> >>> +struct rte_flow_action_vf {
> >>> +	uint32_t original:1; /**< Use original VF ID if possible. */
> >>> +	uint32_t reserved:31; /**< Reserved, must be zero. */
> >>> +	uint32_t id; /**< VF ID to redirect packets to. */
> >>> +};
> > [...]
> >>> +/**
> >>> + * Check whether a flow rule can be created on a given port.
> >>> + *
> >>> + * While this function has no effect on the target device, the flow rule is
> >>> + * validated against its current configuration state and the returned value
> >>> + * should be considered valid by the caller for that state only.
> >>> + *
> >>> + * The returned value is guaranteed to remain valid only as long as no
> >>> + * successful calls to rte_flow_create() or rte_flow_destroy() are made in
> >>> + * the meantime and no device parameter affecting flow rules in any way are
> >>> + * modified, due to possible collisions or resource limitations (although in
> >>> + * such cases EINVAL should not be returned).
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param[in] attr
> >>> + *   Flow rule attributes.
> >>> + * @param[in] pattern
> >>> + *   Pattern specification (list terminated by the END pattern item).
> >>> + * @param[in] actions
> >>> + *   Associated actions (list terminated by the END action).
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   0 if flow rule is valid and can be created. A negative errno value
> >>> + *   otherwise (rte_errno is also set), the following errors are defined:
> >>> + *
> >>> + *   -ENOSYS: underlying device does not support this functionality.
> >>> + *
> >>> + *   -EINVAL: unknown or invalid rule specification.
> >>> + *
> >>> + *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
> >>> + *   bit-masks are unsupported).
> >>> + *
> >>> + *   -EEXIST: collision with an existing rule.
> >>> + *
> >>> + *   -ENOMEM: not enough resources.
> >>> + *
> >>> + *   -EBUSY: action cannot be performed due to busy device resources, may
> >>> + *   succeed if the affected queues or even the entire port are in a stopped
> >>> + *   state (see rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
> >>> + */
> >>> +int
> >>> +rte_flow_validate(uint8_t port_id,
> >>> +		  const struct rte_flow_attr *attr,
> >>> +		  const struct rte_flow_item pattern[],
> >>> +		  const struct rte_flow_action actions[],
> >>> +		  struct rte_flow_error *error);
> >>
> >> Why not just use rte_flow_create() and get an error? Is it less
> >> disruptive to do a validate and find the rule cannot be created, than
> >> using a create directly?
> > 
> > The rationale can be found in the original RFC, which I'll convert to actual
> > documentation in v2. In short:
> > 
> > - Calling rte_flow_validate() before rte_flow_create() is useless since
> >   rte_flow_create() also performs validation.
> > 
> > - We cannot possibly express a full static set of allowed flow rules, even
> >   if we could, it usually depends on the current hardware configuration
> >   therefore would not be static.
> > 
> > - rte_flow_validate() is thus provided as a replacement for capability
> >   flags. It can be used to determine during initialization if the underlying
> >   device can support the typical flow rules an application might want to
> >   provide later and do something useful with that information (e.g. always
> >   use software fallback due to HW limitations).
> > 
> > - rte_flow_validate() being a subset of rte_flow_create(), it is essentially
> >   free to expose.
> 
> make sense now, thanks.
> 
> > 
> >>> +
> >>> +/**
> >>> + * Create a flow rule on a given port.
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param[in] attr
> >>> + *   Flow rule attributes.
> >>> + * @param[in] pattern
> >>> + *   Pattern specification (list terminated by the END pattern item).
> >>> + * @param[in] actions
> >>> + *   Associated actions (list terminated by the END action).
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   A valid handle in case of success, NULL otherwise and rte_errno is set
> >>> + *   to the positive version of one of the error codes defined for
> >>> + *   rte_flow_validate().
> >>> + */
> >>> +struct rte_flow *
> >>> +rte_flow_create(uint8_t port_id,
> >>> +		const struct rte_flow_attr *attr,
> >>> +		const struct rte_flow_item pattern[],
> >>> +		const struct rte_flow_action actions[],
> >>> +		struct rte_flow_error *error);
> >>
> >> General question - are these functions threadsafe? In the OVS example
> >> you could have several threads wanting to create flow rules at the same
> >> time for same or different ports.
> > 
> > No they aren't, applications have to perform their own locking. The RFC (to
> > be converted to actual documentation in v2) says that:
> > 
> > - API operations are synchronous and blocking (``EAGAIN`` cannot be
> >   returned).
> > 
> > - There is no provision for reentrancy/multi-thread safety, although nothing
> >   should prevent different devices from being configured at the same
> >   time. PMDs may protect their control path functions accordingly.
> 
> other comment above wrt locking.
> 
> > 
> >>> +
> >>> +/**
> >>> + * Destroy a flow rule on a given port.
> >>> + *
> >>> + * Failure to destroy a flow rule handle may occur when other flow rules
> >>> + * depend on it, and destroying it would result in an inconsistent state.
> >>> + *
> >>> + * This function is only guaranteed to succeed if handles are destroyed in
> >>> + * reverse order of their creation.
> >>
> >> How can the application find this information out on error?
> > 
> > Without maintaining a list, they cannot. The specified case is the only
> > possible guarantee. That does not mean PMDs should not do their best to
> > destroy flow rules, only that ordering must remain consistent in case of
> > inability to destroy one.
> > 
> > What do you suggest?
> 
> I think if the app cannot remove a specific rule it may want to remove
> all rules and deal with flows in software for a time. So once the app
> knows it fails that should be enough.

OK, then since destruction may return an error already, is it fine?
Applications may call rte_flow_flush() (not supposed to fail unless there is
a serious issue, abort() in that case) and switch to SW fallback.

> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param flow
> >>> + *   Flow rule handle to destroy.
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> >>> + */
> >>> +int
> >>> +rte_flow_destroy(uint8_t port_id,
> >>> +		 struct rte_flow *flow,
> >>> +		 struct rte_flow_error *error);
> >>> +
> >>> +/**
> >>> + * Destroy all flow rules associated with a port.
> >>> + *
> >>> + * In the unlikely event of failure, handles are still considered destroyed
> >>> + * and no longer valid but the port must be assumed to be in an inconsistent
> >>> + * state.
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> >>> + */
> >>> +int
> >>> +rte_flow_flush(uint8_t port_id,
> >>> +	       struct rte_flow_error *error);
> >>
> >> rte_flow_destroy_all() would be more descriptive (but breaks your style)
> > 
> > There are enough underscores as it is. I like flush, if enough people
> > complain we'll change it but it has to occur before the first public
> > release.
> > 
> >>> +
> >>> +/**
> >>> + * Query an existing flow rule.
> >>> + *
> >>> + * This function allows retrieving flow-specific data such as counters.
> >>> + * Data is gathered by special actions which must be present in the flow
> >>> + * rule definition.
> >>
> >> re last sentence, it would be good if you can put a link to
> >> RTE_FLOW_ACTION_TYPE_COUNT
> > 
> > Will do, I did not know how until very recently.
> > 
> >>> + *
> >>> + * @param port_id
> >>> + *   Port identifier of Ethernet device.
> >>> + * @param flow
> >>> + *   Flow rule handle to query.
> >>> + * @param action
> >>> + *   Action type to query.
> >>> + * @param[in, out] data
> >>> + *   Pointer to storage for the associated query data type.
> >>
> >> can this be anything other than rte_flow_query_count?
> > 
> > Likely in the future. I've only defined this one as a counterpart for
> > existing API functionality and because we wanted to expose it in mlx5.
> > 
> >>> + * @param[out] error
> >>> + *   Perform verbose error reporting if not NULL.
> >>> + *
> >>> + * @return
> >>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> >>> + */
> >>> +int
> >>> +rte_flow_query(uint8_t port_id,
> >>> +	       struct rte_flow *flow,
> >>> +	       enum rte_flow_action_type action,
> >>> +	       void *data,
> >>> +	       struct rte_flow_error *error);
> >>> +
> >>> +#ifdef __cplusplus
> >>> +}
> >>> +#endif
> >>
> >> I don't see a way to dump all the rules for a port out. I think this is
> >> neccessary for degbugging. You could have a look through dpif.h in OVS
> >> and see how dpif_flow_dump_next() is used, it might be a good reference.
> > 
> > DPDK does not maintain flow rules and, depending on hardware capabilities
> > and level of compliance, PMDs do not necessarily do it either, particularly
> > since it requires space and application probably have a better method to
> > store these pointers for their own needs.
> 
> understood
> 
> > 
> > What you see here is only a PMD interface. Depending on applications needs,
> > generic helper functions built on top of these may be added to manage flow
> > rules in the future.
> 
> I'm thinking of the case where something goes wrong and I want to get a
> dump of all the flow rules from hardware, not query the rules I think I
> have. I don't see a way to do it or something to build a helper on top of?

Generic helper functions would exist on top of this API and would likely
maintain a list of flow rules themselves. The dump in that case would be
entirely implemented in software. I think that recovering flow rules from HW
may be complicated in many cases (even without taking storage allocation and
rules conversion issues into account), therefore if there is really a need
for it, we could perhaps add a dump() function that PMDs are free to
implement later.

Chandran, Sugesh Dec. 9, 2016, 12:18 p.m. UTC | #11

Hi Adrien,
Thank you for your comments,
Please see the reply below.

Regards
_Sugesh


> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Thursday, December 8, 2016 3:09 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>; dev@dpdk.org; Thomas
> Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olivier Matz <olivier.matz@6wind.com>;
> sugesh.chandran@intel.comn
> Subject: Re: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> 
> Hi Sugesh,
> 
> On Tue, Dec 06, 2016 at 06:11:38PM +0000, Chandran, Sugesh wrote:
> [...]
> > > >>> +int
> > > >>> +rte_flow_validate(uint8_t port_id,
> > > >>> +		  const struct rte_flow_attr *attr,
> > > >>> +		  const struct rte_flow_item pattern[],
> > > >>> +		  const struct rte_flow_action actions[],
> > > >>> +		  struct rte_flow_error *error);
> > > >>
> > > >> Why not just use rte_flow_create() and get an error? Is it less
> > > >> disruptive to do a validate and find the rule cannot be created,
> > > >> than using a create directly?
> > > >
> > > > The rationale can be found in the original RFC, which I'll convert
> > > > to actual documentation in v2. In short:
> > > >
> > > > - Calling rte_flow_validate() before rte_flow_create() is useless since
> > > >   rte_flow_create() also performs validation.
> > > >
> > > > - We cannot possibly express a full static set of allowed flow rules, even
> > > >   if we could, it usually depends on the current hardware configuration
> > > >   therefore would not be static.
> > > >
> > > > - rte_flow_validate() is thus provided as a replacement for capability
> > > >   flags. It can be used to determine during initialization if the underlying
> > > >   device can support the typical flow rules an application might want to
> > > >   provide later and do something useful with that information (e.g.
> always
> > > >   use software fallback due to HW limitations).
> > > >
> > > > - rte_flow_validate() being a subset of rte_flow_create(), it is
> essentially
> > > >   free to expose.
> > >
> > > make sense now, thanks.
> > [Sugesh] : We had this discussion earlier at the design stage about
> > the time taken for programming the hardware, and how to make it
> > deterministic. How about having a timeout parameter as well for the
> > rte_flow_* If the hardware flow insert is timed out, error out than
> > waiting indefinitely, so that application have some control over The
> > time to program the flow. It can be another set of APIs something
> > like, rte_flow_create_timeout()
> 
> Yes as discussed the existing API does not provide any timing constraints to
> PMDs, validate() and create() may take forever to complete, although PMDs
> are strongly encouraged to take as little time as possible.
> 
> Like you suggested, this could be done through distinct API calls. The
> validate() function would also have its _timeout() counterpart since the set
> of possible rules could be restricted in that mode.
[Sugesh] Thanks!. Looking forward to see an api set with that implementation as well 
in the future :). I feel it's a must from the user application point of view.
> 
> > Are you going to provide any control over the initialization of NIC
> > to define the capability matrices For eg; To operate in a L3 router mode,
> software wanted to initialize the NIC port only to consider the L2 and L3
> fields.
> > I assume the initialization is done based on the first rules that are
> programmed into the NIC.?
> 
> Precisely, PMDs are supposed to determine the most appropriate device
> mode to use in order to handle the requested rules. They may even switch
> to another mode if necessary assuming this does not break existing
> constraints.
> 
> I think we've discussed an atomic (commit-based) mode of operation
> through separate functions as well, where the application would attempt to
> create a bunch of rules at once, possibly making it easier for PMDs to
> determine the most appropriate mode of operation for the device.
> 
> All of these may be added later according to users feedback once the basic
> API has settled.
[Sugesh] Yes , we discussed about this before. However I feel that, it make sense
to provide some flexibility to the user/application to define a profile/mode of the device.
This way the complexity of determining the mode by itself will be taken away from PMD.
Looking at the P4 enablement patches in OVS, the mode definition APIs can be used in conjunction
P4 behavioral model. 
For eg: A P4 model for a L2 switch operate OVS as a L2 switch. Using the mode definition APIs
Its possible to impose the same behavioral model in the hardware too. 
This way its simple, clean and very predictive though it needs to define an additional profile_define APIs.
I am sorry to provide the comment at this stage,  However looking at the adoption of ebpf, P4 make me
to think this way.
What do you think?
> 
> --
> Adrien Mazarguil
> 6WIND

Adrien Mazarguil Dec. 9, 2016, 4:38 p.m. UTC | #12

Hi Sugesh,

On Fri, Dec 09, 2016 at 12:18:03PM +0000, Chandran, Sugesh wrote:
[...]
> > > Are you going to provide any control over the initialization of NIC
> > > to define the capability matrices For eg; To operate in a L3 router mode,
> > software wanted to initialize the NIC port only to consider the L2 and L3
> > fields.
> > > I assume the initialization is done based on the first rules that are
> > programmed into the NIC.?
> > 
> > Precisely, PMDs are supposed to determine the most appropriate device
> > mode to use in order to handle the requested rules. They may even switch
> > to another mode if necessary assuming this does not break existing
> > constraints.
> > 
> > I think we've discussed an atomic (commit-based) mode of operation
> > through separate functions as well, where the application would attempt to
> > create a bunch of rules at once, possibly making it easier for PMDs to
> > determine the most appropriate mode of operation for the device.
> > 
> > All of these may be added later according to users feedback once the basic
> > API has settled.
> [Sugesh] Yes , we discussed about this before. However I feel that, it make sense
> to provide some flexibility to the user/application to define a profile/mode of the device.
> This way the complexity of determining the mode by itself will be taken away from PMD.
> Looking at the P4 enablement patches in OVS, the mode definition APIs can be used in conjunction
> P4 behavioral model. 
> For eg: A P4 model for a L2 switch operate OVS as a L2 switch. Using the mode definition APIs
> Its possible to impose the same behavioral model in the hardware too. 
> This way its simple, clean and very predictive though it needs to define an additional profile_define APIs.
> I am sorry to provide the comment at this stage,  However looking at the adoption of ebpf, P4 make me
> to think this way.
> What do you think?

What you suggest (device profile configuration) would be done by a separate
function in any case, so as long as everyone agrees on a generic method to
do so, no problem with extending rte_flow. By default in the meantime we'll
have to rely on PMDs to make the right decision.

Do you think it has to be defined from the beginning?

Chandran, Sugesh Dec. 12, 2016, 10:20 a.m. UTC | #13

Hi Adrien,

Regards
_Sugesh

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Friday, December 9, 2016 4:39 PM
> To: Chandran, Sugesh <sugesh.chandran@intel.com>
> Cc: Kevin Traynor <ktraynor@redhat.com>; dev@dpdk.org; Thomas
> Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>; Olivier Matz <olivier.matz@6wind.com>;
> sugesh.chandran@intel.comn
> Subject: Re: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> 
> Hi Sugesh,
> 
> On Fri, Dec 09, 2016 at 12:18:03PM +0000, Chandran, Sugesh wrote:
> [...]
> > > > Are you going to provide any control over the initialization of
> > > > NIC to define the capability matrices For eg; To operate in a L3
> > > > router mode,
> > > software wanted to initialize the NIC port only to consider the L2
> > > and L3 fields.
> > > > I assume the initialization is done based on the first rules that
> > > > are
> > > programmed into the NIC.?
> > >
> > > Precisely, PMDs are supposed to determine the most appropriate
> > > device mode to use in order to handle the requested rules. They may
> > > even switch to another mode if necessary assuming this does not
> > > break existing constraints.
> > >
> > > I think we've discussed an atomic (commit-based) mode of operation
> > > through separate functions as well, where the application would
> > > attempt to create a bunch of rules at once, possibly making it
> > > easier for PMDs to determine the most appropriate mode of operation
> for the device.
> > >
> > > All of these may be added later according to users feedback once the
> > > basic API has settled.
> > [Sugesh] Yes , we discussed about this before. However I feel that, it
> > make sense to provide some flexibility to the user/application to define a
> profile/mode of the device.
> > This way the complexity of determining the mode by itself will be taken
> away from PMD.
> > Looking at the P4 enablement patches in OVS, the mode definition APIs
> > can be used in conjunction
> > P4 behavioral model.
> > For eg: A P4 model for a L2 switch operate OVS as a L2 switch. Using
> > the mode definition APIs Its possible to impose the same behavioral model
> in the hardware too.
> > This way its simple, clean and very predictive though it needs to define an
> additional profile_define APIs.
> > I am sorry to provide the comment at this stage,  However looking at
> > the adoption of ebpf, P4 make me to think this way.
> > What do you think?
> 
> What you suggest (device profile configuration) would be done by a separate
> function in any case, so as long as everyone agrees on a generic method to
> do so, no problem with extending rte_flow. By default in the meantime we'll
> have to rely on PMDs to make the right decision.
[Sugesh] I am fine with PMD is making the decision on profile/mode selection in
Default case. However we must provide an option for the application to define a mode
and PMD must honor with it to avoid making an invalid mode change.
> 
> Do you think it has to be defined from the beginning?
[Sugesh] I feel it's going to be another big topic to decide how proposed mode implementation will be looks like,
What should be available modes and etc.  So I am OK to consider as its not part of this flow API definition for now.
However its good to mention that in the API comments section to be aware. Do you agree that?

> 
> --
> Adrien Mazarguil
> 6WIND

Adrien Mazarguil Dec. 12, 2016, 11:17 a.m. UTC | #14

Hi Sugesh,

On Mon, Dec 12, 2016 at 10:20:18AM +0000, Chandran, Sugesh wrote:
> Hi Adrien,
> 
> Regards
> _Sugesh
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Friday, December 9, 2016 4:39 PM
> > To: Chandran, Sugesh <sugesh.chandran@intel.com>
> > Cc: Kevin Traynor <ktraynor@redhat.com>; dev@dpdk.org; Thomas
> > Monjalon <thomas.monjalon@6wind.com>; De Lara Guarch, Pablo
> > <pablo.de.lara.guarch@intel.com>; Olivier Matz <olivier.matz@6wind.com>;
> > sugesh.chandran@intel.comn
> > Subject: Re: [dpdk-dev] [PATCH 01/22] ethdev: introduce generic flow API
> > 
> > Hi Sugesh,
> > 
> > On Fri, Dec 09, 2016 at 12:18:03PM +0000, Chandran, Sugesh wrote:
> > [...]
> > > > > Are you going to provide any control over the initialization of
> > > > > NIC to define the capability matrices For eg; To operate in a L3
> > > > > router mode,
> > > > software wanted to initialize the NIC port only to consider the L2
> > > > and L3 fields.
> > > > > I assume the initialization is done based on the first rules that
> > > > > are
> > > > programmed into the NIC.?
> > > >
> > > > Precisely, PMDs are supposed to determine the most appropriate
> > > > device mode to use in order to handle the requested rules. They may
> > > > even switch to another mode if necessary assuming this does not
> > > > break existing constraints.
> > > >
> > > > I think we've discussed an atomic (commit-based) mode of operation
> > > > through separate functions as well, where the application would
> > > > attempt to create a bunch of rules at once, possibly making it
> > > > easier for PMDs to determine the most appropriate mode of operation
> > for the device.
> > > >
> > > > All of these may be added later according to users feedback once the
> > > > basic API has settled.
> > > [Sugesh] Yes , we discussed about this before. However I feel that, it
> > > make sense to provide some flexibility to the user/application to define a
> > profile/mode of the device.
> > > This way the complexity of determining the mode by itself will be taken
> > away from PMD.
> > > Looking at the P4 enablement patches in OVS, the mode definition APIs
> > > can be used in conjunction
> > > P4 behavioral model.
> > > For eg: A P4 model for a L2 switch operate OVS as a L2 switch. Using
> > > the mode definition APIs Its possible to impose the same behavioral model
> > in the hardware too.
> > > This way its simple, clean and very predictive though it needs to define an
> > additional profile_define APIs.
> > > I am sorry to provide the comment at this stage,  However looking at
> > > the adoption of ebpf, P4 make me to think this way.
> > > What do you think?
> > 
> > What you suggest (device profile configuration) would be done by a separate
> > function in any case, so as long as everyone agrees on a generic method to
> > do so, no problem with extending rte_flow. By default in the meantime we'll
> > have to rely on PMDs to make the right decision.
> [Sugesh] I am fine with PMD is making the decision on profile/mode selection in
> Default case. However we must provide an option for the application to define a mode
> and PMD must honor with it to avoid making an invalid mode change.
> > 
> > Do you think it has to be defined from the beginning?
> [Sugesh] I feel it's going to be another big topic to decide how proposed mode implementation will be looks like,
> What should be available modes and etc.  So I am OK to consider as its not part of this flow API definition for now.
> However its good to mention that in the API comments section to be aware. Do you agree that?

Will do, I'll mention it in the "future evolutions" section.

Kevin Traynor Dec. 14, 2016, 11:48 a.m. UTC | #15

hi Adrien, sorry for the delay

<...>

>>>>
>>>> Is it expected that the application or pmd will provide locking between
>>>> these functions if required? I think it's going to have to be the app.
>>>
>>> Locking is indeed expected to be performed by applications. This API only
>>> documents places where locking would make sense if necessary and expected
>>> behavior.
>>>
>>> Like all control path APIs, this one assumes a single control thread.
>>> Applications must take the necessary precautions.
>>
>> If you look at OVS now it's quite possible that you have 2 rx queues
>> serviced by different threads, that would also install the flow rules in
>> the software flow caches - possibly that could extend to adding hardware
>> flows. There could also be another thread that is querying for stats. So
>> anything that can be done to minimise the locking would be helpful -
>> maybe query() could be atomic and not require any locking?
> 
> I think we need basic functions with as few constraints as possible on PMDs
> first, this API being somewhat complex to implement on their side. That
> covers the common use case where applications have a single control thread
> or otherwise perform locking on their own.
> 
> Once the basics are there for most PMDs, we may add new functions, items,
> properties and actions that provide additional constraints (timing,
> multi-threading and so on), which remain to be defined according to
> feedback. It is designed to be extended without causing ABI breakage.

I think Sugesh and I are trying to foresee some of the issues that may
arise when integrating with something like OVS. OTOH it's
hard/impossible to say what will be needed exactly in the API right now
to make it suitable for OVS.

So, I'm ok with the approach you are taking by exposing a basic API
but I think there should be an expectation that it may not be sufficient
for a project like OVS to integrate in and may take several
iterations/extensions - don't go anywhere!

> 
> As for query(), let's see how PMDs handle it first. A race between query()
> and create() on a given device is almost unavoidable without locking, same
> for queries that reset counters in a given flow rule. Basic parallel queries
> should not cause any harm otherwise, although this cannot be guaranteed yet.

You still have a race if there is locking, except it is for the lock,
but it has the same effect. The downside of my suggestion is that all
the PMDs would need to guarantee they could gets stats atomically - I'm
not sure if they can or it's too restrictive.

> 

<...>

>>
>>>
>>>>> +
>>>>> +/**
>>>>> + * Destroy a flow rule on a given port.
>>>>> + *
>>>>> + * Failure to destroy a flow rule handle may occur when other flow rules
>>>>> + * depend on it, and destroying it would result in an inconsistent state.
>>>>> + *
>>>>> + * This function is only guaranteed to succeed if handles are destroyed in
>>>>> + * reverse order of their creation.
>>>>
>>>> How can the application find this information out on error?
>>>
>>> Without maintaining a list, they cannot. The specified case is the only
>>> possible guarantee. That does not mean PMDs should not do their best to
>>> destroy flow rules, only that ordering must remain consistent in case of
>>> inability to destroy one.
>>>
>>> What do you suggest?
>>
>> I think if the app cannot remove a specific rule it may want to remove
>> all rules and deal with flows in software for a time. So once the app
>> knows it fails that should be enough.
> 
> OK, then since destruction may return an error already, is it fine?
> Applications may call rte_flow_flush() (not supposed to fail unless there is
> a serious issue, abort() in that case) and switch to SW fallback.

yes, it's fine.

> 

<...>

>>>>> + * @param[out] error
>>>>> + *   Perform verbose error reporting if not NULL.
>>>>> + *
>>>>> + * @return
>>>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>>>> + */
>>>>> +int
>>>>> +rte_flow_query(uint8_t port_id,
>>>>> +	       struct rte_flow *flow,
>>>>> +	       enum rte_flow_action_type action,
>>>>> +	       void *data,
>>>>> +	       struct rte_flow_error *error);
>>>>> +
>>>>> +#ifdef __cplusplus
>>>>> +}
>>>>> +#endif
>>>>
>>>> I don't see a way to dump all the rules for a port out. I think this is
>>>> neccessary for degbugging. You could have a look through dpif.h in OVS
>>>> and see how dpif_flow_dump_next() is used, it might be a good reference.
>>>
>>> DPDK does not maintain flow rules and, depending on hardware capabilities
>>> and level of compliance, PMDs do not necessarily do it either, particularly
>>> since it requires space and application probably have a better method to
>>> store these pointers for their own needs.
>>
>> understood
>>
>>>
>>> What you see here is only a PMD interface. Depending on applications needs,
>>> generic helper functions built on top of these may be added to manage flow
>>> rules in the future.
>>
>> I'm thinking of the case where something goes wrong and I want to get a
>> dump of all the flow rules from hardware, not query the rules I think I
>> have. I don't see a way to do it or something to build a helper on top of?
> 
> Generic helper functions would exist on top of this API and would likely
> maintain a list of flow rules themselves. The dump in that case would be
> entirely implemented in software. I think that recovering flow rules from HW
> may be complicated in many cases (even without taking storage allocation and
> rules conversion issues into account), therefore if there is really a need
> for it, we could perhaps add a dump() function that PMDs are free to
> implement later.
> 

ok. Maybe there are some more generic stats that can be got from the
hardware that would help debugging that would suffice, like total flow
rule hits/misses (i.e. not on a per flow rule basis).

You can get this from the software flow caches and it's widely used for
debugging. e.g.

pmd thread numa_id 0 core_id 3:
	emc hits:0
	megaflow hits:0
	avg. subtable lookups per hit:0.00
	miss:0

Adrien Mazarguil Dec. 14, 2016, 1:54 p.m. UTC | #16

Hi Kevin,

On Wed, Dec 14, 2016 at 11:48:04AM +0000, Kevin Traynor wrote:
> hi Adrien, sorry for the delay
> 
> <...>
> 
> >>>>
> >>>> Is it expected that the application or pmd will provide locking between
> >>>> these functions if required? I think it's going to have to be the app.
> >>>
> >>> Locking is indeed expected to be performed by applications. This API only
> >>> documents places where locking would make sense if necessary and expected
> >>> behavior.
> >>>
> >>> Like all control path APIs, this one assumes a single control thread.
> >>> Applications must take the necessary precautions.
> >>
> >> If you look at OVS now it's quite possible that you have 2 rx queues
> >> serviced by different threads, that would also install the flow rules in
> >> the software flow caches - possibly that could extend to adding hardware
> >> flows. There could also be another thread that is querying for stats. So
> >> anything that can be done to minimise the locking would be helpful -
> >> maybe query() could be atomic and not require any locking?
> > 
> > I think we need basic functions with as few constraints as possible on PMDs
> > first, this API being somewhat complex to implement on their side. That
> > covers the common use case where applications have a single control thread
> > or otherwise perform locking on their own.
> > 
> > Once the basics are there for most PMDs, we may add new functions, items,
> > properties and actions that provide additional constraints (timing,
> > multi-threading and so on), which remain to be defined according to
> > feedback. It is designed to be extended without causing ABI breakage.
> 
> I think Sugesh and I are trying to foresee some of the issues that may
> arise when integrating with something like OVS. OTOH it's
> hard/impossible to say what will be needed exactly in the API right now
> to make it suitable for OVS.
> 
> So, I'm ok with the approach you are taking by exposing a basic API
> but I think there should be an expectation that it may not be sufficient
> for a project like OVS to integrate in and may take several
> iterations/extensions - don't go anywhere!
> 
> > 
> > As for query(), let's see how PMDs handle it first. A race between query()
> > and create() on a given device is almost unavoidable without locking, same
> > for queries that reset counters in a given flow rule. Basic parallel queries
> > should not cause any harm otherwise, although this cannot be guaranteed yet.
> 
> You still have a race if there is locking, except it is for the lock,
> but it has the same effect. The downside of my suggestion is that all
> the PMDs would need to guarantee they could gets stats atomically - I'm
> not sure if they can or it's too restrictive.
> 
> > 
> 
> <...>
> 
> >>
> >>>
> >>>>> +
> >>>>> +/**
> >>>>> + * Destroy a flow rule on a given port.
> >>>>> + *
> >>>>> + * Failure to destroy a flow rule handle may occur when other flow rules
> >>>>> + * depend on it, and destroying it would result in an inconsistent state.
> >>>>> + *
> >>>>> + * This function is only guaranteed to succeed if handles are destroyed in
> >>>>> + * reverse order of their creation.
> >>>>
> >>>> How can the application find this information out on error?
> >>>
> >>> Without maintaining a list, they cannot. The specified case is the only
> >>> possible guarantee. That does not mean PMDs should not do their best to
> >>> destroy flow rules, only that ordering must remain consistent in case of
> >>> inability to destroy one.
> >>>
> >>> What do you suggest?
> >>
> >> I think if the app cannot remove a specific rule it may want to remove
> >> all rules and deal with flows in software for a time. So once the app
> >> knows it fails that should be enough.
> > 
> > OK, then since destruction may return an error already, is it fine?
> > Applications may call rte_flow_flush() (not supposed to fail unless there is
> > a serious issue, abort() in that case) and switch to SW fallback.
> 
> yes, it's fine.
> 
> > 
> 
> <...>
> 
> >>>>> + * @param[out] error
> >>>>> + *   Perform verbose error reporting if not NULL.
> >>>>> + *
> >>>>> + * @return
> >>>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> >>>>> + */
> >>>>> +int
> >>>>> +rte_flow_query(uint8_t port_id,
> >>>>> +	       struct rte_flow *flow,
> >>>>> +	       enum rte_flow_action_type action,
> >>>>> +	       void *data,
> >>>>> +	       struct rte_flow_error *error);
> >>>>> +
> >>>>> +#ifdef __cplusplus
> >>>>> +}
> >>>>> +#endif
> >>>>
> >>>> I don't see a way to dump all the rules for a port out. I think this is
> >>>> neccessary for degbugging. You could have a look through dpif.h in OVS
> >>>> and see how dpif_flow_dump_next() is used, it might be a good reference.
> >>>
> >>> DPDK does not maintain flow rules and, depending on hardware capabilities
> >>> and level of compliance, PMDs do not necessarily do it either, particularly
> >>> since it requires space and application probably have a better method to
> >>> store these pointers for their own needs.
> >>
> >> understood
> >>
> >>>
> >>> What you see here is only a PMD interface. Depending on applications needs,
> >>> generic helper functions built on top of these may be added to manage flow
> >>> rules in the future.
> >>
> >> I'm thinking of the case where something goes wrong and I want to get a
> >> dump of all the flow rules from hardware, not query the rules I think I
> >> have. I don't see a way to do it or something to build a helper on top of?
> > 
> > Generic helper functions would exist on top of this API and would likely
> > maintain a list of flow rules themselves. The dump in that case would be
> > entirely implemented in software. I think that recovering flow rules from HW
> > may be complicated in many cases (even without taking storage allocation and
> > rules conversion issues into account), therefore if there is really a need
> > for it, we could perhaps add a dump() function that PMDs are free to
> > implement later.
> > 
> 
> ok. Maybe there are some more generic stats that can be got from the
> hardware that would help debugging that would suffice, like total flow
> rule hits/misses (i.e. not on a per flow rule basis).
> 
> You can get this from the software flow caches and it's widely used for
> debugging. e.g.
> 
> pmd thread numa_id 0 core_id 3:
> 	emc hits:0
> 	megaflow hits:0
> 	avg. subtable lookups per hit:0.00
> 	miss:0
> 

Perhaps a rule such as the following could do the trick:

 group: 42 (or priority 42)
 pattern: void
 actions: count / passthru

Assuming useful flow rules are defined with higher priorities (using lower
group ID or priority level) and provide a terminating action, this one would
count all packets that were not caught by them.

That is one example to illustrate how "global" counters can be requested by
applications.

Otherwise you could just make sure all rules contain mark / flag actions, in
which case mbufs would tell directly if they went through them or need
additional SW processing.

Kevin Traynor Dec. 14, 2016, 4:11 p.m. UTC | #17

On 12/14/2016 01:54 PM, Adrien Mazarguil wrote:

>>
>>>>>>> + * @param[out] error
>>>>>>> + *   Perform verbose error reporting if not NULL.
>>>>>>> + *
>>>>>>> + * @return
>>>>>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>>>>>> + */
>>>>>>> +int
>>>>>>> +rte_flow_query(uint8_t port_id,
>>>>>>> +	       struct rte_flow *flow,
>>>>>>> +	       enum rte_flow_action_type action,
>>>>>>> +	       void *data,
>>>>>>> +	       struct rte_flow_error *error);
>>>>>>> +
>>>>>>> +#ifdef __cplusplus
>>>>>>> +}
>>>>>>> +#endif
>>>>>>
>>>>>> I don't see a way to dump all the rules for a port out. I think this is
>>>>>> neccessary for degbugging. You could have a look through dpif.h in OVS
>>>>>> and see how dpif_flow_dump_next() is used, it might be a good reference.
>>>>>
>>>>> DPDK does not maintain flow rules and, depending on hardware capabilities
>>>>> and level of compliance, PMDs do not necessarily do it either, particularly
>>>>> since it requires space and application probably have a better method to
>>>>> store these pointers for their own needs.
>>>>
>>>> understood
>>>>
>>>>>
>>>>> What you see here is only a PMD interface. Depending on applications needs,
>>>>> generic helper functions built on top of these may be added to manage flow
>>>>> rules in the future.
>>>>
>>>> I'm thinking of the case where something goes wrong and I want to get a
>>>> dump of all the flow rules from hardware, not query the rules I think I
>>>> have. I don't see a way to do it or something to build a helper on top of?
>>>
>>> Generic helper functions would exist on top of this API and would likely
>>> maintain a list of flow rules themselves. The dump in that case would be
>>> entirely implemented in software. I think that recovering flow rules from HW
>>> may be complicated in many cases (even without taking storage allocation and
>>> rules conversion issues into account), therefore if there is really a need
>>> for it, we could perhaps add a dump() function that PMDs are free to
>>> implement later.
>>>
>>
>> ok. Maybe there are some more generic stats that can be got from the
>> hardware that would help debugging that would suffice, like total flow
>> rule hits/misses (i.e. not on a per flow rule basis).
>>
>> You can get this from the software flow caches and it's widely used for
>> debugging. e.g.
>>
>> pmd thread numa_id 0 core_id 3:
>> 	emc hits:0
>> 	megaflow hits:0
>> 	avg. subtable lookups per hit:0.00
>> 	miss:0
>>
> 
> Perhaps a rule such as the following could do the trick:
> 
>  group: 42 (or priority 42)
>  pattern: void
>  actions: count / passthru
> 
> Assuming useful flow rules are defined with higher priorities (using lower
> group ID or priority level) and provide a terminating action, this one would
> count all packets that were not caught by them.
> 
> That is one example to illustrate how "global" counters can be requested by
> applications.
> 
> Otherwise you could just make sure all rules contain mark / flag actions, in
> which case mbufs would tell directly if they went through them or need
> additional SW processing.
> 

ok, sounds like there's some options at least to work with on this which
is good. thanks.

diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index d6bb8f8..3b46630 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -243,6 +243,10 @@  M: Thomas Monjalon <thomas.monjalon@6wind.com>
 F: lib/librte_ether/
 F: scripts/test-null.sh
 
+Generic flow API
+M: Adrien Mazarguil <adrien.mazarguil@6wind.com>
+F: lib/librte_ether/rte_flow*
+
 Crypto API
 M: Declan Doherty <declan.doherty@intel.com>
 F: lib/librte_cryptodev/
diff --git a/lib/librte_ether/Makefile b/lib/librte_ether/Makefile
index efe1e5f..9335361 100644
--- a/lib/librte_ether/Makefile
+++ b/lib/librte_ether/Makefile
@@ -44,6 +44,7 @@  EXPORT_MAP := rte_ether_version.map
 LIBABIVER := 5
 
 SRCS-y += rte_ethdev.c
+SRCS-y += rte_flow.c
 
 #
 # Export include files
@@ -51,6 +52,8 @@  SRCS-y += rte_ethdev.c
 SYMLINK-y-include += rte_ethdev.h
 SYMLINK-y-include += rte_eth_ctrl.h
 SYMLINK-y-include += rte_dev_info.h
+SYMLINK-y-include += rte_flow.h
+SYMLINK-y-include += rte_flow_driver.h
 
 # this lib depends upon:
 DEPDIRS-y += lib/librte_net lib/librte_eal lib/librte_mempool lib/librte_ring lib/librte_mbuf
diff --git a/lib/librte_ether/rte_eth_ctrl.h b/lib/librte_ether/rte_eth_ctrl.h
index fe80eb0..8386904 100644
--- a/lib/librte_ether/rte_eth_ctrl.h
+++ b/lib/librte_ether/rte_eth_ctrl.h
@@ -99,6 +99,7 @@  enum rte_filter_type {
 	RTE_ETH_FILTER_FDIR,
 	RTE_ETH_FILTER_HASH,
 	RTE_ETH_FILTER_L2_TUNNEL,
+	RTE_ETH_FILTER_GENERIC,
 	RTE_ETH_FILTER_MAX
 };
 
diff --git a/lib/librte_ether/rte_ether_version.map b/lib/librte_ether/rte_ether_version.map
index 72be66d..b5d2547 100644
--- a/lib/librte_ether/rte_ether_version.map
+++ b/lib/librte_ether/rte_ether_version.map
@@ -147,3 +147,13 @@  DPDK_16.11 {
 	rte_eth_dev_pci_remove;
 
 } DPDK_16.07;
+
+DPDK_17.02 {
+	global:
+
+	rte_flow_validate;
+	rte_flow_create;
+	rte_flow_destroy;
+	rte_flow_query;
+
+} DPDK_16.11;
diff --git a/lib/librte_ether/rte_flow.c b/lib/librte_ether/rte_flow.c
new file mode 100644
index 0000000..064963d
--- /dev/null
+++ b/lib/librte_ether/rte_flow.c
@@ -0,0 +1,159 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2016 6WIND S.A.
+ *   Copyright 2016 Mellanox.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <stdint.h>
+
+#include <rte_errno.h>
+#include <rte_branch_prediction.h>
+#include "rte_ethdev.h"
+#include "rte_flow_driver.h"
+#include "rte_flow.h"
+
+/* Get generic flow operations structure from a port. */
+const struct rte_flow_ops *
+rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops;
+	int code;
+
+	if (unlikely(!rte_eth_dev_is_valid_port(port_id)))
+		code = ENODEV;
+	else if (unlikely(!dev->dev_ops->filter_ctrl ||
+			  dev->dev_ops->filter_ctrl(dev,
+						    RTE_ETH_FILTER_GENERIC,
+						    RTE_ETH_FILTER_GET,
+						    &ops) ||
+			  !ops))
+		code = ENOTSUP;
+	else
+		return ops;
+	rte_flow_error_set(error, code, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(code));
+	return NULL;
+}
+
+/* Check whether a flow rule can be created on a given port. */
+int
+rte_flow_validate(uint8_t port_id,
+		  const struct rte_flow_attr *attr,
+		  const struct rte_flow_item pattern[],
+		  const struct rte_flow_action actions[],
+		  struct rte_flow_error *error)
+{
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->validate))
+		return ops->validate(dev, attr, pattern, actions, error);
+	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(ENOTSUP));
+	return -rte_errno;
+}
+
+/* Create a flow rule on a given port. */
+struct rte_flow *
+rte_flow_create(uint8_t port_id,
+		const struct rte_flow_attr *attr,
+		const struct rte_flow_item pattern[],
+		const struct rte_flow_action actions[],
+		struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return NULL;
+	if (likely(!!ops->create))
+		return ops->create(dev, attr, pattern, actions, error);
+	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(ENOTSUP));
+	return NULL;
+}
+
+/* Destroy a flow rule on a given port. */
+int
+rte_flow_destroy(uint8_t port_id,
+		 struct rte_flow *flow,
+		 struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->destroy))
+		return ops->destroy(dev, flow, error);
+	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(ENOTSUP));
+	return -rte_errno;
+}
+
+/* Destroy all flow rules associated with a port. */
+int
+rte_flow_flush(uint8_t port_id,
+	       struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (unlikely(!ops))
+		return -rte_errno;
+	if (likely(!!ops->flush))
+		return ops->flush(dev, error);
+	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(ENOTSUP));
+	return -rte_errno;
+}
+
+/* Query an existing flow rule. */
+int
+rte_flow_query(uint8_t port_id,
+	       struct rte_flow *flow,
+	       enum rte_flow_action_type action,
+	       void *data,
+	       struct rte_flow_error *error)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	const struct rte_flow_ops *ops = rte_flow_ops_get(port_id, error);
+
+	if (!ops)
+		return -rte_errno;
+	if (likely(!!ops->query))
+		return ops->query(dev, flow, action, data, error);
+	rte_flow_error_set(error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			   NULL, rte_strerror(ENOTSUP));
+	return -rte_errno;
+}
diff --git a/lib/librte_ether/rte_flow.h b/lib/librte_ether/rte_flow.h
new file mode 100644
index 0000000..211f307
--- /dev/null
+++ b/lib/librte_ether/rte_flow.h
@@ -0,0 +1,947 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2016 6WIND S.A.
+ *   Copyright 2016 Mellanox.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_FLOW_H_
+#define RTE_FLOW_H_
+
+/**
+ * @file
+ * RTE generic flow API
+ *
+ * This interface provides the ability to program packet matching and
+ * associated actions in hardware through flow rules.
+ */
+
+#include <rte_arp.h>
+#include <rte_ether.h>
+#include <rte_icmp.h>
+#include <rte_ip.h>
+#include <rte_sctp.h>
+#include <rte_tcp.h>
+#include <rte_udp.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Flow rule attributes.
+ *
+ * Priorities are set on two levels: per group and per rule within groups.
+ *
+ * Lower values denote higher priority, the highest priority for both levels
+ * is 0, so that a rule with priority 0 in group 8 is always matched after a
+ * rule with priority 8 in group 0.
+ *
+ * Although optional, applications are encouraged to group similar rules as
+ * much as possible to fully take advantage of hardware capabilities
+ * (e.g. optimized matching) and work around limitations (e.g. a single
+ * pattern type possibly allowed in a given group).
+ *
+ * Group and priority levels are arbitrary and up to the application, they
+ * do not need to be contiguous nor start from 0, however the maximum number
+ * varies between devices and may be affected by existing flow rules.
+ *
+ * If a packet is matched by several rules of a given group for a given
+ * priority level, the outcome is undefined. It can take any path, may be
+ * duplicated or even cause unrecoverable errors.
+ *
+ * Note that support for more than a single group and priority level is not
+ * guaranteed.
+ *
+ * Flow rules can apply to inbound and/or outbound traffic (ingress/egress).
+ *
+ * Several pattern items and actions are valid and can be used in both
+ * directions. Those valid for only one direction are described as such.
+ *
+ * Specifying both directions at once is not recommended but may be valid in
+ * some cases, such as incrementing the same counter twice.
+ *
+ * Not specifying any direction is currently an error.
+ */
+struct rte_flow_attr {
+	uint32_t group; /**< Priority group. */
+	uint32_t priority; /**< Priority level within group. */
+	uint32_t ingress:1; /**< Rule applies to ingress traffic. */
+	uint32_t egress:1; /**< Rule applies to egress traffic. */
+	uint32_t reserved:30; /**< Reserved, must be zero. */
+};
+
+/**
+ * Matching pattern item types.
+ *
+ * Items are arranged in a list to form a matching pattern for packets.
+ * They fall in two categories:
+ *
+ * - Protocol matching (ANY, RAW, ETH, IPV4, IPV6, ICMP, UDP, TCP, SCTP,
+ *   VXLAN and so on), usually associated with a specification
+ *   structure. These must be stacked in the same order as the protocol
+ *   layers to match, starting from L2.
+ *
+ * - Affecting how the pattern is processed (END, VOID, INVERT, PF, VF, PORT
+ *   and so on), often without a specification structure. Since they are
+ *   meta data that does not match packet contents, these can be specified
+ *   anywhere within item lists without affecting the protocol matching
+ *   items.
+ *
+ * See the description of individual types for more information. Those
+ * marked with [META] fall into the second category.
+ */
+enum rte_flow_item_type {
+	/**
+	 * [META]
+	 *
+	 * End marker for item lists. Prevents further processing of items,
+	 * thereby ending the pattern.
+	 *
+	 * No associated specification structure.
+	 */
+	RTE_FLOW_ITEM_TYPE_END,
+
+	/**
+	 * [META]
+	 *
+	 * Used as a placeholder for convenience. It is ignored and simply
+	 * discarded by PMDs.
+	 *
+	 * No associated specification structure.
+	 */
+	RTE_FLOW_ITEM_TYPE_VOID,
+
+	/**
+	 * [META]
+	 *
+	 * Inverted matching, i.e. process packets that do not match the
+	 * pattern.
+	 *
+	 * No associated specification structure.
+	 */
+	RTE_FLOW_ITEM_TYPE_INVERT,
+
+	/**
+	 * Matches any protocol in place of the current layer, a single ANY
+	 * may also stand for several protocol layers.
+	 *
+	 * See struct rte_flow_item_any.
+	 */
+	RTE_FLOW_ITEM_TYPE_ANY,
+
+	/**
+	 * [META]
+	 *
+	 * Matches packets addressed to the physical function of the device.
+	 *
+	 * If the underlying device function differs from the one that would
+	 * normally receive the matched traffic, specifying this item
+	 * prevents it from reaching that device unless the flow rule
+	 * contains a PF action. Packets are not duplicated between device
+	 * instances by default.
+	 *
+	 * No associated specification structure.
+	 */
+	RTE_FLOW_ITEM_TYPE_PF,
+
+	/**
+	 * [META]
+	 *
+	 * Matches packets addressed to a virtual function ID of the device.
+	 *
+	 * If the underlying device function differs from the one that would
+	 * normally receive the matched traffic, specifying this item
+	 * prevents it from reaching that device unless the flow rule
+	 * contains a VF action. Packets are not duplicated between device
+	 * instances by default.
+	 *
+	 * See struct rte_flow_item_vf.
+	 */
+	RTE_FLOW_ITEM_TYPE_VF,
+
+	/**
+	 * [META]
+	 *
+	 * Matches packets coming from the specified physical port of the
+	 * underlying device.
+	 *
+	 * The first PORT item overrides the physical port normally
+	 * associated with the specified DPDK input port (port_id). This
+	 * item can be provided several times to match additional physical
+	 * ports.
+	 *
+	 * See struct rte_flow_item_port.
+	 */
+	RTE_FLOW_ITEM_TYPE_PORT,
+
+	/**
+	 * Matches a byte string of a given length at a given offset.
+	 *
+	 * See struct rte_flow_item_raw.
+	 */
+	RTE_FLOW_ITEM_TYPE_RAW,
+
+	/**
+	 * Matches an Ethernet header.
+	 *
+	 * See struct rte_flow_item_eth.
+	 */
+	RTE_FLOW_ITEM_TYPE_ETH,
+
+	/**
+	 * Matches an 802.1Q/ad VLAN tag.
+	 *
+	 * See struct rte_flow_item_vlan.
+	 */
+	RTE_FLOW_ITEM_TYPE_VLAN,
+
+	/**
+	 * Matches an IPv4 header.
+	 *
+	 * See struct rte_flow_item_ipv4.
+	 */
+	RTE_FLOW_ITEM_TYPE_IPV4,
+
+	/**
+	 * Matches an IPv6 header.
+	 *
+	 * See struct rte_flow_item_ipv6.
+	 */
+	RTE_FLOW_ITEM_TYPE_IPV6,
+
+	/**
+	 * Matches an ICMP header.
+	 *
+	 * See struct rte_flow_item_icmp.
+	 */
+	RTE_FLOW_ITEM_TYPE_ICMP,
+
+	/**
+	 * Matches a UDP header.
+	 *
+	 * See struct rte_flow_item_udp.
+	 */
+	RTE_FLOW_ITEM_TYPE_UDP,
+
+	/**
+	 * Matches a TCP header.
+	 *
+	 * See struct rte_flow_item_tcp.
+	 */
+	RTE_FLOW_ITEM_TYPE_TCP,
+
+	/**
+	 * Matches a SCTP header.
+	 *
+	 * See struct rte_flow_item_sctp.
+	 */
+	RTE_FLOW_ITEM_TYPE_SCTP,
+
+	/**
+	 * Matches a VXLAN header.
+	 *
+	 * See struct rte_flow_item_vxlan.
+	 */
+	RTE_FLOW_ITEM_TYPE_VXLAN,
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_ANY
+ *
+ * Matches any protocol in place of the current layer, a single ANY may also
+ * stand for several protocol layers.
+ *
+ * This is usually specified as the first pattern item when looking for a
+ * protocol anywhere in a packet.
+ *
+ * A maximum value of 0 requests matching any number of protocol layers
+ * above or equal to the minimum value, a maximum value lower than the
+ * minimum one is otherwise invalid.
+ *
+ * This type does not work with a range (struct rte_flow_item.last).
+ */
+struct rte_flow_item_any {
+	uint16_t min; /**< Minimum number of layers covered. */
+	uint16_t max; /**< Maximum number of layers covered, 0 for infinity. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_VF
+ *
+ * Matches packets addressed to a virtual function ID of the device.
+ *
+ * If the underlying device function differs from the one that would
+ * normally receive the matched traffic, specifying this item prevents it
+ * from reaching that device unless the flow rule contains a VF
+ * action. Packets are not duplicated between device instances by default.
+ *
+ * - Likely to return an error or never match any traffic if this causes a
+ *   VF device to match traffic addressed to a different VF.
+ * - Can be specified multiple times to match traffic addressed to several
+ *   specific VFs.
+ * - Can be combined with a PF item to match both PF and VF traffic.
+ *
+ * A zeroed mask can be used to match any VF.
+ */
+struct rte_flow_item_vf {
+	uint32_t id; /**< Destination VF ID. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_PORT
+ *
+ * Matches packets coming from the specified physical port of the underlying
+ * device.
+ *
+ * The first PORT item overrides the physical port normally associated with
+ * the specified DPDK input port (port_id). This item can be provided
+ * several times to match additional physical ports.
+ *
+ * Note that physical ports are not necessarily tied to DPDK input ports
+ * (port_id) when those are not under DPDK control. Possible values are
+ * specific to each device, they are not necessarily indexed from zero and
+ * may not be contiguous.
+ *
+ * As a device property, the list of allowed values as well as the value
+ * associated with a port_id should be retrieved by other means.
+ *
+ * A zeroed mask can be used to match any port index.
+ */
+struct rte_flow_item_port {
+	uint32_t index; /**< Physical port index. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_RAW
+ *
+ * Matches a byte string of a given length at a given offset.
+ *
+ * Offset is either absolute (using the start of the packet) or relative to
+ * the end of the previous matched item in the stack, in which case negative
+ * values are allowed.
+ *
+ * If search is enabled, offset is used as the starting point. The search
+ * area can be delimited by setting limit to a nonzero value, which is the
+ * maximum number of bytes after offset where the pattern may start.
+ *
+ * Matching a zero-length pattern is allowed, doing so resets the relative
+ * offset for subsequent items.
+ *
+ * This type does not work with a range (struct rte_flow_item.last).
+ */
+struct rte_flow_item_raw {
+	uint32_t relative:1; /**< Look for pattern after the previous item. */
+	uint32_t search:1; /**< Search pattern from offset (see also limit). */
+	uint32_t reserved:30; /**< Reserved, must be set to zero. */
+	int32_t offset; /**< Absolute or relative offset for pattern. */
+	uint16_t limit; /**< Search area limit for start of pattern. */
+	uint16_t length; /**< Pattern length. */
+	uint8_t pattern[]; /**< Byte string to look for. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_ETH
+ *
+ * Matches an Ethernet header.
+ */
+struct rte_flow_item_eth {
+	struct ether_addr dst; /**< Destination MAC. */
+	struct ether_addr src; /**< Source MAC. */
+	unsigned int type; /**< EtherType. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_VLAN
+ *
+ * Matches an 802.1Q/ad VLAN tag.
+ *
+ * This type normally follows either RTE_FLOW_ITEM_TYPE_ETH or
+ * RTE_FLOW_ITEM_TYPE_VLAN.
+ */
+struct rte_flow_item_vlan {
+	uint16_t tpid; /**< Tag protocol identifier. */
+	uint16_t tci; /**< Tag control information. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_IPV4
+ *
+ * Matches an IPv4 header.
+ *
+ * Note: IPv4 options are handled by dedicated pattern items.
+ */
+struct rte_flow_item_ipv4 {
+	struct ipv4_hdr hdr; /**< IPv4 header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_IPV6.
+ *
+ * Matches an IPv6 header.
+ *
+ * Note: IPv6 options are handled by dedicated pattern items.
+ */
+struct rte_flow_item_ipv6 {
+	struct ipv6_hdr hdr; /**< IPv6 header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_ICMP.
+ *
+ * Matches an ICMP header.
+ */
+struct rte_flow_item_icmp {
+	struct icmp_hdr hdr; /**< ICMP header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_UDP.
+ *
+ * Matches a UDP header.
+ */
+struct rte_flow_item_udp {
+	struct udp_hdr hdr; /**< UDP header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_TCP.
+ *
+ * Matches a TCP header.
+ */
+struct rte_flow_item_tcp {
+	struct tcp_hdr hdr; /**< TCP header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_SCTP.
+ *
+ * Matches a SCTP header.
+ */
+struct rte_flow_item_sctp {
+	struct sctp_hdr hdr; /**< SCTP header definition. */
+};
+
+/**
+ * RTE_FLOW_ITEM_TYPE_VXLAN.
+ *
+ * Matches a VXLAN header (RFC 7348).
+ */
+struct rte_flow_item_vxlan {
+	uint8_t flags; /**< Normally 0x08 (I flag). */
+	uint8_t rsvd0[3]; /**< Reserved, normally 0x000000. */
+	uint8_t vni[3]; /**< VXLAN identifier. */
+	uint8_t rsvd1; /**< Reserved, normally 0x00. */
+};
+
+/**
+ * Matching pattern item definition.
+ *
+ * A pattern is formed by stacking items starting from the lowest protocol
+ * layer to match. This stacking restriction does not apply to meta items
+ * which can be placed anywhere in the stack with no effect on the meaning
+ * of the resulting pattern.
+ *
+ * A stack is terminated by a END item.
+ *
+ * The spec field should be a valid pointer to a structure of the related
+ * item type. It may be set to NULL in many cases to use default values.
+ *
+ * Optionally, last can point to a structure of the same type to define an
+ * inclusive range. This is mostly supported by integer and address fields,
+ * may cause errors otherwise. Fields that do not support ranges must be set
+ * to the same value as their spec counterparts.
+ *
+ * By default all fields present in spec are considered relevant.* This
+ * behavior can be altered by providing a mask structure of the same type
+ * with applicable bits set to one. It can also be used to partially filter
+ * out specific fields (e.g. as an alternate mean to match ranges of IP
+ * addresses).
+ *
+ * Note this is a simple bit-mask applied before interpreting the contents
+ * of spec and last, which may yield unexpected results if not used
+ * carefully. For example, if for an IPv4 address field, spec provides
+ * 10.1.2.3, last provides 10.3.4.5 and mask provides 255.255.0.0, the
+ * effective range is 10.1.0.0 to 10.3.255.255.
+ *
+ * * The defaults for data-matching items such as IPv4 when mask is not
+ *   specified actually depend on the underlying implementation since only
+ *   recognized fields can be taken into account.
+ */
+struct rte_flow_item {
+	enum rte_flow_item_type type; /**< Item type. */
+	const void *spec; /**< Pointer to item specification structure. */
+	const void *last; /**< Defines an inclusive range (spec to last). */
+	const void *mask; /**< Bit-mask applied to spec and last. */
+};
+
+/**
+ * Action types.
+ *
+ * Each possible action is represented by a type. Some have associated
+ * configuration structures. Several actions combined in a list can be
+ * affected to a flow rule. That list is not ordered.
+ *
+ * They fall in three categories:
+ *
+ * - Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent
+ *   processing matched packets by subsequent flow rules, unless overridden
+ *   with PASSTHRU.
+ *
+ * - Non terminating actions (PASSTHRU, DUP) that leave matched packets up
+ *   for additional processing by subsequent flow rules.
+ *
+ * - Other non terminating meta actions that do not affect the fate of
+ *   packets (END, VOID, MARK, FLAG, COUNT).
+ *
+ * When several actions are combined in a flow rule, they should all have
+ * different types (e.g. dropping a packet twice is not possible). The
+ * defined behavior is for PMDs to only take into account the last action of
+ * a given type found in the list. PMDs still perform error checking on the
+ * entire list.
+ *
+ * Note that PASSTHRU is the only action able to override a terminating
+ * rule.
+ */
+enum rte_flow_action_type {
+	/**
+	 * [META]
+	 *
+	 * End marker for action lists. Prevents further processing of
+	 * actions, thereby ending the list.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_END,
+
+	/**
+	 * [META]
+	 *
+	 * Used as a placeholder for convenience. It is ignored and simply
+	 * discarded by PMDs.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_VOID,
+
+	/**
+	 * Leaves packets up for additional processing by subsequent flow
+	 * rules. This is the default when a rule does not contain a
+	 * terminating action, but can be specified to force a rule to
+	 * become non-terminating.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_PASSTHRU,
+
+	/**
+	 * [META]
+	 *
+	 * Attaches a 32 bit value to packets.
+	 *
+	 * See struct rte_flow_action_mark.
+	 */
+	RTE_FLOW_ACTION_TYPE_MARK,
+
+	/**
+	 * [META]
+	 *
+	 * Flag packets. Similar to MARK but only affects ol_flags.
+	 *
+	 * Note: a distinctive flag must be defined for it.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_FLAG,
+
+	/**
+	 * Assigns packets to a given queue index.
+	 *
+	 * See struct rte_flow_action_queue.
+	 */
+	RTE_FLOW_ACTION_TYPE_QUEUE,
+
+	/**
+	 * Drops packets.
+	 *
+	 * PASSTHRU overrides this action if both are specified.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_DROP,
+
+	/**
+	 * [META]
+	 *
+	 * Enables counters for this rule.
+	 *
+	 * These counters can be retrieved and reset through rte_flow_query(),
+	 * see struct rte_flow_query_count.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_COUNT,
+
+	/**
+	 * Duplicates packets to a given queue index.
+	 *
+	 * This is normally combined with QUEUE, however when used alone, it
+	 * is actually similar to QUEUE + PASSTHRU.
+	 *
+	 * See struct rte_flow_action_dup.
+	 */
+	RTE_FLOW_ACTION_TYPE_DUP,
+
+	/**
+	 * Similar to QUEUE, except RSS is additionally performed on packets
+	 * to spread them among several queues according to the provided
+	 * parameters.
+	 *
+	 * See struct rte_flow_action_rss.
+	 */
+	RTE_FLOW_ACTION_TYPE_RSS,
+
+	/**
+	 * Redirects packets to the physical function (PF) of the current
+	 * device.
+	 *
+	 * No associated configuration structure.
+	 */
+	RTE_FLOW_ACTION_TYPE_PF,
+
+	/**
+	 * Redirects packets to the virtual function (VF) of the current
+	 * device with the specified ID.
+	 *
+	 * See struct rte_flow_action_vf.
+	 */
+	RTE_FLOW_ACTION_TYPE_VF,
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_MARK
+ *
+ * Attaches a 32 bit value to packets.
+ *
+ * This value is arbitrary and application-defined. For compatibility with
+ * FDIR it is returned in the hash.fdir.hi mbuf field. PKT_RX_FDIR_ID is
+ * also set in ol_flags.
+ */
+struct rte_flow_action_mark {
+	uint32_t id; /**< 32 bit value to return with packets. */
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_QUEUE
+ *
+ * Assign packets to a given queue index.
+ *
+ * Terminating by default.
+ */
+struct rte_flow_action_queue {
+	uint16_t index; /**< Queue index to use. */
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_COUNT (query)
+ *
+ * Query structure to retrieve and reset flow rule counters.
+ */
+struct rte_flow_query_count {
+	uint32_t reset:1; /**< Reset counters after query [in]. */
+	uint32_t hits_set:1; /**< hits field is set [out]. */
+	uint32_t bytes_set:1; /**< bytes field is set [out]. */
+	uint32_t reserved:29; /**< Reserved, must be zero [in, out]. */
+	uint64_t hits; /**< Number of hits for this rule [out]. */
+	uint64_t bytes; /**< Number of bytes through this rule [out]. */
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_DUP
+ *
+ * Duplicates packets to a given queue index.
+ *
+ * This is normally combined with QUEUE, however when used alone, it is
+ * actually similar to QUEUE + PASSTHRU.
+ *
+ * Non-terminating by default.
+ */
+struct rte_flow_action_dup {
+	uint16_t index; /**< Queue index to duplicate packets to. */
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_RSS
+ *
+ * Similar to QUEUE, except RSS is additionally performed on packets to
+ * spread them among several queues according to the provided parameters.
+ *
+ * Note: RSS hash result is normally stored in the hash.rss mbuf field,
+ * however it conflicts with the MARK action as they share the same
+ * space. When both actions are specified, the RSS hash is discarded and
+ * PKT_RX_RSS_HASH is not set in ol_flags. MARK has priority. The mbuf
+ * structure should eventually evolve to store both.
+ *
+ * Terminating by default.
+ */
+struct rte_flow_action_rss {
+	const struct rte_eth_rss_conf *rss_conf; /**< RSS parameters. */
+	uint16_t queues; /**< Number of entries in queue[]. */
+	uint16_t queue[]; /**< Queues indices to use. */
+};
+
+/**
+ * RTE_FLOW_ACTION_TYPE_VF
+ *
+ * Redirects packets to a virtual function (VF) of the current device.
+ *
+ * Packets matched by a VF pattern item can be redirected to their original
+ * VF ID instead of the specified one. This parameter may not be available
+ * and is not guaranteed to work properly if the VF part is matched by a
+ * prior flow rule or if packets are not addressed to a VF in the first
+ * place.
+ *
+ * Terminating by default.
+ */
+struct rte_flow_action_vf {
+	uint32_t original:1; /**< Use original VF ID if possible. */
+	uint32_t reserved:31; /**< Reserved, must be zero. */
+	uint32_t id; /**< VF ID to redirect packets to. */
+};
+
+/**
+ * Definition of a single action.
+ *
+ * A list of actions is terminated by a END action.
+ *
+ * For simple actions without a configuration structure, conf remains NULL.
+ */
+struct rte_flow_action {
+	enum rte_flow_action_type type; /**< Action type. */
+	const void *conf; /**< Pointer to action configuration structure. */
+};
+
+/**
+ * Opaque type returned after successfully creating a flow.
+ *
+ * This handle can be used to manage and query the related flow (e.g. to
+ * destroy it or retrieve counters).
+ */
+struct rte_flow;
+
+/**
+ * Verbose error types.
+ *
+ * Most of them provide the type of the object referenced by struct
+ * rte_flow_error.cause.
+ */
+enum rte_flow_error_type {
+	RTE_FLOW_ERROR_TYPE_NONE, /**< No error. */
+	RTE_FLOW_ERROR_TYPE_UNSPECIFIED, /**< Cause unspecified. */
+	RTE_FLOW_ERROR_TYPE_HANDLE, /**< Flow rule (handle). */
+	RTE_FLOW_ERROR_TYPE_ATTR_GROUP, /**< Group field. */
+	RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY, /**< Priority field. */
+	RTE_FLOW_ERROR_TYPE_ATTR_INGRESS, /**< Ingress field. */
+	RTE_FLOW_ERROR_TYPE_ATTR_EGRESS, /**< Egress field. */
+	RTE_FLOW_ERROR_TYPE_ATTR, /**< Attributes structure. */
+	RTE_FLOW_ERROR_TYPE_ITEM_NUM, /**< Pattern length. */
+	RTE_FLOW_ERROR_TYPE_ITEM, /**< Specific pattern item. */
+	RTE_FLOW_ERROR_TYPE_ACTION_NUM, /**< Number of actions. */
+	RTE_FLOW_ERROR_TYPE_ACTION, /**< Specific action. */
+};
+
+/**
+ * Verbose error structure definition.
+ *
+ * This object is normally allocated by applications and set by PMDs, the
+ * message points to a constant string which does not need to be freed by
+ * the application, however its pointer can be considered valid only as long
+ * as its associated DPDK port remains configured. Closing the underlying
+ * device or unloading the PMD invalidates it.
+ *
+ * Both cause and message may be NULL regardless of the error type.
+ */
+struct rte_flow_error {
+	enum rte_flow_error_type type; /**< Cause field and error types. */
+	const void *cause; /**< Object responsible for the error. */
+	const char *message; /**< Human-readable error message. */
+};
+
+/**
+ * Check whether a flow rule can be created on a given port.
+ *
+ * While this function has no effect on the target device, the flow rule is
+ * validated against its current configuration state and the returned value
+ * should be considered valid by the caller for that state only.
+ *
+ * The returned value is guaranteed to remain valid only as long as no
+ * successful calls to rte_flow_create() or rte_flow_destroy() are made in
+ * the meantime and no device parameter affecting flow rules in any way are
+ * modified, due to possible collisions or resource limitations (although in
+ * such cases EINVAL should not be returned).
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in] attr
+ *   Flow rule attributes.
+ * @param[in] pattern
+ *   Pattern specification (list terminated by the END pattern item).
+ * @param[in] actions
+ *   Associated actions (list terminated by the END action).
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 if flow rule is valid and can be created. A negative errno value
+ *   otherwise (rte_errno is also set), the following errors are defined:
+ *
+ *   -ENOSYS: underlying device does not support this functionality.
+ *
+ *   -EINVAL: unknown or invalid rule specification.
+ *
+ *   -ENOTSUP: valid but unsupported rule specification (e.g. partial
+ *   bit-masks are unsupported).
+ *
+ *   -EEXIST: collision with an existing rule.
+ *
+ *   -ENOMEM: not enough resources.
+ *
+ *   -EBUSY: action cannot be performed due to busy device resources, may
+ *   succeed if the affected queues or even the entire port are in a stopped
+ *   state (see rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
+ */
+int
+rte_flow_validate(uint8_t port_id,
+		  const struct rte_flow_attr *attr,
+		  const struct rte_flow_item pattern[],
+		  const struct rte_flow_action actions[],
+		  struct rte_flow_error *error);
+
+/**
+ * Create a flow rule on a given port.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[in] attr
+ *   Flow rule attributes.
+ * @param[in] pattern
+ *   Pattern specification (list terminated by the END pattern item).
+ * @param[in] actions
+ *   Associated actions (list terminated by the END action).
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   A valid handle in case of success, NULL otherwise and rte_errno is set
+ *   to the positive version of one of the error codes defined for
+ *   rte_flow_validate().
+ */
+struct rte_flow *
+rte_flow_create(uint8_t port_id,
+		const struct rte_flow_attr *attr,
+		const struct rte_flow_item pattern[],
+		const struct rte_flow_action actions[],
+		struct rte_flow_error *error);
+
+/**
+ * Destroy a flow rule on a given port.
+ *
+ * Failure to destroy a flow rule handle may occur when other flow rules
+ * depend on it, and destroying it would result in an inconsistent state.
+ *
+ * This function is only guaranteed to succeed if handles are destroyed in
+ * reverse order of their creation.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param flow
+ *   Flow rule handle to destroy.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+rte_flow_destroy(uint8_t port_id,
+		 struct rte_flow *flow,
+		 struct rte_flow_error *error);
+
+/**
+ * Destroy all flow rules associated with a port.
+ *
+ * In the unlikely event of failure, handles are still considered destroyed
+ * and no longer valid but the port must be assumed to be in an inconsistent
+ * state.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+rte_flow_flush(uint8_t port_id,
+	       struct rte_flow_error *error);
+
+/**
+ * Query an existing flow rule.
+ *
+ * This function allows retrieving flow-specific data such as counters.
+ * Data is gathered by special actions which must be present in the flow
+ * rule definition.
+ *
+ * @param port_id
+ *   Port identifier of Ethernet device.
+ * @param flow
+ *   Flow rule handle to query.
+ * @param action
+ *   Action type to query.
+ * @param[in, out] data
+ *   Pointer to storage for the associated query data type.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+rte_flow_query(uint8_t port_id,
+	       struct rte_flow *flow,
+	       enum rte_flow_action_type action,
+	       void *data,
+	       struct rte_flow_error *error);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_FLOW_H_ */
diff --git a/lib/librte_ether/rte_flow_driver.h b/lib/librte_ether/rte_flow_driver.h
new file mode 100644
index 0000000..a88c621
--- /dev/null
+++ b/lib/librte_ether/rte_flow_driver.h
@@ -0,0 +1,177 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright 2016 6WIND S.A.
+ *   Copyright 2016 Mellanox.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of 6WIND S.A. nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef RTE_FLOW_DRIVER_H_
+#define RTE_FLOW_DRIVER_H_
+
+/**
+ * @file
+ * RTE generic flow API (driver side)
+ *
+ * This file provides implementation helpers for internal use by PMDs, they
+ * are not intended to be exposed to applications and are not subject to ABI
+ * versioning.
+ */
+
+#include <stdint.h>
+
+#include <rte_errno.h>
+#include "rte_flow.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Generic flow operations structure implemented and returned by PMDs.
+ *
+ * To implement this API, PMDs must handle the RTE_ETH_FILTER_GENERIC filter
+ * type in their .filter_ctrl callback function (struct eth_dev_ops) as well
+ * as the RTE_ETH_FILTER_GET filter operation.
+ *
+ * If successful, this operation must result in a pointer to a PMD-specific
+ * struct rte_flow_ops written to the argument address as described below:
+ *
+ *  // PMD filter_ctrl callback
+ *
+ *  static const struct rte_flow_ops pmd_flow_ops = { ... };
+ *
+ *  switch (filter_type) {
+ *  case RTE_ETH_FILTER_GENERIC:
+ *      if (filter_op != RTE_ETH_FILTER_GET)
+ *          return -EINVAL;
+ *      *(const void **)arg = &pmd_flow_ops;
+ *      return 0;
+ *  }
+ *
+ * See also rte_flow_ops_get().
+ *
+ * These callback functions are not supposed to be used by applications
+ * directly, which must rely on the API defined in rte_flow.h.
+ *
+ * Public-facing wrapper functions perform a few consistency checks so that
+ * unimplemented (i.e. NULL) callbacks simply return -ENOTSUP. These
+ * callbacks otherwise only differ by their first argument (with port ID
+ * already resolved to a pointer to struct rte_eth_dev).
+ */
+struct rte_flow_ops {
+	/** See rte_flow_validate(). */
+	int (*validate)
+		(struct rte_eth_dev *,
+		 const struct rte_flow_attr *,
+		 const struct rte_flow_item [],
+		 const struct rte_flow_action [],
+		 struct rte_flow_error *);
+	/** See rte_flow_create(). */
+	struct rte_flow *(*create)
+		(struct rte_eth_dev *,
+		 const struct rte_flow_attr *,
+		 const struct rte_flow_item [],
+		 const struct rte_flow_action [],
+		 struct rte_flow_error *);
+	/** See rte_flow_destroy(). */
+	int (*destroy)
+		(struct rte_eth_dev *,
+		 struct rte_flow *,
+		 struct rte_flow_error *);
+	/** See rte_flow_flush(). */
+	int (*flush)
+		(struct rte_eth_dev *,
+		 struct rte_flow_error *);
+	/** See rte_flow_query(). */
+	int (*query)
+		(struct rte_eth_dev *,
+		 struct rte_flow *,
+		 enum rte_flow_action_type,
+		 void *,
+		 struct rte_flow_error *);
+};
+
+/**
+ * Initialize generic flow error structure.
+ *
+ * This function also sets rte_errno to a given value.
+ *
+ * @param[out] error
+ *   Pointer to flow error structure (may be NULL).
+ * @param code
+ *   Related error code (rte_errno).
+ * @param type
+ *   Cause field and error types.
+ * @param cause
+ *   Object responsible for the error.
+ * @param message
+ *   Human-readable error message.
+ *
+ * @return
+ *   Pointer to flow error structure.
+ */
+static inline struct rte_flow_error *
+rte_flow_error_set(struct rte_flow_error *error,
+		   int code,
+		   enum rte_flow_error_type type,
+		   void *cause,
+		   const char *message)
+{
+	if (error) {
+		*error = (struct rte_flow_error){
+			.type = type,
+			.cause = cause,
+			.message = message,
+		};
+	}
+	rte_errno = code;
+	return error;
+}
+
+/**
+ * Get generic flow operations structure from a port.
+ *
+ * @param port_id
+ *   Port identifier to query.
+ * @param[out] error
+ *   Pointer to flow error structure.
+ *
+ * @return
+ *   The flow operations structure associated with port_id, NULL in case of
+ *   error, in which case rte_errno is set and the error structure contains
+ *   additional details.
+ */
+const struct rte_flow_ops *
+rte_flow_ops_get(uint8_t port_id, struct rte_flow_error *error);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* RTE_FLOW_DRIVER_H_ */