[dpdk-dev] [RFC 1/2] net/tap: add eBPF to TAP device

Ophir Munk ophirmu at mellanox.com
Thu Nov 30 09:01:09 CET 2017
Previous message: [dpdk-dev] [PATCH] app/testpmd: extend event printing
Next message: [dpdk-dev] [RFC 2/2] net/tap: add eBPF instructions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
The DPDK traffic classifier is the rte_flow API and the tap PMD
must support it including RSS queue mapping actions.
An example usage for this requirement is failsafe transparent
switching from a PCI device to TAP device while RSS queues are the
same on both devices.

TC was chosen as TAP classifier but TC alone does not support RSS
queue mapping. This commit uses a combination of TC rules and eBPF
actions in order to support TAP RSS.

eBPF requires Linux version 3.19. eBPF is effective only when running
with an appropriate kernel version. It must be compiled with the
appropriate Linux kernel headers. In case the kernel headers do
not include eBPF definitions a warning will be issued during
compilation time and TAP RSS will not be supported.

Signed-off-by: Pascal Mazon <pascal.mazon at 6wind.com>
Signed-off-by: Ophir Munk <ophirmu at mellanox.com>
---

The DPDK traffic classifier is the rte_flow API and the tap PMD
must support it including RSS queue mapping actions.
An example usage for this requirement is failsafe transparent
switching from a PCI device to TAP device while RSS queues are the
same on both devices.
TC was chosen as TAP classifier but TC alone does not support RSS
queue mapping. This RFC suggests using a combination of TC rules and eBPF
actions in order to support TAP RSS.
eBPF requires Linux version 3.19. eBPF is effective only when running
with an appropriate kernel version. It must be compiled with the
appropriate Linux kernel headers. In case the kernel headers do
not include eBPF definitions a warning will be issued during
compilation time and TAP RSS will not be supported.
The C source file (tap_bpf_insns.c) includes eBPF "assembly
instructions" in the format of an array of struct bpf_insns.
This array is passed to the kernel for execution in BPF system call.
The C language source file (tap_bpf_program.c) from which the
"assembly instructions" were generated is included in TAP source tree,
however it does not take part in dpdk compilation.
TAP documentation will detail the process of eBPF "assembly instructions"
generation.

eBPF programs controlled from tap PMD will be used to match packets, compute a
hash given the configured key, and send packets using the desired queue.
In an eBPF program, it is typically not possible to edit the queue_mapping field
in skb to direct the packet in the correct queue. That part would be addressed by
chaining a ``skbedit queue_mapping`` action.

A packet would go through these TC rules (on the local side of the tap netdevice):

 +-----+---------------------------+----------------------------------+----------+
 |PRIO |           Match           |              Action 1            | Action 2 |
 +=====+===========================+==================================+==========+
 |  1  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
 +-----+---------------------------+----------------------------------+----------+
 |  2  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
 +-----+---------------------------+----------------------------------+----------+
 | ... |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 |  x  | ANY                       | BPF: append NULL 32bits for hash |          |
 |     |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 |x + 1| ACTUAL FLOW RULE 1 MATCH  | ...                              |          |
 |     |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 |x + 2| ACTUAL FLOW RULE 2 MATCH  | ...                              |          |
 |     |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 | ... |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 |  y  | FLOW RULE RSS 1 MATCH     | BPF compute hash into packet     |reclassify|
 |     |                           | tailroom && set queue in skb->cb |          |
 +-----+---------------------------+----------------------------------+----------+
 |y + 1| FLOW RULE RSS 2 MATCH     | BPF compute hash into packet     |reclassify|
 |     |                           | tailroom && set queue in skb->cb |          |
 +-----+---------------------------+----------------------------------+----------+
 | ... |                           |                                  |          |
 +-----+---------------------------+----------------------------------+----------+
 |  z  | ANY (default RSS)         | BPF compute hash into packet     |reclassify|
 |     |                           | tailroom && set queue in skb->cb |          |
 +-----+---------------------------+----------------------------------+----------+
 |  z  | ANY (isolate mode)        | DROP                             |          |
 +-----+---------------------------+----------------------------------+----------+

 Rules 1..x will match marked packets and will redirect them to their queues however
 on first classification packets are not marked there will not be redirected.
 Only when going later through RSS rules y..z BPF computes RSS hash,
 sets queue in dkb->cb, and reclassifies packets. Then packets are classified again
 through rules 1-x while being marked and  will be redirected.
 Rules (x+1)..y are non-RSS TC rules already used in dpdk versions prior to 18.02

 doc/guides/prog_guide/rte_flow_tap.rst | 962 +++++++++++++++++++++++++++++++++
 drivers/net/tap/Makefile               |   6 +-
 drivers/net/tap/rte_eth_tap.h          |   7 +-
 drivers/net/tap/tap_bpf_elf.h          |  56 ++
 drivers/net/tap/tap_flow.c             | 336 ++++++++----
 5 files changed, 1263 insertions(+), 104 deletions(-)
 create mode 100644 doc/guides/prog_guide/rte_flow_tap.rst
 create mode 100644 drivers/net/tap/tap_bpf_elf.h

diff --git a/doc/guides/prog_guide/rte_flow_tap.rst b/doc/guides/prog_guide/rte_flow_tap.rst
new file mode 100644
index 0000000..04ddda6
--- /dev/null
+++ b/doc/guides/prog_guide/rte_flow_tap.rst
@@ -0,0 +1,962 @@
+=====================================
+Flow API support in TAP PMD, using TC
+=====================================
+
+.. contents::
+.. sectnum::
+
+.. footer::
+
+   v0.8 - page ###Page###
+
+.. raw:: pdf
+
+  PageBreak
+
+Rationale
+=========
+
+For this project, the tap PMD has to receive selected traffic from a different
+netdevice (refer to *VM migration with Microsoft Hyper-V and Mellanox
+ConnectX-3* document) and only cover the same set of rules as supported by the
+mlx4 PMD.
+
+The DPDK traffic classifier is the rte_flow API, and the tap PMD must therefore
+implement it. For that, TC was chosen for several reasons:
+
+- it happens very early in the kernel stack for ingress (faster than netfilter).
+- it supports dropping packets given a specific flow.
+- it supports redirecting packets to a different netdevice.
+- it has a "flower" classifier type that meets mostly the pattern items in
+  rte_flow.
+- it can be configured through a netlink socket, without an external tool.
+
+Modes of operation
+==================
+
+There should be two modes of operation for the tap PMD regarding rte_flow:
+*local* and *remote*. Only one mode can be in use at a time for a specific tap
+interface.
+
+The *local* mode would be the default one, if no specific parameter is specified
+in the command line. To start the application with tap in *remote* mode, set the
+``remote`` tap parameter to the interface you want to redirect packets from,
+e.g.::
+
+   testpmd -n 4 -c 0xf -m 1024 --vdev=net_tap,iface=tap0,remote=eth3 -- \
+     -i --burst=64 --coremask=0x2
+
+*Local* mode
+------------
+
+In *local* mode, flow rules would be applied as-is, on the tap netdevice itself
+(e.g.: ``tap0``).
+
+The typical use-case is having a linux program (e.g. a webserver) communicating
+with the DPDK app through the tap netdevice::
+
+   +-------------------------+
+   |    DPDK application     |
+   +-------------------------+
+     |                     ^
+     | rte_flow   rte_flow |
+     v egress      ingress |
+   +-------------------------+
+   |        Tap PMD          |
+   +-------------------------+
+     |                     ^
+     | TC               TC |
+     v ingress      egress |
+   +-------------------------+              +-------------------------+
+   |                         |<-------------|                         |
+   |   Tap netdevice (tap0)  |              |  Linux app (webserver)  |
+   |                         |------------->|                         |
+   +-------------------------+              +-------------------------+
+
+.. raw:: pdf
+
+  PageBreak
+
+*Remote* mode
+-------------
+
+In *remote* mode, flow rules would be applied on the tap netdevice (e.g.:
+``tap0``), and use a similar match to redirect specific packets from another
+netdevice (e.g.: ``eth3``, a NetVSC netdevice in our project scenario)::
+
+   +-------------------------+
+   |    DPDK application     |
+   +-------------------------+
+     |                     ^
+     | rte_flow   rte_flow |
+     v egress      ingress |
+   +-------------------------+
+   |        Tap PMD          |
+   +-------------------------+
+     |                     ^
+     | TC               TC |
+     v ingress      egress |
+   +-------------------------+              +-------------------------+
+   |                         |<------------------redirection-------\  |
+   |   Tap netdevice (tap0)  |              |                       | |
+   |                         |------------->|-\       eth3          | |
+   +-------------------------+              +--|--------------------|-+
+                                               | TC              TC ^
+                                               | egress     ingress |
+                                               v                    |
+
+.. raw:: pdf
+
+  PageBreak
+
+rte_flow rules conversion
+=========================
+
+Netlink
+-------
+
+The only way to create TC rules in the kernel is through netlink messages.
+Two possibilities arise for managing TC rules:
+
+- Using native netlink API calls in the tap PMD
+- Calling the ``tc`` command from iproute2 inside our PMD, via ``system()``.
+
+The former will be done, as library calls are faster than changing context and
+executing an external program from within the tap PMD. Moreover, the kernel TC
+API might propose features not yet implemented in iproute2. Furthermore, a
+custom implementation enables finer tuning and better control.
+
+..
+   Some implementations for TC configuration through Netlink exist already. It's a
+   good source of inspiration on how to do it:
+
+   - iproute2's tc `source code`__
+   - ovs's tc implementation__ (not yet upstream)
+
+   __ https://github.com/shemminger/iproute2/tree/master/tc
+   __ https://mail.openvswitch.org/pipermail/ovs-dev/2016-November/324693.html
+
+Conversion examples
+-------------------
+
+Here are a few examples of rules and how they can be translated from rte_flow
+rules to TC rules. rte_flow rules will be expressed using testpmd's ``flow``
+command syntax, while TC rules will use iproute2 ``tc`` command syntax.
+
+**Notes**:
+   - rte_flow ``ingress`` direction can be translated into a TC ``egress`` rule,
+     and vice versa, when it applies to a tap interface, as TC considers the
+     kernel netdevice standpoint.
+   - in TC, redirecting a packet works by taking a packet from ``ingress`` and
+     sending to another device's ``egress``.
+
+*Local* mode
+~~~~~~~~~~~~
+
+#. Flow rule to give packets coming on the ``tap0`` interface to RX queue 0:
+
+   Using rte_flow::
+
+      flow validate 0 ingress pattern port index is 0 / end \
+        actions queue index 0 / end
+
+   Using ``tc``::
+
+      tc filter add dev tap0 parent 1: flower indev tap0 \
+        action skbedit queue_mapping 0
+
+#. Flow rule to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2:
+
+   Using rte_flow::
+
+      flow create 0 ingress pattern eth src is de:ad:ca:fe:00:02 / end \
+        actions queue 2 / end
+
+   Using ``tc``::
+
+      tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \
+        action skbedit queue_mapping 2
+
+#. Flow rule to drop packets matching specific 5-tuple info:
+
+   Using rte_flow::
+
+      flow create 0 ingress pattern eth dst is 3a:80:ce:61:36:54 \
+        src is 52:43:7b:fd:ac:f3 / ipv4 src is 1.1.1.1 dst is 2.2.2.2 \
+        / udp src is 4444 dst is 5555 / end actions drop / end
+
+   Using ``tc``::
+
+      tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \
+        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
+        ip_proto udp src_port 4444 dst_port 5555 action drop
+
+*Remote* mode
+~~~~~~~~~~~~~
+
+In *remote* mode, an additional rule for redirecting packet is systematically
+required. The examples are similar to the previous section (the rte_flow rule
+will thus be omitted).
+
+#. TC rules to give packets coming on the ``eth3`` interface to ``tap0`` RX
+   queue 0::
+
+      # redirection rule
+      tc filter add dev eth3 parent ffff: flower indev eth3 \
+        action mirred egress redirect dev tap0
+      # actual tap rule
+      tc filter add dev tap0 parent 1: flower indev tap0 \
+        action skbedit queue_mapping 0
+
+#. TC rules to get packets with source mac ``de:ad:ca:fe:00:02`` on RX queue 2::
+
+      # redirection rule
+      tc filter add dev eth3 parent ffff: flower src_mac de:ad:ca:fe:00:02 \
+        action mirred egress redirect dev tap0
+      # actual tap rule
+      tc filter add dev tap0 parent 1: flower src_mac de:ad:ca:fe:00:02 \
+        action skbedit queue_mapping 2
+
+#. TC rules to drop packets matching specific 5-tuple info::
+
+      # redirection rule
+      tc filter add dev eth3 parent ffff:  flower dst_mac 3a:80:ce:61:36:54 \
+        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
+        ip_proto udp src_port 4444 dst_port 5555 \
+        action mirred egress redirect dev tap0
+      # actual tap rule
+      tc filter add dev tap0 parent 1: flower dst_mac 3a:80:ce:61:36:54 \
+        src_mac 52:43:7b:fd:ac:f3 eth_type ip src_ip 1.1.1.1 dst_ip 2.2.2.2 \
+        ip_proto udp src_port 4444 dst_port 5555 action drop
+
+One last thing, to redirect packets the other way around (from ``tap0`` to
+``eth3``), we would use a similar rule, exchanging interfaces and using an
+appropriate match, e.g.::
+
+   tc filter add dev tap0 parent ffff: flower indev tap0 \
+     action mirred egress redirect dev eth3
+
+..
+   **Note:** ``parent ffff:`` is for TC ``ingress`` while ``parent 1:`` is for TC
+    ``egress``.
+
+Broadcast and promiscuous support
++++++++++++++++++++++++++++++++++
+
+*Remote* mode requirements:
+
+#. When turning the tap netdevice promiscuous, the remote netdevice should
+   implicitly be turned promiscuous too, to get as many packets as possible.
+
+#. Packets matching the destination MAC configured in the tap PMD should be
+   redirected from the remote without being processed by the stack there in the
+   kernel.
+
+#. In promiscuous mode, an incoming packet should be duplicated to be processed
+   both by the tap PMD and the remote netdevice itself.
+
+#. Incoming packets with broadcast destination MAC (i.e.: ``ff:ff:ff:ff:ff:ff``)
+   should be duplicated to be processed both by the tap PMD and the remote
+   netdevice itself.
+
+#. Incoming packets with IPv6 multicast destination MAC (i.e.:
+   ``33:33:00:00:00:00/33:33:00:00:00:00``) should be duplicated to be processed
+   both by the tap PMD and the remote netdevice itself.
+
+#. Incoming packets with broadcast/multicast bit set in the destination MAC
+   (i.e.: ``01:00:00:00:00:00/01:00:00:00:00:00``) should be duplicated to be
+   processed both by the tap PMD and the remote netdevice itself.
+
+Each of these requirements (except the first one) can be directly translated
+into a TC rule, e.g.::
+
+   # local mac (notice the REDIRECT for mirred action):
+   tc filter add dev eth3 parent ffff: prio 1 flower dst_mac de:ad:be:ef:01:02 \
+     action mirred egress redirect dev tap0
+
+   # tap promisc:
+   tc filter add dev eth3 parent ffff: prio 2 basic \
+     action mirred egress mirror dev tap0
+
+   # broadcast:
+   tc filter add dev eth3 parent ffff: prio 3 flower dst_mac ff:ff:ff:ff:ff:ff \
+     action mirred egress mirror dev tap0
+
+   # broadcast v6 (can't express mac_mask with tc, but it works via netlink):
+   tc filter add dev eth3 parent ffff: prio 4 flower dst_mac 33:33:00:00:00:00 \
+     action mirred egress mirror dev tap0
+
+   # all_multi (can't express mac_mask with tc, but it works via netlink):
+   tc filter add dev eth3 parent ffff: prio 5 flower dst_mac 01:00:00:00:00:00 \
+     action mirred egress mirror dev tap0
+
+When promiscuous mode is switched off or on, the first TC rule will be modified
+to have respectively an empty action (``continue``) or the ``mirror`` action.
+
+The first 5 priorities are always reserved, and can only be used for these
+filters.
+
+On top of that, the tap PMD can configure explicit rte_flow rules, translated as
+TC rules on both the remote netdevice and the tap netdevice. On the remote,
+those would need to be processed after the default rules handling promiscuous
+mode, broadcast and all_multi packets.
+
+When using the ``mirror`` action, the packet is duplicated and sent to the tap
+netdevice, while the original packet gets directly processed by the kernel
+without going through later TC rules for the remote. On the tap netdevice, the
+duplicated packet will go through tap TC rules and be classified depending on
+those rules.
+
+**Note:** It is possible to combine a ``mirror`` action and a ``continue``
+action for a single TC rule. Then the original packet would undergo remaining TC
+rules on the remote netdevice side.
+
+When using the ``redirect`` action, the behavior is similar on the tap side, but
+the packet is not duplicated, no further kernel processing is done for the
+remote side.
+
+The following diagram sums it up. A packet that match a TC rule follows the
+associated action (the number in the diamond represents the rule prio as set in
+the above TC rules)::
+
+
+                        Incoming packet |
+                       on remote (eth3) |
+                                        | Going through
+                                        | TC ingress rules
+                                        v
+                                       / \
+                                      / 5 \
+                                     /     \    yes
+                                    /  mac  \____________________> tap0
+                                    \ match?/    duplicated pkt
+                                     \     /
+                                      \   /
+                                       \ /
+                                        V no, then continue
+                                        | with TC rules
+                                        |
+                                        v
+                                       / \
+                                      / 2 \
+         eth3                  yes   /     \    yes
+       kernel <____________________ /promisc\____________________> tap0
+        stack      original pkt     \ match?/    duplicated pkt
+                                     \     /
+                                      \   /
+                                       \ /
+                                        V no, then continue
+                                        | with TC rules
+                                        |
+                                        v
+                                       / \
+                                      / 3 \
+         eth3                  yes   /     \    yes
+       kernel <____________________ / bcast \____________________> tap0
+        stack      original pkt     \ match?/    duplicated pkt
+                                     \     /
+                                      \   /
+                                       \ /
+                                        V no, then continue
+                                        | with TC rules
+                                        |
+                                        v
+                                       / \
+                                      / 4 \
+         eth3                  yes   /     \    yes
+       kernel <____________________ / bcast6\____________________> tap0
+        stack      original pkt     \ match?/    duplicated pkt
+                                     \     /
+                                      \   /
+                                       \ /
+                                        V no, then continue
+                                        | with TC rules
+                                        |
+                                        v
+                                       / \
+                                      / 5 \
+         eth3                  yes   / all \    yes
+       kernel <____________________ / multi \____________________> tap0
+        stack      original pkt     \ match?/    duplicated pkt
+                                     \     /
+                                      \   /
+                                       \ /
+                                        V no, then continue
+                                        | with TC rules
+                                        |
+                                        v
+                                        |
+                                        . remaining TC rules
+                                        .
+         eth3                           |
+       kernel <________________________/
+        stack      original pkt
+
+.. raw:: pdf
+
+  PageBreak
+
+Associating an rte_flow rule with a TC one
+==========================================
+
+A TC rule is identified by a ``priority`` (16-bit value) and a ``handle``
+(32-bit value). To delete a rule, the priority must be specified, and if several
+rules have the same priority, the handle is needed to select the correct one.
+
+..
+   Specifying an empty priority and handle when requesting a TC rule creation will
+   let the kernel automatically decide what values to set. In fact, the kernel will
+   start with a high priority (i.e. 49152) and subsequent rules will get decreasing
+   priorities (lower priorites get evaluated first).
+
+To avoid further requests to the kernel to identify what priority/handle has
+been automatically allocated, the tap PMD can set priorities and handles
+systematically when creating a rule.
+
+In *local* mode, an rte_flow rule should be translated into a single TC flow
+identified by priority+handle.
+
+In *remote* mode, an rte_flow rule requires two TC rules, one on the tap
+netdevice itself (for the correct action) and another one on the other netdevice
+where packets are redirected from. Both TC rules' priorities+handles must be
+stored for a specific rte_flow rule, and associated with the device they are
+applied on.
+
+.. raw:: pdf
+
+  PageBreak
+
+Considerations regarding Flow API support
+=========================================
+
+Flow rule attributes
+--------------------
+
+Groups and priorities:
+   There is no native support of groups in TC. Instead, the priority field
+   (which is part of the netlink TC msg header) can be adapted. The four MSB
+   would be used to define the group (allowing for 16 groups), while the 12 LSB
+   would be left to define the actual priority (up to 4096).
+
+   Rules with lower priorities are evaluated first. For rules with identical
+   priorities, the one with the highest handle value gets evaluated first.
+
+Direction:
+   Both ingress and egress filtering can be supported.
+
+Meta item types
+---------------
+
+Most applications will use: ``(END | VOID)``
+
+END, VOID:
+   Supported without problem.
+
+INVERT:
+   There is no easy way to support that in TC. It won't be supported
+
+   **mlx4 will not support it either.**
+
+PF, VF, PORT:
+   Not applicable to a tap netdevice.
+
+Data matching item types
+------------------------
+
+Most applications will use:
+``ETH / (IPV4 | IPV6 | END) / (TCP | UDP | END) / END``
+
+ANY:
+   Should be supported.
+
+   **mlx4 will partially support it.**
+
+RAW:
+   It is not planned to support it for now. Matching Raw packets would require
+   using a different classifier than "flower", which is the most simple and
+   applicable for otherwise most other cases. With TC, it's not possible to
+   support in the same rule both "flower" and raw packets.
+
+   **mlx4 will not support it either**.
+
+VLAN:
+   Matching VLAN ID and prio supported.
+   **Note: linux v4.9 required for VLAN support.**
+
+ETH, IPV4, IPV6, UDP, TCP:
+   Matching source/destination MAC/IP/port is supported, with masks.
+
+   **mlx4 does not support partial bit-masks (full or zeroed only).**
+
+ICMP:
+   By specifying the appropriate ether type, ICMP packets can be matched.
+   However, there is no support for ICMP type or code.
+
+   **mlx4 will not support it, however.**
+
+SCTP:
+   By specifying the appropriate IP protocol, SCTP packets can be matched.
+   However, no specific SCTP fields can be matched.
+
+   **mlx4 will not support it, however.**
+
+VXLAN:
+   VXLAN is not recognized by the "flower" classifier. Kernel-managed VXLAN
+   traffic would come through an additional netdevice, which falls outside
+   the scope of this project. VXLAN traffic should occur outside VMs anyway.
+
+Action types
+------------
+
+Most applications will use: ``(VOID | END | QUEUE | DROP) / END``
+
+By default, multiple actions are possible for TC flow rules. However, they are
+ordered in the kernel. The implementation will need to handle actions in a way
+that orders them intelligently when creating them.
+
+VOID, END:
+   Supported.
+
+PASSTHRU:
+   The generic "continue" action can be used.
+
+   **mlx4 will not support it, however**.
+
+MARK / FLAG:
+   The mark is a field inside an skbuff. However, the tap reads messages (mostly
+   packet data), without that info. As an alternative, it may be possible to
+   create a specific queue to pass packets with a specific mark. Further testing
+   are needed to ensure it is feasable.
+
+QUEUE:
+   The ``skbedit`` action with the ``queue_mapping`` option enables directing
+   packets to specific queues.
+
+   Like rte_flow, specifying several ``skbedit queue_mapping`` actions in TC
+   only considers the last one.
+
+DROP:
+   The generic "drop" action can be used. Packets will effectively be dropped,
+   and not left for the kernel to process.
+
+COUNT: Stats are automatically stored in the kernel. The COUNT action will thus
+   be ignored when creating the rule. ``rte_flow_query()`` can be implemented
+   to request a rule's stats from the kernel.
+
+DUP:
+   Duplicating packets is not supported.
+
+RSS:
+   There's no built-in mechanism for RSS in TC.
+
+   By default, incoming packets go to the tap PMD queue 0. To support RSS in
+   software, several additional queues must be set up. Packets coming in on
+   queue 0 can be considered as requiring RSS, and the PMD will apply software
+   rss (using something like ``rte_softrss()``) to select a queue for the
+   packet.
+
+PF, VF:
+   Not applicable to a tap netdevice.
+
+.. raw:: pdf
+
+  PageBreak
+
+TC limitations for flow collision
+=================================
+
+From TC standpoint, filter rules with identical priorities do not collide, if
+they do specify values (at least one different) for the same fields in the TC
+message, with identical fields masks.
+
+Unfortunately, some flows that obviously are not colliding can be considered
+otherwise by the kernel when parsing the TC messages, and thus their creation
+would be rejected.
+
+Here is a table for matching TC fields with their flow API equivalent:
+
++------------------------------+-----------------------------------+-----------+
+|    TC message field          |      rte_flow API                 | maskable? |
++==============================+===================================+===========+
+| TCA_FLOWER_KEY_ETH_DST       | eth dst                           | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_ETH_SRC       | eth src                           | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_ETH_TYPE      | eth type is 0xZZZZ ||             | no        |
+|                              | eth / {ipv4|ipv6}                 |           |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_IP_PROTO      | eth / {ipv4|ipv6} / {tcp|udp}     | no        |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_IPV4_SRC      | eth / ipv4 src                    | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_IPV4_DST      | eth / ipv4 dst                    | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_IPV6_SRC      | eth / ipv6 src                    | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_IPV6_DST      | eth / ipv6 dst                    | yes       |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_L4_SRC        | eth / {ipv4|ipv6} / {tcp|udp} dst | no        |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_L4_DST        | eth / {ipv4|ipv6} / {tcp|udp} src | no        |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_VLAN_ID       | eth / vlan vid                    | no        |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_VLAN_PRIO     | eth / vlan pcp                    | no        |
++------------------------------+-----------------------------------+-----------+
+| TCA_FLOWER_KEY_VLAN_ETH_TYPE | eth / vlan tpid                   | no        |
++------------------------------+-----------------------------------+-----------+
+
+When creating rules with identical priorities, one must make sure that they
+would be translated in TC using the same fields as shown in the above table.
+
+The following flow rules can share the same priority, as they use the same
+fields with identical masks under the hood::
+
+   > flow create 0 ingress priority 0 pattern eth / ipv4 / end
+          actions drop / end
+   Flow rule #0 created
+   > flow create 0 ingress priority 0 pattern eth type is 0x86dd / end
+          actions drop / end
+   Flow rule #1 created
+
+**Note:** Both rules use ETH_TYPE (mask 0xffff) in their TC form.
+
+Sadly, the following flow rules cannot share the same priority, since fields for
+matching IPv4 and IPv6 src/dst addresses are different::
+
+   > flow create 0 ingress priority 1 pattern eth / ipv4 src is 1.1.1.1 / end
+          actions drop / end
+   Flow rule #0 created
+   > flow create 0 ingress priority 1 pattern eth / ipv6 src is ::1 / end
+          actions drop / end
+   PMD: Kernel refused TC filter rule creation (22): Invalid argument
+   Caught error type 2 (flow rule (handle)): overlapping rules
+
+**Note:** First rule uses ETH_TYPE and IPV4_SRC, while the second uses ETH_TYPE
+and IPV6_SRC.
+
+It is however possible to match different IPvX addresses with the same
+priority::
+
+   > flow create 0 ingress priority 2 pattern eth / ipv4 src is 1.1.1.1 / end
+          actions drop / end
+   Flow rule #0 created
+   > flow create 0 ingress priority 2 pattern eth / ipv4 src is 2.2.2.2 / end
+          actions drop / end
+   Flow rule #1 created
+
+If the first rule specifies both destination and source addresses, then the
+other rule with the same priority must too (with identical masks)::
+
+   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 1.1.1.1
+          dst is 1.1.1.2 / end actions drop / end
+   Flow rule #0 created
+   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2 / end
+          actions drop / end
+   PMD: Kernel refused TC filter rule creation (22): Invalid argument
+   Caught error type 2 (flow rule (handle)): overlapping rules
+   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2
+          dst spec 2.2.2.3 dst mask 255.255.255.0 / end actions drop / end
+   PMD: Kernel refused TC filter rule creation (22): Invalid argument
+   Caught error type 2 (flow rule (handle)): overlapping rules
+   > flow create 0 ingress priority 3 pattern eth / ipv4 src is 2.2.2.2
+          dst is 2.2.2.3 / end actions drop / end
+   Flow rule #1 created
+
+**Note:** First rule uses ETH_TYPE, IPV4_SRC and IPV4_DST (with full masks). The
+two others must also use those to share the same priority.
+
+It is possible to match TCP/UDP packets with different ports whatever the
+underlying L3, if the same fields are used (thus no l3 addresses specification).
+For instance::
+
+   > flow create 0 ingress priority 4 pattern eth / ipv4 / tcp dst is 3333 / end
+          actions drop / end
+   Flow rule #0 created
+   > flow create 0 ingress priority 4 pattern eth / ipv6 / udp dst is 4444 / end
+          actions drop / end
+   Flow rule #1 created
+   > flow create 0 ingress priority 4 pattern eth / ipv6 / udp src is 5555 / end
+          actions drop / end
+   PMD: Kernel refused TC filter rule creation (22): Invalid argument
+   Caught error type 2 (flow rule (handle)): overlapping rules
+
+**Note:** First 2 rules use ETH_TYPE, IP_PROTO and L4_DST with different values
+but identical masks, so they're OK. Last rule used L4_SRC instead of L4_DST.
+
+.. raw:: pdf
+
+  PageBreak
+
+RSS implementation for tap
+==========================
+
+There are several areas of research for a tap RSS implementation:
+
+#. userland implementation in tap PMD
+#. userland implementation in DPDK (generic)
+#. userland implementation using combination of TC rules and BPF filters/actions
+#. kernel-side implementation in tap driver
+#. kernel-side implementation as a BPF classifier/action
+#. kernel-side implementation as a separate TC action
+
++--------------+------------------------------+------------------------------+
+|              | Pros                         | Cons                         |
++==============+==============================+==============================+
+| tap PMD      | - no kernel upstreaming      | - tap PMD is supposed to be  |
+|              |                              |   simple, and would no longer|
+|              |                              |   be.                        |
+|              |                              |                              |
+|              |                              | - complex rework, with many  |
+|              |                              |   rings for enqueuing packets|
+|              |                              |   to the right queue         |
+|              |                              |                              |
+|              |                              | - slower                     |
+|              |                              |                              |
+|              |                              | - won't be accepted as it    |
+|              |                              |   doesn't make sense to redo |
+|              |                              |   what the kernel did        |
+|              |                              |   previously                 |
++--------------+------------------------------+------------------------------+
+| generic DPDK | - would be useful to others  | - design must be compatible  |
+|              |                              |   with most PMDs             |
+|              |                              |                              |
+|              |                              | - probably the longest to    |
+|              |                              |   develop                    |
+|              |                              |                              |
+|              |                              | - requires DPDK community    |
+|              |                              |   approval                   |
+|              |                              |                              |
+|              |                              | - requires heavy changes in  |
+|              |                              |   tap PMD itself anyway      |
++--------------+------------------------------+------------------------------+
+| TC rules     | - no kernel upstreaming      | - BPF is complicated to learn|
+| combination  |                              |                              |
+|              | - fast                       | - Runtime BPF compilation /  |
+|              |                              |   or bytecode change, would  |
+|              | - per-flow RSS               |   be tricky                  |
+|              |                              |                              |
+|              | - no change in tap PMD       | - much rework in the tap PMD |
+|              |   datapath                   |   to handle lots of new      |
+|              |                              |   netlink messages / actions |
++--------------+------------------------------+------------------------------+
+| tap driver   | - pretty fast as it          | - might not be accepted by   |
+|              |   intervenes early in packet |   the kernel community as    |
+|              |   RX                         |   they may cling to their    |
+|              |                              |   jhash2 hashing function for|
+|              |                              |   RX.                        |
+|              |                              |                              |
+|              |                              | - only a single RSS context  |
++--------------+------------------------------+------------------------------+
+| BPF          | - fast                       | - BPF is complicated to learn|
+| classifier - |                              |                              |
+| action       | - per-flow RSS               | - would require changing the |
+|              |                              |   kernel API to support      |
+|              |                              |   editing queue_mapping in an|
+|              |                              |   skb                        |
+|              |                              |                              |
+|              |                              | - hashing would be performed |
+|              |                              |   for each queue of a        |
+|              |                              |   specific RSS context       |
+|              |                              |                              |
+|              |                              | - probably difficult to gain |
+|              |                              |   community acceptance       |
++--------------+------------------------------+------------------------------+
+| TC action    | - much more flexibility, with| - needs to be in sync with   |
+|              |   per-flow RSS, multiple     |   iproute2's tc program      |
+|              |   keys, multiple packet      |                              |
+|              |   fields for the hash...     | - kernel upstreaming is not  |
+|              |                              |   necessarily easy           |
+|              | - it's a separate kernel     |                              |
+|              |   module that can be         | - rework in tap PMD to       |
+|              |   maintained out-of-tree and |   support new RSS action and |
+|              |   optionally upstreamed      |   configuration              |
+|              |   anytime                    |                              |
+|              |                              |                              |
+|              | - most logical to be handled |                              |
+|              |   in kernel as RSS is        |                              |
+|              |   supposed to be computed in |                              |
+|              |   the "NIC" exactly once.    |                              |
+|              |                              |                              |
+|              | - fastest                    |                              |
+|              |                              |                              |
+|              | - no change in tap PMD       |                              |
+|              |   datapath                   |                              |
++--------------+------------------------------+------------------------------+
+
+TC rules using BPF from tap PMD
+-------------------------------
+
+The third solution is the best for userland-based solutions.
+It does the job well, fast (datapath running in kernel), is logically happening
+in the kernel in runtime, supports flow-based RSS, has the best potential to
+be accepted by the community.
+
+Advantages with this solution:
+- hash can be recorded in the packet data and read in tap PMD
+- no kernel customization, everything in DPDK
+- packet gets in tap PMD on the correct queue directly
+
+Drawbacks:
+- complicates tap PMD a lot:
+  - 3 BPF programs
+  - new implicit rules
+  - new action and filter support
+  - packet stripping
+- numerous TC rules required (in proportion with the number of queues)
+- fast (kernel + JIT BPF), but several TC rules must be crossed
+
+BPF programs controlled from tap PMD will be used to match packets, compute a
+hash given the configured key, and send packets to tap using the desired queue.
+
+Design
+~~~~~~
+
+BPF has a limited set of functions for editing the skb in TC. They are listed
+in ``linux/net/core/filter.c:tc_cls_act_func_proto()``:
+
+- skb_store_bytes
+- skb_load_bytes
+- skb_pull_data
+- csum_diff
+- csum_update
+- l3_csum_replace
+- l4_csum_replace
+- clone_redirect
+- get_cgroup_classid
+- skb_vlan_push
+- skb_vlan_pop
+- skb_change_proto
+- skb_change_type
+- skb_change_tail
+- skb_get_tunnel_key
+- skb_set_tunnel_key
+- skb_get_tunnel_opt
+- skb_set_tunnel_opt
+- redirect
+- get_route_realm
+- get_hash_recalc
+- set_hash_invalid
+- perf_event_output
+- get_smp_processor_id
+- skb_under_cgroup
+
+In a BPF program, it is typically not possible to edit the queue_mapping field
+to direct the packet in the correct queue. That part would be done by chaining a
+``skbedit queue_mapping`` action.
+
+It is not possible either to directly prepend data to a packet (appending works,
+though).
+
+A packet would go through these rules (on the local side of the tap netdevice):
+
++-----+---------------------------+----------------------------------+----------+
+|PRIO |           Match           |              Action 1            | Action 2 |
++=====+===========================+==================================+==========+
+|  1  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
++-----+---------------------------+----------------------------------+----------+
+|  2  | marked?                   | skbedit queue 'mark' --> DPDK    |          |
++-----+---------------------------+----------------------------------+----------+
+| ... |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+|  x  | ANY                       | BPF: append NULL 32bits for hash |          |
+|     |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+|x + 1| ACTUAL FLOW RULE 1 MATCH  | ...                              |          |
+|     |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+|x + 2| ACTUAL FLOW RULE 2 MATCH  | ...                              |          |
+|     |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+| ... |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+|  y  | FLOW RULE RSS 1 MATCH     | BPF compute hash into packet     |reclassify|
+|     |                           | tailroom && set queue in skb->cb |          |
++-----+---------------------------+----------------------------------+----------+
+|y + 1| FLOW RULE RSS 2 MATCH     | BPF compute hash into packet     |reclassify|
+|     |                           | tailroom && set queue in skb->cb |          |
++-----+---------------------------+----------------------------------+----------+
+| ... |                           |                                  |          |
++-----+---------------------------+----------------------------------+----------+
+|  z  | ANY (default RSS)         | BPF compute hash into packet     |reclassify|
+|     |                           | tailroom && set queue in skb->cb |          |
++-----+---------------------------+----------------------------------+----------+
+|  z  | ANY (isolate mode)        | DROP                             |          |
++-----+---------------------------+----------------------------------+----------+
+
+
+
+TC kernel action
+----------------
+
+The latest solution (implementing a TC action) would probably be the most simple
+to implement. It is also very flexible, opening more possibilities for filtering
+and RSS combined.
+
+For this solution, the following parameters could be used to configure RSS in a
+TC netlink message:
+
+``queues`` (u16 \*):
+  list of queues to spread incoming traffic on. That's actually the reta.
+  **Note:** the queue in an ``skb`` is on 16-bits, hence the type here.
+
+``key`` (u8 \*):
+  key to use for the Toeplitz-hash in this flow.
+
+``hash_fields`` (bitfield):
+  similar to what's in DPDK, the bitfield should determine what fields in the
+  packet header to use for hashing. It is likely another means of configuring
+  which fields to pick would be used actually.
+
+``algo`` (unsigned):
+  an enum value from the kernel act_rss header can be used to determine which
+  algorithm (implemented in the kernel) to use. Possible algos could be
+  toeplitz, xor, symmetric hash...
+
+**Note:** The number of queues to use is automatically deduced from the
+``queues`` netlink attribute length. The ``key`` length can be similarly
+obtained.
+
+.. raw:: pdf
+
+  PageBreak
+
+Appendix: TC netlink message
+============================
+
+**Note:** For deterministic behavior, TC queueing disciplines (QDISC), filters
+and classes must be flushed before starting to apply TC rules. There is a little
+bit of boilerplate (with specific netlink messages) to ensure TC rules can be
+applied. Typically, the TC ``ingress`` QDISC must be created first.
+
+For information, netlink messages regarding TC will look like this::
+
+      0          8         16         24         32
+      +----------+----------+----------+----------+  ---
+    0 |                  Length                   |     \
+      +---------------------+---------------------+      \
+    4 |         Type        |        Flags        |      |
+      +----------- ---------+---------------------+       >--  struct
+    8 |              Sequence number              |      |     nlmsghdr
+      +-------------------------------------------+      /
+   12 |           Process Port ID (PID)           |     /
+      +==========+==========+==========+==========+  ---
+   16 |  Family  |  Rsvd1   |      Reserved2      |     \
+      +----------+----------+---------------------+      \
+   20 |              Interface index              |      |
+      +-------------------------------------------+      |
+   24 |                  Handle                   |      |
+      +-------------------------------------------+       >--  struct
+   28 |               Parent handle               |      |     tcmsg
+      |        MAJOR        +        MINOR        |      |
+      +-------------------------------------------+      |
+   32 |                 TCM info                  |      /
+      |       priority      +      protocol       |     /
+      +===========================================+  ---
+      |                                           |
+      |                  Payload                  |
+      |                                           |
+        ........................................
+      |                                           |
+      |                                           |
+      +-------------------------------------------+
diff --git a/drivers/net/tap/Makefile b/drivers/net/tap/Makefile
index 405b49e..9afae5e 100644
--- a/drivers/net/tap/Makefile
+++ b/drivers/net/tap/Makefile
@@ -39,6 +39,9 @@ EXPORT_MAP := rte_pmd_tap_version.map
 
 LIBABIVER := 1
 
+# TAP_MAX_QUEUES must be a power of 2 as it will be used for masking */
+TAP_MAX_QUEUES = 16
+
 CFLAGS += -O3
 CFLAGS += -I$(SRCDIR)
 CFLAGS += -I.
@@ -47,6 +50,8 @@ LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs -lrte_hash
 LDLIBS += -lrte_bus_vdev
 
+CFLAGS += -DTAP_MAX_QUEUES=$(TAP_MAX_QUEUES)
+
 #
 # all source are stored in SRCS-y
 #
@@ -89,7 +94,6 @@ tap_autoconf.h: tap_autoconf.h.new
 		mv '$<' '$@'
 
 $(SRCS-$(CONFIG_RTE_LIBRTE_PMD_TAP):.c=.o): tap_autoconf.h
-
 clean_tap: FORCE
 	$Q rm -f -- tap_autoconf.h tap_autoconf.h.new
 
diff --git a/drivers/net/tap/rte_eth_tap.h b/drivers/net/tap/rte_eth_tap.h
index 829f32f..01ac153 100644
--- a/drivers/net/tap/rte_eth_tap.h
+++ b/drivers/net/tap/rte_eth_tap.h
@@ -45,7 +45,7 @@
 #include <rte_ether.h>
 
 #ifdef IFF_MULTI_QUEUE
-#define RTE_PMD_TAP_MAX_QUEUES	16
+#define RTE_PMD_TAP_MAX_QUEUES	TAP_MAX_QUEUES
 #else
 #define RTE_PMD_TAP_MAX_QUEUES	1
 #endif
@@ -90,6 +90,11 @@ struct pmd_internals {
 	int ioctl_sock;                   /* socket for ioctl calls */
 	int nlsk_fd;                      /* Netlink socket fd */
 	int flow_isolate;                 /* 1 if flow isolation is enabled */
+	int flower_support;               /* 1 if kernel supports, else 0 */
+	int flower_vlan_support;          /* 1 if kernel supports, else 0 */
+	int rss_enabled;                  /* 1 if RSS is enabled, else 0 */
+	/* implicit rules set when RSS is enabled */
+	LIST_HEAD(tap_rss_flows, rte_flow) rss_flows;
 	LIST_HEAD(tap_flows, rte_flow) flows;        /* rte_flow rules */
 	/* implicit rte_flow rules set when a remote device is active */
 	LIST_HEAD(tap_implicit_flows, rte_flow) implicit_flows;
diff --git a/drivers/net/tap/tap_bpf_elf.h b/drivers/net/tap/tap_bpf_elf.h
new file mode 100644
index 0000000..f3db1bf
--- /dev/null
+++ b/drivers/net/tap/tap_bpf_elf.h
@@ -0,0 +1,56 @@
+/*******************************************************************************
+
+  Copyright (C) 2015 Daniel Borkmann <daniel at iogearbox.net>
+
+  Copied from iproute2's include/bpf_elf.h, available at:
+    https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
+
+  This file is licensed under GNU General Public License (GPL) v2.
+
+  The full GNU General Public License is included in this distribution in
+  the file called "LICENSE.GPL".
+
+*******************************************************************************/
+
+
+#ifndef __BPF_ELF__
+#define __BPF_ELF__
+
+#include <asm/types.h>
+
+/* Note:
+ *
+ * Below ELF section names and bpf_elf_map structure definition
+ * are not (!) kernel ABI. It's rather a "contract" between the
+ * application and the BPF loader in tc. For compatibility, the
+ * section names should stay as-is. Introduction of aliases, if
+ * needed, are a possibility, though.
+ */
+
+/* ELF section names, etc */
+#define ELF_SECTION_LICENSE	"license"
+#define ELF_SECTION_MAPS	"maps"
+#define ELF_SECTION_PROG	"prog"
+#define ELF_SECTION_CLASSIFIER	"classifier"
+#define ELF_SECTION_ACTION	"action"
+
+#define ELF_MAX_MAPS		64
+#define ELF_MAX_LICENSE_LEN	128
+
+/* Object pinning settings */
+#define PIN_NONE		0
+#define PIN_OBJECT_NS		1
+#define PIN_GLOBAL_NS		2
+
+/* ELF map definition */
+struct bpf_elf_map {
+	__u32 type;
+	__u32 size_key;
+	__u32 size_value;
+	__u32 max_elem;
+	__u32 flags;
+	__u32 id;
+	__u32 pinning;
+};
+
+#endif /* __BPF_ELF__ */
diff --git a/drivers/net/tap/tap_flow.c b/drivers/net/tap/tap_flow.c
index ffc0b85..43bab7d 100644
--- a/drivers/net/tap/tap_flow.c
+++ b/drivers/net/tap/tap_flow.c
@@ -43,6 +43,9 @@
 #include <tap_autoconf.h>
 #include <tap_tcmsgs.h>
 
+#include <linux/bpf.h>
+#include <linux/tc_act/tc_bpf.h>
+
 #ifndef HAVE_TC_FLOWER
 /*
  * For kernels < 4.2, this enum is not defined. Runtime checks will be made to
@@ -104,6 +107,23 @@ struct remote_rule {
 	int mirred;
 };
 
+struct action_data {
+	char id[16];
+
+	union {
+		struct tc_gact gact;
+		struct tc_mirred mirred;
+		struct skbedit {
+			struct tc_skbedit skbedit;
+			uint16_t queue;
+		} skbedit;
+		struct bpf {
+			int bpf_fd;
+			char *annotation;
+		} bpf;
+	};
+};
+
 static int tap_flow_create_eth(const struct rte_flow_item *item, void *data);
 static int tap_flow_create_vlan(const struct rte_flow_item *item, void *data);
 static int tap_flow_create_ipv4(const struct rte_flow_item *item, void *data);
@@ -134,6 +154,8 @@ struct remote_rule {
 		 int set,
 		 struct rte_flow_error *error);
 
+static int rss_enable(struct pmd_internals *pmd);
+
 static const struct rte_flow_ops tap_flow_ops = {
 	.validate = tap_flow_validate,
 	.create = tap_flow_create,
@@ -816,111 +838,64 @@ struct tap_flow_items {
 }
 
 /**
- * Transform a DROP/PASSTHRU action item in the provided flow for TC.
- *
- * @param[in, out] flow
- *   Flow to be filled.
- * @param[in] action
- *   Appropriate action to be set in the TCA_GACT_PARMS structure.
- *
- * @return
- *   0 if checks are alright, -1 otherwise.
+ * FIXME
  */
 static int
-add_action_gact(struct rte_flow *flow, int action)
+add_action(struct rte_flow *flow, size_t *act_index, struct action_data *adata)
 {
 	struct nlmsg *msg = &flow->msg;
-	size_t act_index = 1;
-	struct tc_gact p = {
-		.action = action
-	};
 
-	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
-		return -1;
-	if (nlattr_nested_start(msg, act_index++) < 0)
+	if (nlattr_nested_start(msg, ++(*act_index)) < 0)
 		return -1;
-	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("gact"), "gact");
-	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
-		return -1;
-	nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(p), &p);
-	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
-	nlattr_nested_finish(msg); /* nested act_index */
-	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
-	return 0;
-}
-
-/**
- * Transform a MIRRED action item in the provided flow for TC.
- *
- * @param[in, out] flow
- *   Flow to be filled.
- * @param[in] ifindex
- *   Netdevice ifindex, where to mirror/redirect packet to.
- * @param[in] action_type
- *   Either TCA_EGRESS_REDIR for redirection or TCA_EGRESS_MIRROR for mirroring.
- *
- * @return
- *   0 if checks are alright, -1 otherwise.
- */
-static int
-add_action_mirred(struct rte_flow *flow, uint16_t ifindex, uint16_t action_type)
-{
-	struct nlmsg *msg = &flow->msg;
-	size_t act_index = 1;
-	struct tc_mirred p = {
-		.eaction = action_type,
-		.ifindex = ifindex,
-	};
 
-	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
-		return -1;
-	if (nlattr_nested_start(msg, act_index++) < 0)
-		return -1;
-	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("mirred"), "mirred");
+	nlattr_add(&msg->nh, TCA_ACT_KIND, strlen(adata->id), adata->id);
 	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
 		return -1;
-	if (action_type == TCA_EGRESS_MIRROR)
-		p.action = TC_ACT_PIPE;
-	else /* REDIRECT */
-		p.action = TC_ACT_STOLEN;
-	nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(p), &p);
+	if (strcmp("gact", adata->id) == 0) {
+		nlattr_add(&msg->nh, TCA_GACT_PARMS, sizeof(adata->gact),
+			   &adata->gact);
+	} else if (strcmp("mirred", adata->id) == 0) {
+		if (adata->mirred.eaction == TCA_EGRESS_MIRROR)
+			adata->mirred.action = TC_ACT_PIPE;
+		else /* REDIRECT */
+			adata->mirred.action = TC_ACT_STOLEN;
+		nlattr_add(&msg->nh, TCA_MIRRED_PARMS, sizeof(adata->mirred),
+			   &adata->mirred);
+	} else if (strcmp("skbedit", adata->id) == 0) {
+		nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS,
+			   sizeof(adata->skbedit.skbedit),
+			   &adata->skbedit.skbedit);
+		nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING,
+			     adata->skbedit.queue);
+	} else if (strcmp("bpf", adata->id) == 0) {
+		nlattr_add32(&msg->nh, TCA_ACT_BPF_FD, adata->bpf.bpf_fd);
+		nlattr_add(&msg->nh, TCA_ACT_BPF_NAME,
+			   strlen(adata->bpf.annotation),
+			   adata->bpf.annotation);
+	} else {
+		return -1;
+	}
 	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
 	nlattr_nested_finish(msg); /* nested act_index */
-	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
 	return 0;
 }
 
 /**
- * Transform a QUEUE action item in the provided flow for TC.
- *
- * @param[in, out] flow
- *   Flow to be filled.
- * @param[in] queue
- *   Queue id to use.
- *
- * @return
- *   0 if checks are alright, -1 otherwise.
+ * FIXME
  */
 static int
-add_action_skbedit(struct rte_flow *flow, uint16_t queue)
+add_actions(struct rte_flow *flow, int nb_actions, struct action_data *data,
+	    int classifier_action)
 {
 	struct nlmsg *msg = &flow->msg;
-	size_t act_index = 1;
-	struct tc_skbedit p = {
-		.action = TC_ACT_PIPE
-	};
+	size_t act_index = 0;
+	int i;
 
-	if (nlattr_nested_start(msg, TCA_FLOWER_ACT) < 0)
-		return -1;
-	if (nlattr_nested_start(msg, act_index++) < 0)
+	if (nlattr_nested_start(msg, classifier_action) < 0)
 		return -1;
-	nlattr_add(&msg->nh, TCA_ACT_KIND, sizeof("skbedit"), "skbedit");
-	if (nlattr_nested_start(msg, TCA_ACT_OPTIONS) < 0)
-		return -1;
-	nlattr_add(&msg->nh, TCA_SKBEDIT_PARMS, sizeof(p), &p);
-	nlattr_add16(&msg->nh, TCA_SKBEDIT_QUEUE_MAPPING, queue);
-	nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
-	nlattr_nested_finish(msg); /* nested act_index */
+	for (i = 0; i < nb_actions; i++)
+		if (add_action(flow, &act_index, data + i) < 0)
+			return -1;
 	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
 	return 0;
 }
@@ -1053,7 +1028,12 @@ struct tap_flow_items {
 		}
 	}
 	if (mirred && flow) {
-		uint16_t if_index = pmd->if_index;
+		struct action_data adata = {
+			.id = "mirred",
+			.mirred = {
+				.eaction = mirred,
+			},
+		};
 
 		/*
 		 * If attr->egress && mirred, then this is a special
@@ -1061,9 +1041,13 @@ struct tap_flow_items {
 		 * redirect packets coming from the DPDK App, out
 		 * through the remote netdevice.
 		 */
-		if (attr->egress)
-			if_index = pmd->remote_if_index;
-		if (add_action_mirred(flow, if_index, mirred) < 0)
+		adata.mirred.ifindex = attr->ingress ? pmd->if_index :
+			pmd->remote_if_index;
+		if (mirred == TCA_EGRESS_MIRROR)
+			adata.mirred.action = TC_ACT_PIPE;
+		else
+			adata.mirred.action = TC_ACT_STOLEN;
+		if (add_actions(flow, 1, &adata, TCA_FLOWER_ACT) < 0)
 			goto exit_action_not_supported;
 		else
 			goto end;
@@ -1077,14 +1061,33 @@ struct tap_flow_items {
 			if (action)
 				goto exit_action_not_supported;
 			action = 1;
-			if (flow)
-				err = add_action_gact(flow, TC_ACT_SHOT);
+			if (flow) {
+				struct action_data adata = {
+					.id = "gact",
+					.gact = {
+						.action = TC_ACT_SHOT,
+					},
+				};
+
+				err = add_actions(flow, 1, &adata,
+						  TCA_FLOWER_ACT);
+			}
 		} else if (actions->type == RTE_FLOW_ACTION_TYPE_PASSTHRU) {
 			if (action)
 				goto exit_action_not_supported;
 			action = 1;
-			if (flow)
-				err = add_action_gact(flow, TC_ACT_UNSPEC);
+			if (flow) {
+				struct action_data adata = {
+					.id = "gact",
+					.gact = {
+						/* continue */
+						.action = TC_ACT_UNSPEC,
+					},
+				};
+
+				err = add_actions(flow, 1, &adata,
+						  TCA_FLOWER_ACT);
+			}
 		} else if (actions->type == RTE_FLOW_ACTION_TYPE_QUEUE) {
 			const struct rte_flow_action_queue *queue =
 				(const struct rte_flow_action_queue *)
@@ -1096,22 +1099,30 @@ struct tap_flow_items {
 			if (!queue ||
 			    (queue->index > pmd->dev->data->nb_rx_queues - 1))
 				goto exit_action_not_supported;
-			if (flow)
-				err = add_action_skbedit(flow, queue->index);
+			if (flow) {
+				struct action_data adata = {
+					.id = "skbedit",
+					.skbedit = {
+						.skbedit = {
+							.action = TC_ACT_PIPE,
+						},
+						.queue = queue->index,
+					},
+				};
+
+				err = add_actions(flow, 1, &adata,
+					TCA_FLOWER_ACT);
+			}
 		} else if (actions->type == RTE_FLOW_ACTION_TYPE_RSS) {
-			/* Fake RSS support. */
 			const struct rte_flow_action_rss *rss =
 				(const struct rte_flow_action_rss *)
 				actions->conf;
 
-			if (action)
-				goto exit_action_not_supported;
-			action = 1;
-			if (!rss || rss->num < 1 ||
-			    (rss->queue[0] > pmd->dev->data->nb_rx_queues - 1))
+			if (action++)
 				goto exit_action_not_supported;
-			if (flow)
-				err = add_action_skbedit(flow, rss->queue[0]);
+			if (!pmd->rss_enabled)
+				err = rss_enable(pmd);
+			(void)rss;
 		} else {
 			goto exit_action_not_supported;
 		}
@@ -1632,6 +1643,127 @@ int tap_flow_implicit_destroy(struct pmd_internals *pmd,
 	return 0;
 }
 
+#define BPF_PROGRAM "tap_bpf_program.o"
+
+/**
+ * Enable RSS on tap: create leading TC rules for queuing.
+ */
+static int rss_enable(struct pmd_internals *pmd)
+{
+	struct rte_flow *rss_flow = NULL;
+	char section[64];
+	struct nlmsg *msg = NULL;
+	/* 4096 is the maximum number of instructions for a BPF program */
+	char annotation[256];
+	int bpf_fd;
+	int i;
+
+	/*
+	 * Add a rule per queue to match reclassified packets and direct them to
+	 * the correct queue.
+	 */
+	for (i = 0; i < pmd->dev->data->nb_rx_queues; i++) {
+		struct action_data adata = {
+			.id = "skbedit",
+			.skbedit = {
+				.skbedit = {
+					.action = TC_ACT_PIPE,
+				},
+				.queue = i,
+			},
+		};
+
+		bpf_fd = 0;
+
+		rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0);
+		if (!rss_flow) {
+			RTE_LOG(ERR, PMD,
+				"Cannot allocate memory for rte_flow");
+			return -1;
+		}
+		msg = &rss_flow->msg;
+		tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER, NLM_F_REQUEST |
+			    NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE);
+		msg->t.tcm_info = TC_H_MAKE((i + PRIORITY_OFFSET) << 16,
+					    htons(ETH_P_ALL));
+		msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0);
+		tap_flow_set_handle(rss_flow);
+		nlattr_add(&msg->nh, TCA_KIND, sizeof("bpf"), "bpf");
+		if (nlattr_nested_start(msg, TCA_OPTIONS) < 0)
+			return -1;
+		nlattr_add32(&msg->nh, TCA_BPF_FD, bpf_fd);
+		snprintf(annotation, sizeof(annotation), "%s:[%s]",
+			 BPF_PROGRAM, section);
+		nlattr_add(&msg->nh, TCA_BPF_NAME, strlen(annotation),
+			   annotation);
+
+		if (add_actions(rss_flow, 1, &adata, TCA_BPF_ACT) < 0)
+			return -1;
+		nlattr_nested_finish(msg); /* nested TCA_ACT_OPTIONS */
+		/* Netlink message is now ready to be sent */
+		if (nl_send(pmd->nlsk_fd, &msg->nh) < 0)
+			return -1;
+		if (nl_recv_ack(pmd->nlsk_fd) < 0)
+			return -1;
+		LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next);
+	}
+
+	snprintf(annotation, sizeof(annotation), "%s:[%s]", BPF_PROGRAM,
+		 section);
+	rss_flow = rte_malloc(__func__, sizeof(struct rte_flow), 0);
+	if (!rss_flow) {
+		RTE_LOG(ERR, PMD,
+			"Cannot allocate memory for rte_flow");
+		return -1;
+	}
+	msg = &rss_flow->msg;
+	tc_init_msg(msg, pmd->if_index, RTM_NEWTFILTER,
+		    NLM_F_REQUEST | NLM_F_ACK | NLM_F_EXCL | NLM_F_CREATE);
+	msg->t.tcm_info =
+		TC_H_MAKE((RTE_PMD_TAP_MAX_QUEUES + PRIORITY_OFFSET) << 16,
+			  htons(ETH_P_ALL));
+	msg->t.tcm_parent = TC_H_MAKE(MULTIQ_MAJOR_HANDLE, 0);
+	tap_flow_set_handle(rss_flow);
+	nlattr_add(&msg->nh, TCA_KIND, sizeof("flower"), "flower");
+	if (nlattr_nested_start(msg, TCA_OPTIONS) < 0)
+		return -1;
+
+	/* no fields for matching: all packets must match */
+	{
+		/* Actions */
+		struct action_data data[2] = {
+			[0] = {
+				.id = "bpf",
+				.bpf = {
+					.bpf_fd = bpf_fd,
+					.annotation = annotation,
+				},
+			},
+			[1] = {
+				.id = "gact",
+				.gact = {
+					/* continue */
+					.action = TC_ACT_UNSPEC,
+				},
+			},
+		};
+
+		if (add_actions(rss_flow, 2, data, TCA_FLOWER_ACT) < 0)
+			return -1;
+	}
+	nlattr_nested_finish(msg); /* nested TCA_FLOWER_ACT */
+	nlattr_nested_finish(msg); /* nested TCA_OPTIONS */
+	/* Netlink message is now ready to be sent */
+	if (nl_send(pmd->nlsk_fd, &msg->nh) < 0)
+		return -1;
+	if (nl_recv_ack(pmd->nlsk_fd) < 0)
+		return -1;
+	LIST_INSERT_HEAD(&pmd->rss_flows, rss_flow, next);
+
+	pmd->rss_enabled = 1;
+	return 0;
+}
+
 /**
  * Manage filter operations.
  *
-- 
1.8.3.1
Previous message: [dpdk-dev] [PATCH] app/testpmd: extend event printing
Next message: [dpdk-dev] [RFC 2/2] net/tap: add eBPF instructions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list