[dpdk-dev] [RFC] ethdev: abstraction layer for QoS hierarchical scheduler

Jerin Jacob jerin.jacob at caviumnetworks.com
Wed Jan 11 14:56:03 CET 2017

On Wed, Nov 30, 2016 at 06:16:50PM +0000, Cristian Dumitrescu wrote:
> This RFC proposes an ethdev-based abstraction layer for Quality of Service (QoS)
> hierarchical scheduler. The goal of the abstraction layer is to provide a simple
> generic API that is agnostic of the underlying HW, SW or mixed HW-SW complex
> implementation.

Thanks Cristian for bringing up this RFC.
This will help in integrating NPU's QoS hierarchical scheduler's into DPDK.

Overall the RFC looks very good as a generic traffic manager. However, as a
NPU HW vendor, we feel like we need to expose some of the HW constraints and
HW specific features in a generic way in this specification to effectively use
with HW based implementation.

I will try to describe HW constraints and HW features associated with HW based
hierarchical scheduler found in Cavium SoCs as inline. IMO, If other HW vendors
share the constraints on "hardware based hierarchical scheduler"
then we could have a realistic HW/SW abstraction for the hierarchical scheduler.

> Q1: What is the benefit for having an abstraction layer for QoS hierarchical
> layer?
> A1: There is growing interest in the industry for handling various HW-based,
> SW-based or mixed hierarchical scheduler implementations using a unified DPDK
> API.


> Q4: Why have this abstraction layer into ethdev as opposed to a new type of
> device (e.g. scheddev) similar to ethdev, cryptodev, eventdev, etc?
> A4: Packets are sent to the Ethernet device using the ethdev API
> rte_eth_tx_burst() function, with the hierarchical scheduling taking place
> automatically (i.e. no SW intervention) in HW implementations. Basically, the
> hierarchical scheduler is done as part of packet TX operation.
> The hierarchical scheduler is typically the last stage before packet TX and it
> is tightly integrated with the TX stage. The hierarchical scheduler is just
> another offload feature of the Ethernet device, which needs to be accommodated
> by the ethdev API similar to any other offload feature (such as RSS, DCB,
> flow director, etc).
> Once the decision to schedule a specific packet has been taken, this packet
> cannot be dropped and it has to be sent over the wire as is, otherwise what
> takes place on the wire is not what was planned at scheduling time, so the
> scheduling is not accurate (Note: there are some devices which allow prepending
> headers to the packet after the scheduling stage at the expense of sending
> correction requests back to the scheduler, but this only strengthens the bond
> between scheduling and TX).

Makes sense.

> Q5: Given that the packet scheduling takes place automatically for pure HW
> implementations, how does packet scheduling take place for poll-mode SW
> implementations?
> A5: The API provided function rte_sched_run() is designed to take care of this.
> For HW implementations, this function typically does nothing. For SW
> implementations, this function is typically expected to perform dequeue of
> packets from the hierarchical scheduler and their write to Ethernet device TX
> queue, periodic flush of any buffers on enqueue-side into the hierarchical
> scheduler for burst-oriented implementations, etc.

Yes. In addition to that, if rte_sched_run() does nothing(for HW implementation)
then _application_ should not call the same. I think we need to introduce
"service core" concept in DPDK to make it very transparent from an application

> Q6: Which are the scheduling algorithms supported?
> A6: The fundamental scheduling algorithms that are supported are Strict Priority
> (SP) and Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at
> the level of each node of the scheduling hierarchy, regardless of the node
> level/position in the tree. The SP algorithm is used to schedule between sibling
> nodes with different priority, while WFQ is used to schedule between groups of
> siblings that have the same priority.
> Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR
> (DWRR), etc are considered approximations of the ideal WFQ and are therefore
> assimilated to WFQ, although an associated implementation-dependent accuracy,
> performance and resource usage trade-off might exist.

Makes sense.

> Q7: Which are the supported congestion management algorithms?
> A7: Tail drop, head drop and Weighted Random Early Detection (WRED). They are
> available for every leaf node in the hierarchy, subject to the specific
> implementation supporting them.

We don't support Tail drop, head drop or WRED for each leaf node in the hierarchy,
Instead, in some sense, it is integrated into the HW mempool block at the ingress.
So, maybe we can have some sort of capability or info API to get the capability of
the scheduler to the application to get the big picture instead of trying
individual resource APIs in the spec.

We do have support for querying available free entries in the leaf queue to
figure out the load. But it may not be worth to start a service core(rte_sched_run())
for implementing the spec due to multicore communication overhead.
Instead, using the HW base support(a means to get the
depth of leaf queue) application/library can do congestion management.

Thoughts ?

Does any other HW vendor support egress congestion management in HW ?

> Q8: Is traffic shaping supported?
> A8: Yes, there are a number of shapers (rate limiters) that can be supported for
> each node in the hierarchy (built-in limit is currently set to 4 per node). Each
> shaper can be private to a node (used only by that node) or shared between
> multiple nodes.

Makes sense. We have dual rate shapers(very similar to RFC-2697 and RFC-2698)
at all the nodes(obviously, an only single rate at the last node(the one close to
physical port)). Just to understand, When we say 4 shapers per node, Is it four
different rate limiters per node? Is there any RFC for four rate limiter like
single(RFC-2697) and dual(RFC-2698)?

> Q9: What is the purpose of having shaper profiles and WRED profiles?
> A9: In most implementations, many shapers typically share the same configuration
> parameters, so defining shaper profiles simplifies the configuration task. Same
> considerations apply to WRED contexts and profiles.

Makes sense.

> Q11: Are on-the-fly changes of the scheduling hierarchy allowed by the API?
> A11: Yes. The actual changes take place subject to the specific implementation
> supporting them, otherwise error code is returned.

On-the-fly scheduling hierarchy is tricky in HW implementation and it comes with
a lot of constraints. Returning the error code is fine, But we need to define what
it takes to reconfigure the hierarchy if on-the-fly reconfiguring is not supported.

The high-level constraints for reconfiguring hierarchy in our HW is:
1) Stop adding additional packets in leaf node
2) Wait for the packets to drain out from the nodes.

Point (2) is internal to implementation so we can manage.
I guess, For, Point (1), Application may need to know the constraint.

> Q13: Which are the possible options for the user when the Ethernet port does not
> support the scheduling hierarchy required by the user?
> A13: The following options are available to the user:
> i) abort
> ii) try out a new hierarchy (e.g. with less leaf nodes), if acceptable

As mentioned earlier, Additional API to get the capability will help here.
Some of the other capabilities, we believe it will be useful for the applications.

1) maximum number of levels,
2) maximum nodes per level,
3) Is congestion management supported?
4) maximum priority per node?

At least it will be useful for writing the example application

> iii) wrap the Ethernet device into a new type of Ethernet device that has a SW
> front-end implementing the hierarchical scheduler (e.g. existing DPDK library
> librte_sched); instantiate the new device type on-the-fly and check if the
> hierarchy requirements can be met by the new device.

Do we want to wrap to new ethernet device or let application to use software
library directly ? If it is former then,

Are we planning for a generic SW based driver for this? So that the NICs don't have
HW support can just reuse the SW driver.Instead of duplicating the code in
all the PMD drivers?

> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu at intel.com>
> ---
>  lib/librte_ether/rte_ethdev.h | 794 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 794 insertions(+)
>  mode change 100644 => 100755 lib/librte_ether/rte_ethdev.h
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> old mode 100644
> new mode 100755
> index 9678179..d4d8604
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -182,6 +182,8 @@ extern "C" {
>  #include <rte_pci.h>
>  #include <rte_dev.h>
>  #include <rte_devargs.h>
> +#include <rte_meter.h>
> +#include <rte_red.h>


> +
> +enum rte_eth_sched_stats_counter {
> +	/**< Number of packets scheduled from current node. */
> +	/**< Number of bytes scheduled from current node. */
> +	/**< Number of packets currently waiting in the packet queue of current
> +	     leaf node. */
> +	/**< Number of bytes currently waiting in the packet queue of current
> +	     leaf node. */

Some of the other counters seen in HW implementations from the shaper(rate limiter) are


> +};
> +
> +/**
> +  * Node statistics counters
> +  */
> +struct rte_eth_sched_node_stats {
> +	/**< Number of packets scheduled from current node. */
> +	uint64_t n_pkts;
> +	/**< Number of bytes scheduled from current node. */
> +	uint64_t n_bytes;
> +	/**< Statistics counters for leaf nodes only */

We don't have support for the stats for the all nodes.Since you have the
rte_eth_sched_node_stats_get_enabled(), We are good.

> +	struct {
> +		/**< Number of packets dropped by current leaf node. */
> +		uint64_t n_pkts_dropped;
> +		/**< Number of bytes dropped by current leaf node. */
> +		uint64_t n_bytes_dropped;
> +		/**< Number of packets currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_pkts_queued;
> +		/**< Number of bytes currently waiting in the packet queue of
> +		     current leaf node. */
> +		uint64_t n_bytes_queued;
> +	} leaf;

leaf stats looks good to us.

> +};
> +
>  /**
> + * Scheduler WRED profile add
> + *
> + * Create a new WRED profile with ID set to *wred_profile_id*. The new profile
> + * is used to create one or several WRED contexts.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param wred_profile_id
> + *   WRED profile ID for the new profile. Needs to be unused.
> + * @param profile
> + *   WRED profile parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_eth_sched_wred_params *profile);

How about returning wred_profile_id from the driver? looks like, that is the easy
way to manage from driver perspective(driver can pass the same handle for similar
profiles and have an opaque number for embedding some other information)
and it is kind of norm.
int rte_eth_sched_wred_profile_add(uint8_t port_id,
		struct rte_eth_sched_wred_params *profile);

> +/**
> + * Scheduler node add or remap
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.

IMO, We need an explicit error number to differentiate the configuration
error due do Ethernet port has been started.
And on receiving on such error code, we need to define what is the procedure
to reconfigure the topology.

The recent rte_flow spec has own error codes to get more visibility on the failure,
so that application can choose better attributes for configuring.
For example, Some of those limitations in our HW are

1) priorities are from 0 to 9(error type: PRIORITY_NOT_SUPPORTED)
2) DDWR is applicable only for one set priorities per children to parent
connection. example, valid case: 0-1-1-1-2-3. Invalid case: 0-1-1-1-3-2-(2),

> + */
> +int rte_eth_sched_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	struct rte_eth_sched_node_params *params);
> +
> +/**
> + * 
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param queue_id
> + *   Queue ID. Needs to be valid.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_eth_sched_node_queue_set(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t queue_id);
> +

In HW based implementation leaf node id == tx_queue_id as hierarchical
scheduling is tightly coupled with tx_queues(ie leaf nodes), Do we need
such translation? like  specifying "queue_id" in struct rte_eth_sched_node_params ?
since tx_queues are expressed in 0..n. How about making the leaf node id as
same. There is no such translation in HW so may be it will difficult to implement.
Do we really need this translation?

Other points:

HW can't understand any SW marking schemes applied at ingress classification
level. For us, at leaf node level all packets are in color unaware mode with
input color set as green(aka (color blind mode). On the subsequent levels,HW
adds color meta on the packet based on the shapers.

With above scheme, we have few features where we need figure out how to abstract in
the generic way based on SW implementation or other HW vendors constraints.

1) If last level color meta is YELLOW, HW can mark(write) 3 bits in the packet.
It will be useful for sharing the color info across two different systems.(like
updating IP  diffserv bits)

2) The need for additional shaping param called _adjust_.
Typically the conditioning and scheduling algorithm is measured in bytes of
IP packets per second. We have a _signed_ adjust(-255 to 255) field
(looks like other HW implementations also) to express packet length with reference
to L2 length. Positive value to include L1 header
(typically 20B, Ethernet preamble and Inter Frame Gap)
and negative value to express to remove L2 + VLAN header and take only IP len etc


More information about the dev mailing list