[dpdk-dev] [PATCH 2/2] ethdev: add hierarchical scheduler API

Jerin Jacob jerin.jacob at caviumnetworks.com
Thu Mar 2 12:47:01 CET 2017


On Fri, Feb 10, 2017 at 02:05:50PM +0000, Cristian Dumitrescu wrote:
> - SW fallback based on librte_sched library (to be later introduced by
>   standalone patch set)
> 
> [1] RFC: http://dpdk.org/ml/archives/dev/2016-November/050956.html
> [2] Jerin’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054484.html
> [3] Hemants’s feedback on RFC: http://www.dpdk.org/ml/archives/dev/2017-January/054866.html
> 
> Signed-off-by: Cristian Dumitrescu <cristian.dumitrescu at intel.com>
> ---
>  MAINTAINERS                            |    4 +
> +	rte_scheddev_node_parent_update;
> +	rte_scheddev_node_shaper_update;
> +	rte_scheddev_node_shared_shaper_update;
> +	rte_scheddev_node_scheduling_mode_update;
> +	rte_scheddev_node_cman_update;


Since the scope is beyond the scheduler i.e(CMAN, marking and shaping)
should we call it traffic manager or anything similar ? How about tmdev
instead of scheddev? No strong opinion here. But, I think, it worth to think
any other name for this.(Crypto and eventdev has schedulers too)


> +	rte_scheddev_node_wred_context_update;
> +	rte_scheddev_node_shared_wred_context_update;
> +	rte_scheddev_mark_vlan_dei;
> +	rte_scheddev_mark_ip_ecn;
> +	rte_scheddev_mark_ip_dscp;
> +	rte_scheddev_stats_get_enabled;
> +	rte_scheddev_stats_enable;
> +	rte_scheddev_node_stats_get_enabled;
> +	rte_scheddev_node_stats_enable;
> +	rte_scheddev_node_stats_read;
>  
>  } DPDK_17.02;
> +int rte_scheddev_wred_profile_add(uint8_t port_id,
> +	uint32_t wred_profile_id,
> +	struct rte_scheddev_wred_params *profile,
> +	struct rte_scheddev_error *error)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +	const struct rte_scheddev_ops *ops =
> +		rte_scheddev_ops_get(port_id, error);
> +
> +	if (ops == NULL)
> +		return -rte_errno;
> +
> +	if (ops->wred_profile_add == NULL)
> +		return -rte_scheddev_error_set(error,
> +			ENOSYS,
> +			RTE_SCHEDDEV_ERROR_TYPE_UNSPECIFIED,
> +			NULL,
> +			rte_strerror(ENOSYS));

IMO, The above piece of code gets duplicated in all the functions, may
be a candidate for macro or inline function

> + * This interface provides the ability to configure the hierarchical scheduler
> + * feature in a generic way.
> + */
> +
> +#include <stdint.h>
> +
> +#include <rte_red.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/** Ethernet framing overhead
> +  *
> +  * Overhead fields per Ethernet frame:
> +  * 1. Preamble:                                            7 bytes;
> +  * 2. Start of Frame Delimiter (SFD):                      1 byte;
> +  * 3. Inter-Frame Gap (IFG):                              12 bytes.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD                  20
> +
> +/**
> +  * Ethernet framing overhead plus Frame Check Sequence (FCS). Useful when FCS
> +  * is generated and added at the end of the Ethernet frame on TX side without
> +  * any SW intervention.
> +  */
> +#define RTE_SCHEDDEV_ETH_FRAMING_OVERHEAD_FCS              24
> +
> +/**< Invalid WRED profile ID */
> +#define RTE_SCHEDDEV_WRED_PROFILE_ID_NONE                  UINT32_MAX
> +
> +/**< Invalid shaper profile ID */
> +#define RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE                UINT32_MAX
> +
> +/**< Scheduler hierarchy root node ID */
> +#define RTE_SCHEDDEV_ROOT_NODE_ID                          UINT32_MAX
> +
> +
> +/**
> +  * Scheduler node capabilities
> +  */
> +struct rte_scheddev_node_capabilities {
> +	/**< Private shaper support. */
> +	int shaper_private_supported;
> +
> +	/**< Dual rate shaping support for private shaper. Valid only when
> +	 * private shaper is supported.
> +	 */
> +	int shaper_private_dual_rate_supported;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for private
> +	 * shaper. Valid only when private shaper is supported.
> +	 */
> +	uint64_t shaper_private_rate_max;
> +
> +	/**< Maximum number of supported shared shapers. The value of zero
> +	 * indicates that shared shapers are not supported.
> +	 */
> +	uint32_t shaper_shared_n_max;


Private vs shared, Can we hide this detail inside the implementation?
What is the semantics of shared shaper. We have the profile concept in
the spec, can that replace the shared shapers? I think, it will help
application if we can hide the private vs shared shaper detail in
implementation.

> +
> +	/**< Items valid only for non-leaf nodes. */
> +	struct {
> +		/**< Maximum number of children nodes. */
> +		uint32_t n_children_max;
> +
> +		/**< Lowest priority supported. The value of 1 indicates that
> +		 * only priority 0 is supported, which essentially means that
> +		 * Strict Priority (SP) algorithm is not supported.
> +		 */
> +		uint32_t sp_priority_min;

As Hemant suggested, _level_ may be right name here.

> +
> +		/**< Maximum number of sibling nodes that can have the same
> +		 * priority at any given time. When equal to *n_children_max*,
> +		 * it indicates that WFQ/WRR algorithms are not supported.
> +		 */
> +		uint32_t sp_n_children_max;

In our HW, A node can have 10 separate sp priority siblings or a WRR with downsteam N nodes.
Valid configuration are
- <=10 siblings with <=10 static priorities
- We can choose a sibling as WRR and connect N WRR siblings + <=9
  static priory nodes

Not sure how that constrain map here

> +
> +		/**< WFQ algorithm support. */
> +		int scheduling_wfq_supported;
> +
> +		/**< WRR algorithm support. */
> +		int scheduling_wrr_supported;
> +
> +		/**< Maximum WFQ/WRR weight. */
> +		uint32_t scheduling_wfq_wrr_weight_max;
> +	} nonleaf;
> +
> +	/**< Items valid only for leaf nodes. */
> +	struct {
> +		/**< Head drop algorithm support. */
> +		int cman_head_drop_supported;
> +
> +		/**< Private WRED context support. */
> +		int cman_wred_context_private_supported;
> +
> +		/**< Maximum number of shared WRED contexts supported. The value
> +		 * of zero indicates that shared WRED contexts are not
> +		 * supported.
> +		 */
> +		uint32_t cman_wred_context_shared_n_max;
> +	} leaf;
> +};
> +
> +/**
> +  * Scheduler capabilities
> +  */
> +struct rte_scheddev_capabilities {
> +	/**< Maximum number of nodes. */
> +	uint32_t n_nodes_max;
> +
> +	/**< Maximum number of levels (i.e. number of nodes connecting the root
> +	 * node with any leaf node, including the root and the leaf).
> +	 */
> +	uint32_t n_levels_max;
> +
> +	/**< Maximum number of shapers, either private or shared. In case the
> +	 * implementation does not share any resource between private and
> +	 * shared shapers, it is typically equal to the sum between
> +	 * *shaper_private_n_max* and *shaper_shared_n_max*.
> +	 */
> +	uint32_t shaper_n_max;
> +
> +	/**< Maximum number of private shapers. Indicates the maximum number of
> +	 * nodes that can concurrently have the private shaper enabled.
> +	 */
> +	uint32_t shaper_private_n_max;
> +
> +	/**< Maximum number of shared shapers. The value of zero indicates that
> +	  * shared shapers are not supported.
> +	  */
> +	uint32_t shaper_shared_n_max;
> +
> +	/**< Maximum number of nodes that can share the same shared shaper. Only
> +	  * valid when shared shapers are supported.
> +	  */
> +	uint32_t shaper_shared_n_nodes_max;
> +
> +	/**< Maximum number of shared shapers that can be configured with dual
> +	  * rate shaping. The value of zero indicates that dual rate shaping
> +	  * support is not available for shared shapers.
> +	  */
> +	uint32_t shaper_shared_dual_rate_n_max;
> +
> +	/**< Minimum committed/peak rate (bytes per second) for shared
> +	  * shapers. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_min;
> +
> +	/**< Maximum committed/peak rate (bytes per second) for shared
> +	  * shaper. Only valid when shared shapers are supported.
> +	  */
> +	uint64_t shaper_shared_rate_max;

As Hemant suggested, We need additional per LEVEL capabilities. Number of
shared and number of nodes are limited in our HW and we cannot move one
node to another level.

> +
> +	/**< Minimum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_min;
> +
> +	/**< Maximum value allowed for packet length adjustment for
> +	  * private/shared shapers.
> +	  */
> +	int shaper_pkt_length_adjust_max;
> +
> +	/**< Maximum number of WRED contexts. */
> +	uint32_t cman_wred_context_n_max;
> +
> +	/**< Maximum number of private WRED contexts. Indicates the maximum
> +	  * number of leaf nodes that can concurrently have the private WRED
> +	  * context enabled.
> +	  */
> +	uint32_t cman_wred_context_private_n_max;
> +
> +	/**< Maximum number of shared WRED contexts. The value of zero indicates
> +	  * that shared WRED contexts are not supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_max;
> +
> +	/**< Maximum number of leaf nodes that can share the same WRED context.
> +	  * Only valid when shared WRED contexts are supported.
> +	  */
> +	uint32_t cman_wred_context_shared_n_nodes_max;
> +
> +	/**< Support for VLAN DEI packet marking. */
> +	int mark_vlan_dei_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of TCP packets. */
> +	int mark_ip_ecn_tcp_supported;
> +
> +	/**< Support for IPv4/IPv6 ECN marking of SCTP packets. */
> +	int mark_ip_ecn_sctp_supported;
> +
> +	/**< Support for IPv4/IPv6 DSCP packet marking. */
> +	int mark_ip_dscp_supported;
> +
> +	/**< Summary of node-level capabilities across all nodes. */
> +	struct rte_scheddev_node_capabilities node;
> +};
> +
> +/**
> +  * Congestion management (CMAN) mode
> +  *
> +  * This is used for controlling the admission of packets into a packet queue or
> +  * group of packet queues on congestion. On request of writing a new packet
> +  * into the current queue while the queue is full, the *tail drop* algorithm
> +  * drops the new packet while leaving the queue unmodified, as opposed to *head
> +  * drop* algorithm, which drops the packet at the head of the queue (the oldest
> +  * packet waiting in the queue) and admits the new packet at the tail of the
> +  * queue.
> +  *
> +  * The *Random Early Detection (RED)* algorithm works by proactively dropping
> +  * more and more input packets as the queue occupancy builds up. When the queue
> +  * is full or almost full, RED effectively works as *tail drop*. The *Weighted
> +  * RED* algorithm uses a separate set of RED thresholds for each packet color.
> +  */
> +enum rte_scheddev_cman_mode {
> +	RTE_SCHEDDEV_CMAN_TAIL_DROP = 0, /**< Tail drop */
> +	RTE_SCHEDDEV_CMAN_HEAD_DROP, /**< Head drop */
> +	RTE_SCHEDDEV_CMAN_WRED, /**< Weighted Random Early Detection (WRED) */
> +};
> +
> +/**
> +  * Color
> +  */
> +enum rte_scheddev_color {
> +	e_RTE_SCHEDDEV_GREEN = 0, /**< Green */
> +	e_RTE_SCHEDDEV_YELLOW,    /**< Yellow */
> +	e_RTE_SCHEDDEV_RED,       /**< Red */
> +	e_RTE_SCHEDDEV_COLORS     /**< Number of colors */

May be we can remove e_ here to be inline with other DPDK enums

> +};
> +
> +/**
> +  * WRED profile
> +  */
> +struct rte_scheddev_wred_params {
> +	/**< One set of RED parameters per packet color */
> +	struct rte_red_params red_params[e_RTE_SCHEDDEV_COLORS];
> +};
> +
> +/**
> +  * Token bucket
> +  */
> +struct rte_scheddev_token_bucket {
> +	/**< Token bucket rate (bytes per second) */
> +	uint64_t rate;
> +
> +	/**< Token bucket size (bytes), a.k.a. max burst size */
> +	uint64_t size;
> +};
> +
> +  * Each leaf node sits on on top of a TX queue of the current Ethernet port.
> +  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-1),
> +  * where N is the number of TX queues configured for the current Ethernet port.
> +  * The non-leaf nodes have their IDs generated by the application.
> +  */
> +struct rte_scheddev_node_params {
> +	/**< Shaper profile for the private shaper. The absence of the private
> +	 * shaper for the current node is indicated by setting this parameter
> +	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.
> +	 */

I think we need add node priorities like priority,weight,level in node
param and let rte_scheddev_node_add only node_id and parent_node_id


> +	uint32_t shaper_profile_id;
> +
> +	/**< User allocated array of valid shared shaper IDs. */
> +	uint32_t *shared_shaper_id;
> +
> +	/**< Number of shared shaper IDs in the *shared_shaper_id* array. */
> +	uint32_t n_shared_shapers;
> +
> +	union {
> +		/**< Parameters only valid for non-leaf nodes. */
> +		struct {
> +			/**< For each priority, indicates whether the children
> +			 * nodes sharing the same priority are to be scheduled
> +			 * by WFQ or by WRR. When NULL, it indicates that WFQ
> +			 * is to be used for all priorities. When non-NULL, it
> +			 * points to a pre-allocated array of *n_priority*
> +			 * elements, with a non-zero value element indicating
> +			 * WFQ and a zero value element for WRR.
> +			 */
> +			int *scheduling_mode_per_priority;
> +
> +			/**< Number of priorities. */
> +			uint32_t n_priorities;
> +		} nonleaf;
> +
> +		/**< Parameters only valid for leaf nodes. */
> +		struct {
> +			/**< Congestion management mode */
> +			enum rte_scheddev_cman_mode cman;
> +
> +			/**< WRED parameters (valid when *cman* is WRED). */
> +			struct {
> +				/**< WRED profile for private WRED context. */
> +				uint32_t wred_profile_id;
> +
> +				/**< User allocated array of shared WRED context
> +				 * IDs. The absence of a private WRED context
> +				 * for current leaf node is indicated by value
> +				 * RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.
> +				 */
> +				uint32_t *shared_wred_context_id;
> +
> +				/**< Number of shared WRED context IDs in the
> +				 * *shared_wred_context_id* array.
> +				 */
> +				uint32_t n_shared_wred_contexts;
> +			} wred;
> +		} leaf;
> +	};
> +};
> +
> +	const char *message; /**< Human-readable error message. */
> +};
> +
> +/**
> + * Scheduler capabilities get
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param cap
> + *   Scheduler capabilities. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_capabilities_get(uint8_t port_id,
> +	struct rte_scheddev_capabilities *cap,
> +	struct rte_scheddev_error *error);

As mentioned above, IMO, better to have level capabilities too.

> +
> +/**
> + * Scheduler node add
> + *
> + * When *node_id* is not a valid node ID, a new node with this ID is created and
> + * connected as child to the existing node identified by *parent_node_id*.
> + *
> + * When *node_id* is a valid node ID, this node is disconnected from its current
> + * parent and connected as child to another existing node identified by
> + * *parent_node_id *.
> + *
> + * This function can be called during port initialization phase (before the
> + * Ethernet port is started) for building the scheduler start-up hierarchy.
> + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> + * hierarchy updates, this function can also be called during run-time (after
> + * the Ethernet port is started).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID
> + * @param parent_node_id
> + *   Parent node ID. Needs to be the valid.
> + * @param priority
> + *   Node priority. The highest node priority is zero. Used by the SP algorithm
> + *   running on the parent of the current node for scheduling this child node.
> + * @param weight
> + *   Node weight. The node weight is relative to the weight sum of all siblings
> + *   that have the same priority. The lowest weight is one. Used by the WFQ/WRR
> + *   algorithm running on the parent of the current node for scheduling this
> + *   child node.
> + * @param params
> + *   Node parameters. Needs to be pre-allocated and valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_add(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,

see struct rte_scheddev_node_params comment.

> +/**
> +	struct rte_scheddev_node_params *params,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node delete
> + *
> + * Delete an existing node. This operation fails when this node currently has at
> + * least one user (i.e. child node).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_delete(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler node suspend
> + *
> + * Suspend an existing node.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_suspend(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);

What is the use case for this ? Is it same as setting CIR and PIR as
zero drop the packets. Or its connected to dynamic topology change ?
IMO, dynamic topology change should be based on capability


> +
> +/**
> + * Scheduler node resume
> + *
> + * Resume an existing node that was previously suspended.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_resume(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_error *error);
> +
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_parent_update(uint8_t port_id,
> +	uint32_t node_id,
> +	uint32_t parent_node_id,
> +	uint32_t priority,
> +	uint32_t weight,
> +	struct rte_scheddev_error *error);
> +

IMO, dynamic topology change should based on capability

> + * Scheduler node scheduling mode update
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param node_id
> + *   Node ID. Needs to be valid leaf node ID.
> + * @param scheduling_mode_per_priority
> + *   For each priority, indicates whether the children nodes sharing the same
> + *   priority are to be scheduled by WFQ or by WRR. When NULL, it indicates that
> + *   WFQ is to be used for all priorities. When non-NULL, it points to a
> + *   pre-allocated array of *n_priority* elements, with a non-zero value element
> + *   indicating WFQ and a zero value element for WRR.
> + * @param n_priorities
> + *   Number of priorities.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_scheduling_mode_update(uint8_t port_id,
> +	uint32_t node_id,
> +	int *scheduling_mode_per_priority,
> +	uint32_t n_priorities,
> +	struct rte_scheddev_error *error);

Do we really need expose the driver implements WFQ or WRR ? Any weighted
scheme is fine. Right? No strong opinion though.

> +/**
> + * Scheduler packet marking - VLAN DEI (IEEE 802.1Q)
> + *
> + * IEEE 802.1p maps the traffic class to the VLAN Priority Code Point (PCP)
> + * field (3 bits), while IEEE 802.1q maps the drop priority to the VLAN Drop
> + * Eligible Indicator (DEI) field (1 bit), which was previously named Canonical
> + * Format Indicator (CFI).
> + *
> + * All VLAN frames of a given color get their DEI bit set if marking is enabled
> + * for this color; otherwise, their DEI bit is left as is (either set or not).
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param mark_green
> + *   Set to non-zero value to enable marking of green packets and to zero to
> + *   disable it.
> + * @param mark_yellow
> + *   Set to non-zero value to enable marking of yellow packets and to zero to
> + *   disable it.
> + * @param mark_red
> + *   Set to non-zero value to enable marking of red packets and to zero to
> + *   disable it.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_mark_vlan_dei(uint8_t port_id,
> +	int mark_green,

We think, we don't need to mark for green color across marking APIs

> +	int mark_yellow,
> +	int mark_red,
> +	struct rte_scheddev_error *error);
> +
> +/**
> + * Scheduler packet marking - IPv4 / IPv6 ECN (IETF RFC 3168)
> + *
> + * IETF RFCs 2474 and 3168 reorganize the IPv4 Type of Service (TOS) field
> +
> + * @param capability_stats_mask
> + *   Statistics counter types available for the current node. Needs to be
> + *   pre-allocated.
> + * @param enabled_stats_mask
> + *   Statistics counter types currently enabled for the current node. This is
> + *   a subset of *capability_stats_mask*. Needs to be pre-allocated.
> + * @param error
> + *   Error details. Filled in only on error, when not NULL.
> + * @return
> + *   0 on success, non-zero error code otherwise.
> + */
> +int rte_scheddev_node_stats_get_enabled(uint8_t port_id,
> +	uint32_t node_id,
> +	uint64_t *capability_stats_mask,
> +	uint64_t *enabled_stats_mask,
> +	struct rte_scheddev_error *error);
> +
> +/**
> +int rte_scheddev_node_stats_read(uint8_t port_id,
> +	uint32_t node_id,
> +	struct rte_scheddev_node_stats *stats,
> +	int clear,
> +	struct rte_scheddev_error *error);

We need to add stats reset too. Right?

Jerin and Bala

> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* __INCLUDE_RTE_SCHEDDEV_H__ */


More information about the dev mailing list