[dpdk-dev] [PATCH 2/2] ethdev: add hierarchical scheduler API

Dumitrescu, Cristian cristian.dumitrescu at intel.com
Tue Feb 21 14:44:39 CET 2017
Previous message: [dpdk-dev] [PATCH 2/2] ethdev: add hierarchical scheduler API
Next message: [dpdk-dev] [PATCH] eventdev: amend comment for timeout_ticks in rte_event_dequeue_burst()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Hemant,

> > +  * Scheduler node capabilities
> > +  */
> > +struct rte_scheddev_node_capabilities {

...<snip>

> > +		/**< Lowest priority supported. The value of 1 indicates that
> > +		 * only priority 0 is supported, which essentially means that
> > +		 * Strict Priority (SP) algorithm is not supported.
> > +		 */
> > +		uint32_t sp_priority_min;
> > +
> This can be  simply sp_priority_level, with 0 indicating no support
> 1 indicates '0' and '1' priority.  or 7 indicates '0' to '7' i.e. total
> 8 priorities.

Yes, will pick a better name, as you suggested. 

> 
> > +		/**< Maximum number of sibling nodes that can have the
> same
> > +		 * priority at any given time. When equal to
> *n_children_max*,
> > +		 * it indicates that WFQ/WRR algorithms are not supported.
> > +		 */
> > +		uint32_t sp_n_children_max;
> not clear to me.
> OK, more than 1 children can have same priority, than you apply WRR/WFQ
> among them.
> 
> However, there can be different sets,  e.g prio '0' and '1' has only 1
> children. while prio '2' has 6 children, than you apply WRR/WFQ among them.
> 

Yes, the parameter description seems wrong to me as well. The correct statement should
be: "When equal to 1, it indicates that WFQ/WRR algorithms are not supported", right?
I will fix the description.

Also, for the sake of clarity, it makes sense to mention in the
description that sp_n_children_max value range is 1 .. n_children_max.

> > +
> > +		/**< WFQ algorithm support. */
> > +		int scheduling_wfq_supported;
> > +
> > +		/**< WRR algorithm support. */
> > +		int scheduling_wrr_supported;
> > +
> > +		/**< Maximum WFQ/WRR weight. */
> > +		uint32_t scheduling_wfq_wrr_weight_max;
> > +	} nonleaf;
> > +
> > +	/**< Items valid only for leaf nodes. */
> > +	struct {
> > +		/**< Head drop algorithm support. */
> > +		int cman_head_drop_supported;
> > +
> > +		/**< Private WRED context support. */
> > +		int cman_wred_context_private_supported;
> > +
> 
> The context part is not clear to me.
> 

The WRED context is the WRED object/instance initialized based on a WRED profile.

The leaf node can support or not a WRED context that it owns (private WRED context),
which is what this parameter refers to, and zero or several shared WRED contexts (shared
with other nodes), which is what the below parameter cman_wred_context_shared_n_max
refers to. Makes sense?

> > +		/**< Maximum number of shared WRED contexts
> supported. The value
> > +		 * of zero indicates that shared WRED contexts are not
> > +		 * supported.
> > +		 */
> > +		uint32_t cman_wred_context_shared_n_max;
> > +	} leaf;
> 
> non-leaf nodes may have different capabilities.
> 
> your leaf node is like a QoS Queue, are you supporting shapper on leaf
> node as well?
> 

Yes, we do support shapers per leaf nodes as well. Probably a misunderstanding:
if you look closer to struct rte_scheddev_node_capabilities, the shaper related
parameters are in the common part of the structure, not in the inner non-leaf/leaf
specific parameter area, agree?

> 
> I will still prefer if you separate QoS Queue from a standard Sched
> node, the capabilities are different and it will be cleaner at the cost
> of increased structure and number of APIs.
> 

I seriously thought about this suggestion (which you stated in your previous email), I even went down
this path just to revert back eventually, as my conclusion was this it is not the best approach. What I
found is this has the tendency to create a lot of confusion for which API functions and data structures
are applicable to: (a) both leaf and non-leaf nodes, (b) leaf nodes only, (c) non-leaf nodes only.

Approach 1: Have separate API for leaf nodes and non-leaf nodes. We now effectively have to different
API objects: leaf_node and nonleaf_node.
My conclusions against it:
- We end up duplicating 90% of the API;
- We also need to document and enforce clear rules for the interaction between leaf nodes and
non-leaf nodes, which adds a lot of complexity, as it doubles the amount of variables the user needs
to consider.

2. Have most of the API common for leaf nodes and non-leaf nodes (for common object called node),
but have some API only applicable to leaf nodes (name using leaf_node). It still creates the impression
that there are two different API objects (leaf_node and non-leaf node), which is false in this case.
My conclusions against it:
- It is likely the user will look for leaf-node specific API for most basic operations, and will not find them;
- Even if the user is aware that the "node" API is applicable to both leaf nodes and non-leaf nodes, it still
requires documenting and enforcing clear rules for the interaction between leaf nodes and non-leaf nodes,
which increases the complexity.

3. Have all API refer to a single object consistently called node.
This is the approach that I ended up picking:
- It really simplifies the API and adds a lot of clarity
- Clearly enforce differences between leaf and non-leaf nodes in API data structures using unions or structs
clearly called leaf and nonleaf.
- The only functions that are applicable to leaf nodes only / non-leaf nodes only are just the node_xyz_update()
functions to be called for updating node parameter xyz post -node-add operation. Easy to get the type of node
they are applicable for, as the parameter itself leaves no room for confusion.

One more note: it is straightforward to determine if an existing node is leaf or not, as the API predefines leaf node IDs
to 0 ... (N_TXQs - 1), which cannot be used for non-leaf nodes. This was done as result of some of your previous
suggestions. I will actually add a new API (inline) function rte_scheddev_node_is_leaf() to further clarify & enforce
this rule.

> > +};
> > +
> > +/**
> > +  * Scheduler capabilities
> > +  */
> > +struct rte_scheddev_capabilities {
...<snip>
> > +	/**< Summary of node-level capabilities across all nodes. */
> > +	struct rte_scheddev_node_capabilities node;
> 
> This should be array of numbers of levels supported in the system.
> Non-leaf node at level 2 can have different capabilities than level 3 node.
> 

My initial thinking: To get the per-level capabilities, simply call the node-level
capability API for the first node (or any node) on that level. But now I see your point,
as this approach can be used only after the nodes have been added (as the ID of an existing node is required).

On the other hand, some implementations can be very flexible: have a pool of nodes
that can be added in any topology to create an arbitrary number of hierarchy levels
(i.e. distance from leaf to root), so the level notion is not that relevant anymore.

Therefore, I propose we add a new API function for level-specific capability query (at 2. below), so
we come to the following set of capability related API functions:
1. rte_scheddev_capabilities_get(sched_params) = summary of capabilities for entire hierarchy
2. rte_scheddev_level_capabilities_get(level_id, node_params) = summary of node capabilities for all nodes on same level
3. rte_scheddev_node_capabilities_get(node_id, node_params) = summary of node capabilities for an existing node

Any concerns?

...<snip>

> > +  * Each leaf node sits on on top of a TX queue of the current Ethernet
> port.
> > +  * Therefore, the leaf nodes are predefined with the node IDs of 0 .. (N-
> 1),
> > +  * where N is the number of TX queues configured for the current
> Ethernet port.
> > +  * The non-leaf nodes have their IDs generated by the application.
> > +  */
> 
> 
> Ok, that means 0 to N-1 is reserved for leaf nodes. the application will
> choose any value for non-leaf nodes?
> What will be the parent node id for the root node?
> 

Yes!

For parent node ID, see the at the top of the file:
	/**< Scheduler hierarchy root node ID */
	#define RTE_SCHEDDEV_ROOT_NODE_ID  UINT32_MAX
I will add a comment for node_add() function as well to furher clarify.

> > +struct rte_scheddev_node_params {
> > +	/**< Shaper profile for the private shaper. The absence of the
> private
> > +	 * shaper for the current node is indicated by setting this parameter
> > +	 * to RTE_SCHEDDEV_SHAPER_PROFILE_ID_NONE.
> > +	 */
> > +	uint32_t shaper_profile_id;
> > +
> > +	/**< User allocated array of valid shared shaper IDs. */
> > +	uint32_t *shared_shaper_id;
> > +
> > +	/**< Number of shared shaper IDs in the *shared_shaper_id* array.
> */
> > +	uint32_t n_shared_shapers;
> > +
> > +	union {
> > +		/**< Parameters only valid for non-leaf nodes. */
> > +		struct {
> > +			/**< For each priority, indicates whether the children
> > +			 * nodes sharing the same priority are to be
> scheduled
> > +			 * by WFQ or by WRR. When NULL, it indicates that
> WFQ
> > +			 * is to be used for all priorities. When non-NULL, it
> > +			 * points to a pre-allocated array of *n_priority*
> > +			 * elements, with a non-zero value element
> indicating
> > +			 * WFQ and a zero value element for WRR.
> > +			 */
> > +			int *scheduling_mode_per_priority;
> 
> what is the structure of the pointer element? Just a bool array?
> 

Yes, we decide between WFQ and WRR for each group of children with same priority.

> > +
> > +			/**< Number of priorities. */
> > +			uint32_t n_priorities;
> > +		} nonleaf;
> > +
> > +		/**< Parameters only valid for leaf nodes. */
> > +		struct {
> > +			/**< Congestion management mode */
> > +			enum rte_scheddev_cman_mode cman;
> > +
> > +			/**< WRED parameters (valid when *cman* is
> WRED). */
> > +			struct {
> > +				/**< WRED profile for private WRED context.
> */
> > +				uint32_t wred_profile_id;
> > +
> > +				/**< User allocated array of shared WRED
> context
> > +				 * IDs. The absence of a private WRED
> context
> > +				 * for current leaf node is indicated by value
> > +				 *
> RTE_SCHEDDEV_WRED_PROFILE_ID_NONE.
> > +				 */
> > +				uint32_t *shared_wred_context_id;
> > +
> > +				/**< Number of shared WRED context IDs in
> the
> > +				 * *shared_wred_context_id* array.
> > +				 */
> > +				uint32_t n_shared_wred_contexts;
> > +			} wred;
> > +		} leaf;
> 
> need a bool is_leaf here to differentiate between leaf and non-leaf node.

This data structure is only used by the node_add() function. The fact whether
the current node is leaf or non-leaf is decided by the node_id parameter of the
node_add() function (as discussed above on the convention on leaf node ID
enforced by the API), so IMO adding the is_leaf in the parameter structure is
redundant.

> > +/**
> > + * Scheduler node capabilities get
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param node_id
> > + *   Node ID. Needs to be valid.
> > + * @param cap
> > + *   Scheduler node capabilities. Needs to be pre-allocated and valid.
> > + * @param error
> > + *   Error details. Filled in only on error, when not NULL.
> > + * @return
> > + *   0 on success, non-zero error code otherwise.
> > + */
> > +int rte_scheddev_node_capabilities_get(uint8_t port_id,
> > +	uint32_t node_id,
> > +	struct rte_scheddev_node_capabilities *cap,
> > +	struct rte_scheddev_error *error);
> > +
> 
> Node capabilities is already part of scheddev_capabilities?
> 
> What are you expecting different here. Unless you support different
> capability for each level, this may not be useful.
> 

Yes, you're right, I will merge stats capability query into the capability API.

> > +/**
> > + * Scheduler node add
> > + *
> > + * When *node_id* is not a valid node ID, a new node with this ID is
> created and
> > + * connected as child to the existing node identified by
> *parent_node_id*.
> > + *
> > + * When *node_id* is a valid node ID, this node is disconnected from its
> current
> > + * parent and connected as child to another existing node identified by
> > + * *parent_node_id *.
> > + *
> > + * This function can be called during port initialization phase (before the
> > + * Ethernet port is started) for building the scheduler start-up hierarchy.
> > + * Subject to the specific Ethernet port supporting on-the-fly scheduler
> > + * hierarchy updates, this function can also be called during run-time
> (after
> > + * the Ethernet port is started).
> 
> This should  a capability, whether dynamic_hierarchy_updates are
> supported or not.
> 

Agreed, will add.

> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param node_id
> > + *   Node ID
> > + * @param parent_node_id
> > + *   Parent node ID. Needs to be the valid.
> 
> What will be the parent node id for the root node?  how the root node is
> created on the ethernet port?
> 

The first node added to the hierarchy should have parent set to
RTE_SCHEDDEV_ROOT_NODE_ID and becomes the _real_root_ node; all
the other nodes should be added as children of the _real_root_ node or its
descendants.

Will add comment to node_add() function to further clarify this.

> > + * @param priority
> > + *   Node priority. The highest node priority is zero. Used by the SP
> algorithm
> > + *   running on the parent of the current node for scheduling this child
> node.
> > + * @param weight
> > + *   Node weight. The node weight is relative to the weight sum of all
> siblings
> > + *   that have the same priority. The lowest weight is one. Used by the
> WFQ/WRR
> > + *   algorithm running on the parent of the current node for scheduling
> this
> > + *   child node.
> > + * @param params
> > + *   Node parameters. Needs to be pre-allocated and valid.
> > + * @param error
> > + *   Error details. Filled in only on error, when not NULL.
> > + * @return
> > + *   0 on success, non-zero error code otherwise.
> > + */
> > +int rte_scheddev_node_add(uint8_t port_id,
> > +	uint32_t node_id,
> > +	uint32_t parent_node_id,
> > +	uint32_t priority,
> > +	uint32_t weight,
> > +	struct rte_scheddev_node_params *params,
> > +	struct rte_scheddev_error *error);
> > +

...<snip>
> > +/**
> > + * Scheduler node parent update
> > + *
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param node_id
> > + *   Node ID. Needs to be valid.
> > + * @param parent_node_id
> > + *   Node ID for the new parent. Needs to be valid.
> > + * @param priority
> > + *   Node priority. The highest node priority is zero. Used by the SP
> algorithm
> > + *   running on the parent of the current node for scheduling this child
> node.
> > + * @param weight
> > + *   Node weight. The node weight is relative to the weight sum of all
> siblings
> > + *   that have the same priority. The lowest weight is zero. Used by the
> WFQ/WRR
> > + *   algorithm running on the parent of the current node for scheduling
> this
> > + *   child node.
> > + * @param error
> > + *   Error details. Filled in only on error, when not NULL.
> > + * @return
> > + *   0 on success, non-zero error code otherwise.
> > + */
> > +int rte_scheddev_node_parent_update(uint8_t port_id,
> > +	uint32_t node_id,
> > +	uint32_t parent_node_id,
> > +	uint32_t priority,
> > +	uint32_t weight,
> > +	struct rte_scheddev_error *error);
> > +
> 
> The usages are not clear. How it is different from node_add API.
> is the intention to update a specific node or change the connection of a
> specific node to a existing or new parent.
> 

Yes.

The node_add() API function should be called only to create nodes that do
not exist yet. Basically, the provided node_is should not be in use when
node_add() is called. Will update the function description (sorry, I thought I
already documented this, but looks like it slipped somehow to me).

All the node_xyz_update() API functions should be called only on nodes
that already exist for typically run-time updates.

IMO this convention provides a clear way to differentiate between the creation
vs. post-creation update mechanisms.

...<snip>

Regards,
Cristian
Previous message: [dpdk-dev] [PATCH 2/2] ethdev: add hierarchical scheduler API
Next message: [dpdk-dev] [PATCH] eventdev: amend comment for timeout_ticks in rte_event_dequeue_burst()
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list