[dpdk-dev] [RFC] libeventdev: event driven programming model framework for DPDK

Jerin Jacob jerin.jacob at caviumnetworks.com
Tue Aug 9 03:01:41 CEST 2016


Hi All,

Find below an RFC API specification which attempts to
define the standard application programming interface
for event driven programming in DPDK and to abstract HW based event devices.

These devices can support event scheduling and flow ordering
in HW and typically found in NW SoCs as an integrated device or
as PCI EP device.

The RFC APIs are inspired from existing ethernet and crypto devices.
Following are the requirements considered to define the RFC API.

1) APIs similar to existing Ethernet and crypto API framework for
    ○ Device creation, device Identification and device configuration
2) Enumerate libeventdev resources as numbers(0..N) to
    ○ Avoid ABI issues with handles
    ○ Event device may have million flow queues so it's not practical to
    have handles for each flow queue and its associated name based
    lookup in multiprocess case
3) Avoid struct mbuf changes
4) APIs to
    ○ Enumerate eventdev driver capabilities and resources
    ○ Enqueue events from l-core
    ○ Schedule events
    ○ Synchronize events
    ○ Maintain ingress order of the events
    ○ Run to completion support

Find below the URL for the complete API specification.

https://rawgit.com/jerinjacobk/libeventdev/master/rte_eventdev.h

I have created a supportive document to share the concepts of
event driven programming model and proposed APIs details to get
better reach for the specification.
This presentation will cover introduction to event driven programming model concepts,
characteristics of hardware-based event manager devices,
RFC API proposal, example use case, and benefits of using the event driven programming model.

Find below the URL for the supportive document.

https://rawgit.com/jerinjacobk/libeventdev/master/DPDK-event_driven_programming_framework.pdf

git repo for the above documents:

https://github.com/jerinjacobk/libeventdev/

Looking forward to getting comments from both application and driver
implementation perspective.

What follows is the text version of the above documents, for inline comments and discussion.
I intend to update that specification accordingly.

/**
 * Get the total number of event devices that have been successfully
 * initialised.
 *
 * @return
 *   The total number of usable event devices.
 */
extern uint8_t
rte_eventdev_count(void);

/**
 * Get the device identifier for the named event device.
 *
 * @param name
 *   Event device name to select the event device identifier.
 *
 * @return
 *   Returns event device identifier on success.
 *   - <0: Failure to find named event device.
 */
extern uint8_t
rte_eventdev_get_dev_id(const char *name);

/*
 * Return the NUMA socket to which a device is connected.
 *
 * @param dev_id
 *   The identifier of the device.
 * @return
 *   The NUMA socket id to which the device is connected or
 *   a default of zero if the socket could not be determined.
 *   - -1: dev_id value is out of range.
 */
extern int
rte_eventdev_socket_id(uint8_t dev_id);

/**  Event device information */
struct rte_eventdev_info {
	const char *driver_name;	/**< Event driver name */
	struct rte_pci_device *pci_dev;	/**< PCI information */
	uint32_t min_sched_wait_ns;
	/**< Minimum supported scheduler wait delay in ns by this device */
	uint32_t max_sched_wait_ns;
	/**< Maximum supported scheduler wait delay in ns by this device */
	uint32_t sched_wait_ns;
	/**< Configured scheduler wait delay in ns of this device */
	uint32_t max_flow_queues_log2;
	/**< LOG2 of maximum flow queues supported by this device */
	uint8_t  max_sched_groups;
	/**< Maximum schedule groups supported by this device */
	uint8_t  max_sched_group_priority_levels;
	/**< Maximum schedule group priority levels supported by this device */
}

/**
 * Retrieve the contextual information of an event device.
 *
 * @param dev_id
 *   The identifier of the device.
 * @param[out] dev_info
 *   A pointer to a structure of type *rte_eventdev_info* to be filled with the
 *   contextual information of the device.
 */
extern void
rte_eventdev_info_get(uint8_t dev_id, struct rte_eventdev_info *dev_info);

/** Event device configuration structure */
struct rte_eventdev_config {
	uint32_t sched_wait_ns;
	/**< rte_event_schedule() wait for *sched_wait_ns* ns on this device */
	uint32_t nb_flow_queues_log2;
	/**< LOG2 of the number of flow queues to configure on this device */
	uint8_t  nb_sched_groups;
	/**< The number of schedule groups to configure on this device */
};

/**
 * Configure an event device.
 *
 * This function must be invoked first before any other function in the
 * API. This function can also be re-invoked when a device is in the
 * stopped state.
 *
 * The caller may use rte_eventdev_info_get() to get the capability of each
 * resources available in this event device.
 *
 * @param dev_id
 *   The identifier of the device to configure.
 * @param config
 *   The event device configuration structure.
 *
 * @return
 *   - 0: Success, device configured.
 *   - <0: Error code returned by the driver configuration function.
 */
extern int
rte_eventdev_configure(uint8_t dev_id, struct rte_eventdev_config *config);


#define RTE_EVENT_SCHED_GRP_PRI_HIGHEST	0
/**< Highest schedule group priority */
#define RTE_EVENT_SCHED_GRP_PRI_NORMAL	128
/**< Normal schedule group priority */
#define RTE_EVENT_SCHED_GRP_PRI_LOWEST	255
/**< Lowest schedule group priority */

struct rte_eventdev_sched_group_conf {
	rte_cpuset_t lcore_list;
	/**< List of l-cores has membership in this schedule group */
	uint8_t priority;
	/**< Priority for this schedule group relative to other schedule groups.
	     If the event device's *max_sched_group_priority_levels* are not in
	     the range of requested *priority* then event driver can normalize
	     to required priority value in the range of
	     [RTE_EVENT_SCHED_GRP_PRI_HIGHEST, RTE_EVENT_SCHED_GRP_PRI_LOWEST]*/
	uint8_t enable_all_lcores;
	/**< Ignore *core_list* and enable all the l-cores */
};

/**
 * Allocate and set up a schedule group for a event device.
 *
 * @param dev_id
 *   The identifier of the device.
 * @param group_id
 *   The index of the schedule group to setup. The value must be in the range
 *   [0, nb_sched_groups - 1] previously supplied to rte_eventdev_configure().
 * @param group_conf
 *   The pointer to the configuration data to be used for the schedule group.
 *   NULL value is allowed, in which case default configuration	used.
 * @param socket_id
 *   The *socket_id* argument is the socket identifier in case of NUMA.
 *   The value can be *SOCKET_ID_ANY* if there is no NUMA constraint for the
 *   DMA memory allocated for the receive schedule group.
 *
 * @return
 *   - 0: Success, schedule group correctly set up.
 *   - <0: Schedule group configuration failed
 */
extern int
rte_eventdev_sched_group_setup(uint8_t dev_id, uint8_t group_id,
		const struct rte_eventdev_sched_group_conf *group_conf,
		int socket_id);

/**
 * Get the number of schedule groups on a specific event device
 *
 * @param dev_id
 *   Event device identifier.
 * @return
 *   - The number of configured schedule groups
 */
extern uint16_t
rte_eventdev_sched_group_count(uint8_t dev_id);

/**
 * Get the priority of the schedule group on a specific event device
 *
 * @param dev_id
 *   Event device identifier.
 * @param group_id
 *   Schedule group identifier.
 * @return
 *   - The configured priority of the schedule group in
 *     [RTE_EVENT_SCHED_GRP_PRI_HIGHEST, RTE_EVENT_SCHED_GRP_PRI_LOWEST] range
 */
extern uint8_t
rte_eventdev_sched_group_priority(uint8_t dev_id, uint8_t group_id);

/**
 * Get the configured flow queue id mask of a specific event device
 *
 * *flow_queue_id_mask* can be used to generate *flow_queue_id* value in the
 * range [0 - (2^max_flow_queues_log2 -1)] of a specific event device.
 * *flow_queue_id* value will be used in the event enqueue operation
 * and comparing scheduled event *flow_queue_id* value against enqueued value.
 *
 * @param dev_id
 *   Event device identifier.
 * @return
 *   - The configured flow queue id mask
 */
extern uint32_t
rte_eventdev_flow_queue_id_mask(uint8_t dev_id);

/**
 * Start an event device.
 *
 * The device start step is the last one and consists of setting the schedule
 * groups and flow queues to start accepting the events and schedules to l-cores.
 *
 * On success, all basic functions exported by the API (event enqueue,
 * event schedule and so on) can be invoked.
 *
 * @param dev_id
 *   Event device identifier
 * @return
 *   - 0: Success, device started.
 *   - <0: Error code of the driver device start function.
 */
extern int
rte_eventdev_start(uint8_t dev_id);

/**
 * Stop an event device. The device can be restarted with a call to
 * rte_eventdev_start()
 *
 * @param dev_id
 *   Event device identifier.
 */
extern void
rte_eventdev_stop(uint8_t dev_id);

/**
 * Close an event device. The device cannot be restarted!
 *
 * @param dev_id
 *   Event device identifier
 *
 * @return
 *  - 0 on successfully closing device
 *  - <0 on failure to close device
 */
extern int
rte_eventdev_close(uint8_t dev_id);


/* Scheduler synchronization method */

#define RTE_SCHED_SYNC_ORDERED		0
/**< Ordered flow queue synchronization
 *
 * Events from an ordered flow queue can be scheduled to multiple l-cores for
 * concurrent processing while maintaining the original event order. This
 * scheme enables the user to achieve high single flow throughput by avoiding
 * SW synchronization for ordering between l-cores.
 *
 * The source flow queue ordering is maintained when events are enqueued to
 * their destination queue(s) within the same ordered queue synchronization
 * context. A l-core holds the context until it requests another event from the
 * scheduler, which implicitly releases the context. User may allow the
 * scheduler to release the context earlier than that by calling
 * rte_event_schedule_release()
 *
 * Events from the source flow queue appear in their original order when
 * dequeued from a destination flow queue irrespective of its
 * synchronization method. Event ordering is based on the received event(s),
 * but also other (newly allocated or stored) events are ordered when enqueued
 * within the same ordered context.Events not enqueued (e.g. freed or stored)
 * within the context are considered missing from reordering and are skipped at
 * this time (but can be ordered again within another context).
 *
 */

#define RTE_SCHED_SYNC_ATOMIC		1
/**< Atomic flow queue synchronization
 *
 * Events from an atomic flow queue can be scheduled only to a single l-core at
 * a time. The l-core is guaranteed to have exclusive (atomic) access to the
 * associated flow queue context, which enables the user to avoid SW
 * synchronization. Atomic flow queue also helps to maintain event ordering
 * since only one l-core at a time is able to process events from a flow queue.
 *
 * The atomic queue synchronization context is dedicated to the l-core until it
 * requests another event from the scheduler, which implicitly releases the
 * context. User may allow the scheduler to release the context earlier than
 * that by calling rte_event_schedule_release()
 *
 */

#define RTE_SCHED_SYNC_PARALLEL		2
/**< Parallel flow queue
 *
 * The scheduler performs priority scheduling, load balancing etc functions
 * but does not provide additional event synchronization or ordering.
 * It's free to schedule events from single parallel queue to multiple l-core
 * for concurrent processing. Application is responsible for flow queue context
 * synchronization and event ordering (SW synchronization).
 *
 */

/* Event types to classify the event source */

#define RTE_EVENT_TYPE_ETHDEV		0x0
/**< The event generated from ethdev subsystem */
#define RTE_EVENT_TYPE_CRYPTODEV	0x1
/**< The event generated from crypodev subsystem */
#define RTE_EVENT_TYPE_TIMERDEV		0x2
/**< The event generated from timerdev subsystem */
#define RTE_EVENT_TYPE_LCORE		0x3
/**< The event generated from l-core. Application may use *sub_event_type*
 * to further classify the event */
#define RTE_EVENT_TYPE_INVALID		0xf
/**< Invalid event type */
#define RTE_EVENT_TYPE_MAX		0x16

/**< The generic rte_event structure to hold the event attributes */
struct rte_event {
        union {
		uint64_t u64;
		struct {
			uint32_t flow_queue_id;
			/**< Flow queue identifier to choose the flow queue in
			 * enqueue and schedule operation.
			 * The value must be the range of
			 * rte_eventdev_flow_queue_id_mask() */
			uint8_t  sched_group_id;
			/**< Schedule group identifier to choose the schedule
			 * group in enqueue and schedule operation.
			 * The value must be in the range
			 * [0, nb_sched_groups - 1] previously supplied to
			 * rte_eventdev_configure(). */
			uint8_t  sched_sync;
			/**< Scheduler synchronization method associated
			 * with flow queue for enqueue and schedule operation */
			uint8_t  event_type;
			/**< Event type to classify the event source  */
			uint8_t  sub_event_type;
			/**< Sub-event types based on the event source */
		};
	};
	union {
		uintptr_t event;
		/**< Opaque event pointer */
		struct rte_mbuf *mbuf;
		/**< mbuf pointer if the scheduled event is associated with mbuf */
	};
}

/**
 *
 * Enqueue the event object supplied in *rte_event* structure on flow queue
 * identified as *flow_queue_id* associated with the schedule group
 * *sched_group_id*, scheduler synchronization method and its event types
 * on an event device designated by its *dev_id*.
 *
 * @param dev_id
 *   Event device identifier.
 * @param ev
 *   Pointer to struct rte_event
 * @return
 *  - 0 on success
 *  - <0 on failure
 */
extern int
rte_eventdev_enqueue(uint8_t dev_id, struct rte_event *ev);

/**
 * Enqueue a burst of events objects supplied in *rte_event* structure
 * on an event device designated by its *dev_id*.
 *
 * The rte_eventdev_enqueue_burst() function is invoked to enqueue
 * multiple event objects. Its the burst variant of rte_eventdev_enqueue()
 * function
 *
 * The *num* parameter is the number of event objects to enqueue which are
 * supplied in the *ev* array of *rte_event* structure.
 *
 * The rte_eventdev_enqueue_burst() function returns the number of
 * events objects it actually enqueued . A return value equal to
 * *num* means that all event objects have been enqueued.
 *
 * @param dev_id
 *   The identifier of the device.
 * @param ev
 *   The address of an array of *num* pointers to *rte_event* structure
 *   which contain the event object enqueue operations to be processed.
 * @param num
 *   The number of event objects to enqueue
 *
 * @return
 * The number of event objects actually enqueued on the event device. The return
 * value can be less than the value of the *num* parameter when the
 * event devices flow queue is full or if invalid parameters are specified in
 * a *rte_event*. If return value is less than *num*, the remaining events at
 * the end of ev[] are not consumed, and the caller has to take care of them.
 */
extern int
rte_eventdev_enqueue_burst(uint8_t dev_id, struct rte_event *ev[], int num);

/**
 * Schedule an event to the caller l-core from the event device designated by
 * its *dev_id*.
 *
 * rte_event_schedule() does not dictate the specifics of scheduling algorithm as
 * each eventdev driver may have different criteria to schedule an event.
 * However, in general, from an application perspective scheduler may use
 * following scheme to dispatch an event to l-core
 *
 * 1) Selection of schedule group
 *   a) The Number of schedule group available in the event device
 *   b) The caller l-core membership in the schedule group.
 *   c) Schedule group priority relative to other schedule groups.
 * 2) Selection of flow queue and event
 *   a) The Number of flow queues  available in event device
 *   b) Scheduler synchronization method associated with the flow queue
 *
 * On successful scheduler event dispatch, The caller l-core holds scheduler
 * synchronization context associated with the dispatched event, an explicit
 * rte_event_schedule_release() or rte_event_schedule_ctxt_*() or next
 * rte_event_schedule() call shall release the context
 *
 * @param dev_id
 *   The identifier of the device.
 * @param[out] ev
 *   Pointer to struct rte_event. On successful event dispatch, Implementation
 *   updates the event attributes
 * @param wait
 *   When true, wait for event till available or *sched_wait_ns* ns which
 *   previously supplied to rte_eventdev_configure()
 *
 * @return
 * When true, a valid event has been dispatched by the scheduler.
 *
 */
extern bool
rte_event_schedule(uint8_t dev_id, struct rte_event *ev, bool wait);

/**
 * Schedule an event to the caller l-core from a specific schedule group
 * *group_id* of event device designated by its *dev_id*.
 *
 * Like rte_event_schedule(), but schedule group provided as argument *group_id*
 *
 * @param dev_id
 *   The identifier of the device.
 * @param group_id
 *   Schedule group identifier to select the schedule group for event dispatch
 * @param[out] ev
 *   Pointer to struct rte_event. On successful event dispatch, Implementation
 *   updates the event attributes
 * @param wait
 *   When true, wait for event till available or *sched_wait_ns* ns which
 *   previously supplied to rte_eventdev_configure()
 *
 * @return
 * When true, a valid event has been dispatched by the scheduler.
 *
 */
extern bool
rte_event_schedule_from_group(uint8_t dev_id, uint8_t group_id,
				struct rte_event *ev, bool wait);

/**
 * Release the current scheduler synchronization context associated with the
 * scheduler dispatched event
 *
 * If current scheduler synchronization context method is *RTE_SCHED_SYNC_ATOMIC*
 * then this function hints the scheduler that the user has completed critical
 * section processing in the current atomic context.
 * The scheduler is now allowed to schedule events from the same flow queue to
 * another l-core.
 * Early atomic context release may increase parallelism and thus system
 * performance, but user needs to design carefully the split into critical vs.
 * non-critical sections.
 *
 * If current scheduler synchronization context method is *RTE_SCHED_SYNC_ORDERED*
 * then this function hints the scheduler that the user has done all enqueues
 * that need to maintain event order in the current ordered context.
 * The scheduler is allowed to release the ordered context of this l-core and
 * avoid reordering any following enqueues.
 * Early ordered context release may increase parallelism and thus system
 * performance, since scheduler may start reordering events sooner than the next
 * schedule call.
 *
 * If current scheduler synchronization context method is *RTE_SCHED_SYNC_PARALLEL*
 * then this function is a nop
 *
 * @param dev_id
 *   The identifier of the device.
 *
 */
extern void
rte_event_schedule_release(uint8_t dev_id);

/**
 * Update the current schedule context associated with caller l-core
 *
 * rte_event_schedule_ctxt_update() can be used to support run-to-completion
 * model where the application requires the current *event* to stay on the same
 * l-core as it moves through the series of processing stages, provided the
 * event type is *RTE_EVENT_TYPE_LCORE*.
 *
 * In the context of run-to-completion model, rte_eventdev_enqueue()
 * and its associated rte_event_schedule() can be replaced by
 * rte_event_schedule_ctxt_update() if caller requires to current event to
 * stay on caller l-core for new *flow_queue_id* and/or new *sched_sync*
 * and/or new *sub_event_type* values
 *
 * All of the arguments should be equal to their current schedule context values
 * unless the application needs the dispatcher to modify the  event attribute
 * of a dispatched event.
 *
 * rte_event_schedule_ctxt_update() is a costly operation, by splitting it as
 * functions(rte_event_schedule_ctxt_update() and rte_event_schedule_ctxt_wait())
 * allows caller to overlap the context update latency with other profitable
 * work
 *
 * @param dev_id
 *   The identifier of the device.
 * @param flow_queue_id
 *   The new flow queue identifier
 * @param sched_sync
 *   The new schedule synchronization method
 * @param sub_event_type
 *   The new sub_event_type where event_type == RTE_EVENT_TYPE_LCORE
 * @param wait
 *   When true, wait until context update completes
 *   When false, request to update the attribute may optionally start an
 *   operation that may not finish when this function returns.
 *   In that case, this function return '1' to indicate the application to
 *   call rte_event_schedule_ctxt_wait() before processing with an
 *   operation that requires the completion of the requested event attribute
 *   change
 * @return
 *  - <0 on failure
 *  - 0 on if event attribute update operation has been completed.
 *  - 1 on if event attribute update operation has begun asynchronously.
 *
 */
extern int
rte_event_schedule_ctxt_update(uint8_t dev_id, uint32_t flow_queue_id,
		uint8_t  sched_sync, uint8_t sub_event_type, bool wait);

/**
 * Wait for l-core associated event update operation to complete on the
 * event device designated by its *dev_id*.
 *
 * The caller l-core wait until a previously started event attribute update
 * operation from the same l-core till it completes
 *
 * This function is invoked when rte_event_schedule_ctxt_update() returns '1'
 *
 * @param dev_id
 *   The identifier of the device.
 */
extern void
rte_event_schedule_ctxt_wait(uint8_t dev_id);

/**
 * Join the caller l-core to a schedule group *group_id* of the event device
 * designated by its *dev_id*.
 *
 * l-core membership in the schedule group can be configured with
 * rte_eventdev_sched_group_setup() prior to rte_eventdev_start()
 *
 * @param dev_id
 *   The identifier of the device.
 * @param group_id
 *   Schedule group identifier to select the schedule group to join
 *
 * @return
 *  - 0 on success
 *  - <0 on failure
 */
extern int
rte_event_schedule_group_join(uint8_t dev_id, uint8_t group_id);

/**
 * Leave the caller l-core from a schedule group *group_id* of the event device
 * designated by its *dev_id*.
 *
 * This function will unsubscribe the calling l-core from receiving  events from
 * the specified  schedule group *group_id*
 *
 * l-core membership in the schedule group can be configured with
 * rte_eventdev_sched_group_setup() prior to rte_eventdev_start()
 *
 * @param dev_id
 *   The identifier of the device.
 * @param group_id
 *   Schedule group identifier to select the schedule group to join
 *
 * @return
 *  - 0 on success
 *  - <0 on failure
 */
extern int
rte_event_schedule_group_leave(uint8_t dev_id, uint8_t group_id);


*************** text version of the presentation document ************************

Agenda
Event driven programming model concepts in data plane perspective
Characteristics of HW based event manager devices
libeventdev
Example use case - Simple IPSec outbound processing
Benefits of event driven programming model
Future work


Event driven programming model - Concepts
Event is an asynchronous notification from HW/SW to CPU core
Typical examples of events in dataplane are
Packets from ethernet device
Crypto work completion notification from Crypto HW
Timer expiry notification from Timer HW
CPU generates an event to notify another CPU(used in pipeline mode) 
Event driven programming is a programming paradigm in which flow of the program is determined by events

Core 0
queue0
Core 1
Core n
Scheduler
queue N
queue3
queue2
queue1
packet event
Timer expiry ev 
Crypto done ev
SW event
Packet event, Timer expiry event and crypto work complete event are the typical HW generated events
Core can also produce  the SW event to notify another core for work completion

Queue 0..N stores the events


Scheduler schedules an event to core



Core process the event and enqueue  to another downstream queue for further processing or send the event/packet to wire



Event driven programming model - Concepts

Characteristics of HW based event device
Millions of flow queues
Events associated with a single flow queue can be scheduled on multiple CPUs for concurrent processing while maintaining the original event order
Provides synchronization of the events without SW lock schemes
Priority based scheduling to enable the QoS
Event device may have 1 to N schedule groups
Each core can be a member of any subset of schedule groups
Each core decides which schedule group(s) it accepts the events from 
Schedule groups provide a means to execute different functions on different cores
Flow queues grouped into schedule groups
Core to schedule group membership can be changed at runtime to support scaling and reduce the latency of critical work by assigning more cores at runtime  
Event scheduler is implemented in HW to the save CPU cycles 



libeventdev components
Core 0
Core 1
Core n
Scheduler
packet event
Timer expiry ev 
Crypto done ev
SW event
flowqueue n
flowqueue2
flowqueue1
flowqueue0
flowqueue n
flowqueue2
flowqueue1
flowqueue0
flowqueue n
flowqueue2
flowqueue1
flowqueue0
Sched
group0
Sched
group1
Sched
group n
enqueue(grp_id, flow_queue_id, schedule_sync.
event_type,
event)
{grp,flow_queueid,schedule_sync, event_type, event}= schedule()
priority x
priority y
priority z
Core 0's  Sched
Group bitmask:
100011
Group 0
Group 1
Group n

Core 1's
Sched  group bitmask:
000001
Group 0

Each core has group-mask to capture, the list of  schedule groups participate in schedule()
API Interface
Southbound eventdev driver interface 

libeventdev - flow
Event driver registers with libeventdev subsystem and subsystem provide a unique device id
Application get the device capabilities with rte_eventdev_info_get(dev_id), like
The number of schedule groups
The number of flow queues in a schedule group
Application configures the event device and each schedule groups in the event device, like
The number of schedule groups and the flow queues are required
Priority of each schedule group and list of l-cores associated with it
Connect schedule groups with other HW event producers in the system like ethdev and crypto etc
In fastpath,
HW/SW enqueues the events to flow queues associated with schedule groups
Core gets the event through scheduler by invoking rte_event_scheduler() from lcore
Core process the event and enqueue to another downstream queue for further processing or send the event/packet to wire if it is the last stage of the processing
rte_event_scheduler() schedules the event based on 
selection of the schedule group 
The caller l-core membership in the schedule group
 Schedule group priority relative to other schedule groups.
selection of the flow queue and the event inside the  schedule group
Scheduler sync method associated with the flow queue(ATOMIC vs ORDERED/PARALLEL) 



Schedule sync methods (How events are Synchronized)
PARALLEL
Events from a parallel flow queue can be scheduled to multiple cores for concurrent processing
Ingress order is not maintained
ATOMIC
Events from an atomic flow queue can schedule only to a single core at a time
Enable critical section in packet processing like sequence number update etc
Ingress order is maintained as outstanding is always one at a time
ORDERED
Events from the ordered flow queue can be scheduled to multiple cores for concurrent processing
Ingress order is maintained
Enable high single flow throughput



ORDERED flow queue for ingress ordering
6
5
4
3
2
1
ORDERED flow queue
Scheduler
Cores processing ordered events in parallel
4
6
3
1
2
5
6
5
4
3
2
1
Any downstream flow queue
rte_event_schedule() 
rte_event_queue_enqueue()
The source ORDERED flow queue’s ingress order shall be maintained when events are enqueued to any downstream flow queue

Use case (Simple IPSec Outbound processing) 

PHASE1:
POLICY/SA, 
ROUTE
Lookup 
 In parallel
(ORDERED)

 Port 0
   RX
 Port 1
   RX
 Port 2
   RX
 Port 3
   RX
 Port 4
   RX
 Port 6
   RX
 Port 0
   TX
 Port 1
   TX
 Port 2
   TX
 Port 3
   TX
 Port 4
   TX
 Port 6
   TX
PHASE2:
SEQ Number update per SA
(ATOMIC)
PHASE3:
HW assisted IPSec crypto 


PHASE4:
Core sends  encrypted pks to Tx port queues
(ATOMIC)


Packets enqueued  into one of up to 1M flow queues based on a classification criterion(e.g 5 tuple hash)
PHASE1 generates a unique SA based on input packet and SA tables.
Each SA flow will be processed in parallel.
Core enqueues on ATOMIC flow queue for critical section processing per SA
Crypto HW sends the crypto work completion event to notify the core.

Crypto HW processes the crypto operations in background
Core issues IPSec crypto request to HW

Simple IPSec Outbound processing - Cores View
Core n
Core 1
Core 0
while(1) {
    event = rte_event_schedule();
    process the specific phase
    call different enqueue() to send  to  
         - atomic flow queue
         - crypto HW engine queue
         - TX port queue
}
Scheduler
N
HW crypto assist
Tx port  queue
Tx port queue
Tx port queue
Per SA, Core enqueues on ATOMIC flow queue for critical section phase of the flow
On completion of crypto work, HW generates the crypto work completion notification
RX pkt HW enqueues one of millions flow to ORDERED flow queues 
Flow queues 
SA
Flow queues 
Flow queues 
SA
Core enqueues the crypto work

API Requirements
APIs similar to existing ethernet and crypto API framework for
Device creation, device Identification and device configuration
Enumerate libeventdev resources as numbers(0..N)  to
Avoid ABI issues with handles
event device may have million flow queues so it's not practical to have handles for each flow queue and its associated name based lookup in multiprocess case
Avoid struct mbuf changes
APIs to
Enumerate eventdev driver capabilities and resources
Enqueue events from l-core
Schedule events
Synchronize events
Maintain ingress order of the events


API - Slow path
APIs similar to existing ethernet and crypto API framework for
Device creation - Physical event devices are discovered during the PCI probe/enumeration of the EAL function which is executed at DPDK initialization, based on their PCI device identifier, each unique PCI BDF (bus/bridge, device, function)
Device Identification - A unique device index used to designate the event device in all functions exported by the eventdev API.
Device Capability discovery 
rte_eventdev_info_get() - To get the global resources like number of schedule groups and number of flow queues per schedule group etc of the event device
Device configuration
rte_eventdev_configure() - configures the number of schedule groups and the number of flow queues on the schedule groups
rte_eventdev_sched_group_setup() - configures schedule group specific configuration like priority and the list of l-core has membership in the schedule group
Device state change - rte_eventdev_start()/stop()/close() like ethdev device



API - Fast path
bool rte_event_schedule(uint8_t dev_id, struct rte_event *ev, bool wait);
Schedule an event to the caller l-core from a specific schedule group of event device designated by its dev_id
bool rte_event_schedule_from_group(uint8_t dev_id, uint8_t group_id,struct rte_event *ev, wait)
Like rte_event_schedule(), but schedule group provided as argument 
void rte_event_schedule_release(uint8_t dev_id);
Release the current scheduler synchronization context associated with the scheduler dispatched event
int rte_event_schedule_group_[join/leave](uint8_t dev_id, uint8_t group_id);
Leave/Joins the caller l-core from/to a schedule group
bool rte_event_schedule_ctxt_update(uint8_t dev_id, uint32_t flow_queue_id, uint8_t  sched_sync, uint8_t sub_event_type, bool wait);
rte_event_schedule_ctxt_update() can be used to support run-to-completion model where the application requires the current *event* to stay on the same  l-core as it moves through the series of processing stages, provided the event type is RTE_EVENT_TYPE_LCORE




Fast path APIs - Simple IPSec outbound example
#define APP_STATE_SEQ_UPDATE 0
on each lcore
{
        struct rte_event ev;
        uint32_t flow_queue_id_mask = rte_eventdev_flow_queue_id_mask(eventdev);

        while (1) {
                ret = rte_event_schedule(eventdev, &ev, true);
		if (!ret)
		    continue;

                /* packets from HW rx ports proceed parallely per flow(ORDERED)*/
                if (ev.event_type == RTE_EVENT_TYPE_ETHDEV) {
                        sa = outbound_sa_lookup(ev.mbuf);
		        modify the packet per SA attributes
			find the tx port and tx queue from routing table

                        /* move to next phase (atomic seq number update per sa) */
                        ev.flow_queue_id = sa & flow_queue_id_mask;
                        ev.sched_sync = RTE_SCHED_SYNC_ATOMIC;
                        ev.sub_event_id = APP_STATE_SEQ_UPDATE;
                        rte_event_enqueue(evendev, ev);
                } else if (ev.event_type == RTE_EVENT_TYPE_LCORE && ev.sub_event_id == APP_STATE_SEQ_UPDATE) {
                        sa = ev.flow_queue_id;
                        /* do critical section work per sa */
                        do_critical_section_work(sa);

                        /* Issue the crypto request and generate the following on crypto work completion */
                        ev.flow_queue_id = tx_port;
                        ev.sub_event_id = tx_queue_id;
                        ev.sched_sync = RTE_SCHED_SYNC_ATOMIC;
                        rte_cryptodev_event_enqueue(cryptodev, ev.mbuf, eventdev, ev);
                }
                } else if((ev.event_type == RTE_EVENT_TYPE_CRYPTODEV)
		        tx_port = ev.flow_queue_id;
			tx_queue_id = ev.sub_evend_id;
                        send the packet to tx port/queue
                }
        }
}


rte_event_schedule_ctxt_update() can be used to support run-to-completion model where the application requires the current event to stay on same l-core as it moves through the series of processing stages, provided the event type is RTE_EVENT_TYPE_LCORE(l-core to l-core communication)
For example in the previous use case, the ATOMIC  sequence number update per SA can be achieved like below


Scheduler context update is costly operation, by spliting it as two functions(rte_event_schedule_ctxt_update() and rte_event_schedule_ctxt_wait()) allows application to overlap the context switch latency with other profitable work


Run-to-completion model support
                        /* move to next phase (atomic seq number update per sa) */
                        ev.flow_queue_id = sa & flow_queue_id_mask;
                        ev.sched_sync = RTE_SCHED_SYNC_ATOMIC;
                        ev.sub_event_id = APP_STATE_SEQ_UPDATE;
                        rte_event_enqueue(evendev, ev);
                } else if (ev.event_type == RTE_EVENT_TYPE_LCORE && ev.sub_event_id == APP_STATE_SEQ_UPDATE) {
                        sa = ev.flow_queue_id;
                        /* do critical section work per sa */
                        do_critical_section_work(sa);
          /* move to next phase (atomic seq number update per sa) */

                    rte_event_schedule_ctxt_update(eventdev,
sa & flow_queue_id_mask, RTE_SCHED_SYNC_ATOMIC, APP_STATE_SEQ_UPDATE, true);

                        /* do critical section work per sa */
                        do_critical_section_work(sa);

Benefits of event driven programming model
Enable high single flow throughput with ORDERED schedule sync method
The processing stages are not bound to specific cores. It provides better load-balancing and scaling capabilities than traditional pipelining.
Prioritize: Guarantee lcores work on the highest priority event available
Support asynchronous operations which allow the cores to stay busy while hardware manages requests.
Remove the static mappings between core to port/rx queue
Scaling from 1 to N flows are easy as its not bound to specific cores


Future work
Integrate the event device with ethernet, crypto and timer subsystems in DPDK
Ethdev/event device integration is possible by extending new 6WIND’s ingress classification specification where a new action type can establish ethdev’s port to eventdev’s schedule group connection
Cryptodev needs some change at configuration stage to set crypto work complete event delivery mechanism 
Spec out timerdev for PCI based timer event devices(timer event devices generates timer expiry event vs callback in the existing SW based timer scheme)
Event driven model operates on a single event at a time. Need to create a helper  API to make it burst in nature for the final enqueues to different HW block like ethdev tx-queue



More information about the dev mailing list