[dpdk-dev] [RFC PATCH 0/4] pktdev

Marc Sune marc.sune at bisdn.de
Mon Apr 20 08:51:26 CEST 2015



On 17/04/15 21:50, Wiles, Keith wrote:
> Hi Marc and Bruce,

Hi Keith, Bruce,

>
> On 4/17/15, 1:49 PM, "Marc Sune" <marc.sune at bisdn.de> wrote:
>
>>
>> On 17/04/15 17:16, Bruce Richardson wrote:
>>> Hi all,
>>>
>>> to continue this discussion a bit more, here is my, slightly different,
>>> slant
>>> on what a pktdev abstraction may look like.
>>>
>>> The primary objective I had in mind when drafting this is to provide the
>>> minimal abstraction that can be *easily* used as a common device
>>> abstraction for
>>> existing (and future) device types to be passed to dataplane code. The
>>> patchset
>>> demonstrates this by defining a minimal interface for pktdev - since I
>>> firmly
>>> believe the interface should be as small as possible - and then showing
>>> how that
>>> common interface can be used to unify rings and ethdevs under a common
>>> API for the
>>> datapath. I believe any attempt to unify things much beyond this to the
>>> control
>>> plane or setup phase is not worth doing - at least not initially - as at
>>> init time the code always needs to be aware of the underlying resource
>>> type in
>>> order to configure it properly for dataplane use.
>>>
>>> The overall objective I look to achieve is illustrated by the final
>>> patch in
>>> the series, which is a sample app where the same code is used for all
>>> cores,
>>> irrespective of the underlying device type.
>>>
>>> To get to that point, patch 1 defines the minimal API - just RX and TX.
>>> The .c
>>> file in the library is empty for simplicity, though I would see some
>>> functionality moving there when/if it makes sense e.g. the callback
>>> support
>>> from ethdev, as is done in Keith's patchset.
>>> Patch 2 then makes very minimal changes to ethdev to allow ethdevs to
>>> be used
>>> as pktdevs, and to make use of the pktdev functions when appropriate
>>> Patch 3 was, for me, the key test for this implementation - how hard
>>> was it to
>>> make an rte_ring usable as a pktdev too. Two single-line functions for
>>> RX/TX
>>> and a separate "converter" function proved to be all that was necessary
>>> here -
>>> and I believe simpler solutions may be possible too, as the extra
>>> structures
>>> allocated on conversion could be merged into the rte_ring structure
>>> itself and
>>> initialized on ring creation if we prefer that option. It is
>>> hoped/presumed that
>>> wrapping other structures, such as KNI, may prove to be just as easily
>>> done.
>>> [Not attempted yet - left as an exercise for the reader :-)].
>>>
>>> Now, in terms of pktdev vs ethdev, there is nothing in this proposal
>>> that
>>> cannot also be done using ethdev AFAIK. However, pktdev as outlined here
>>> should make the process far easier than trying to create a full PMD for
>>> something.
>>> All NIC specific functions, including things like stop/start, are
>>> stripped out,
>>> as they don't make sense for an rte_ring or other software objects.
>>> Also, the other thing this provides is that we can move away from just
>>> using
>>> port ids. Instead in the same way as we now reference
>>> rings/mempools/KNIs etc
>>> via pointer, we can do the same with ethernet ports as pktdevs on the
>>> data path.
>>> There was discussion previously on moving beyond 8-bit port ids. If we
>>> look to
>>> use ethdev as a common abstraction, I feel that change will soon have
>>> to be made
>>> causing a large amount of code churn.
>> Hi Richard,
>>
>> First thank you both for taking the time to look at this. I did not not
>> reply to Keith because you Richard summarized most of my concerns.
>>
>> I had a brief look to this second proposal. It is more aligned to what I
>> had in mind. But still I feel it is slightly too complicated. I don't
>> like much the necessary (in your approach) MACRO-like pkt_dev_data
>> struct. It is also slightly inconvenient that the user has to do:
>>
>> +	struct rte_pkt_dev *in = rte_eth_get_dev(0);
>>
>> +		struct rte_pkt_dev *out = rte_ring_get_dev(
>> +				rte_ring_create(name, 4096, rte_socket_id(), 0));
>>
>>
>>
>> What about something like (~pseudo-code):
>>
>> rte_pkt_dev_data.h:
>>
>>     enum rte_pkt_dev_type{
>>          RTE_PKT_DEV_ETH,
>>          RTE_PKT_DEV_RING,
>>          RTE_PKT_DEV_KNI,
>>          //Keep adding as more PMDs are supported
>>     };
>>
>>
>>     //This struct may be redundant if there is nothing more
>>     struct rte_pkt_dev_data{
>>          enum rte_pkt_dev_type;
>>          //Placeholder, maybe we need more...
>>     };
>>
>>     //Make RX/TX pktdev APIs more readable, but not really needed
>>     typedef void pkt_dev_t;
>>
>> (In all PMDs and e.g. KNI and RINGs):
>>
>>   struct rte_eth_dev {
>>      struct rte_pkt_dev_data pkt_dev;//
>> <++++++++++++++++++++++++++++++++++
>>      eth_rx_burst_t rx_pkt_burst; /**< Pointer to PMD receive function. */
>>      eth_tx_burst_t tx_pkt_burst; /**< Pointer to PMD transmit function.
>> */
>>      struct rte_eth_dev_data *data;  /**< Pointer to device data */
>>      /...
>>
>> rte_pkt_dev.h:
>>
>>        #include <rte_ethdev.h>
>>        //Include PMD (and non-PMD) TX/RX headers...
>>
>>     static inline uint16_t
>>     rte_pkt_tx_burst(pkt_dev_t* dev, uint16_t queue_id,
>>               struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>>     {
>>          switch (((struct rte_pkt_dev_data*)dev)->type){
>>              case  RTE_PKT_DEV_ETH:
>>                  struct rte_eth_dev* eth_dev = (struct
>> rte_eth_dev*)pkt_dev;
>>                  rte_pkt_tx_burst(eth_dev, queue_id, tx_pkts, nb_pkts);
>>                  break;
>>              case RTE_PKT_DEV_RING:
>>                  //...
>>          }
>>     }
>>
>>     //...
> I do not see the functional different from my proposal other then a bit
> more search/replace, which does effect some of the PMD¹s. The changes to
> the PMD is just to use the macros as we now have a nested structure.

I think what I am proposing is different. See below.

>
> The nesting and movement some structures and code to a common location is
> to provide a better way to add other devices. Moving code into a common
> set of structures and APIs should only improve the code as we get more
> devices added to DPDK. The basic flow and design is the same.

In your proposal you are movig TX/RX queues as well as some shared stuff 
and, most importantly, you are using function pointers to execute the 
RX/TX routines.

What I was proposing is to try to add the minimum common shared state in 
order to properly demultiplex the RX/TX call and have a common set of 
abstract calls (the pkt_dev type). In a way, I was proposing to 
deliberately not have a shared struct rte_dev_data because I think the 
internals of the "pkt_dev" can be very different across devices (e.g. 
queues in kni vs eth port vs. crypto?). I treat the pkt_dev as a "black 
box" that conforms to TX/RX API, leaving the developer of that device to 
define its internal structures as it better suites the needs. I only use 
each of the specific device type TX/RX APIs (external to us, pkt_dev 
library) in rte_pkt_dev.h. This also simplifies the refactor required to 
eventually integrate the rte_pkt_dev library and builds it "on top" of 
the existing APIs.

The other important difference with both, Bruce and your approach, and 
mine is the use of function pointers for RX/TX. I don't use them, which 
makes the entire abstracted TX/RX (including the final RX/TX routines 
itself) functions be "inlinable".

Btw, I forgot to add something basic in the previous pseudo-code. The 
different types have to be conditionally compiled according to 
compiled-in DPDK libs:

rte_pkt_dev.h:

     #include <rte_config.h>

     //Eth devices
     #ifdef RTE_LIBRTE_ETHER
     #include <rte_ethdev.h>
     #endif

     //KNI
     #ifdef RTE_LIBRTE_KNI
     #include <rte_kni.h>
     #endif

     //...
     //Include PMD (and non-PMD) TX/RX headers...

    static inline uint16_t
    rte_pkt_tx_burst(pkt_dev_t* dev, uint16_t queue_id,
              struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
    {
         switch (((struct rte_pkt_dev_data*)dev)->type){
             #ifdef RTE_LIBRTE_ETHER
             case  RTE_PKT_DEV_ETH:
                 struct rte_eth_dev* eth_dev = (struct rte_eth_dev*)pkt_dev;
                 rte_pkt_tx_burst(eth_dev, queue_id, tx_pkts, nb_pkts);
                 break;
             #endif

             #ifdef RTE_LIBRTE_KNI
             case RTE_PKT_DEV_KNI:
                 //...
                 break;
             #endif

             default:
                     //Corrupted type or unsupported (without compiled 
support)
                     //Ignore or fail(fatal error)?
                     break;
         }
    }

    //...

>
>  From rte_common_device.h (format screwed with cut/paste and not all of the
> code)
>
>
> struct rte_dev_data {
> 	char name[RTE_DEV_NAME_MAX_LEN]; /**< Unique identifier name */
>
> 	void	**rx_queues;            /**< Array of pointers to RX queues. */
> 	void	**tx_queues;            /**< Array of pointers to TX queues. */
>
> 	uint16_t nb_rx_queues;          /**< Number of RX queues. */
> 	uint16_t nb_tx_queues;          /**< Number of TX queues. */
>
> 	uint16_t mtu;                   /**< Maximum Transmission Unit. */
> 	uint8_t  unit_id;               /**< Unit/Port ID for this instance */
> 	uint8_t  _pad0;             	/* alignment filler */
>
> 	void	*dev_private;           /**< PMD-specific private data */
> 	uint64_t rx_mbuf_alloc_failed;  /**< RX ring mbuf allocation failures. */
> 	uint32_t min_rx_buf_size;       /**< Common rx buffer size handled by all
> queues */
> };
>
> struct rte_dev_info {
> 	struct rte_pci_device   *pci_dev;   /**< Device PCI information. */
> 	const char      *driver_name;   	/**< Device Driver name. */
> 	unsigned int    if_index;       	/**< Index to bound host interface, or 0
> if none. */
> 	/* Use if_indextoname() to translate into an interface name. */
> 	unsigned int	_pad0;
> };
>
>
>
> struct rte_dev {
> 	dev_rx_burst_t          rx_pkt_burst;   /**< Pointer to PMD receive
> function. */
> 	dev_tx_burst_t          tx_pkt_burst;   /**< Pointer to PMD transmit
> function. */
> 	void				  	*data;      	/**< Pointer to device data */
> 	struct rte_dev_global	*gbl_data;      /**< Pointer to device global data
> */
> 	const struct rte_dev_drv *driver;       /**< Driver for this device */
> 	void					*dev_ops;       /**< Functions exported by PMD */
> 	struct rte_pci_device   *pci_dev;       /**< PCI info. supplied by
> probing */
> 	struct rte_dev_info		*dev_info;		/**< Pointer to the device info
> structure */
>
> 	/** User application callback for interrupts if present */
> 	struct rte_dev_cb_list  link_intr_cbs;
> 	/**
> 	* User-supplied functions called from rx_burst to post-process
> 	* received packets before passing them to the user
> 	*/
> 	struct rte_dev_rxtx_callback **post_rx_burst_cbs;
> 	/**
> 	* User-supplied functions called from tx_burst to pre-process
> 	* received packets before passing them to the driver for transmission.
> 	*/
> 	struct rte_dev_rxtx_callback **pre_tx_burst_cbs;

What is the indended use of these callbacks and what is the benefit? If 
the user is supplying the callback, why not putting it directly after 
the RX and before the TX call in user's code (anyway)?

> 	enum rte_dev_type		dev_type;       /**< Flag indicating the device type */
> 	uint8_t                 attached;        /**< Flag indicating the port is
> attached */
> };
>
>  From rte_ethdev.h
>
> struct rte_eth_dev_info {
> 	struct rte_dev_info	di;	/**< Common device information */
>
> 	/* Rest of structure is for private device data */
> 	uint32_t min_rx_bufsize; /**< Minimum size of RX buffer. */
> 	uint32_t max_rx_pktlen; /**< Maximum configurable length of RX pkt. */
> 	uint16_t max_rx_queues; /**< Maximum number of RX queues. */
> 	uint16_t max_tx_queues; /**< Maximum number of TX queues. */
> 	uint32_t max_mac_addrs; /**< Maximum number of MAC addresses. */
> 	uint32_t max_hash_mac_addrs;
> 	/** Maximum number of hash MAC addresses for MTA and UTA. */
> 	uint16_t max_vfs; /**< Maximum number of VFs. */
> 	uint16_t max_vmdq_pools; /**< Maximum number of VMDq pools. */
> 	uint32_t rx_offload_capa; /**< Device RX offload capabilities. */
> 	uint32_t tx_offload_capa; /**< Device TX offload capabilities. */
> 	uint16_t reta_size;
> 	/**< Device redirection table size, the total number of entries. */
> 	/** Bit mask of RSS offloads, the bit offset also means flow type */
> 	uint64_t flow_type_rss_offloads;
> 	struct rte_eth_rxconf default_rxconf; /**< Default RX configuration */
> 	struct rte_eth_txconf default_txconf; /**< Default TX configuration */
> 	uint16_t vmdq_queue_base; /**< First queue ID for VMDQ pools. */
> 	uint16_t vmdq_queue_num; /**< Queue number for VMDQ pools. */
> 	uint16_t vmdq_pool_base; /**< First ID of VMDQ pools. */
> };
>
> struct rte_eth_dev_data {
> 	struct rte_dev_data		dd;			/**< Common device data */
>
> 	/* Rest of the structure is private data */
> 	struct rte_eth_dev_sriov sriov;    /**< SRIOV data */
>
> 	struct rte_eth_link dev_link;
> 	/**< Link-level information & status */
>
> 	struct rte_eth_conf dev_conf;   /**< Configuration applied to device. */
> 	struct ether_addr* mac_addrs;	/**< Device Ethernet Link address. */
> 	uint64_t mac_pool_sel[ETH_NUM_RECEIVE_MAC_ADDR];
> 	/** bitmap array of associating Ethernet MAC addresses to pools */
> 	struct ether_addr* hash_mac_addrs;
>
> 	/** Device Ethernet MAC addresses of hash filtering. */
> 	uint8_t promiscuous  : 1,   /**< RX promiscuous mode ON(1) / OFF(0). */
> 		scattered_rx : 1,	/**< RX of scattered packets is ON(1) / OFF(0) */
> 		all_multicast: 1,   /**< RX all multicast mode ON(1) / OFF(0). */
> 		dev_started  : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
> };
>
> In the case of rte_ethdev.h the rte_eth_dev turns into a macro '#define
> rte_eth_dev rte_dev', in other devices they may have a rte_xyz_dev
> structure rte_dev as the first member and then private members. Also later
> if we find something private to ethdev then a structure can be created.
>
>
> Most of the code changes outside of these changes are because using a
> macro to fix the code was easier. Rewriting the functions to use real
> variables instead of casting void pointer could yield without the macros.
> Not sure as the performance impact in rewriting some of these functions to
> eliminate the macros.
>
> This is the heart of what I am proposing and it just so happens to allow
> Bruce¹ ideas as well. I have another patch I have not sent that includes
> Bruce¹s ideas using my suggestions.
>
> Being able to use a single API for all devices does have its benefits IMO,
> but I want to cleanup ethdev to allow for other devices to be added to the
> system.
>
> At some point (when the cryptodev is done) I want to add it to this
> framework, plus add more APIs for crypto similar to Linux Crypto API or
> something we can agree is a reasonable set of APIs for DPDK.

I currently don't see the benefit of abstracting beyond the RX/TX 
functions, and I see more cons than pros. Doing so we will also make 
librte_pkt_dev sort of mandatory, even if you are not using it our 
abstracted RX/TX functions at all.

regards
marc

>
>
> Regards,
>
> ++Keith
>>
>> Maybe it is oversimplified? With this approach tough, we would barely
>> touch the PMDs and the functionality can be built on top of the existing
>> PMDs.
>>
>> Thoughts?
>> Marc
>>
>>
>>> Bruce Richardson (4):
>>>     Add example pktdev implementation
>>>     Make ethdev explicitly a subclass of pktdev
>>>     add support for a ring to be a pktdev
>>>     example app showing pktdevs used in a chain
>>>
>>>    config/common_bsdapp           |   5 +
>>>    config/common_linuxapp         |   5 +
>>>    examples/pktdev/Makefile       |  57 +++++++++++
>>>    examples/pktdev/basicfwd.c     | 222
>>> +++++++++++++++++++++++++++++++++++++++++
>>>    lib/Makefile                   |   1 +
>>>    lib/librte_ether/rte_ethdev.h  |  26 ++---
>>>    lib/librte_pktdev/Makefile     |  56 +++++++++++
>>>    lib/librte_pktdev/rte_pktdev.c |  35 +++++++
>>>    lib/librte_pktdev/rte_pktdev.h | 144 ++++++++++++++++++++++++++
>>>    lib/librte_ring/rte_ring.c     |  41 ++++++++
>>>    lib/librte_ring/rte_ring.h     |   3 +
>>>    11 files changed, 582 insertions(+), 13 deletions(-)
>>>    create mode 100644 examples/pktdev/Makefile
>>>    create mode 100644 examples/pktdev/basicfwd.c
>>>    create mode 100644 lib/librte_pktdev/Makefile
>>>    create mode 100644 lib/librte_pktdev/rte_pktdev.c
>>>    create mode 100644 lib/librte_pktdev/rte_pktdev.h
>>>



More information about the dev mailing list