[dpdk-dev,RFC,v2] vhost: new rte_vhost API proposal

Message ID 1526648465-62579-1-git-send-email-dariuszx.stojaczyk@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK

Commit Message

Stojaczyk, DariuszX May 18, 2018, 1:01 p.m. UTC
  rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
stage. This is perfectly fine from the Vhost-user spec perspective, but
doesn't meet rte_vhost expectations. rte_vhost waits for all queues
to be fully initialized before it allows the entire device to be
processed. qFixing rte_vhost directly would require quite a big amount
of changes, which would completely break backwards compatibility.

This rte_vhost2 library is intended to smooth out the transition.
It exposes a low-level API for implementing new Vhost-user slaves.
The existing rte_vhost is about to be refactored to use rte_vhost2
library underneath, and demanding backends could now use rte_vhost2
directly.

We're designing a fresh library here, so this opens up a great
amount of possibilities and improvements we can make to the public
API that will pay off significantly for the sake of future
specification/library extensions.

rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
and allows developers to implement Vhost devices with an ease.
It calls user-provided callbacks once proper device initialization
state has been reached. That is - memory mappings have changed,
virtqueues are ready to be processed, features have changed in
runtime, etc.

Compared to the rte_vhost, this lib additionally allows the following:
* ability to start/stop particular queues
* most callbacks are now asynchronous - it greatly simplifies
the event handling for asynchronous applications and doesn't
make anything harder for synchronous ones.
* this is low-level API. It doesn't have any vhost-net, nvme
or crypto references. These backend-specific libraries will
be later refactored to use *this* generic library underneath.
This implies that the library doesn't do any virtqueue processing,
it only delivers vring addresses to the user, so he can process
virtqueues by himself.
* abstracting away Unix domain sockets (vhost-user) and PCI
(virtio-vhost-user).
* The API imposes how public functions can be called and how
internal data can change, so there's only a minimal work required
to ensure thread-safety. Possibly no mutexes are required at all.
* full Virtio 1.0/vhost-user specification compliance.

This patch only introduces the API. Some additional functions
for vDPA might be still required, but everything present here
so far shouldn't need changing.

Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
---
 lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 331 insertions(+)
 create mode 100644 lib/librte_vhost2/rte_vhost2.h
  

Comments

Maxime Coquelin May 18, 2018, 1:50 p.m. UTC | #1
Hi Dariusz,

On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed. qFixing rte_vhost directly would require quite a big amount
> of changes, which would completely break backwards compatibility.
> 
> This rte_vhost2 library is intended to smooth out the transition.
> It exposes a low-level API for implementing new Vhost-user slaves.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> library underneath, and demanding backends could now use rte_vhost2
> directly.

I like the idea, and the proposed way to smooth the transition.

I will certainly have other comments later, but please find below
the ones I have for the moment.

> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
> 
> rte_vhost2 abstracts away most Vhost-user/virtio-vhost-user specifics
> and allows developers to implement Vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed in
> runtime, etc.
> 
> Compared to the rte_vhost, this lib additionally allows the following:
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
> the event handling for asynchronous applications and doesn't
> make anything harder for synchronous ones.
> * this is low-level API. It doesn't have any vhost-net, nvme
> or crypto references. These backend-specific libraries will
> be later refactored to use *this* generic library underneath.
> This implies that the library doesn't do any virtqueue processing,
> it only delivers vring addresses to the user, so he can process
> virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
> (virtio-vhost-user).
> * The API imposes how public functions can be called and how
> internal data can change, so there's only a minimal work required
> to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
> 
> This patch only introduces the API. Some additional functions
> for vDPA might be still required, but everything present here
> so far shouldn't need changing.
> 
> Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
> ---
>   lib/librte_vhost2/rte_vhost2.h | 331 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 331 insertions(+)
>   create mode 100644 lib/librte_vhost2/rte_vhost2.h
> 
> diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
> new file mode 100644
> index 0000000..385b093
> --- /dev/null
> +++ b/lib/librte_vhost2/rte_vhost2.h
> @@ -0,0 +1,331 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright (c) Intel Corporation.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_VHOST2_H_
> +#define _RTE_VHOST2_H_
> +
> +/**
> + * @file
> + * This library abstracts away most Vhost-user/virtio-vhost-user specifics
> + * and allows developers to implement Vhost devices with an ease.
> + * It calls user-provided callbacks once proper device initialization
> + * state has been reached. That is - memory mappings have changed,
> + * virtqueues are ready to be processed, features have changed in runtime, etc.
> + */
> +
> +#include <stdint.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/* Not C++-aware. */
> +#include <linux/vhost.h>
> +
> +#define RTE_VHOST2_MEMORY_MAX_NREGIONS	8
> +
> +#define RTE_VHOST2_CLIENT		(1ULL << 0)
> +#define RTE_VHOST2_NO_RECONNECT		(1ULL << 1)
> +
> +enum rte_vhost2_set_config_type {
> +	/** Config changed on request by the vhost driver. */
> +	VHOST_SET_CONFIG_TYPE_MASTER = 0,
> +	/** Config is being restored after a successful migration. */
> +	VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> +};
> +
> +#define RTE_VHOST2_MSG_VERSION_MASK	(0x3)
> +#define RTE_VHOST2_MSG_REPLY_MASK	(0x1 << 2)
> +#define RTE_VHOST2_MSG_NEED_REPLY	(0x1 << 3)
> +
> +struct rte_vhost2_msg {
> +	uint32_t id;
> +	uint32_t flags;
> +	uint32_t size; /**< The following payload size. */
> +	void *payload;
> +	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> +};
> +
> +/** Single memory region. Both physically and virtually contiguous */
> +struct rte_vhost2_mem_region {
> +	uint64_t guest_phys_addr;
> +	uint64_t guest_user_addr;
> +	uint64_t host_user_addr;
> +	uint64_t size;
> +	void *mmap_addr;
> +	uint64_t mmap_size;
> +	int fd;
> +};
> +
> +struct rte_vhost2_memory {
> +	uint32_t nregions;
> +	struct rte_vhost2_mem_region regions[];
> +};
> +
> +/**
> + * Vhost device created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
> + * vhost device data. This struct is published just for inline vdev
> + * functions to access their data directly.
> + */
> +struct rte_vhost2_dev {
> +	struct rte_vhost2_memory *mem;
> +	bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
> +};
> +
> +/**
> + * Virtqueue created and managed by rte_vhost2. Accessible via
> + * \c rte_vhost2_tgt_ops callbacks.
> + */
> +struct rte_vhost2_vq {
> +	 struct vring_desc *desc;
> +	 struct vring_avail *avail;
> +	 struct vring_used *used;
> +	 /* available only if \c VHOST_F_LOG_ALL has been negotiated */
> +	 void *log;
> +	 uint16_t size;
> +};
> +
> +/**
> + * Device/queue related callbacks, all optional. Provided callback
> + * parameters are guaranteed not to be NULL unless explicitly specified.
> + */
> +struct rte_vhost2_tgt_ops {
> +	/**
> +	 * New driver connected. If this is completed with a non-zero status,
> +	 * rte_vhost2 will terminate the connection.
> +	 */
> +	void (*device_create)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Device is ready to operate. vdev data is now initialized. This callback
> +	* may be called multiple times as e.g. memory mappings can change
> +	* dynamically. All queues are guaranteed to be stopped by now.
> +	*/
> +	void (*device_init)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Features have changed in runtime. This is called at least once during
> +	* initialization before `device_init`. Queues might be still running
> +	* at this point.
> +	*/
> +	void (*device_features_changed)(struct rte_vhost2_dev *vdev,
> +			uint64_t features);
> +	/**
> +	* Start processing vq. The `vq` is guaranteed not to be modified before
> +	* `queue_stop` is called.
> +	*/
> +	void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	* Stop processing vq. It shouldn't be accessed after this callback
> +	* completes (via \c rte_vhost2_tgt_cb_complete). This can be called
> +	* prior to shutdown or before actions that require changing vhost
> +	* device/vq state.
> +	*/
> +	void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/** Device disconnected. All queues are guaranteed to be stopped by now */
> +	void (*device_destroy)(struct rte_vhost2_dev *vdev);
> +	/**
> +	* Custom vhost-user message handler. This is called for
> +	* backend-specific messages (net/crypto/scsi) that weren't recognized
> +	* by the generic message parser. `msg` is available until
> +	* \c rte_vhost2_tgt_cb_complete is called.
> +	*/
> +	void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
> +
> +	/** Interrupt handler, synchronous. */
> +	void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +	/**
> +	 * Full device config read, synchronous. Return 0 if `len` bytes of
> +	 * `config` have been successfully set, -1 otherwise.
> +	 */
> +	int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t len);
> +	/**
> +	 * Device config changed by the driver, synchronous. `type` indicates
> +	 * the reason of change.
> +	 */
> +	int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
> +			  uint32_t offset, uint32_t len,
> +			  enum rte_vhost2_set_config_type type);
> +
> +	void *reserved[8]; /**< Reserved for future extension */
> +};
> +
> +/**
> + * Registers a new vhost target accepting remote connections. Multiple
> + * available transports are available. It is possible to create a Vhost-user
> + * Unix domain socket polling local connections or connect to a physical
> + * Virtio device and install an interrupt handler .
> + *
> + * This function is thread-safe.
> + *
> + * \param trtype type of the transport used, e.g. "vhost-user",
> + * "PCI-vhost-user", "PCI-vDPA".
> + * \param trid identifier of the device. For PCI this would be the BDF address,
> + * for vhost-user the socket name.
> + * \param trflags additional options for the specified transport
> + * \param trctx additional data for the specified transport. Can be NULL.
> + * \param tgt_ops callbacks to be called upon reaching specific initialization
> + * states.
> + * \param features supported Virtio features. To be negotiated with the
> + * driver ones. rte_vhost2 will append a couple of generic feature bits
> + * which are required by the Virtio spec. TODO list these features here
> + * \return 0 on success, negative errno otherwise
> + */
> +int rte_vhost2_tgt_register(const char *trtype, const char *trid,
> +			    uint64_t trflags, void *trctx,
> +			    struct rte_vhost2_tgt_ops *tgt_ops,
> +			    uint64_t features);

Couldn't the register API also pass the vdev?
Doing this, the backend could have rte_vhost2_dev in its device
struct.

> +
> +/**
> + * Finish async device tgt ops callback. Unless a tgt op has been documented
> + * as 'synchronous' this function must be called at the end of the op handler.
> + * It can be called either before or after the op handler returns. rte_vhost2
> + * won't call any tgt ops callbacks while another one hasn't been finished yet.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param rc 0 on success, negative errno otherwise. If non-zero value is
> + * given, the current callback will be perceived as failed. A queue that failed
> + * to start won't need to be stopped.
> + */
> +void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
> +
> +/**
> + * Unregisters a vhost target asynchronously. All active queue will be stopped
> + * and all devices destroyed.
> + *
> + * This function is thread-safe.
> + *
> + * \param cb_fn callback to be called on finish. It'll be called from the same
> + * thread that calls \c rte_vhost2_tgt_ops.
> + * \param cb_arg argument for \c cb_fn
> + * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
> + * if non-zero value is returned.
> + */
> +int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
> +			       void (*cb_fn)(void *arg), void *cb_arg);
> +
> +/**
> + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
> + *
> + * This function is thread-safe.
> + *
> + * \param mem vhost device memory
> + * \param gpa guest physical address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \return vhost virtual address or NULL if requested `gpa` is not mapped.
> + */
> +static inline void *
> +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
> +{
> +	struct rte_vhost2_mem_region *r;
> +	uint32_t i;
> +
> +	for (i = 0; i < mem->nregions; i++) {
> +		r = &mem->regions[i];
> +		if (gpa >= r->guest_phys_addr &&
> +		    gpa <  r->guest_phys_addr + r->size) {
> +
> +			if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
> +				*len = r->guest_phys_addr + r->size - gpa;
> +			}
> +
> +			return gpa - r->guest_phys_addr + r->host_user_addr;
> +		}
> +	}
> +	*len = 0;
> +
> +	return 0;
> +}

Maybe we could take the opportunity to only have rte_vhost2_iova_to_vva.

> +/**
> + * Translate I/O virtual address to vhost address space.
> + *
> + * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
> + * a TLB miss and wait for the TLB update response.
> + * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
> + * address and `perm` is ignored.
> + *
> + * This function is thread-safe.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + * \param iova I/O virtual address
> + * \param len length of the memory to translate (in bytes). If requested
> + * memory chunk crosses memory region boundary, the *len will be set to
> + * the remaining, maximum length of virtually contiguous memory. In such
> + * case the user will be required to call another gpa_to_vva(gpa + *len).
> + * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
> + * \return vhost virtual address or NULL if requested `iova` is not mapped
> + * or the `perm` doesn't match.
> + */
> +static inline void *
> +rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
> +		       uint64_t iova, uint32_t *len, uint8_t perm)
> +{
> +	void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
> +				  uint64_t iova, uint64_t size, uint8_t perm);
> +
> +	if (!vdev->iommu) {
> +		return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
> +	}
> +
> +	return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
> +}
> +
> +/**
> + * Notify the driver about vq change. This is an eventfd_write for vhost-user
> + * or MMIO write for PCI devices.
> + *
> + * \param vdev vhost device
> + * \param vq virtqueue. Must be started.
> + */
> +void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
> +
> +/**
> + * Notify the driver about device config change. This will result in \c
> + * rte_vhost2_tgt_ops->get_config being called.
> + *
> + * \param vdev vhost device
> + */
> +void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_VHOST2_H_ */
>
  
Yuanhan Liu May 20, 2018, 7:07 a.m. UTC | #2
On Fri, May 18, 2018 at 03:50:45PM +0200, Maxime Coquelin wrote:
> Hi Dariusz,
> 
> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:
> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> > stage. This is perfectly fine from the Vhost-user spec perspective, but
> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> > to be fully initialized before it allows the entire device to be
> > processed. qFixing rte_vhost directly would require quite a big amount
> > of changes, which would completely break backwards compatibility.
> > 
> > This rte_vhost2 library is intended to smooth out the transition.
> > It exposes a low-level API for implementing new Vhost-user slaves.
> > The existing rte_vhost is about to be refactored to use rte_vhost2
> > library underneath, and demanding backends could now use rte_vhost2
> > directly.
> 
> I like the idea, and the proposed way to smooth the transition.

Please be aware of that I just had a quick glimpse of this patch and it's
likely I don't have too much time to follow this.

However, I also like this idea. And thank you for working on it.

	--yliu
  
Stojaczyk, DariuszX May 22, 2018, 10:19 a.m. UTC | #3
Hi Maxime,

> -----Original Message-----

> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]

> Sent: Friday, May 18, 2018 9:51 PM

> On 05/18/2018 03:01 PM, Dariusz Stojaczyk wrote:

> > rte_vhost is not vhost-user spec compliant. Some Vhost drivers have

> > been already confirmed not to work with rte_vhost. virtio-user-scsi-pci

> > in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS

> > stage. This is perfectly fine from the Vhost-user spec perspective, but

> > doesn't meet rte_vhost expectations. rte_vhost waits for all queues

> > to be fully initialized before it allows the entire device to be

> > processed. qFixing rte_vhost directly would require quite a big amount

> > of changes, which would completely break backwards compatibility.

> >

> > This rte_vhost2 library is intended to smooth out the transition.

> > It exposes a low-level API for implementing new Vhost-user slaves.

> > The existing rte_vhost is about to be refactored to use rte_vhost2

> > library underneath, and demanding backends could now use rte_vhost2

> > directly.

> 

> I like the idea, and the proposed way to smooth the transition.

> 

> I will certainly have other comments later, but please find below

> the ones I have for the moment.

> 

> > <snip>

> > +

> > +/**

> > + * Registers a new vhost target accepting remote connections. Multiple

> > + * available transports are available. It is possible to create a Vhost-

> user

> > + * Unix domain socket polling local connections or connect to a

> physical

> > + * Virtio device and install an interrupt handler .

> > + *

> > + * This function is thread-safe.

> > + *

> > + * \param trtype type of the transport used, e.g. "vhost-user",

> > + * "PCI-vhost-user", "PCI-vDPA".

> > + * \param trid identifier of the device. For PCI this would be the BDF

> address,

> > + * for vhost-user the socket name.

> > + * \param trflags additional options for the specified transport

> > + * \param trctx additional data for the specified transport. Can be

> NULL.

> > + * \param tgt_ops callbacks to be called upon reaching specific

> initialization

> > + * states.

> > + * \param features supported Virtio features. To be negotiated with

> the

> > + * driver ones. rte_vhost2 will append a couple of generic feature bits

> > + * which are required by the Virtio spec. TODO list these features here

> > + * \return 0 on success, negative errno otherwise

> > + */

> > +int rte_vhost2_tgt_register(const char *trtype, const char *trid,

> > +			    uint64_t trflags, void *trctx,

> > +			    struct rte_vhost2_tgt_ops *tgt_ops,

> > +			    uint64_t features);

> 

> Couldn't the register API also pass the vdev?

> Doing this, the backend could have rte_vhost2_dev in its device

> struct.


Please notice the register API is for registering targets, not devices. A single Vhost-user server target can spawn multiple devices - one for each connection. I know the nomenclature is different from rte_vhost, but since each connection uses its own (virt)queues it makes sense to call things this way.

Initially I thought about adding some rte_vhost2_tgt struct declaration that register function would return, but later on came to a conclusion that it would only complicate things for the library user. A parent struct that would keep rte_vhost2_tgt* needs to contain `const char *trtype` and `const char *trid` anyway, so it's just easier to use these two strings for target identification. 

> > <snip>

> > +/**

> > + * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.

> > + *

> > + * This function is thread-safe.

> > + *

> > + * \param mem vhost device memory

> > + * \param gpa guest physical address

> > + * \param len length of the memory to translate (in bytes). If

> requested

> > + * memory chunk crosses memory region boundary, the *len will be

> set to

> > + * the remaining, maximum length of virtually contiguous memory. In

> such

> > + * case the user will be required to call another gpa_to_vva(gpa +

> *len).

> > + * \return vhost virtual address or NULL if requested `gpa` is not

> mapped.

> > + */

> > +static inline void *

> > +rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t

> gpa, uint64_t *len)

> > +{

> > +	struct rte_vhost2_mem_region *r;

> > +	uint32_t i;

> > +

> > +	for (i = 0; i < mem->nregions; i++) {

> > +		r = &mem->regions[i];

> > +		if (gpa >= r->guest_phys_addr &&

> > +		    gpa <  r->guest_phys_addr + r->size) {

> > +

> > +			if (unlikely(*len > r->guest_phys_addr + r->size -

> gpa)) {

> > +				*len = r->guest_phys_addr + r->size - gpa;

> > +			}

> > +

> > +			return gpa - r->guest_phys_addr + r-

> >host_user_addr;

> > +		}

> > +	}

> > +	*len = 0;

> > +

> > +	return 0;

> > +}

> 

> Maybe we could take the opportunity to only have

> rte_vhost2_iova_to_vva.


Good idea; will remove it in v3.

Thanks,
D.
  
Stojaczyk, DariuszX May 29, 2018, 1:38 p.m. UTC | #4
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Friday, May 25, 2018 12:06 PM
> On Fri, May 18, 2018 at 03:01:05PM +0200, Dariusz Stojaczyk wrote:
> > +struct rte_vhost2_msg {
> > +	uint32_t id;
> 
> Is this what the vhost-user specification calls the "request type"?  I
> suggest following the vhost-user spec terminology.
> 
> > +	uint32_t flags;
> > +	uint32_t size; /**< The following payload size. */
> > +	void *payload;
> > +	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> 
> Is it necessary to expose file descriptor passing in the API?
> virtio-vhost-user doesn't have file descriptor passing, so it's best if this
> can be hidden inside rte_vhost2.

So it's another argument for not exposing raw message handling to the user.
If there's some backend-specific vhost-user message in future that contains an fd, it will need a set of new abstractions to work with virtio-vhost-user anyway.
I guess I'll get back the original custom_msg idea from V1.

> 
> > +};
> > +
> > +/** Single memory region. Both physically and virtually contiguous */
> > +struct rte_vhost2_mem_region {
> > +	uint64_t guest_phys_addr;
> > +	uint64_t guest_user_addr;
> > +	uint64_t host_user_addr;
> > +	uint64_t size;
> > +	void *mmap_addr;
> > +	uint64_t mmap_size;
> > +	int fd;
> 
> virtio-vhost-user doesn't have an fd.  Why do API consumers need to
> know about the fd?

They don't. Ack. I'll strip this struct.

> 
> > +/**
> > + * Device/queue related callbacks, all optional. Provided callback
> > + * parameters are guaranteed not to be NULL unless explicitly
> specified.
> > + */
> 
> This is a good place to mention that all callbacks are asynchronous unless
> specified otherwise.  Without that knowledge statements below like "If
> this is completed with a non-zero status" are confusing on a void
> function.

Ack.

> 
> > +struct rte_vhost2_tgt_ops {
> > +	/**
> > +	 * New driver connected. If this is completed with a non-zero
> status,
> > +	 * rte_vhost2 will terminate the connection.
> > +	 */
> > +	void (*device_create)(struct rte_vhost2_dev *vdev);
> > +	/**
> > +	* Device is ready to operate. vdev data is now initialized. This
> callback
> > +	* may be called multiple times as e.g. memory mappings can
> change
> > +	* dynamically. All queues are guaranteed to be stopped by now.
> > +	*/
> > +	void (*device_init)(struct rte_vhost2_dev *vdev);
> > +	/**
> > +	* Features have changed in runtime. This is called at least once
> > +during
> 
> s/in/at/

Ack.

> 
> > +	/**
> > +	* Custom vhost-user message handler. This is called for
> > +	* backend-specific messages (net/crypto/scsi) that weren't
> recognized
> > +	* by the generic message parser. `msg` is available until
> > +	* \c rte_vhost2_tgt_cb_complete is called.
> > +	*/
> > +	void (*custom_msg)(struct rte_vhost2_dev *vdev, struct
> > +rte_vhost2_msg *msg);
> 
> What happens if rte_vhost2_tgt_cb_complete() is called with a negative
> rc?  Does the specific errno value matter?

My current implementation only checks for rc != 0 now. I'm still working this out.

> 
> Where is the API for sending a vhost-user reply message?

I didn't push any. Now that you pointed out the fds in public API I think I'll rollback this custom_msg stuff to V1.

D.
  
Stojaczyk, DariuszX May 30, 2018, 12:24 p.m. UTC | #5
> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> Sent: Wednesday, May 30, 2018 10:57 AM
> On Tue, May 29, 2018 at 01:38:33PM +0000, Stojaczyk, DariuszX wrote:
> >
> >
> > > -----Original Message-----
> > > From: Stefan Hajnoczi [mailto:stefanha@redhat.com]
> > > Sent: Friday, May 25, 2018 12:06 PM
> > > On Fri, May 18, 2018 at 03:01:05PM +0200, Dariusz Stojaczyk wrote:
> > > > +struct rte_vhost2_msg {
> > > > +	uint32_t id;
> > >
> > > Is this what the vhost-user specification calls the "request type"?
> > > I suggest following the vhost-user spec terminology.
> > >
> > > > +	uint32_t flags;
> > > > +	uint32_t size; /**< The following payload size. */
> > > > +	void *payload;
> > > > +	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
> > >
> > > Is it necessary to expose file descriptor passing in the API?
> > > virtio-vhost-user doesn't have file descriptor passing, so it's best
> > > if this can be hidden inside rte_vhost2.
> >
> > So it's another argument for not exposing raw message handling to the
> user.
> > If there's some backend-specific vhost-user message in future that
> contains an fd, it will need a set of new abstractions to work with virtio-
> vhost-user anyway.
> > I guess I'll get back the original custom_msg idea from V1.
> 
> Hold on, have you looked at the device-specific messages defined in the
> vhost-user spec?  Do any of them even pass resources (file descriptors)?

There's at least VHOST_USER_NVME_SET_CQ_CALL (currently in review, not present in master yet). Vhost-nvme is a one big hack that doesn't use any virtqueues, so it has its own message for callfd.
  
Dariusz Stojaczyk June 13, 2018, 9:41 a.m. UTC | #6
Hi Stefan,
I'm sorry for the late response. My email client filtered out this
mail. I fixed it just now.

pt., 8 cze 2018 o 15:29 Stefan Hajnoczi <stefanha@redhat.com> napisał(a):
>
> On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> > The proposed structure for the new library is described below.
> >  * rte_vhost2.h
> >    - public API
> >    - registering targets with provided ops
> >    - unregistering targets
> >    - iova_to_vva()
> >  * transport.h/.c
> >    - implements rte_vhost2.h
> >    - allows registering vhost transports, which are opaquely required by the
> >      rte_vhost2.h API (target register function requires transport name).
> >    - exposes a set of callbacks to be implemented by each transport
> >  * vhost_user.c
> >    - vhost-user Unix domain socket transport
>
> This file should be called transport_unix.c or similar so it's clear
> that it only handles UNIX domain socket transport aspects, not general
> vhost-user protocol aspects.  If the distinction is unclear then
> layering violations are easy to make in the future (especially when
> people other than you contribute to the code).

Ack. Also, virtio-vhost-user transport still has to be placed in
drivers/ directory, right? We could move most of this library
somewhere into drivers/, leaving only rte_vhost2.h, transport.h and
transport.c in lib/librte_vhost2. What do you think?

>
> >    - does recvmsg()
> >    - uses the underlying vhost-user helper lib to process messages, but still
> >      handles some transport-specific ones, e.g. SET_MEM_TABLE
> >    - calls some of the rte_vhost2.h ops registered with a target
> >  * fd_man.h/.c
> >    - polls provided descriptors, calls user callbacks on fd events
> >    - based on the original rte_vhost version
> >    - additionally allows calling user-provided callbacks on the poll thread
>
> Ths is general-purpose functionality that should be a core DPDK utility.
>
> Are you sure you cannot use existing (e)poll functionality in DPDK?

We have to use poll here, and I haven't seen any DPDK APIs for poll,
only rte_epoll_*. Since received vhost-user messages may be handled
asynchronously, we have to temporarily remove an fd from the poll
group for the time each message is handled. We do it by setting the fd
in the polled fds array to -1. man page for poll(2) explicitly
suggests this to ignore poll() events.

>
> >  * vhost.h/.c
> >    - a transport-agnostic vhost-user library
> >    - calls most of the rte_vhost2.h ops registered with a target
> >    - manages virtqueues state
> >    - hopefully to be reused by the virtio-vhost-user
> >    - exposes a set of callbacks to be implemented by each transport
> >      (for e.g. sending message replies)
> >
> > This series includes only vhost-user transport. Virtio-vhost-user
> > is to be done later.
> >
> > The following items are still TBD:
> >   * vhost-user slave channel
> >   * get/set_config
> >   * cfg_call() implementation
> >   * IOTLB
> >   * NUMA awareness
>
> This is important to think about while the API is still being designed.
>
> Some initial thoughts.  NUMA affinity is optimal when:
>
> 1. The vring, indirect descriptor tables, and data buffers are allocated
>    on the same NUMA node.
>
> 2. The virtqueue interrupts go to vcpus associated with the same NUMA
>    node as the vring.
>
> 3. The guest NUMA node corresponds to the host NUMA node of the backing
>    storage device (e.g. NVMe PCI adapter).
>
> This way memory is local to the NVMe PCI adapter on the way down the
> stack when submitting I/O and back up again when completing I/O.
>
> Achieving #1 & #2 is up to the guest drivers.
>
> Achieving #3 is up to virtual machine configuration (QEMU command-line).
>
> The role that DPDK plays in all of this is that each vhost-user
> virtqueue should be polled by a thread that has been placed on the same
> host NUMA node mentioned above for #1, #2, and #3.
>
> Per-virtqueue state should also be allocated on this host NUMA node.

I agree on all points.

> Device backends should be able to query this information so they, too,
> can allocate memory with optimal NUMA affinity.

Of course. Since we offer raw vq pointers, they will be able to use
get_mempolicy(2). No additional APIs required.

>
> >   * Live migration
>
> Another important feature to design in from the beginning.

This is being worked on at the moment.
Thanks,
D.
  
Tiwei Bie June 25, 2018, 11:01 a.m. UTC | #7
On Thu, Jun 07, 2018 at 05:12:20PM +0200, Dariusz Stojaczyk wrote:
> rte_vhost is not vhost-user spec compliant. Some Vhost drivers have
> been already confirmed not to work with rte_vhost. virtio-user-scsi-pci
> in QEMU 2.12 doesn't fully initialize its management queues at SeaBIOS
> stage. This is perfectly fine from the Vhost-user spec perspective, but
> doesn't meet rte_vhost expectations. rte_vhost waits for all queues
> to be fully initialized before it allows the entire device to be
> processed.
> 
> The problem can be temporarily worked around by start-stopping entire
> device (all queues) on each vhost-user message that could change queue
> state. This would increase general VM boot time and is totally
> unacceptable though.
> 
> Fixing rte_vhost properly would require quite a big amount
> of changes which would completely break backward compatibility.
> We're introducing a new rte_vhost2 library intended to smooth the transition.
> It exposes a low-level API for implementing new vhost devices.
> The existing rte_vhost is about to be refactored to use rte_vhost2
> underneath and demanding users could now use rte_vhost2 directly.
> 
> We're designing a fresh library here, so this opens up a great
> amount of possibilities and improvements we can make to the public
> API that will pay off significantly for the sake of future
> specification/library extensions.
> 
> rte_vhost2 abstracts away most vhost-user/virtio-vhost-user specifics
> and allows developers to implement vhost devices with an ease.
> It calls user-provided callbacks once proper device initialization
> state has been reached. That is - memory mappings have changed,
> virtqueues are ready to be processed, features have changed at
> runtime, etc.
> 
> Compared to the rte_vhost, this lib additionally allows the following:
> 
> * ability to start/stop particular queues
> * most callbacks are now asynchronous - it greatly simplifies
>   the event handling for asynchronous applications and doesn't
>   make anything harder for synchronous ones.
> * this is low-level vhost API. It doesn't have any vhost-net, nvme
>   or crypto references. These backend-specific libraries will
>   be later refactored to use *this* generic library underneath.
>   This implies that the library doesn't do any virtqueue processing,
>   it only delivers vring addresses to the user, so he can process
>   virtqueues by himself.
> * abstracting away Unix domain sockets (vhost-user) and PCI
>   (virtio-vhost-user).
> * APIs for interrupt-driven drivers
> * replacing gpa_to_vva function with an IOMMU-aware variant
> * The API imposes how public functions can be called and how
>   internal data can change, so there's only a minimal work required
>   to ensure thread-safety. Possibly no mutexes are required at all.
> * full Virtio 1.0/vhost-user specification compliance.
> 
> The proposed structure for the new library is described below.
>  * rte_vhost2.h
>    - public API
>    - registering targets with provided ops
>    - unregistering targets
>    - iova_to_vva()
>  * transport.h/.c
>    - implements rte_vhost2.h
>    - allows registering vhost transports, which are opaquely required by the
>      rte_vhost2.h API (target register function requires transport name).
>    - exposes a set of callbacks to be implemented by each transport
>  * vhost_user.c
>    - vhost-user Unix domain socket transport
>    - does recvmsg()
>    - uses the underlying vhost-user helper lib to process messages, but still
>      handles some transport-specific ones, e.g. SET_MEM_TABLE
>    - calls some of the rte_vhost2.h ops registered with a target
>  * fd_man.h/.c
>    - polls provided descriptors, calls user callbacks on fd events
>    - based on the original rte_vhost version
>    - additionally allows calling user-provided callbacks on the poll thread
>  * vhost.h/.c
>    - a transport-agnostic vhost-user library
>    - calls most of the rte_vhost2.h ops registered with a target
>    - manages virtqueues state
>    - hopefully to be reused by the virtio-vhost-user
>    - exposes a set of callbacks to be implemented by each transport
>      (for e.g. sending message replies)
> 
> This series includes only vhost-user transport. Virtio-vhost-user
> is to be done later.
> 
> The following items are still TBD:
>   * vhost-user slave channel
>   * get/set_config
>   * cfg_call() implementation
>   * IOTLB
>   * NUMA awareness
>   * Live migration
>   * vhost-net implementation (to port rte_vhost over)
>   * vhost-crypto implementation (to port rte_vhost over)
>   * vhost-user client mode (connecting to an external socket)
>   * various error logging
> 
> Dariusz Stojaczyk (7):
>   vhost2: add initial implementation
>   vhost2: add vhost-user helper layer
>   vhost2/vhost: handle queue_stop/device_destroy failure
>   vhost2: add asynchronous fd_man
>   vhost2: add initial vhost-user transport
>   vhost2/vhost: support protocol features
>   vhost2/user: implement tgt unregister path

Hi Dariusz,

Thank you for putting efforts in making the DPDK
vhost more generic!

From my understanding, your proposal is that:

1) Introduce rte_vhost2 to provide the APIs which
   allow users to implement vhost backends like
   SCSI, net, crypto, ..

2) Refactor the existing rte_vhost to use rte_vhost2.
   The rte_vhost will still provide below existing
   sets of APIs:
    1. The APIs which allow users to implement
       external vhost backends (these APIs were
       designed for SPDK previously)
    2. The APIs provided by the net backend
    3. The APIs provided by the crypto backend
   And above APIs in rte_vhost won't be changed.

Is my above understanding correct? Thanks!

Best regards,
Tiwei Bie


> 
>  lib/Makefile                             |   4 +-
>  lib/librte_vhost2/Makefile               |  25 +
>  lib/librte_vhost2/fd_man.c               | 288 +++++++++++
>  lib/librte_vhost2/fd_man.h               | 125 +++++
>  lib/librte_vhost2/rte_vhost2.h           | 304 +++++++++++
>  lib/librte_vhost2/rte_vhost2_version.map |  12 +
>  lib/librte_vhost2/transport.c            |  74 +++
>  lib/librte_vhost2/transport.h            |  32 ++
>  lib/librte_vhost2/vhost.c                | 728 ++++++++++++++++++++++++++
>  lib/librte_vhost2/vhost.h                | 203 ++++++++
>  lib/librte_vhost2/vhost_user.c           | 845 +++++++++++++++++++++++++++++++
>  11 files changed, 2638 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_vhost2/Makefile
>  create mode 100644 lib/librte_vhost2/fd_man.c
>  create mode 100644 lib/librte_vhost2/fd_man.h
>  create mode 100644 lib/librte_vhost2/rte_vhost2.h
>  create mode 100644 lib/librte_vhost2/rte_vhost2_version.map
>  create mode 100644 lib/librte_vhost2/transport.c
>  create mode 100644 lib/librte_vhost2/transport.h
>  create mode 100644 lib/librte_vhost2/vhost.c
>  create mode 100644 lib/librte_vhost2/vhost.h
>  create mode 100644 lib/librte_vhost2/vhost_user.c
> 
> -- 
> 2.11.0
> 
> 
> Dariusz Stojaczyk (7):
>   vhost2: add initial implementation
>   vhost2: add vhost-user helper layer
>   vhost2/vhost: handle queue_stop/device_destroy failure
>   vhost2: add asynchronous fd_man
>   vhost2: add initial vhost-user transport
>   vhost2/vhost: support protocol features
>   vhost2/user: implement tgt unregister path
> 
>  lib/Makefile                             |   4 +-
>  lib/librte_vhost2/Makefile               |  25 +
>  lib/librte_vhost2/fd_man.c               | 288 +++++++++++
>  lib/librte_vhost2/fd_man.h               | 125 +++++
>  lib/librte_vhost2/rte_vhost2.h           | 304 +++++++++++
>  lib/librte_vhost2/rte_vhost2_version.map |  12 +
>  lib/librte_vhost2/transport.c            |  74 +++
>  lib/librte_vhost2/transport.h            |  32 ++
>  lib/librte_vhost2/vhost.c                | 728 ++++++++++++++++++++++++++
>  lib/librte_vhost2/vhost.h                | 203 ++++++++
>  lib/librte_vhost2/vhost_user.c           | 845 +++++++++++++++++++++++++++++++
>  11 files changed, 2638 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_vhost2/Makefile
>  create mode 100644 lib/librte_vhost2/fd_man.c
>  create mode 100644 lib/librte_vhost2/fd_man.h
>  create mode 100644 lib/librte_vhost2/rte_vhost2.h
>  create mode 100644 lib/librte_vhost2/rte_vhost2_version.map
>  create mode 100644 lib/librte_vhost2/transport.c
>  create mode 100644 lib/librte_vhost2/transport.h
>  create mode 100644 lib/librte_vhost2/vhost.c
>  create mode 100644 lib/librte_vhost2/vhost.h
>  create mode 100644 lib/librte_vhost2/vhost_user.c
> 
> -- 
> 2.11.0
>
  
Stojaczyk, DariuszX June 25, 2018, 12:17 p.m. UTC | #8
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> Sent: Monday, June 25, 2018 1:02 PM
> <snip>
> 
> Hi Dariusz,
> 

Hi Tiwei,

> Thank you for putting efforts in making the DPDK
> vhost more generic!
> 
> From my understanding, your proposal is that:
> 
> 1) Introduce rte_vhost2 to provide the APIs which
>    allow users to implement vhost backends like
>    SCSI, net, crypto, ..
> 

That's right.

> 2) Refactor the existing rte_vhost to use rte_vhost2.
>    The rte_vhost will still provide below existing
>    sets of APIs:
>     1. The APIs which allow users to implement
>        external vhost backends (these APIs were
>        designed for SPDK previously)
>     2. The APIs provided by the net backend
>     3. The APIs provided by the crypto backend
>    And above APIs in rte_vhost won't be changed.

That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.
Regards,
D.

> 
> Is my above understanding correct? Thanks!
> 
> Best regards,
> Tiwei Bie
>
  
Tiwei Bie June 26, 2018, 8:22 a.m. UTC | #9
On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > Sent: Monday, June 25, 2018 1:02 PM
> > <snip>
> > 
> > Hi Dariusz,
> > 
> 
> Hi Tiwei,
> 
> > Thank you for putting efforts in making the DPDK
> > vhost more generic!
> > 
> > From my understanding, your proposal is that:
> > 
> > 1) Introduce rte_vhost2 to provide the APIs which
> >    allow users to implement vhost backends like
> >    SCSI, net, crypto, ..
> > 
> 
> That's right.
> 
> > 2) Refactor the existing rte_vhost to use rte_vhost2.
> >    The rte_vhost will still provide below existing
> >    sets of APIs:
> >     1. The APIs which allow users to implement
> >        external vhost backends (these APIs were
> >        designed for SPDK previously)
> >     2. The APIs provided by the net backend
> >     3. The APIs provided by the crypto backend
> >    And above APIs in rte_vhost won't be changed.
> 
> That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.

Currently I have below concerns and questions:

- The rte_vhost's problem is still there. Even though
  rte_vhost2 is introduced, the net and crypto backends
  in rte_vhost won't benefit from the new callbacks.

  The existing rte_vhost in DPDK not only provides the
  APIs for DPDK applications to implement the external
  backends. But also provides high performance net and
  crypto backends implementation (maybe more in the
  future). So it's important that besides the DPDK
  applications which implement their external backends,
  the DPDK applications which use the builtin backends
  will also benefit from the new callbacks.

  So we should have a clear plan on how will the legacy
  callbacks in rte_vhost be dealt with in the next step.

  Besides, the new library's name is a bit misleading.
  It makes the existing rte_vhost library sound like an
  obsolete library. But actually the existing rte_vhost
  isn't an obsolete library. It will still provide the
  net and crypto backends. So if we want to introduce
  this new library, we should give it a better name.

- It's possible to solve rte_vhost's problem you met
  by refactoring the existing vhost library directly
  instead of re-implementing a new vhost library from
  scratch and keeping the old one's problem as is.

  In this way, it will solve the problem you met and
  also solve the problem for rte_vhost. Why not go
  this way? Something like:

  Below is the existing callbacks set in rte_vhost.h:

  /**
   * Device and vring operations.
   */
  struct vhost_device_ops {
          ......
  };

  It's a legacy implementation, and doesn't really
  follow the DPDK API design (e.g. no rte_ prefix).
  We can design and implement a new message handling
  and a new set of callbacks for rte_vhost to solve
  the problem you met without changing the old one.
  Something like:

  struct rte_vhost_device_ops {
          ......
  }

  int
  vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
  {
          ......

          if (!vdev->is_using_new_device_ops) {
                  // Call the existing message handler
                  return vhost_user_msg_handler_legacy(vdev, msg);
          }

          // Implement the new logic here
          ......
  }

  A vhost application is allowed to register only struct
  rte_vhost_device_ops or struct vhost_device_ops (which
  should be deprecated in the future). The two ops cannot
  be registered at the same time.

  The existing applications could use the old ops. And
  if an application registers struct rte_vhost_device_ops,
  the new callbacks and message handler will be used.

Best regards,
Tiwei Bie


> Regards,
> D.
> 
> > 
> > Is my above understanding correct? Thanks!
> > 
> > Best regards,
> > Tiwei Bie
> >
  
Thomas Monjalon June 26, 2018, 8:30 a.m. UTC | #10
26/06/2018 10:22, Tiwei Bie:
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > From: Tiwei Bie
> > > 
> > > Hi Dariusz,
> > 
> > Hi Tiwei,
> > 
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > > 
> > > From my understanding, your proposal is that:
> > > 
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > >    allow users to implement vhost backends like
> > >    SCSI, net, crypto, ..
> > 
> > That's right.
> > 
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > >    The rte_vhost will still provide below existing
> > >    sets of APIs:
> > >     1. The APIs which allow users to implement
> > >        external vhost backends (these APIs were
> > >        designed for SPDK previously)
> > >     2. The APIs provided by the net backend
> > >     3. The APIs provided by the crypto backend
> > >    And above APIs in rte_vhost won't be changed.
> > 
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops underneath and will call existing vhost_device_ops for e.g. starting the device once all queues are started.
> 
> Currently I have below concerns and questions:
> 
> - The rte_vhost's problem is still there. Even though
>   rte_vhost2 is introduced, the net and crypto backends
>   in rte_vhost won't benefit from the new callbacks.
> 
>   The existing rte_vhost in DPDK not only provides the
>   APIs for DPDK applications to implement the external
>   backends. But also provides high performance net and
>   crypto backends implementation (maybe more in the
>   future). So it's important that besides the DPDK
>   applications which implement their external backends,
>   the DPDK applications which use the builtin backends
>   will also benefit from the new callbacks.
> 
>   So we should have a clear plan on how will the legacy
>   callbacks in rte_vhost be dealt with in the next step.
> 
>   Besides, the new library's name is a bit misleading.
>   It makes the existing rte_vhost library sound like an
>   obsolete library. But actually the existing rte_vhost
>   isn't an obsolete library. It will still provide the
>   net and crypto backends. So if we want to introduce
>   this new library, we should give it a better name.
> 
> - It's possible to solve rte_vhost's problem you met
>   by refactoring the existing vhost library directly
>   instead of re-implementing a new vhost library from
>   scratch and keeping the old one's problem as is.

+1

>   In this way, it will solve the problem you met and
>   also solve the problem for rte_vhost. Why not go
>   this way? Something like:
> 
>   Below is the existing callbacks set in rte_vhost.h:
> 
>   /**
>    * Device and vring operations.
>    */
>   struct vhost_device_ops {
>           ......
>   };
> 
>   It's a legacy implementation, and doesn't really
>   follow the DPDK API design (e.g. no rte_ prefix).
>   We can design and implement a new message handling
>   and a new set of callbacks for rte_vhost to solve
>   the problem you met without changing the old one.
>   Something like:
> 
>   struct rte_vhost_device_ops {
>           ......
>   }
> 
>   int
>   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg *msg)
>   {
>           ......
> 
>           if (!vdev->is_using_new_device_ops) {
>                   // Call the existing message handler
>                   return vhost_user_msg_handler_legacy(vdev, msg);
>           }
> 
>           // Implement the new logic here
>           ......
>   }
> 
>   A vhost application is allowed to register only struct
>   rte_vhost_device_ops or struct vhost_device_ops (which
>   should be deprecated in the future). The two ops cannot
>   be registered at the same time.
> 
>   The existing applications could use the old ops. And
>   if an application registers struct rte_vhost_device_ops,
>   the new callbacks and message handler will be used.
  
Stojaczyk, DariuszX June 26, 2018, 8:47 a.m. UTC | #11
> -----Original Message-----
> From: Bie, Tiwei
> Sent: Tuesday, June 26, 2018 10:22 AM
> To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
> Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
> Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
> <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
> Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
> <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
> Wodkowski, PawelX <pawelx.wodkowski@intel.com>
> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> 
> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > > Sent: Monday, June 25, 2018 1:02 PM
> > > <snip>
> > >
> > > Hi Dariusz,
> > >
> >
> > Hi Tiwei,
> >
> > > Thank you for putting efforts in making the DPDK
> > > vhost more generic!
> > >
> > > From my understanding, your proposal is that:
> > >
> > > 1) Introduce rte_vhost2 to provide the APIs which
> > >    allow users to implement vhost backends like
> > >    SCSI, net, crypto, ..
> > >
> >
> > That's right.
> >
> > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > >    The rte_vhost will still provide below existing
> > >    sets of APIs:
> > >     1. The APIs which allow users to implement
> > >        external vhost backends (these APIs were
> > >        designed for SPDK previously)
> > >     2. The APIs provided by the net backend
> > >     3. The APIs provided by the crypto backend
> > >    And above APIs in rte_vhost won't be changed.
> >
> > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> underneath and will call existing vhost_device_ops for e.g. starting the device
> once all queues are started.
> 
> Currently I have below concerns and questions:
> 
> - The rte_vhost's problem is still there. Even though
>   rte_vhost2 is introduced, the net and crypto backends
>   in rte_vhost won't benefit from the new callbacks.
> 
>   The existing rte_vhost in DPDK not only provides the
>   APIs for DPDK applications to implement the external
>   backends. But also provides high performance net and
>   crypto backends implementation (maybe more in the
>   future). So it's important that besides the DPDK
>   applications which implement their external backends,
>   the DPDK applications which use the builtin backends
>   will also benefit from the new callbacks.
> 
>   So we should have a clear plan on how will the legacy
>   callbacks in rte_vhost be dealt with in the next step.
> 
>   Besides, the new library's name is a bit misleading.
>   It makes the existing rte_vhost library sound like an
>   obsolete library. But actually the existing rte_vhost
>   isn't an obsolete library. It will still provide the
>   net and crypto backends. So if we want to introduce
>   this new library, we should give it a better name.
> 
> - It's possible to solve rte_vhost's problem you met
>   by refactoring the existing vhost library directly
>   instead of re-implementing a new vhost library from
>   scratch and keeping the old one's problem as is.
> 
>   In this way, it will solve the problem you met and
>   also solve the problem for rte_vhost. Why not go
>   this way? Something like:
> 
>   Below is the existing callbacks set in rte_vhost.h:
> 
>   /**
>    * Device and vring operations.
>    */
>   struct vhost_device_ops {
>           ......
>   };
> 
>   It's a legacy implementation, and doesn't really
>   follow the DPDK API design (e.g. no rte_ prefix).
>   We can design and implement a new message handling
>   and a new set of callbacks for rte_vhost to solve
>   the problem you met without changing the old one.
>   Something like:
> 
>   struct rte_vhost_device_ops {
>           ......
>   }
> 
>   int
>   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> *msg)
>   {
>           ......
> 
>           if (!vdev->is_using_new_device_ops) {
>                   // Call the existing message handler
>                   return vhost_user_msg_handler_legacy(vdev, msg);
>           }
> 
>           // Implement the new logic here
>           ......
>   }
> 
>   A vhost application is allowed to register only struct
>   rte_vhost_device_ops or struct vhost_device_ops (which
>   should be deprecated in the future). The two ops cannot
>   be registered at the same time.
> 
>   The existing applications could use the old ops. And
>   if an application registers struct rte_vhost_device_ops,
>   the new callbacks and message handler will be used.

Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.

> 
> Best regards,
> Tiwei Bie
> 
> 
> > Regards,
> > D.
> >
> > >
> > > Is my above understanding correct? Thanks!
> > >
> > > Best regards,
> > > Tiwei Bie
> > >
  
Tiwei Bie June 26, 2018, 9:14 a.m. UTC | #12
On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:
> > -----Original Message-----
> > From: Bie, Tiwei
> > Sent: Tuesday, June 26, 2018 10:22 AM
> > To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
> > Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
> > Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
> > <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
> > Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
> > <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
> > Wodkowski, PawelX <pawelx.wodkowski@intel.com>
> > Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
> > 
> > On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
> > > > Sent: Monday, June 25, 2018 1:02 PM
> > > > <snip>
> > > >
> > > > Hi Dariusz,
> > > >
> > >
> > > Hi Tiwei,
> > >
> > > > Thank you for putting efforts in making the DPDK
> > > > vhost more generic!
> > > >
> > > > From my understanding, your proposal is that:
> > > >
> > > > 1) Introduce rte_vhost2 to provide the APIs which
> > > >    allow users to implement vhost backends like
> > > >    SCSI, net, crypto, ..
> > > >
> > >
> > > That's right.
> > >
> > > > 2) Refactor the existing rte_vhost to use rte_vhost2.
> > > >    The rte_vhost will still provide below existing
> > > >    sets of APIs:
> > > >     1. The APIs which allow users to implement
> > > >        external vhost backends (these APIs were
> > > >        designed for SPDK previously)
> > > >     2. The APIs provided by the net backend
> > > >     3. The APIs provided by the crypto backend
> > > >    And above APIs in rte_vhost won't be changed.
> > >
> > > That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
> > underneath and will call existing vhost_device_ops for e.g. starting the device
> > once all queues are started.
> > 
> > Currently I have below concerns and questions:
> > 
> > - The rte_vhost's problem is still there. Even though
> >   rte_vhost2 is introduced, the net and crypto backends
> >   in rte_vhost won't benefit from the new callbacks.
> > 
> >   The existing rte_vhost in DPDK not only provides the
> >   APIs for DPDK applications to implement the external
> >   backends. But also provides high performance net and
> >   crypto backends implementation (maybe more in the
> >   future). So it's important that besides the DPDK
> >   applications which implement their external backends,
> >   the DPDK applications which use the builtin backends
> >   will also benefit from the new callbacks.
> > 
> >   So we should have a clear plan on how will the legacy
> >   callbacks in rte_vhost be dealt with in the next step.
> > 
> >   Besides, the new library's name is a bit misleading.
> >   It makes the existing rte_vhost library sound like an
> >   obsolete library. But actually the existing rte_vhost
> >   isn't an obsolete library. It will still provide the
> >   net and crypto backends. So if we want to introduce
> >   this new library, we should give it a better name.
> > 
> > - It's possible to solve rte_vhost's problem you met
> >   by refactoring the existing vhost library directly
> >   instead of re-implementing a new vhost library from
> >   scratch and keeping the old one's problem as is.
> > 
> >   In this way, it will solve the problem you met and
> >   also solve the problem for rte_vhost. Why not go
> >   this way? Something like:
> > 
> >   Below is the existing callbacks set in rte_vhost.h:
> > 
> >   /**
> >    * Device and vring operations.
> >    */
> >   struct vhost_device_ops {
> >           ......
> >   };
> > 
> >   It's a legacy implementation, and doesn't really
> >   follow the DPDK API design (e.g. no rte_ prefix).
> >   We can design and implement a new message handling
> >   and a new set of callbacks for rte_vhost to solve
> >   the problem you met without changing the old one.
> >   Something like:
> > 
> >   struct rte_vhost_device_ops {
> >           ......
> >   }
> > 
> >   int
> >   vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
> > *msg)
> >   {
> >           ......
> > 
> >           if (!vdev->is_using_new_device_ops) {
> >                   // Call the existing message handler
> >                   return vhost_user_msg_handler_legacy(vdev, msg);
> >           }
> > 
> >           // Implement the new logic here
> >           ......
> >   }
> > 
> >   A vhost application is allowed to register only struct
> >   rte_vhost_device_ops or struct vhost_device_ops (which
> >   should be deprecated in the future). The two ops cannot
> >   be registered at the same time.
> > 
> >   The existing applications could use the old ops. And
> >   if an application registers struct rte_vhost_device_ops,
> >   the new callbacks and message handler will be used.
> 
> Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.

These APIs can be safely added at any time.
And we can also ask developers to add public
APIs if it matters when adding new features
in the future. I don't think it's a big
problem.

Best regards,
Tiwei Bie

> 
> > 
> > Best regards,
> > Tiwei Bie
> > 
> > 
> > > Regards,
> > > D.
> > >
> > > >
> > > > Is my above understanding correct? Thanks!
> > > >
> > > > Best regards,
> > > > Tiwei Bie
> > > >
  
Maxime Coquelin June 26, 2018, 9:38 a.m. UTC | #13
On 06/26/2018 11:14 AM, Tiwei Bie wrote:
> On Tue, Jun 26, 2018 at 04:47:33PM +0800, Stojaczyk, DariuszX wrote:
>>> -----Original Message-----
>>> From: Bie, Tiwei
>>> Sent: Tuesday, June 26, 2018 10:22 AM
>>> To: Stojaczyk, DariuszX <dariuszx.stojaczyk@intel.com>
>>> Cc: Dariusz Stojaczyk <darek.stojaczyk@gmail.com>; dev@dpdk.org; Maxime
>>> Coquelin <maxime.coquelin@redhat.com>; Tetsuya Mukawa
>>> <mtetsuyah@gmail.com>; Stefan Hajnoczi <stefanha@redhat.com>; Thomas
>>> Monjalon <thomas@monjalon.net>; yliu@fridaylinux.org; Harris, James R
>>> <james.r.harris@intel.com>; Kulasek, TomaszX <tomaszx.kulasek@intel.com>;
>>> Wodkowski, PawelX <pawelx.wodkowski@intel.com>
>>> Subject: Re: [dpdk-dev] [RFC v3 0/7] vhost2: new librte_vhost2 proposal
>>>
>>> On Mon, Jun 25, 2018 at 08:17:08PM +0800, Stojaczyk, DariuszX wrote:
>>>>> -----Original Message-----
>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Tiwei Bie
>>>>> Sent: Monday, June 25, 2018 1:02 PM
>>>>> <snip>
>>>>>
>>>>> Hi Dariusz,
>>>>>
>>>>
>>>> Hi Tiwei,
>>>>
>>>>> Thank you for putting efforts in making the DPDK
>>>>> vhost more generic!
>>>>>
>>>>>  From my understanding, your proposal is that:
>>>>>
>>>>> 1) Introduce rte_vhost2 to provide the APIs which
>>>>>     allow users to implement vhost backends like
>>>>>     SCSI, net, crypto, ..
>>>>>
>>>>
>>>> That's right.
>>>>
>>>>> 2) Refactor the existing rte_vhost to use rte_vhost2.
>>>>>     The rte_vhost will still provide below existing
>>>>>     sets of APIs:
>>>>>      1. The APIs which allow users to implement
>>>>>         external vhost backends (these APIs were
>>>>>         designed for SPDK previously)
>>>>>      2. The APIs provided by the net backend
>>>>>      3. The APIs provided by the crypto backend
>>>>>     And above APIs in rte_vhost won't be changed.
>>>>
>>>> That's correct. Rte_vhost would register its own rte_vhost2_tgt_ops
>>> underneath and will call existing vhost_device_ops for e.g. starting the device
>>> once all queues are started.
>>>
>>> Currently I have below concerns and questions:
>>>
>>> - The rte_vhost's problem is still there. Even though
>>>    rte_vhost2 is introduced, the net and crypto backends
>>>    in rte_vhost won't benefit from the new callbacks.
>>>
>>>    The existing rte_vhost in DPDK not only provides the
>>>    APIs for DPDK applications to implement the external
>>>    backends. But also provides high performance net and
>>>    crypto backends implementation (maybe more in the
>>>    future). So it's important that besides the DPDK
>>>    applications which implement their external backends,
>>>    the DPDK applications which use the builtin backends
>>>    will also benefit from the new callbacks.
>>>
>>>    So we should have a clear plan on how will the legacy
>>>    callbacks in rte_vhost be dealt with in the next step.
>>>
>>>    Besides, the new library's name is a bit misleading.
>>>    It makes the existing rte_vhost library sound like an
>>>    obsolete library. But actually the existing rte_vhost
>>>    isn't an obsolete library. It will still provide the
>>>    net and crypto backends. So if we want to introduce
>>>    this new library, we should give it a better name.
>>>
>>> - It's possible to solve rte_vhost's problem you met
>>>    by refactoring the existing vhost library directly
>>>    instead of re-implementing a new vhost library from
>>>    scratch and keeping the old one's problem as is.
>>>
>>>    In this way, it will solve the problem you met and
>>>    also solve the problem for rte_vhost. Why not go
>>>    this way? Something like:
>>>
>>>    Below is the existing callbacks set in rte_vhost.h:
>>>
>>>    /**
>>>     * Device and vring operations.
>>>     */
>>>    struct vhost_device_ops {
>>>            ......
>>>    };
>>>
>>>    It's a legacy implementation, and doesn't really
>>>    follow the DPDK API design (e.g. no rte_ prefix).
>>>    We can design and implement a new message handling
>>>    and a new set of callbacks for rte_vhost to solve
>>>    the problem you met without changing the old one.
>>>    Something like:
>>>
>>>    struct rte_vhost_device_ops {
>>>            ......
>>>    }
>>>
>>>    int
>>>    vhost_user_msg_handler(struct vhost_dev *vdev, struct vhost_user_msg
>>> *msg)
>>>    {
>>>            ......
>>>
>>>            if (!vdev->is_using_new_device_ops) {
>>>                    // Call the existing message handler
>>>                    return vhost_user_msg_handler_legacy(vdev, msg);
>>>            }
>>>
>>>            // Implement the new logic here
>>>            ......
>>>    }
>>>
>>>    A vhost application is allowed to register only struct
>>>    rte_vhost_device_ops or struct vhost_device_ops (which
>>>    should be deprecated in the future). The two ops cannot
>>>    be registered at the same time.
>>>
>>>    The existing applications could use the old ops. And
>>>    if an application registers struct rte_vhost_device_ops,
>>>    the new callbacks and message handler will be used.
>>
>> Please notice that some features like vIOMMU are not even a part of the public rte_vhost API. Only vhost-net benefits from vIOMMU right now. Separating vhost-net from a generic vhost library (rte_vhost2) would avoid making such design mistakes in future. What's the point of having a single rte_vhost library, if some vhost-user features are only implemented for vhost-net.
> 
> These APIs can be safely added at any time.
> And we can also ask developers to add public
> APIs if it matters when adding new features
> in the future. I don't think it's a big
> problem.

+1, I don't think it is a problem.
It is better to have it internal only at the beginning than having to
break the API.

Thanks,
Maxime
> Best regards,
> Tiwei Bie
> 
>>
>>>
>>> Best regards,
>>> Tiwei Bie
>>>
>>>
>>>> Regards,
>>>> D.
>>>>
>>>>>
>>>>> Is my above understanding correct? Thanks!
>>>>>
>>>>> Best regards,
>>>>> Tiwei Bie
>>>>>
  

Patch

diff --git a/lib/librte_vhost2/rte_vhost2.h b/lib/librte_vhost2/rte_vhost2.h
new file mode 100644
index 0000000..385b093
--- /dev/null
+++ b/lib/librte_vhost2/rte_vhost2.h
@@ -0,0 +1,331 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright (c) Intel Corporation.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_VHOST2_H_
+#define _RTE_VHOST2_H_
+
+/**
+ * @file
+ * This library abstracts away most Vhost-user/virtio-vhost-user specifics
+ * and allows developers to implement Vhost devices with an ease.
+ * It calls user-provided callbacks once proper device initialization
+ * state has been reached. That is - memory mappings have changed,
+ * virtqueues are ready to be processed, features have changed in runtime, etc.
+ */
+
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Not C++-aware. */
+#include <linux/vhost.h>
+
+#define RTE_VHOST2_MEMORY_MAX_NREGIONS	8
+
+#define RTE_VHOST2_CLIENT		(1ULL << 0)
+#define RTE_VHOST2_NO_RECONNECT		(1ULL << 1)
+
+enum rte_vhost2_set_config_type {
+	/** Config changed on request by the vhost driver. */
+	VHOST_SET_CONFIG_TYPE_MASTER = 0,
+	/** Config is being restored after a successful migration. */
+	VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
+};
+
+#define RTE_VHOST2_MSG_VERSION_MASK	(0x3)
+#define RTE_VHOST2_MSG_REPLY_MASK	(0x1 << 2)
+#define RTE_VHOST2_MSG_NEED_REPLY	(0x1 << 3)
+
+struct rte_vhost2_msg {
+	uint32_t id;
+	uint32_t flags;
+	uint32_t size; /**< The following payload size. */
+	void *payload;
+	int fds[RTE_VHOST2_MEMORY_MAX_NREGIONS];
+};
+
+/** Single memory region. Both physically and virtually contiguous */
+struct rte_vhost2_mem_region {
+	uint64_t guest_phys_addr;
+	uint64_t guest_user_addr;
+	uint64_t host_user_addr;
+	uint64_t size;
+	void *mmap_addr;
+	uint64_t mmap_size;
+	int fd;
+};
+
+struct rte_vhost2_memory {
+	uint32_t nregions;
+	struct rte_vhost2_mem_region regions[];
+};
+
+/**
+ * Vhost device created and managed by rte_vhost2. Accessible via
+ * \c rte_vhost2_tgt_ops callbacks. This is only a part of the real
+ * vhost device data. This struct is published just for inline vdev
+ * functions to access their data directly.
+ */
+struct rte_vhost2_dev {
+	struct rte_vhost2_memory *mem;
+	bool iommu; /**< \c VIRTIO_F_IOMMU_PLATFORM has been negotiated */
+};
+
+/**
+ * Virtqueue created and managed by rte_vhost2. Accessible via
+ * \c rte_vhost2_tgt_ops callbacks.
+ */
+struct rte_vhost2_vq {
+	 struct vring_desc *desc;
+	 struct vring_avail *avail;
+	 struct vring_used *used;
+	 /* available only if \c VHOST_F_LOG_ALL has been negotiated */
+	 void *log;
+	 uint16_t size;
+};
+
+/**
+ * Device/queue related callbacks, all optional. Provided callback
+ * parameters are guaranteed not to be NULL unless explicitly specified.
+ */
+struct rte_vhost2_tgt_ops {
+	/**
+	 * New driver connected. If this is completed with a non-zero status,
+	 * rte_vhost2 will terminate the connection.
+	 */
+	void (*device_create)(struct rte_vhost2_dev *vdev);
+	/**
+	* Device is ready to operate. vdev data is now initialized. This callback
+	* may be called multiple times as e.g. memory mappings can change
+	* dynamically. All queues are guaranteed to be stopped by now.
+	*/
+	void (*device_init)(struct rte_vhost2_dev *vdev);
+	/**
+	* Features have changed in runtime. This is called at least once during
+	* initialization before `device_init`. Queues might be still running
+	* at this point.
+	*/
+	void (*device_features_changed)(struct rte_vhost2_dev *vdev,
+			uint64_t features);
+	/**
+	* Start processing vq. The `vq` is guaranteed not to be modified before
+	* `queue_stop` is called.
+	*/
+	void (*queue_start)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+	/**
+	* Stop processing vq. It shouldn't be accessed after this callback
+	* completes (via \c rte_vhost2_tgt_cb_complete). This can be called
+	* prior to shutdown or before actions that require changing vhost
+	* device/vq state.
+	*/
+	void (*queue_stop)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+	/** Device disconnected. All queues are guaranteed to be stopped by now */
+	void (*device_destroy)(struct rte_vhost2_dev *vdev);
+	/**
+	* Custom vhost-user message handler. This is called for
+	* backend-specific messages (net/crypto/scsi) that weren't recognized
+	* by the generic message parser. `msg` is available until
+	* \c rte_vhost2_tgt_cb_complete is called.
+	*/
+	void (*custom_msg)(struct rte_vhost2_dev *vdev, struct rte_vhost2_msg *msg);
+
+	/** Interrupt handler, synchronous. */
+	void (*queue_kick)(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+	/**
+	 * Full device config read, synchronous. Return 0 if `len` bytes of
+	 * `config` have been successfully set, -1 otherwise.
+	 */
+	int (*get_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
+			  uint32_t len);
+	/**
+	 * Device config changed by the driver, synchronous. `type` indicates
+	 * the reason of change.
+	 */
+	int (*set_config)(struct rte_vhost2_dev *vdev, uint8_t *config,
+			  uint32_t offset, uint32_t len,
+			  enum rte_vhost2_set_config_type type);
+
+	void *reserved[8]; /**< Reserved for future extension */
+};
+
+/**
+ * Registers a new vhost target accepting remote connections. Multiple
+ * available transports are available. It is possible to create a Vhost-user
+ * Unix domain socket polling local connections or connect to a physical
+ * Virtio device and install an interrupt handler .
+ *
+ * This function is thread-safe.
+ *
+ * \param trtype type of the transport used, e.g. "vhost-user",
+ * "PCI-vhost-user", "PCI-vDPA".
+ * \param trid identifier of the device. For PCI this would be the BDF address,
+ * for vhost-user the socket name.
+ * \param trflags additional options for the specified transport
+ * \param trctx additional data for the specified transport. Can be NULL.
+ * \param tgt_ops callbacks to be called upon reaching specific initialization
+ * states.
+ * \param features supported Virtio features. To be negotiated with the
+ * driver ones. rte_vhost2 will append a couple of generic feature bits
+ * which are required by the Virtio spec. TODO list these features here
+ * \return 0 on success, negative errno otherwise
+ */
+int rte_vhost2_tgt_register(const char *trtype, const char *trid,
+			    uint64_t trflags, void *trctx,
+			    struct rte_vhost2_tgt_ops *tgt_ops,
+			    uint64_t features);
+
+/**
+ * Finish async device tgt ops callback. Unless a tgt op has been documented
+ * as 'synchronous' this function must be called at the end of the op handler.
+ * It can be called either before or after the op handler returns. rte_vhost2
+ * won't call any tgt ops callbacks while another one hasn't been finished yet.
+ *
+ * This function is thread-safe.
+ *
+ * \param vdev vhost device
+ * \param rc 0 on success, negative errno otherwise. If non-zero value is
+ * given, the current callback will be perceived as failed. A queue that failed
+ * to start won't need to be stopped.
+ */
+void rte_vhost2_tgt_cb_complete(struct rte_vhost2_dev *vdev, int rc);
+
+/**
+ * Unregisters a vhost target asynchronously. All active queue will be stopped
+ * and all devices destroyed.
+ *
+ * This function is thread-safe.
+ *
+ * \param cb_fn callback to be called on finish. It'll be called from the same
+ * thread that calls \c rte_vhost2_tgt_ops.
+ * \param cb_arg argument for \c cb_fn
+ * \return 0 on success, negative errno otherwise. `cb_fn` won't be called
+ * if non-zero value is returned.
+ */
+int rte_vhost2_tgt_unregister(const char *trtype, const char *trid,
+			       void (*cb_fn)(void *arg), void *cb_arg);
+
+/**
+ * Bypass VIRTIO_F_IOMMU_PLATFORM and translate gpa directly.
+ *
+ * This function is thread-safe.
+ *
+ * \param mem vhost device memory
+ * \param gpa guest physical address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \return vhost virtual address or NULL if requested `gpa` is not mapped.
+ */
+static inline void *
+rte_vhost2_gpa_to_vva(struct rte_vhost2_memory *mem, uint64_t gpa, uint64_t *len)
+{
+	struct rte_vhost2_mem_region *r;
+	uint32_t i;
+
+	for (i = 0; i < mem->nregions; i++) {
+		r = &mem->regions[i];
+		if (gpa >= r->guest_phys_addr &&
+		    gpa <  r->guest_phys_addr + r->size) {
+
+			if (unlikely(*len > r->guest_phys_addr + r->size - gpa)) {
+				*len = r->guest_phys_addr + r->size - gpa;
+			}
+
+			return gpa - r->guest_phys_addr + r->host_user_addr;
+		}
+	}
+	*len = 0;
+
+	return 0;
+}
+
+/**
+ * Translate I/O virtual address to vhost address space.
+ *
+ * If VIRTIO_F_IOMMU_PLATFORM has been negotiated, this might potentially send
+ * a TLB miss and wait for the TLB update response.
+ * If VIRTIO_F_IOMMU_PLATFORM has not been negotiated, `iova` is a physical
+ * address and `perm` is ignored.
+ *
+ * This function is thread-safe.
+ *
+ * \param vdev vhost device
+ * \param vq virtqueue. Must be started.
+ * \param iova I/O virtual address
+ * \param len length of the memory to translate (in bytes). If requested
+ * memory chunk crosses memory region boundary, the *len will be set to
+ * the remaining, maximum length of virtually contiguous memory. In such
+ * case the user will be required to call another gpa_to_vva(gpa + *len).
+ * \param perm VHOST_ACCESS_RO,VHOST_ACCESS_WO or VHOST_ACCESS_RW
+ * \return vhost virtual address or NULL if requested `iova` is not mapped
+ * or the `perm` doesn't match.
+ */
+static inline void *
+rte_vhost2_iova_to_vva(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq,
+		       uint64_t iova, uint32_t *len, uint8_t perm)
+{
+	void *__vhost_iova_to_vva(struct virtio_net * dev, struct vhost_virtqueue * vq,
+				  uint64_t iova, uint64_t size, uint8_t perm);
+
+	if (!vdev->iommu) {
+		return rte_vhost2_gpa_to_vva(vdev->mem, iova, len);
+	}
+
+	return __vhost_iova_to_vva(vdev, vq, iova, len, perm);
+}
+
+/**
+ * Notify the driver about vq change. This is an eventfd_write for vhost-user
+ * or MMIO write for PCI devices.
+ *
+ * \param vdev vhost device
+ * \param vq virtqueue. Must be started.
+ */
+void rte_vhost2_dev_call(struct rte_vhost2_dev *vdev, struct rte_vhost2_vq *vq);
+
+/**
+ * Notify the driver about device config change. This will result in \c
+ * rte_vhost2_tgt_ops->get_config being called.
+ *
+ * \param vdev vhost device
+ */
+void rte_vhost2_dev_cfg_call(struct rte_vhost2_dev *vdev);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_VHOST2_H_ */