[dpdk-dev,v2,5/7] net/virtio_user: add vhost kernel support

Message ID 1482477266-39199-6-git-send-email-jianfeng.tan@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Yuanhan Liu
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel compilation success Compilation OK

Commit Message

Jianfeng Tan Dec. 23, 2016, 7:14 a.m. UTC
  This patch add support vhost kernel as the backend for virtio_user.
Three main hook functions are added:
  - vhost_kernel_setup() to open char device, each vq pair needs one
    vhostfd;
  - vhost_kernel_ioctl() to communicate control messages with vhost
    kernel module;
  - vhost_kernel_enable_queue_pair() to open tap device and set it
    as the backend of corresonding vhost fd (that is to say, vq pair).

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
---
 drivers/net/virtio/Makefile                      |   1 +
 drivers/net/virtio/virtio_user/vhost.h           |   2 +
 drivers/net/virtio/virtio_user/vhost_kernel.c    | 364 +++++++++++++++++++++++
 drivers/net/virtio/virtio_user/virtio_user_dev.c |  21 +-
 drivers/net/virtio/virtio_user/virtio_user_dev.h |   4 +
 5 files changed, 388 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_user/vhost_kernel.c
  

Comments

Yuanhan Liu Dec. 26, 2016, 7:44 a.m. UTC | #1
On Fri, Dec 23, 2016 at 07:14:24AM +0000, Jianfeng Tan wrote:
> + *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
                     ^^^^^^^^^

I think it should be 2016.


> +/* By default, vhost kernel module allows 64 regions, but DPDK allows
> + * 256 segments. As a relief, below function merges those virtually
> + * adjacent memsegs into one region.
> + */
> +static struct vhost_memory_kernel *
> +prepare_vhost_memory_kernel(void)
> +{
> +	uint32_t i, j, k = 0;
> +	struct rte_memseg *seg;
> +	struct vhost_memory_region *mr;
> +	struct vhost_memory_kernel *vm;
> +
> +	vm = malloc(sizeof(struct vhost_memory_kernel) +
> +		    VHOST_KERNEL_MAX_REGIONS *
> +		    sizeof(struct vhost_memory_region));
> +
> +	for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
> +		seg = &rte_eal_get_configuration()->mem_config->memseg[i];
> +		if (!seg->addr)
> +			break;
> +
> +		int new_region = 1;
> +
> +		for (j = 0; j < k; ++j) {
> +			mr = &vm->regions[j];
> +
> +			if (mr->userspace_addr + mr->memory_size ==
> +			    (uint64_t)seg->addr) {
> +				mr->memory_size += seg->len;
> +				new_region = 0;
> +				break;
> +			}
> +
> +			if ((uint64_t)seg->addr + seg->len ==
> +			    mr->userspace_addr) {
> +				mr->guest_phys_addr = (uint64_t)seg->addr;

Be careful, such cast would break 32 bit OS build.

> +static int
> +vhost_kernel_ioctl(struct virtio_user_dev *dev,
> +		   enum vhost_user_request req,
> +		   void *arg)
> +{
> +	int i, ret = -1;
> +	uint64_t req_kernel;
> +	struct vhost_memory_kernel *vm = NULL;
> +
> +	req_kernel = vhost_req_user_to_kernel[req];
> +
> +	if (req_kernel == VHOST_SET_MEM_TABLE) {
> +		vm = prepare_vhost_memory_kernel();
> +		if (!vm)
> +			return -1;
> +		arg = (void *)vm;
> +	}
> +
> +	/* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */

Because this feature need the vhost IOTLB support from the device
emulation. Patches for QEMU hasn't been merged yet, but it has been
there for a while.

Since we don't have the support yet, for sure it won't work. But
I'm wondering why you have to disable it explicitly?

> +	if (req_kernel == VHOST_SET_FEATURES)
> +		*(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM);
> +
> +	for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) {
> +		if (dev->vhostfds[i] < 0)
> +			continue;
> +
> +		ret = ioctl(dev->vhostfds[i], req_kernel, arg);
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	if (vm)
> +		free(vm);
> +
> +	return ret;
> +}
> +
> +/**
> + * Set up environment to talk with a vhost kernel backend.
> + *
> + * @return
> + *   - (-1) if fail to set up;
> + *   - (>=0) if successful.
> + */
> +static int
> +vhost_kernel_setup(struct virtio_user_dev *dev)
> +{
> +	int vhostfd;
> +	uint32_t q;

Why not simply use 'i'?

> +static int
> +vhost_kernel_enable_queue_pair(struct virtio_user_dev *dev,
> +			       uint16_t pair_idx,
> +			       int enable)
> +{
> +	unsigned int features;

Better to rename it to "tap_features" or something? When I first saw
such name, I thought it was about vhost features.

	--yliu
  
Jianfeng Tan Jan. 4, 2017, 7:22 a.m. UTC | #2
Sorry, I forget to reply this comment.

On 12/26/2016 3:44 PM, Yuanhan Liu wrote:
> [...]
>> +
>> +	/* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
> Because this feature need the vhost IOTLB support from the device
> emulation. Patches for QEMU hasn't been merged yet, but it has been
> there for a while.

Yes. And it's in need of help from QEMU.

>
> Since we don't have the support yet, for sure it won't work. But
> I'm wondering why you have to disable it explicitly?

Here we do not have QEMU. Frontend driver talks with vhost-net through 
virtio_user_dev. And both ends claim to support VIRTIO_F_IOMMU_PLATFORM. 
So this feature bit will be negotiated if we don't explicitly disable 
it. In my previous test, it fails in vhost_init_device_iotlb() of vhost 
kernel module. My guess is that, for this feature, without the help of 
QEMU, vhost kernel module cannot work independently.

Thanks,
Jianfeng
  
Yuanhan Liu Jan. 4, 2017, 7:46 a.m. UTC | #3
On Wed, Jan 04, 2017 at 03:22:08PM +0800, Tan, Jianfeng wrote:
> 
> Sorry, I forget to reply this comment.
> 
> On 12/26/2016 3:44 PM, Yuanhan Liu wrote:
> >[...]
> >>+
> >>+	/* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
> >Because this feature need the vhost IOTLB support from the device
> >emulation. Patches for QEMU hasn't been merged yet, but it has been
> >there for a while.
> 
> Yes. And it's in need of help from QEMU.
> 
> >
> >Since we don't have the support yet, for sure it won't work. But
> >I'm wondering why you have to disable it explicitly?
> 
> Here we do not have QEMU. Frontend driver talks with vhost-net through
> virtio_user_dev. And both ends claim to support VIRTIO_F_IOMMU_PLATFORM. So
> this feature bit will be negotiated if we don't explicitly disable it. In my
> previous test, it fails in vhost_init_device_iotlb() of vhost kernel module.
> My guess is that, for this feature, without the help of QEMU, vhost kernel
> module cannot work independently.

The negotiation is a work between the driver, the device and the vhost
backend. If the device doesn't claim to support it, how could it be
nenogitated.

Think that way, does old QEMU (that doesn't support this feature) disable this
feature explicitly? No. Then why do we have to do that?

	--yliu
  
Jason Wang Jan. 9, 2017, 4:39 a.m. UTC | #4
On 2016年12月23日 15:14, Jianfeng Tan wrote:
> This patch add support vhost kernel as the backend for virtio_user.
> Three main hook functions are added:
>    - vhost_kernel_setup() to open char device, each vq pair needs one
>      vhostfd;
>    - vhost_kernel_ioctl() to communicate control messages with vhost
>      kernel module;
>    - vhost_kernel_enable_queue_pair() to open tap device and set it
>      as the backend of corresonding vhost fd (that is to say, vq pair).
>
> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
> ---
>   drivers/net/virtio/Makefile                      |   1 +
>   drivers/net/virtio/virtio_user/vhost.h           |   2 +
>   drivers/net/virtio/virtio_user/vhost_kernel.c    | 364 +++++++++++++++++++++++
>   drivers/net/virtio/virtio_user/virtio_user_dev.c |  21 +-
>   drivers/net/virtio/virtio_user/virtio_user_dev.h |   4 +
>   5 files changed, 388 insertions(+), 4 deletions(-)
>   create mode 100644 drivers/net/virtio/virtio_user/vhost_kernel.c
>
> diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
> index 97972a6..faeffb2 100644
> --- a/drivers/net/virtio/Makefile
> +++ b/drivers/net/virtio/Makefile
> @@ -60,6 +60,7 @@ endif
>   
>   ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
>   SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
> +SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
>   SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/virtio_user_dev.c
>   SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user_ethdev.c
>   endif
> diff --git a/drivers/net/virtio/virtio_user/vhost.h b/drivers/net/virtio/virtio_user/vhost.h
> index bd67133..ffab13a 100644
> --- a/drivers/net/virtio/virtio_user/vhost.h
> +++ b/drivers/net/virtio/virtio_user/vhost.h
> @@ -120,4 +120,6 @@ struct virtio_user_backend_ops {
>   };
>   
>   struct virtio_user_backend_ops ops_user;
> +struct virtio_user_backend_ops ops_kernel;
> +
>   #endif
> diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c
> new file mode 100644
> index 0000000..8984c5c
> --- /dev/null
> +++ b/drivers/net/virtio/virtio_user/vhost_kernel.c
> @@ -0,0 +1,364 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <unistd.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <sys/ioctl.h>
> +#include <net/if.h>
> +#include <string.h>
> +#include <errno.h>
> +
> +#include <rte_memory.h>
> +#include <rte_eal_memconfig.h>
> +
> +#include "vhost.h"
> +#include "virtio_user_dev.h"
> +
> +struct vhost_memory_kernel {
> +	uint32_t nregions;
> +	uint32_t padding;
> +	struct vhost_memory_region regions[0];
> +};
> +
> +/* vhost kernel ioctls */
> +#define VHOST_VIRTIO 0xAF
> +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
> +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
> +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
> +#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
> +#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
> +#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
> +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
> +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
> +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
> +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
> +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
> +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
> +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
> +#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
> +
> +/* TUN ioctls */
> +#define TUNSETIFF     _IOW('T', 202, int)
> +#define TUNGETFEATURES _IOR('T', 207, unsigned int)
> +#define TUNSETOFFLOAD  _IOW('T', 208, unsigned int)
> +#define TUNGETIFF      _IOR('T', 210, unsigned int)
> +#define TUNSETSNDBUF   _IOW('T', 212, int)
> +#define TUNGETVNETHDRSZ _IOR('T', 215, int)
> +#define TUNSETVNETHDRSZ _IOW('T', 216, int)
> +#define TUNSETQUEUE  _IOW('T', 217, int)
> +#define TUNSETVNETLE _IOW('T', 220, int)
> +#define TUNSETVNETBE _IOW('T', 222, int)
> +
> +/* TUNSETIFF ifr flags */
> +#define IFF_TAP          0x0002
> +#define IFF_NO_PI        0x1000
> +#define IFF_ONE_QUEUE    0x2000
> +#define IFF_VNET_HDR     0x4000
> +#define IFF_MULTI_QUEUE  0x0100
> +#define IFF_ATTACH_QUEUE 0x0200
> +#define IFF_DETACH_QUEUE 0x0400

Do we really want to duplicate those things which has been exposed by 
uapi here?

> +
> +/* Constants */
> +#define TUN_DEF_SNDBUF	(1ull << 20)
> +#define PATH_NET_TUN	"/dev/net/tun"
> +#define VHOST_KERNEL_MAX_REGIONS	64

Unfortunate not a constant any more since c9ce42f72fd0 vhost: add 
max_mem_regions module parameter.

> +
> +static uint64_t vhost_req_user_to_kernel[] = {
> +	[VHOST_USER_SET_OWNER] = VHOST_SET_OWNER,
> +	[VHOST_USER_RESET_OWNER] = VHOST_RESET_OWNER,
> +	[VHOST_USER_SET_FEATURES] = VHOST_SET_FEATURES,
> +	[VHOST_USER_GET_FEATURES] = VHOST_GET_FEATURES,
> +	[VHOST_USER_SET_VRING_CALL] = VHOST_SET_VRING_CALL,
> +	[VHOST_USER_SET_VRING_NUM] = VHOST_SET_VRING_NUM,
> +	[VHOST_USER_SET_VRING_BASE] = VHOST_SET_VRING_BASE,
> +	[VHOST_USER_GET_VRING_BASE] = VHOST_GET_VRING_BASE,
> +	[VHOST_USER_SET_VRING_ADDR] = VHOST_SET_VRING_ADDR,
> +	[VHOST_USER_SET_VRING_KICK] = VHOST_SET_VRING_KICK,
> +	[VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
> +};
> +
> +/* By default, vhost kernel module allows 64 regions, but DPDK allows
> + * 256 segments. As a relief, below function merges those virtually
> + * adjacent memsegs into one region.
> + */
> +static struct vhost_memory_kernel *
> +prepare_vhost_memory_kernel(void)
> +{
> +	uint32_t i, j, k = 0;
> +	struct rte_memseg *seg;
> +	struct vhost_memory_region *mr;
> +	struct vhost_memory_kernel *vm;
> +
> +	vm = malloc(sizeof(struct vhost_memory_kernel) +
> +		    VHOST_KERNEL_MAX_REGIONS *
> +		    sizeof(struct vhost_memory_region));
> +
> +	for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
> +		seg = &rte_eal_get_configuration()->mem_config->memseg[i];
> +		if (!seg->addr)
> +			break;

If we're sure the number of regions is less than 64(or the module 
parameter read from /sys), can we avoid the iteration here?

> +
> +		int new_region = 1;
> +
> +		for (j = 0; j < k; ++j) {
> +			mr = &vm->regions[j];
> +
> +			if (mr->userspace_addr + mr->memory_size ==
> +			    (uint64_t)seg->addr) {
> +				mr->memory_size += seg->len;
> +				new_region = 0;
> +				break;
> +			}
> +
> +			if ((uint64_t)seg->addr + seg->len ==
> +			    mr->userspace_addr) {
> +				mr->guest_phys_addr = (uint64_t)seg->addr;
> +				mr->userspace_addr = (uint64_t)seg->addr;
> +				mr->memory_size += seg->len;
> +				new_region = 0;
> +				break;
> +			}
> +		}
> +
> +		if (new_region == 0)
> +			continue;
> +
> +		mr = &vm->regions[k++];
> +		mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
> +		mr->userspace_addr = (uint64_t)seg->addr;
> +		mr->memory_size = seg->len;
> +		mr->mmap_offset = 0;
> +
> +		if (k >= VHOST_KERNEL_MAX_REGIONS) {
> +			free(vm);
> +			return NULL;
> +		}
> +	}
> +
> +	vm->nregions = k;
> +	vm->padding = 0;
> +	return vm;
> +}
> +
> +static int
> +vhost_kernel_ioctl(struct virtio_user_dev *dev,
> +		   enum vhost_user_request req,
> +		   void *arg)
> +{
> +	int i, ret = -1;
> +	uint64_t req_kernel;
> +	struct vhost_memory_kernel *vm = NULL;
> +
> +	req_kernel = vhost_req_user_to_kernel[req];
> +
> +	if (req_kernel == VHOST_SET_MEM_TABLE) {
> +		vm = prepare_vhost_memory_kernel();
> +		if (!vm)
> +			return -1;
> +		arg = (void *)vm;
> +	}
> +
> +	/* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */

I think the reason is when VIRTIO_F_IOMMU_PLATFORM is negotiated, all 
address should be iova instead of gpa.

> +	if (req_kernel == VHOST_SET_FEATURES)
> +		*(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM);
> +
> +	for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) {
> +		if (dev->vhostfds[i] < 0)
> +			continue;
> +
> +		ret = ioctl(dev->vhostfds[i], req_kernel, arg);
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	if (vm)
> +		free(vm);
> +
> +	return ret;
> +}
> +
> +/**
> + * Set up environment to talk with a vhost kernel backend.
> + *
> + * @return
> + *   - (-1) if fail to set up;
> + *   - (>=0) if successful.
> + */
> +static int
> +vhost_kernel_setup(struct virtio_user_dev *dev)
> +{
> +	int vhostfd;
> +	uint32_t q;
> +
> +	for (q = 0; q < dev->max_queue_pairs; ++q) {
> +		vhostfd = open(dev->path, O_RDWR);
> +		if (vhostfd < 0) {
> +			PMD_DRV_LOG(ERR, "fail to open %s, %s",
> +				    dev->path, strerror(errno));
> +			return -1;
> +		}
> +
> +		dev->vhostfds[q] = vhostfd;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +vhost_kernel_set_backend(int vhostfd, int tapfd)
> +{
> +	struct vhost_vring_file f;
> +
> +	f.fd = tapfd;
> +	f.index = 0;
> +	if (ioctl(vhostfd, VHOST_NET_SET_BACKEND, &f) < 0) {
> +		PMD_DRV_LOG(ERR, "VHOST_NET_SET_BACKEND fails, %s",
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	f.index = 1;
> +	if (ioctl(vhostfd, VHOST_NET_SET_BACKEND, &f) < 0) {
> +		PMD_DRV_LOG(ERR, "VHOST_NET_SET_BACKEND fails, %s",
> +				strerror(errno));
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +vhost_kernel_enable_queue_pair(struct virtio_user_dev *dev,
> +			       uint16_t pair_idx,
> +			       int enable)
> +{
> +	unsigned int features;
> +	int sndbuf = TUN_DEF_SNDBUF;
> +	struct ifreq ifr;
> +	int hdr_size;
> +	int vhostfd;
> +	int tapfd;
> +
> +	vhostfd = dev->vhostfds[pair_idx];
> +
> +	if (!enable) {
> +		if (dev->tapfds[pair_idx]) {
> +			close(dev->tapfds[pair_idx]);
> +			dev->tapfds[pair_idx] = -1;
> +		}
> +		return vhost_kernel_set_backend(vhostfd, -1);

If this is used to for thing like ethtool -L in guest, we should use 
TUNSETQUEUE here.

> +	} else if (dev->tapfds[pair_idx] >= 0) {
> +		return 0;
> +	}
> +
> +	if ((dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) ||
> +	    (dev->features & (1ULL << VIRTIO_F_VERSION_1)))
> +		hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
> +	else
> +		hdr_size = sizeof(struct virtio_net_hdr);
> +
> +	/* TODO:
> +	 * 1. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
> +	 * 2. get number of memory regions from vhost module parameter
> +	 * max_mem_regions, supported in newer version linux kernel
> +	 */
> +	tapfd = open(PATH_NET_TUN, O_RDWR);
> +	if (tapfd < 0) {
> +		PMD_DRV_LOG(ERR, "fail to open %s: %s",
> +			    PATH_NET_TUN, strerror(errno));
> +		return -1;
> +	}
> +
> +	/* Construct ifr */
> +	memset(&ifr, 0, sizeof(ifr));
> +	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> +
> +	if (ioctl(tapfd, TUNGETFEATURES, &features) == -1) {
> +		PMD_DRV_LOG(ERR, "TUNGETFEATURES failed: %s", strerror(errno));
> +		goto error;
> +	}
> +	if (features & IFF_ONE_QUEUE)
> +		ifr.ifr_flags |= IFF_ONE_QUEUE;
> +
> +	/* Let tap instead of vhost-net handle vnet header, as the latter does
> +	 * not support offloading. And in this case, we should not set feature
> +	 * bit VHOST_NET_F_VIRTIO_NET_HDR.
> +	 */
> +	if (features & IFF_VNET_HDR) {
> +		ifr.ifr_flags |= IFF_VNET_HDR;
> +	} else {
> +		PMD_DRV_LOG(ERR, "TAP does not support IFF_VNET_HDR");
> +		goto error;
> +	}
> +
> +	if (dev->ifname)
> +		strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
> +	else
> +		strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
> +	if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
> +		PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
> +		goto error;
> +	}

This requires CAP_NET_ADMIN, so we should really consider to accept a 
pre-created fd here.

> +
> +	fcntl(tapfd, F_SETFL, O_NONBLOCK);
> +
> +	if (ioctl(tapfd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
> +		PMD_DRV_LOG(ERR, "TUNSETVNETHDRSZ failed: %s", strerror(errno));
> +		goto error;
> +	}
> +
> +	if (ioctl(tapfd, TUNSETSNDBUF, &sndbuf) < 0) {
> +		PMD_DRV_LOG(ERR, "TUNSETSNDBUF failed: %s", strerror(errno));
> +		goto error;
> +	}

Let's use INT_MAX as default here to survive from evil consumer here.

> +
> +	if (vhost_kernel_set_backend(vhostfd, tapfd) < 0)
> +		goto error;
> +
> +	dev->tapfds[pair_idx] = tapfd;
> +	if (!dev->ifname)
> +		dev->ifname = strdup(ifr.ifr_name);
> +
> +	return 0;
> +error:
> +	return -1;
> +}
> +
> +struct virtio_user_backend_ops ops_kernel = {
> +	.setup = vhost_kernel_setup,
> +	.send_request = vhost_kernel_ioctl,
> +	.enable_qp = vhost_kernel_enable_queue_pair
> +};
> diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c
> index a818c29..c718b85 100644
> --- a/drivers/net/virtio/virtio_user/virtio_user_dev.c
> +++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c
> @@ -219,7 +219,7 @@ is_vhost_user_by_type(const char *path)
>   static int
>   virtio_user_dev_setup(struct virtio_user_dev *dev)
>   {
> -	uint32_t i;
> +	uint32_t i, q;
>   
>   	dev->vhostfd = -1;
>   	for (i = 0; i < VIRTIO_MAX_VIRTQUEUES * 2 + 1; ++i) {
> @@ -227,12 +227,18 @@ virtio_user_dev_setup(struct virtio_user_dev *dev)
>   		dev->callfds[i] = -1;
>   	}
>   
> +	for (q = 0; q < VHOST_KERNEL_MAX_QUEUES; ++q) {
> +		dev->vhostfds[q] = -1;
> +		dev->tapfds[q] = -1;
> +	}
> +
>   	if (is_vhost_user_by_type(dev->path)) {
>   		dev->ops = &ops_user;
> -		return dev->ops->setup(dev);
> +	} else {
> +		dev->ops = &ops_kernel;
>   	}
>   
> -	return -1;
> +	return dev->ops->setup(dev);
>   }
>   
>   int
> @@ -284,7 +290,9 @@ virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
>   void
>   virtio_user_dev_uninit(struct virtio_user_dev *dev)
>   {
> -	uint32_t i;
> +	uint32_t i, q;
> +
> +	dev->ops->send_request(dev, VHOST_USER_RESET_OWNER, NULL);
>   
>   	for (i = 0; i < dev->max_queue_pairs * 2; ++i) {
>   		close(dev->callfds[i]);
> @@ -292,6 +300,11 @@ virtio_user_dev_uninit(struct virtio_user_dev *dev)
>   	}
>   
>   	close(dev->vhostfd);
> +
> +	for (q = 0; q < VHOST_KERNEL_MAX_QUEUES; ++q) {
> +		close(dev->vhostfds[q]);
> +		close(dev->tapfds[q]);
> +	}
>   }
>   
>   static uint8_t
> diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h
> index 503a496..148b2e6 100644
> --- a/drivers/net/virtio/virtio_user/virtio_user_dev.h
> +++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h
> @@ -44,6 +44,10 @@ struct virtio_user_dev {
>   	int		vhostfd;
>   
>   	/* for vhost_kernel backend */
> +	char		*ifname;
> +#define VHOST_KERNEL_MAX_QUEUES		8
> +	int		vhostfds[VHOST_KERNEL_MAX_QUEUES];
> +	int		tapfds[VHOST_KERNEL_MAX_QUEUES];
>   
>   	/* for both vhost_user and vhost_kernel */
>   	int		callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];
  
Jason Wang Jan. 9, 2017, 4:51 a.m. UTC | #5
On 2017年01月04日 15:22, Tan, Jianfeng wrote:
>
> Sorry, I forget to reply this comment.
>
> On 12/26/2016 3:44 PM, Yuanhan Liu wrote:
>> [...]
>>> +
>>> +    /* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
>> Because this feature need the vhost IOTLB support from the device
>> emulation. Patches for QEMU hasn't been merged yet, but it has been
>> there for a while.
>
> Yes. And it's in need of help from QEMU.
>
>>
>> Since we don't have the support yet, for sure it won't work. But
>> I'm wondering why you have to disable it explicitly?
>
> Here we do not have QEMU. Frontend driver talks with vhost-net through 
> virtio_user_dev. And both ends claim to support 
> VIRTIO_F_IOMMU_PLATFORM. So this feature bit will be negotiated if we 
> don't explicitly disable it. In my previous test, it fails in 
> vhost_init_device_iotlb() of vhost kernel module.

Interesting, vhost_init_device_iotlb() only fail when OOM. Do you meet it?


> My guess is that, for this feature, without the help of QEMU, vhost 
> kernel module cannot work independently.

Technically it can if your userspace supports device IOTLB APIs too.

Or if you're using static mappings, and preset all mappings through 
VHOST_IOTLB_UPADATE to make sure no IOTLB misses.

Thanks

>
> Thanks,
> Jianfeng
  
Jianfeng Tan Jan. 10, 2017, 6:11 a.m. UTC | #6
Hi Jason,


On 1/9/2017 12:39 PM, Jason Wang wrote:
>
>
> On 2016年12月23日 15:14, Jianfeng Tan wrote:
>> This patch add support vhost kernel as the backend for virtio_user.
>> Three main hook functions are added:
>>    - vhost_kernel_setup() to open char device, each vq pair needs one
>>      vhostfd;
>>    - vhost_kernel_ioctl() to communicate control messages with vhost
>>      kernel module;
>>    - vhost_kernel_enable_queue_pair() to open tap device and set it
>>      as the backend of corresonding vhost fd (that is to say, vq pair).
>>
>> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
>> ---
>>
[...]
>> +/* TUNSETIFF ifr flags */
>> +#define IFF_TAP          0x0002
>> +#define IFF_NO_PI        0x1000
>> +#define IFF_ONE_QUEUE    0x2000
>> +#define IFF_VNET_HDR     0x4000
>> +#define IFF_MULTI_QUEUE  0x0100
>> +#define IFF_ATTACH_QUEUE 0x0200
>> +#define IFF_DETACH_QUEUE 0x0400
>
> Do we really want to duplicate those things which has been exposed by 
> uapi here?

You mean those defined in <linux/if_tun.h>? Redefine those common 
macros, or include standard header file, with respective pros and cons. 
DPDK prefers the redefinition way as far as I understand, doesn't it?

>
>> +
>> +/* Constants */
>> +#define TUN_DEF_SNDBUF    (1ull << 20)
>> +#define PATH_NET_TUN    "/dev/net/tun"
>> +#define VHOST_KERNEL_MAX_REGIONS    64
>
> Unfortunate not a constant any more since c9ce42f72fd0 vhost: add 
> max_mem_regions module parameter.

Yes, I was considering to ignore this in the initial release. But it's 
not a big effort, I'll try to fix it in latest version.

>
>> +
>> +static uint64_t vhost_req_user_to_kernel[] = {
>> +    [VHOST_USER_SET_OWNER] = VHOST_SET_OWNER,
>> +    [VHOST_USER_RESET_OWNER] = VHOST_RESET_OWNER,
>> +    [VHOST_USER_SET_FEATURES] = VHOST_SET_FEATURES,
>> +    [VHOST_USER_GET_FEATURES] = VHOST_GET_FEATURES,
>> +    [VHOST_USER_SET_VRING_CALL] = VHOST_SET_VRING_CALL,
>> +    [VHOST_USER_SET_VRING_NUM] = VHOST_SET_VRING_NUM,
>> +    [VHOST_USER_SET_VRING_BASE] = VHOST_SET_VRING_BASE,
>> +    [VHOST_USER_GET_VRING_BASE] = VHOST_GET_VRING_BASE,
>> +    [VHOST_USER_SET_VRING_ADDR] = VHOST_SET_VRING_ADDR,
>> +    [VHOST_USER_SET_VRING_KICK] = VHOST_SET_VRING_KICK,
>> +    [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
>> +};
>> +
>> +/* By default, vhost kernel module allows 64 regions, but DPDK allows
>> + * 256 segments. As a relief, below function merges those virtually
>> + * adjacent memsegs into one region.
>> + */
>> +static struct vhost_memory_kernel *
>> +prepare_vhost_memory_kernel(void)
>> +{
>> +    uint32_t i, j, k = 0;
>> +    struct rte_memseg *seg;
>> +    struct vhost_memory_region *mr;
>> +    struct vhost_memory_kernel *vm;
>> +
>> +    vm = malloc(sizeof(struct vhost_memory_kernel) +
>> +            VHOST_KERNEL_MAX_REGIONS *
>> +            sizeof(struct vhost_memory_region));
>> +
>> +    for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
>> +        seg = &rte_eal_get_configuration()->mem_config->memseg[i];
>> +        if (!seg->addr)
>> +            break;
>
> If we're sure the number of regions is less than 64(or the module 
> parameter read from /sys), can we avoid the iteration here?

The "if" statement under the "for" statement can save us from all 
RTE_MAX_MEMSEG iteration.

>
>> +
>> +        int new_region = 1;
>> +
>> +        for (j = 0; j < k; ++j) {
>> +            mr = &vm->regions[j];
>> +
>> +            if (mr->userspace_addr + mr->memory_size ==
>> +                (uint64_t)seg->addr) {
>> +                mr->memory_size += seg->len;
>> +                new_region = 0;
>> +                break;
>> +            }
>> +
>> +            if ((uint64_t)seg->addr + seg->len ==
>> +                mr->userspace_addr) {
>> +                mr->guest_phys_addr = (uint64_t)seg->addr;
>> +                mr->userspace_addr = (uint64_t)seg->addr;
>> +                mr->memory_size += seg->len;
>> +                new_region = 0;
>> +                break;
>> +            }
>> +        }
>> +
>> +        if (new_region == 0)
>> +            continue;
>> +
>> +        mr = &vm->regions[k++];
>> +        mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr 
>> here! */
>> +        mr->userspace_addr = (uint64_t)seg->addr;
>> +        mr->memory_size = seg->len;
>> +        mr->mmap_offset = 0;
>> +
>> +        if (k >= VHOST_KERNEL_MAX_REGIONS) {
>> +            free(vm);
>> +            return NULL;
>> +        }
>> +    }
>> +
>> +    vm->nregions = k;
>> +    vm->padding = 0;
>> +    return vm;
>> +}
>> +
>> +static int
>> +vhost_kernel_ioctl(struct virtio_user_dev *dev,
>> +           enum vhost_user_request req,
>> +           void *arg)
>> +{
>> +    int i, ret = -1;
>> +    uint64_t req_kernel;
>> +    struct vhost_memory_kernel *vm = NULL;
>> +
>> +    req_kernel = vhost_req_user_to_kernel[req];
>> +
>> +    if (req_kernel == VHOST_SET_MEM_TABLE) {
>> +        vm = prepare_vhost_memory_kernel();
>> +        if (!vm)
>> +            return -1;
>> +        arg = (void *)vm;
>> +    }
>> +
>> +    /* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
>
> I think the reason is when VIRTIO_F_IOMMU_PLATFORM is negotiated, all 
> address should be iova instead of gpa.
>

Yes, I agree. As we don't have to do memory protection in a single 
process, so it's completely useless here, right?

>> +    if (req_kernel == VHOST_SET_FEATURES)
>> +        *(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM);
>> +
>> +    for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) {
>> +        if (dev->vhostfds[i] < 0)
>> +            continue;
>> +
[...]
>> +    if (!enable) {
>> +        if (dev->tapfds[pair_idx]) {
>> +            close(dev->tapfds[pair_idx]);
>> +            dev->tapfds[pair_idx] = -1;
>> +        }
>> +        return vhost_kernel_set_backend(vhostfd, -1);
>
> If this is used to for thing like ethtool -L in guest, we should use 
> TUNSETQUEUE here.

Oops, I was missing this ioctl operation. Let me fix it in next version.

>
>> +    } else if (dev->tapfds[pair_idx] >= 0) {
>> +        return 0;
>> +    }
>> +
>> +    if ((dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) ||
>> +        (dev->features & (1ULL << VIRTIO_F_VERSION_1)))
>> +        hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>> +    else
>> +        hdr_size = sizeof(struct virtio_net_hdr);
>> +
>> +    /* TODO:
>> +     * 1. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
>> +     * 2. get number of memory regions from vhost module parameter
>> +     * max_mem_regions, supported in newer version linux kernel
>> +     */
>> +    tapfd = open(PATH_NET_TUN, O_RDWR);
>> +    if (tapfd < 0) {
>> +        PMD_DRV_LOG(ERR, "fail to open %s: %s",
>> +                PATH_NET_TUN, strerror(errno));
>> +        return -1;
>> +    }
>> +
>> +    /* Construct ifr */
>> +    memset(&ifr, 0, sizeof(ifr));
>> +    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
>> +
>> +    if (ioctl(tapfd, TUNGETFEATURES, &features) == -1) {
>> +        PMD_DRV_LOG(ERR, "TUNGETFEATURES failed: %s", strerror(errno));
>> +        goto error;
>> +    }
>> +    if (features & IFF_ONE_QUEUE)
>> +        ifr.ifr_flags |= IFF_ONE_QUEUE;
>> +
>> +    /* Let tap instead of vhost-net handle vnet header, as the 
>> latter does
>> +     * not support offloading. And in this case, we should not set 
>> feature
>> +     * bit VHOST_NET_F_VIRTIO_NET_HDR.
>> +     */
>> +    if (features & IFF_VNET_HDR) {
>> +        ifr.ifr_flags |= IFF_VNET_HDR;
>> +    } else {
>> +        PMD_DRV_LOG(ERR, "TAP does not support IFF_VNET_HDR");
>> +        goto error;
>> +    }
>> +
>> +    if (dev->ifname)
>> +        strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
>> +    else
>> +        strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
>> +    if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
>> +        PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
>> +        goto error;
>> +    }
>
> This requires CAP_NET_ADMIN, so we should really consider to accept a 
> pre-created fd here.

It sounds like a future work for me. So far, all DPDK apps are running 
in privileged mode (including CAP_NET_ADMIN?).

>
>> +
>> +    fcntl(tapfd, F_SETFL, O_NONBLOCK);
>> +
>> +    if (ioctl(tapfd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
>> +        PMD_DRV_LOG(ERR, "TUNSETVNETHDRSZ failed: %s", 
>> strerror(errno));
>> +        goto error;
>> +    }
>> +
>> +    if (ioctl(tapfd, TUNSETSNDBUF, &sndbuf) < 0) {
>> +        PMD_DRV_LOG(ERR, "TUNSETSNDBUF failed: %s", strerror(errno));
>> +        goto error;
>> +    }
>
> Let's use INT_MAX as default here to survive from evil consumer here.

Oh yes, I will fix it.

Thanks,
Jianfeng
  
Thomas Monjalon Jan. 10, 2017, 8:46 a.m. UTC | #7
2017-01-10 14:11, Tan, Jianfeng:
> On 1/9/2017 12:39 PM, Jason Wang wrote:
> > On 2016年12月23日 15:14, Jianfeng Tan wrote:
> [...]
> >> +/* TUNSETIFF ifr flags */
> >> +#define IFF_TAP          0x0002
> >> +#define IFF_NO_PI        0x1000
> >> +#define IFF_ONE_QUEUE    0x2000
> >> +#define IFF_VNET_HDR     0x4000
> >> +#define IFF_MULTI_QUEUE  0x0100
> >> +#define IFF_ATTACH_QUEUE 0x0200
> >> +#define IFF_DETACH_QUEUE 0x0400
> >
> > Do we really want to duplicate those things which has been exposed by 
> > uapi here?
> 
> You mean those defined in <linux/if_tun.h>? Redefine those common 
> macros, or include standard header file, with respective pros and cons. 
> DPDK prefers the redefinition way as far as I understand, doesn't it?

What is the benefit of copying defines instead of including?
  
Jianfeng Tan Jan. 10, 2017, 9:11 a.m. UTC | #8
On 1/10/2017 4:46 PM, Thomas Monjalon wrote:
> 2017-01-10 14:11, Tan, Jianfeng:
>> On 1/9/2017 12:39 PM, Jason Wang wrote:
>>> On 2016年12月23日 15:14, Jianfeng Tan wrote:
>> [...]
>>>> +/* TUNSETIFF ifr flags */
>>>> +#define IFF_TAP          0x0002
>>>> +#define IFF_NO_PI        0x1000
>>>> +#define IFF_ONE_QUEUE    0x2000
>>>> +#define IFF_VNET_HDR     0x4000
>>>> +#define IFF_MULTI_QUEUE  0x0100
>>>> +#define IFF_ATTACH_QUEUE 0x0200
>>>> +#define IFF_DETACH_QUEUE 0x0400
>>> Do we really want to duplicate those things which has been exposed by
>>> uapi here?
>> You mean those defined in <linux/if_tun.h>? Redefine those common
>> macros, or include standard header file, with respective pros and cons.
>> DPDK prefers the redefinition way as far as I understand, doesn't it?
> What is the benefit of copying defines instead of including?

(1) Avoid compiling errors that old version missed some of macros.
(2) Better for portability (does not apply here).

I might take it for granted a little bit as I see some very popular open 
source projects behave like this way, like QEMU, HAProxy, etc.
  
Jianfeng Tan Jan. 11, 2017, 2:30 a.m. UTC | #9
Hi Jason,


On 1/9/2017 12:39 PM, Jason Wang wrote:
>> +    if (!enable) {
>> +        if (dev->tapfds[pair_idx]) {
>> +            close(dev->tapfds[pair_idx]);
>> +            dev->tapfds[pair_idx] = -1;
>> +        }
>> +        return vhost_kernel_set_backend(vhostfd, -1);
>
> If this is used to for thing like ethtool -L in guest, we should use 
> TUNSETQUEUE here.

To make it clear, why we need to ioctl(..., TUNSETQUEUE, ...) here. 
According to Linux/Documentation/networking/tuntap.txt,
     "A new ioctl(TUNSETQUEUE) were introduced to enable or disable a 
queue. When
     calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And 
when
     calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The 
queue were
     enabled by default after it was created through TUNSETIFF."

As it's enabled by default, do you still see the necessity to call it 
explicitly?

Thanks,
Jianfeng
  
Jason Wang Jan. 11, 2017, 2:42 a.m. UTC | #10
On 2017年01月10日 14:11, Tan, Jianfeng wrote:
>
> Hi Jason,
>
>
> On 1/9/2017 12:39 PM, Jason Wang wrote:
>>
>>
>> On 2016年12月23日 15:14, Jianfeng Tan wrote:
>>> This patch add support vhost kernel as the backend for virtio_user.
>>> Three main hook functions are added:
>>>    - vhost_kernel_setup() to open char device, each vq pair needs one
>>>      vhostfd;
>>>    - vhost_kernel_ioctl() to communicate control messages with vhost
>>>      kernel module;
>>>    - vhost_kernel_enable_queue_pair() to open tap device and set it
>>>      as the backend of corresonding vhost fd (that is to say, vq pair).
>>>
>>> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
>>> ---
>>>
> [...]
>>> +/* TUNSETIFF ifr flags */
>>> +#define IFF_TAP          0x0002
>>> +#define IFF_NO_PI        0x1000
>>> +#define IFF_ONE_QUEUE    0x2000
>>> +#define IFF_VNET_HDR     0x4000
>>> +#define IFF_MULTI_QUEUE  0x0100
>>> +#define IFF_ATTACH_QUEUE 0x0200
>>> +#define IFF_DETACH_QUEUE 0x0400
>>
>> Do we really want to duplicate those things which has been exposed by 
>> uapi here?
>
> You mean those defined in <linux/if_tun.h>? Redefine those common 
> macros, or include standard header file, with respective pros and 
> cons. DPDK prefers the redefinition way as far as I understand, 
> doesn't it?
>

Well, if you really want to do this, you may want to use an independent 
file. Then you can sync it with linux headers with a bash script.

>>
>>> +
>>> +/* Constants */
>>> +#define TUN_DEF_SNDBUF    (1ull << 20)
>>> +#define PATH_NET_TUN    "/dev/net/tun"
>>> +#define VHOST_KERNEL_MAX_REGIONS    64
>>
>> Unfortunate not a constant any more since c9ce42f72fd0 vhost: add 
>> max_mem_regions module parameter.
>
> Yes, I was considering to ignore this in the initial release. But it's 
> not a big effort, I'll try to fix it in latest version.
>
>>
>>> +
>>> +static uint64_t vhost_req_user_to_kernel[] = {
>>> +    [VHOST_USER_SET_OWNER] = VHOST_SET_OWNER,
>>> +    [VHOST_USER_RESET_OWNER] = VHOST_RESET_OWNER,
>>> +    [VHOST_USER_SET_FEATURES] = VHOST_SET_FEATURES,
>>> +    [VHOST_USER_GET_FEATURES] = VHOST_GET_FEATURES,
>>> +    [VHOST_USER_SET_VRING_CALL] = VHOST_SET_VRING_CALL,
>>> +    [VHOST_USER_SET_VRING_NUM] = VHOST_SET_VRING_NUM,
>>> +    [VHOST_USER_SET_VRING_BASE] = VHOST_SET_VRING_BASE,
>>> +    [VHOST_USER_GET_VRING_BASE] = VHOST_GET_VRING_BASE,
>>> +    [VHOST_USER_SET_VRING_ADDR] = VHOST_SET_VRING_ADDR,
>>> +    [VHOST_USER_SET_VRING_KICK] = VHOST_SET_VRING_KICK,
>>> +    [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
>>> +};
>>> +
>>> +/* By default, vhost kernel module allows 64 regions, but DPDK allows
>>> + * 256 segments. As a relief, below function merges those virtually
>>> + * adjacent memsegs into one region.
>>> + */
>>> +static struct vhost_memory_kernel *
>>> +prepare_vhost_memory_kernel(void)
>>> +{
>>> +    uint32_t i, j, k = 0;
>>> +    struct rte_memseg *seg;
>>> +    struct vhost_memory_region *mr;
>>> +    struct vhost_memory_kernel *vm;
>>> +
>>> +    vm = malloc(sizeof(struct vhost_memory_kernel) +
>>> +            VHOST_KERNEL_MAX_REGIONS *
>>> +            sizeof(struct vhost_memory_region));
>>> +
>>> +    for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
>>> +        seg = &rte_eal_get_configuration()->mem_config->memseg[i];
>>> +        if (!seg->addr)
>>> +            break;
>>
>> If we're sure the number of regions is less than 64(or the module 
>> parameter read from /sys), can we avoid the iteration here?
>
> The "if" statement under the "for" statement can save us from all 
> RTE_MAX_MEMSEG iteration.

But if we know the number of regions is short than the limit, there's 
even no need for this?

>
>>
>>> +
>>> +        int new_region = 1;
>>> +
>>> +        for (j = 0; j < k; ++j) {
>>> +            mr = &vm->regions[j];
>>> +
>>> +            if (mr->userspace_addr + mr->memory_size ==
>>> +                (uint64_t)seg->addr) {
>>> +                mr->memory_size += seg->len;
>>> +                new_region = 0;
>>> +                break;
>>> +            }
>>> +
>>> +            if ((uint64_t)seg->addr + seg->len ==
>>> +                mr->userspace_addr) {
>>> +                mr->guest_phys_addr = (uint64_t)seg->addr;
>>> +                mr->userspace_addr = (uint64_t)seg->addr;
>>> +                mr->memory_size += seg->len;
>>> +                new_region = 0;
>>> +                break;
>>> +            }
>>> +        }
>>> +
>>> +        if (new_region == 0)
>>> +            continue;
>>> +
>>> +        mr = &vm->regions[k++];
>>> +        mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr 
>>> here! */
>>> +        mr->userspace_addr = (uint64_t)seg->addr;
>>> +        mr->memory_size = seg->len;
>>> +        mr->mmap_offset = 0;
>>> +
>>> +        if (k >= VHOST_KERNEL_MAX_REGIONS) {
>>> +            free(vm);
>>> +            return NULL;
>>> +        }
>>> +    }
>>> +
>>> +    vm->nregions = k;
>>> +    vm->padding = 0;
>>> +    return vm;
>>> +}
>>> +
>>> +static int
>>> +vhost_kernel_ioctl(struct virtio_user_dev *dev,
>>> +           enum vhost_user_request req,
>>> +           void *arg)
>>> +{
>>> +    int i, ret = -1;
>>> +    uint64_t req_kernel;
>>> +    struct vhost_memory_kernel *vm = NULL;
>>> +
>>> +    req_kernel = vhost_req_user_to_kernel[req];
>>> +
>>> +    if (req_kernel == VHOST_SET_MEM_TABLE) {
>>> +        vm = prepare_vhost_memory_kernel();
>>> +        if (!vm)
>>> +            return -1;
>>> +        arg = (void *)vm;
>>> +    }
>>> +
>>> +    /* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
>>
>> I think the reason is when VIRTIO_F_IOMMU_PLATFORM is negotiated, all 
>> address should be iova instead of gpa.
>>
>
> Yes, I agree. As we don't have to do memory protection in a single 
> process, so it's completely useless here, right?

Yes if there's no IOMMU concept in this case.

>
>>> +    if (req_kernel == VHOST_SET_FEATURES)
>>> +        *(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM);
>>> +
>>> +    for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) {
>>> +        if (dev->vhostfds[i] < 0)
>>> +            continue;
>>> +
> [...]
>>> +    if (!enable) {
>>> +        if (dev->tapfds[pair_idx]) {
>>> +            close(dev->tapfds[pair_idx]);
>>> +            dev->tapfds[pair_idx] = -1;
>>> +        }
>>> +        return vhost_kernel_set_backend(vhostfd, -1);
>>
>> If this is used to for thing like ethtool -L in guest, we should use 
>> TUNSETQUEUE here.
>
> Oops, I was missing this ioctl operation. Let me fix it in next version.
>
>>
>>> +    } else if (dev->tapfds[pair_idx] >= 0) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    if ((dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) ||
>>> +        (dev->features & (1ULL << VIRTIO_F_VERSION_1)))
>>> +        hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>>> +    else
>>> +        hdr_size = sizeof(struct virtio_net_hdr);
>>> +
>>> +    /* TODO:
>>> +     * 1. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
>>> +     * 2. get number of memory regions from vhost module parameter
>>> +     * max_mem_regions, supported in newer version linux kernel
>>> +     */
>>> +    tapfd = open(PATH_NET_TUN, O_RDWR);
>>> +    if (tapfd < 0) {
>>> +        PMD_DRV_LOG(ERR, "fail to open %s: %s",
>>> +                PATH_NET_TUN, strerror(errno));
>>> +        return -1;
>>> +    }
>>> +
>>> +    /* Construct ifr */
>>> +    memset(&ifr, 0, sizeof(ifr));
>>> +    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
>>> +
>>> +    if (ioctl(tapfd, TUNGETFEATURES, &features) == -1) {
>>> +        PMD_DRV_LOG(ERR, "TUNGETFEATURES failed: %s", 
>>> strerror(errno));
>>> +        goto error;
>>> +    }
>>> +    if (features & IFF_ONE_QUEUE)
>>> +        ifr.ifr_flags |= IFF_ONE_QUEUE;
>>> +
>>> +    /* Let tap instead of vhost-net handle vnet header, as the 
>>> latter does
>>> +     * not support offloading. And in this case, we should not set 
>>> feature
>>> +     * bit VHOST_NET_F_VIRTIO_NET_HDR.
>>> +     */
>>> +    if (features & IFF_VNET_HDR) {
>>> +        ifr.ifr_flags |= IFF_VNET_HDR;
>>> +    } else {
>>> +        PMD_DRV_LOG(ERR, "TAP does not support IFF_VNET_HDR");
>>> +        goto error;
>>> +    }
>>> +
>>> +    if (dev->ifname)
>>> +        strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
>>> +    else
>>> +        strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
>>> +    if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
>>> +        PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
>>> +        goto error;
>>> +    }
>>
>> This requires CAP_NET_ADMIN, so we should really consider to accept a 
>> pre-created fd here.
>
> It sounds like a future work for me. So far, all DPDK apps are running 
> in privileged mode (including CAP_NET_ADMIN?).
>

That's not safe. Accepting a per-created fd can solve this, and can even 
support macvtap.

>>
>>> +
>>> +    fcntl(tapfd, F_SETFL, O_NONBLOCK);
>>> +
>>> +    if (ioctl(tapfd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
>>> +        PMD_DRV_LOG(ERR, "TUNSETVNETHDRSZ failed: %s", 
>>> strerror(errno));
>>> +        goto error;
>>> +    }
>>> +
>>> +    if (ioctl(tapfd, TUNSETSNDBUF, &sndbuf) < 0) {
>>> +        PMD_DRV_LOG(ERR, "TUNSETSNDBUF failed: %s", strerror(errno));
>>> +        goto error;
>>> +    }
>>
>> Let's use INT_MAX as default here to survive from evil consumer here.
>
> Oh yes, I will fix it.
>
> Thanks,
> Jianfeng
  
Jason Wang Jan. 11, 2017, 2:45 a.m. UTC | #11
On 2017年01月11日 10:30, Tan, Jianfeng wrote:
>
> Hi Jason,
>
>
> On 1/9/2017 12:39 PM, Jason Wang wrote:
>>> +    if (!enable) {
>>> +        if (dev->tapfds[pair_idx]) {
>>> +            close(dev->tapfds[pair_idx]);
>>> +            dev->tapfds[pair_idx] = -1;
>>> +        }
>>> +        return vhost_kernel_set_backend(vhostfd, -1);
>>
>> If this is used to for thing like ethtool -L in guest, we should use 
>> TUNSETQUEUE here.
>
> To make it clear, why we need to ioctl(..., TUNSETQUEUE, ...) here. 
> According to Linux/Documentation/networking/tuntap.txt,
>     "A new ioctl(TUNSETQUEUE) were introduced to enable or disable a 
> queue. When
>     calling it with IFF_DETACH_QUEUE flag, the queue were disabled. 
> And when
>     calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The 
> queue were
>     enabled by default after it was created through TUNSETIFF."
>
> As it's enabled by default, do you still see the necessity to call it 
> explicitly?

If you want to keep it enabled, no need. But if you want to disable one 
specific queue (which I believe is the case of !enable?), you need to 
call it.

Thanks

>
> Thanks,
> Jianfeng
  
Jianfeng Tan Jan. 11, 2017, 3:13 a.m. UTC | #12
On 1/11/2017 10:42 AM, Jason Wang wrote:
>
>
> On 2017年01月10日 14:11, Tan, Jianfeng wrote:
>>
>> Hi Jason,
>>
>>
>> On 1/9/2017 12:39 PM, Jason Wang wrote:
>>>
>>>
>>> On 2016年12月23日 15:14, Jianfeng Tan wrote:
>>>> This patch add support vhost kernel as the backend for virtio_user.
>>>> Three main hook functions are added:
>>>>    - vhost_kernel_setup() to open char device, each vq pair needs one
>>>>      vhostfd;
>>>>    - vhost_kernel_ioctl() to communicate control messages with vhost
>>>>      kernel module;
>>>>    - vhost_kernel_enable_queue_pair() to open tap device and set it
>>>>      as the backend of corresonding vhost fd (that is to say, vq 
>>>> pair).
>>>>
>>>> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
>>>> ---
>>>>
>> [...]
>>>> +/* TUNSETIFF ifr flags */
>>>> +#define IFF_TAP          0x0002
>>>> +#define IFF_NO_PI        0x1000
>>>> +#define IFF_ONE_QUEUE    0x2000
>>>> +#define IFF_VNET_HDR     0x4000
>>>> +#define IFF_MULTI_QUEUE  0x0100
>>>> +#define IFF_ATTACH_QUEUE 0x0200
>>>> +#define IFF_DETACH_QUEUE 0x0400
>>>
>>> Do we really want to duplicate those things which has been exposed 
>>> by uapi here?
>>
>> You mean those defined in <linux/if_tun.h>? Redefine those common 
>> macros, or include standard header file, with respective pros and 
>> cons. DPDK prefers the redefinition way as far as I understand, 
>> doesn't it?
>>
>
> Well, if you really want to do this, you may want to use an 
> independent file. Then you can sync it with linux headers with a bash 
> script.

Agreed!

[...]
>>>> +static struct vhost_memory_kernel *
>>>> +prepare_vhost_memory_kernel(void)
>>>> +{
>>>> +    uint32_t i, j, k = 0;
>>>> +    struct rte_memseg *seg;
>>>> +    struct vhost_memory_region *mr;
>>>> +    struct vhost_memory_kernel *vm;
>>>> +
>>>> +    vm = malloc(sizeof(struct vhost_memory_kernel) +
>>>> +            VHOST_KERNEL_MAX_REGIONS *
>>>> +            sizeof(struct vhost_memory_region));
>>>> +
>>>> +    for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
>>>> +        seg = &rte_eal_get_configuration()->mem_config->memseg[i];
>>>> +        if (!seg->addr)
>>>> +            break;
>>>
>>> If we're sure the number of regions is less than 64(or the module 
>>> parameter read from /sys), can we avoid the iteration here?
>>
>> The "if" statement under the "for" statement can save us from all 
>> RTE_MAX_MEMSEG iteration.
>
> But if we know the number of regions is short than the limit, there's 
> even no need for this?

The problem is: we don't have a variable to know how many segments are 
there in DPDK. Anyway, we need to report error in the situation that # 
of segments > # of regions. Or can you give a more specific example?

[...]
>>>> +    if (dev->ifname)
>>>> +        strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
>>>> +    else
>>>> +        strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
>>>> +    if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
>>>> +        PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
>>>> +        goto error;
>>>> +    }
>>>
>>> This requires CAP_NET_ADMIN, so we should really consider to accept 
>>> a pre-created fd here.
>>
>> It sounds like a future work for me. So far, all DPDK apps are 
>> running in privileged mode (including CAP_NET_ADMIN?).
>>
>
> That's not safe. 

Yes.

> Accepting a per-created fd can solve this, and can even support macvtap

So you are proposing a way to specify the virtio_user parameter like, 
--vdev=virtio_user,tapfd=fd1,xxx, right? As replied in your another 
comment, it's a great way to go. But I'm afraid we have no time to 
implement that in this cycle.

Thanks,
Jianfeng
  
Jason Wang Jan. 11, 2017, 3:23 a.m. UTC | #13
On 2017年01月11日 11:13, Tan, Jianfeng wrote:
>
>
> On 1/11/2017 10:42 AM, Jason Wang wrote:
>>
>>
>> On 2017年01月10日 14:11, Tan, Jianfeng wrote:
>>>
>>> Hi Jason,
>>>
>>>
>>> On 1/9/2017 12:39 PM, Jason Wang wrote:
>>>>
>>>>
>>>> On 2016年12月23日 15:14, Jianfeng Tan wrote:
>>>>> This patch add support vhost kernel as the backend for virtio_user.
>>>>> Three main hook functions are added:
>>>>>    - vhost_kernel_setup() to open char device, each vq pair needs one
>>>>>      vhostfd;
>>>>>    - vhost_kernel_ioctl() to communicate control messages with vhost
>>>>>      kernel module;
>>>>>    - vhost_kernel_enable_queue_pair() to open tap device and set it
>>>>>      as the backend of corresonding vhost fd (that is to say, vq 
>>>>> pair).
>>>>>
>>>>> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
>>>>> ---
>>>>>
>>> [...]
>>>>> +/* TUNSETIFF ifr flags */
>>>>> +#define IFF_TAP          0x0002
>>>>> +#define IFF_NO_PI        0x1000
>>>>> +#define IFF_ONE_QUEUE    0x2000
>>>>> +#define IFF_VNET_HDR     0x4000
>>>>> +#define IFF_MULTI_QUEUE  0x0100
>>>>> +#define IFF_ATTACH_QUEUE 0x0200
>>>>> +#define IFF_DETACH_QUEUE 0x0400
>>>>
>>>> Do we really want to duplicate those things which has been exposed 
>>>> by uapi here?
>>>
>>> You mean those defined in <linux/if_tun.h>? Redefine those common 
>>> macros, or include standard header file, with respective pros and 
>>> cons. DPDK prefers the redefinition way as far as I understand, 
>>> doesn't it?
>>>
>>
>> Well, if you really want to do this, you may want to use an 
>> independent file. Then you can sync it with linux headers with a bash 
>> script.
>
> Agreed!
>
> [...]
>>>>> +static struct vhost_memory_kernel *
>>>>> +prepare_vhost_memory_kernel(void)
>>>>> +{
>>>>> +    uint32_t i, j, k = 0;
>>>>> +    struct rte_memseg *seg;
>>>>> +    struct vhost_memory_region *mr;
>>>>> +    struct vhost_memory_kernel *vm;
>>>>> +
>>>>> +    vm = malloc(sizeof(struct vhost_memory_kernel) +
>>>>> +            VHOST_KERNEL_MAX_REGIONS *
>>>>> +            sizeof(struct vhost_memory_region));
>>>>> +
>>>>> +    for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
>>>>> +        seg = &rte_eal_get_configuration()->mem_config->memseg[i];
>>>>> +        if (!seg->addr)
>>>>> +            break;
>>>>
>>>> If we're sure the number of regions is less than 64(or the module 
>>>> parameter read from /sys), can we avoid the iteration here?
>>>
>>> The "if" statement under the "for" statement can save us from all 
>>> RTE_MAX_MEMSEG iteration.
>>
>> But if we know the number of regions is short than the limit, there's 
>> even no need for this?
>
> The problem is: we don't have a variable to know how many segments are 
> there in DPDK. Anyway, we need to report error in the situation that # 
> of segments > # of regions. Or can you give a more specific example?
>
> [...]

If we don't know #segments, I'm fine with current code.

>>>>> +    if (dev->ifname)
>>>>> +        strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
>>>>> +    else
>>>>> +        strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
>>>>> +    if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
>>>>> +        PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
>>>>> +        goto error;
>>>>> +    }
>>>>
>>>> This requires CAP_NET_ADMIN, so we should really consider to accept 
>>>> a pre-created fd here.
>>>
>>> It sounds like a future work for me. So far, all DPDK apps are 
>>> running in privileged mode (including CAP_NET_ADMIN?).
>>>
>>
>> That's not safe. 
>
> Yes.
>
>> Accepting a per-created fd can solve this, and can even support macvtap
>
> So you are proposing a way to specify the virtio_user parameter like, 
> --vdev=virtio_user,tapfd=fd1,xxx, right? As replied in your another 
> comment, it's a great way to go. But I'm afraid we have no time to 
> implement that in this cycle.
>
> Thanks,
> Jianfeng

Ok, just want to show its advantages. It can be added on top.

And two more suggestions:
- better to split tap support out of vhost file
- kernel support more than 8 queues on recent kernel, so there's no need 
to limit it to 8. When running on old kernel, TUNSETIFF will fail which 
can give user a hint to reduce #queues.

Thanks
  
Jianfeng Tan Jan. 12, 2017, 9:40 a.m. UTC | #14
Hi Jason,

> Ok, just want to show its advantages. It can be added on top.
>
> And two more suggestions:
> - better to split tap support out of vhost file

Good suggestion! Will do that in next version.

> - kernel support more than 8 queues on recent kernel, so there's no 
> need to limit it to 8. When running on old kernel, TUNSETIFF will fail 
> which can give user a hint to reduce #queues.
>

After a try, even we remove such restriction in vhost_kernel backend, we 
are still restricted by VIRTIO_MAX_VIRTQUEUES (8) defined in virtio PMD.

Thanks for the suggestions!

Jianfeng
  
Jason Wang Jan. 13, 2017, 2:27 a.m. UTC | #15
On 2017年01月12日 17:40, Tan, Jianfeng wrote:
> Hi Jason,
>
>> Ok, just want to show its advantages. It can be added on top.
>>
>> And two more suggestions:
>> - better to split tap support out of vhost file
>
> Good suggestion! Will do that in next version.
>
>> - kernel support more than 8 queues on recent kernel, so there's no 
>> need to limit it to 8. When running on old kernel, TUNSETIFF will 
>> fail which can give user a hint to reduce #queues.
>>
>
> After a try, even we remove such restriction in vhost_kernel backend, 
> we are still restricted by VIRTIO_MAX_VIRTQUEUES (8) defined in virtio 
> PMD.

I think we'd better remove this limitation in the future.

Thanks

>
> Thanks for the suggestions!
>
> Jianfeng
  

Patch

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 97972a6..faeffb2 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -60,6 +60,7 @@  endif
 
 ifeq ($(CONFIG_RTE_VIRTIO_USER),y)
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_user.c
+SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/vhost_kernel.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user/virtio_user_dev.c
 SRCS-$(CONFIG_RTE_LIBRTE_VIRTIO_PMD) += virtio_user_ethdev.c
 endif
diff --git a/drivers/net/virtio/virtio_user/vhost.h b/drivers/net/virtio/virtio_user/vhost.h
index bd67133..ffab13a 100644
--- a/drivers/net/virtio/virtio_user/vhost.h
+++ b/drivers/net/virtio/virtio_user/vhost.h
@@ -120,4 +120,6 @@  struct virtio_user_backend_ops {
 };
 
 struct virtio_user_backend_ops ops_user;
+struct virtio_user_backend_ops ops_kernel;
+
 #endif
diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c
new file mode 100644
index 0000000..8984c5c
--- /dev/null
+++ b/drivers/net/virtio/virtio_user/vhost_kernel.c
@@ -0,0 +1,364 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+#include <string.h>
+#include <errno.h>
+
+#include <rte_memory.h>
+#include <rte_eal_memconfig.h>
+
+#include "vhost.h"
+#include "virtio_user_dev.h"
+
+struct vhost_memory_kernel {
+	uint32_t nregions;
+	uint32_t padding;
+	struct vhost_memory_region regions[0];
+};
+
+/* vhost kernel ioctls */
+#define VHOST_VIRTIO 0xAF
+#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64)
+#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01)
+#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02)
+#define VHOST_SET_MEM_TABLE _IOW(VHOST_VIRTIO, 0x03, struct vhost_memory_kernel)
+#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64)
+#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int)
+#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state)
+#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr)
+#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state)
+#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file)
+#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file)
+#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
+#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
+
+/* TUN ioctls */
+#define TUNSETIFF     _IOW('T', 202, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
+#define TUNSETOFFLOAD  _IOW('T', 208, unsigned int)
+#define TUNGETIFF      _IOR('T', 210, unsigned int)
+#define TUNSETSNDBUF   _IOW('T', 212, int)
+#define TUNGETVNETHDRSZ _IOR('T', 215, int)
+#define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE  _IOW('T', 217, int)
+#define TUNSETVNETLE _IOW('T', 220, int)
+#define TUNSETVNETBE _IOW('T', 222, int)
+
+/* TUNSETIFF ifr flags */
+#define IFF_TAP          0x0002
+#define IFF_NO_PI        0x1000
+#define IFF_ONE_QUEUE    0x2000
+#define IFF_VNET_HDR     0x4000
+#define IFF_MULTI_QUEUE  0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
+
+/* Constants */
+#define TUN_DEF_SNDBUF	(1ull << 20)
+#define PATH_NET_TUN	"/dev/net/tun"
+#define VHOST_KERNEL_MAX_REGIONS	64
+
+static uint64_t vhost_req_user_to_kernel[] = {
+	[VHOST_USER_SET_OWNER] = VHOST_SET_OWNER,
+	[VHOST_USER_RESET_OWNER] = VHOST_RESET_OWNER,
+	[VHOST_USER_SET_FEATURES] = VHOST_SET_FEATURES,
+	[VHOST_USER_GET_FEATURES] = VHOST_GET_FEATURES,
+	[VHOST_USER_SET_VRING_CALL] = VHOST_SET_VRING_CALL,
+	[VHOST_USER_SET_VRING_NUM] = VHOST_SET_VRING_NUM,
+	[VHOST_USER_SET_VRING_BASE] = VHOST_SET_VRING_BASE,
+	[VHOST_USER_GET_VRING_BASE] = VHOST_GET_VRING_BASE,
+	[VHOST_USER_SET_VRING_ADDR] = VHOST_SET_VRING_ADDR,
+	[VHOST_USER_SET_VRING_KICK] = VHOST_SET_VRING_KICK,
+	[VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
+};
+
+/* By default, vhost kernel module allows 64 regions, but DPDK allows
+ * 256 segments. As a relief, below function merges those virtually
+ * adjacent memsegs into one region.
+ */
+static struct vhost_memory_kernel *
+prepare_vhost_memory_kernel(void)
+{
+	uint32_t i, j, k = 0;
+	struct rte_memseg *seg;
+	struct vhost_memory_region *mr;
+	struct vhost_memory_kernel *vm;
+
+	vm = malloc(sizeof(struct vhost_memory_kernel) +
+		    VHOST_KERNEL_MAX_REGIONS *
+		    sizeof(struct vhost_memory_region));
+
+	for (i = 0; i < RTE_MAX_MEMSEG; ++i) {
+		seg = &rte_eal_get_configuration()->mem_config->memseg[i];
+		if (!seg->addr)
+			break;
+
+		int new_region = 1;
+
+		for (j = 0; j < k; ++j) {
+			mr = &vm->regions[j];
+
+			if (mr->userspace_addr + mr->memory_size ==
+			    (uint64_t)seg->addr) {
+				mr->memory_size += seg->len;
+				new_region = 0;
+				break;
+			}
+
+			if ((uint64_t)seg->addr + seg->len ==
+			    mr->userspace_addr) {
+				mr->guest_phys_addr = (uint64_t)seg->addr;
+				mr->userspace_addr = (uint64_t)seg->addr;
+				mr->memory_size += seg->len;
+				new_region = 0;
+				break;
+			}
+		}
+
+		if (new_region == 0)
+			continue;
+
+		mr = &vm->regions[k++];
+		mr->guest_phys_addr = (uint64_t)seg->addr; /* use vaddr here! */
+		mr->userspace_addr = (uint64_t)seg->addr;
+		mr->memory_size = seg->len;
+		mr->mmap_offset = 0;
+
+		if (k >= VHOST_KERNEL_MAX_REGIONS) {
+			free(vm);
+			return NULL;
+		}
+	}
+
+	vm->nregions = k;
+	vm->padding = 0;
+	return vm;
+}
+
+static int
+vhost_kernel_ioctl(struct virtio_user_dev *dev,
+		   enum vhost_user_request req,
+		   void *arg)
+{
+	int i, ret = -1;
+	uint64_t req_kernel;
+	struct vhost_memory_kernel *vm = NULL;
+
+	req_kernel = vhost_req_user_to_kernel[req];
+
+	if (req_kernel == VHOST_SET_MEM_TABLE) {
+		vm = prepare_vhost_memory_kernel();
+		if (!vm)
+			return -1;
+		arg = (void *)vm;
+	}
+
+	/* Does not work when VIRTIO_F_IOMMU_PLATFORM now, why? */
+	if (req_kernel == VHOST_SET_FEATURES)
+		*(uint64_t *)arg &= ~(1ULL << VIRTIO_F_IOMMU_PLATFORM);
+
+	for (i = 0; i < VHOST_KERNEL_MAX_QUEUES; ++i) {
+		if (dev->vhostfds[i] < 0)
+			continue;
+
+		ret = ioctl(dev->vhostfds[i], req_kernel, arg);
+		if (ret < 0)
+			break;
+	}
+
+	if (vm)
+		free(vm);
+
+	return ret;
+}
+
+/**
+ * Set up environment to talk with a vhost kernel backend.
+ *
+ * @return
+ *   - (-1) if fail to set up;
+ *   - (>=0) if successful.
+ */
+static int
+vhost_kernel_setup(struct virtio_user_dev *dev)
+{
+	int vhostfd;
+	uint32_t q;
+
+	for (q = 0; q < dev->max_queue_pairs; ++q) {
+		vhostfd = open(dev->path, O_RDWR);
+		if (vhostfd < 0) {
+			PMD_DRV_LOG(ERR, "fail to open %s, %s",
+				    dev->path, strerror(errno));
+			return -1;
+		}
+
+		dev->vhostfds[q] = vhostfd;
+	}
+
+	return 0;
+}
+
+static int
+vhost_kernel_set_backend(int vhostfd, int tapfd)
+{
+	struct vhost_vring_file f;
+
+	f.fd = tapfd;
+	f.index = 0;
+	if (ioctl(vhostfd, VHOST_NET_SET_BACKEND, &f) < 0) {
+		PMD_DRV_LOG(ERR, "VHOST_NET_SET_BACKEND fails, %s",
+				strerror(errno));
+		return -1;
+	}
+
+	f.index = 1;
+	if (ioctl(vhostfd, VHOST_NET_SET_BACKEND, &f) < 0) {
+		PMD_DRV_LOG(ERR, "VHOST_NET_SET_BACKEND fails, %s",
+				strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int
+vhost_kernel_enable_queue_pair(struct virtio_user_dev *dev,
+			       uint16_t pair_idx,
+			       int enable)
+{
+	unsigned int features;
+	int sndbuf = TUN_DEF_SNDBUF;
+	struct ifreq ifr;
+	int hdr_size;
+	int vhostfd;
+	int tapfd;
+
+	vhostfd = dev->vhostfds[pair_idx];
+
+	if (!enable) {
+		if (dev->tapfds[pair_idx]) {
+			close(dev->tapfds[pair_idx]);
+			dev->tapfds[pair_idx] = -1;
+		}
+		return vhost_kernel_set_backend(vhostfd, -1);
+	} else if (dev->tapfds[pair_idx] >= 0) {
+		return 0;
+	}
+
+	if ((dev->features & (1ULL << VIRTIO_NET_F_MRG_RXBUF)) ||
+	    (dev->features & (1ULL << VIRTIO_F_VERSION_1)))
+		hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+	else
+		hdr_size = sizeof(struct virtio_net_hdr);
+
+	/* TODO:
+	 * 1. verify we can get/set vnet_hdr_len, tap_probe_vnet_hdr_len
+	 * 2. get number of memory regions from vhost module parameter
+	 * max_mem_regions, supported in newer version linux kernel
+	 */
+	tapfd = open(PATH_NET_TUN, O_RDWR);
+	if (tapfd < 0) {
+		PMD_DRV_LOG(ERR, "fail to open %s: %s",
+			    PATH_NET_TUN, strerror(errno));
+		return -1;
+	}
+
+	/* Construct ifr */
+	memset(&ifr, 0, sizeof(ifr));
+	ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
+
+	if (ioctl(tapfd, TUNGETFEATURES, &features) == -1) {
+		PMD_DRV_LOG(ERR, "TUNGETFEATURES failed: %s", strerror(errno));
+		goto error;
+	}
+	if (features & IFF_ONE_QUEUE)
+		ifr.ifr_flags |= IFF_ONE_QUEUE;
+
+	/* Let tap instead of vhost-net handle vnet header, as the latter does
+	 * not support offloading. And in this case, we should not set feature
+	 * bit VHOST_NET_F_VIRTIO_NET_HDR.
+	 */
+	if (features & IFF_VNET_HDR) {
+		ifr.ifr_flags |= IFF_VNET_HDR;
+	} else {
+		PMD_DRV_LOG(ERR, "TAP does not support IFF_VNET_HDR");
+		goto error;
+	}
+
+	if (dev->ifname)
+		strncpy(ifr.ifr_name, dev->ifname, IFNAMSIZ);
+	else
+		strncpy(ifr.ifr_name, "tap%d", IFNAMSIZ);
+	if (ioctl(tapfd, TUNSETIFF, (void *)&ifr) == -1) {
+		PMD_DRV_LOG(ERR, "TUNSETIFF failed: %s", strerror(errno));
+		goto error;
+	}
+
+	fcntl(tapfd, F_SETFL, O_NONBLOCK);
+
+	if (ioctl(tapfd, TUNSETVNETHDRSZ, &hdr_size) < 0) {
+		PMD_DRV_LOG(ERR, "TUNSETVNETHDRSZ failed: %s", strerror(errno));
+		goto error;
+	}
+
+	if (ioctl(tapfd, TUNSETSNDBUF, &sndbuf) < 0) {
+		PMD_DRV_LOG(ERR, "TUNSETSNDBUF failed: %s", strerror(errno));
+		goto error;
+	}
+
+	if (vhost_kernel_set_backend(vhostfd, tapfd) < 0)
+		goto error;
+
+	dev->tapfds[pair_idx] = tapfd;
+	if (!dev->ifname)
+		dev->ifname = strdup(ifr.ifr_name);
+
+	return 0;
+error:
+	return -1;
+}
+
+struct virtio_user_backend_ops ops_kernel = {
+	.setup = vhost_kernel_setup,
+	.send_request = vhost_kernel_ioctl,
+	.enable_qp = vhost_kernel_enable_queue_pair
+};
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.c b/drivers/net/virtio/virtio_user/virtio_user_dev.c
index a818c29..c718b85 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.c
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.c
@@ -219,7 +219,7 @@  is_vhost_user_by_type(const char *path)
 static int
 virtio_user_dev_setup(struct virtio_user_dev *dev)
 {
-	uint32_t i;
+	uint32_t i, q;
 
 	dev->vhostfd = -1;
 	for (i = 0; i < VIRTIO_MAX_VIRTQUEUES * 2 + 1; ++i) {
@@ -227,12 +227,18 @@  virtio_user_dev_setup(struct virtio_user_dev *dev)
 		dev->callfds[i] = -1;
 	}
 
+	for (q = 0; q < VHOST_KERNEL_MAX_QUEUES; ++q) {
+		dev->vhostfds[q] = -1;
+		dev->tapfds[q] = -1;
+	}
+
 	if (is_vhost_user_by_type(dev->path)) {
 		dev->ops = &ops_user;
-		return dev->ops->setup(dev);
+	} else {
+		dev->ops = &ops_kernel;
 	}
 
-	return -1;
+	return dev->ops->setup(dev);
 }
 
 int
@@ -284,7 +290,9 @@  virtio_user_dev_init(struct virtio_user_dev *dev, char *path, int queues,
 void
 virtio_user_dev_uninit(struct virtio_user_dev *dev)
 {
-	uint32_t i;
+	uint32_t i, q;
+
+	dev->ops->send_request(dev, VHOST_USER_RESET_OWNER, NULL);
 
 	for (i = 0; i < dev->max_queue_pairs * 2; ++i) {
 		close(dev->callfds[i]);
@@ -292,6 +300,11 @@  virtio_user_dev_uninit(struct virtio_user_dev *dev)
 	}
 
 	close(dev->vhostfd);
+
+	for (q = 0; q < VHOST_KERNEL_MAX_QUEUES; ++q) {
+		close(dev->vhostfds[q]);
+		close(dev->tapfds[q]);
+	}
 }
 
 static uint8_t
diff --git a/drivers/net/virtio/virtio_user/virtio_user_dev.h b/drivers/net/virtio/virtio_user/virtio_user_dev.h
index 503a496..148b2e6 100644
--- a/drivers/net/virtio/virtio_user/virtio_user_dev.h
+++ b/drivers/net/virtio/virtio_user/virtio_user_dev.h
@@ -44,6 +44,10 @@  struct virtio_user_dev {
 	int		vhostfd;
 
 	/* for vhost_kernel backend */
+	char		*ifname;
+#define VHOST_KERNEL_MAX_QUEUES		8
+	int		vhostfds[VHOST_KERNEL_MAX_QUEUES];
+	int		tapfds[VHOST_KERNEL_MAX_QUEUES];
 
 	/* for both vhost_user and vhost_kernel */
 	int		callfds[VIRTIO_MAX_VIRTQUEUES * 2 + 1];