[dpdk-dev,1/3] eal/vfio: add support for multiple container

Message ID 20180309230809.63361-2-xiao.w.wang@intel.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Xiao Wang March 9, 2018, 11:08 p.m. UTC
  From: Junjie Chen <junjie.j.chen@intel.com>

Currently eal vfio framework binds vfio group fd to the default
container fd, while in some cases, e.g. vDPA (vhost data path
acceleration), we want to set vfio group to a new container and
program DMA mapping via this new container, so this patch adds
APIs to support multiple container.

Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal.c          |  51 ++-
 lib/librte_eal/common/include/rte_vfio.h | 117 ++++++-
 lib/librte_eal/linuxapp/eal/eal_vfio.c   | 553 ++++++++++++++++++++++++++-----
 lib/librte_eal/linuxapp/eal/eal_vfio.h   |   2 +
 lib/librte_eal/rte_eal_version.map       |   7 +
 5 files changed, 629 insertions(+), 101 deletions(-)
  

Comments

Anatoly Burakov March 14, 2018, 12:08 p.m. UTC | #1
On 09-Mar-18 11:08 PM, Xiao Wang wrote:
> From: Junjie Chen <junjie.j.chen@intel.com>
> 
> Currently eal vfio framework binds vfio group fd to the default
> container fd, while in some cases, e.g. vDPA (vhost data path
> acceleration), we want to set vfio group to a new container and
> program DMA mapping via this new container, so this patch adds
> APIs to support multiple container.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> ---

I'm not going to get into virtual vs. real device debate, but i do have 
some issues with VFIO side of things.

I'm not completely convinced this change is needed in the first place. 
If the device driver manages its own groups anyway, it knows which VFIO 
groups belong to it, so it can add/remove them without putting them into 
separate containers. What is the purpose of keeping them in a separate 
container as opposed to just keeping track of group id's?

<...>


> +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> +
> +	if (vfio_cfg->vfio_container_fd < 0)
> +		return -1;
> +
> +	return vfio_cfg->vfio_container_fd;
> +}

Please correct me if i'm wrong, but this patch appears to be mistitled. 
You're not really creating multiple containers, you're just partitioning 
existing one. Do we really need to open/store/close container fd's 
separately, if all we have is a single container anyway?

The semantics of this are also weird in multiprocess. When secondary 
process requests a container, we always create a new one, send it over 
IPC and close it afterwards. It seems to be oblivious that you may have 
several container fd's, and does not know which one you are asking for. 
We know it's all the same container, but that's clearly not what the 
code appears to be doing.
  
Xiao Wang March 15, 2018, 4:49 p.m. UTC | #2
Hi Anatoly,

> -----Original Message-----

> From: Burakov, Anatoly

> Sent: Wednesday, March 14, 2018 8:08 PM

> To: Wang, Xiao W <xiao.w.wang@intel.com>; dev@dpdk.org

> Cc: Wang, Zhihong <zhihong.wang@intel.com>;

> maxime.coquelin@redhat.com; yliu@fridaylinux.org; Liang, Cunming

> <cunming.liang@intel.com>; Xu, Rosen <rosen.xu@intel.com>; Chen, Junjie J

> <junjie.j.chen@intel.com>; Daly, Dan <dan.daly@intel.com>

> Subject: Re: [dpdk-dev] [PATCH 1/3] eal/vfio: add support for multiple

> container

> 

> On 09-Mar-18 11:08 PM, Xiao Wang wrote:

> > From: Junjie Chen <junjie.j.chen@intel.com>

> >

> > Currently eal vfio framework binds vfio group fd to the default

> > container fd, while in some cases, e.g. vDPA (vhost data path

> > acceleration), we want to set vfio group to a new container and

> > program DMA mapping via this new container, so this patch adds

> > APIs to support multiple container.

> >

> > Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>

> > Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>

> > ---

> 

> I'm not going to get into virtual vs. real device debate, but i do have

> some issues with VFIO side of things.

> 

> I'm not completely convinced this change is needed in the first place.

> If the device driver manages its own groups anyway, it knows which VFIO

> groups belong to it, so it can add/remove them without putting them into

> separate containers. What is the purpose of keeping them in a separate

> container as opposed to just keeping track of group id's?


The device driver needs to have a separate container to program IOMMU
For the device, with the VM's addr translation table. So driver needs the
Devices be put into new containers, rather than the default one.

> 

> <...>

> 

> 

> > +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();

> > +

> > +	if (vfio_cfg->vfio_container_fd < 0)

> > +		return -1;

> > +

> > +	return vfio_cfg->vfio_container_fd;

> > +}

> 

> Please correct me if i'm wrong, but this patch appears to be mistitled.

> You're not really creating multiple containers, you're just partitioning

> existing one. Do we really need to open/store/close container fd's

> separately, if all we have is a single container anyway?


This driver are creating new containers for devices, it needs each device
to have its own container, then we can dma_map/ummap for the device
via it's associated container.

BRs,
Xiao

> 

> The semantics of this are also weird in multiprocess. When secondary

> process requests a container, we always create a new one, send it over

> IPC and close it afterwards. It seems to be oblivious that you may have

> several container fd's, and does not know which one you are asking for.

> We know it's all the same container, but that's clearly not what the

> code appears to be doing.

> 

> --

> Thanks,

> Anatoly
  

Patch

diff --git a/lib/librte_eal/bsdapp/eal/eal.c b/lib/librte_eal/bsdapp/eal/eal.c
index 4eafcb5ad..6cc321a70 100644
--- a/lib/librte_eal/bsdapp/eal/eal.c
+++ b/lib/librte_eal/bsdapp/eal/eal.c
@@ -38,6 +38,7 @@ 
 #include <rte_interrupts.h>
 #include <rte_bus.h>
 #include <rte_dev.h>
+#include <rte_vfio.h>
 #include <rte_devargs.h>
 #include <rte_version.h>
 #include <rte_atomic.h>
@@ -738,15 +739,6 @@  rte_eal_vfio_intr_mode(void)
 /* dummy forward declaration. */
 struct vfio_device_info;
 
-/* dummy prototypes. */
-int rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
-		int *vfio_dev_fd, struct vfio_device_info *device_info);
-int rte_vfio_release_device(const char *sysfs_base, const char *dev_addr, int fd);
-int rte_vfio_enable(const char *modname);
-int rte_vfio_is_enabled(const char *modname);
-int rte_vfio_noiommu_is_enabled(void);
-int rte_vfio_clear_group(int vfio_group_fd);
-
 int rte_vfio_setup_device(__rte_unused const char *sysfs_base,
 		      __rte_unused const char *dev_addr,
 		      __rte_unused int *vfio_dev_fd,
@@ -781,3 +773,44 @@  int rte_vfio_clear_group(__rte_unused int vfio_group_fd)
 {
 	return 0;
 }
+
+int rte_vfio_create_container(void)
+{
+	return -1;
+}
+
+int rte_vfio_destroy_container(__rte_unused int container_fd)
+{
+	return -1;
+}
+
+int rte_vfio_bind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_unbind_group_no(__rte_unused int container_fd,
+	__rte_unused int iommu_group_no)
+{
+	return -1;
+}
+
+int rte_vfio_dma_map(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_dma_unmap(__rte_unused int container_fd,
+	__rte_unused int dma_type,
+	__rte_unused const struct rte_memseg *ms)
+{
+	return -1;
+}
+
+int rte_vfio_get_group_fd(__rte_unused int iommu_group_no)
+{
+	return -1;
+}
diff --git a/lib/librte_eal/common/include/rte_vfio.h b/lib/librte_eal/common/include/rte_vfio.h
index e981a6228..3aad9cace 100644
--- a/lib/librte_eal/common/include/rte_vfio.h
+++ b/lib/librte_eal/common/include/rte_vfio.h
@@ -123,6 +123,121 @@  int rte_vfio_noiommu_is_enabled(void);
 int
 rte_vfio_clear_group(int vfio_group_fd);
 
-#endif /* VFIO_PRESENT */
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Create a new container
+ * @return
+ *    the container fd if success
+ *    else < 0
+ */
+int __rte_experimental
+rte_vfio_create_container(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Destroy the container, unbind all vfio group number.
+ * @param container_fd
+ *   the container fd to destroy
+ * @return
+ *    0 if true.
+ *   !0 otherwise.
+ */
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Bind a group number to container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to bind to container
+ * @return
+ *    group fd if successful
+ *    < 0 if failed
+ */
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Unbind a group from specified container.
+ *
+ * @param container_fd
+ *   the container fd of container
+ * @param iommu_group_no
+ *   the iommu_group_no to delete from container
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma mapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *   the dma type for mapping
+ * @param ms
+ *   the dma address region to map
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_map(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Perform dma unmapping for device in specified conainer
+ *
+ * @param container_fd
+ *   the specified container fd
+ * @param dma_type
+ *    the dma map type
+ * @param ms
+ *   the dma address region to unmap
+ * @return
+ *     0 if successful
+ *     !0 if failed
+ */
+int __rte_experimental
+rte_vfio_dma_unmap(int container_fd,
+	int dma_type,
+	const struct rte_memseg *ms);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * Get group fd via group number
+ * @param iommu_group_number
+ *  the group number
+ * @return
+ *     corresonding group fd if successful
+ *     -1 if failed
+ */
+int __rte_experimental
+rte_vfio_get_group_fd(int iommu_group_no);
 
+#endif /* VFIO_PRESENT */
 #endif /* _RTE_VFIO_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index e44ae4d04..939917da9 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -9,6 +9,7 @@ 
 
 #include <rte_log.h>
 #include <rte_memory.h>
+#include <rte_malloc.h>
 #include <rte_eal_memconfig.h>
 #include <rte_vfio.h>
 
@@ -19,7 +20,9 @@ 
 #ifdef VFIO_PRESENT
 
 /* per-process VFIO config */
-static struct vfio_config vfio_cfg;
+static struct vfio_config default_vfio_cfg;
+
+static struct vfio_config *vfio_cfgs[VFIO_MAX_CONTAINERS] = {&default_vfio_cfg};
 
 static int vfio_type1_dma_map(int);
 static int vfio_spapr_dma_map(int);
@@ -35,38 +38,13 @@  static const struct vfio_iommu_type iommu_types[] = {
 	{ RTE_VFIO_NOIOMMU, "No-IOMMU", &vfio_noiommu_dma_map},
 };
 
-int
-vfio_get_group_fd(int iommu_group_no)
+static int
+vfio_open_group_fd(int iommu_group_no)
 {
-	int i;
 	int vfio_group_fd;
 	char filename[PATH_MAX];
-	struct vfio_group *cur_grp;
-
-	/* check if we already have the group descriptor open */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == iommu_group_no)
-			return vfio_cfg.vfio_groups[i].fd;
-
-	/* Lets see first if there is room for a new group */
-	if (vfio_cfg.vfio_active_groups == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
-		return -1;
-	}
-
-	/* Now lets get an index for the new group */
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].group_no == -1) {
-			cur_grp = &vfio_cfg.vfio_groups[i];
-			break;
-		}
 
-	/* This should not happen */
-	if (i == VFIO_MAX_GROUPS) {
-		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
-		return -1;
-	}
-	/* if primary, try to open the group */
+	/* if in primary process, try to open the group */
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 		/* try regular group format */
 		snprintf(filename, sizeof(filename),
@@ -75,8 +53,8 @@  vfio_get_group_fd(int iommu_group_no)
 		if (vfio_group_fd < 0) {
 			/* if file not found, it's not an error */
 			if (errno != ENOENT) {
-				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-						strerror(errno));
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					filename, strerror(errno));
 				return -1;
 			}
 
@@ -86,8 +64,10 @@  vfio_get_group_fd(int iommu_group_no)
 			vfio_group_fd = open(filename, O_RDWR);
 			if (vfio_group_fd < 0) {
 				if (errno != ENOENT) {
-					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
-							strerror(errno));
+					RTE_LOG(ERR, EAL,
+						"Cannot open %s: %s\n",
+						filename,
+						strerror(errno));
 					return -1;
 				}
 				return 0;
@@ -95,21 +75,19 @@  vfio_get_group_fd(int iommu_group_no)
 			/* noiommu group found */
 		}
 
-		cur_grp->group_no = iommu_group_no;
-		cur_grp->fd = vfio_group_fd;
-		vfio_cfg.vfio_active_groups++;
 		return vfio_group_fd;
 	}
-	/* if we're in a secondary process, request group fd from the primary
+	/*
+	 * if we're in a secondary process, request group fd from the primary
 	 * process via our socket
 	 */
 	else {
-		int socket_fd, ret;
-
-		socket_fd = vfio_mp_sync_connect_to_primary();
+		int ret;
+		int socket_fd = vfio_mp_sync_connect_to_primary();
 
 		if (socket_fd < 0) {
-			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
+			RTE_LOG(ERR, EAL,
+				"  cannot connect to primary process!\n");
 			return -1;
 		}
 		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
@@ -122,6 +100,7 @@  vfio_get_group_fd(int iommu_group_no)
 			close(socket_fd);
 			return -1;
 		}
+
 		ret = vfio_mp_sync_receive_request(socket_fd);
 		switch (ret) {
 		case SOCKET_NO_FD:
@@ -132,9 +111,6 @@  vfio_get_group_fd(int iommu_group_no)
 			/* if we got the fd, store it and return it */
 			if (vfio_group_fd > 0) {
 				close(socket_fd);
-				cur_grp->group_no = iommu_group_no;
-				cur_grp->fd = vfio_group_fd;
-				vfio_cfg.vfio_active_groups++;
 				return vfio_group_fd;
 			}
 			/* fall-through on error */
@@ -147,70 +123,353 @@  vfio_get_group_fd(int iommu_group_no)
 	return -1;
 }
 
+static struct vfio_config *
+vfio_get_container(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++)
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return vfio_cfg;
+	}
+
+	return &default_vfio_cfg;
+}
 
 static int
-get_vfio_group_idx(int vfio_group_fd)
+vfio_get_container_idx(int container_fd)
 {
 	int i;
-	for (i = 0; i < VFIO_MAX_GROUPS; i++)
-		if (vfio_cfg.vfio_groups[i].fd == vfio_group_fd)
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		if (vfio_cfgs[i]->vfio_container_fd == container_fd)
 			return i;
+	}
+
+	return -1;
+}
+
+static int
+vfio_find_container_idx(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return i;
+		}
+	}
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_create_container(void)
+{
+	struct vfio_config *vfio_cfg;
+	int i;
+
+	/* Find an empty slot to store new vfio config */
+	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
+		if (vfio_cfgs[i] == NULL)
+			break;
+	}
+
+	if (i == VFIO_MAX_CONTAINERS) {
+		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
+		return -1;
+	}
+
+	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
+		RTE_CACHE_LINE_SIZE);
+	vfio_cfg = vfio_cfgs[i];
+
+	if (vfio_cfgs[i] == NULL)
+		return -ENOMEM;
+
+	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
+
+	for (i = 0 ; i < VFIO_MAX_GROUPS; i++) {
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+	}
+
+	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
+
+	if (vfio_cfg->vfio_container_fd < 0)
+		return -1;
+
+	return vfio_cfg->vfio_container_fd;
+}
+
+int __rte_experimental
+rte_vfio_destroy_container(int container_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, idx;
+
+	idx = vfio_get_container_idx(container_fd);
+	vfio_cfg = vfio_cfgs[idx];
+
+	if (!idx)
+		return 0;
+
+	if (idx < 0) {
+		RTE_LOG(ERR, EAL, "Invalid container fd\n");
+		return -1;
+	}
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no != -1)
+			rte_vfio_unbind_group_no(container_fd,
+				vfio_cfg->vfio_groups[i].group_no);
+
+	rte_free(vfio_cfgs[idx]);
+	vfio_cfgs[idx] = NULL;
+	close(container_fd);
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_bind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	/* Check room for new group */
+	if (cur_vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (cur_vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	cur_vfio_cfg->vfio_active_groups++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_vfio_unbind_group_no(int container_fd, int iommu_group_no)
+{
+	struct vfio_config *cur_vfio_cfg;
+	struct vfio_group *cur_grp;
+	int i;
+
+	i = vfio_get_container_idx(container_fd);
+
+	if (!i)
+		return 0;
+
+	cur_vfio_cfg = vfio_cfgs[i];
+
+	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
+		if (cur_vfio_cfg->vfio_groups[i].group_no == iommu_group_no) {
+			cur_grp = &cur_vfio_cfg->vfio_groups[i];
+			break;
+		}
+	}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Specified group number not found\n");
+		return -1;
+	}
+
+	if (close(cur_grp->fd) < 0) {
+		RTE_LOG(INFO, EAL, "Error when closing vfio_group_fd for"
+				" iommu_group_no %d\n",
+			iommu_group_no);
+		return -1;
+	}
+	cur_grp->group_no = -1;
+	cur_grp->fd = -1;
+	cur_vfio_cfg->vfio_active_groups--;
+
+	return 0;
+}
+
+int
+vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_group *cur_grp;
+	struct vfio_config *vfio_cfg;
+	int vfio_group_fd;
+	int i;
+
+	i = vfio_find_container_idx(iommu_group_no);
+	vfio_cfg = vfio_cfgs[i];
+
+	/* check if we already have the group descriptor open */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == iommu_group_no)
+			return vfio_cfg->vfio_groups[i].fd;
+
+	/* Lets see first if there is room for a new group */
+	if (vfio_cfg->vfio_active_groups == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups reached!\n");
+		return -1;
+	}
+
+	/* Now lets get an index for the new group */
+	for (i = 0; i < VFIO_MAX_GROUPS; i++)
+		if (vfio_cfg->vfio_groups[i].group_no == -1) {
+			cur_grp = &vfio_cfg->vfio_groups[i];
+			break;
+		}
+
+	/* This should not happen */
+	if (i == VFIO_MAX_GROUPS) {
+		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
+		return -1;
+	}
+
+	vfio_group_fd = vfio_open_group_fd(iommu_group_no);
+	if (vfio_group_fd < 0) {
+		RTE_LOG(ERR, EAL, "Failed to open group %d\n", iommu_group_no);
+		return -1;
+	}
+
+	cur_grp->group_no = iommu_group_no;
+	cur_grp->fd = vfio_group_fd;
+	vfio_cfg->vfio_active_groups++;
+
+	return vfio_group_fd;
+}
+
+static int
+get_vfio_group_idx(int vfio_group_fd)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		if (!vfio_cfgs[i])
+			continue;
+
+		vfio_cfg = vfio_cfgs[i];
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].fd == vfio_group_fd)
+				return j;
+		}
+	}
+
 	return -1;
 }
 
 static void
 vfio_group_device_get(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices++;
+		vfio_cfg->vfio_groups[i].devices++;
 }
 
 static void
 vfio_group_device_put(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1))
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 	else
-		vfio_cfg.vfio_groups[i].devices--;
+		vfio_cfg->vfio_groups[i].devices--;
 }
 
 static int
 vfio_group_device_count(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	i = get_vfio_group_idx(vfio_group_fd);
 	if (i < 0 || i > (VFIO_MAX_GROUPS - 1)) {
 		RTE_LOG(ERR, EAL, "  wrong vfio_group index (%d)\n", i);
 		return -1;
 	}
 
-	return vfio_cfg.vfio_groups[i].devices;
+	return vfio_cfg->vfio_groups[i].devices;
 }
 
 int
 rte_vfio_clear_group(int vfio_group_fd)
 {
+	struct vfio_config *vfio_cfg;
 	int i;
 	int socket_fd, ret;
 
+	vfio_cfg = vfio_get_container(vfio_group_fd);
+	if (!vfio_cfg)
+		RTE_LOG(ERR, EAL, "  wrong group fd (%d)\n", vfio_group_fd);
+
 	if (internal_config.process_type == RTE_PROC_PRIMARY) {
 
 		i = get_vfio_group_idx(vfio_group_fd);
 		if (i < 0)
 			return -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
-		vfio_cfg.vfio_active_groups--;
+		vfio_cfg->vfio_groups[i].group_no = -1;
+		vfio_cfg->vfio_groups[i].fd = -1;
+		vfio_cfg->vfio_groups[i].devices = 0;
+		vfio_cfg->vfio_active_groups--;
 		return 0;
 	}
 
@@ -261,9 +520,11 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
 	};
+	int vfio_container_fd;
 	int vfio_group_fd;
 	int iommu_group_no;
-	int ret;
+	int ret = 0;
+	int index;
 
 	/* get group number */
 	ret = vfio_get_group_no(sysfs_base, dev_addr, &iommu_group_no);
@@ -309,12 +570,14 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		return -1;
 	}
 
+	index = vfio_find_container_idx(iommu_group_no);
+	vfio_container_fd = vfio_cfgs[index]->vfio_container_fd;
+
 	/* check if group does not have a container yet */
 	if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
-
 		/* add group to a container */
 		ret = ioctl(vfio_group_fd, VFIO_GROUP_SET_CONTAINER,
-				&vfio_cfg.vfio_container_fd);
+				&vfio_container_fd);
 		if (ret) {
 			RTE_LOG(ERR, EAL, "  %s cannot add VFIO group to container, "
 					"error %i (%s)\n", dev_addr, errno, strerror(errno));
@@ -331,11 +594,12 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 		 * Note this can happen several times with the hotplug
 		 * functionality.
 		 */
+
 		if (internal_config.process_type == RTE_PROC_PRIMARY &&
-				vfio_cfg.vfio_active_groups == 1) {
+				vfio_cfgs[index]->vfio_active_groups == 1) {
 			/* select an IOMMU type which we will be using */
 			const struct vfio_iommu_type *t =
-				vfio_set_iommu_type(vfio_cfg.vfio_container_fd);
+				vfio_set_iommu_type(vfio_container_fd);
 			if (!t) {
 				RTE_LOG(ERR, EAL,
 					"  %s failed to select IOMMU type\n",
@@ -344,7 +608,13 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 				rte_vfio_clear_group(vfio_group_fd);
 				return -1;
 			}
-			ret = t->dma_map_func(vfio_cfg.vfio_container_fd);
+			/* DMA map for the default container only. */
+			if (default_vfio_cfg.vfio_container_fd ==
+				vfio_container_fd)
+				ret = t->dma_map_func(vfio_container_fd);
+			else
+				ret = 0;
+
 			if (ret) {
 				RTE_LOG(ERR, EAL,
 					"  %s DMA remapping failed, error %i (%s)\n",
@@ -388,7 +658,7 @@  rte_vfio_setup_device(const char *sysfs_base, const char *dev_addr,
 
 int
 rte_vfio_release_device(const char *sysfs_base, const char *dev_addr,
-		    int vfio_dev_fd)
+			int vfio_dev_fd)
 {
 	struct vfio_group_status group_status = {
 			.argsz = sizeof(group_status)
@@ -456,9 +726,9 @@  rte_vfio_enable(const char *modname)
 	int vfio_available;
 
 	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
-		vfio_cfg.vfio_groups[i].fd = -1;
-		vfio_cfg.vfio_groups[i].group_no = -1;
-		vfio_cfg.vfio_groups[i].devices = 0;
+		default_vfio_cfg.vfio_groups[i].fd = -1;
+		default_vfio_cfg.vfio_groups[i].group_no = -1;
+		default_vfio_cfg.vfio_groups[i].devices = 0;
 	}
 
 	/* inform the user that we are probing for VFIO */
@@ -480,12 +750,12 @@  rte_vfio_enable(const char *modname)
 		return 0;
 	}
 
-	vfio_cfg.vfio_container_fd = vfio_get_container_fd();
+	default_vfio_cfg.vfio_container_fd = vfio_get_container_fd();
 
 	/* check if we have VFIO driver enabled */
-	if (vfio_cfg.vfio_container_fd != -1) {
+	if (default_vfio_cfg.vfio_container_fd != -1) {
 		RTE_LOG(NOTICE, EAL, "VFIO support initialized\n");
-		vfio_cfg.vfio_enabled = 1;
+		default_vfio_cfg.vfio_enabled = 1;
 	} else {
 		RTE_LOG(NOTICE, EAL, "VFIO support could not be initialized\n");
 	}
@@ -497,7 +767,7 @@  int
 rte_vfio_is_enabled(const char *modname)
 {
 	const int mod_available = rte_eal_check_module(modname) > 0;
-	return vfio_cfg.vfio_enabled && mod_available;
+	return default_vfio_cfg.vfio_enabled && mod_available;
 }
 
 const struct vfio_iommu_type *
@@ -665,41 +935,87 @@  vfio_get_group_no(const char *sysfs_base,
 }
 
 static int
-vfio_type1_dma_map(int vfio_container_fd)
+do_vfio_type1_dma_map(int vfio_container_fd,
+	const struct rte_memseg *ms)
 {
-	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
-	int i, ret;
+	struct vfio_iommu_type1_dma_map dma_map;
+	int ret;
 
-	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
-	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
-		struct vfio_iommu_type1_dma_map dma_map;
+	if (ms->addr == NULL) {
+		RTE_LOG(ERR, EAL, "invalid dma addr");
+		return -1;
+	}
 
-		if (ms[i].addr == NULL)
-			break;
+	memset(&dma_map, 0, sizeof(dma_map));
+	dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
+	dma_map.vaddr = ms->addr_64;
+	dma_map.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_map.iova = dma_map.vaddr;
+	else
+		dma_map.iova = ms->iova;
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
-		memset(&dma_map, 0, sizeof(dma_map));
-		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
-		dma_map.vaddr = ms[i].addr_64;
-		dma_map.size = ms[i].len;
-		if (rte_eal_iova_mode() == RTE_IOVA_VA)
-			dma_map.iova = dma_map.vaddr;
-		else
-			dma_map.iova = ms[i].iova;
-		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
 
-		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot set up DMA remapping, error %i (%s)\n",
+			errno,
+			strerror(errno));
+		return -1;
+	}
 
-		if (ret) {
-			RTE_LOG(ERR, EAL, "  cannot set up DMA remapping, "
-					  "error %i (%s)\n", errno,
-					  strerror(errno));
+	return 0;
+}
+
+static int
+do_vfio_type1_dma_unmap(int vfio_container_fd,
+	const struct rte_memseg *ms)
+{
+	int ret;
+	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
+	struct vfio_iommu_type1_dma_unmap dma_unmap;
+
+	memset(&dma_unmap, 0, sizeof(dma_unmap));
+	dma_unmap.argsz = sizeof(struct vfio_iommu_type1_dma_unmap);
+	dma_unmap.size = ms->len;
+	if (rte_eal_iova_mode() == RTE_IOVA_VA)
+		dma_unmap.iova = ms->addr_64;
+	else
+		dma_unmap.iova = ms->iova;
+	dma_unmap.flags = 0;
+
+	ret = ioctl(vfio_container_fd, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	if (ret) {
+		RTE_LOG(ERR, EAL,
+			"  cannot unmap DMA, error %i (%s)\n",
+			errno,
+			strerror(errno));
 			return -1;
-		}
 	}
 
 	return 0;
 }
 
+static int
+vfio_type1_dma_map(int vfio_container_fd)
+{
+	const struct rte_memseg *ms = rte_eal_get_physmem_layout();
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
+		if (ms[i].addr == NULL)
+			break;
+		ret = do_vfio_type1_dma_map(vfio_container_fd, &ms[i]);
+		if (ret < 0)
+			return ret;
+	}
+
+	return ret;
+}
+
 static int
 vfio_spapr_dma_map(int vfio_container_fd)
 {
@@ -843,4 +1159,59 @@  rte_vfio_noiommu_is_enabled(void)
 	return c == 'Y';
 }
 
+int
+rte_vfio_dma_map(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_map(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma map for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int
+rte_vfio_dma_unmap(int container_fd, int dma_type,
+	const struct rte_memseg *ms)
+{
+	if (dma_type == RTE_VFIO_TYPE1) {
+		return do_vfio_type1_dma_unmap(container_fd, ms);
+	} else if (dma_type == RTE_VFIO_SPAPR) {
+		RTE_LOG(ERR, EAL,
+			"Additional dma unmap for SPAPR type not support yet.");
+			return -1;
+	} else if (dma_type == RTE_VFIO_NOIOMMU) {
+		return 0;
+	}
+
+	return -1;
+}
+
+int rte_vfio_get_group_fd(int iommu_group_no)
+{
+	struct vfio_config *vfio_cfg;
+	int i, j;
+
+	for (i = 0; i < VFIO_MAX_CONTAINERS; i++) {
+		vfio_cfg = vfio_cfgs[i];
+		if (!vfio_cfg)
+			continue;
+
+		for (j = 0; j < VFIO_MAX_GROUPS; j++) {
+			if (vfio_cfg->vfio_groups[j].group_no ==
+					iommu_group_no)
+				return vfio_cfg->vfio_groups[j].fd;
+		}
+	}
+
+	return -1;
+}
+
 #endif
diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.h b/lib/librte_eal/linuxapp/eal/eal_vfio.h
index 80595773e..716fe4551 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.h
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.h
@@ -157,6 +157,8 @@  int vfio_mp_sync_setup(void);
 #define SOCKET_NO_FD 0x1
 #define SOCKET_ERR 0xFF
 
+#define VFIO_MAX_CONTAINERS 256
+
 #endif /* VFIO_PRESENT */
 
 #endif /* EAL_VFIO_H_ */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d12360235..fc78a1581 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -254,5 +254,12 @@  EXPERIMENTAL {
 	rte_service_set_runstate_mapped_check;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_vfio_create_container;
+	rte_vfio_destroy_container;
+	rte_vfio_bind_group_no;
+	rte_vfio_unbind_group_no;
+	rte_vfio_dma_map;
+	rte_vfio_dma_unmap;
+	rte_vfio_get_group_fd;
 
 } DPDK_18.02;