[PATCH] vfio: do not coalesce DMA mappings

Ding, Xuan xuan.ding at intel.com
Thu Jun 29 10:21:14 CEST 2023


Hi Nipun,

I'd like to appreciate your time reading this email.

Our QA team found that since this commit "a399d7b5a994: do not coalesce DMA mappings" is introduced,
the dpdk testpmd start with "--no-huge" parameters will failed, and shows "EAL: Cannot set up DMA remapping, error 28 (No space left on device)".
So they reported it on dpdk Bugzilla: https://bugs.dpdk.org/show_bug.cgi?id=1235.

I understand this feature is to keep consistent with the kernel and not allow memory segments be merged.
The side effect is the testpmd with "--no-huge" parameters will not be able to start because the too many pages will exceed the capability of IOMMU.
Is it expected? Should we remove the --no-huge" in our testcase?

Regards,
Xuan

> -----Original Message-----
> From: Nipun Gupta <nipun.gupta at amd.com>
> Sent: Friday, December 30, 2022 5:59 PM
> To: dev at dpdk.org; thomas at monjalon.net; Burakov, Anatoly
> <anatoly.burakov at intel.com>; ferruh.yigit at amd.com
> Cc: nikhil.agarwal at amd.com; Nipun Gupta <nipun.gupta at amd.com>
> Subject: [PATCH] vfio: do not coalesce DMA mappings
> 
> At the cleanup time when dma unmap is done, linux kernel does not allow
> unmap of individual segments which were coalesced together while creating
> the DMA map for type1 IOMMU mappings. So, this change updates the
> mapping of the memory
> segments(hugepages) on a per-page basis.
> 
> Signed-off-by: Nipun Gupta <nipun.gupta at amd.com>
> ---
> 
> When hotplug of devices is used, multiple pages gets colaeced and a single
> mapping gets created for these pages (using APIs
> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup time
> when the memory is released, the VFIO does not cleans up that memory and
> following error is observed in the eal for 2MB
> hugepages:
> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> 
> This is because VFIO does not clear the DMA (refer API
> vfio_dma_do_unmap() -
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
> c#L1330),
> where it checks the dma mapping where it checks for IOVA to free:
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
> c#L1418.
> 
> Thus this change updates the mapping to be created individually instead of
> colaecing them.
> 
>  lib/eal/linux/eal_vfio.c | 29 -----------------------------
>  1 file changed, 29 deletions(-)
> 
> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c index
> 549b86ae1d..56edccb0db 100644
> --- a/lib/eal/linux/eal_vfio.c
> +++ b/lib/eal/linux/eal_vfio.c
> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char *sysfs_base,
>  	return 1;
>  }
> 
> -static int
> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> rte_memseg *ms,
> -		size_t len, void *arg)
> -{
> -	int *vfio_container_fd = arg;
> -
> -	if (msl->external)
> -		return 0;
> -
> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> ms->iova,
> -			len, 1);
> -}
> -
>  static int
>  type1_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
>  		void *arg)
> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> const struct rte_memseg *ms,
>  	if (ms->iova == RTE_BAD_IOVA)
>  		return 0;
> 
> -	/* if IOVA mode is VA, we've already mapped the internal segments */
> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> -		return 0;
> -
>  	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> ms->iova,
>  			ms->len, 1);
>  }
> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int vfio_container_fd,
> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> vfio_container_fd)  {
> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> -		/* with IOVA as VA mode, we can get away with mapping
> contiguous
> -		 * chunks rather than going page-by-page.
> -		 */
> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> -				&vfio_container_fd);
> -		if (ret)
> -			return ret;
> -		/* we have to continue the walk because we've skipped the
> -		 * external segments during the config walk.
> -		 */
> -	}
>  	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> 
> --
> 2.25.1



More information about the dev mailing list