[PATCH] vfio: do not coalesce DMA mappings

Ding, Xuan xuan.ding at intel.com
Fri Jun 30 07:58:25 CEST 2023


Hi Nipun,

Replies are inline.

> -----Original Message-----
> From: Nipun Gupta <nipun.gupta at amd.com>
> Sent: Friday, June 30, 2023 9:46 AM
> To: Ding, Xuan <xuan.ding at intel.com>; dev at dpdk.org;
> thomas at monjalon.net; Burakov, Anatoly <anatoly.burakov at intel.com>;
> ferruh.yigit at amd.com
> Cc: nikhil.agarwal at amd.com; He, Xingguang <xingguang.he at intel.com>; Ling,
> WeiX <weix.ling at intel.com>
> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Xuan,
> 
> Thanks for pointing out the issue and figuring out the patch which introduced
> this. If you have answers to below queries, please let me know:
> 
> Is there any other test cases which tests "--no-huge" which pass?

Yes, there are test cases adding "--no-huge" option to validate 4k page size in async vhost.
Actually, the page size is decided by front-end, so I think this case can be removed.

Previously, testpmd can start with "--no-huge" options (not sure if there are test cases).
Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i

> 
> Also, if we change the "-m" option to provide lower memory, does the test
> pass?

"-m" option is also added and does not work.

> 
> When you mention too many pages exceed the capability of IOMMU, you are
> referring to HW capability to create multiple pages? Here it seems in case of
> 4K page size we need 256K pages which is limiting the capacity?

Yes, this is the result of my initial debugging.
The direct impact is that this kind of testpmd cases cannot start now.
If this is expected, I think we can close this defect and ignore the "--no-huge" option when start.

Regards,
Xuan

> 
> Regards,
> Nipun
> 
> On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> > Hi Nipun,
> >
> > I'd like to appreciate your time reading this email.
> >
> > Our QA team found that since this commit "a399d7b5a994: do not
> > coalesce DMA mappings" is introduced, the dpdk testpmd start with "--no-
> huge" parameters will failed, and shows "EAL: Cannot set up DMA
> remapping, error 28 (No space left on device)".
> > So they reported it on dpdk Bugzilla:
> https://bugs.dpdk.org/show_bug.cgi?id=1235.
> >
> > I understand this feature is to keep consistent with the kernel and not allow
> memory segments be merged.
> > The side effect is the testpmd with "--no-huge" parameters will not be able
> to start because the too many pages will exceed the capability of IOMMU.
> > Is it expected? Should we remove the --no-huge" in our testcase?
> >
> > Regards,
> > Xuan
> >
> >> -----Original Message-----
> >> From: Nipun Gupta <nipun.gupta at amd.com>
> >> Sent: Friday, December 30, 2022 5:59 PM
> >> To: dev at dpdk.org; thomas at monjalon.net; Burakov, Anatoly
> >> <anatoly.burakov at intel.com>; ferruh.yigit at amd.com
> >> Cc: nikhil.agarwal at amd.com; Nipun Gupta <nipun.gupta at amd.com>
> >> Subject: [PATCH] vfio: do not coalesce DMA mappings
> >>
> >> At the cleanup time when dma unmap is done, linux kernel does not
> >> allow unmap of individual segments which were coalesced together
> >> while creating the DMA map for type1 IOMMU mappings. So, this change
> >> updates the mapping of the memory
> >> segments(hugepages) on a per-page basis.
> >>
> >> Signed-off-by: Nipun Gupta <nipun.gupta at amd.com>
> >> ---
> >>
> >> When hotplug of devices is used, multiple pages gets colaeced and a
> >> single mapping gets created for these pages (using APIs
> >> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup time
> >> when the memory is released, the VFIO does not cleans up that memory
> >> and following error is observed in the eal for 2MB
> >> hugepages:
> >> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> >>
> >> This is because VFIO does not clear the DMA (refer API
> >> vfio_dma_do_unmap() -
> >>
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type
> 1.
> >> c#L1330),
> >> where it checks the dma mapping where it checks for IOVA to free:
> >>
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type
> 1.
> >> c#L1418.
> >>
> >> Thus this change updates the mapping to be created individually
> >> instead of colaecing them.
> >>
> >>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
> >>   1 file changed, 29 deletions(-)
> >>
> >> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
> >> index 549b86ae1d..56edccb0db 100644
> >> --- a/lib/eal/linux/eal_vfio.c
> >> +++ b/lib/eal/linux/eal_vfio.c
> >> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char
> *sysfs_base,
> >>   	return 1;
> >>   }
> >>
> >> -static int
> >> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> >> rte_memseg *ms,
> >> -		size_t len, void *arg)
> >> -{
> >> -	int *vfio_container_fd = arg;
> >> -
> >> -	if (msl->external)
> >> -		return 0;
> >> -
> >> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> >> ms->iova,
> >> -			len, 1);
> >> -}
> >> -
> >>   static int
> >>   type1_map(const struct rte_memseg_list *msl, const struct rte_memseg
> *ms,
> >>   		void *arg)
> >> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> >> const struct rte_memseg *ms,
> >>   	if (ms->iova == RTE_BAD_IOVA)
> >>   		return 0;
> >>
> >> -	/* if IOVA mode is VA, we've already mapped the internal segments
> */
> >> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> >> -		return 0;
> >> -
> >>   	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> >> ms->iova,
> >>   			ms->len, 1);
> >>   }
> >> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int
> vfio_container_fd,
> >> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> >> vfio_container_fd)  {
> >> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> >> -		/* with IOVA as VA mode, we can get away with mapping
> >> contiguous
> >> -		 * chunks rather than going page-by-page.
> >> -		 */
> >> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> >> -				&vfio_container_fd);
> >> -		if (ret)
> >> -			return ret;
> >> -		/* we have to continue the walk because we've skipped the
> >> -		 * external segments during the config walk.
> >> -		 */
> >> -	}
> >>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> >>
> >> --
> >> 2.25.1
> >


More information about the dev mailing list