[v2] kni: add IOVA va support for kni

Message ID 20190401095118.4176-1-kirankumark@marvell.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2] kni: add IOVA va support for kni |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS

Commit Message

Kiran Kumar Kokkilagadda April 1, 2019, 9:51 a.m. UTC
  From: Kiran Kumar K <kirankumark@marvell.com>

With current KNI implementation kernel module will work only in
IOVA=PA mode. This patch will add support for kernel module to work
with IOVA=VA mode.

The idea is to maintain a mapping in KNI module between user pages and
kernel pages and in fast path perform a lookup in this table and get
the kernel virtual address for corresponding user virtual address.

In IOVA=VA mode, the memory allocated to the pool is physically
and virtually contiguous. We will take advantage of this and create a
mapping in the kernel.In kernel we need mapping for queues
(tx_q, rx_q,... slow path) and mbuf memory (fast path).

At the KNI init time, in slow path we will create a mapping for the
queues and mbuf using get_user_pages similar to af_xdp. Using pool
memory base address, we will create a page map table for the mbuf,
which we will use in the fast path for kernel page translation.

At KNI init time, we will pass the base address of the pool and size of
the pool to kernel. In kernel, using get_user_pages API, we will get
the pages with size PAGE_SIZE and store the mapping and start address
of user space in a table.

In fast path for any user address perform PAGE_SHIFT
(user_addr >> PAGE_SHIFT) and subtract the start address from this value,
we will get the index of the kernel page with in the page map table.
Adding offset to this kernel page address, we will get the kernel address
for this user virtual address.

For example user pool base address is X, and size is S that we passed to
kernel. In kernel we will create a mapping for this using get_user_pages.
Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
and user start page will be U (we will get it from X >> PAGE_SHIFT).

For any user address Z we will get the index of the page map table using
((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
address will give kernel virtual address.

Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
---

V2 changes:
* Fixed build issue with older kernel

 kernel/linux/kni/kni_dev.h                    |  37 +++
 kernel/linux/kni/kni_misc.c                   | 215 +++++++++++++++++-
 kernel/linux/kni/kni_net.c                    | 114 ++++++++--
 .../eal/include/exec-env/rte_kni_common.h     |   8 +
 lib/librte_kni/rte_kni.c                      |  21 ++
 5 files changed, 369 insertions(+), 26 deletions(-)

--
2.17.1
  

Comments

Ferruh Yigit April 3, 2019, 4:29 p.m. UTC | #1
On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
> From: Kiran Kumar K <kirankumark@marvell.com>
> 
> With current KNI implementation kernel module will work only in
> IOVA=PA mode. This patch will add support for kernel module to work
> with IOVA=VA mode.

Thanks Kiran for removing the limitation, I have a few questions, can you please
help me understand.

And when this patch is ready, the restriction in 'linux/eal/eal.c', in
'rte_eal_init' should be removed, perhaps with this patch. I assume you already
doing it to be able to test this patch.

> 
> The idea is to maintain a mapping in KNI module between user pages and
> kernel pages and in fast path perform a lookup in this table and get
> the kernel virtual address for corresponding user virtual address.
> 
> In IOVA=VA mode, the memory allocated to the pool is physically
> and virtually contiguous. We will take advantage of this and create a
> mapping in the kernel.In kernel we need mapping for queues
> (tx_q, rx_q,... slow path) and mbuf memory (fast path).

Is it?
As far as I know mempool can have multiple chunks and they can be both virtually
and physically separated.

And even for a single chunk, that will be virtually continuous, but will it be
physically continuous?

> 
> At the KNI init time, in slow path we will create a mapping for the
> queues and mbuf using get_user_pages similar to af_xdp. Using pool
> memory base address, we will create a page map table for the mbuf,
> which we will use in the fast path for kernel page translation.
> 
> At KNI init time, we will pass the base address of the pool and size of
> the pool to kernel. In kernel, using get_user_pages API, we will get
> the pages with size PAGE_SIZE and store the mapping and start address
> of user space in a table.
> 
> In fast path for any user address perform PAGE_SHIFT
> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
> we will get the index of the kernel page with in the page map table.
> Adding offset to this kernel page address, we will get the kernel address
> for this user virtual address.
> 
> For example user pool base address is X, and size is S that we passed to
> kernel. In kernel we will create a mapping for this using get_user_pages.
> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
> and user start page will be U (we will get it from X >> PAGE_SHIFT).
> 
> For any user address Z we will get the index of the page map table using
> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
> address will give kernel virtual address.
> 
> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>

<...>

> +int
> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	/* Get at least one page */
> +	if (size < PAGE_SIZE)
> +		size = PAGE_SIZE;
> +
> +	/* Compute number of user pages based on page size */
> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +	/* Allocate memory for the pages */
> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
> +		      GFP_KERNEL | __GFP_NOWARN);
> +	if (!mem->pgs) {
> +		pr_err("%s: -ENOMEM\n", __func__);
> +		return -ENOMEM;
> +	}
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	/* Get the user pages from the user address*/
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
> +	npgs = get_user_pages((u64)address, mem->npgs,
> +				gup_flags, &mem->pgs[0], NULL);
> +#else
> +	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
> +				gup_flags, 0, &mem->pgs[0], NULL);
> +#endif
> +	up_write(&current->mm->mmap_sem);

This should work even memory is physically not continuous, right? Where exactly
physically continuous requirement is coming from?

<...>

> +
> +/* Get the kernel address from the user address using
> + * page map table. Will be used only in IOVA=VA mode
> + */
> +static inline void*
> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
> +{
> +	uint32_t index;
> +	/* User page - start user page will give the index
> +	 * with in the page map table
> +	 */
> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
> +
> +	/* Add the offset to the page address */
> +	return (kni->va_info.page_map[index].addr +
> +		(usr_addr & kni->va_info.page_mask));
> +
> +}
> +
>  /* physical address to kernel virtual address */
>  static void *
>  pa2kva(void *pa)
> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
>  			return;
> 
>  		for (i = 0; i < num_rx; i++) {
> -			kva = pa2kva(kni->pa[i]);
> +			if (likely(kni->iova_mode == 1))
> +				kva = get_kva((u64)(kni->pa[i]), kni);

kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
'mempool_memhdr->addr'

If this is working I must be missing something but not able to find what it is.

<...>

> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>  	kni->group_id = conf->group_id;
>  	kni->mbuf_size = conf->mbuf_size;
> 
> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
> +	if (dev_info.iova_mode) {
> +		struct rte_mempool_memhdr *hdr;
> +		uint64_t pool_size = 0;
> +
> +		/* In each pool header chunk, we will maintain the
> +		 * base address of the pool. This chunk is physically and
> +		 * virtually contiguous.
> +		 * This approach will work, only if the allocated pool
> +		 * memory is contiguous, else it won't work
> +		 */
> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
> +		dev_info.mbuf_va = (void *)(hdr->addr);
> +
> +		/* Traverse the list and get the total size of the pool */
> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
> +			pool_size += hdr->len;
> +		}

This code is aware that there may be multiple chunks, but assumes they are all
continuous, I don't know if this assumption is correct.

Also I guess there is another assumption that there will be single pktmbuf_pool
in the application which passed into kni?
What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
don't know their pages, so won't able to get their kernel virtual address.
  
Kiran Kumar Kokkilagadda April 4, 2019, 5:03 a.m. UTC | #2
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, April 3, 2019 9:59 PM
> To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
> Cc: dev@dpdk.org; Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
> 
> External Email
> 
> ----------------------------------------------------------------------
> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
> > From: Kiran Kumar K <kirankumark@marvell.com>
> >
> > With current KNI implementation kernel module will work only in
> > IOVA=PA mode. This patch will add support for kernel module to work
> > with IOVA=VA mode.
> 
> Thanks Kiran for removing the limitation, I have a few questions, can you please
> help me understand.
> 
> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init'
> should be removed, perhaps with this patch. I assume you already doing it to be
> able to test this patch.
> 

User can choose the mode by passing --iova-mode=<va/pa>. I will remove the rte_kni module restriction.
 

> >
> > The idea is to maintain a mapping in KNI module between user pages and
> > kernel pages and in fast path perform a lookup in this table and get
> > the kernel virtual address for corresponding user virtual address.
> >
> > In IOVA=VA mode, the memory allocated to the pool is physically and
> > virtually contiguous. We will take advantage of this and create a
> > mapping in the kernel.In kernel we need mapping for queues (tx_q,
> > rx_q,... slow path) and mbuf memory (fast path).
> 
> Is it?
> As far as I know mempool can have multiple chunks and they can be both
> virtually and physically separated.
> 
> And even for a single chunk, that will be virtually continuous, but will it be
> physically continuous?
> 

You are right, it need not have to be physically contiguous. Will change the description.
 

> >
> > At the KNI init time, in slow path we will create a mapping for the
> > queues and mbuf using get_user_pages similar to af_xdp. Using pool
> > memory base address, we will create a page map table for the mbuf,
> > which we will use in the fast path for kernel page translation.
> >
> > At KNI init time, we will pass the base address of the pool and size
> > of the pool to kernel. In kernel, using get_user_pages API, we will
> > get the pages with size PAGE_SIZE and store the mapping and start
> > address of user space in a table.
> >
> > In fast path for any user address perform PAGE_SHIFT (user_addr >>
> > PAGE_SHIFT) and subtract the start address from this value, we will
> > get the index of the kernel page with in the page map table.
> > Adding offset to this kernel page address, we will get the kernel
> > address for this user virtual address.
> >
> > For example user pool base address is X, and size is S that we passed
> > to kernel. In kernel we will create a mapping for this using get_user_pages.
> > Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
> > ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
> >
> > For any user address Z we will get the index of the page map table
> > using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
> > this address will give kernel virtual address.
> >
> > Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> 
> <...>
> 
> > +int
> > +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
> > +	unsigned int gup_flags = FOLL_WRITE;
> > +	long npgs;
> > +	int err;
> > +
> > +	/* Get at least one page */
> > +	if (size < PAGE_SIZE)
> > +		size = PAGE_SIZE;
> > +
> > +	/* Compute number of user pages based on page size */
> > +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
> > +
> > +	/* Allocate memory for the pages */
> > +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
> > +		      GFP_KERNEL | __GFP_NOWARN);
> > +	if (!mem->pgs) {
> > +		pr_err("%s: -ENOMEM\n", __func__);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	down_write(&current->mm->mmap_sem);
> > +
> > +	/* Get the user pages from the user address*/ #if
> LINUX_VERSION_CODE
> > +>= KERNEL_VERSION(4,9,0)
> > +	npgs = get_user_pages((u64)address, mem->npgs,
> > +				gup_flags, &mem->pgs[0], NULL);
> > +#else
> > +	npgs = get_user_pages(current, current->mm, (u64)address, mem-
> >npgs,
> > +				gup_flags, 0, &mem->pgs[0], NULL); #endif
> > +	up_write(&current->mm->mmap_sem);
> 
> This should work even memory is physically not continuous, right? Where
> exactly physically continuous requirement is coming from?
> 

Yes, it is not necessary to be physically contiguous.

 


> <...>
> 
> > +
> > +/* Get the kernel address from the user address using
> > + * page map table. Will be used only in IOVA=VA mode  */ static
> > +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
> > +	uint32_t index;
> > +	/* User page - start user page will give the index
> > +	 * with in the page map table
> > +	 */
> > +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
> > +
> > +	/* Add the offset to the page address */
> > +	return (kni->va_info.page_map[index].addr +
> > +		(usr_addr & kni->va_info.page_mask));
> > +
> > +}
> > +
> >  /* physical address to kernel virtual address */  static void *
> > pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
> > kni_dev *kni,
> >  			return;
> >
> >  		for (i = 0; i < num_rx; i++) {
> > -			kva = pa2kva(kni->pa[i]);
> > +			if (likely(kni->iova_mode == 1))
> > +				kva = get_kva((u64)(kni->pa[i]), kni);
> 
> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
> 'mempool_memhdr->addr'
> 
> If this is working I must be missing something but not able to find what it is.
> 
> <...>
> 
 
In IOVA=VA mode, both the values will be same right?


> > @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
> >  	kni->group_id = conf->group_id;
> >  	kni->mbuf_size = conf->mbuf_size;
> >
> > +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
> > +	if (dev_info.iova_mode) {
> > +		struct rte_mempool_memhdr *hdr;
> > +		uint64_t pool_size = 0;
> > +
> > +		/* In each pool header chunk, we will maintain the
> > +		 * base address of the pool. This chunk is physically and
> > +		 * virtually contiguous.
> > +		 * This approach will work, only if the allocated pool
> > +		 * memory is contiguous, else it won't work
> > +		 */
> > +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
> > +		dev_info.mbuf_va = (void *)(hdr->addr);
> > +
> > +		/* Traverse the list and get the total size of the pool */
> > +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
> > +			pool_size += hdr->len;
> > +		}
> 
> This code is aware that there may be multiple chunks, but assumes they are all
> continuous, I don't know if this assumption is correct.
> 
> Also I guess there is another assumption that there will be single pktmbuf_pool
> in the application which passed into kni?
> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
> don't know their pages, so won't able to get their kernel virtual address.

All these chunks have to be virtually contiguous, otherwise this approach will not work. Here one thing we can do is create the mapping for complete huge page, and make sure all the mbuff_mempools will be with in this offset. As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed the size, even with multiple pktmbuf_pools. If mbuf_pools were created outside this huge_page, this approach will not work and we will fall back to PA mode.
  
Burakov, Anatoly April 4, 2019, 9:57 a.m. UTC | #3
On 03-Apr-19 5:29 PM, Ferruh Yigit wrote:
> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>> From: Kiran Kumar K <kirankumark@marvell.com>
>>
>> With current KNI implementation kernel module will work only in
>> IOVA=PA mode. This patch will add support for kernel module to work
>> with IOVA=VA mode.
> 
> Thanks Kiran for removing the limitation, I have a few questions, can you please
> help me understand.
> 
> And when this patch is ready, the restriction in 'linux/eal/eal.c', in
> 'rte_eal_init' should be removed, perhaps with this patch. I assume you already
> doing it to be able to test this patch.
> 
>>
>> The idea is to maintain a mapping in KNI module between user pages and
>> kernel pages and in fast path perform a lookup in this table and get
>> the kernel virtual address for corresponding user virtual address.
>>
>> In IOVA=VA mode, the memory allocated to the pool is physically
>> and virtually contiguous. We will take advantage of this and create a
>> mapping in the kernel.In kernel we need mapping for queues
>> (tx_q, rx_q,... slow path) and mbuf memory (fast path).
> 
> Is it?
> As far as I know mempool can have multiple chunks and they can be both virtually
> and physically separated.
> 
> And even for a single chunk, that will be virtually continuous, but will it be
> physically continuous?

Just to clarify.

Within DPDK, we do not make a distinction between physical address and 
IOVA address - we never need the actual physical address, we just need 
the DMA addresses, which can either match the physical address, or be 
completely arbitrary (in our case, they will match VA addresses, but it 
doesn't have to be the case - in fact, 17.11 will, under some 
circumstances, populate IOVA addresses simply starting from address 0).

However, one has to remember that IOVA address *is not a physical 
address*. The pages backing a VA chunk may be *IOVA*-contiguous, but may 
not necessarily be *physically* contiguous. Under normal circumstances 
we really don't care, because the VFIO/IOMMU takes care of the mapping 
between IOVA and PA transparently for the hardware.

So, in IOVA as VA mode, the memory allocated to the mempool will be 
(within a given chunk) both VA and IOVA contiguous - but not necessarily 
*physically* contiguous! In fact, if you try calling rte_mem_virt2phy() 
on the mempool pages, you'll likely find that they aren't (i've seen 
cases where pages were mapped /backwards/!).

Therefore, unless by "physically contiguous" you mean "IOVA-contiguous", 
you *cannot* rely on memory in mempool being *physically* contiguous 
merely based on the fact that it's IOVA-contiguous.

> 
>>
>> At the KNI init time, in slow path we will create a mapping for the
>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>> memory base address, we will create a page map table for the mbuf,
>> which we will use in the fast path for kernel page translation.
>>
>> At KNI init time, we will pass the base address of the pool and size of
>> the pool to kernel. In kernel, using get_user_pages API, we will get
>> the pages with size PAGE_SIZE and store the mapping and start address
>> of user space in a table.
>>
>> In fast path for any user address perform PAGE_SHIFT
>> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
>> we will get the index of the kernel page with in the page map table.
>> Adding offset to this kernel page address, we will get the kernel address
>> for this user virtual address.
>>
>> For example user pool base address is X, and size is S that we passed to
>> kernel. In kernel we will create a mapping for this using get_user_pages.
>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
>> and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>
>> For any user address Z we will get the index of the page map table using
>> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
>> address will give kernel virtual address.
>>
>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> 
> <...>
> 
>> +int
>> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
>> +{
>> +	unsigned int gup_flags = FOLL_WRITE;
>> +	long npgs;
>> +	int err;
>> +
>> +	/* Get at least one page */
>> +	if (size < PAGE_SIZE)
>> +		size = PAGE_SIZE;
>> +
>> +	/* Compute number of user pages based on page size */
>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>> +
>> +	/* Allocate memory for the pages */
>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>> +		      GFP_KERNEL | __GFP_NOWARN);
>> +	if (!mem->pgs) {
>> +		pr_err("%s: -ENOMEM\n", __func__);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +
>> +	/* Get the user pages from the user address*/
>> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
>> +	npgs = get_user_pages((u64)address, mem->npgs,
>> +				gup_flags, &mem->pgs[0], NULL);
>> +#else
>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
>> +				gup_flags, 0, &mem->pgs[0], NULL);
>> +#endif
>> +	up_write(&current->mm->mmap_sem);
> 
> This should work even memory is physically not continuous, right? Where exactly
> physically continuous requirement is coming from?
> 
> <...>
> 
>> +
>> +/* Get the kernel address from the user address using
>> + * page map table. Will be used only in IOVA=VA mode
>> + */
>> +static inline void*
>> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
>> +{
>> +	uint32_t index;
>> +	/* User page - start user page will give the index
>> +	 * with in the page map table
>> +	 */
>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>> +
>> +	/* Add the offset to the page address */
>> +	return (kni->va_info.page_map[index].addr +
>> +		(usr_addr & kni->va_info.page_mask));
>> +
>> +}
>> +
>>   /* physical address to kernel virtual address */
>>   static void *
>>   pa2kva(void *pa)
>> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
>>   			return;
>>
>>   		for (i = 0; i < num_rx; i++) {
>> -			kva = pa2kva(kni->pa[i]);
>> +			if (likely(kni->iova_mode == 1))
>> +				kva = get_kva((u64)(kni->pa[i]), kni);
> 
> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
> 'mempool_memhdr->addr'
> 
> If this is working I must be missing something but not able to find what it is.
> 
> <...>
> 
>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>   	kni->group_id = conf->group_id;
>>   	kni->mbuf_size = conf->mbuf_size;
>>
>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>> +	if (dev_info.iova_mode) {
>> +		struct rte_mempool_memhdr *hdr;
>> +		uint64_t pool_size = 0;
>> +
>> +		/* In each pool header chunk, we will maintain the
>> +		 * base address of the pool. This chunk is physically and
>> +		 * virtually contiguous.
>> +		 * This approach will work, only if the allocated pool
>> +		 * memory is contiguous, else it won't work
>> +		 */
>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>> +
>> +		/* Traverse the list and get the total size of the pool */
>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>> +			pool_size += hdr->len;
>> +		}
> 
> This code is aware that there may be multiple chunks, but assumes they are all
> continuous, I don't know if this assumption is correct.

It is in fact incorrect - mempool chunks can be located anywhere in memory.

> 
> Also I guess there is another assumption that there will be single pktmbuf_pool
> in the application which passed into kni?
> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
> don't know their pages, so won't able to get their kernel virtual address.
>
  
Ferruh Yigit April 4, 2019, 11:20 a.m. UTC | #4
On 4/4/2019 6:03 AM, Kiran Kumar Kokkilagadda wrote:
> 
> 
>> -----Original Message-----
>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>> Sent: Wednesday, April 3, 2019 9:59 PM
>> To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
>> Cc: dev@dpdk.org; Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>> From: Kiran Kumar K <kirankumark@marvell.com>
>>>
>>> With current KNI implementation kernel module will work only in
>>> IOVA=PA mode. This patch will add support for kernel module to work
>>> with IOVA=VA mode.
>>
>> Thanks Kiran for removing the limitation, I have a few questions, can you please
>> help me understand.
>>
>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init'
>> should be removed, perhaps with this patch. I assume you already doing it to be
>> able to test this patch.
>>
> 
> User can choose the mode by passing --iova-mode=<va/pa>. I will remove the rte_kni module restriction.
>  
> 
>>>
>>> The idea is to maintain a mapping in KNI module between user pages and
>>> kernel pages and in fast path perform a lookup in this table and get
>>> the kernel virtual address for corresponding user virtual address.
>>>
>>> In IOVA=VA mode, the memory allocated to the pool is physically and
>>> virtually contiguous. We will take advantage of this and create a
>>> mapping in the kernel.In kernel we need mapping for queues (tx_q,
>>> rx_q,... slow path) and mbuf memory (fast path).
>>
>> Is it?
>> As far as I know mempool can have multiple chunks and they can be both
>> virtually and physically separated.
>>
>> And even for a single chunk, that will be virtually continuous, but will it be
>> physically continuous?
>>
> 
> You are right, it need not have to be physically contiguous. Will change the description.
>  
> 
>>>
>>> At the KNI init time, in slow path we will create a mapping for the
>>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>>> memory base address, we will create a page map table for the mbuf,
>>> which we will use in the fast path for kernel page translation.
>>>
>>> At KNI init time, we will pass the base address of the pool and size
>>> of the pool to kernel. In kernel, using get_user_pages API, we will
>>> get the pages with size PAGE_SIZE and store the mapping and start
>>> address of user space in a table.
>>>
>>> In fast path for any user address perform PAGE_SHIFT (user_addr >>
>>> PAGE_SHIFT) and subtract the start address from this value, we will
>>> get the index of the kernel page with in the page map table.
>>> Adding offset to this kernel page address, we will get the kernel
>>> address for this user virtual address.
>>>
>>> For example user pool base address is X, and size is S that we passed
>>> to kernel. In kernel we will create a mapping for this using get_user_pages.
>>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
>>> ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>>
>>> For any user address Z we will get the index of the page map table
>>> using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
>>> this address will give kernel virtual address.
>>>
>>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
>>
>> <...>
>>
>>> +int
>>> +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
>>> +	unsigned int gup_flags = FOLL_WRITE;
>>> +	long npgs;
>>> +	int err;
>>> +
>>> +	/* Get at least one page */
>>> +	if (size < PAGE_SIZE)
>>> +		size = PAGE_SIZE;
>>> +
>>> +	/* Compute number of user pages based on page size */
>>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>>> +
>>> +	/* Allocate memory for the pages */
>>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>>> +		      GFP_KERNEL | __GFP_NOWARN);
>>> +	if (!mem->pgs) {
>>> +		pr_err("%s: -ENOMEM\n", __func__);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	down_write(&current->mm->mmap_sem);
>>> +
>>> +	/* Get the user pages from the user address*/ #if
>> LINUX_VERSION_CODE
>>> +>= KERNEL_VERSION(4,9,0)
>>> +	npgs = get_user_pages((u64)address, mem->npgs,
>>> +				gup_flags, &mem->pgs[0], NULL);
>>> +#else
>>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem-
>>> npgs,
>>> +				gup_flags, 0, &mem->pgs[0], NULL); #endif
>>> +	up_write(&current->mm->mmap_sem);
>>
>> This should work even memory is physically not continuous, right? Where
>> exactly physically continuous requirement is coming from?
>>
> 
> Yes, it is not necessary to be physically contiguous.
> 
>  
> 
> 
>> <...>
>>
>>> +
>>> +/* Get the kernel address from the user address using
>>> + * page map table. Will be used only in IOVA=VA mode  */ static
>>> +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
>>> +	uint32_t index;
>>> +	/* User page - start user page will give the index
>>> +	 * with in the page map table
>>> +	 */
>>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>>> +
>>> +	/* Add the offset to the page address */
>>> +	return (kni->va_info.page_map[index].addr +
>>> +		(usr_addr & kni->va_info.page_mask));
>>> +
>>> +}
>>> +
>>>  /* physical address to kernel virtual address */  static void *
>>> pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
>>> kni_dev *kni,
>>>  			return;
>>>
>>>  		for (i = 0; i < num_rx; i++) {
>>> -			kva = pa2kva(kni->pa[i]);
>>> +			if (likely(kni->iova_mode == 1))
>>> +				kva = get_kva((u64)(kni->pa[i]), kni);
>>
>> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
>> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
>> 'mempool_memhdr->addr'
>>
>> If this is working I must be missing something but not able to find what it is.
>>
>> <...>
>>
>  
> In IOVA=VA mode, both the values will be same right?

I don't know for sure, but according my understanding 'mempool_memhdr->addr' is
application virtual address, 'mempool_memhdr->iova' id device virtual address
and can be anything as long as iommu configured correctly.
So technically ->addr and ->iova can be same, but I don't know it is always the
case in DPDK.

> 
> 
>>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>>  	kni->group_id = conf->group_id;
>>>  	kni->mbuf_size = conf->mbuf_size;
>>>
>>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>>> +	if (dev_info.iova_mode) {
>>> +		struct rte_mempool_memhdr *hdr;
>>> +		uint64_t pool_size = 0;
>>> +
>>> +		/* In each pool header chunk, we will maintain the
>>> +		 * base address of the pool. This chunk is physically and
>>> +		 * virtually contiguous.
>>> +		 * This approach will work, only if the allocated pool
>>> +		 * memory is contiguous, else it won't work
>>> +		 */
>>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>>> +
>>> +		/* Traverse the list and get the total size of the pool */
>>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>>> +			pool_size += hdr->len;
>>> +		}
>>
>> This code is aware that there may be multiple chunks, but assumes they are all
>> continuous, I don't know if this assumption is correct.
>>
>> Also I guess there is another assumption that there will be single pktmbuf_pool
>> in the application which passed into kni?
>> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
>> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
>> don't know their pages, so won't able to get their kernel virtual address.
> 
> All these chunks have to be virtually contiguous, otherwise this approach will not work. Here one thing we can do is create the mapping for complete huge page, and make sure all the mbuff_mempools will be with in this offset. As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed the size, even with multiple pktmbuf_pools. If mbuf_pools were created outside this huge_page, this approach will not work and we will fall back to PA mode.
> 

Those were also my two points,
- This requires all chunks in to be virtually continuous
- There should be only one pktmbuf_pools in system for this to work

But both above assumptions are not correct.
And as far as I can see there is not way to detect them and act accordingly.
  
Ferruh Yigit April 4, 2019, 11:21 a.m. UTC | #5
On 4/4/2019 10:57 AM, Burakov, Anatoly wrote:
> On 03-Apr-19 5:29 PM, Ferruh Yigit wrote:
>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>> From: Kiran Kumar K <kirankumark@marvell.com>
>>>
>>> With current KNI implementation kernel module will work only in
>>> IOVA=PA mode. This patch will add support for kernel module to work
>>> with IOVA=VA mode.
>>
>> Thanks Kiran for removing the limitation, I have a few questions, can you please
>> help me understand.
>>
>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in
>> 'rte_eal_init' should be removed, perhaps with this patch. I assume you already
>> doing it to be able to test this patch.
>>
>>>
>>> The idea is to maintain a mapping in KNI module between user pages and
>>> kernel pages and in fast path perform a lookup in this table and get
>>> the kernel virtual address for corresponding user virtual address.
>>>
>>> In IOVA=VA mode, the memory allocated to the pool is physically
>>> and virtually contiguous. We will take advantage of this and create a
>>> mapping in the kernel.In kernel we need mapping for queues
>>> (tx_q, rx_q,... slow path) and mbuf memory (fast path).
>>
>> Is it?
>> As far as I know mempool can have multiple chunks and they can be both virtually
>> and physically separated.
>>
>> And even for a single chunk, that will be virtually continuous, but will it be
>> physically continuous?
> 
> Just to clarify.
> 
> Within DPDK, we do not make a distinction between physical address and 
> IOVA address - we never need the actual physical address, we just need 
> the DMA addresses, which can either match the physical address, or be 
> completely arbitrary (in our case, they will match VA addresses, but it 
> doesn't have to be the case - in fact, 17.11 will, under some 
> circumstances, populate IOVA addresses simply starting from address 0).
> 
> However, one has to remember that IOVA address *is not a physical 
> address*. The pages backing a VA chunk may be *IOVA*-contiguous, but may 
> not necessarily be *physically* contiguous. Under normal circumstances 
> we really don't care, because the VFIO/IOMMU takes care of the mapping 
> between IOVA and PA transparently for the hardware.
> 
> So, in IOVA as VA mode, the memory allocated to the mempool will be 
> (within a given chunk) both VA and IOVA contiguous - but not necessarily 
> *physically* contiguous! In fact, if you try calling rte_mem_virt2phy() 
> on the mempool pages, you'll likely find that they aren't (i've seen 
> cases where pages were mapped /backwards/!).
> 
> Therefore, unless by "physically contiguous" you mean "IOVA-contiguous", 
> you *cannot* rely on memory in mempool being *physically* contiguous 
> merely based on the fact that it's IOVA-contiguous.

Thanks for clarification.
  
Burakov, Anatoly April 4, 2019, 1:29 p.m. UTC | #6
On 04-Apr-19 12:20 PM, Ferruh Yigit wrote:
> On 4/4/2019 6:03 AM, Kiran Kumar Kokkilagadda wrote:
>>
>>
>>> -----Original Message-----
>>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>>> Sent: Wednesday, April 3, 2019 9:59 PM
>>> To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
>>> Cc: dev@dpdk.org; Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
>>>
>>> External Email
>>>
>>> ----------------------------------------------------------------------
>>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>>> From: Kiran Kumar K <kirankumark@marvell.com>
>>>>
>>>> With current KNI implementation kernel module will work only in
>>>> IOVA=PA mode. This patch will add support for kernel module to work
>>>> with IOVA=VA mode.
>>>
>>> Thanks Kiran for removing the limitation, I have a few questions, can you please
>>> help me understand.
>>>
>>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init'
>>> should be removed, perhaps with this patch. I assume you already doing it to be
>>> able to test this patch.
>>>
>>
>> User can choose the mode by passing --iova-mode=<va/pa>. I will remove the rte_kni module restriction.
>>   
>>
>>>>
>>>> The idea is to maintain a mapping in KNI module between user pages and
>>>> kernel pages and in fast path perform a lookup in this table and get
>>>> the kernel virtual address for corresponding user virtual address.
>>>>
>>>> In IOVA=VA mode, the memory allocated to the pool is physically and
>>>> virtually contiguous. We will take advantage of this and create a
>>>> mapping in the kernel.In kernel we need mapping for queues (tx_q,
>>>> rx_q,... slow path) and mbuf memory (fast path).
>>>
>>> Is it?
>>> As far as I know mempool can have multiple chunks and they can be both
>>> virtually and physically separated.
>>>
>>> And even for a single chunk, that will be virtually continuous, but will it be
>>> physically continuous?
>>>
>>
>> You are right, it need not have to be physically contiguous. Will change the description.
>>   
>>
>>>>
>>>> At the KNI init time, in slow path we will create a mapping for the
>>>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>>>> memory base address, we will create a page map table for the mbuf,
>>>> which we will use in the fast path for kernel page translation.
>>>>
>>>> At KNI init time, we will pass the base address of the pool and size
>>>> of the pool to kernel. In kernel, using get_user_pages API, we will
>>>> get the pages with size PAGE_SIZE and store the mapping and start
>>>> address of user space in a table.
>>>>
>>>> In fast path for any user address perform PAGE_SHIFT (user_addr >>
>>>> PAGE_SHIFT) and subtract the start address from this value, we will
>>>> get the index of the kernel page with in the page map table.
>>>> Adding offset to this kernel page address, we will get the kernel
>>>> address for this user virtual address.
>>>>
>>>> For example user pool base address is X, and size is S that we passed
>>>> to kernel. In kernel we will create a mapping for this using get_user_pages.
>>>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
>>>> ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>>>
>>>> For any user address Z we will get the index of the page map table
>>>> using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
>>>> this address will give kernel virtual address.
>>>>
>>>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
>>>
>>> <...>
>>>
>>>> +int
>>>> +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
>>>> +	unsigned int gup_flags = FOLL_WRITE;
>>>> +	long npgs;
>>>> +	int err;
>>>> +
>>>> +	/* Get at least one page */
>>>> +	if (size < PAGE_SIZE)
>>>> +		size = PAGE_SIZE;
>>>> +
>>>> +	/* Compute number of user pages based on page size */
>>>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>>>> +
>>>> +	/* Allocate memory for the pages */
>>>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>>>> +		      GFP_KERNEL | __GFP_NOWARN);
>>>> +	if (!mem->pgs) {
>>>> +		pr_err("%s: -ENOMEM\n", __func__);
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	down_write(&current->mm->mmap_sem);
>>>> +
>>>> +	/* Get the user pages from the user address*/ #if
>>> LINUX_VERSION_CODE
>>>> +>= KERNEL_VERSION(4,9,0)
>>>> +	npgs = get_user_pages((u64)address, mem->npgs,
>>>> +				gup_flags, &mem->pgs[0], NULL);
>>>> +#else
>>>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem-
>>>> npgs,
>>>> +				gup_flags, 0, &mem->pgs[0], NULL); #endif
>>>> +	up_write(&current->mm->mmap_sem);
>>>
>>> This should work even memory is physically not continuous, right? Where
>>> exactly physically continuous requirement is coming from?
>>>
>>
>> Yes, it is not necessary to be physically contiguous.
>>
>>   
>>
>>
>>> <...>
>>>
>>>> +
>>>> +/* Get the kernel address from the user address using
>>>> + * page map table. Will be used only in IOVA=VA mode  */ static
>>>> +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
>>>> +	uint32_t index;
>>>> +	/* User page - start user page will give the index
>>>> +	 * with in the page map table
>>>> +	 */
>>>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>>>> +
>>>> +	/* Add the offset to the page address */
>>>> +	return (kni->va_info.page_map[index].addr +
>>>> +		(usr_addr & kni->va_info.page_mask));
>>>> +
>>>> +}
>>>> +
>>>>   /* physical address to kernel virtual address */  static void *
>>>> pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
>>>> kni_dev *kni,
>>>>   			return;
>>>>
>>>>   		for (i = 0; i < num_rx; i++) {
>>>> -			kva = pa2kva(kni->pa[i]);
>>>> +			if (likely(kni->iova_mode == 1))
>>>> +				kva = get_kva((u64)(kni->pa[i]), kni);
>>>
>>> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
>>> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
>>> 'mempool_memhdr->addr'
>>>
>>> If this is working I must be missing something but not able to find what it is.
>>>
>>> <...>
>>>
>>   
>> In IOVA=VA mode, both the values will be same right?
> 
> I don't know for sure, but according my understanding 'mempool_memhdr->addr' is
> application virtual address, 'mempool_memhdr->iova' id device virtual address
> and can be anything as long as iommu configured correctly.
> So technically ->addr and ->iova can be same, but I don't know it is always the
> case in DPDK.

In IOVA as VA case, they will be the same, except if external memory is 
used (in which case all bets are off, as this memory is not managed by 
DPDK and we can't enforce any constraints on it). That is not to say 
that IOVA being different from VA can't happen in the future, but the 
whole notion of "IOVA as VA" implies that IOVA is VA :)

> 
>>
>>
>>>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>>>   	kni->group_id = conf->group_id;
>>>>   	kni->mbuf_size = conf->mbuf_size;
>>>>
>>>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>>>> +	if (dev_info.iova_mode) {
>>>> +		struct rte_mempool_memhdr *hdr;
>>>> +		uint64_t pool_size = 0;
>>>> +
>>>> +		/* In each pool header chunk, we will maintain the
>>>> +		 * base address of the pool. This chunk is physically and
>>>> +		 * virtually contiguous.
>>>> +		 * This approach will work, only if the allocated pool
>>>> +		 * memory is contiguous, else it won't work
>>>> +		 */
>>>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>>>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>>>> +
>>>> +		/* Traverse the list and get the total size of the pool */
>>>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>>>> +			pool_size += hdr->len;
>>>> +		}
>>>
>>> This code is aware that there may be multiple chunks, but assumes they are all
>>> continuous, I don't know if this assumption is correct.
>>>
>>> Also I guess there is another assumption that there will be single pktmbuf_pool
>>> in the application which passed into kni?
>>> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
>>> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
>>> don't know their pages, so won't able to get their kernel virtual address.
>>
>> All these chunks have to be virtually contiguous, otherwise this approach will not work. Here one thing we can do is create the mapping for complete huge page, and make sure all the mbuff_mempools will be with in this offset. As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed the size, even with multiple pktmbuf_pools. If mbuf_pools were created outside this huge_page, this approach will not work and we will fall back to PA mode.
>>
> 
> Those were also my two points,
> - This requires all chunks in to be virtually continuous
> - There should be only one pktmbuf_pools in system for this to work
> 
> But both above assumptions are not correct.
> And as far as I can see there is not way to detect them and act accordingly.
> 

Correct, there is no guarantee that an entire mempool will be contained 
within one page, or even two adjacent pages. There is nothing stopping 
the memchunks to be allocated backwards (i.e. memchunk1 may have higher 
or lower address than memchunk2), or even be allocated from pages of 
different sizes.
  

Patch

diff --git a/kernel/linux/kni/kni_dev.h b/kernel/linux/kni/kni_dev.h
index 688f574a4..055b8d59e 100644
--- a/kernel/linux/kni/kni_dev.h
+++ b/kernel/linux/kni/kni_dev.h
@@ -32,10 +32,47 @@ 
 /* Default carrier state for created KNI network interfaces */
 extern uint32_t dflt_carrier;

+struct iova_page_info {
+	/* User to kernel page table map, used for
+	 * fast path lookup
+	 */
+	struct mbuf_page {
+		void *addr;
+	} *page_map;
+
+	/* Page mask */
+	u64 page_mask;
+
+	/* Start page for user address */
+	u64 start_page;
+
+	struct page_info {
+		/* Physical pages returned by get_user_pages */
+		struct page **pgs;
+
+		/* Number of Pages returned by get_user_pages */
+		u32 npgs;
+	} page_info;
+
+	/* Queue info */
+	struct page_info tx_q;
+	struct page_info rx_q;
+	struct page_info alloc_q;
+	struct page_info free_q;
+	struct page_info req_q;
+	struct page_info resp_q;
+	struct page_info sync_va;
+};
+
 /**
  * A structure describing the private information for a kni device.
  */
 struct kni_dev {
+	/* Page info for IOVA=VA mode */
+	struct iova_page_info va_info;
+	/* IOVA mode 0 = PA, 1 = VA */
+	uint8_t iova_mode;
+
 	/* kni list */
 	struct list_head list;

diff --git a/kernel/linux/kni/kni_misc.c b/kernel/linux/kni/kni_misc.c
index 04c78eb87..0be0e1dfd 100644
--- a/kernel/linux/kni/kni_misc.c
+++ b/kernel/linux/kni/kni_misc.c
@@ -201,6 +201,122 @@  kni_dev_remove(struct kni_dev *dev)
 	return 0;
 }

+static void
+kni_unpin_pages(struct page_info *mem)
+{
+	u32 i;
+
+	/* Set the user pages as dirty, so that these pages will not be
+	 * allocated to other applications until we release them.
+	 */
+	for (i = 0; i < mem->npgs; i++) {
+		struct page *page = mem->pgs[i];
+
+		set_page_dirty_lock(page);
+		put_page(page);
+	}
+
+	kfree(mem->pgs);
+	mem->pgs = NULL;
+}
+
+static void
+kni_clean_queue(struct page_info *mem)
+{
+	if (mem->pgs) {
+		set_page_dirty_lock(mem->pgs[0]);
+		put_page(mem->pgs[0]);
+		kfree(mem->pgs);
+		mem->pgs = NULL;
+	}
+}
+
+static void
+kni_cleanup_iova(struct iova_page_info *mem)
+{
+	kni_unpin_pages(&mem->page_info);
+	kfree(mem->page_map);
+	mem->page_map = NULL;
+
+	kni_clean_queue(&mem->tx_q);
+	kni_clean_queue(&mem->rx_q);
+	kni_clean_queue(&mem->alloc_q);
+	kni_clean_queue(&mem->free_q);
+	kni_clean_queue(&mem->req_q);
+	kni_clean_queue(&mem->resp_q);
+	kni_clean_queue(&mem->sync_va);
+}
+
+int
+kni_pin_pages(void *address, size_t size, struct page_info *mem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	/* Get at least one page */
+	if (size < PAGE_SIZE)
+		size = PAGE_SIZE;
+
+	/* Compute number of user pages based on page size */
+	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
+
+	/* Allocate memory for the pages */
+	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
+		      GFP_KERNEL | __GFP_NOWARN);
+	if (!mem->pgs) {
+		pr_err("%s: -ENOMEM\n", __func__);
+		return -ENOMEM;
+	}
+
+	down_write(&current->mm->mmap_sem);
+
+	/* Get the user pages from the user address*/
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
+	npgs = get_user_pages((u64)address, mem->npgs,
+				gup_flags, &mem->pgs[0], NULL);
+#else
+	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
+				gup_flags, 0, &mem->pgs[0], NULL);
+#endif
+	up_write(&current->mm->mmap_sem);
+
+	/* We didn't get all the requested pages, throw error */
+	if (npgs != mem->npgs) {
+		if (npgs >= 0) {
+			mem->npgs = npgs;
+			err = -ENOMEM;
+			pr_err("%s: -ENOMEM\n", __func__);
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+	return 0;
+
+out_pin:
+	kni_unpin_pages(mem);
+out_pgs:
+	kfree(mem->pgs);
+	mem->pgs = NULL;
+	return err;
+}
+
+static void*
+kni_map_queue(struct kni_dev *kni, u64 addr,
+			   struct page_info *mm)
+{
+	/* Map atleast 1 page */
+	if (kni_pin_pages((void *)addr, PAGE_SIZE,
+			  mm) != 0) {
+		pr_err("Unable to pin pages\n");
+		return NULL;
+	}
+
+	return (page_address(mm->pgs[0]) +
+			   (addr & kni->va_info.page_mask));
+}
+
 static int
 kni_release(struct inode *inode, struct file *file)
 {
@@ -228,6 +344,11 @@  kni_release(struct inode *inode, struct file *file)
 		}

 		kni_dev_remove(dev);
+
+		/* IOVA=VA mode, unpin pages */
+		if (likely(dev->iova_mode == 1))
+			kni_cleanup_iova(&dev->va_info);
+
 		list_del(&dev->list);
 	}
 	up_write(&knet->kni_list_lock);
@@ -368,16 +489,91 @@  kni_ioctl_create(struct net *net, uint32_t ioctl_num,
 	strncpy(kni->name, dev_info.name, RTE_KNI_NAMESIZE);

 	/* Translate user space info into kernel space info */
-	kni->tx_q = phys_to_virt(dev_info.tx_phys);
-	kni->rx_q = phys_to_virt(dev_info.rx_phys);
-	kni->alloc_q = phys_to_virt(dev_info.alloc_phys);
-	kni->free_q = phys_to_virt(dev_info.free_phys);
+	kni->iova_mode = dev_info.iova_mode;

-	kni->req_q = phys_to_virt(dev_info.req_phys);
-	kni->resp_q = phys_to_virt(dev_info.resp_phys);
-	kni->sync_va = dev_info.sync_va;
-	kni->sync_kva = phys_to_virt(dev_info.sync_phys);
+	if (kni->iova_mode) {
+		u64 mbuf_addr;
+		int i;
+
+		/* map userspace memory info */
+		mbuf_addr = (u64)dev_info.mbuf_va;

+		/* Pre compute page mask, used in fast path */
+		kni->va_info.page_mask = (u64)(PAGE_SIZE - 1);
+
+		/* Store start page address, This is the reference
+		 * for all the user virtual address
+		 */
+		kni->va_info.start_page = (mbuf_addr >> PAGE_SHIFT);
+
+		/* Get and pin the user pages */
+		if (kni_pin_pages(dev_info.mbuf_va, dev_info.mbuf_pool_size,
+			      &kni->va_info.page_info) != 0) {
+			pr_err("Unable to pin pages\n");
+			return -1;
+		}
+
+		/* Page map table between user and kernel pages */
+		kni->va_info.page_map = kcalloc(kni->va_info.page_info.npgs,
+						   sizeof(struct mbuf_page),
+						   GFP_KERNEL);
+		if (kni->va_info.page_map == NULL) {
+			pr_err("Out of memory\n");
+			return -ENOMEM;
+		}
+
+		/* Conver the user pages to kernel pages */
+		for (i = 0; i < kni->va_info.page_info.npgs; i++) {
+			kni->va_info.page_map[i].addr =
+				page_address(kni->va_info.page_info.pgs[i]);
+		}
+
+		/* map queues */
+		kni->tx_q = kni_map_queue(kni, dev_info.tx_phys,
+					  &kni->va_info.tx_q);
+		if (kni->tx_q == NULL)
+			goto iova_err;
+
+		kni->rx_q = kni_map_queue(kni, dev_info.rx_phys,
+					  &kni->va_info.rx_q);
+		if (kni->rx_q == NULL)
+			goto iova_err;
+
+		kni->alloc_q = kni_map_queue(kni, dev_info.alloc_phys,
+					     &kni->va_info.alloc_q);
+		if (kni->alloc_q == NULL)
+			goto iova_err;
+
+		kni->free_q = kni_map_queue(kni, dev_info.free_phys,
+					    &kni->va_info.free_q);
+		if (kni->free_q == NULL)
+			goto iova_err;
+
+		kni->req_q = kni_map_queue(kni, dev_info.req_phys,
+					   &kni->va_info.req_q);
+		if (kni->req_q == NULL)
+			goto iova_err;
+
+		kni->resp_q = kni_map_queue(kni, dev_info.resp_phys,
+					    &kni->va_info.resp_q);
+		if (kni->resp_q == NULL)
+			goto iova_err;
+
+		kni->sync_kva = kni_map_queue(kni, dev_info.sync_phys,
+					      &kni->va_info.sync_va);
+		if (kni->sync_kva == NULL)
+			goto iova_err;
+	} else {
+		/* Address tranlation for IOVA=PA mode */
+		kni->tx_q = phys_to_virt(dev_info.tx_phys);
+		kni->rx_q = phys_to_virt(dev_info.rx_phys);
+		kni->alloc_q = phys_to_virt(dev_info.alloc_phys);
+		kni->free_q = phys_to_virt(dev_info.free_phys);
+		kni->req_q = phys_to_virt(dev_info.req_phys);
+		kni->resp_q = phys_to_virt(dev_info.resp_phys);
+		kni->sync_kva = phys_to_virt(dev_info.sync_phys);
+	}
+	kni->sync_va = dev_info.sync_va;
 	kni->mbuf_size = dev_info.mbuf_size;

 	pr_debug("tx_phys:      0x%016llx, tx_q addr:      0x%p\n",
@@ -484,6 +680,9 @@  kni_ioctl_create(struct net *net, uint32_t ioctl_num,
 	up_write(&knet->kni_list_lock);

 	return 0;
+iova_err:
+	kni_cleanup_iova(&kni->va_info);
+	return -1;
 }

 static int
diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c
index 7371b6d58..83fbcf6f1 100644
--- a/kernel/linux/kni/kni_net.c
+++ b/kernel/linux/kni/kni_net.c
@@ -35,6 +35,25 @@  static void kni_net_rx_normal(struct kni_dev *kni);
 /* kni rx function pointer, with default to normal rx */
 static kni_net_rx_t kni_net_rx_func = kni_net_rx_normal;

+
+/* Get the kernel address from the user address using
+ * page map table. Will be used only in IOVA=VA mode
+ */
+static inline void*
+get_kva(uint64_t usr_addr, struct kni_dev *kni)
+{
+	uint32_t index;
+	/* User page - start user page will give the index
+	 * with in the page map table
+	 */
+	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
+
+	/* Add the offset to the page address */
+	return (kni->va_info.page_map[index].addr +
+		(usr_addr & kni->va_info.page_mask));
+
+}
+
 /* physical address to kernel virtual address */
 static void *
 pa2kva(void *pa)
@@ -186,7 +205,10 @@  kni_fifo_trans_pa2va(struct kni_dev *kni,
 			return;

 		for (i = 0; i < num_rx; i++) {
-			kva = pa2kva(kni->pa[i]);
+			if (likely(kni->iova_mode == 1))
+				kva = get_kva((u64)(kni->pa[i]), kni);
+			else
+				kva = pa2kva(kni->pa[i]);
 			kni->va[i] = pa2va(kni->pa[i], kva);
 		}

@@ -263,8 +285,16 @@  kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 	if (likely(ret == 1)) {
 		void *data_kva;

-		pkt_kva = pa2kva(pkt_pa);
-		data_kva = kva2data_kva(pkt_kva);
+
+		if (likely(kni->iova_mode == 1)) {
+			pkt_kva = get_kva((u64)pkt_pa, kni);
+			data_kva = (uint8_t *)pkt_kva +
+				(sizeof(struct rte_kni_mbuf) +
+				 pkt_kva->data_off);
+		} else {
+			pkt_kva = pa2kva(pkt_pa);
+			data_kva = kva2data_kva(pkt_kva);
+		}
 		pkt_va = pa2va(pkt_pa, pkt_kva);

 		len = skb->len;
@@ -333,11 +363,18 @@  kni_net_rx_normal(struct kni_dev *kni)
 	if (num_rx == 0)
 		return;

+
 	/* Transfer received packets to netif */
 	for (i = 0; i < num_rx; i++) {
-		kva = pa2kva(kni->pa[i]);
+		if (likely(kni->iova_mode == 1)) {
+			kva = get_kva((u64)kni->pa[i], kni);
+			data_kva = (uint8_t *)kva +
+				(sizeof(struct rte_kni_mbuf) + kva->data_off);
+		} else {
+			kva = pa2kva(kni->pa[i]);
+			data_kva = kva2data_kva(kva);
+		}
 		len = kva->pkt_len;
-		data_kva = kva2data_kva(kva);
 		kni->va[i] = pa2va(kni->pa[i], kva);

 		skb = dev_alloc_skb(len + 2);
@@ -363,8 +400,17 @@  kni_net_rx_normal(struct kni_dev *kni)
 				if (!kva->next)
 					break;

-				kva = pa2kva(va2pa(kva->next, kva));
-				data_kva = kva2data_kva(kva);
+				if (likely(kni->iova_mode == 1)) {
+					kva = get_kva(
+						(u64)va2pa(kva->next, kva),
+						kni);
+					data_kva = (uint8_t *)kva +
+					(sizeof(struct rte_kni_mbuf) +
+					 kva->data_off);
+				} else {
+					kva = pa2kva(va2pa(kva->next, kva));
+					data_kva = kva2data_kva(kva);
+				}
 			}
 		}

@@ -434,14 +480,31 @@  kni_net_rx_lo_fifo(struct kni_dev *kni)
 		num = ret;
 		/* Copy mbufs */
 		for (i = 0; i < num; i++) {
-			kva = pa2kva(kni->pa[i]);
-			len = kva->pkt_len;
-			data_kva = kva2data_kva(kva);
-			kni->va[i] = pa2va(kni->pa[i], kva);
+			if (likely(kni->iova_mode == 1)) {
+				kva = get_kva((u64)(kni->pa[i]), kni);
+				len = kva->pkt_len;
+				data_kva = (uint8_t *)kva +
+					(sizeof(struct rte_kni_mbuf) +
+					 kva->data_off);
+				kni->va[i] = pa2va(kni->pa[i], kva);
+				alloc_kva = get_kva((u64)(kni->alloc_pa[i]),
+						    kni);
+				alloc_data_kva = (uint8_t *)alloc_kva +
+					(sizeof(struct rte_kni_mbuf) +
+					 alloc_kva->data_off);
+				kni->alloc_va[i] = pa2va(kni->alloc_pa[i],
+							 alloc_kva);
+			} else {
+				kva = pa2kva(kni->pa[i]);
+				len = kva->pkt_len;
+				data_kva = kva2data_kva(kva);
+				kni->va[i] = pa2va(kni->pa[i], kva);

-			alloc_kva = pa2kva(kni->alloc_pa[i]);
-			alloc_data_kva = kva2data_kva(alloc_kva);
-			kni->alloc_va[i] = pa2va(kni->alloc_pa[i], alloc_kva);
+				alloc_kva = pa2kva(kni->alloc_pa[i]);
+				alloc_data_kva = kva2data_kva(alloc_kva);
+				kni->alloc_va[i] = pa2va(kni->alloc_pa[i],
+							 alloc_kva);
+			}

 			memcpy(alloc_data_kva, data_kva, len);
 			alloc_kva->pkt_len = len;
@@ -507,9 +570,15 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)

 	/* Copy mbufs to sk buffer and then call tx interface */
 	for (i = 0; i < num; i++) {
-		kva = pa2kva(kni->pa[i]);
+		if (likely(kni->iova_mode == 1)) {
+			kva = get_kva((u64)(kni->pa[i]), kni);
+			data_kva = (uint8_t *)kva +
+				(sizeof(struct rte_kni_mbuf) + kva->data_off);
+		} else {
+			kva = pa2kva(kni->pa[i]);
+			data_kva = kva2data_kva(kva);
+		}
 		len = kva->pkt_len;
-		data_kva = kva2data_kva(kva);
 		kni->va[i] = pa2va(kni->pa[i], kva);

 		skb = dev_alloc_skb(len + 2);
@@ -545,8 +614,17 @@  kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 				if (!kva->next)
 					break;

-				kva = pa2kva(va2pa(kva->next, kva));
-				data_kva = kva2data_kva(kva);
+				if (likely(kni->iova_mode == 1)) {
+					kva = get_kva(
+						(u64)(va2pa(kva->next, kva)),
+						kni);
+					data_kva = (uint8_t *)kva +
+					(sizeof(struct rte_kni_mbuf) +
+					 kva->data_off);
+				} else {
+					kva = pa2kva(va2pa(kva->next, kva));
+					data_kva = kva2data_kva(kva);
+				}
 			}
 		}

diff --git a/lib/librte_eal/linux/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linux/eal/include/exec-env/rte_kni_common.h
index 5afa08713..897dd956f 100644
--- a/lib/librte_eal/linux/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linux/eal/include/exec-env/rte_kni_common.h
@@ -128,6 +128,14 @@  struct rte_kni_device_info {
 	unsigned mbuf_size;
 	unsigned int mtu;
 	char mac_addr[6];
+
+	/* IOVA mode. 1 = VA, 0 = PA */
+	uint8_t iova_mode;
+
+	/* Pool size, will be used in kernel to map the
+	 * user pages
+	 */
+	uint64_t mbuf_pool_size;
 };

 #define KNI_DEVICE "kni"
diff --git a/lib/librte_kni/rte_kni.c b/lib/librte_kni/rte_kni.c
index 492e207a3..3bf19faa0 100644
--- a/lib/librte_kni/rte_kni.c
+++ b/lib/librte_kni/rte_kni.c
@@ -304,6 +304,27 @@  rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
 	kni->group_id = conf->group_id;
 	kni->mbuf_size = conf->mbuf_size;

+	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
+	if (dev_info.iova_mode) {
+		struct rte_mempool_memhdr *hdr;
+		uint64_t pool_size = 0;
+
+		/* In each pool header chunk, we will maintain the
+		 * base address of the pool. This chunk is physically and
+		 * virtually contiguous.
+		 * This approach will work, only if the allocated pool
+		 * memory is contiguous, else it won't work
+		 */
+		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
+		dev_info.mbuf_va = (void *)(hdr->addr);
+
+		/* Traverse the list and get the total size of the pool */
+		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
+			pool_size += hdr->len;
+		}
+		dev_info.mbuf_pool_size = pool_size +
+			pktmbuf_pool->mz->len;
+	}
 	ret = ioctl(kni_fd, RTE_KNI_IOCTL_CREATE, &dev_info);
 	if (ret < 0)
 		goto ioctl_fail;