[dpdk-dev,07/10] linuxapp/eal_vfio: honor iova mode before mapping

Message ID 20170608110513.22548-8-santosh.shukla@caviumnetworks.com (mailing list archive)
State Changes Requested, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Santosh Shukla June 8, 2017, 11:05 a.m. UTC
  Check iova mode and accordingly map iova to pa or va.

Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---
 lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)
  

Comments

Maxime Coquelin July 5, 2017, 9:14 a.m. UTC | #1
On 06/08/2017 01:05 PM, Santosh Shukla wrote:
> Check iova mode and accordingly map iova to pa or va.
> 
> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
> ---
>   lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> index 04914406f..348b7a7f4 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>   		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>   		dma_map.vaddr = ms[i].addr_64;
>   		dma_map.size = ms[i].len;
> -		dma_map.iova = ms[i].phys_addr;
> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> +			dma_map.iova = dma_map.vaddr;
> +		else
> +			dma_map.iova = ms[i].phys_addr;
>   		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>   

IIUC, it is changing default behavior for VFIO devices.

I see a possible problem, but I'm not sure the case is valid.

Imagine you have two devices in the iommu group, and the two devices are
used in separate processes. Each process could try two different
physical addresses at the same virtual address, and so the second map
would fail.

By using physical addresses, you are safe against this problem.

Any thoughts?

Cheers,
Maxime
  
Jerin Jacob July 5, 2017, 3:43 p.m. UTC | #2
-----Original Message-----
> Date: Wed, 5 Jul 2017 11:14:01 +0200
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>  thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>  shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>  before mapping
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.1.0
> 
> 
> 
> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
> > Check iova mode and accordingly map iova to pa or va.
> > 
> > Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
> > Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
> > ---
> >   lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
> >   1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > index 04914406f..348b7a7f4 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
> >   		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
> >   		dma_map.vaddr = ms[i].addr_64;
> >   		dma_map.size = ms[i].len;
> > -		dma_map.iova = ms[i].phys_addr;
> > +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> > +			dma_map.iova = dma_map.vaddr;
> > +		else
> > +			dma_map.iova = ms[i].phys_addr;
> >   		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> 
> IIUC, it is changing default behavior for VFIO devices.
> 
> I see a possible problem, but I'm not sure the case is valid.
> 
> Imagine you have two devices in the iommu group, and the two devices are
> used in separate processes. Each process could try two different
> physical addresses at the same virtual address, and so the second map
> would fail.

IMO, Doesn't look like a problem. Here is the data flow

1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
on primary process
http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359

2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
that, the Secondary process has the _same_ virtual address as primary or
exit from on attach.
http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452 

3) Since secondary process adds the mapped the virtual address in step (2).
in the page table in OS. On SMMU entry miss(When device
request from I/O transaction), OS will load the mapping and update the SMMU 
"context" with page tables from MMU.

Let me add the background for why this feature is required in DPDK to
enable NPU style co-processors.

The traditional NICs the Rx path code look like this:
1) On control path, Fill the mempool with buffers
2) on rx_burst(), alloc the mbuf from mempool
3) SW has the mbuf in hand(which is a virtual address) and program the
HW with mbuf->buf_physaddr)
4) Return the last pushed mbuf(will be updated by HW by now)


On NPU style co-processors, situation is different as the buffer recycling 
has been done in HW unlike SW model. Here is the data flow:
1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
address, which is PA in existing model)
2) on rx_burst, HW gives you IOVA address(as address as step 1)
3) As application expects VA to operate on it, rx_burst() needs to
convert to VA from PAA. Which is very costly.
Instead with this IOVA as VA scheme, We can avoid the cost of converting
with help of IOMMU/SMMU.

This patch set auto detects the mode based available of type devices in
bus and provides an option to override mode based on eal argument, so we
don't foresee any issue with this approach and welcome any alternative
approaches.

Similar problem exists in another part of the code in DPDK,
http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
Its a conditional compilation based approach with duplicating the vfio
code and we are trying to fix the problem in a generic way so that
everyone can get benefited out of it.

Comments are welcome.

/Jerin

> 
> By using physical addresses, you are safe against this problem.
> 
> Any thoughts?
> 
> Cheers,
> Maxime
  
Maxime Coquelin July 6, 2017, 7:58 a.m. UTC | #3
On 07/05/2017 05:43 PM, Jerin Jacob wrote:
> -----Original Message-----
>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>   thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>   shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>   before mapping
>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.1.0
>>
>>
>>
>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>> Check iova mode and accordingly map iova to pa or va.
>>>
>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>> ---
>>>    lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>> index 04914406f..348b7a7f4 100644
>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>    		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>    		dma_map.vaddr = ms[i].addr_64;
>>>    		dma_map.size = ms[i].len;
>>> -		dma_map.iova = ms[i].phys_addr;
>>> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>> +			dma_map.iova = dma_map.vaddr;
>>> +		else
>>> +			dma_map.iova = ms[i].phys_addr;
>>>    		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>
>> IIUC, it is changing default behavior for VFIO devices.
>>
>> I see a possible problem, but I'm not sure the case is valid.
>>
>> Imagine you have two devices in the iommu group, and the two devices are
>> used in separate processes. Each process could try two different
>> physical addresses at the same virtual address, and so the second map
>> would fail.
> 
> IMO, Doesn't look like a problem. Here is the data flow
> 
> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
> on primary process
> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
> 
> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
> that, the Secondary process has the _same_ virtual address as primary or
> exit from on attach.
> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
> 
> 3) Since secondary process adds the mapped the virtual address in step (2).
> in the page table in OS. On SMMU entry miss(When device
> request from I/O transaction), OS will load the mapping and update the SMMU
> "context" with page tables from MMU.

Ok thanks for the detailed info, but what about the case where the same
iommu group is used by two primary processes?

I don't know how frequent it is, but if ACS is not supported by either 
the endpoint or the the root port, then you would have to share the same 
IOMMU group for all the ports of your card. Right?

> Let me add the background for why this feature is required in DPDK to
> enable NPU style co-processors.
> 
> The traditional NICs the Rx path code look like this:
> 1) On control path, Fill the mempool with buffers
> 2) on rx_burst(), alloc the mbuf from mempool
> 3) SW has the mbuf in hand(which is a virtual address) and program the
> HW with mbuf->buf_physaddr)
> 4) Return the last pushed mbuf(will be updated by HW by now)
> 
> 
> On NPU style co-processors, situation is different as the buffer recycling
> has been done in HW unlike SW model. Here is the data flow:
> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
> address, which is PA in existing model)
> 2) on rx_burst, HW gives you IOVA address(as address as step 1)
> 3) As application expects VA to operate on it, rx_burst() needs to
> convert to VA from PAA. Which is very costly.
> Instead with this IOVA as VA scheme, We can avoid the cost of converting
> with help of IOMMU/SMMU.
> 
> This patch set auto detects the mode based available of type devices in
> bus and provides an option to override mode based on eal argument, so we
> don't foresee any issue with this approach and welcome any alternative
> approaches.

I don't question the need of the feature for these kind of
co-processors, using VA as IOVA in your case seems very valid.

What concerns me is that we change the default behavior for all other
devices. Having an option to override is fine to me, but the default
mode should remain the same IMHO.

Wouldn't it be possible to default to VA as IOVA only when an HW mempool
is in use?

> Similar problem exists in another part of the code in DPDK,
> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
> Its a conditional compilation based approach with duplicating the vfio
> code and we are trying to fix the problem in a generic way so that
> everyone can get benefited out of it.
> 
> Comments are welcome.

Thanks,
Maxime

> /Jerin
> 
>>
>> By using physical addresses, you are safe against this problem.
>>
>> Any thoughts?
>>
>> Cheers,
>> Maxime
  
Jerin Jacob July 6, 2017, 9:49 a.m. UTC | #4
-----Original Message-----
> Date: Thu, 6 Jul 2017 09:58:41 +0200
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>  thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>  hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>  before mapping
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.1.0
> 
> 
> 
> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
> > -----Original Message-----
> > > Date: Wed, 5 Jul 2017 11:14:01 +0200
> > > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
> > >   thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
> > > CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
> > >   shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
> > > Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
> > >   before mapping
> > > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
> > >   Thunderbird/52.1.0
> > > 
> > > 
> > > 
> > > On 06/08/2017 01:05 PM, Santosh Shukla wrote:
> > > > Check iova mode and accordingly map iova to pa or va.
> > > > 
> > > > Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
> > > > Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
> > > > ---
> > > >    lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
> > > >    1 file changed, 8 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > index 04914406f..348b7a7f4 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
> > > >    		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
> > > >    		dma_map.vaddr = ms[i].addr_64;
> > > >    		dma_map.size = ms[i].len;
> > > > -		dma_map.iova = ms[i].phys_addr;
> > > > +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> > > > +			dma_map.iova = dma_map.vaddr;
> > > > +		else
> > > > +			dma_map.iova = ms[i].phys_addr;
> > > >    		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> > > 
> > > IIUC, it is changing default behavior for VFIO devices.
> > > 
> > > I see a possible problem, but I'm not sure the case is valid.
> > > 
> > > Imagine you have two devices in the iommu group, and the two devices are
> > > used in separate processes. Each process could try two different
> > > physical addresses at the same virtual address, and so the second map
> > > would fail.
> > 
> > IMO, Doesn't look like a problem. Here is the data flow
> > 
> > 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
> > on primary process
> > http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
> > 
> > 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
> > that, the Secondary process has the _same_ virtual address as primary or
> > exit from on attach.
> > http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
> > 
> > 3) Since secondary process adds the mapped the virtual address in step (2).
> > in the page table in OS. On SMMU entry miss(When device
> > request from I/O transaction), OS will load the mapping and update the SMMU
> > "context" with page tables from MMU.
> 
> Ok thanks for the detailed info, but what about the case where the same
> iommu group is used by two primary processes?

Does that case exist with DPDK? We always need to blacklist same BDF in
the secondary process to make things work with existing DPDK setup. Which
make sense as well. Only primary process configures the HW blocks.

> 
> I don't know how frequent it is, but if ACS is not supported by either the
> endpoint or the the root port, then you would have to share the same IOMMU
> group for all the ports of your card. Right?

ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
comes as a IOMMU group.

If it in bypass mode anyway you use in vfio-noiommu mode as
there is no protection anyway.

> 
> > Let me add the background for why this feature is required in DPDK to
> > enable NPU style co-processors.
> > 
> > The traditional NICs the Rx path code look like this:
> > 1) On control path, Fill the mempool with buffers
> > 2) on rx_burst(), alloc the mbuf from mempool
> > 3) SW has the mbuf in hand(which is a virtual address) and program the
> > HW with mbuf->buf_physaddr)
> > 4) Return the last pushed mbuf(will be updated by HW by now)
> > 
> > 
> > On NPU style co-processors, situation is different as the buffer recycling
> > has been done in HW unlike SW model. Here is the data flow:
> > 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
> > address, which is PA in existing model)
> > 2) on rx_burst, HW gives you IOVA address(as address as step 1)
> > 3) As application expects VA to operate on it, rx_burst() needs to
> > convert to VA from PAA. Which is very costly.
> > Instead with this IOVA as VA scheme, We can avoid the cost of converting
> > with help of IOMMU/SMMU.
> > 
> > This patch set auto detects the mode based available of type devices in
> > bus and provides an option to override mode based on eal argument, so we
> > don't foresee any issue with this approach and welcome any alternative
> > approaches.
> 
> I don't question the need of the feature for these kind of
> co-processors, using VA as IOVA in your case seems very valid.
> 
> What concerns me is that we change the default behavior for all other
> devices. Having an option to override is fine to me, but the default
> mode should remain the same IMHO.

Doesn't seems to be a technical point. But I agree with your concern.
we will address it.
I think, we have two ways to address it.

option 1:
- In existing patch,
a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
  http://dpdk.org/dev/patchwork/patch/25192
b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
http://dpdk.org/dev/patchwork/patch/25193/
check the code after rte_bus_scan()

option 2:
On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
mode so other type of devices safe.

I think, option 2 makes sense, as it gives foolproof auto detection scheme and
without effecting any other devices that not interested in this scheme

Does that address your concern about the patchset?

> Wouldn't it be possible to default to VA as IOVA only when an HW mempool
> is in use?

It will be too late as in the normal scheme of things, application
creates the pool.

> 
> > Similar problem exists in another part of the code in DPDK,
> > http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
> > Its a conditional compilation based approach with duplicating the vfio
> > code and we are trying to fix the problem in a generic way so that
> > everyone can get benefited out of it.
> > 
> > Comments are welcome.
> 
> Thanks,
> Maxime
> 
> > /Jerin
> > 
> > > 
> > > By using physical addresses, you are safe against this problem.
> > > 
> > > Any thoughts?
> > > 
> > > Cheers,
> > > Maxime
  
Maxime Coquelin July 6, 2017, 10:59 a.m. UTC | #5
On 07/06/2017 11:49 AM, Jerin Jacob wrote:
> -----Original Message-----
>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>   thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>   hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>   before mapping
>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>   Thunderbird/52.1.0
>>
>>
>>
>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>> -----Original Message-----
>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>    thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>    shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>    before mapping
>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>    Thunderbird/52.1.0
>>>>
>>>>
>>>>
>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>
>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>> ---
>>>>>     lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> index 04914406f..348b7a7f4 100644
>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>     		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>     		dma_map.vaddr = ms[i].addr_64;
>>>>>     		dma_map.size = ms[i].len;
>>>>> -		dma_map.iova = ms[i].phys_addr;
>>>>> +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>> +			dma_map.iova = dma_map.vaddr;
>>>>> +		else
>>>>> +			dma_map.iova = ms[i].phys_addr;
>>>>>     		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>
>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>
>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>
>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>> used in separate processes. Each process could try two different
>>>> physical addresses at the same virtual address, and so the second map
>>>> would fail.
>>>
>>> IMO, Doesn't look like a problem. Here is the data flow
>>>
>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>> on primary process
>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>
>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>> that, the Secondary process has the _same_ virtual address as primary or
>>> exit from on attach.
>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>
>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>> in the page table in OS. On SMMU entry miss(When device
>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>> "context" with page tables from MMU.
>>
>> Ok thanks for the detailed info, but what about the case where the same
>> iommu group is used by two primary processes?
> 
> Does that case exist with DPDK? We always need to blacklist same BDF in
> the secondary process to make things work with existing DPDK setup. Which
> make sense as well. Only primary process configures the HW blocks.

I meant the case when two BDF are in the same IOMMU group (if ACS is not
supported at some point in the hierarchy). And I meant two primary
processes running, like for example two containers running each a DPDK
application.

Maybe this is not a valid use-case (it is not secure, as it would break
isolation between the two containers), but it seems that it is something
DPDK allows today, if I'm not mistaken.

>>
>> I don't know how frequent it is, but if ACS is not supported by either the
>> endpoint or the the root port, then you would have to share the same IOMMU
>> group for all the ports of your card. Right?
> 
> ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
> comes as a IOMMU group.
> 
> If it in bypass mode anyway you use in vfio-noiommu mode as
> there is no protection anyway.
Yes.

>>
>>> Let me add the background for why this feature is required in DPDK to
>>> enable NPU style co-processors.
>>>
>>> The traditional NICs the Rx path code look like this:
>>> 1) On control path, Fill the mempool with buffers
>>> 2) on rx_burst(), alloc the mbuf from mempool
>>> 3) SW has the mbuf in hand(which is a virtual address) and program the
>>> HW with mbuf->buf_physaddr)
>>> 4) Return the last pushed mbuf(will be updated by HW by now)
>>>
>>>
>>> On NPU style co-processors, situation is different as the buffer recycling
>>> has been done in HW unlike SW model. Here is the data flow:
>>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
>>> address, which is PA in existing model)
>>> 2) on rx_burst, HW gives you IOVA address(as address as step 1)
>>> 3) As application expects VA to operate on it, rx_burst() needs to
>>> convert to VA from PAA. Which is very costly.
>>> Instead with this IOVA as VA scheme, We can avoid the cost of converting
>>> with help of IOMMU/SMMU.
>>>
>>> This patch set auto detects the mode based available of type devices in
>>> bus and provides an option to override mode based on eal argument, so we
>>> don't foresee any issue with this approach and welcome any alternative
>>> approaches.
>>
>> I don't question the need of the feature for these kind of
>> co-processors, using VA as IOVA in your case seems very valid.
>>
>> What concerns me is that we change the default behavior for all other
>> devices. Having an option to override is fine to me, but the default
>> mode should remain the same IMHO.
> 
> Doesn't seems to be a technical point. But I agree with your concern.
> we will address it.
> I think, we have two ways to address it.
> 
> option 1:
> - In existing patch,
> a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
>    http://dpdk.org/dev/patchwork/patch/25192
> b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
> value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
> http://dpdk.org/dev/patchwork/patch/25193/
> check the code after rte_bus_scan()
> 
> option 2:
> On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
> we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
> mode so other type of devices safe.
> 
> I think, option 2 makes sense, as it gives foolproof auto detection scheme and
> without effecting any other devices that not interested in this scheme
> 
> Does that address your concern about the patchset?

Yes it does, or maybe create a new flag in struct rte_pci_driver's
drv_flags to provide a hint it prefers to use VA as IOVA?

It, of course, would just be a hint, and should be set only when other
conditions are met.

>> Wouldn't it be possible to default to VA as IOVA only when an HW mempool
>> is in use?
> 
> It will be too late as in the normal scheme of things, application
> creates the pool.

OK, makes sense.

Thanks,
Maxime

>>
>>> Similar problem exists in another part of the code in DPDK,
>>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
>>> Its a conditional compilation based approach with duplicating the vfio
>>> code and we are trying to fix the problem in a generic way so that
>>> everyone can get benefited out of it.
>>>
>>> Comments are welcome.
>>
>> Thanks,
>> Maxime
>>
>>> /Jerin
>>>
>>>>
>>>> By using physical addresses, you are safe against this problem.
>>>>
>>>> Any thoughts?
>>>>
>>>> Cheers,
>>>> Maxime
  
Jerin Jacob July 6, 2017, 11:12 a.m. UTC | #6
-----Original Message-----
> Date: Thu, 6 Jul 2017 12:59:04 +0200
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>  thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>  hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>  before mapping
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.1.0
> 
> 
> 
> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
> > -----Original Message-----
> > > Date: Thu, 6 Jul 2017 09:58:41 +0200
> > > From: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
> > >   thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
> > >   hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
> > > Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
> > >   before mapping
> > > User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
> > >   Thunderbird/52.1.0
> > > 
> > > 
> > > 
> > > > > 
> > > > > 
> > > > > On 06/08/2017 01:05 PM, Santosh Shukla wrote:
> > > > > > Check iova mode and accordingly map iova to pa or va.
> > > > > > 
> > > > > > Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
> > > > > > Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
> > > > > > ---
> > > > > >     lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
> > > > > >     1 file changed, 8 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > > > index 04914406f..348b7a7f4 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > > > @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
> > > > > >     		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
> > > > > >     		dma_map.vaddr = ms[i].addr_64;
> > > > > >     		dma_map.size = ms[i].len;
> > > > > > -		dma_map.iova = ms[i].phys_addr;
> > > > > > +		if (rte_eal_iova_mode() == RTE_IOVA_VA)
> > > > > > +			dma_map.iova = dma_map.vaddr;
> > > > > > +		else
> > > > > > +			dma_map.iova = ms[i].phys_addr;
> > > > > >     		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> > > > > 
> > > > > IIUC, it is changing default behavior for VFIO devices.
> > > > > 
> > > > > I see a possible problem, but I'm not sure the case is valid.
> > > > > 
> > > > > Imagine you have two devices in the iommu group, and the two devices are
> > > > > used in separate processes. Each process could try two different
> > > > > physical addresses at the same virtual address, and so the second map
> > > > > would fail.
> > > > 
> > > > IMO, Doesn't look like a problem. Here is the data flow
> > > > 
> > > > 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
> > > > on primary process
> > > > http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
> > > > 
> > > > 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
> > > > that, the Secondary process has the _same_ virtual address as primary or
> > > > exit from on attach.
> > > > http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
> > > > 
> > > > 3) Since secondary process adds the mapped the virtual address in step (2).
> > > > in the page table in OS. On SMMU entry miss(When device
> > > > request from I/O transaction), OS will load the mapping and update the SMMU
> > > > "context" with page tables from MMU.
> > > 
> > > Ok thanks for the detailed info, but what about the case where the same
> > > iommu group is used by two primary processes?
> > 
> > Does that case exist with DPDK? We always need to blacklist same BDF in
> > the secondary process to make things work with existing DPDK setup. Which
> > make sense as well. Only primary process configures the HW blocks.
> 
> I meant the case when two BDF are in the same IOMMU group (if ACS is not
> supported at some point in the hierarchy). And I meant two primary
> processes running, like for example two containers running each a DPDK
> application.
> 
> Maybe this is not a valid use-case (it is not secure, as it would break
> isolation between the two containers), but it seems that it is something
> DPDK allows today, if I'm not mistaken.

Not sure. Doesn't seems to valid case with VFIO as without ACS anyway
it will break security (the all point of IOMMU protection == VFIO)

> 
> > > 
> > > I don't know how frequent it is, but if ACS is not supported by either the
> > > endpoint or the the root port, then you would have to share the same IOMMU
> > > group for all the ports of your card. Right?
> > 
> > ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
> > comes as a IOMMU group.
> > 
> > If it in bypass mode anyway you use in vfio-noiommu mode as
> > there is no protection anyway.
> > > What concerns me is that we change the default behavior for all other
> > > devices. Having an option to override is fine to me, but the default
> > > mode should remain the same IMHO.
> > 
> > Doesn't seems to be a technical point. But I agree with your concern.
> > we will address it.
> > I think, we have two ways to address it.
> > 
> > option 1:
> > - In existing patch,
> > a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
> >    http://dpdk.org/dev/patchwork/patch/25192
> > b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
> > value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
> > http://dpdk.org/dev/patchwork/patch/25193/
> > check the code after rte_bus_scan()
> > 
> > option 2:
> > On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
> > we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
> > mode so other type of devices safe.
> > 
> > I think, option 2 makes sense, as it gives foolproof auto detection scheme and
> > without effecting any other devices that not interested in this scheme
> > 
> > Does that address your concern about the patchset?
> 
> Yes it does, or maybe create a new flag in struct rte_pci_driver's
> drv_flags to provide a hint it prefers to use VA as IOVA?
> 
> It, of course, would just be a hint, and should be set only when other
> conditions are met.

Yes. Makes sense. We will roll out v2 with option2 + your rte_pci_driver
suggestion.

Thanks a lot for the review. Appreciated.
  
Santosh Shukla July 6, 2017, 11:19 a.m. UTC | #7
On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:

>
> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>> -----Original Message-----
>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>   thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>>   hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>   before mapping
>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>   Thunderbird/52.1.0
>>>
>>>
>>>
>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>> -----Original Message-----
>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>    thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>>    shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>    before mapping
>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>    Thunderbird/52.1.0
>>>>>
>>>>>
>>>>>
>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>
>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>>> ---
>>>>>>     lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>     1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> index 04914406f..348b7a7f4 100644
>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>             dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>             dma_map.vaddr = ms[i].addr_64;
>>>>>>             dma_map.size = ms[i].len;
>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>> +        else
>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>             dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>>
>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>
>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>
>>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>>> used in separate processes. Each process could try two different
>>>>> physical addresses at the same virtual address, and so the second map
>>>>> would fail.
>>>>
>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>
>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>>> on primary process
>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>>
>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>>> that, the Secondary process has the _same_ virtual address as primary or
>>>> exit from on attach.
>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>>
>>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>>> in the page table in OS. On SMMU entry miss(When device
>>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>>> "context" with page tables from MMU.
>>>
>>> Ok thanks for the detailed info, but what about the case where the same
>>> iommu group is used by two primary processes?
>>
>> Does that case exist with DPDK? We always need to blacklist same BDF in
>> the secondary process to make things work with existing DPDK setup. Which
>> make sense as well. Only primary process configures the HW blocks.
>
> I meant the case when two BDF are in the same IOMMU group (if ACS is not
> supported at some point in the hierarchy). And I meant two primary
> processes running, like for example two containers running each a DPDK
> application.
>
> Maybe this is not a valid use-case (it is not secure, as it would break
> isolation between the two containers), but it seems that it is something
> DPDK allows today, if I'm not mistaken.
>
I'm not sure how two primary process could run, as because latter primary process
would try accessing /var/run/.rte_config and would fail at this [1] point.

It's not valid use-case for dpdk (imo).
[1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204

>>>
>>> I don't know how frequent it is, but if ACS is not supported by either the
>>> endpoint or the the root port, then you would have to share the same IOMMU
>>> group for all the ports of your card. Right?
>>
>> ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF
>> comes as a IOMMU group.
>>
>> If it in bypass mode anyway you use in vfio-noiommu mode as
>> there is no protection anyway.
> Yes.
>
>>>
>>>> Let me add the background for why this feature is required in DPDK to
>>>> enable NPU style co-processors.
>>>>
>>>> The traditional NICs the Rx path code look like this:
>>>> 1) On control path, Fill the mempool with buffers
>>>> 2) on rx_burst(), alloc the mbuf from mempool
>>>> 3) SW has the mbuf in hand(which is a virtual address) and program the
>>>> HW with mbuf->buf_physaddr)
>>>> 4) Return the last pushed mbuf(will be updated by HW by now)
>>>>
>>>>
>>>> On NPU style co-processors, situation is different as the buffer recycling
>>>> has been done in HW unlike SW model. Here is the data flow:
>>>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA
>>>> address, which is PA in existing model)
>>>> 2) on rx_burst, HW gives you IOVA address(as address as step 1)
>>>> 3) As application expects VA to operate on it, rx_burst() needs to
>>>> convert to VA from PAA. Which is very costly.
>>>> Instead with this IOVA as VA scheme, We can avoid the cost of converting
>>>> with help of IOMMU/SMMU.
>>>>
>>>> This patch set auto detects the mode based available of type devices in
>>>> bus and provides an option to override mode based on eal argument, so we
>>>> don't foresee any issue with this approach and welcome any alternative
>>>> approaches.
>>>
>>> I don't question the need of the feature for these kind of
>>> co-processors, using VA as IOVA in your case seems very valid.
>>>
>>> What concerns me is that we change the default behavior for all other
>>> devices. Having an option to override is fine to me, but the default
>>> mode should remain the same IMHO.
>>
>> Doesn't seems to be a technical point. But I agree with your concern.
>> we will address it.
>> I think, we have two ways to address it.
>>
>> option 1:
>> - In existing patch,
>> a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA)
>>    http://dpdk.org/dev/patchwork/patch/25192
>> b) only when with eal argument sets to RTE_IOVA_VA and then bus probed
>> value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA
>> http://dpdk.org/dev/patchwork/patch/25193/
>> check the code after rte_bus_scan()
>>
>> option 2:
>> On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/
>> we can check the rte_pci_device.id.vendor_id == CAVIUM to select the
>> mode so other type of devices safe.
>>
>> I think, option 2 makes sense, as it gives foolproof auto detection scheme and
>> without effecting any other devices that not interested in this scheme
>>
>> Does that address your concern about the patchset?
>
> Yes it does, or maybe create a new flag in struct rte_pci_driver's
> drv_flags to provide a hint it prefers to use VA as IOVA?
>
> It, of course, would just be a hint, and should be set only when other
> conditions are met.
>
>>> Wouldn't it be possible to default to VA as IOVA only when an HW mempool
>>> is in use?
>>
>> It will be too late as in the normal scheme of things, application
>> creates the pool.
>
> OK, makes sense.
>
> Thanks,
> Maxime
>
>>>
>>>> Similar problem exists in another part of the code in DPDK,
>>>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231
>>>> Its a conditional compilation based approach with duplicating the vfio
>>>> code and we are trying to fix the problem in a generic way so that
>>>> everyone can get benefited out of it.
>>>>
>>>> Comments are welcome.
>>>
>>> Thanks,
>>> Maxime
>>>
>>>> /Jerin
>>>>
>>>>>
>>>>> By using physical addresses, you are safe against this problem.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Cheers,
>>>>> Maxime
  
Maxime Coquelin July 6, 2017, 1:08 p.m. UTC | #8
On 07/06/2017 01:19 PM, santosh wrote:
> On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:
> 
>>
>> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>>> -----Original Message-----
>>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>    thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>>>    hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>    before mapping
>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>    Thunderbird/52.1.0
>>>>
>>>>
>>>>
>>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>>> -----Original Message-----
>>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>     thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>>>     shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>>     before mapping
>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>     Thunderbird/52.1.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>>
>>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>>>> ---
>>>>>>>      lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>>      1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>> index 04914406f..348b7a7f4 100644
>>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>>              dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>>              dma_map.vaddr = ms[i].addr_64;
>>>>>>>              dma_map.size = ms[i].len;
>>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>>> +        else
>>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>>              dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>>>
>>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>>
>>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>>
>>>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>>>> used in separate processes. Each process could try two different
>>>>>> physical addresses at the same virtual address, and so the second map
>>>>>> would fail.
>>>>>
>>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>>
>>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>>>> on primary process
>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>>>
>>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>>>> that, the Secondary process has the _same_ virtual address as primary or
>>>>> exit from on attach.
>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>>>
>>>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>>>> in the page table in OS. On SMMU entry miss(When device
>>>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>>>> "context" with page tables from MMU.
>>>>
>>>> Ok thanks for the detailed info, but what about the case where the same
>>>> iommu group is used by two primary processes?
>>>
>>> Does that case exist with DPDK? We always need to blacklist same BDF in
>>> the secondary process to make things work with existing DPDK setup. Which
>>> make sense as well. Only primary process configures the HW blocks.
>>
>> I meant the case when two BDF are in the same IOMMU group (if ACS is not
>> supported at some point in the hierarchy). And I meant two primary
>> processes running, like for example two containers running each a DPDK
>> application.
>>
>> Maybe this is not a valid use-case (it is not secure, as it would break
>> isolation between the two containers), but it seems that it is something
>> DPDK allows today, if I'm not mistaken.
>>
> I'm not sure how two primary process could run, as because latter primary process
> would try accessing /var/run/.rte_config and would fail at this [1] point.
> 
> It's not valid use-case for dpdk (imo).
> [1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204

Yes this is possible. I had never used it before, but Thomas told me it
is supported by setting--file-prefix option. I had a trial, and I
confirm it works:
session 1> ./install/bin/testpmd -l 0,2 --socket-mem=1024 -w 
0000:05:00.0 --proc-type=primary --file-prefix=app1 -- --disable-hw-vlan 
-i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
session 2> ./install/bin/testpmd -l 0,3 --socket-mem=1024 -w 
0000:05:00.1 --proc-type=primary --file-prefix=app2 -- --disable-hw-vlan 
-i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io

In the above example, two ports of the same card is used by two
processes. Note that in this case, ACS is supproted and both ports have
their own iommu group.

Maxime
  
Maxime Coquelin July 6, 2017, 1:11 p.m. UTC | #9
On 07/06/2017 03:08 PM, Maxime Coquelin wrote:
> 
> 
> On 07/06/2017 01:19 PM, santosh wrote:
>> On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:
>>
>>>
>>> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>>>> -----Original Message-----
>>>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>    thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>>>>    hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, 
>>>>> gaetan.rivet@6wind.com
>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova 
>>>>> mode
>>>>>    before mapping
>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>    Thunderbird/52.1.0
>>>>>
>>>>>
>>>>>
>>>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>>>> -----Original Message-----
>>>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>>     thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>>>>     shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor 
>>>>>>> iova mode
>>>>>>>     before mapping
>>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>>     Thunderbird/52.1.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>>>
>>>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>>>>> ---
>>>>>>>>      lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>>>      1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c 
>>>>>>>> b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>> index 04914406f..348b7a7f4 100644
>>>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>>>              dma_map.argsz = sizeof(struct 
>>>>>>>> vfio_iommu_type1_dma_map);
>>>>>>>>              dma_map.vaddr = ms[i].addr_64;
>>>>>>>>              dma_map.size = ms[i].len;
>>>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>>>> +        else
>>>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>>>              dma_map.flags = VFIO_DMA_MAP_FLAG_READ | 
>>>>>>>> VFIO_DMA_MAP_FLAG_WRITE;
>>>>>>>
>>>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>>>
>>>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>>>
>>>>>>> Imagine you have two devices in the iommu group, and the two 
>>>>>>> devices are
>>>>>>> used in separate processes. Each process could try two different
>>>>>>> physical addresses at the same virtual address, and so the second 
>>>>>>> map
>>>>>>> would fail.
>>>>>>
>>>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>>>
>>>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called 
>>>>>> only
>>>>>> on primary process
>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359 
>>>>>>
>>>>>>
>>>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make 
>>>>>> sure
>>>>>> that, the Secondary process has the _same_ virtual address as 
>>>>>> primary or
>>>>>> exit from on attach.
>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452 
>>>>>>
>>>>>>
>>>>>> 3) Since secondary process adds the mapped the virtual address in 
>>>>>> step (2).
>>>>>> in the page table in OS. On SMMU entry miss(When device
>>>>>> request from I/O transaction), OS will load the mapping and update 
>>>>>> the SMMU
>>>>>> "context" with page tables from MMU.
>>>>>
>>>>> Ok thanks for the detailed info, but what about the case where the 
>>>>> same
>>>>> iommu group is used by two primary processes?
>>>>
>>>> Does that case exist with DPDK? We always need to blacklist same BDF in
>>>> the secondary process to make things work with existing DPDK setup. 
>>>> Which
>>>> make sense as well. Only primary process configures the HW blocks.
>>>
>>> I meant the case when two BDF are in the same IOMMU group (if ACS is not
>>> supported at some point in the hierarchy). And I meant two primary
>>> processes running, like for example two containers running each a DPDK
>>> application.
>>>
>>> Maybe this is not a valid use-case (it is not secure, as it would break
>>> isolation between the two containers), but it seems that it is something
>>> DPDK allows today, if I'm not mistaken.
>>>
>> I'm not sure how two primary process could run, as because latter 
>> primary process
>> would try accessing /var/run/.rte_config and would fail at this [1] 
>> point.
>>
>> It's not valid use-case for dpdk (imo).
>> [1] 
>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204
> 
> Yes this is possible. I had never used it before, but Thomas told me it
> is supported by setting--file-prefix option. I had a trial, and I
> confirm it works:
> session 1> ./install/bin/testpmd -l 0,2 --socket-mem=1024 -w 
> 0000:05:00.0 --proc-type=primary --file-prefix=app1 -- --disable-hw-vlan 
> -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
> session 2> ./install/bin/testpmd -l 0,3 --socket-mem=1024 -w 
> 0000:05:00.1 --proc-type=primary --file-prefix=app2 -- --disable-hw-vlan 
> -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
> 
> In the above example, two ports of the same card is used by two
> processes. Note that in this case, ACS is supproted and both ports have
> their own iommu group.

# ls -al /var/run/.app*
-rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app1_config
-rw-r--r--. 1 root root  49728 Jul  6 09:08 /var/run/.app1_hugepage_info
srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app1_mp_socket
-rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app2_config
-rw-r--r--. 1 root root  45584 Jul  6 09:08 /var/run/.app2_hugepage_info
srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app2_mp_socket
  
Santosh Shukla July 6, 2017, 2:13 p.m. UTC | #10
On Thursday 06 July 2017 06:41 PM, Maxime Coquelin wrote:

>
> On 07/06/2017 03:08 PM, Maxime Coquelin wrote:
>>
>>
>> On 07/06/2017 01:19 PM, santosh wrote:
>>> On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:
>>>
>>>>
>>>> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>>>>> -----Original Message-----
>>>>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>    thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>>>>>    hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>>    before mapping
>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>    Thunderbird/52.1.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>>>>> -----Original Message-----
>>>>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>>>     thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>>>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>>>>>     shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>>>>     before mapping
>>>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>>>     Thunderbird/52.1.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>>>>>> ---
>>>>>>>>>      lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>>>>      1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>> index 04914406f..348b7a7f4 100644
>>>>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>>>>              dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>>>>              dma_map.vaddr = ms[i].addr_64;
>>>>>>>>>              dma_map.size = ms[i].len;
>>>>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>>>>> +        else
>>>>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>>>>              dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>>>>>
>>>>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>>>>
>>>>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>>>>
>>>>>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>>>>>> used in separate processes. Each process could try two different
>>>>>>>> physical addresses at the same virtual address, and so the second map
>>>>>>>> would fail.
>>>>>>>
>>>>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>>>>
>>>>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>>>>>> on primary process
>>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>>>>>
>>>>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>>>>>> that, the Secondary process has the _same_ virtual address as primary or
>>>>>>> exit from on attach.
>>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>>>>>
>>>>>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>>>>>> in the page table in OS. On SMMU entry miss(When device
>>>>>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>>>>>> "context" with page tables from MMU.
>>>>>>
>>>>>> Ok thanks for the detailed info, but what about the case where the same
>>>>>> iommu group is used by two primary processes?
>>>>>
>>>>> Does that case exist with DPDK? We always need to blacklist same BDF in
>>>>> the secondary process to make things work with existing DPDK setup. Which
>>>>> make sense as well. Only primary process configures the HW blocks.
>>>>
>>>> I meant the case when two BDF are in the same IOMMU group (if ACS is not
>>>> supported at some point in the hierarchy). And I meant two primary
>>>> processes running, like for example two containers running each a DPDK
>>>> application.
>>>>
>>>> Maybe this is not a valid use-case (it is not secure, as it would break
>>>> isolation between the two containers), but it seems that it is something
>>>> DPDK allows today, if I'm not mistaken.
>>>>
>>> I'm not sure how two primary process could run, as because latter primary process
>>> would try accessing /var/run/.rte_config and would fail at this [1] point.
>>>
>>> It's not valid use-case for dpdk (imo).
>>> [1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204
>>
>> Yes this is possible. I had never used it before, but Thomas told me it
>> is supported by setting--file-prefix option. I had a trial, and I
>> confirm it works:
>> session 1> ./install/bin/testpmd -l 0,2 --socket-mem=1024 -w 0000:05:00.0 --proc-type=primary --file-prefix=app1 -- --disable-hw-vlan -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
>> session 2> ./install/bin/testpmd -l 0,3 --socket-mem=1024 -w 0000:05:00.1 --proc-type=primary --file-prefix=app2 -- --disable-hw-vlan -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
>>
>> In the above example, two ports of the same card is used by two
>> processes. Note that in this case, ACS is supproted and both ports have
>> their own iommu group.
>
> # ls -al /var/run/.app*
> -rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app1_config
> -rw-r--r--. 1 root root  49728 Jul  6 09:08 /var/run/.app1_hugepage_info
> srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app1_mp_socket
> -rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app2_config
> -rw-r--r--. 1 root root  45584 Jul  6 09:08 /var/run/.app2_hugepage_info
> srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app2_mp_socket
>
Yes, You're right, you can start two primary process, I missed that point. 
Use-case which you mentioned is ok, because they are under two different iommu
group so proposed scheme will work. It may not work for the case when ACS not present,
so its bypass mode which falls under vfio-noiommu category.

Having said that: Per discussion on [1]. The proposed scheme where 
bus makes decision based on pci_id and/or pci_drv will be a full proof
solution, and that way other types of devices will not be impacted. Right?

[1] https://www.mail-archive.com/dev@dpdk.org/msg70283.html
  
Maxime Coquelin July 6, 2017, 2:39 p.m. UTC | #11
On 07/06/2017 04:13 PM, santosh wrote:
> On Thursday 06 July 2017 06:41 PM, Maxime Coquelin wrote:
> 
>>
>> On 07/06/2017 03:08 PM, Maxime Coquelin wrote:
>>>
>>>
>>> On 07/06/2017 01:19 PM, santosh wrote:
>>>> On Thursday 06 July 2017 04:29 PM, Maxime Coquelin wrote:
>>>>
>>>>>
>>>>> On 07/06/2017 11:49 AM, Jerin Jacob wrote:
>>>>>> -----Original Message-----
>>>>>>> Date: Thu, 6 Jul 2017 09:58:41 +0200
>>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>>>>>>> CC: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>>     thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org,
>>>>>>>     hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>>>     before mapping
>>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>>     Thunderbird/52.1.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 07/05/2017 05:43 PM, Jerin Jacob wrote:
>>>>>>>> -----Original Message-----
>>>>>>>>> Date: Wed, 5 Jul 2017 11:14:01 +0200
>>>>>>>>> From: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>>>> To: Santosh Shukla <santosh.shukla@caviumnetworks.com>,
>>>>>>>>>      thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org
>>>>>>>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com,
>>>>>>>>>      shreyansh.jain@nxp.com, gaetan.rivet@6wind.com
>>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode
>>>>>>>>>      before mapping
>>>>>>>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>>>>>>>>>      Thunderbird/52.1.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote:
>>>>>>>>>> Check iova mode and accordingly map iova to pa or va.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Santosh Shukla<santosh.shukla@caviumnetworks.com>
>>>>>>>>>> Signed-off-by: Jerin Jacob<jerin.jacob@caviumnetworks.com>
>>>>>>>>>> ---
>>>>>>>>>>       lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++--
>>>>>>>>>>       1 file changed, 8 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>>> index 04914406f..348b7a7f4 100644
>>>>>>>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
>>>>>>>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd)
>>>>>>>>>>               dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
>>>>>>>>>>               dma_map.vaddr = ms[i].addr_64;
>>>>>>>>>>               dma_map.size = ms[i].len;
>>>>>>>>>> -        dma_map.iova = ms[i].phys_addr;
>>>>>>>>>> +        if (rte_eal_iova_mode() == RTE_IOVA_VA)
>>>>>>>>>> +            dma_map.iova = dma_map.vaddr;
>>>>>>>>>> +        else
>>>>>>>>>> +            dma_map.iova = ms[i].phys_addr;
>>>>>>>>>>               dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>>>>>>
>>>>>>>>> IIUC, it is changing default behavior for VFIO devices.
>>>>>>>>>
>>>>>>>>> I see a possible problem, but I'm not sure the case is valid.
>>>>>>>>>
>>>>>>>>> Imagine you have two devices in the iommu group, and the two devices are
>>>>>>>>> used in separate processes. Each process could try two different
>>>>>>>>> physical addresses at the same virtual address, and so the second map
>>>>>>>>> would fail.
>>>>>>>>
>>>>>>>> IMO, Doesn't look like a problem. Here is the data flow
>>>>>>>>
>>>>>>>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only
>>>>>>>> on primary process
>>>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359
>>>>>>>>
>>>>>>>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure
>>>>>>>> that, the Secondary process has the _same_ virtual address as primary or
>>>>>>>> exit from on attach.
>>>>>>>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452
>>>>>>>>
>>>>>>>> 3) Since secondary process adds the mapped the virtual address in step (2).
>>>>>>>> in the page table in OS. On SMMU entry miss(When device
>>>>>>>> request from I/O transaction), OS will load the mapping and update the SMMU
>>>>>>>> "context" with page tables from MMU.
>>>>>>>
>>>>>>> Ok thanks for the detailed info, but what about the case where the same
>>>>>>> iommu group is used by two primary processes?
>>>>>>
>>>>>> Does that case exist with DPDK? We always need to blacklist same BDF in
>>>>>> the secondary process to make things work with existing DPDK setup. Which
>>>>>> make sense as well. Only primary process configures the HW blocks.
>>>>>
>>>>> I meant the case when two BDF are in the same IOMMU group (if ACS is not
>>>>> supported at some point in the hierarchy). And I meant two primary
>>>>> processes running, like for example two containers running each a DPDK
>>>>> application.
>>>>>
>>>>> Maybe this is not a valid use-case (it is not secure, as it would break
>>>>> isolation between the two containers), but it seems that it is something
>>>>> DPDK allows today, if I'm not mistaken.
>>>>>
>>>> I'm not sure how two primary process could run, as because latter primary process
>>>> would try accessing /var/run/.rte_config and would fail at this [1] point.
>>>>
>>>> It's not valid use-case for dpdk (imo).
>>>> [1] http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal.c#n204
>>>
>>> Yes this is possible. I had never used it before, but Thomas told me it
>>> is supported by setting--file-prefix option. I had a trial, and I
>>> confirm it works:
>>> session 1> ./install/bin/testpmd -l 0,2 --socket-mem=1024 -w 0000:05:00.0 --proc-type=primary --file-prefix=app1 -- --disable-hw-vlan -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
>>> session 2> ./install/bin/testpmd -l 0,3 --socket-mem=1024 -w 0000:05:00.1 --proc-type=primary --file-prefix=app2 -- --disable-hw-vlan -i --rxq=1 --txq=1         --nb-cores=1 --forward-mode=io
>>>
>>> In the above example, two ports of the same card is used by two
>>> processes. Note that in this case, ACS is supproted and both ports have
>>> their own iommu group.
>>
>> # ls -al /var/run/.app*
>> -rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app1_config
>> -rw-r--r--. 1 root root  49728 Jul  6 09:08 /var/run/.app1_hugepage_info
>> srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app1_mp_socket
>> -rw-r-----. 1 root root 208420 Jul  6 09:08 /var/run/.app2_config
>> -rw-r--r--. 1 root root  45584 Jul  6 09:08 /var/run/.app2_hugepage_info
>> srwxr-xr-x. 1 root root      0 Jul  6 09:08 /var/run/.app2_mp_socket
>>
> Yes, You're right, you can start two primary process, I missed that point.
> Use-case which you mentioned is ok, because they are under two different iommu
> group so proposed scheme will work. It may not work for the case when ACS not present,
> so its bypass mode which falls under vfio-noiommu category.
> 
> Having said that: Per discussion on [1]. The proposed scheme where
> bus makes decision based on pci_id and/or pci_drv will be a full proof
> solution, and that way other types of devices will not be impacted. Right?


Right!

Thanks,
Maxime
> [1] https://www.mail-archive.com/dev@dpdk.org/msg70283.html
> 
>
  

Patch

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 04914406f..348b7a7f4 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -706,7 +706,10 @@  vfio_type1_dma_map(int vfio_container_fd)
 		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
 		dma_map.vaddr = ms[i].addr_64;
 		dma_map.size = ms[i].len;
-		dma_map.iova = ms[i].phys_addr;
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			dma_map.iova = dma_map.vaddr;
+		else
+			dma_map.iova = ms[i].phys_addr;
 		dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
 
 		ret = ioctl(vfio_container_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
@@ -792,7 +795,10 @@  vfio_spapr_dma_map(int vfio_container_fd)
 		dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map);
 		dma_map.vaddr = ms[i].addr_64;
 		dma_map.size = ms[i].len;
-		dma_map.iova = ms[i].phys_addr;
+		if (rte_eal_iova_mode() == RTE_IOVA_VA)
+			dma_map.iova = dma_map.vaddr;
+		else
+			dma_map.iova = ms[i].phys_addr;
 		dma_map.flags = VFIO_DMA_MAP_FLAG_READ |
 				 VFIO_DMA_MAP_FLAG_WRITE;