[v2] net/virtio: add platform memory ordering feature support

Message ID 20181226163712.31596-1-i.maximets@samsung.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series [v2] net/virtio: add platform memory ordering feature support |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS
ci/Intel-compilation success Compilation OK

Commit Message

Ilya Maximets Dec. 26, 2018, 4:37 p.m. UTC
  VIRTIO_F_ORDER_PLATFORM is required to use proper memory barriers
in case of HW vhost implementations like vDPA.

DMA barriers (rte_cio_*) are sufficent for that purpose.

Previously known as VIRTIO_F_IO_BARRIER.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
---

Version 2:
  * rebased on current master (packed rings).

RFC --> Version 1:
  * Dropped vendor-specific hack to determine if we need real barriers.
  * Added VIRTIO_F_ORDER_PLATFORM feature definition and checking.

Note: Patch to change the name of the feature from VIRTIO_F_IO_BARRIER
      to VIRTIO_F_ORDER_PLATFORM is not merged yet:
      https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg04114.html

 drivers/net/virtio/virtio_ethdev.c |  2 ++
 drivers/net/virtio/virtio_ethdev.h |  3 ++-
 drivers/net/virtio/virtio_pci.h    |  7 ++++++
 drivers/net/virtio/virtio_rxtx.c   | 16 ++++++------
 drivers/net/virtio/virtqueue.h     | 39 ++++++++++++++++++++++++------
 5 files changed, 51 insertions(+), 16 deletions(-)
  

Comments

Shahaf Shuler Dec. 27, 2018, 10:07 a.m. UTC | #1
Hi Ilya, 

Wednesday, December 26, 2018 6:37 PM, Ilya Maximets:
> Subject: [dpdk-dev] [PATCH v2] net/virtio: add platform memory ordering
> feature support
> 
> VIRTIO_F_ORDER_PLATFORM is required to use proper memory barriers in
> case of HW vhost implementations like vDPA.
> 
> DMA barriers (rte_cio_*) are sufficent for that purpose.
> 
> Previously known as VIRTIO_F_IO_BARRIER.
> 
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
> 
> Version 2:
>   * rebased on current master (packed rings).
> 
> RFC --> Version 1:
>   * Dropped vendor-specific hack to determine if we need real barriers.
>   * Added VIRTIO_F_ORDER_PLATFORM feature definition and checking.
> 
> Note: Patch to change the name of the feature from VIRTIO_F_IO_BARRIER
>       to VIRTIO_F_ORDER_PLATFORM is not merged yet:
> 
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.m
> ail-archive.com%2Fvirtio-dev%40lists.oasis-
> open.org%2Fmsg04114.html&amp;data=02%7C01%7Cshahafs%40mellanox.co
> m%7C147684bb9b754e30781408d66b506965%7Ca652971c7d2e4d9ba6a4d149
> 256f461b%7C0%7C0%7C636814390449088148&amp;sdata=f4W1YLftBtZ4zAQ3
> %2Bv7LV5w%2FDhM5aWpROV2c2gHY8iU%3D&amp;reserved=0
> 
>  drivers/net/virtio/virtio_ethdev.c |  2 ++  drivers/net/virtio/virtio_ethdev.h |  3
> ++-
>  drivers/net/virtio/virtio_pci.h    |  7 ++++++
>  drivers/net/virtio/virtio_rxtx.c   | 16 ++++++------
>  drivers/net/virtio/virtqueue.h     | 39 ++++++++++++++++++++++++------
>  5 files changed, 51 insertions(+), 16 deletions(-)

[...]

> 
>  /*
> - * Per virtio_config.h in Linux.
> + * Per virtio_ring.h in Linux.
>   *     For virtio_pci on SMP, we don't need to order with respect to MMIO
>   *     accesses through relaxed memory I/O windows, so smp_mb() et al are
>   *     sufficient.
>   *
> + *     For using virtio to talk to real devices (eg. vDPA) we do need real
> + *     barriers.
>   */
> -#define virtio_mb()	rte_smp_mb()
> -#define virtio_rmb()	rte_smp_rmb()
> -#define virtio_wmb()	rte_smp_wmb()
> +static inline void
> +virtio_mb(uint8_t weak_barriers)
> +{
> +	if (weak_barriers)
> +		rte_smp_mb();
> +	else
> +		rte_mb();
> +}
> +
> +static inline void
> +virtio_rmb(uint8_t weak_barriers)
> +{
> +	if (weak_barriers)
> +		rte_smp_rmb();
> +	else
> +		rte_cio_rmb();
> +}
> +
> +static inline void
> +virtio_wmb(uint8_t weak_barriers)
> +{
> +	if (weak_barriers)
> +		rte_smp_wmb();
> +	else
> +		rte_cio_wmb();

Just wondering if the cio barrier is enough here.
This virtio_wmb will be called, for example in the following sequence for transmit (not of packed queue):
if (likely(nb_tx)) {                                                 
        vq_update_avail_idx(vq);                <--- virito_wmb()                     
                                                                     
        if (unlikely(virtqueue_kick_prepare(vq))) {                 <--- no barrier 
                virtqueue_notify(vq);                                
                PMD_TX_LOG(DEBUG, "Notified backend after xmit");    
        }                                                            
}                                                                    

Assuming the backhand has vDPA device. The notify will be most likely mapped to the device PCI bar as write combining.
This means we need to keep ordering here between two memory domains - the PCI and the host local memory. We must make sure that when the notify reaches the device, the store to the host local memory is already written to memory (and not in the write buffer).
Is complier barrier is enough for such purpose? Or we need something stronger like sfence? 

Note #1 - This issue (if exists) is not caused directly by your code, however you are making things right here w/ respect to memory ordering so worth taking care also on this one
Note #2 - if indeed there is an issue here, for packed queue we are somewhat OK, since virtio_mb() will be called before the kick. I am not sure why we need such hard barrier and sfence is not enough. Do you know? 


> +}
> 
>  #ifdef RTE_PMD_PACKET_PREFETCH
>  #define rte_packet_prefetch(p)  rte_prefetch1(p) @@ -325,7 +350,7 @@
> virtqueue_enable_intr_packed(struct virtqueue *vq)
> 
> 
>  	if (vq->event_flags_shadow == RING_EVENT_FLAGS_DISABLE) {
> -		virtio_wmb();
> +		virtio_wmb(vq->hw->weak_barriers);
>  		vq->event_flags_shadow = RING_EVENT_FLAGS_ENABLE;
>  		*event_flags = vq->event_flags_shadow;
>  	}
> @@ -391,7 +416,7 @@ void vq_ring_free_inorder(struct virtqueue *vq,
> uint16_t desc_idx,  static inline void  vq_update_avail_idx(struct virtqueue *vq)
> {
> -	virtio_wmb();
> +	virtio_wmb(vq->hw->weak_barriers);
>  	vq->vq_ring.avail->idx = vq->vq_avail_idx;  }
> 
> @@ -423,7 +448,7 @@ virtqueue_kick_prepare_packed(struct virtqueue *vq)  {
>  	uint16_t flags;
> 
> -	virtio_mb();
> +	virtio_mb(vq->hw->weak_barriers);
>  	flags = vq->ring_packed.device_event->desc_event_flags;
> 
>  	return flags != RING_EVENT_FLAGS_DISABLE;
> --
> 2.17.1
  
Ilya Maximets Jan. 9, 2019, 2:34 p.m. UTC | #2
On 27.12.2018 13:07, Shahaf Shuler wrote:
> Hi Ilya, 
> 
> Wednesday, December 26, 2018 6:37 PM, Ilya Maximets:
>> Subject: [dpdk-dev] [PATCH v2] net/virtio: add platform memory ordering
>> feature support
>>
>> VIRTIO_F_ORDER_PLATFORM is required to use proper memory barriers in
>> case of HW vhost implementations like vDPA.
>>
>> DMA barriers (rte_cio_*) are sufficent for that purpose.
>>
>> Previously known as VIRTIO_F_IO_BARRIER.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>> ---
>>
>> Version 2:
>>   * rebased on current master (packed rings).
>>
>> RFC --> Version 1:
>>   * Dropped vendor-specific hack to determine if we need real barriers.
>>   * Added VIRTIO_F_ORDER_PLATFORM feature definition and checking.
>>
>> Note: Patch to change the name of the feature from VIRTIO_F_IO_BARRIER
>>       to VIRTIO_F_ORDER_PLATFORM is not merged yet:
>>
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.m
>> ail-archive.com%2Fvirtio-dev%40lists.oasis-
>> open.org%2Fmsg04114.html&amp;data=02%7C01%7Cshahafs%40mellanox.co
>> m%7C147684bb9b754e30781408d66b506965%7Ca652971c7d2e4d9ba6a4d149
>> 256f461b%7C0%7C0%7C636814390449088148&amp;sdata=f4W1YLftBtZ4zAQ3
>> %2Bv7LV5w%2FDhM5aWpROV2c2gHY8iU%3D&amp;reserved=0
>>
>>  drivers/net/virtio/virtio_ethdev.c |  2 ++  drivers/net/virtio/virtio_ethdev.h |  3
>> ++-
>>  drivers/net/virtio/virtio_pci.h    |  7 ++++++
>>  drivers/net/virtio/virtio_rxtx.c   | 16 ++++++------
>>  drivers/net/virtio/virtqueue.h     | 39 ++++++++++++++++++++++++------
>>  5 files changed, 51 insertions(+), 16 deletions(-)
> 
> [...]
> 
>>
>>  /*
>> - * Per virtio_config.h in Linux.
>> + * Per virtio_ring.h in Linux.
>>   *     For virtio_pci on SMP, we don't need to order with respect to MMIO
>>   *     accesses through relaxed memory I/O windows, so smp_mb() et al are
>>   *     sufficient.
>>   *
>> + *     For using virtio to talk to real devices (eg. vDPA) we do need real
>> + *     barriers.
>>   */
>> -#define virtio_mb()	rte_smp_mb()
>> -#define virtio_rmb()	rte_smp_rmb()
>> -#define virtio_wmb()	rte_smp_wmb()
>> +static inline void
>> +virtio_mb(uint8_t weak_barriers)
>> +{
>> +	if (weak_barriers)
>> +		rte_smp_mb();
>> +	else
>> +		rte_mb();
>> +}
>> +
>> +static inline void
>> +virtio_rmb(uint8_t weak_barriers)
>> +{
>> +	if (weak_barriers)
>> +		rte_smp_rmb();
>> +	else
>> +		rte_cio_rmb();
>> +}
>> +
>> +static inline void
>> +virtio_wmb(uint8_t weak_barriers)
>> +{
>> +	if (weak_barriers)
>> +		rte_smp_wmb();
>> +	else
>> +		rte_cio_wmb();
> 
> Just wondering if the cio barrier is enough here.
> This virtio_wmb will be called, for example in the following sequence for transmit (not of packed queue):
> if (likely(nb_tx)) {                                                 
>         vq_update_avail_idx(vq);                <--- virito_wmb()                     
>                                                                      
>         if (unlikely(virtqueue_kick_prepare(vq))) {                 <--- no barrier 
>                 virtqueue_notify(vq);                                
>                 PMD_TX_LOG(DEBUG, "Notified backend after xmit");    
>         }                                                            
> }                                                                    
> 
> Assuming the backhand has vDPA device. The notify will be most likely mapped to the device PCI bar as write combining.
> This means we need to keep ordering here between two memory domains - the PCI and the host local memory. We must make sure that when the notify reaches the device, the store to the host local memory is already written to memory (and not in the write buffer).
> Is complier barrier is enough for such purpose? Or we need something stronger like sfence? 
> 
> Note #1 - This issue (if exists) is not caused directly by your code, however you are making things right here w/ respect to memory ordering so worth taking care also on this one
> Note #2 - if indeed there is an issue here, for packed queue we are somewhat OK, since virtio_mb() will be called before the kick. I am not sure why we need such hard barrier and sfence is not enough. Do you know? 

Hmm. Thanks for spotting this.
The issue exists, but it's a bit different.
Looks like we have missed virtio_mb() for the split case inside the
virtqueue_kick_prepare(). It's required because we must ensure the
ordering between writing the data (updating avail->idx) and reading the
used->flags. Otherwise we could catch the case where data written, but
virtqueue wasn't notified and vhost waits for notification.
This is not the vDPA related issue. This could happen with kernel vhost.
Adding the virtio_mb() to virtqueue_kick_prepare() will solve the vDPA
case automatically. I'll prepare the patches.


Beside that, Jens, between v10 and v11 of the packed queues support
patch-set you added the virtio_mb() to kick_prepare function.
I guess, this could be the root cause of the performance drop you
spotted in compare with split queues. Maybe you could check and prove
that point?
I didn't found why you done that change (there are no such review
comments on the list), but it was right decision.

virtio_mb() is really heavy. I'd like to avoid it somehow, but I don't
know how to do this yet.

> 
> 
>> +}
>>
>>  #ifdef RTE_PMD_PACKET_PREFETCH
>>  #define rte_packet_prefetch(p)  rte_prefetch1(p) @@ -325,7 +350,7 @@
>> virtqueue_enable_intr_packed(struct virtqueue *vq)
>>
>>
>>  	if (vq->event_flags_shadow == RING_EVENT_FLAGS_DISABLE) {
>> -		virtio_wmb();
>> +		virtio_wmb(vq->hw->weak_barriers);
>>  		vq->event_flags_shadow = RING_EVENT_FLAGS_ENABLE;
>>  		*event_flags = vq->event_flags_shadow;
>>  	}
>> @@ -391,7 +416,7 @@ void vq_ring_free_inorder(struct virtqueue *vq,
>> uint16_t desc_idx,  static inline void  vq_update_avail_idx(struct virtqueue *vq)
>> {
>> -	virtio_wmb();
>> +	virtio_wmb(vq->hw->weak_barriers);
>>  	vq->vq_ring.avail->idx = vq->vq_avail_idx;  }
>>
>> @@ -423,7 +448,7 @@ virtqueue_kick_prepare_packed(struct virtqueue *vq)  {
>>  	uint16_t flags;
>>
>> -	virtio_mb();
>> +	virtio_mb(vq->hw->weak_barriers);
>>  	flags = vq->ring_packed.device_event->desc_event_flags;
>>
>>  	return flags != RING_EVENT_FLAGS_DISABLE;
>> --
>> 2.17.1
>
  
Michael S. Tsirkin Jan. 9, 2019, 3:50 p.m. UTC | #3
On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
> virtio_mb() is really heavy. I'd like to avoid it somehow, but I don't
> know how to do this yet.

Linux driver doesn't avoid it either.
  
Shahaf Shuler Jan. 10, 2019, 8:36 p.m. UTC | #4
Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
> alejandro.lucero@netronome.com; Daniel Marcovitch
> <danielm@mellanox.com>
> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> ordering feature support
> 
> On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
> > virtio_mb() is really heavy. I'd like to avoid it somehow, but I don't
> > know how to do this yet.
> 
> Linux driver doesn't avoid it either.

I understand v3 was merged but still would like to continue the discuss and make sure all is clear and agreed.

Form patch [1] description it is very clear why we need the rte_smp_mb() barrier.
However I am not sure why this barrier is interoperate into rte_mb in case of vDPA.  In vDPA case, both read of the user ring and write of the avail index are for local cached memory. 
The only write which is to uncachable memory (device memory) is the notify itself. 

As I mentioned, there is a need to have a store fence before doing the notify, but from different reasons. So vDPA use case and need Is a bit different than what presented in [1].

[1]
https://patches.dpdk.org/patch/49545/

> 
> --
> MST
  
Shahaf Shuler Jan. 15, 2019, 6:33 a.m. UTC | #5
Thursday, January 10, 2019 10:37 PM, Shahaf Shuler:
> Subject: RE: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> ordering feature support
> 
> Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
> > alejandro.lucero@netronome.com; Daniel Marcovitch
> > <danielm@mellanox.com>
> > Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> > ordering feature support
> >
> > On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
> > > virtio_mb() is really heavy. I'd like to avoid it somehow, but I
> > > don't know how to do this yet.
> >
> > Linux driver doesn't avoid it either.
> 
> I understand v3 was merged but still would like to continue the discuss and
> make sure all is clear and agreed.
> 
> Form patch [1] description it is very clear why we need the rte_smp_mb()
> barrier.
> However I am not sure why this barrier is interoperate into rte_mb in case of
> vDPA.  In vDPA case, both read of the user ring and write of the avail index
> are for local cached memory.
> The only write which is to uncachable memory (device memory) is the notify
> itself.
> 
> As I mentioned, there is a need to have a store fence before doing the
> notify, but from different reasons. So vDPA use case and need Is a bit
> different than what presented in [1].

Any answer?
It is pity if we add redundant barriers which will degrade the driver performance. 

> 
> [1]
> https://patches.dpdk.org/patch/49545/
> 
> >
> > --
> > MST
  
Ilya Maximets Jan. 15, 2019, 8:29 a.m. UTC | #6
On 15.01.2019 9:33, Shahaf Shuler wrote:
> Thursday, January 10, 2019 10:37 PM, Shahaf Shuler:
>> Subject: RE: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
>> ordering feature support
>>
>> Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
>>> alejandro.lucero@netronome.com; Daniel Marcovitch
>>> <danielm@mellanox.com>
>>> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
>>> ordering feature support
>>>
>>> On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
>>>> virtio_mb() is really heavy. I'd like to avoid it somehow, but I
>>>> don't know how to do this yet.
>>>
>>> Linux driver doesn't avoid it either.
>>
>> I understand v3 was merged but still would like to continue the discuss and
>> make sure all is clear and agreed.
>>
>> Form patch [1] description it is very clear why we need the rte_smp_mb()
>> barrier.
>> However I am not sure why this barrier is interoperate into rte_mb in case of
>> vDPA.  In vDPA case, both read of the user ring and write of the avail index
>> are for local cached memory.
>> The only write which is to uncachable memory (device memory) is the notify
>> itself.
>>
>> As I mentioned, there is a need to have a store fence before doing the
>> notify, but from different reasons. So vDPA use case and need Is a bit
>> different than what presented in [1].
> 
> Any answer?
> It is pity if we add redundant barriers which will degrade the driver performance.

Sorry for late responses. Have a lot of work with OVS right now.

Regarding your question.
Current code looks like this:

 1. Update ring.
 2. virtio_wmb()
 3. Update idx.
 4. virtio_mb()
 5. read flags.
 6. notify.

virtio_mb() is here to avoid reordering of steps 3 and 5.
i.e. we need a full barrier to ensure the order between store (idx update)
and load (reading the flags). Otherwise we could miss the notification.
We can't avoid the barrier here, because even x86 does not guarantee
the ordering of the local load with earlier local store.

> 
>>
>> [1]
>> https://patches.dpdk.org/patch/49545/
>>
>>>
>>> --
>>> MST
> 
>
  
Shahaf Shuler Jan. 15, 2019, 8:55 a.m. UTC | #7
Tuesday, January 15, 2019 10:29 AM, Ilya Maximets:
> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> ordering feature support
> 
> On 15.01.2019 9:33, Shahaf Shuler wrote:
> > Thursday, January 10, 2019 10:37 PM, Shahaf Shuler:
> >> Subject: RE: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> >> ordering feature support
> >>
> >> Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
> >>> alejandro.lucero@netronome.com; Daniel Marcovitch
> >>> <danielm@mellanox.com>
> >>> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> >>> ordering feature support
> >>>
> >>> On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
> >>>> virtio_mb() is really heavy. I'd like to avoid it somehow, but I
> >>>> don't know how to do this yet.
> >>>
> >>> Linux driver doesn't avoid it either.
> >>
> >> I understand v3 was merged but still would like to continue the
> >> discuss and make sure all is clear and agreed.
> >>
> >> Form patch [1] description it is very clear why we need the
> >> rte_smp_mb() barrier.
> >> However I am not sure why this barrier is interoperate into rte_mb in
> >> case of vDPA.  In vDPA case, both read of the user ring and write of
> >> the avail index are for local cached memory.
> >> The only write which is to uncachable memory (device memory) is the
> >> notify itself.
> >>
> >> As I mentioned, there is a need to have a store fence before doing
> >> the notify, but from different reasons. So vDPA use case and need Is
> >> a bit different than what presented in [1].
> >
> > Any answer?
> > It is pity if we add redundant barriers which will degrade the driver
> performance.
> 
> Sorry for late responses. Have a lot of work with OVS right now.
> 
> Regarding your question.
> Current code looks like this:
> 
>  1. Update ring.
>  2. virtio_wmb()
>  3. Update idx.
>  4. virtio_mb()
>  5. read flags.
>  6. notify.
> 
> virtio_mb() is here to avoid reordering of steps 3 and 5.
> i.e. we need a full barrier to ensure the order between store (idx update)
> and load (reading the flags). Otherwise we could miss the notification.
> We can't avoid the barrier here, because even x86 does not guarantee the
> ordering of the local load with earlier local store.

This is clear. You need the rte_smp_mb() here. My question is why you need the rte_mb() in case of vDPA? 
As you said, all accesses are local. 

Pasting you commit code:
/*                                                                        
 * Per virtio_ring.h in Linux.                                            
 *     For virtio_pci on SMP, we don't need to order with respect to MMIO 
 *     accesses through relaxed memory I/O windows, so smp_mb() et al are 
 *     sufficient.                                                        
 *                                                                        
 *     For using virtio to talk to real devices (eg. vDPA) we do need real
 *     barriers.                                                          
 */                                                                       
static inline void                                                        
virtio_mb(uint8_t weak_barriers)                                          
{                                                                         
        if (weak_barriers)                                                
                rte_smp_mb();                                             
        else                                                              
                rte_mb();                                                 
}                                                                         

> 
> >
> >>
> >> [1]
> >>
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpa
> >>
> tches.dpdk.org%2Fpatch%2F49545%2F&amp;data=02%7C01%7Cshahafs%40
> mellan
> >>
> ox.com%7C01907f1b2e0e4002cb7508d67ac38a98%7Ca652971c7d2e4d9ba6a4
> d1492
> >>
> 56f461b%7C0%7C0%7C636831377591864200&amp;sdata=TSpc%2Fzyq2aq0N3
> %2Bh4o
> >> ro4std8ut%2FQU6%2BOeMDeeaQdsM%3D&amp;reserved=0
> >>
> >>>
> >>> --
> >>> MST
> >
> >
  
Ilya Maximets Jan. 15, 2019, 10:23 a.m. UTC | #8
On 15.01.2019 11:55, Shahaf Shuler wrote:
> Tuesday, January 15, 2019 10:29 AM, Ilya Maximets:
>> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
>> ordering feature support
>>
>> On 15.01.2019 9:33, Shahaf Shuler wrote:
>>> Thursday, January 10, 2019 10:37 PM, Shahaf Shuler:
>>>> Subject: RE: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
>>>> ordering feature support
>>>>
>>>> Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
>>>>> alejandro.lucero@netronome.com; Daniel Marcovitch
>>>>> <danielm@mellanox.com>
>>>>> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
>>>>> ordering feature support
>>>>>
>>>>> On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
>>>>>> virtio_mb() is really heavy. I'd like to avoid it somehow, but I
>>>>>> don't know how to do this yet.
>>>>>
>>>>> Linux driver doesn't avoid it either.
>>>>
>>>> I understand v3 was merged but still would like to continue the
>>>> discuss and make sure all is clear and agreed.
>>>>
>>>> Form patch [1] description it is very clear why we need the
>>>> rte_smp_mb() barrier.
>>>> However I am not sure why this barrier is interoperate into rte_mb in
>>>> case of vDPA.  In vDPA case, both read of the user ring and write of
>>>> the avail index are for local cached memory.
>>>> The only write which is to uncachable memory (device memory) is the
>>>> notify itself.
>>>>
>>>> As I mentioned, there is a need to have a store fence before doing
>>>> the notify, but from different reasons. So vDPA use case and need Is
>>>> a bit different than what presented in [1].
>>>
>>> Any answer?
>>> It is pity if we add redundant barriers which will degrade the driver
>> performance.
>>
>> Sorry for late responses. Have a lot of work with OVS right now.
>>
>> Regarding your question.
>> Current code looks like this:
>>
>>  1. Update ring.
>>  2. virtio_wmb()
>>  3. Update idx.
>>  4. virtio_mb()
>>  5. read flags.
>>  6. notify.
>>
>> virtio_mb() is here to avoid reordering of steps 3 and 5.
>> i.e. we need a full barrier to ensure the order between store (idx update)
>> and load (reading the flags). Otherwise we could miss the notification.
>> We can't avoid the barrier here, because even x86 does not guarantee the
>> ordering of the local load with earlier local store.
> 
> This is clear. You need the rte_smp_mb() here. My question is why you need the rte_mb() in case of vDPA? 
> As you said, all accesses are local.

Sorry, I misunderstood your question.
Memory accesses are not local here. We want the ring data/idx be visible
to vDPA HW before reading the flags.

> 
> Pasting you commit code:
> /*                                                                        
>  * Per virtio_ring.h in Linux.                                            
>  *     For virtio_pci on SMP, we don't need to order with respect to MMIO 
>  *     accesses through relaxed memory I/O windows, so smp_mb() et al are 
>  *     sufficient.                                                        
>  *                                                                        
>  *     For using virtio to talk to real devices (eg. vDPA) we do need real
>  *     barriers.                                                          
>  */                                                                       
> static inline void                                                        
> virtio_mb(uint8_t weak_barriers)                                          
> {                                                                         
>         if (weak_barriers)                                                
>                 rte_smp_mb();                                             
>         else                                                              
>                 rte_mb();                                                 
> }                                                                         
> 
>>
>>>
>>>>
>>>> [1]
>>>>
>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpa
>>>>
>> tches.dpdk.org%2Fpatch%2F49545%2F&amp;data=02%7C01%7Cshahafs%40
>> mellan
>>>>
>> ox.com%7C01907f1b2e0e4002cb7508d67ac38a98%7Ca652971c7d2e4d9ba6a4
>> d1492
>>>>
>> 56f461b%7C0%7C0%7C636831377591864200&amp;sdata=TSpc%2Fzyq2aq0N3
>> %2Bh4o
>>>> ro4std8ut%2FQU6%2BOeMDeeaQdsM%3D&amp;reserved=0
>>>>
>>>>>
>>>>> --
>>>>> MST
>>>
>>>
  
Michael S. Tsirkin Feb. 12, 2019, 5:50 p.m. UTC | #9
On Tue, Jan 15, 2019 at 08:55:50AM +0000, Shahaf Shuler wrote:
> Tuesday, January 15, 2019 10:29 AM, Ilya Maximets:
> > Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> > ordering feature support
> > 
> > On 15.01.2019 9:33, Shahaf Shuler wrote:
> > > Thursday, January 10, 2019 10:37 PM, Shahaf Shuler:
> > >> Subject: RE: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> > >> ordering feature support
> > >>
> > >> Wednesday, January 9, 2019 5:50 PM, Michael S. Tsirkin:
> > >>> alejandro.lucero@netronome.com; Daniel Marcovitch
> > >>> <danielm@mellanox.com>
> > >>> Subject: Re: [dpdk-dev] [PATCH v2] net/virtio: add platform memory
> > >>> ordering feature support
> > >>>
> > >>> On Wed, Jan 09, 2019 at 05:34:38PM +0300, Ilya Maximets wrote:
> > >>>> virtio_mb() is really heavy. I'd like to avoid it somehow, but I
> > >>>> don't know how to do this yet.
> > >>>
> > >>> Linux driver doesn't avoid it either.
> > >>
> > >> I understand v3 was merged but still would like to continue the
> > >> discuss and make sure all is clear and agreed.
> > >>
> > >> Form patch [1] description it is very clear why we need the
> > >> rte_smp_mb() barrier.
> > >> However I am not sure why this barrier is interoperate into rte_mb in
> > >> case of vDPA.  In vDPA case, both read of the user ring and write of
> > >> the avail index are for local cached memory.
> > >> The only write which is to uncachable memory (device memory) is the
> > >> notify itself.
> > >>
> > >> As I mentioned, there is a need to have a store fence before doing
> > >> the notify, but from different reasons. So vDPA use case and need Is
> > >> a bit different than what presented in [1].
> > >
> > > Any answer?
> > > It is pity if we add redundant barriers which will degrade the driver
> > performance.
> > 
> > Sorry for late responses. Have a lot of work with OVS right now.
> > 
> > Regarding your question.
> > Current code looks like this:
> > 
> >  1. Update ring.
> >  2. virtio_wmb()
> >  3. Update idx.
> >  4. virtio_mb()
> >  5. read flags.
> >  6. notify.
> > 
> > virtio_mb() is here to avoid reordering of steps 3 and 5.
> > i.e. we need a full barrier to ensure the order between store (idx update)
> > and load (reading the flags). Otherwise we could miss the notification.
> > We can't avoid the barrier here, because even x86 does not guarantee the
> > ordering of the local load with earlier local store.
> 
> This is clear. You need the rte_smp_mb() here. My question is why you need the rte_mb() in case of vDPA? 
> As you said, all accesses are local. 

Are you asking why the different barriers? Or as you asking why is a barrier
needed at all?

The barriers themselves are clearly needed.

But in my opinion some dpdk barrier implementations are sub-optimal and too
strong.  For example on intel: the big question is whether anyone does
any non-temporals. In absence of these, and with non-cacheable mappings
on x86 wmb and rmb should be a nop, and mb should be a locked instruction.
It might make sense to add rte_dma_rmb/wmb/mb.




> Pasting you commit code:
> /*                                                                        
>  * Per virtio_ring.h in Linux.                                            
>  *     For virtio_pci on SMP, we don't need to order with respect to MMIO 
>  *     accesses through relaxed memory I/O windows, so smp_mb() et al are 
>  *     sufficient.                                                        
>  *                                                                        
>  *     For using virtio to talk to real devices (eg. vDPA) we do need real
>  *     barriers.                                                          
>  */                                                                       
> static inline void                                                        
> virtio_mb(uint8_t weak_barriers)                                          
> {                                                                         
>         if (weak_barriers)                                                
>                 rte_smp_mb();                                             
>         else                                                              
>                 rte_mb();                                                 
> }                                                                         
> 
> > 
> > >
> > >>
> > >> [1]
> > >>
> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpa
> > >>
> > tches.dpdk.org%2Fpatch%2F49545%2F&amp;data=02%7C01%7Cshahafs%40
> > mellan
> > >>
> > ox.com%7C01907f1b2e0e4002cb7508d67ac38a98%7Ca652971c7d2e4d9ba6a4
> > d1492
> > >>
> > 56f461b%7C0%7C0%7C636831377591864200&amp;sdata=TSpc%2Fzyq2aq0N3
> > %2Bh4o
> > >> ro4std8ut%2FQU6%2BOeMDeeaQdsM%3D&amp;reserved=0
> > >>
> > >>>
> > >>> --
> > >>> MST
> > >
> > >
  

Patch

diff --git a/drivers/net/virtio/virtio_ethdev.c b/drivers/net/virtio/virtio_ethdev.c
index 446c338fc..6d461180c 100644
--- a/drivers/net/virtio/virtio_ethdev.c
+++ b/drivers/net/virtio/virtio_ethdev.c
@@ -1613,6 +1613,8 @@  virtio_init_device(struct rte_eth_dev *eth_dev, uint64_t req_features)
 	if (virtio_negotiate_features(hw, req_features) < 0)
 		return -1;
 
+	hw->weak_barriers = !vtpci_with_feature(hw, VIRTIO_F_ORDER_PLATFORM);
+
 	if (!hw->virtio_user_dev) {
 		pci_dev = RTE_ETH_DEV_TO_PCI(eth_dev);
 		rte_eth_copy_pci_info(eth_dev, pci_dev);
diff --git a/drivers/net/virtio/virtio_ethdev.h b/drivers/net/virtio/virtio_ethdev.h
index f8d8a56ab..b8aab7da4 100644
--- a/drivers/net/virtio/virtio_ethdev.h
+++ b/drivers/net/virtio/virtio_ethdev.h
@@ -35,7 +35,8 @@ 
 	 1ULL << VIRTIO_F_VERSION_1       |	\
 	 1ULL << VIRTIO_F_IN_ORDER        |	\
 	 1ULL << VIRTIO_F_RING_PACKED	  |	\
-	 1ULL << VIRTIO_F_IOMMU_PLATFORM)
+	 1ULL << VIRTIO_F_IOMMU_PLATFORM  |	\
+	 1ULL << VIRTIO_F_ORDER_PLATFORM)
 
 #define VIRTIO_PMD_SUPPORTED_GUEST_FEATURES	\
 	(VIRTIO_PMD_DEFAULT_GUEST_FEATURES |	\
diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index b22b62dad..38a0261da 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -129,6 +129,12 @@  struct virtnet_ctl;
  */
 #define VIRTIO_F_IN_ORDER 35
 
+/*
+ * This feature indicates that memory accesses by the driver and the device
+ * are ordered in a way described by the platform.
+ */
+#define VIRTIO_F_ORDER_PLATFORM 36
+
 /* The Guest publishes the used index for which it expects an interrupt
  * at the end of the avail ring. Host should ignore the avail->flags field. */
 /* The Host publishes the avail index for which it expects a kick
@@ -241,6 +247,7 @@  struct virtio_hw {
 	uint8_t     use_simple_rx;
 	uint8_t     use_inorder_rx;
 	uint8_t     use_inorder_tx;
+	uint8_t     weak_barriers;
 	bool        has_tx_offload;
 	bool        has_rx_offload;
 	uint16_t    port_id;
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 2309b71d6..ebb86ef70 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -1152,7 +1152,7 @@  virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 
 	nb_used = VIRTQUEUE_NUSED(vq);
 
-	virtio_rmb();
+	virtio_rmb(hw->weak_barriers);
 
 	num = likely(nb_used <= nb_pkts) ? nb_used : nb_pkts;
 	if (unlikely(num > VIRTIO_MBUF_BURST_SZ))
@@ -1361,7 +1361,7 @@  virtio_recv_pkts_inorder(void *rx_queue,
 	nb_used = RTE_MIN(nb_used, nb_pkts);
 	nb_used = RTE_MIN(nb_used, VIRTIO_MBUF_BURST_SZ);
 
-	virtio_rmb();
+	virtio_rmb(hw->weak_barriers);
 
 	PMD_RX_LOG(DEBUG, "used:%d", nb_used);
 
@@ -1549,7 +1549,7 @@  virtio_recv_mergeable_pkts(void *rx_queue,
 
 	nb_used = VIRTQUEUE_NUSED(vq);
 
-	virtio_rmb();
+	virtio_rmb(hw->weak_barriers);
 
 	PMD_RX_LOG(DEBUG, "used:%d", nb_used);
 
@@ -1940,7 +1940,7 @@  virtio_xmit_pkts_packed(void *tx_queue, struct rte_mbuf **tx_pkts,
 
 		/* Positive value indicates it need free vring descriptors */
 		if (unlikely(need > 0)) {
-			virtio_rmb();
+			virtio_rmb(hw->weak_barriers);
 			need = RTE_MIN(need, (int)nb_pkts);
 			virtio_xmit_cleanup_packed(vq, need);
 			need = slots - vq->vq_free_cnt;
@@ -1988,7 +1988,7 @@  virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
 	nb_used = VIRTQUEUE_NUSED(vq);
 
-	virtio_rmb();
+	virtio_rmb(hw->weak_barriers);
 	if (likely(nb_used > vq->vq_nentries - vq->vq_free_thresh))
 		virtio_xmit_cleanup(vq, nb_used);
 
@@ -2030,7 +2030,7 @@  virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 		/* Positive value indicates it need free vring descriptors */
 		if (unlikely(need > 0)) {
 			nb_used = VIRTQUEUE_NUSED(vq);
-			virtio_rmb();
+			virtio_rmb(hw->weak_barriers);
 			need = RTE_MIN(need, (int)nb_used);
 
 			virtio_xmit_cleanup(vq, need);
@@ -2086,7 +2086,7 @@  virtio_xmit_pkts_inorder(void *tx_queue,
 	PMD_TX_LOG(DEBUG, "%d packets to xmit", nb_pkts);
 	nb_used = VIRTQUEUE_NUSED(vq);
 
-	virtio_rmb();
+	virtio_rmb(hw->weak_barriers);
 	if (likely(nb_used > vq->vq_nentries - vq->vq_free_thresh))
 		virtio_xmit_cleanup_inorder(vq, nb_used);
 
@@ -2134,7 +2134,7 @@  virtio_xmit_pkts_inorder(void *tx_queue,
 		need = slots - vq->vq_free_cnt;
 		if (unlikely(need > 0)) {
 			nb_used = VIRTQUEUE_NUSED(vq);
-			virtio_rmb();
+			virtio_rmb(hw->weak_barriers);
 			need = RTE_MIN(need, (int)nb_used);
 
 			virtio_xmit_cleanup_inorder(vq, need);
diff --git a/drivers/net/virtio/virtqueue.h b/drivers/net/virtio/virtqueue.h
index d8ae5cdec..a66a37f61 100644
--- a/drivers/net/virtio/virtqueue.h
+++ b/drivers/net/virtio/virtqueue.h
@@ -19,15 +19,40 @@ 
 struct rte_mbuf;
 
 /*
- * Per virtio_config.h in Linux.
+ * Per virtio_ring.h in Linux.
  *     For virtio_pci on SMP, we don't need to order with respect to MMIO
  *     accesses through relaxed memory I/O windows, so smp_mb() et al are
  *     sufficient.
  *
+ *     For using virtio to talk to real devices (eg. vDPA) we do need real
+ *     barriers.
  */
-#define virtio_mb()	rte_smp_mb()
-#define virtio_rmb()	rte_smp_rmb()
-#define virtio_wmb()	rte_smp_wmb()
+static inline void
+virtio_mb(uint8_t weak_barriers)
+{
+	if (weak_barriers)
+		rte_smp_mb();
+	else
+		rte_mb();
+}
+
+static inline void
+virtio_rmb(uint8_t weak_barriers)
+{
+	if (weak_barriers)
+		rte_smp_rmb();
+	else
+		rte_cio_rmb();
+}
+
+static inline void
+virtio_wmb(uint8_t weak_barriers)
+{
+	if (weak_barriers)
+		rte_smp_wmb();
+	else
+		rte_cio_wmb();
+}
 
 #ifdef RTE_PMD_PACKET_PREFETCH
 #define rte_packet_prefetch(p)  rte_prefetch1(p)
@@ -325,7 +350,7 @@  virtqueue_enable_intr_packed(struct virtqueue *vq)
 
 
 	if (vq->event_flags_shadow == RING_EVENT_FLAGS_DISABLE) {
-		virtio_wmb();
+		virtio_wmb(vq->hw->weak_barriers);
 		vq->event_flags_shadow = RING_EVENT_FLAGS_ENABLE;
 		*event_flags = vq->event_flags_shadow;
 	}
@@ -391,7 +416,7 @@  void vq_ring_free_inorder(struct virtqueue *vq, uint16_t desc_idx,
 static inline void
 vq_update_avail_idx(struct virtqueue *vq)
 {
-	virtio_wmb();
+	virtio_wmb(vq->hw->weak_barriers);
 	vq->vq_ring.avail->idx = vq->vq_avail_idx;
 }
 
@@ -423,7 +448,7 @@  virtqueue_kick_prepare_packed(struct virtqueue *vq)
 {
 	uint16_t flags;
 
-	virtio_mb();
+	virtio_mb(vq->hw->weak_barriers);
 	flags = vq->ring_packed.device_event->desc_event_flags;
 
 	return flags != RING_EVENT_FLAGS_DISABLE;