vhost: batch used descriptors chains write-back with packed ring

Message ID 20181128094700.14598-1-maxime.coquelin@redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series vhost: batch used descriptors chains write-back with packed ring |

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS

Commit Message

Maxime Coquelin Nov. 28, 2018, 9:47 a.m. UTC
  Instead of writing back descriptors chains in order, let's
write the first chain flags last in order to improve batching.

With Kernel's pktgen benchmark, ~3% performance gain is measured.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---
 lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 14 deletions(-)
  

Comments

Jens Freimann Nov. 28, 2018, 10:05 a.m. UTC | #1
On Wed, Nov 28, 2018 at 10:47:00AM +0100, Maxime Coquelin wrote:
>Instead of writing back descriptors chains in order, let's
>write the first chain flags last in order to improve batching.
>
>With Kernel's pktgen benchmark, ~3% performance gain is measured.
>
>Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>---
> lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
> 1 file changed, 23 insertions(+), 14 deletions(-)
>

Tested-by: Jens Freimann <jfreimann@redhat.com>
Reviewed-by: Jens Freimann <jfreimann@redhat.com>
  
Ilya Maximets Dec. 5, 2018, 4:01 p.m. UTC | #2
On 28.11.2018 12:47, Maxime Coquelin wrote:
> Instead of writing back descriptors chains in order, let's
> write the first chain flags last in order to improve batching.

I'm not sure if this fully compliant with virtio spec.
It says that 'each side (driver and device) are only required to poll
(or test) a single location in memory', but it does not forbid to
test other descriptors. So, if the driver will try to check not only
'the next device descriptor after the one they processed previously,
in circular order' but a few descriptors ahead, it could read an
inconsistent memory because there are no more write barriers between
updates for flags and id/len for them.

What do you think ?

> 
> With Kernel's pktgen benchmark, ~3% performance gain is measured.
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Tested-by: Jens Freimann <jfreimann@redhat.com>
> Reviewed-by: Jens Freimann <jfreimann@redhat.com>
> ---
>  lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
>  1 file changed, 23 insertions(+), 14 deletions(-)
> 
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 5e1a1a727..f54642c2d 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>  			struct vhost_virtqueue *vq)
>  {
>  	int i;
> -	uint16_t used_idx = vq->last_used_idx;
> +	uint16_t head_flags, head_idx = vq->last_used_idx;
>  
> -	/* Split loop in two to save memory barriers */
> -	for (i = 0; i < vq->shadow_used_idx; i++) {
> -		vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
> -		vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
> -
> -		used_idx += vq->shadow_used_packed[i].count;
> -		if (used_idx >= vq->size)
> -			used_idx -= vq->size;
> -	}
> -
> -	rte_smp_wmb();
> +	if (unlikely(vq->shadow_used_idx == 0))
> +		return;
>  
>  	for (i = 0; i < vq->shadow_used_idx; i++) {
>  		uint16_t flags;
> @@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>  			flags &= ~VRING_DESC_F_AVAIL;
>  		}
>  
> -		vq->desc_packed[vq->last_used_idx].flags = flags;
> +		vq->desc_packed[vq->last_used_idx].id =
> +			vq->shadow_used_packed[i].id;
> +		vq->desc_packed[vq->last_used_idx].len =
> +			vq->shadow_used_packed[i].len;
> +
> +		if (i > 0) {
> +			vq->desc_packed[vq->last_used_idx].flags = flags;
>  
> -		vhost_log_cache_used_vring(dev, vq,
> +			vhost_log_cache_used_vring(dev, vq,
>  					vq->last_used_idx *
>  					sizeof(struct vring_packed_desc),
>  					sizeof(struct vring_packed_desc));
> +		} else {
> +			head_idx = vq->last_used_idx;
> +			head_flags = flags;
> +		}
>  
>  		vq->last_used_idx += vq->shadow_used_packed[i].count;
>  		if (vq->last_used_idx >= vq->size) {
> @@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>  	}
>  
>  	rte_smp_wmb();
> +
> +	vq->desc_packed[head_idx].flags = head_flags;
>  	vq->shadow_used_idx = 0;
> +
> +	vhost_log_cache_used_vring(dev, vq,
> +				head_idx *
> +				sizeof(struct vring_packed_desc),
> +				sizeof(struct vring_packed_desc));
> +
>  	vhost_log_cache_sync(dev, vq);
>  }
>  
>
  
Michael S. Tsirkin Dec. 6, 2018, 12:56 a.m. UTC | #3
On Wed, Dec 05, 2018 at 07:01:23PM +0300, Ilya Maximets wrote:
> On 28.11.2018 12:47, Maxime Coquelin wrote:
> > Instead of writing back descriptors chains in order, let's
> > write the first chain flags last in order to improve batching.
> 
> I'm not sure if this fully compliant with virtio spec.
> It says that 'each side (driver and device) are only required to poll
> (or test) a single location in memory', but it does not forbid to
> test other descriptors. So, if the driver will try to check not only
> 'the next device descriptor after the one they processed previously,
> in circular order' but a few descriptors ahead, it could read an
> inconsistent memory because there are no more write barriers between
> updates for flags and id/len for them.
> 
> What do you think ?

Write barriers for SMP effects are quite cheap on most architectures.
So adding them before each flag write is probably not a big deal.


> > 
> > With Kernel's pktgen benchmark, ~3% performance gain is measured.
> > 
> > Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Tested-by: Jens Freimann <jfreimann@redhat.com>
> > Reviewed-by: Jens Freimann <jfreimann@redhat.com>
> > ---
> >  lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
> >  1 file changed, 23 insertions(+), 14 deletions(-)
> > 
> > diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> > index 5e1a1a727..f54642c2d 100644
> > --- a/lib/librte_vhost/virtio_net.c
> > +++ b/lib/librte_vhost/virtio_net.c
> > @@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
> >  			struct vhost_virtqueue *vq)
> >  {
> >  	int i;
> > -	uint16_t used_idx = vq->last_used_idx;
> > +	uint16_t head_flags, head_idx = vq->last_used_idx;
> >  
> > -	/* Split loop in two to save memory barriers */
> > -	for (i = 0; i < vq->shadow_used_idx; i++) {
> > -		vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
> > -		vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
> > -
> > -		used_idx += vq->shadow_used_packed[i].count;
> > -		if (used_idx >= vq->size)
> > -			used_idx -= vq->size;
> > -	}
> > -
> > -	rte_smp_wmb();
> > +	if (unlikely(vq->shadow_used_idx == 0))
> > +		return;
> >  
> >  	for (i = 0; i < vq->shadow_used_idx; i++) {
> >  		uint16_t flags;
> > @@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
> >  			flags &= ~VRING_DESC_F_AVAIL;
> >  		}
> >  
> > -		vq->desc_packed[vq->last_used_idx].flags = flags;
> > +		vq->desc_packed[vq->last_used_idx].id =
> > +			vq->shadow_used_packed[i].id;
> > +		vq->desc_packed[vq->last_used_idx].len =
> > +			vq->shadow_used_packed[i].len;
> > +
> > +		if (i > 0) {

Specifically here?

> > +			vq->desc_packed[vq->last_used_idx].flags = flags;
> >  
> > -		vhost_log_cache_used_vring(dev, vq,
> > +			vhost_log_cache_used_vring(dev, vq,
> >  					vq->last_used_idx *
> >  					sizeof(struct vring_packed_desc),
> >  					sizeof(struct vring_packed_desc));
> > +		} else {
> > +			head_idx = vq->last_used_idx;
> > +			head_flags = flags;
> > +		}
> >  
> >  		vq->last_used_idx += vq->shadow_used_packed[i].count;
> >  		if (vq->last_used_idx >= vq->size) {
> > @@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
> >  	}
> >  
> >  	rte_smp_wmb();
> > +
> > +	vq->desc_packed[head_idx].flags = head_flags;
> >  	vq->shadow_used_idx = 0;
> > +
> > +	vhost_log_cache_used_vring(dev, vq,
> > +				head_idx *
> > +				sizeof(struct vring_packed_desc),
> > +				sizeof(struct vring_packed_desc));
> > +
> >  	vhost_log_cache_sync(dev, vq);
> >  }
> >  
> >
  
Maxime Coquelin Dec. 6, 2018, 5:10 p.m. UTC | #4
On 12/5/18 5:01 PM, Ilya Maximets wrote:
> On 28.11.2018 12:47, Maxime Coquelin wrote:
>> Instead of writing back descriptors chains in order, let's
>> write the first chain flags last in order to improve batching.
> 
> I'm not sure if this fully compliant with virtio spec.
> It says that 'each side (driver and device) are only required to poll
> (or test) a single location in memory', but it does not forbid to
> test other descriptors. So, if the driver will try to check not only
> 'the next device descriptor after the one they processed previously,
> in circular order' but a few descriptors ahead, it could read an
> inconsistent memory because there are no more write barriers between
> updates for flags and id/len for them.
> 
> What do you think ?

Yes, that makes sense.
It should have no cost on x86 moreover.

I'll fix it in v2.
Thanks,
Maxime

>>
>> With Kernel's pktgen benchmark, ~3% performance gain is measured.
>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Tested-by: Jens Freimann <jfreimann@redhat.com>
>> Reviewed-by: Jens Freimann <jfreimann@redhat.com>
>> ---
>>   lib/librte_vhost/virtio_net.c | 37 ++++++++++++++++++++++-------------
>>   1 file changed, 23 insertions(+), 14 deletions(-)
>>
>> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
>> index 5e1a1a727..f54642c2d 100644
>> --- a/lib/librte_vhost/virtio_net.c
>> +++ b/lib/librte_vhost/virtio_net.c
>> @@ -135,19 +135,10 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>>   			struct vhost_virtqueue *vq)
>>   {
>>   	int i;
>> -	uint16_t used_idx = vq->last_used_idx;
>> +	uint16_t head_flags, head_idx = vq->last_used_idx;
>>   
>> -	/* Split loop in two to save memory barriers */
>> -	for (i = 0; i < vq->shadow_used_idx; i++) {
>> -		vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
>> -		vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
>> -
>> -		used_idx += vq->shadow_used_packed[i].count;
>> -		if (used_idx >= vq->size)
>> -			used_idx -= vq->size;
>> -	}
>> -
>> -	rte_smp_wmb();
>> +	if (unlikely(vq->shadow_used_idx == 0))
>> +		return;
>>   
>>   	for (i = 0; i < vq->shadow_used_idx; i++) {
>>   		uint16_t flags;
>> @@ -165,12 +156,22 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>>   			flags &= ~VRING_DESC_F_AVAIL;
>>   		}
>>   
>> -		vq->desc_packed[vq->last_used_idx].flags = flags;
>> +		vq->desc_packed[vq->last_used_idx].id =
>> +			vq->shadow_used_packed[i].id;
>> +		vq->desc_packed[vq->last_used_idx].len =
>> +			vq->shadow_used_packed[i].len;
>> +
>> +		if (i > 0) {
>> +			vq->desc_packed[vq->last_used_idx].flags = flags;
>>   
>> -		vhost_log_cache_used_vring(dev, vq,
>> +			vhost_log_cache_used_vring(dev, vq,
>>   					vq->last_used_idx *
>>   					sizeof(struct vring_packed_desc),
>>   					sizeof(struct vring_packed_desc));
>> +		} else {
>> +			head_idx = vq->last_used_idx;
>> +			head_flags = flags;
>> +		}
>>   
>>   		vq->last_used_idx += vq->shadow_used_packed[i].count;
>>   		if (vq->last_used_idx >= vq->size) {
>> @@ -180,7 +181,15 @@ flush_shadow_used_ring_packed(struct virtio_net *dev,
>>   	}
>>   
>>   	rte_smp_wmb();
>> +
>> +	vq->desc_packed[head_idx].flags = head_flags;
>>   	vq->shadow_used_idx = 0;
>> +
>> +	vhost_log_cache_used_vring(dev, vq,
>> +				head_idx *
>> +				sizeof(struct vring_packed_desc),
>> +				sizeof(struct vring_packed_desc));
>> +
>>   	vhost_log_cache_sync(dev, vq);
>>   }
>>   
>>
  

Patch

diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5e1a1a727..f54642c2d 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -135,19 +135,10 @@  flush_shadow_used_ring_packed(struct virtio_net *dev,
 			struct vhost_virtqueue *vq)
 {
 	int i;
-	uint16_t used_idx = vq->last_used_idx;
+	uint16_t head_flags, head_idx = vq->last_used_idx;
 
-	/* Split loop in two to save memory barriers */
-	for (i = 0; i < vq->shadow_used_idx; i++) {
-		vq->desc_packed[used_idx].id = vq->shadow_used_packed[i].id;
-		vq->desc_packed[used_idx].len = vq->shadow_used_packed[i].len;
-
-		used_idx += vq->shadow_used_packed[i].count;
-		if (used_idx >= vq->size)
-			used_idx -= vq->size;
-	}
-
-	rte_smp_wmb();
+	if (unlikely(vq->shadow_used_idx == 0))
+		return;
 
 	for (i = 0; i < vq->shadow_used_idx; i++) {
 		uint16_t flags;
@@ -165,12 +156,22 @@  flush_shadow_used_ring_packed(struct virtio_net *dev,
 			flags &= ~VRING_DESC_F_AVAIL;
 		}
 
-		vq->desc_packed[vq->last_used_idx].flags = flags;
+		vq->desc_packed[vq->last_used_idx].id =
+			vq->shadow_used_packed[i].id;
+		vq->desc_packed[vq->last_used_idx].len =
+			vq->shadow_used_packed[i].len;
+
+		if (i > 0) {
+			vq->desc_packed[vq->last_used_idx].flags = flags;
 
-		vhost_log_cache_used_vring(dev, vq,
+			vhost_log_cache_used_vring(dev, vq,
 					vq->last_used_idx *
 					sizeof(struct vring_packed_desc),
 					sizeof(struct vring_packed_desc));
+		} else {
+			head_idx = vq->last_used_idx;
+			head_flags = flags;
+		}
 
 		vq->last_used_idx += vq->shadow_used_packed[i].count;
 		if (vq->last_used_idx >= vq->size) {
@@ -180,7 +181,15 @@  flush_shadow_used_ring_packed(struct virtio_net *dev,
 	}
 
 	rte_smp_wmb();
+
+	vq->desc_packed[head_idx].flags = head_flags;
 	vq->shadow_used_idx = 0;
+
+	vhost_log_cache_used_vring(dev, vq,
+				head_idx *
+				sizeof(struct vring_packed_desc),
+				sizeof(struct vring_packed_desc));
+
 	vhost_log_cache_sync(dev, vq);
 }