[dpdk-dev,1/2] net/virtio: fix performance regression due to TSO enabling

Message ID 1484108832-19907-2-git-send-email-yuanhan.liu@linux.intel.com (mailing list archive)
State Accepted, archived
Delegated to: Yuanhan Liu
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel compilation success Compilation OK

Commit Message

Yuanhan Liu Jan. 11, 2017, 4:27 a.m. UTC
  TSO is now enabled, but it's not actually being used by default in a
simple L2 forward mode. In such case, we have to zero the virtio net
headers, to inform the vhost backend that no offload is being used:

    hdr->csum_start = 0;
    hdr->csum_offset = 0;
    hdr->flags = 0;

    hdr->gso_type = 0;
    hdr->gso_size = 0;
    hdr->hdr_len = 0;

Such writes could be very costly; it introduces severe cache issues:
The above operations introduce cache write for each packet, which
stalls the read operation from the vhost backend.

The fact that virtio net header is initiated to zero in PMD driver
init stage means that these costly writes are unnecessary and could
be avoided:

    if (hdr->csum_start != 0)
        hdr->csum_start = 0;

And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
performance drop introduced by TSO enabling is recovered: it could
be up to 20% in micro benchmarking.

Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload")
Fixes: 696573046e9e ("net/virtio: support TSO")

Cc: Olivier Matz <olivier.matz@6wind.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: stable@dpdk.org
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 drivers/net/virtio/virtio_rxtx.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)
  

Comments

Maxime Coquelin Jan. 11, 2017, 7:59 a.m. UTC | #1
On 01/11/2017 05:27 AM, Yuanhan Liu wrote:
> TSO is now enabled, but it's not actually being used by default in a
> simple L2 forward mode. In such case, we have to zero the virtio net
> headers, to inform the vhost backend that no offload is being used:
>
>     hdr->csum_start = 0;
>     hdr->csum_offset = 0;
>     hdr->flags = 0;
>
>     hdr->gso_type = 0;
>     hdr->gso_size = 0;
>     hdr->hdr_len = 0;
>
> Such writes could be very costly; it introduces severe cache issues:
> The above operations introduce cache write for each packet, which
> stalls the read operation from the vhost backend.
>
> The fact that virtio net header is initiated to zero in PMD driver
> init stage means that these costly writes are unnecessary and could
> be avoided:
>
>     if (hdr->csum_start != 0)
>         hdr->csum_start = 0;
>
> And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> performance drop introduced by TSO enabling is recovered: it could
> be up to 20% in micro benchmarking.
Very nice!

>
> Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload")
> Fixes: 696573046e9e ("net/virtio: support TSO")
>
> Cc: Olivier Matz <olivier.matz@6wind.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: stable@dpdk.org
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> ---
>  drivers/net/virtio/virtio_rxtx.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
> index 1e5a6b9..8ec2f1a 100644
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -258,6 +258,12 @@
>  		vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO6);
>  }
>
> +/* avoid write operation when necessary, to lessen cache issues */
> +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> +	if ((var) != (val))			\
> +		(var) = (val);			\
> +} while (0)
As it is intended to go in -stable, I think this is fine to have it
only in the driver, but for v17.02, maybe we should have another patch 
on top that declares it somewhere so that other libs and drivers can
make use of it?

> +
>  static inline void
>  virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
>  		       uint16_t needed, int use_indirect, int can_push)
> @@ -337,9 +343,9 @@
>  			break;
>
>  		default:
> -			hdr->csum_start = 0;
> -			hdr->csum_offset = 0;
> -			hdr->flags = 0;
> +			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
> +			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
> +			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
>  			break;
>  		}
>
> @@ -355,9 +361,9 @@
>  				cookie->l3_len +
>  				cookie->l4_len;
>  		} else {
> -			hdr->gso_type = 0;
> -			hdr->gso_size = 0;
> -			hdr->hdr_len = 0;
> +			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
> +			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
> +			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
>  		}
>  	}
>
>

Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks!
Maxime
  
Yuanhan Liu Jan. 11, 2017, 8:08 a.m. UTC | #2
On Wed, Jan 11, 2017 at 08:59:28AM +0100, Maxime Coquelin wrote:
> >+/* avoid write operation when necessary, to lessen cache issues */
> >+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> >+	if ((var) != (val))			\
> >+		(var) = (val);			\
> >+} while (0)
> As it is intended to go in -stable, I think this is fine to have it
> only in the driver, but for v17.02, maybe we should have another patch on
> top that declares it somewhere so that other libs and drivers can
> make use of it?

Yes, if it has any other usages :)


	--yliu
  
Olivier Matz Jan. 11, 2017, 8:22 a.m. UTC | #3
Hi Yuanhan,

Thanks for the fix.

On Wed, 11 Jan 2017 08:59:28 +0100, Maxime Coquelin
<maxime.coquelin@redhat.com> wrote:
> On 01/11/2017 05:27 AM, Yuanhan Liu wrote:
> > TSO is now enabled, but it's not actually being used by default in a
> > simple L2 forward mode. In such case, we have to zero the virtio net
> > headers, to inform the vhost backend that no offload is being used:
> >
> >     hdr->csum_start = 0;
> >     hdr->csum_offset = 0;
> >     hdr->flags = 0;
> >
> >     hdr->gso_type = 0;
> >     hdr->gso_size = 0;
> >     hdr->hdr_len = 0;
> >
> > Such writes could be very costly; it introduces severe cache issues:
> > The above operations introduce cache write for each packet, which
> > stalls the read operation from the vhost backend.
> >
> > The fact that virtio net header is initiated to zero in PMD driver
> > init stage means that these costly writes are unnecessary and could
> > be avoided:
> >
> >     if (hdr->csum_start != 0)
> >         hdr->csum_start = 0;
> >
> > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > performance drop introduced by TSO enabling is recovered: it could
> > be up to 20% in micro benchmarking.  
> Very nice!

Yep, I'm also impressed by the result.
I would have think this is something already done by the hardware and
transparent to the software.

> 
> >
> > Fixes: 58169a9c8153 ("net/virtio: support Tx checksum offload")
> > Fixes: 696573046e9e ("net/virtio: support TSO")
> >
> > Cc: Olivier Matz <olivier.matz@6wind.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: stable@dpdk.org
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>
> [...]
>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Reviewed-by: Olivier Matz <olivier.matz@6wind.com>
  
Thomas Monjalon Jan. 11, 2017, 2:51 p.m. UTC | #4
2017-01-11 12:27, Yuanhan Liu:
> The fact that virtio net header is initiated to zero in PMD driver
> init stage means that these costly writes are unnecessary and could
> be avoided:
> 
>     if (hdr->csum_start != 0)
>         hdr->csum_start = 0;
> 
> And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> performance drop introduced by TSO enabling is recovered: it could
> be up to 20% in micro benchmarking.

This patch is adding a condition to assignments.
We need a benchmark on other architectures like ARM. Please anyone?


[...]
> +/* avoid write operation when necessary, to lessen cache issues */
> +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> +	if ((var) != (val))			\
> +		(var) = (val);			\
> +} while (0)
  
Yuanhan Liu Jan. 12, 2017, 2:30 a.m. UTC | #5
On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> 2017-01-11 12:27, Yuanhan Liu:
> > The fact that virtio net header is initiated to zero in PMD driver
> > init stage means that these costly writes are unnecessary and could
> > be avoided:
> > 
> >     if (hdr->csum_start != 0)
> >         hdr->csum_start = 0;
> > 
> > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > performance drop introduced by TSO enabling is recovered: it could
> > be up to 20% in micro benchmarking.
> 
> This patch is adding a condition to assignments.
> We need a benchmark on other architectures like ARM. Please anyone?

I think the cost of condition should be way lower than the cost from the
penalty introduced by the cache issue, that I don't see it would perform
bad on other platforms.

But, of course, testing is always welcome!

	--yliu
> 
> 
> [...]
> > +/* avoid write operation when necessary, to lessen cache issues */
> > +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> > +	if ((var) != (val))			\
> > +		(var) = (val);			\
> > +} while (0)
  
Jan Viktorin Jan. 12, 2017, 3:02 p.m. UTC | #6
On Thu, 12 Jan 2017 10:30:58 +0800
Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:

> On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> > 2017-01-11 12:27, Yuanhan Liu:  
> > > The fact that virtio net header is initiated to zero in PMD driver
> > > init stage means that these costly writes are unnecessary and could
> > > be avoided:
> > > 
> > >     if (hdr->csum_start != 0)
> > >         hdr->csum_start = 0;
> > > 
> > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > > performance drop introduced by TSO enabling is recovered: it could
> > > be up to 20% in micro benchmarking.  
> > 
> > This patch is adding a condition to assignments.
> > We need a benchmark on other architectures like ARM. Please anyone?  
> 
> I think the cost of condition should be way lower than the cost from the
> penalty introduced by the cache issue, that I don't see it would perform
> bad on other platforms.
> 
> But, of course, testing is always welcome!
> 
> 	--yliu

Hello,

we've done a synthetic measurement, principle briefly:

== Without condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
	hdr->csum_start = 0;
	hdr->csum_offset = 0;
	hdr->flags = 0;
}

end = gettimeofday();


== With condition check ==

start = gettimeofday();

for (i = 0; i < 1024*1024*128; ++i) {
	ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
	ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
	ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
}

end = gettimeofday();


== Results ==

Computed as total time of all threads:

for i = 1..THREAD_COUNT:
	result += end[i] - start[i]

cpu           threads  without-check (ms)  with-check
Xeon E5-2670        1            516              529
Xeon E5-2670        2           1155              953
Xeon E5-2670        8           8947             5044
Xeon E5-2670       16          23335            16836
Zynq-7020 (armv7)   1           6735             7205
Zynq-7020 (armv7)   2          13753            14418

The advantage for Intel is evident when increasing the number
of threads.

However, on 32-bit ARMs we might expect some performance drop.

Regards
Jan

> > 
> > 
> > [...]  
> > > +/* avoid write operation when necessary, to lessen cache issues */
> > > +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> > > +	if ((var) != (val))			\
> > > +		(var) = (val);			\
> > > +} while (0)
  
Yuanhan Liu Jan. 13, 2017, 6:13 a.m. UTC | #7
On Thu, Jan 12, 2017 at 04:02:56PM +0100, Jan Viktorin wrote:
> On Thu, 12 Jan 2017 10:30:58 +0800
> Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> 
> > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> > > 2017-01-11 12:27, Yuanhan Liu:  
> > > > The fact that virtio net header is initiated to zero in PMD driver
> > > > init stage means that these costly writes are unnecessary and could
> > > > be avoided:
> > > > 
> > > >     if (hdr->csum_start != 0)
> > > >         hdr->csum_start = 0;
> > > > 
> > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > > > performance drop introduced by TSO enabling is recovered: it could
> > > > be up to 20% in micro benchmarking.  
> > > 
> > > This patch is adding a condition to assignments.
> > > We need a benchmark on other architectures like ARM. Please anyone?  
> > 
> > I think the cost of condition should be way lower than the cost from the
> > penalty introduced by the cache issue, that I don't see it would perform
> > bad on other platforms.
> > 
> > But, of course, testing is always welcome!
> > 
> > 	--yliu
> 
> Hello,
> 
> we've done a synthetic measurement, principle briefly:

Thanks!

> 
> == Without condition check ==
> 
> start = gettimeofday();
> 
> for (i = 0; i < 1024*1024*128; ++i) {
> 	hdr->csum_start = 0;
> 	hdr->csum_offset = 0;
> 	hdr->flags = 0;
> }
> 
> end = gettimeofday();
> 
> 
> == With condition check ==
> 
> start = gettimeofday();
> 
> for (i = 0; i < 1024*1024*128; ++i) {
> 	ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
> 	ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
> 	ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
> }
> 
> end = gettimeofday();

But it's not the test methodology I'd expect. You are purely testing
the instruction cycles. The drop on ARM is something more like "the
if instruction takes more cycles than the simple assignment".

This macro is used in the case that one process is heavily writing
same value (0 here) again and again while another process is heavily
read it also again and again. That means cache violation always
happen. With this macro, however, this cache issue could be avoided,
since no write happens.

For such workload, I don't think it would behaviour worse on ARM.

	--yliu

> == Results ==
> 
> Computed as total time of all threads:
> 
> for i = 1..THREAD_COUNT:
> 	result += end[i] - start[i]
> 
> cpu           threads  without-check (ms)  with-check
> Xeon E5-2670        1            516              529
> Xeon E5-2670        2           1155              953
> Xeon E5-2670        8           8947             5044
> Xeon E5-2670       16          23335            16836
> Zynq-7020 (armv7)   1           6735             7205
> Zynq-7020 (armv7)   2          13753            14418
> 
> The advantage for Intel is evident when increasing the number
> of threads.
> 
> However, on 32-bit ARMs we might expect some performance drop.
> 
> Regards
> Jan
> 
> > > 
> > > 
> > > [...]  
> > > > +/* avoid write operation when necessary, to lessen cache issues */
> > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> > > > +	if ((var) != (val))			\
> > > > +		(var) = (val);			\
> > > > +} while (0)
  
Yuanhan Liu Jan. 16, 2017, 7:12 a.m. UTC | #8
On Fri, Jan 13, 2017 at 02:13:09PM +0800, Yuanhan Liu wrote:
> But it's not the test methodology I'd expect. You are purely testing
> the instruction cycles. The drop on ARM is something more like "the
> if instruction takes more cycles than the simple assignment".
> 
> This macro is used in the case that one process is heavily writing
> same value (0 here) again and again while another process is heavily
> read it also again and again. That means cache violation always
> happen. With this macro, however, this cache issue could be avoided,
> since no write happens.
> 
> For such workload, I don't think it would behaviour worse on ARM.

No reply yet; I will treat it as no objections, and please shout out if any.

Both applied to dpdk-next-virtio.

	--yliu
  
Yuanhan Liu Jan. 16, 2017, 11:21 a.m. UTC | #9
On Mon, Jan 16, 2017 at 12:14:04PM +0100, Michal Orsák wrote:
> On 16.1.2017 12:12, Yuanhan Liu wrote:
> >On Mon, Jan 16, 2017 at 12:05:04PM +0100, Michal Orsák wrote:
> >>On 16.1.2017 08:12, Yuanhan Liu wrote:
> >>>On Fri, Jan 13, 2017 at 02:13:09PM +0800, Yuanhan Liu wrote:
> >>>>But it's not the test methodology I'd expect. You are purely testing
> >>>>the instruction cycles. The drop on ARM is something more like "the
> >>>>if instruction takes more cycles than the simple assignment".
> >>>>
> >>>>This macro is used in the case that one process is heavily writing
> >>>>same value (0 here) again and again while another process is heavily
> >>>>read it also again and again. That means cache violation always
> >>>>happen. With this macro, however, this cache issue could be avoided,
> >>>>since no write happens.
> >>>>
> >>>>For such workload, I don't think it would behaviour worse on ARM.
> >>>No reply yet; I will treat it as no objections, and please shout out if any.
> >>>
> >>>Both applied to dpdk-next-virtio.
> >>>
> >>>	--yliu
> >>Hello,
> >>
> >>
> >>currently I am running short of time. If you have any test prepared which i
> >>can just ran, please send me a link.
> >No link, but you could try:
> >
> >- a typical PVP test
> >
> >- a txonly test: running txonly fwd mode in guest PMD while running
> >   rxonly in fwd mode.
> >
> >The second is a micro test, thus I saw way bigger boost.
> >
> >When are you available for the testing, btw?
> 25.1.2017+

Okay, I will hold on a while to apply them.

	--yliu
  
Yuanhan Liu Jan. 30, 2017, 1:30 p.m. UTC | #10
On Mon, Jan 16, 2017 at 12:26:41PM +0100, Michal Orsák wrote:
> >>>>>>For such workload, I don't think it would behaviour worse on ARM.
> >>>>>No reply yet; I will treat it as no objections, and please shout out if any.
> >>>>>
> >>>>>Both applied to dpdk-next-virtio.
> >>>>>
> >>>>>	--yliu
> >>>>Hello,
> >>>>
> >>>>
> >>>>currently I am running short of time. If you have any test prepared which i
> >>>>can just ran, please send me a link.
> >>>No link, but you could try:
> >>>
> >>>- a typical PVP test
> >>>
> >>>- a txonly test: running txonly fwd mode in guest PMD while running
> >>>   rxonly in fwd mode.
> >>>
> >>>The second is a micro test, thus I saw way bigger boost.
> >>>
> >>>When are you available for the testing, btw?
> >>25.1.2017+
> >Okay, I will hold on a while to apply them.
> Ok, I will send you results when I have them.

Again, no reply yet. I'm appying them to dpdk-next-virtio. I really
don't think it could perform worse in ARM, as it removes a costly
cache invalidation operation (which should be more expensive than the
instruction cycles).

OTOH, again, testing is welcome, if it's later proved to be worse on
ARM (which I highly doubt), I could revert them.

	--yliu
  
Maxime Coquelin Jan. 30, 2017, 1:54 p.m. UTC | #11
On 01/30/2017 02:30 PM, Yuanhan Liu wrote:
> On Mon, Jan 16, 2017 at 12:26:41PM +0100, Michal Orsák wrote:
>>>>>>>> For such workload, I don't think it would behaviour worse on ARM.
>>>>>>> No reply yet; I will treat it as no objections, and please shout out if any.
>>>>>>>
>>>>>>> Both applied to dpdk-next-virtio.
>>>>>>>
>>>>>>> 	--yliu
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>> currently I am running short of time. If you have any test prepared which i
>>>>>> can just ran, please send me a link.
>>>>> No link, but you could try:
>>>>>
>>>>> - a typical PVP test
>>>>>
>>>>> - a txonly test: running txonly fwd mode in guest PMD while running
>>>>>   rxonly in fwd mode.
>>>>>
>>>>> The second is a micro test, thus I saw way bigger boost.
>>>>>
>>>>> When are you available for the testing, btw?
>>>> 25.1.2017+
>>> Okay, I will hold on a while to apply them.
>> Ok, I will send you results when I have them.
>
> Again, no reply yet. I'm appying them to dpdk-next-virtio. I really
> don't think it could perform worse in ARM, as it removes a costly
> cache invalidation operation (which should be more expensive than the
> instruction cycles).
>
> OTOH, again, testing is welcome, if it's later proved to be worse on
> ARM (which I highly doubt), I could revert them.
Or have a per-architecture definition of ASSIGN_UNLESS_EQUAL if it
proved to regress on ARM?

Maxime
>
> 	--yliu
>
  
Yuanhan Liu Jan. 30, 2017, 2:10 p.m. UTC | #12
On Mon, Jan 30, 2017 at 02:54:16PM +0100, Maxime Coquelin wrote:
> >OTOH, again, testing is welcome, if it's later proved to be worse on
> >ARM (which I highly doubt), I could revert them.
> Or have a per-architecture definition of ASSIGN_UNLESS_EQUAL if it
> proved to regress on ARM?

Yes, we could try that.

	--yliu
  

Patch

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index 1e5a6b9..8ec2f1a 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -258,6 +258,12 @@ 
 		vtpci_with_feature(hw, VIRTIO_NET_F_HOST_TSO6);
 }
 
+/* avoid write operation when necessary, to lessen cache issues */
+#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
+	if ((var) != (val))			\
+		(var) = (val);			\
+} while (0)
+
 static inline void
 virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct rte_mbuf *cookie,
 		       uint16_t needed, int use_indirect, int can_push)
@@ -337,9 +343,9 @@ 
 			break;
 
 		default:
-			hdr->csum_start = 0;
-			hdr->csum_offset = 0;
-			hdr->flags = 0;
+			ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
 			break;
 		}
 
@@ -355,9 +361,9 @@ 
 				cookie->l3_len +
 				cookie->l4_len;
 		} else {
-			hdr->gso_type = 0;
-			hdr->gso_size = 0;
-			hdr->hdr_len = 0;
+			ASSIGN_UNLESS_EQUAL(hdr->gso_type, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->gso_size, 0);
+			ASSIGN_UNLESS_EQUAL(hdr->hdr_len, 0);
 		}
 	}