[dpdk-dev] [dpdk-stable] [PATCH 1/2] net/virtio: fix performance regression due to TSO enabling

Yuanhan Liu yuanhan.liu at linux.intel.com
Fri Jan 13 07:13:09 CET 2017


On Thu, Jan 12, 2017 at 04:02:56PM +0100, Jan Viktorin wrote:
> On Thu, 12 Jan 2017 10:30:58 +0800
> Yuanhan Liu <yuanhan.liu at linux.intel.com> wrote:
> 
> > On Wed, Jan 11, 2017 at 03:51:22PM +0100, Thomas Monjalon wrote:
> > > 2017-01-11 12:27, Yuanhan Liu:  
> > > > The fact that virtio net header is initiated to zero in PMD driver
> > > > init stage means that these costly writes are unnecessary and could
> > > > be avoided:
> > > > 
> > > >     if (hdr->csum_start != 0)
> > > >         hdr->csum_start = 0;
> > > > 
> > > > And that's what the macro ASSIGN_UNLESS_EQUAL does. With this, the
> > > > performance drop introduced by TSO enabling is recovered: it could
> > > > be up to 20% in micro benchmarking.  
> > > 
> > > This patch is adding a condition to assignments.
> > > We need a benchmark on other architectures like ARM. Please anyone?  
> > 
> > I think the cost of condition should be way lower than the cost from the
> > penalty introduced by the cache issue, that I don't see it would perform
> > bad on other platforms.
> > 
> > But, of course, testing is always welcome!
> > 
> > 	--yliu
> 
> Hello,
> 
> we've done a synthetic measurement, principle briefly:

Thanks!

> 
> == Without condition check ==
> 
> start = gettimeofday();
> 
> for (i = 0; i < 1024*1024*128; ++i) {
> 	hdr->csum_start = 0;
> 	hdr->csum_offset = 0;
> 	hdr->flags = 0;
> }
> 
> end = gettimeofday();
> 
> 
> == With condition check ==
> 
> start = gettimeofday();
> 
> for (i = 0; i < 1024*1024*128; ++i) {
> 	ASSIGN_UNLESS_EQUAL(hdr->csum_start, 0);
> 	ASSIGN_UNLESS_EQUAL(hdr->csum_offset, 0);
> 	ASSIGN_UNLESS_EQUAL(hdr->flags, 0);
> }
> 
> end = gettimeofday();

But it's not the test methodology I'd expect. You are purely testing
the instruction cycles. The drop on ARM is something more like "the
if instruction takes more cycles than the simple assignment".

This macro is used in the case that one process is heavily writing
same value (0 here) again and again while another process is heavily
read it also again and again. That means cache violation always
happen. With this macro, however, this cache issue could be avoided,
since no write happens.

For such workload, I don't think it would behaviour worse on ARM.

	--yliu

> == Results ==
> 
> Computed as total time of all threads:
> 
> for i = 1..THREAD_COUNT:
> 	result += end[i] - start[i]
> 
> cpu           threads  without-check (ms)  with-check
> Xeon E5-2670        1            516              529
> Xeon E5-2670        2           1155              953
> Xeon E5-2670        8           8947             5044
> Xeon E5-2670       16          23335            16836
> Zynq-7020 (armv7)   1           6735             7205
> Zynq-7020 (armv7)   2          13753            14418
> 
> The advantage for Intel is evident when increasing the number
> of threads.
> 
> However, on 32-bit ARMs we might expect some performance drop.
> 
> Regards
> Jan
> 
> > > 
> > > 
> > > [...]  
> > > > +/* avoid write operation when necessary, to lessen cache issues */
> > > > +#define ASSIGN_UNLESS_EQUAL(var, val) do {	\
> > > > +	if ((var) != (val))			\
> > > > +		(var) = (val);			\
> > > > +} while (0)  


More information about the dev mailing list