[dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy

Li, Xiaoyun xiaoyun.li at intel.com
Wed Oct 18 04:21:30 CEST 2017


Hi

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas at monjalon.net]
> Sent: Wednesday, October 18, 2017 05:24
> To: Li, Xiaoyun <xiaoyun.li at intel.com>; Ananyev, Konstantin
> <konstantin.ananyev at intel.com>; Richardson, Bruce
> <bruce.richardson at intel.com>
> Cc: dev at dpdk.org; Lu, Wenzhuo <wenzhuo.lu at intel.com>; Zhang, Helin
> <helin.zhang at intel.com>; ophirmu at mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> Hi,
> 
> 13/10/2017 11:01, Xiaoyun Li:
> > This patch dynamically selects functions of memcpy at run-time based
> > on CPU flags that current machine supports. This patch uses function
> > pointers which are bind to the relative functions at constrctor time.
> > In addition, AVX512 instructions set would be compiled only if users
> > config it enabled and the compiler supports it.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li at intel.com>
> > ---
> Keeping only the major changes of the patch for later discussions:
> [...]
> >  static inline void *
> >  rte_memcpy(void *dst, const void *src, size_t n)  {
> > -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > -		return rte_memcpy_aligned(dst, src, n);
> > +	if (n <= RTE_X86_MEMCPY_THRESH)
> > +		return rte_memcpy_internal(dst, src, n);
> >  	else
> > -		return rte_memcpy_generic(dst, src, n);
> > +		return (*rte_memcpy_ptr)(dst, src, n);
> >  }
> [...]
> > +static inline void *
> > +rte_memcpy_internal(void *dst, const void *src, size_t n) {
> > +	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > +		return rte_memcpy_aligned(dst, src, n);
> > +	else
> > +		return rte_memcpy_generic(dst, src, n); }
> 
> The significant change of this patch is to call a function pointer for packet
> size > 128 (RTE_X86_MEMCPY_THRESH). 
The perf drop is due to function call replacing inline.

> Please could you provide some benchmark numbers?
I ran memcpy_perf_test which would show the time cost of memcpy. I ran it on broadwell with sse and avx2.
But I just draw pictures and looked at the trend not computed the exact percentage. Sorry about that.
The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192.
In my test, the size grows, the drop degrades. (Using copy time indicates the perf.)
>From the trend picture, when the size is smaller than 128 bytes, the perf drops a lot, almost 50%. And above 128 bytes, it approaches the original dpdk.
I computed it right now, it shows that when greater than 128 bytes and smaller than 1024 bytes, the perf drops about 15%. When above 1024 bytes, the perf drops about 4%.

> From a test done at Mellanox, there might be a performance degradation of
> about 15% in testpmd txonly with AVX2.
> Is there someone else seeing a performance degradation?


More information about the dev mailing list