[dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA platform

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Dec 16 12:47:37 CET 2016


Hi Zhiyong,

> > > > > >
> > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n);
> > > > >
> > > > > static inline void*
> > > > > rte_memset_huge(void *s, int c, size_t n) {
> > > > >    return __rte_memset_vector(s, c, n); }
> > > > >
> > > > > static inline void *
> > > > > rte_memset(void *s, int c, size_t n) {
> > > > > 	If (n < XXX)
> > > > > 		return rte_memset_scalar(s, c, n);
> > > > > 	else
> > > > > 		return rte_memset_huge(s, c, n); }
> > > > >
> > > > > XXX could be either a define, or could also be a variable, so it
> > > > > can be setuped at startup, depending on the architecture.
> > > > >
> > > > > Would that work?
> > > > > Konstantin
> > > > >
> > > I have implemented the code for  choosing the functions at run time.
> > > rte_memcpy is used more frequently, So I test it at run time.
> > >
> > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src,
> > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline
> > > void * rte_memcpy(void *dst, const void *src, size_t n) {
> > >         return rte_memcpy_vector(dst, src, n); } In order to reduce
> > > the overhead at run time, I assign the function address to var
> > > rte_memcpy_vector before main() starts to init the var.
> > >
> > > static void __attribute__((constructor))
> > > rte_memcpy_init(void)
> > > {
> > > 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_avx2;
> > > 	}
> > > 	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
> > > 	{
> > > 		rte_memcpy_vector = rte_memcpy_sse;
> > > 	}
> > > 	else
> > > 	{
> > > 		rte_memcpy_vector = memcpy;
> > > 	}
> > >
> > > }
> >
> > I thought we discussed a bit different approach.
> > In which rte_memcpy_vector() (rte_memeset_vector) would be called  only
> > after some cutoff point, i.e:
> >
> > void
> > rte_memcpy(void *dst, const void *src, size_t len) {
> > 	if (len < N) memcpy(dst, src, len);
> > 	else rte_memcpy_vector(dst, src, len);
> > }
> >
> > If you just always call rte_memcpy_vector() for every len, then it means that
> > compiler most likely has always to generate a proper call (not inlining
> > happening).
> 
> > For small length(s) price of extra function would probably overweight any
> > potential gain with SSE/AVX2 implementation.
> >
> > Konstantin
> 
> Yes, in fact,  from my tests, For small length(s)  rte_memset is far better than glibc memset,
> For large lengths, rte_memset is only a bit better than memset.
> because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine.

Ok, thanks for clarification.
>From previous mails I got a wrong  impression that on big lengths
rte_memset_vector() is significantly faster than memset().

> 
> >For small length(s) price of extra function would probably overweight any
>  >potential gain.
> This is the key point. I think it should include the scalar optimization, not only vector optimization.
> 
> The value of rte_memset is always inlined and for small lengths it will be better.
> when in some case We are not sure that memset is always inlined by compiler.

Ok, so do you know in what cases memset() is not get inlined?
Is it when len parameter can't be precomputed by the compiler
(is not a constant)?

So to me it sounds like:
- We don't need to have an optimized verision of rte_memset() for big sizes.
- Which probably means we don't need an arch specific versions of rte_memset_vector() at all -
   for small sizes (<= 32B) scalar version would be good enough. 
- For big sizes we can just rely on memset().
Is that so?

> It seems that choosing function at run time will lose the gains.
> The following is tested on haswell by patch code.

Not sure what columns 2 and 3 in the table below mean? 
Konstantin

> ** rte_memset() - memset perf tests
>         (C = compile-time constant) **
> ======== ======= ======== ======= ========
>    Size memset in cache  memset in mem
> (bytes)        (ticks)        (ticks)
> ------- -------------- ---------------
> ============= 32B aligned ================
>       3            3 -    8       19 -  128
>       4            4 -    8       13 -  128
>       8            2 -    7       19 -  128
>       9            2 -    7       19 -  127
>      12           2 -    7       19 -  127
>      17          3 -    8        19 -  132
>      64          3 -    8        28 -  168
>     128        7 -   13       54 -  200
>     255        8 -   20       100 -  223
>     511        14 -   20     187 -  314
>    1024      24 -   29     328 -  379
>    8192     198 -  225   1829 - 2193
> 
> Thanks
> Zhiyong



More information about the dev mailing list