[dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy

Li, Xiaoyun xiaoyun.li at intel.com
Mon Oct 2 02:12:25 CEST 2017


Hi

> That means that each file with '#include <re_memcpy.h> will have its own
> copy
> of that function:
> $ objdump -d  x86_64-native-linuxapp-gcc/app/testpmd  | grep
> '<rte_memcpy_init>:' | sort -u | wc -l
> 233
> Same story for rte_memcpy_ptr and rte_memcpy_DEFAULT, etc...
> Obviously we need (and want) only one copy of that stuff per binary.
> 
> > +#ifdef CC_SUPPORT_AVX2
> 
> Why do you assume this macro will be defined?
> By whom?
> There is no such macro with gcc:
> $ gcc -march=native -dM -E - </dev/null 2>&1 | grep AVX2
> #define __AVX2__ 1
> , and you don't define it yourself.
> When building with '-march=native' on BDW only rte_memcpy_DEFAULT get
> compiled.
> 

I defined it myself. But when I sort the patch, I forgot to modify the file in this version. Sorry about that.
It should be like this. To check whether the compiler supports AVX2 or AVX512.

diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..92399ec 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -141,3 +141,17 @@ space:= $(empty) $(empty)
 CPUFLAGSTMP1 := $(addprefix RTE_CPUFLAG_,$(CPUFLAGS))
 CPUFLAGSTMP2 := $(subst $(space),$(comma),$(CPUFLAGSTMP1))
 CPUFLAGS_LIST := -DRTE_COMPILE_TIME_CPUFLAGS=$(CPUFLAGSTMP2)
+
+# Check if the compiler supports AVX512.
+CC_SUPPORT_AVX512 := $(shell $(CC) -march=skylake-avx512 -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512
+endif
+endif
+
+# Check if the compiler supports AVX2.
+CC_SUPPORT_AVX2 := $(shell $(CC) -march=core-avx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif


> To summarize: as I understand the goal of that patch was
> (assuming that our current rte_memcpy() implementation is good in terms of
> both performance and functionality):
> 1. Based on current rte_memcpy() implementation define 3 x86 arch specific
> rte_memcpy flavors:
>   a) rte_memcpy_SSE
>   b) rte_memcpy_AVX2
>   c) rte_memcpy_AVX512
> 2. Select appropriate flavor based on current HW at runtime,
>     i.e. both 3 flavors should be present in the binary and selection should be
> made
>     at program startup.
> 
> As I can see none of the goals was achieved with the current patch,
> instead a lot of redundant code was introduced.
> So I think it is NACK for the current version.
> What I think need to be done instead:
> 
> 1.  mv lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> 2. inside rte_memcpy_internal.h rename rte_memcpy() into
> rte_memcpy_internal().
> 3. create 3 files:
>     rte_memcpy_sse.c
>     rte_memcpy_avx2.c
>     rte_memcpy_avx512.c
> 
> Inside each of these files we define corresponding rte_memcpy_xxx()
> function.
> I.E:
> rte_memcpy_avx2.c:
> ....
> #ifndef RTE_MACHINE_CPUFLAG_AVX2
> #error "no avx2 support"
> endif
> 
> #include "rte_memcpy_internal.h"
> ...
> 
> void *
> rte_memcpy_avx2(void *dst, const void *src, size_t n)
> {
>    return rte_memcpy_internal(dst, src, n);
> }
> 
> 4. Make changes inside lib/librte_eal/Makefile to ensure that each of
> rte_memcpy_xxx()
> get build with appropriate -march flags (I.E: avx2 with -mavx2, etc.)
> You can use librte_acl/Makefile as a reference.
> 
> 5. Create rte_memcpy.c and put rte_memcpy_ptr/rte_memcpy_init()
> definitions in that file.
> 6. Create new rte_memcpy.h  and define rte_memcpy() in it:
> 
> ...
> #include <rte_memcpy_internal.h>
> ...
> 
> +#define RTE_X86_MEMCPY_THRESH 128
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
> 	if (n <= RTE_X86_MEMCPY_THRESH)
>                   return rte_memcpy_internal(dst, src, n);
>                else
> 	   return (*rte_memcpy_ptr)(dst, src, n);
> }
> 
> 7. Test it properly - i.e. build dpdk with default target and make sure each of
> 3 flavors
> could be selected properly at runtime based on underlying arch.
> 
> 8. As a possible future improvement - with such changes we don't need a
> generic inline
> implementation. We can think about creating a faster version that need to
> copy
> <= 128B.
> 
> Konstantin
> 

Will modify it in next version.
Thanks.



More information about the dev mailing list