[dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms

Wang, Zhihong zhihong.wang at intel.com
Mon Sep 18 07:10:40 CEST 2017


> Hi Zhihong Wang
> 
> I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than
> avx2 rte_memcpy.

Hi Haifeng,

AVX512 memcpy is marked as experimental and disabled by default, its
benefit varies from case to case. So enable it only when the case
(SW + HW setup with expected data pattern) is verified.

BTW, it's not recommended to use micro benchmarks like test_memcpy_perf
for memcpy performance report as they aren't likely able to reflect
performance of real world applications, please find more details at
https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk


Thanks
Zhihong

> 
> The vm loop test for ovs dpdk results:
> avx512 is *15*Gbps
> perf data:
>   0.52 │      vmovdq (%r8,%r10,1),%zmm0
>  95.33 │      sub    $0x40,%r9
>   0.45 │      add    $0x40,%r8
>   0.60 │      vmovdq %zmm0,-0x40(%r8)
>   1.84 │      cmp    $0x3f,%r9
>        │    ↓ ja     f20
>        │      lea    -0x40(%rsi),%r8
>   0.15 │      or     $0xffffffffffffffc0,%rsi
>   0.21 │      and    $0xffffffffffffffc0,%r8
>   0.00 │      lea    0x40(%rsi,%r8,1),%rsi
>   0.00 │      vmovdq (%rcx,%rsi,1),%zmm0
>   0.22 │      vmovdq %zmm0,(%rdx,%rsi,1)
>   0.67 │    ↓ jmpq   c78
>        │      mov    -0x128(%rbp),%rdi
>        │      rex.R
>        │      .byte  0x89
>        │      popfq
> 
> avx2 is *18.8*Gbps
> perf data:
>   0.96 │      add    %r9,%r13
>  66.04 │      vmovdq (%rdx),%ymm0
>   1.20 │      sub    $0x40,%rdi
>   1.53 │      add    $0x40,%rdx
>  10.83 │      vmovdq %ymm0,-0x40(%rdx,%r15,1)
>   8.64 │      vmovdq -0x20(%rdx),%ymm0
>   7.58 │      vmovdq %ymm0,-0x40(%rdx,%r13,1)
> 
> 
> dpdk version: v17.05
> ovs version: 2.8.90
> qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)
> 
> gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
> kernal version: 3.10.0
> 
> 
> compile dpdk:
> CONFIG_RTE_ENABLE_AVX512=y
> export DPDK_DIR=$PWD
> export DPDK_TARGET=x86_64-native-linuxapp-gcc
> export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
> make install T=$DPDK_TARGET DESTDIR=install
> 
> compile ovs:
> sh boot.sh
> ./configure  CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --
> localstatedir=/var --sysconfdir=/etc
> make -j
> make install
> 
> The test for dpdk test_memcpy_perf:
> avx2:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
>    Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
> (bytes)        (ticks)        (ticks)        (ticks)        (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 32B aligned
> ============================
>      64       6 -   10      27 -   52      30 -   39      56 -   97
>     512      24 -   44     251 -  271     145 -  217     396 -  447
>    1024      35 -   78     394 -  433     252 -  319     609 -  670
> ------- -------------- -------------- -------------- --------------
> C    64       3 -    9      28 -   31      29 -   40      55 -   66
> C   512      25 -   55     253 -  268     139 -  268     397 -  410
> C  1024      32 -   83     394 -  416     250 -  396     612 -  687
> =========================== Unaligned
> =============================
>      64       8 -    9      85 -   71      45 -   45     125 -  121
>     512      33 -   49     282 -  305     153 -  252     420 -  478
>    1024      42 -   83     409 -  491     259 -  389     640 -  748
> ------- -------------- -------------- -------------- --------------
> C    64       4 -    9      42 -   46      39 -   46      76 -   90
> C   512      33 -   55     280 -  272     153 -  281     421 -  415
> C  1024      41 -   83     407 -  427     258 -  405     578 -  701
> ======= ============== ============== ==============
> ==============
> 
> avx512:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
>    Size Cache to cache   Cache to mem   Mem to cache     Mem to mem
> (bytes)        (ticks)        (ticks)        (ticks)        (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 64B aligned
> ============================
>      64       6 -    9      18 -   33      24 -   38      40 -   65
>     512      18 -   44     178 -  262     138 -  218     309 -  429
>    1024      27 -   79     338 -  430     250 -  322     560 -  674
> ------- -------------- -------------- -------------- --------------
> C    64       3 -    9      18 -   20      23 -   41      39 -   50
> C   512      15 -   54     205 -  270     134 -  268     304 -  409
> C  1024      24 -   83     371 -  414     242 -  400     550 -  692
> =========================== Unaligned
> =============================
>      64       8 -    9      87 -   74      45 -   48     125 -  118
>     512      23 -   49     298 -  311     150 -  250     437 -  482
>    1024      36 -   83     427 -  505     259 -  406     633 -  754
> ------- -------------- -------------- -------------- --------------
> C    64       4 -    9      42 -   46      39 -   46      76 -   94
> C   512      23 -   55     246 -  277     152 -  290     349 -  426
> C  1024      38 -   83     398 -  431     258 -  416     634 -  725
> ======= ============== ============== ==============
> ==============
> 
> 
> 
> 
> 
> 
> 



More information about the dev mailing list