[dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Wang, Zhihong zhihong.wang at intel.com
Tue Jan 27 09:22:12 CET 2015



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of EDMISON, Kelvin
> (Kelvin)
> Sent: Friday, January 23, 2015 2:22 AM
> To: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> On 2015-01-21, 3:54 PM, "Neil Horman" <nhorman at tuxdriver.com> wrote:
> 
> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> On Wed, 21 Jan 2015 13:26:20 +0000
> >> Bruce Richardson <bruce.richardson at intel.com> wrote:
> >>
> >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >> > >
> >> > > On 21/01/15 14:02, Bruce Richardson wrote:
> >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >> > > >>>>-----Original Message-----
> >> > > >>>>From: Richardson, Bruce
> >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >> > > >>>>To: Neil Horman
> >> > > >>>>Cc: Wang, Zhihong; dev at dpdk.org
> >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >> > > >>>>
> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong
> wrote:
> >> > > >>>>>>>-----Original Message-----
> >> > > >>>>>>>From: Neil Horman [mailto:nhorman at tuxdriver.com]
> >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >> > > >>>>>>>To: Wang, Zhihong
> >> > > >>>>>>>Cc: dev at dpdk.org
> >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> optimization
> >> > > >>>>>>>
> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> >>zhihong.wang at intel.com
> >> > > >>>>wrote:
> >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> >>AVX
> >> > > >>>>platforms.
> >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> >>and
> >> > > >>>>>>>>more test
> >> > > >>>>>>>points.
> >> > > >>>>>>>>Optimization techniques are summarized below:
> >> > > >>>>>>>>
> >> > > >>>>>>>>1. Utilize full cache bandwidth
> >> > > >>>>>>>>
> >> > > >>>>>>>>2. Enforce aligned stores
> >> > > >>>>>>>>
> >> > > >>>>>>>>3. Apply load address alignment based on architecture
> >>features
> >> > > >>>>>>>>
> >> > > >>>>>>>>4. Make load/store address available as early as possible
> >> > > >>>>>>>>
> >> > > >>>>>>>>5. General optimization techniques like inlining, branch
> >> > > >>>>>>>>reducing, prefetch pattern access
> >> > > >>>>>>>>
> >> > > >>>>>>>>Zhihong Wang (4):
> >> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both
> SSE
> >>and AVX
> >> > > >>>>>>>>     platforms
> >> > > >>>>>>>>
> >> > > >>>>>>>>  app/test/Makefile                                  |   6 +
> >> > > >>>>>>>>  app/test/test_memcpy.c                             |  52
> >>+-
> >> > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238
> >>+++++---
> >> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> >> > > >>>>>>>+++++++++++++++------
> >> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >> > > >>>>>>>>
> >> > > >>>>>>>>--
> >> > > >>>>>>>>1.9.3
> >> > > >>>>>>>>
> >> > > >>>>>>>>
> >> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> >>compilation of
> >> > > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >> > > >>>>>>>Neil
> >> > > >>>>>>Neil,
> >> > > >>>>>>
> >> > > >>>>>>Thanks for reporting this!
> >> > > >>>>>>It should compile but will take quite some time if the CPU
> >>doesn't support
> >> > > >>>>AVX2, the reason is that:
> >> > > >>>>>>1. The SSE & AVX memcpy implementation is more
> complicated
> >>than
> >> > > >>>>AVX2
> >> > > >>>>>>version thus the compiler takes more time to compile and
> >>optimize 2.
> >> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
> >>calls for
> >> > > >>>>>>better test case coverage, that's quite a lot
> >> > > >>>>>>
> >> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
> >>4.9.2:
> >> > > >>>>>>1. The whole compile process takes 9'41" with the original
> >> > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
> >>It takes
> >> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to
> >>12 + 12
> >> > > >>>>>>= 24
> >> > > >>>>>>
> >> > > >>>>>>I'll reduce memcpy call in the next version of patch.
> >> > > >>>>>>
> >> > > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile
> >>that
> >> > > >>>>>takes almost
> >> > > >>>>>10 minutes for a single file is going to generate some raised
> >>eyebrows
> >> > > >>>>>when end users start tinkering with it
> >> > > >>>>>
> >> > > >>>>>Neil
> >> > > >>>>>
> >> > > >>>>>>Zhihong (John)
> >> > > >>>>>>
> >> > > >>>>Even two minutes is a very long time to compile, IMHO. The
> >>whole of DPDK
> >> > > >>>>doesn't take that long to compile right now, and that's with a
> >>couple of huge
> >> > > >>>>header files with routing tables in it. Any chance you could
> >>cut compile time
> >> > > >>>>down to a few seconds while still having reasonable tests?
> >> > > >>>>Also, when there is AVX2 present on the system, what is the
> >>compile time
> >> > > >>>>like for that code?
> >> > > >>>>
> >> > > >>>>	/Bruce
> >> > > >>>Neil, Bruce,
> >> > > >>>
> >> > > >>>Some data first.
> >> > > >>>
> >> > > >>>Sandy Bridge without AVX2:
> >> > > >>>1. original w/ 10 constant memcpy: 2'25"
> >> > > >>>2. patch w/ 12 constant memcpy: 2'41"
> >> > > >>>3. patch w/ 63 constant memcpy: 9'41"
> >> > > >>>
> >> > > >>>Haswell with AVX2:
> >> > > >>>1. original w/ 10 constant memcpy: 1'57"
> >> > > >>>2. patch w/ 12 constant memcpy: 1'56"
> >> > > >>>3. patch w/ 63 constant memcpy: 3'16"
> >> > > >>>
> >> > > >>>Also, to address Bruce's question, we have to reduce test case
> >>to cut down compile time. Because we use:
> >> > > >>>1. intrinsics instead of assembly for better flexibility and can
> >>utilize more compiler optimization
> >> > > >>>2. complex function body for better performance
> >> > > >>>3. inlining
> >> > > >>>This increases compile time.
> >> > > >>>But I think it'd be okay to do that as long as we can select a
> >>fair set of test points.
> >> > > >>>
> >> > > >>>It'd be great if you could give some suggestion, say, 12 points.
> >> > > >>>
> >> > > >>>Zhihong (John)
> >> > > >>>
> >> > > >>>
> >> > > >>While I agree in the general case these long compilation times is
> >>painful
> >> > > >>for the users, having a factor of 2-8x in memcpy operations is
> >>quite an
> >> > > >>improvement, specially in DPDK applications which need to deal
> >> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
> >>reassembly.
> >> > > >>
> >> > > >>Why not having a fast compilation by default, and a tunable
> >>config flag to
> >> > > >>enable a highly optimized version of rte_memcpy (e.g.
> >>RTE_EAL_OPT_MEMCPY)?
> >> > > >>
> >> > > >>Marc
> >> > > >>
> >> > > >Out of interest, are these 2-8x improvements something you have
> >>benchmarked
> >> > > >in these app scenarios? [i.e. not just in micro-benchmarks].
> >> > >
> >> > > How much that micro-speedup will end up affecting the performance
> >>of the
> >> > > entire application is something I cannot say, so I agree that we
> >>should
> >> > > probably have some additional benchmarks before deciding that pays
> >>off
> >> > > maintaining 2 versions of rte_memcpy.
> >> > >
> >> > > There are however a bunch of possible DPDK applications that could
> >> > > potentially benefit; IP fragmentation, tunneling and specialized DPI
> >> > > applications, among others, since they involve a reasonable amount
> >>of
> >> > > memcpys per pkt. My point was, *if* it proves that is enough
> >>beneficial, why
> >> > > not having it optionally?
> >> > >
> >> > > Marc
> >> >
> >> > I agree, if it provides the speedups then we need to have it in - and
> >>quite possibly
> >> > on by default, even.
> >> >
> >> > /Bruce
> >>
> >> One issue I have is that as a vendor we need to ship on binary, not
> >>different distributions
> >> for each Intel chip variant. There is some support for multi-chip
> >>version functions
> >> but only in latest Gcc which isn't in Debian stable. And the multi-chip
> >>version
> >> of functions is going to be more expensive than inlining. For some
> >>cases, I have
> >> seen that the overhead of fancy instructions looks good but have nasty
> >>side effects
> >> like CPU stall and/or increased power consumption which turns of turbo
> >>boost.
> >>
> >>
> >> Distro's in general have the same problem with special case
> >>optimizations.
> >>
> >What we really need is to do something like borrow the alternatives
> >mechanism
> >from the kernel so that we can dynamically replace instructions at run
> >time
> >based on cpu flags.  That way we could make the choice at run time, and
> >wouldn't
> >have to do alot of special case jumping about.
> >Neil
> 
> +1.
> 
> I think it should be an anti-requirement that the build machine be the
> exact same chip as the deployment platform.
> 
> I like the cpu flag inspection approach.  It would help in the case where
> DPDK is in a VM and an odd set of CPU flags have been exposed.
> 
> If that approach doesn't work though, then perhaps DPDK memcpy could go
> through a benchmarking at app startup time and select the most performant
> option out of a set, like mdraid's raid6 implementation does.  To give an
> example, this is what my systems print out at boot time re: raid6
> algorithm selection.
> raid6: sse2x1    3171 MB/s
> raid6: sse2x2    3925 MB/s
> raid6: sse2x4    4523 MB/s
> raid6: using algorithm sse2x4 (4523 MB/s)
> 
> Regards,
>    Kelvin
> 

Thanks for the proposal!

For DPDK, performance is always the most important concern. We need to utilize new architecture features to achieve that, so solution per arch is necessary.
Even a few extra cycles can lead to bad performance if they're in a hot loop.
For instance, let's assume DPDK takes 60 cycles to process a packet on average, then 3 more cycles here means 5% performance drop.

The dynamic solution is doable but with performance penalties, even if it could be small. Also it may bring extra complexity, which can lead to unpredictable behaviors and side effects.
For example, the dynamic solution won't have inline unrolling, which can bring significant performance benefit for small copies with constant length, like eth_addr.

We can investigate the VM scenario more.

Zhihong (John)


More information about the dev mailing list