[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

Wang, Zhihong zhihong.wang at intel.com
Sun Sep 25 07:41:55 CEST 2016



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com]
> Sent: Friday, September 23, 2016 9:41 PM
> To: Jianbo Liu <jianbo.liu at linaro.org>
> Cc: dev at dpdk.org; Wang, Zhihong <zhihong.wang at intel.com>; Yuanhan Liu
> <yuanhan.liu at linux.intel.com>; Maxime Coquelin
> <maxime.coquelin at redhat.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> 2016-09-23 18:41, Jianbo Liu:
> > On 23 September 2016 at 10:56, Wang, Zhihong <zhihong.wang at intel.com>
> wrote:
> > .....
> > > This is expected because the 2nd patch is just a baseline and all optimization
> > > patches are organized in the rest of this patch set.
> > >
> > > I think you can do bottleneck analysis on ARM to see what's slowing down the
> > > perf, there might be some micro-arch complications there, mostly likely in
> > > memcpy.
> > >
> > > Do you use glibc's memcpy? I suggest to hand-crafted it on your own.
> > >
> > > Could you publish the mrg_rxbuf=on data also? Since it's more widely used
> > > in terms of spec integrity.
> > >
> > I don't think it will be helpful for you, considering the differences
> > between x86 and arm.


Hi Jianbo,

This patch does help in ARM for small packets like 64B sized ones,
this actually proves the similarity between x86 and ARM in terms
of caching optimization in this patch.

My estimation is based on:

 1. The last patch are for mrg_rxbuf=on, and since you said it helps
    perf, we can ignore it for now when we discuss mrg_rxbuf=off

 2. Vhost enqueue perf =
    Ring overhead + Virtio header overhead + Data memcpy overhead

 3. This patch helps small packets traffic, which means it helps
    ring + virtio header operations

 4. So, when you say perf drop when packet size larger than 512B,
    this is most likely caused by memcpy in ARM not working well
    with this patch

I'm not saying glibc's memcpy is not good enough, it's just that
this is a rather special use case. And since we see specialized
memcpy + this patch give better performance than other combinations
significantly on x86, we suggest to hand-craft a specialized memcpy
for it.

Of course on ARM this is still just my speculation, and we need to
either prove it or find the actual root cause.

It can be **REALLY HELPFUL** if you could help to test this patch on
ARM for mrg_rxbuf=on cases to see if this patch is in fact helpful
to ARM at all, since mrg_rxbuf=on the more widely used cases.


Thanks
Zhihong


> > So please move on with this patchset...
> 
> Jianbo,
> I don't understand.
> You said that the 2nd patch is a regression:
> -       volatile uint16_t       last_used_idx;
> +       uint16_t                last_used_idx;
> 
> And the overrall series lead to performance regression
> for packets > 512 B, right?
> But we don't know wether you have tested the v6 or not.
> 
> Zhihong talked about some improvements possible in rte_memcpy.
> ARM64 is using libc memcpy in rte_memcpy.
> 
> Now you seem to give up.
> Does it mean you accept having a regression in 16.11 release?
> Are you working on rte_memcpy?


More information about the dev mailing list