[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

Wang, Zhihong zhihong.wang at intel.com
Mon Oct 10 09:25:34 CEST 2016



> -----Original Message-----
> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> Sent: Monday, October 10, 2016 2:58 PM
> To: Wang, Zhihong <zhihong.wang at intel.com>
> Cc: Yuanhan Liu <yuanhan.liu at linux.intel.com>; Maxime Coquelin
> <maxime.coquelin at redhat.com>; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On 10 October 2016 at 14:22, Wang, Zhihong <zhihong.wang at intel.com>
> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jianbo Liu [mailto:jianbo.liu at linaro.org]
> >> Sent: Monday, October 10, 2016 1:32 PM
> >> To: Yuanhan Liu <yuanhan.liu at linux.intel.com>
> >> Cc: Wang, Zhihong <zhihong.wang at intel.com>; Maxime Coquelin
> >> <maxime.coquelin at redhat.com>; dev at dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> >>
> >> On 10 October 2016 at 10:44, Yuanhan Liu <yuanhan.liu at linux.intel.com>
> >> wrote:
> >> > On Sun, Oct 09, 2016 at 12:09:07PM +0000, Wang, Zhihong wrote:
> >> >> > > > Tested with testpmd, host: txonly, guest: rxonly
> >> >> > > > size (bytes)     improvement (%)
> >> >> > > > 64                    4.12
> >> >> > > > 128                   6
> >> >> > > > 256                   2.65
> >> >> > > > 512                   -1.12
> >> >> > > > 1024                 -7.02
> >> >> > >
> >> >> > > There is a difference between Zhihong's code and the old I spotted
> in
> >> >> > > the first time: Zhihong removed the avail_idx prefetch. I
> understand
> >> >> > > the prefetch becomes a bit tricky when mrg-rx code path is
> >> considered;
> >> >> > > thus, I didn't comment on that.
> >> >> > >
> >> >> > > That's one of the difference that, IMO, could drop a regression. I
> then
> >> >> > > finally got a chance to add it back.
> >> >> > >
> >> >> > > A rough test shows it improves the performance of 1400B packet
> size
> >> >> > greatly
> >> >> > > in the "txonly in host and rxonly in guest" case: +33% is the number
> I
> >> get
> >> >> > > with my test server (Ivybridge).
> >> >> >
> >> >> > Thanks Yuanhan! I'll validate this on x86.
> >> >>
> >> >> Hi Yuanhan,
> >> >>
> >> >> Seems your code doesn't perform correctly. I write a new version
> >> >> of avail idx prefetch but didn't see any perf benefit.
> >> >>
> >> >> To be honest I doubt the benefit of this idea. The previous mrg_off
> >> >> code has this method but doesn't give any benefits.
> >> >
> >> > Good point. I thought of that before, too. But you know that I made it
> >> > in rush, that I didn't think further and test more.
> >> >
> >> > I looked the code a bit closer this time, and spotted a bug: the prefetch
> >> > actually didn't happen, due to following code piece:
> >> >
> >> >         if (vq->next_avail_idx >= NR_AVAIL_IDX_PREFETCH) {
> >> >                 prefetch_avail_idx(vq);
> >> >                 ...
> >> >         }
> >> >
> >> > Since vq->next_avail_idx is set to 0 at the entrance of enqueue path,
> >> > prefetch_avail_idx() will be called. The fix is easy though: just put
> >> > prefetch_avail_idx before invoking enqueue_packet.
> >> >
> >> > In summary, Zhihong is right, I see no more gains with that fix :(
> >> >
> >> > However, as stated, that's kind of the only difference I found between
> >> > yours and the old code, that maybe it's still worthwhile to have a
> >> > test on ARM, Jianbo?
> >> >
> >> I haven't tested it, but I think it could be no improvement for ARM either.
> >>
> >> A smalll suggestion for enqueue_packet:
> >>
> >> .....
> >> +       /* start copy from mbuf to desc */
> >> +       while (mbuf_avail || mbuf->next) {
> >> .....
> >>
> >> Considering pkt_len is in the first cache line (same as data_len),
> >> while next pointer is in the second cache line,
> >> is it better to check the total packet len, instead of the last mbuf's
> >> next pointer to jump out of while loop and avoid possible cache miss?
> >
> > Jianbo,
> >
> > Thanks for the reply!
> >
> > This idea sounds good, but it won't help the general perf in my
> > opinion, since the 2nd cache line is accessed anyway prior in
> > virtio_enqueue_offload.
> >
> Yes, you are right. I'm thinking of prefetching beforehand.
> And if it's a chained mbuf, virtio_enqueue_offload will not be called
> in next loop.
> 
> > Also this would bring a NULL check when actually access mbuf->next.
> >
> > BTW, could you please publish the number of:
> >
> >  1. mrg_rxbuf=on, comparison between original and original + this patch
> >
> >  2. mrg_rxbuf=off, comparison between original and original + this patch
> >
> > So we can have a whole picture of how this patch impact on ARM platform.
> >
> I think you already have got many results in my previous emails.
> Sorry I can't test right now and busy with other things.

We're still missing mrg on data.



More information about the dev mailing list