[dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue

Wang, Zhihong zhihong.wang at intel.com
Thu Oct 13 08:02:04 CEST 2016



> -----Original Message-----
> From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> Sent: Thursday, October 13, 2016 1:33 PM
> To: Wang, Zhihong <zhihong.wang at intel.com>
> Cc: Jianbo Liu <jianbo.liu at linaro.org>; Thomas Monjalon
> <thomas.monjalon at 6wind.com>; Maxime Coquelin
> <maxime.coquelin at redhat.com>; dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] vhost: optimize enqueue
> 
> On Wed, Oct 12, 2016 at 12:22:08PM +0000, Wang, Zhihong wrote:
> > > > >> >  3. How many percentage of drop are you seeing?
> > > > The testing result:
> > > > size (bytes)     improvement (%)
> > > > 64                   3.92
> > > > 128                 11.51
> > > > 256                  24.16
> > > > 512                  -13.79
> > > > 1024                -22.51
> > > > 1500                -12.22
> > > > A correction is that performance is dropping if byte size is larger than
> 512.
> > >
> > > I have thought of this twice. Unfortunately, I think I may need NACK this
> > > series.
> > >
> > > Merging two code path into one is really good: as you stated, it improves
> > > the maintainability. But only if we see no performance regression on both
> > > path after the refactor. Unfortunately, that's not the case here: it hurts
> > > the performance for one code path (non-mrg Rx).
> > >
> > > That makes me think we may should not do the code path merge at all. I
> think
> > > that also aligns with what you have said before (internally): we could do
> the
> > > merge if it gives comparable performance before and after that.
> > >
> > > Besides that, I don't quite like the way you did in patch 2 (rewrite
> enqueue):
> > > you made a lot of changes in one patch. That means if something wrong
> > > happened,
> > > it is hard to narrow down which change introduces that regression. Badly,
> > > that's exactly what we met here. Weeks have been passed, I see no
> progress.
> > >
> > > That's the reason we like the idea of "one patch only does one thing, an
> > > atomic thing".
> >
> >
> > Yuanhan, folks,
> >
> > Thanks for the analysis. I disagree here though.
> >
> > I analyze, develop, benchmark on x86 platforms, where this patch
> > works great.
> 
> Yes, that's great effort! With your hardwork, we know what the bottleneck
> is and how it could be improved.
> 
> However, you don't have to do code refactor (merge two code path to one)
> to apply those improvements. From what I know, in this patchset, there
> are two factors could improve the performance:
> 
> - copy hdr together with packet data
> 
> - shadow used ring update and update at once
> 
> The overall performance boost I got with your v6 patchset with mrg-Rx
> code path is about 27% (in PVP case). And I have just applied the 1st
> optimization, it yields about 20% boosts. The left could be covered if
> we apply the 2nd optimization (I guess).
> 
> That would be a clean way to optimize vhost mergeable Rx path:
> 
> - you don't touch non-mrg Rx path (well, you may could apply the
>   shadow_used_ring trick to it as wel)
> 
>   This would at least make sure we will have no such performance
>   regression issue reported by ARM guys.
> 
> - you don't refactor the code
> 
>   The rewrite from scratch could introduce other issues, besides the
>   performance regression. We may just don't know it yet.
> 
> 
> Make sense to you? If you agree, I think we could still make it in
> this release: they would be some small changes after all. For example,
> below is the patch applies the 1st optimization tip on top of
> dpdk-next-virtio


Thanks for this great idea. I think it's a better way to do it.
I'll start to make the patch then.


> 
> 	--yliu
> 
> ---------------------------------------------------------------
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 8a151af..0ddb5af 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -379,7 +379,7 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  			    uint16_t end_idx, struct rte_mbuf *m,
>  			    struct buf_vector *buf_vec)
>  {
> -	struct virtio_net_hdr_mrg_rxbuf virtio_hdr = {{0, 0, 0, 0, 0, 0}, 0};
> +	struct virtio_net_hdr_mrg_rxbuf *virtio_hdr;
>  	uint32_t vec_idx = 0;
>  	uint16_t start_idx = vq->last_used_idx;
>  	uint16_t cur_idx = start_idx;
> @@ -388,6 +388,8 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  	uint32_t desc_offset, desc_avail;
>  	uint32_t cpy_len;
>  	uint16_t desc_idx, used_idx;
> +	uint16_t num_buffers = end_idx - start_idx;
> +	int hdr_copied = 0;
> 
>  	if (unlikely(m == NULL))
>  		return 0;
> @@ -399,16 +401,11 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  	if (buf_vec[vec_idx].buf_len < dev->vhost_hlen || !desc_addr)
>  		return 0;
> 
> -	rte_prefetch0((void *)(uintptr_t)desc_addr);
> +	virtio_hdr = (struct virtio_net_hdr_mrg_rxbuf
> *)(uintptr_t)desc_addr;
> +	rte_prefetch0(virtio_hdr);
> 
> -	virtio_hdr.num_buffers = end_idx - start_idx;
>  	LOG_DEBUG(VHOST_DATA, "(%d) RX: num merge buffers %d\n",
> -		dev->vid, virtio_hdr.num_buffers);
> -
> -	virtio_enqueue_offload(m, &virtio_hdr.hdr);
> -	copy_virtio_net_hdr(dev, desc_addr, virtio_hdr);
> -	vhost_log_write(dev, buf_vec[vec_idx].buf_addr, dev->vhost_hlen);
> -	PRINT_PACKET(dev, (uintptr_t)desc_addr, dev->vhost_hlen, 0);
> +		dev->vid, num_buffers);
> 
>  	desc_avail  = buf_vec[vec_idx].buf_len - dev->vhost_hlen;
>  	desc_offset = dev->vhost_hlen;
> @@ -450,6 +447,15 @@ copy_mbuf_to_desc_mergeable(struct virtio_net
> *dev, struct vhost_virtqueue *vq,
>  			mbuf_avail  = rte_pktmbuf_data_len(m);
>  		}
> 
> +		if (hdr_copied == 0) {
> +			virtio_hdr->num_buffers = num_buffers;
> +			virtio_enqueue_offload(m, &virtio_hdr->hdr);
> +			vhost_log_write(dev, buf_vec[vec_idx].buf_addr,
> dev->vhost_hlen);
> +			PRINT_PACKET(dev, (uintptr_t)desc_addr, dev-
> >vhost_hlen, 0);
> +
> +			hdr_copied = 1;
> +		}
> +
>  		cpy_len = RTE_MIN(desc_avail, mbuf_avail);
>  		rte_memcpy((void *)((uintptr_t)(desc_addr + desc_offset)),
>  			rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),


More information about the dev mailing list