[v4] vhost: add support for large buffers

Message ID 20191015161727.32570-1-fbl@sysclose.org (mailing list archive)
State Superseded, archived
Delegated to: Maxime Coquelin
Headers
Series [v4] vhost: add support for large buffers |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-compilation success Compile Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Flavio Leitner Oct. 15, 2019, 4:17 p.m. UTC
  The rte_vhost_dequeue_burst supports two ways of dequeuing data.
If the data fits into a buffer, then all data is copied and a
single linear buffer is returned. Otherwise it allocates
additional mbufs and chains them together to return a multiple
segments mbuf.

While that covers most use cases, it forces applications that
need to work with larger data sizes to support multiple segments
mbufs. The non-linear characteristic brings complexity and
performance implications to the application.

To resolve the issue, add support to attach external buffer
to a pktmbuf and let the host provide during registration if
attaching an external buffer to pktmbuf is supported and if
only linear buffer are supported.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
---
 doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
 lib/librte_vhost/rte_vhost.h        |   4 +
 lib/librte_vhost/socket.c           |  22 ++++++
 lib/librte_vhost/vhost.c            |  22 ++++++
 lib/librte_vhost/vhost.h            |   4 +
 lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
 6 files changed, 182 insertions(+), 14 deletions(-)


- Changelog:
  v4:
    - allow to use pktmbuf if there is exact space
    - removed log message if the buffer is too big
    - fixed the length to include align padding
    - free allocated buf if shinfo fails
  v3:
    - prevent the new features to be used with zero copy
    - fixed sizeof() usage
    - fixed log msg indentation
    - removed/replaced asserts
    - used the correct virt2iova function
    - fixed the patch's title
    - OvS PoC code:
      https://github.com/fleitner/ovs/tree/rte_malloc-v3
  v2:
    - Used rte_malloc() instead of another mempool as suggested by Shahaf.
    - Added the documentation section.
    - Using driver registration to negotiate the features.
    - OvS PoC code:
      https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
  

Comments

Ilya Maximets Oct. 15, 2019, 5:41 p.m. UTC | #1
Hi.
Not a full review.  Few comments inline.

Best regards, Ilya Maximets.

On 15.10.2019 18:17, Flavio Leitner wrote:
> The rte_vhost_dequeue_burst supports two ways of dequeuing data.
> If the data fits into a buffer, then all data is copied and a
> single linear buffer is returned. Otherwise it allocates
> additional mbufs and chains them together to return a multiple
> segments mbuf.
> 
> While that covers most use cases, it forces applications that
> need to work with larger data sizes to support multiple segments
> mbufs. The non-linear characteristic brings complexity and
> performance implications to the application.
> 
> To resolve the issue, add support to attach external buffer
> to a pktmbuf and let the host provide during registration if
> attaching an external buffer to pktmbuf is supported and if
> only linear buffer are supported.
> 
> Signed-off-by: Flavio Leitner <fbl@sysclose.org>
> ---
>   doc/guides/prog_guide/vhost_lib.rst |  35 +++++++++
>   lib/librte_vhost/rte_vhost.h        |   4 +
>   lib/librte_vhost/socket.c           |  22 ++++++
>   lib/librte_vhost/vhost.c            |  22 ++++++
>   lib/librte_vhost/vhost.h            |   4 +
>   lib/librte_vhost/virtio_net.c       | 109 ++++++++++++++++++++++++----
>   6 files changed, 182 insertions(+), 14 deletions(-)
> 
> 
> - Changelog:
>    v4:
>      - allow to use pktmbuf if there is exact space
>      - removed log message if the buffer is too big
>      - fixed the length to include align padding
>      - free allocated buf if shinfo fails
>    v3:
>      - prevent the new features to be used with zero copy
>      - fixed sizeof() usage
>      - fixed log msg indentation
>      - removed/replaced asserts
>      - used the correct virt2iova function
>      - fixed the patch's title
>      - OvS PoC code:
>        https://github.com/fleitner/ovs/tree/rte_malloc-v3
>    v2:
>      - Used rte_malloc() instead of another mempool as suggested by Shahaf.
>      - Added the documentation section.
>      - Using driver registration to negotiate the features.
>      - OvS PoC code:
>        https://github.com/fleitner/ovs/commit/8fc197c40b1d4fda331686a7b919e9e2b670dda7
> 
> 
> 
> diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
> index fc3ee4353..07e40e3c5 100644
> --- a/doc/guides/prog_guide/vhost_lib.rst
> +++ b/doc/guides/prog_guide/vhost_lib.rst
> @@ -117,6 +117,41 @@ The following is an overview of some key Vhost API functions:
>       Enabling this flag should only be done when the calling application does
>       not pre-fault the guest shared memory, otherwise migration would fail.
>   
> +  - ``RTE_VHOST_USER_LINEARBUF_SUPPORT``
> +
> +    Enabling this flag forces vhost dequeue function to only provide linear
> +    pktmbuf (no multi-segmented pktmbuf).
> +
> +    The vhost library by default provides a single pktmbuf for given a
> +    packet, but if for some reason the data doesn't fit into a single
> +    pktmbuf (e.g., TSO is enabled), the library will allocate additional
> +    pktmbufs from the same mempool and chain them together to create a
> +    multi-segmented pktmbuf.
> +
> +    However, the vhost application needs to support multi-segmented format.
> +    If the vhost application does not support that format and requires large
> +    buffers to be dequeue, this flag should be enabled to force only linear
> +    buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet.
> +
> +    It is disabled by default.
> +
> +  - ``RTE_VHOST_USER_EXTBUF_SUPPORT``
> +
> +    Enabling this flag allows vhost dequeue function to allocate and attach
> +    an external buffer to a pktmbuf if the pkmbuf doesn't provide enough
> +    space to store all data.
> +
> +    This is useful when the vhost application wants to support large packets
> +    but doesn't want to increase the default mempool object size nor to
> +    support multi-segmented mbufs (non-linear). In this case, a fresh buffer
> +    is allocated using rte_malloc() which gets attached to a pktmbuf using
> +    rte_pktmbuf_attach_extbuf().
> +
> +    See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented
> +    mbufs for application that doesn't support chained mbufs.
> +
> +    It is disabled by default.
> +
>   * ``rte_vhost_driver_set_features(path, features)``
>   
>     This function sets the feature bits the vhost-user driver supports. The
> diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
> index 19474bca0..b821b5df4 100644
> --- a/lib/librte_vhost/rte_vhost.h
> +++ b/lib/librte_vhost/rte_vhost.h
> @@ -30,6 +30,10 @@ extern "C" {
>   #define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
>   #define RTE_VHOST_USER_IOMMU_SUPPORT	(1ULL << 3)
>   #define RTE_VHOST_USER_POSTCOPY_SUPPORT		(1ULL << 4)
> +/* support mbuf with external buffer attached */
> +#define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
> +/* support only linear buffers (no chained mbufs) */
> +#define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
>   
>   /** Protocol features. */
>   #ifndef VHOST_USER_PROTOCOL_F_MQ
> diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
> index 274988c4d..e546be2a8 100644
> --- a/lib/librte_vhost/socket.c
> +++ b/lib/librte_vhost/socket.c
> @@ -40,6 +40,8 @@ struct vhost_user_socket {
>   	bool dequeue_zero_copy;
>   	bool iommu_support;
>   	bool use_builtin_virtio_net;
> +	bool extbuf;
> +	bool linearbuf;
>   
>   	/*
>   	 * The "supported_features" indicates the feature bits the
> @@ -232,6 +234,12 @@ vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
>   	if (vsocket->dequeue_zero_copy)
>   		vhost_enable_dequeue_zero_copy(vid);
>   
> +	if (vsocket->extbuf)
> +		vhost_enable_extbuf(vid);
> +
> +	if (vsocket->linearbuf)
> +		vhost_enable_linearbuf(vid);
> +
>   	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
>   
>   	if (vsocket->notify_ops->new_connection) {
> @@ -870,6 +878,8 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
>   		goto out_free;
>   	}
>   	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
> +	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
>   
>   	/*
>   	 * Set the supported features correctly for the builtin vhost-user
> @@ -894,6 +904,18 @@ rte_vhost_driver_register(const char *path, uint64_t flags)
>   	 * not compatible with postcopy.
>   	 */
>   	if (vsocket->dequeue_zero_copy) {
> +		if (vsocket->extbuf) {
> +			RTE_LOG(ERR, VHOST_CONFIG,
> +			"error: zero copy is incompatible with external buffers\n");
> +			ret = -1;
> +			goto out_free;

There should be 'out_mutex'.

> +		}
> +		if (vsocket->linearbuf) {
> +			RTE_LOG(ERR, VHOST_CONFIG,
> +			"error: zero copy is incompatible with linear buffers\n");
> +			ret = -1;
> +			goto out_free;

Ditto.

> +		}
>   		vsocket->supported_features &= ~(1ULL << VIRTIO_F_IN_ORDER);
>   		vsocket->features &= ~(1ULL << VIRTIO_F_IN_ORDER);
>   
> diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
> index cea44df8c..77457f538 100644
> --- a/lib/librte_vhost/vhost.c
> +++ b/lib/librte_vhost/vhost.c
> @@ -605,6 +605,28 @@ vhost_set_builtin_virtio_net(int vid, bool enable)
>   		dev->flags &= ~VIRTIO_DEV_BUILTIN_VIRTIO_NET;
>   }
>   
> +void
> +vhost_enable_extbuf(int vid)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return;
> +
> +	dev->extbuf = 1;
> +}
> +
> +void
> +vhost_enable_linearbuf(int vid)
> +{
> +	struct virtio_net *dev = get_device(vid);
> +
> +	if (dev == NULL)
> +		return;
> +
> +	dev->linearbuf = 1;
> +}
> +
>   int
>   rte_vhost_get_mtu(int vid, uint16_t *mtu)
>   {
> diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
> index 5131a97a3..0346bd118 100644
> --- a/lib/librte_vhost/vhost.h
> +++ b/lib/librte_vhost/vhost.h
> @@ -302,6 +302,8 @@ struct virtio_net {
>   	rte_atomic16_t		broadcast_rarp;
>   	uint32_t		nr_vring;
>   	int			dequeue_zero_copy;
> +	int			extbuf;
> +	int			linearbuf;
>   	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
>   #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
>   	char			ifname[IF_NAME_SZ];
> @@ -476,6 +478,8 @@ void vhost_attach_vdpa_device(int vid, int did);
>   void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
>   void vhost_enable_dequeue_zero_copy(int vid);
>   void vhost_set_builtin_virtio_net(int vid, bool enable);
> +void vhost_enable_extbuf(int vid);
> +void vhost_enable_linearbuf(int vid);
>   
>   struct vhost_device_ops const *vhost_driver_callback_get(const char *path);
>   
> diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
> index 5b85b832d..da69ab1db 100644
> --- a/lib/librte_vhost/virtio_net.c
> +++ b/lib/librte_vhost/virtio_net.c
> @@ -1289,6 +1289,93 @@ get_zmbuf(struct vhost_virtqueue *vq)
>   	return NULL;
>   }
>   
> +static void
> +virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
> +{
> +	rte_free(opaque);
> +}
> +
> +static int
> +virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
> +{
> +	struct rte_mbuf_ext_shared_info *shinfo = NULL;
> +	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
> +	uint16_t buf_len;
> +	rte_iova_t iova;
> +	void *buf;
> +
> +	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
> +	 * required, otherwise store shinfo in the new buffer.
> +	 */
> +	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
> +		shinfo = rte_pktmbuf_mtod(pkt,
> +					  struct rte_mbuf_ext_shared_info *);
> +	else {
> +		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> +		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> +	}
> +
> +	if (unlikely(total_len > UINT16_MAX))
> +		return -ENOSPC;
> +
> +	buf_len = total_len;
> +	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> +	if (unlikely(buf == NULL))
> +		return -ENOMEM;
> +
> +	/* Initialize shinfo */
> +	if (shinfo) {
> +		shinfo->free_cb = virtio_dev_extbuf_free;
> +		shinfo->fcb_opaque = buf;
> +		rte_mbuf_ext_refcnt_set(shinfo, 1);
> +	} else {
> +		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> +					      virtio_dev_extbuf_free, buf);
> +		if (unlikely(shinfo == NULL)) {
> +			rte_free(buf);
> +			RTE_LOG(ERR, VHOST_DATA, "Failed to init shinfo\n");
> +			return -1;
> +		}
> +	}
> +
> +	iova = rte_malloc_virt2iova(buf);
> +	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
> +	rte_pktmbuf_reset_headroom(pkt);
> +
> +	return 0;
> +}
> +
> +/*
> + * Allocate a host supported pktmbuf.
> + */
> +static __rte_always_inline struct rte_mbuf *
> +virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
> +			 uint32_t data_len)
> +{
> +	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> +
> +	if (unlikely(pkt == NULL))
> +		return NULL;
> +
> +	if (rte_pktmbuf_tailroom(pkt) >= data_len)
> +		return pkt;
> +
> +	/* attach an external buffer if supported */
> +	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
> +		return pkt;
> +
> +	/* check if chained buffers are allowed */
> +	if (!dev->linearbuf)
> +		return pkt;

I guess, checking of the 'linearbuf' should go before checking the 'extbuf'.
The usecase is that allocation of several buffers from the memory pool is,
probably, faster than rte_malloc() + memory attaching.  So, if the 'linearbuf'
is not requested, it might be faster to use chained mbufs.

BTW, I'm not sure if we really need 2 separate options for this.
i.e. 1. +linear +extbuf --> extbuf allocated
      2. +linear -extbuf --> buffer dropped (is this case is useful at all?)
      3. -linear +extbuf --> chained mbufs might be preferred (see above)
      4. -linear -extbuf --> chained mbufs

Case 4 is a default case. Case 1 is our target case for supporting large buffers.
Case 3 might makes no much sense as the result is equal to case 4.
Case 2 might be not interesting for as at all, because it will lead to random
packet drops depending on their size.

But, if only cases 1 and 4 are valid and interesting to us, we could easily merge
both flags.

Thoughts?
  
Flavio Leitner Oct. 15, 2019, 6:44 p.m. UTC | #2
On Tue, 15 Oct 2019 19:41:52 +0200
Ilya Maximets <i.maximets@ovn.org> wrote:

> Hi.
> Not a full review.  Few comments inline.
> 
> Best regards, Ilya Maximets.

Thanks for reviewing Ilya, see below.

[...]
> > @@ -870,6 +878,8 @@ rte_vhost_driver_register(const char *path,
> > uint64_t flags) goto out_free;
> >   	}
> >   	vsocket->dequeue_zero_copy = flags &
> > RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> > +	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
> > +	vsocket->linearbuf = flags &
> > RTE_VHOST_USER_LINEARBUF_SUPPORT; 
> >   	/*
> >   	 * Set the supported features correctly for the builtin
> > vhost-user @@ -894,6 +904,18 @@ rte_vhost_driver_register(const
> > char *path, uint64_t flags)
> >   	 * not compatible with postcopy.
> >   	 */
> >   	if (vsocket->dequeue_zero_copy) {
> > +		if (vsocket->extbuf) {
> > +			RTE_LOG(ERR, VHOST_CONFIG,
> > +			"error: zero copy is incompatible with
> > external buffers\n");
> > +			ret = -1;
> > +			goto out_free;  
> 
> There should be 'out_mutex'.

Good catch, going to fix that.

> 
> > +		}
> > +		if (vsocket->linearbuf) {
> > +			RTE_LOG(ERR, VHOST_CONFIG,
> > +			"error: zero copy is incompatible with
> > linear buffers\n");
> > +			ret = -1;
> > +			goto out_free;  
> 
> Ditto.

Yeah

[...]
> > +/*
> > + * Allocate a host supported pktmbuf.
> > + */
> > +static __rte_always_inline struct rte_mbuf *
> > +virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct
> > rte_mempool *mp,
> > +			 uint32_t data_len)
> > +{
> > +	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > +
> > +	if (unlikely(pkt == NULL))
> > +		return NULL;
> > +
> > +	if (rte_pktmbuf_tailroom(pkt) >= data_len)
> > +		return pkt;
> > +
> > +	/* attach an external buffer if supported */
> > +	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
> > +		return pkt;
> > +
> > +	/* check if chained buffers are allowed */
> > +	if (!dev->linearbuf)
> > +		return pkt;  
> 
> I guess, checking of the 'linearbuf' should go before checking the
> 'extbuf'. The usecase is that allocation of several buffers from the
> memory pool is, probably, faster than rte_malloc() + memory
> attaching.  So, if the 'linearbuf' is not requested, it might be
> faster to use chained mbufs.

The speed seems to depend on the length of the packet and it is not
just the allocation, but the overall overhead to deal with multiple
segments. If the preferred way is chained buffers, then just don't
pass any of the new flags because that's the default behavior today.
 
> BTW, I'm not sure if we really need 2 separate options for this.
> i.e. 1. +linear +extbuf --> extbuf allocated
>       2. +linear -extbuf --> buffer dropped (is this case is useful
> at all?) 3. -linear +extbuf --> chained mbufs might be preferred (see
> above) 4. -linear -extbuf --> chained mbufs
> 
> Case 4 is a default case. 

Yes, in this case if the packet length fits into the pktmbuf, a single
linear mbuf is returned otherwise it uses chained mbufs.

>  Case 1 is our target case for supporting
> large buffers. 

Right, here it enables external buffers, but we don't want chained
mbufs.

> Case 3 might makes no much sense as the result is
> equal to case 4.

Well case #3 either uses a single mbuf from mpool or use an external
buffer for big packets while in case #4 there is no support for big
packets at all.

> Case 2 might be not interesting for as at all,

Case #2 says the app only supports what fits into a single mbuf and
no external buffer nor chained buffers is supported.

> because it will lead to random packet drops depending on their size.

Either that or wrong processing.

> But, if only cases 1 and 4 are valid and interesting to us, we could
> easily merge both flags.

For OvS use-case today yes, but I created different flags because they
seemed different useful features to me and changing flags later can get
the API very confusing.

Thanks!
fbl
  

Patch

diff --git a/doc/guides/prog_guide/vhost_lib.rst b/doc/guides/prog_guide/vhost_lib.rst
index fc3ee4353..07e40e3c5 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -117,6 +117,41 @@  The following is an overview of some key Vhost API functions:
     Enabling this flag should only be done when the calling application does
     not pre-fault the guest shared memory, otherwise migration would fail.
 
+  - ``RTE_VHOST_USER_LINEARBUF_SUPPORT``
+
+    Enabling this flag forces vhost dequeue function to only provide linear
+    pktmbuf (no multi-segmented pktmbuf).
+
+    The vhost library by default provides a single pktmbuf for given a
+    packet, but if for some reason the data doesn't fit into a single
+    pktmbuf (e.g., TSO is enabled), the library will allocate additional
+    pktmbufs from the same mempool and chain them together to create a
+    multi-segmented pktmbuf.
+
+    However, the vhost application needs to support multi-segmented format.
+    If the vhost application does not support that format and requires large
+    buffers to be dequeue, this flag should be enabled to force only linear
+    buffers (see RTE_VHOST_USER_EXTBUF_SUPPORT) or drop the packet.
+
+    It is disabled by default.
+
+  - ``RTE_VHOST_USER_EXTBUF_SUPPORT``
+
+    Enabling this flag allows vhost dequeue function to allocate and attach
+    an external buffer to a pktmbuf if the pkmbuf doesn't provide enough
+    space to store all data.
+
+    This is useful when the vhost application wants to support large packets
+    but doesn't want to increase the default mempool object size nor to
+    support multi-segmented mbufs (non-linear). In this case, a fresh buffer
+    is allocated using rte_malloc() which gets attached to a pktmbuf using
+    rte_pktmbuf_attach_extbuf().
+
+    See RTE_VHOST_USER_LINEARBUF_SUPPORT as well to disable multi-segmented
+    mbufs for application that doesn't support chained mbufs.
+
+    It is disabled by default.
+
 * ``rte_vhost_driver_set_features(path, features)``
 
   This function sets the feature bits the vhost-user driver supports. The
diff --git a/lib/librte_vhost/rte_vhost.h b/lib/librte_vhost/rte_vhost.h
index 19474bca0..b821b5df4 100644
--- a/lib/librte_vhost/rte_vhost.h
+++ b/lib/librte_vhost/rte_vhost.h
@@ -30,6 +30,10 @@  extern "C" {
 #define RTE_VHOST_USER_DEQUEUE_ZERO_COPY	(1ULL << 2)
 #define RTE_VHOST_USER_IOMMU_SUPPORT	(1ULL << 3)
 #define RTE_VHOST_USER_POSTCOPY_SUPPORT		(1ULL << 4)
+/* support mbuf with external buffer attached */
+#define RTE_VHOST_USER_EXTBUF_SUPPORT	(1ULL << 5)
+/* support only linear buffers (no chained mbufs) */
+#define RTE_VHOST_USER_LINEARBUF_SUPPORT	(1ULL << 6)
 
 /** Protocol features. */
 #ifndef VHOST_USER_PROTOCOL_F_MQ
diff --git a/lib/librte_vhost/socket.c b/lib/librte_vhost/socket.c
index 274988c4d..e546be2a8 100644
--- a/lib/librte_vhost/socket.c
+++ b/lib/librte_vhost/socket.c
@@ -40,6 +40,8 @@  struct vhost_user_socket {
 	bool dequeue_zero_copy;
 	bool iommu_support;
 	bool use_builtin_virtio_net;
+	bool extbuf;
+	bool linearbuf;
 
 	/*
 	 * The "supported_features" indicates the feature bits the
@@ -232,6 +234,12 @@  vhost_user_add_connection(int fd, struct vhost_user_socket *vsocket)
 	if (vsocket->dequeue_zero_copy)
 		vhost_enable_dequeue_zero_copy(vid);
 
+	if (vsocket->extbuf)
+		vhost_enable_extbuf(vid);
+
+	if (vsocket->linearbuf)
+		vhost_enable_linearbuf(vid);
+
 	RTE_LOG(INFO, VHOST_CONFIG, "new device, handle is %d\n", vid);
 
 	if (vsocket->notify_ops->new_connection) {
@@ -870,6 +878,8 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 		goto out_free;
 	}
 	vsocket->dequeue_zero_copy = flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
+	vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
+	vsocket->linearbuf = flags & RTE_VHOST_USER_LINEARBUF_SUPPORT;
 
 	/*
 	 * Set the supported features correctly for the builtin vhost-user
@@ -894,6 +904,18 @@  rte_vhost_driver_register(const char *path, uint64_t flags)
 	 * not compatible with postcopy.
 	 */
 	if (vsocket->dequeue_zero_copy) {
+		if (vsocket->extbuf) {
+			RTE_LOG(ERR, VHOST_CONFIG,
+			"error: zero copy is incompatible with external buffers\n");
+			ret = -1;
+			goto out_free;
+		}
+		if (vsocket->linearbuf) {
+			RTE_LOG(ERR, VHOST_CONFIG,
+			"error: zero copy is incompatible with linear buffers\n");
+			ret = -1;
+			goto out_free;
+		}
 		vsocket->supported_features &= ~(1ULL << VIRTIO_F_IN_ORDER);
 		vsocket->features &= ~(1ULL << VIRTIO_F_IN_ORDER);
 
diff --git a/lib/librte_vhost/vhost.c b/lib/librte_vhost/vhost.c
index cea44df8c..77457f538 100644
--- a/lib/librte_vhost/vhost.c
+++ b/lib/librte_vhost/vhost.c
@@ -605,6 +605,28 @@  vhost_set_builtin_virtio_net(int vid, bool enable)
 		dev->flags &= ~VIRTIO_DEV_BUILTIN_VIRTIO_NET;
 }
 
+void
+vhost_enable_extbuf(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->extbuf = 1;
+}
+
+void
+vhost_enable_linearbuf(int vid)
+{
+	struct virtio_net *dev = get_device(vid);
+
+	if (dev == NULL)
+		return;
+
+	dev->linearbuf = 1;
+}
+
 int
 rte_vhost_get_mtu(int vid, uint16_t *mtu)
 {
diff --git a/lib/librte_vhost/vhost.h b/lib/librte_vhost/vhost.h
index 5131a97a3..0346bd118 100644
--- a/lib/librte_vhost/vhost.h
+++ b/lib/librte_vhost/vhost.h
@@ -302,6 +302,8 @@  struct virtio_net {
 	rte_atomic16_t		broadcast_rarp;
 	uint32_t		nr_vring;
 	int			dequeue_zero_copy;
+	int			extbuf;
+	int			linearbuf;
 	struct vhost_virtqueue	*virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];
 #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
 	char			ifname[IF_NAME_SZ];
@@ -476,6 +478,8 @@  void vhost_attach_vdpa_device(int vid, int did);
 void vhost_set_ifname(int, const char *if_name, unsigned int if_len);
 void vhost_enable_dequeue_zero_copy(int vid);
 void vhost_set_builtin_virtio_net(int vid, bool enable);
+void vhost_enable_extbuf(int vid);
+void vhost_enable_linearbuf(int vid);
 
 struct vhost_device_ops const *vhost_driver_callback_get(const char *path);
 
diff --git a/lib/librte_vhost/virtio_net.c b/lib/librte_vhost/virtio_net.c
index 5b85b832d..da69ab1db 100644
--- a/lib/librte_vhost/virtio_net.c
+++ b/lib/librte_vhost/virtio_net.c
@@ -1289,6 +1289,93 @@  get_zmbuf(struct vhost_virtqueue *vq)
 	return NULL;
 }
 
+static void
+virtio_dev_extbuf_free(void *addr __rte_unused, void *opaque)
+{
+	rte_free(opaque);
+}
+
+static int
+virtio_dev_extbuf_alloc(struct rte_mbuf *pkt, uint32_t size)
+{
+	struct rte_mbuf_ext_shared_info *shinfo = NULL;
+	uint32_t total_len = RTE_PKTMBUF_HEADROOM + size;
+	uint16_t buf_len;
+	rte_iova_t iova;
+	void *buf;
+
+	/* Try to use pkt buffer to store shinfo to reduce the amount of memory
+	 * required, otherwise store shinfo in the new buffer.
+	 */
+	if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo))
+		shinfo = rte_pktmbuf_mtod(pkt,
+					  struct rte_mbuf_ext_shared_info *);
+	else {
+		total_len += sizeof(*shinfo) + sizeof(uintptr_t);
+		total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
+	}
+
+	if (unlikely(total_len > UINT16_MAX))
+		return -ENOSPC;
+
+	buf_len = total_len;
+	buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
+	if (unlikely(buf == NULL))
+		return -ENOMEM;
+
+	/* Initialize shinfo */
+	if (shinfo) {
+		shinfo->free_cb = virtio_dev_extbuf_free;
+		shinfo->fcb_opaque = buf;
+		rte_mbuf_ext_refcnt_set(shinfo, 1);
+	} else {
+		shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
+					      virtio_dev_extbuf_free, buf);
+		if (unlikely(shinfo == NULL)) {
+			rte_free(buf);
+			RTE_LOG(ERR, VHOST_DATA, "Failed to init shinfo\n");
+			return -1;
+		}
+	}
+
+	iova = rte_malloc_virt2iova(buf);
+	rte_pktmbuf_attach_extbuf(pkt, buf, iova, buf_len, shinfo);
+	rte_pktmbuf_reset_headroom(pkt);
+
+	return 0;
+}
+
+/*
+ * Allocate a host supported pktmbuf.
+ */
+static __rte_always_inline struct rte_mbuf *
+virtio_dev_pktmbuf_alloc(struct virtio_net *dev, struct rte_mempool *mp,
+			 uint32_t data_len)
+{
+	struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
+
+	if (unlikely(pkt == NULL))
+		return NULL;
+
+	if (rte_pktmbuf_tailroom(pkt) >= data_len)
+		return pkt;
+
+	/* attach an external buffer if supported */
+	if (dev->extbuf && !virtio_dev_extbuf_alloc(pkt, data_len))
+		return pkt;
+
+	/* check if chained buffers are allowed */
+	if (!dev->linearbuf)
+		return pkt;
+
+	/* Data doesn't fit into the buffer and the host supports
+	 * only linear buffers
+	 */
+	rte_pktmbuf_free(pkt);
+
+	return NULL;
+}
+
 static __rte_noinline uint16_t
 virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	struct rte_mempool *mbuf_pool, struct rte_mbuf **pkts, uint16_t count)
@@ -1343,26 +1430,23 @@  virtio_dev_tx_split(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t head_idx;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t nr_vec = 0;
 		int err;
 
 		if (unlikely(fill_vec_buf_split(dev, vq,
 						vq->last_avail_idx + i,
 						&nr_vec, buf_vec,
-						&head_idx, &dummy_len,
+						&head_idx, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
 		if (likely(dev->dequeue_zero_copy == 0))
 			update_shadow_used_ring_split(vq, head_idx, 0);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(pkts[i] == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
+		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
+		if (unlikely(pkts[i] == NULL))
 			break;
-		}
 
 		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
 				mbuf_pool);
@@ -1451,14 +1535,14 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 	for (i = 0; i < count; i++) {
 		struct buf_vector buf_vec[BUF_VECTOR_MAX];
 		uint16_t buf_id;
-		uint32_t dummy_len;
+		uint32_t buf_len;
 		uint16_t desc_count, nr_vec = 0;
 		int err;
 
 		if (unlikely(fill_vec_buf_packed(dev, vq,
 						vq->last_avail_idx, &desc_count,
 						buf_vec, &nr_vec,
-						&buf_id, &dummy_len,
+						&buf_id, &buf_len,
 						VHOST_ACCESS_RO) < 0))
 			break;
 
@@ -1466,12 +1550,9 @@  virtio_dev_tx_packed(struct virtio_net *dev, struct vhost_virtqueue *vq,
 			update_shadow_used_ring_packed(vq, buf_id, 0,
 					desc_count);
 
-		pkts[i] = rte_pktmbuf_alloc(mbuf_pool);
-		if (unlikely(pkts[i] == NULL)) {
-			RTE_LOG(ERR, VHOST_DATA,
-				"Failed to allocate memory for mbuf.\n");
+		pkts[i] = virtio_dev_pktmbuf_alloc(dev, mbuf_pool, buf_len);
+		if (unlikely(pkts[i] == NULL))
 			break;
-		}
 
 		err = copy_desc_to_mbuf(dev, vq, buf_vec, nr_vec, pkts[i],
 				mbuf_pool);