[dpdk-dev,v4,1/2] mbuf: support attaching external buffer to mbuf

Message ID 20180424013854.33749-1-yskoh@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail apply patch file failure

Commit Message

Yongseok Koh April 24, 2018, 1:38 a.m. UTC
  This patch introduces a new way of attaching an external buffer to a mbuf.

Attaching an external buffer is quite similar to mbuf indirection in
replacing buffer addresses and length of a mbuf, but a few differences:
  - When an indirect mbuf is attached, refcnt of the direct mbuf would be
    2 as long as the direct mbuf itself isn't freed after the attachment.
    In such cases, the buffer area of a direct mbuf must be read-only. But
    external buffer has its own refcnt and it starts from 1. Unless
    multiple mbufs are attached to a mbuf having an external buffer, the
    external buffer is writable.
  - There's no need to allocate buffer from a mempool. Any buffer can be
    attached with appropriate free callback.
  - Smaller metadata is required to maintain shared data such as refcnt.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
---

** This patch can pass the mbuf_autotest. **

Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
will be submitted separately rebased on a differnet patchset which
accommodates new memory hotplug design to mlx PMDs.

v4:
* rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
  user can pass memory for shared data via shinfo argument.
* minor changes from review.

v3:
* implement external buffer attachment instead of introducing buf_off for
 mbuf indirection.

 lib/librte_mbuf/rte_mbuf.h | 289 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 260 insertions(+), 29 deletions(-)
  

Comments

Stephen Hemminger April 24, 2018, 5:01 a.m. UTC | #1
On Mon, 23 Apr 2018 18:38:53 -0700
Yongseok Koh <yskoh@mellanox.com> wrote:

> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>     2 as long as the direct mbuf itself isn't freed after the attachment.
>     In such cases, the buffer area of a direct mbuf must be read-only. But
>     external buffer has its own refcnt and it starts from 1. Unless
>     multiple mbufs are attached to a mbuf having an external buffer, the
>     external buffer is writable.
>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>

I think this is a good idea. It looks more useful than indirect mbuf's for
the use case where received data needs to come from a non mempool area.

Does it have any performance impact? I would hope it doesn't impact
applications not using external buffers.

Is it possible to start with a refcnt > 1 for the mbuf?  I am thinking
of the case in netvsc where data is received into an area returned
from the host. The area is an RNDIS buffer and may contain several
packets.  A useful optimization would be for the driver return
mbufs which point to that buffer where starting refcnt value
is the number of packets in the buffer.  When refcnt goes to
0 the buffer would be returned to the host.

One other problem with this is that it adds an additional buffer
management constraint on the application. If for example the
mbuf's are going into a TCP stack and TCP can have very slow
readers; then the receive buffer might have a long lifetime.
Since the receive buffers are limited, eventually the receive
area runs out and no more packets are received. Much fingerpointing
and angry users ensue..
  
Yongseok Koh April 24, 2018, 11:47 a.m. UTC | #2
On Mon, Apr 23, 2018 at 10:01:07PM -0700, Stephen Hemminger wrote:
> On Mon, 23 Apr 2018 18:38:53 -0700
> Yongseok Koh <yskoh@mellanox.com> wrote:
> 
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >   - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >     2 as long as the direct mbuf itself isn't freed after the attachment.
> >     In such cases, the buffer area of a direct mbuf must be read-only. But
> >     external buffer has its own refcnt and it starts from 1. Unless
> >     multiple mbufs are attached to a mbuf having an external buffer, the
> >     external buffer is writable.
> >   - There's no need to allocate buffer from a mempool. Any buffer can be
> >     attached with appropriate free callback.
> >   - Smaller metadata is required to maintain shared data such as refcnt.
> > 
> > Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> 
> I think this is a good idea. It looks more useful than indirect mbuf's for
> the use case where received data needs to come from a non mempool area.

Actually Olivier's idea and I just implemented it for my need. :-)

> Does it have any performance impact? I would hope it doesn't impact
> applications not using external buffers.

It would have little. The only change which can impact regular cases is in
rte_pktmbuf_prefree_seg(). This critical path inlines rte_pktmbuf_detach() and
it becomes a little longer - a few more instructions to update refcnt and branch
to user-provided callback. In io fwd of testpmd with single core, I'm not seeing
any noticeable drop.

> Is it possible to start with a refcnt > 1 for the mbuf?  I am thinking
> of the case in netvsc where data is received into an area returned
> from the host. The area is an RNDIS buffer and may contain several
> packets.  A useful optimization would be for the driver return
> mbufs which point to that buffer where starting refcnt value
> is the number of packets in the buffer.  When refcnt goes to
> 0 the buffer would be returned to the host.

That's actually my use-case for mlx5 PMD. mlx5 device supports "Multi-Packet Rx
Queue".  The device can pack multiple packets into a single Rx buffer to reduce
PCIe overhead of control transactions. And it is also quite common for FPGA
based NICs. What I've done is allocating a big buffer (from a PMD-private
mempool) and reserve a space in the head to store metadata to manage another
refcnt, which gets decremented by registered callback func. And the callback
func will free the whole chunk if the refcnt gets to zero.

+--+----+--------------+---+----+--------------+---+---+- - -
|  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
|  |room|              |inf|room|              |inf|   |
+--+----+--------------+---+----+--------------+---+---+- - -
 ^
 |
 Metadata for the whole chunk, having another refcnt managed by PMD.
 fcb_opaque will have this pointer so that the callback func knows it.

> One other problem with this is that it adds an additional buffer
> management constraint on the application. If for example the
> mbuf's are going into a TCP stack and TCP can have very slow
> readers; then the receive buffer might have a long lifetime.
> Since the receive buffers are limited, eventually the receive
> area runs out and no more packets are received. Much fingerpointing
> and angry users ensue..

In such a case (buffer depletion), I memcpy the Rx packet to mbuf instead of
attaching it to the mbuf until buffers get available. Seems unavoidable
penalty but better than dropping packets.


Thanks,
Yongseok
  
Andrew Rybchenko April 24, 2018, 12:28 p.m. UTC | #3
On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
>
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>      2 as long as the direct mbuf itself isn't freed after the attachment.
>      In such cases, the buffer area of a direct mbuf must be read-only. But
>      external buffer has its own refcnt and it starts from 1. Unless
>      multiple mbufs are attached to a mbuf having an external buffer, the
>      external buffer is writable.
>    - There's no need to allocate buffer from a mempool. Any buffer can be
>      attached with appropriate free callback.
>    - Smaller metadata is required to maintain shared data such as refcnt.

Really useful. Many thanks. See my notes below.

It worries me that detach is more expensive than it really required 
since it
requires to restore mbuf as direct. If mbuf mempool is used for mbufs
as headers for external buffers only all these actions are absolutely 
useless.

> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> ---
>
> ** This patch can pass the mbuf_autotest. **
>
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
>
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>    user can pass memory for shared data via shinfo argument.
> * minor changes from review.
>
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
>
>   lib/librte_mbuf/rte_mbuf.h | 289 ++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 260 insertions(+), 29 deletions(-)
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 06eceba37..7f6507a66 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -326,7 +326,7 @@ extern "C" {
>   		PKT_TX_MACSEC |		 \
>   		PKT_TX_SEC_OFFLOAD)
>   
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */

May be it should mention that shinfo is filled in.

>   
>   #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
>   
> @@ -566,8 +566,24 @@ struct rte_mbuf {
>   	/** Sequence number. See also rte_reorder_insert(). */
>   	uint32_t seqn;
>   
> +	struct rte_mbuf_ext_shared_info *shinfo;

I think it would be useful to add comment that it is used in the case of 
RTE_MBUF_HAS_EXTBUF() only.

> +
>   } __rte_cache_aligned;
>   
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>   /**< Maximum number of nb_segs allowed. */
>   #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>   
> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>   }
>   
>   /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>    */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)

It is still confusing that INDIRECT != !DIRECT.
May be we have no good options right now, but I'd suggest to at least 
deprecate
RTE_MBUF_INDIRECT() and completely remove it in the next release.

> +
> +/**
> + * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
> + *
> + * External buffer is a user-provided anonymous buffer.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
>   
>   /**
>    * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> + *
> + * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> + * can be defined as a direct mbuf.
>    */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))

[...]

> @@ -1195,11 +1282,122 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   }
>   
>   /**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * Memory for shared data can be provided by shinfo argument. If shinfo is NULL,
> + * a few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt, callback
> + * function and so on. The pointer of shared data will be stored in m->shinfo.
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> + *   external buffer has its own refcnt and it starts from 1. Unless
> + *   multiple mbufs are attached to a mbuf having an external buffer, the
> + *   external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback and its IO address.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_iova
> + *   IO address of the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. If memory for
> + *   shared data is not provided, buf_len must be larger than the size of
> + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> + *   enough, this function will return NULL.
> + * @param shinfo
> + *   User-provided memory for shared data. If NULL, a few bytes in the
> + *   trailer of the provided buffer will be dedicated for shared data.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the new start of the data on success, return NULL
> + *   otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	rte_iova_t buf_iova, uint16_t buf_len,
> +	struct rte_mbuf_ext_shared_info *shinfo,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);

May I suggest to move it inside if (shinfo == NULL) to make it clear 
that it is not used if shinfo pointer is provided.

> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = buf_iova;
> +
> +	if (shinfo == NULL) {
> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +					sizeof(*shinfo)), sizeof(uintptr_t));
> +		if ((void *)shinfo <= buf_addr)
> +			return NULL;
> +
> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> +	} else {
> +		m->buf_len = buf_len;
> +	}
> +
> +	m->data_len = 0;
> +
> +	rte_pktmbuf_reset_headroom(m);

I would suggest to make data_off one more parameter.
If I have a buffer with data which I'd like to attach to an mbuf, I'd 
like to control data_off.

> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +	m->shinfo = shinfo;
> +
> +	rte_mbuf_ext_refcnt_set(shinfo, 1);

Why is assignment used here? Cannot we attach extbuf already attached to 
other mbuf?
May be shinfo should be initialized only if it is not provided (shinfo 
== NULL on input)?

> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	return (char *)m->buf_addr + m->data_off;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>    * Attach packet mbuf to another packet mbuf.
>    *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>    *
>    * Right now, not supported:
>    *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1213,19 +1411,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>    */
>   static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   {
> -	struct rte_mbuf *md;
> -
>   	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>   	    rte_mbuf_refcnt_read(mi) == 1);
>   
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> +		mi->ol_flags = m->ol_flags;
> +	} else {
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);

It looks like handling of the direct mbuf is lost here. May be it is 
intentional
to avoid branching since result will be the same for direct mbuf as well,
but looks confusing. Deserves at least a comment which explains why.
Ideally it should be proven by measurements.

> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
>   
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>   	mi->buf_iova = m->buf_iova;
>   	mi->buf_addr = m->buf_addr;
>   	mi->buf_len = m->buf_len;
> @@ -1241,7 +1438,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   	mi->next = NULL;
>   	mi->pkt_len = mi->data_len;
>   	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>   	mi->packet_type = m->packet_type;
>   	mi->timestamp = m->timestamp;
>   
> @@ -1250,12 +1446,50 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   }
>   
>   /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +	RTE_ASSERT(m->shinfo != NULL);
> +
> +	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> +		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);

Shouldn't it be done after below assertion? Just to be less confusing.

> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {

It is not directly related to the changeset, but rte_pktmbuf_prefree_seg()
has many optimizations which could be useful here:
  - do not update if refcnt is 1
  - do not set next/nb_seq if next is already NULL

> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}

[...]
  
Olivier Matz April 24, 2018, 4:02 p.m. UTC | #4
Hi Andrew, Yongseok,

On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> >      2 as long as the direct mbuf itself isn't freed after the attachment.
> >      In such cases, the buffer area of a direct mbuf must be read-only. But
> >      external buffer has its own refcnt and it starts from 1. Unless
> >      multiple mbufs are attached to a mbuf having an external buffer, the
> >      external buffer is writable.
> >    - There's no need to allocate buffer from a mempool. Any buffer can be
> >      attached with appropriate free callback.
> >    - Smaller metadata is required to maintain shared data such as refcnt.
> 
> Really useful. Many thanks. See my notes below.
> 
> It worries me that detach is more expensive than it really required since it
> requires to restore mbuf as direct. If mbuf mempool is used for mbufs
> as headers for external buffers only all these actions are absolutely
> useless.

I agree on the principle. And we have the same issue with indirect mbuf.
Currently, the assumption is that a free mbuf (inside a mempool) is
initialized as a direct mbuf. We can think about optimizations here,
but I'm not sure it should be in this patchset.

[...]

> > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >   }
> >   /**
> > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > + * otherwise.
> > + *
> > + * If a mbuf has its data in another mbuf and references it by mbuf
> > + * indirection, this mbuf can be defined as a cloned mbuf.
> > + */
> > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > +
> > +/**
> >    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> >    */
> > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> 
> It is still confusing that INDIRECT != !DIRECT.
> May be we have no good options right now, but I'd suggest to at least
> deprecate
> RTE_MBUF_INDIRECT() and completely remove it in the next release.

Agree. I may have missed something, but is my previous suggestion
not doable?

- direct = embeds its own data      (and indirect = !direct)
- clone (or another name) = data is another mbuf
- extbuf = data is in an external buffer

Deprecating the macro is a good idea.

> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = buf_iova;
> > +
> > +	if (shinfo == NULL) {
> > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > +		if ((void *)shinfo <= buf_addr)
> > +			return NULL;
> > +
> > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > +	} else {
> > +		m->buf_len = buf_len;
> > +	}
> > +
> > +	m->data_len = 0;
> > +
> > +	rte_pktmbuf_reset_headroom(m);
> 
> I would suggest to make data_off one more parameter.
> If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> to control data_off.

Another option is to set the headroom to 0.
Because the after attaching the mbuf to an external buffer, we will
still require to set the length.

A user can do something like this:

	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
		free_cb, free_cb_arg);
	rte_pktmbuf_append(m, data_len + headroom);
	rte_pktmbuf_adj(m, headroom);

> 
> > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > +	m->shinfo = shinfo;
> > +
> > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> 
> Why is assignment used here? Cannot we attach extbuf already attached to
> other mbuf?

In rte_pktmbuf_attach(), this is true. That's not illogical to
keep the same approach here. Maybe an assert could be added?

> May be shinfo should be initialized only if it is not provided (shinfo ==
> NULL on input)?

I don't get why, can you explain please?


Thanks,
Olivier
  
Andrew Rybchenko April 24, 2018, 6:21 p.m. UTC | #5
On 04/24/2018 07:02 PM, Olivier Matz wrote:
> Hi Andrew, Yongseok,
>
> On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
>> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
>>> This patch introduces a new way of attaching an external buffer to a mbuf.
>>>
>>> Attaching an external buffer is quite similar to mbuf indirection in
>>> replacing buffer addresses and length of a mbuf, but a few differences:
>>>     - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>>>       2 as long as the direct mbuf itself isn't freed after the attachment.
>>>       In such cases, the buffer area of a direct mbuf must be read-only. But
>>>       external buffer has its own refcnt and it starts from 1. Unless
>>>       multiple mbufs are attached to a mbuf having an external buffer, the
>>>       external buffer is writable.
>>>     - There's no need to allocate buffer from a mempool. Any buffer can be
>>>       attached with appropriate free callback.
>>>     - Smaller metadata is required to maintain shared data such as refcnt.
>> Really useful. Many thanks. See my notes below.
>>
>> It worries me that detach is more expensive than it really required since it
>> requires to restore mbuf as direct. If mbuf mempool is used for mbufs
>> as headers for external buffers only all these actions are absolutely
>> useless.
> I agree on the principle. And we have the same issue with indirect mbuf.
> Currently, the assumption is that a free mbuf (inside a mempool) is
> initialized as a direct mbuf. We can think about optimizations here,
> but I'm not sure it should be in this patchset.

I agree that it should be addressed separately.

> [...]
>
>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>    }
>>>    /**
>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>> + * otherwise.
>>> + *
>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>> + */
>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>> +
>>> +/**
>>>     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>     */
>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>> It is still confusing that INDIRECT != !DIRECT.
>> May be we have no good options right now, but I'd suggest to at least
>> deprecate
>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> Agree. I may have missed something, but is my previous suggestion
> not doable?
>
> - direct = embeds its own data      (and indirect = !direct)
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

I guess the problem that it changes INDIRECT semantics since EXTBUF
is added as well. I think strictly speaking it is an API change.
Is it OK to make it without announcement?

> Deprecating the macro is a good idea.
>
>>> +	m->buf_addr = buf_addr;
>>> +	m->buf_iova = buf_iova;
>>> +
>>> +	if (shinfo == NULL) {
>>> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
>>> +					sizeof(*shinfo)), sizeof(uintptr_t));
>>> +		if ((void *)shinfo <= buf_addr)
>>> +			return NULL;
>>> +
>>> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
>>> +	} else {
>>> +		m->buf_len = buf_len;
>>> +	}
>>> +
>>> +	m->data_len = 0;
>>> +
>>> +	rte_pktmbuf_reset_headroom(m);
>> I would suggest to make data_off one more parameter.
>> If I have a buffer with data which I'd like to attach to an mbuf, I'd like
>> to control data_off.
> Another option is to set the headroom to 0.
> Because the after attaching the mbuf to an external buffer, we will
> still require to set the length.
>
> A user can do something like this:
>
> 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> 		free_cb, free_cb_arg);
> 	rte_pktmbuf_append(m, data_len + headroom);
> 	rte_pktmbuf_adj(m, headroom);
>
>>> +	m->ol_flags |= EXT_ATTACHED_MBUF;
>>> +	m->shinfo = shinfo;
>>> +
>>> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
>> Why is assignment used here? Cannot we attach extbuf already attached to
>> other mbuf?
> In rte_pktmbuf_attach(), this is true. That's not illogical to
> keep the same approach here. Maybe an assert could be added?
>
>> May be shinfo should be initialized only if it is not provided (shinfo ==
>> NULL on input)?
> I don't get why, can you explain please?

May be I misunderstand how it should look like when one huge buffer
is partitioned. I thought that it should be only one shinfo per huge buffer
to control when it is not used any more by any mbufs with extbuf.

Other option is to have shinfo per small buf plus reference counter
per huge buf (which is decremented when small buf reference counter
becomes zero and free callback is executed). I guess it is assumed above.
My fear is that it is too much reference counters:
  1. mbuf reference counter
  2. small buf reference counter
  3. huge buf reference counter
May be it is possible use (1) for (2) as well?

Andrew.
  
Olivier Matz April 24, 2018, 7:15 p.m. UTC | #6
On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > Hi Andrew, Yongseok,
> > 
> > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > This patch introduces a new way of attaching an external buffer to a mbuf.
> > > > 
> > > > Attaching an external buffer is quite similar to mbuf indirection in
> > > > replacing buffer addresses and length of a mbuf, but a few differences:
> > > >     - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> > > >       2 as long as the direct mbuf itself isn't freed after the attachment.
> > > >       In such cases, the buffer area of a direct mbuf must be read-only. But
> > > >       external buffer has its own refcnt and it starts from 1. Unless
> > > >       multiple mbufs are attached to a mbuf having an external buffer, the
> > > >       external buffer is writable.
> > > >     - There's no need to allocate buffer from a mempool. Any buffer can be
> > > >       attached with appropriate free callback.
> > > >     - Smaller metadata is required to maintain shared data such as refcnt.
> > > Really useful. Many thanks. See my notes below.
> > > 
> > > It worries me that detach is more expensive than it really required since it
> > > requires to restore mbuf as direct. If mbuf mempool is used for mbufs
> > > as headers for external buffers only all these actions are absolutely
> > > useless.
> > I agree on the principle. And we have the same issue with indirect mbuf.
> > Currently, the assumption is that a free mbuf (inside a mempool) is
> > initialized as a direct mbuf. We can think about optimizations here,
> > but I'm not sure it should be in this patchset.
> 
> I agree that it should be addressed separately.
> 
> > [...]
> > 
> > > > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > > >    }
> > > >    /**
> > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > + * otherwise.
> > > > + *
> > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > + */
> > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > +
> > > > +/**
> > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > >     */
> > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > It is still confusing that INDIRECT != !DIRECT.
> > > May be we have no good options right now, but I'd suggest to at least
> > > deprecate
> > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > Agree. I may have missed something, but is my previous suggestion
> > not doable?
> > 
> > - direct = embeds its own data      (and indirect = !direct)
> > - clone (or another name) = data is another mbuf
> > - extbuf = data is in an external buffer
> 
> I guess the problem that it changes INDIRECT semantics since EXTBUF
> is added as well. I think strictly speaking it is an API change.
> Is it OK to make it without announcement?

In any case, there will be an ABI change, because an application
compiled for 18.02 will not be able to handle these new kind of
mbuf.

So unfortunatly yes, I think this kind of changes should first be
announced.

Thomas, what do you think?


> > Deprecating the macro is a good idea.
> > 
> > > > +	m->buf_addr = buf_addr;
> > > > +	m->buf_iova = buf_iova;
> > > > +
> > > > +	if (shinfo == NULL) {
> > > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > > +		if ((void *)shinfo <= buf_addr)
> > > > +			return NULL;
> > > > +
> > > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > > +	} else {
> > > > +		m->buf_len = buf_len;
> > > > +	}
> > > > +
> > > > +	m->data_len = 0;
> > > > +
> > > > +	rte_pktmbuf_reset_headroom(m);
> > > I would suggest to make data_off one more parameter.
> > > If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> > > to control data_off.
> > Another option is to set the headroom to 0.
> > Because the after attaching the mbuf to an external buffer, we will
> > still require to set the length.
> > 
> > A user can do something like this:
> > 
> > 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> > 		free_cb, free_cb_arg);
> > 	rte_pktmbuf_append(m, data_len + headroom);
> > 	rte_pktmbuf_adj(m, headroom);
> > 
> > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > +	m->shinfo = shinfo;
> > > > +
> > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > other mbuf?
> > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > keep the same approach here. Maybe an assert could be added?
> > 
> > > May be shinfo should be initialized only if it is not provided (shinfo ==
> > > NULL on input)?
> > I don't get why, can you explain please?
> 
> May be I misunderstand how it should look like when one huge buffer
> is partitioned. I thought that it should be only one shinfo per huge buffer
> to control when it is not used any more by any mbufs with extbuf.

OK I got it.

I think both approach could make sense:
- one shinfo per huge buffer
- or one shinfo per mbuf, and use the callback to manage another refcnt
  (like what Yongseok described)

So I agree with your proposal, shinfo should be initialized by
the caller if it is != NULL, else it can be initialized by
rte_pktmbuf_attach_extbuf().


> Other option is to have shinfo per small buf plus reference counter
> per huge buf (which is decremented when small buf reference counter
> becomes zero and free callback is executed). I guess it is assumed above.
> My fear is that it is too much reference counters:
>  1. mbuf reference counter
>  2. small buf reference counter
>  3. huge buf reference counter
> May be it is possible use (1) for (2) as well?

I would prefer to have only 2 reference counters, one in the mbuf
and one in the shinfo.
  
Thomas Monjalon April 24, 2018, 8:22 p.m. UTC | #7
24/04/2018 21:15, Olivier Matz:
> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > + * otherwise.
> > > > > + *
> > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > + */
> > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > +
> > > > > +/**
> > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > >     */
> > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > It is still confusing that INDIRECT != !DIRECT.
> > > > May be we have no good options right now, but I'd suggest to at least
> > > > deprecate
> > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > Agree. I may have missed something, but is my previous suggestion
> > > not doable?
> > > 
> > > - direct = embeds its own data      (and indirect = !direct)
> > > - clone (or another name) = data is another mbuf
> > > - extbuf = data is in an external buffer
> > 
> > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > is added as well. I think strictly speaking it is an API change.
> > Is it OK to make it without announcement?
> 
> In any case, there will be an ABI change, because an application
> compiled for 18.02 will not be able to handle these new kind of
> mbuf.
> 
> So unfortunatly yes, I think this kind of changes should first be
> announced.
> 
> Thomas, what do you think?

What is the impact for the application developer?
Is there something to change in the application after this patch?
  
Yongseok Koh April 24, 2018, 9:53 p.m. UTC | #8
On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> 24/04/2018 21:15, Olivier Matz:
> > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > + * otherwise.
> > > > > > + *
> > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > + */
> > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > +
> > > > > > +/**
> > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > >     */
> > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > deprecate
> > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > Agree. I may have missed something, but is my previous suggestion
> > > > not doable?
> > > > 
> > > > - direct = embeds its own data      (and indirect = !direct)
> > > > - clone (or another name) = data is another mbuf
> > > > - extbuf = data is in an external buffer
> > > 
> > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > is added as well. I think strictly speaking it is an API change.
> > > Is it OK to make it without announcement?
> > 
> > In any case, there will be an ABI change, because an application
> > compiled for 18.02 will not be able to handle these new kind of
> > mbuf.
> > 
> > So unfortunatly yes, I think this kind of changes should first be
> > announced.
> > 
> > Thomas, what do you think?
> 
> What is the impact for the application developer?
> Is there something to change in the application after this patch?

Let me address two concerns discussed here.

1) API breakage of RTE_MBUF_DIRECT()
Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
should be used and it is experimental. If application isn't compiled without
allowing experimental API or application doesn't use the new API, it is always
true that direct == !indirect. It looks logically okay to me. And FYI, it passed
the mbuf_autotest.

2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
It's right that it could breadk ABI if the PMD delivers packets with external
buffer attached. But, the MPRQ feature is disabled by default and it can be
enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
possibility that 18.02-based application receives a mbuf having an external
buffer. And, like Olivier mentioned, there's another ABI breakage by removing
control mbuf anyway.

So, I don't think there's need for developers to change their application after
this patch unless they want to use the new feature.


Thanks,
Yongseok
  
Thomas Monjalon April 24, 2018, 10:15 p.m. UTC | #9
24/04/2018 23:53, Yongseok Koh:
> On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> > 24/04/2018 21:15, Olivier Matz:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > > + * otherwise.
> > > > > > > + *
> > > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > > + */
> > > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +
> > > > > > > +/**
> > > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > > >     */
> > > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > > deprecate
> > > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > > Agree. I may have missed something, but is my previous suggestion
> > > > > not doable?
> > > > > 
> > > > > - direct = embeds its own data      (and indirect = !direct)
> > > > > - clone (or another name) = data is another mbuf
> > > > > - extbuf = data is in an external buffer
> > > > 
> > > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > > is added as well. I think strictly speaking it is an API change.
> > > > Is it OK to make it without announcement?
> > > 
> > > In any case, there will be an ABI change, because an application
> > > compiled for 18.02 will not be able to handle these new kind of
> > > mbuf.
> > > 
> > > So unfortunatly yes, I think this kind of changes should first be
> > > announced.
> > > 
> > > Thomas, what do you think?
> > 
> > What is the impact for the application developer?
> > Is there something to change in the application after this patch?
> 
> Let me address two concerns discussed here.
> 
> 1) API breakage of RTE_MBUF_DIRECT()
> Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
> set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
> should be used and it is experimental. If application isn't compiled without
> allowing experimental API or application doesn't use the new API, it is always
> true that direct == !indirect. It looks logically okay to me. And FYI, it passed
> the mbuf_autotest.
> 
> 2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
> It's right that it could breadk ABI if the PMD delivers packets with external
> buffer attached. But, the MPRQ feature is disabled by default and it can be
> enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
> possibility that 18.02-based application receives a mbuf having an external
> buffer. And, like Olivier mentioned, there's another ABI breakage by removing
> control mbuf anyway.
> 
> So, I don't think there's need for developers to change their application after
> this patch unless they want to use the new feature.

To summarize, this a feature addition, there is no breakage.
So I don't see what should be announced.

I think it could be integrated as experimental with a first
PMD implementation in 18.05. It will allow to test the feature
in the field, and have more feedbacks about how to improve the API.
  
Yongseok Koh April 24, 2018, 10:30 p.m. UTC | #10
On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> On 04/24/2018 04:38 AM, Yongseok Koh wrote:
[...]
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 06eceba37..7f6507a66 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -326,7 +326,7 @@ extern "C" {
> >   		PKT_TX_MACSEC |		 \
> >   		PKT_TX_SEC_OFFLOAD)
> > -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> > +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
> 
> May be it should mention that shinfo is filled in.

Okay.

> >   #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
> > @@ -566,8 +566,24 @@ struct rte_mbuf {
> >   	/** Sequence number. See also rte_reorder_insert(). */
> >   	uint32_t seqn;
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> 
> I think it would be useful to add comment that it is used in the case of
> RTE_MBUF_HAS_EXTBUF() only.

Oops, I missed that. Thanks.

[...]
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	rte_iova_t buf_iova, uint16_t buf_len,
> > +	struct rte_mbuf_ext_shared_info *shinfo,
> > +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> > +{
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> 
> May I suggest to move it inside if (shinfo == NULL) to make it clear that it
> is not used if shinfo pointer is provided.

Done.

[...]
> >   static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
> >   {
> > -	struct rte_mbuf *md;
> > -
> >   	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
> >   	    rte_mbuf_refcnt_read(mi) == 1);
> > -	/* if m is not direct, get the mbuf that embeds the data */
> > -	if (RTE_MBUF_DIRECT(m))
> > -		md = m;
> > -	else
> > -		md = rte_mbuf_from_indirect(m);
> > +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> > +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> > +		mi->ol_flags = m->ol_flags;
> > +	} else {
> > +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
> 
> It looks like handling of the direct mbuf is lost here. May be it is
> intentional
> to avoid branching since result will be the same for direct mbuf as well,
> but looks confusing. Deserves at least a comment which explains why.
> Ideally it should be proven by measurements.

Right, that was intentional to avoid the branch. Sometimes branch is more
expensive than arithmetic ops in core's pipeline. Will add a comment.

[...]
> > +static inline void
> > +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> > +{
> > +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
> 
> Shouldn't it be done after below assertion? Just to be less confusing.

Right. Done.

> > +
> > +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> > +
> > +	if (rte_mbuf_refcnt_update(md, -1) == 0) {
> 
> It is not directly related to the changeset, but rte_pktmbuf_prefree_seg()
> has many optimizations which could be useful here:
>  - do not update if refcnt is 1
>  - do not set next/nb_seq if next is already NULL

Would be better to have a separate patch later.

Thanks,
Yongseok

> > +		md->next = NULL;
> > +		md->nb_segs = 1;
> > +		rte_mbuf_refcnt_set(md, 1);
> > +		rte_mbuf_raw_free(md);
> > +	}
> > +}
> 
> [...]
  
Yongseok Koh April 24, 2018, 11:34 p.m. UTC | #11
On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > Hi Andrew, Yongseok,
> > > 
> > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
[...]
> > > > > +	m->buf_addr = buf_addr;
> > > > > +	m->buf_iova = buf_iova;
> > > > > +
> > > > > +	if (shinfo == NULL) {
> > > > > +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> > > > > +					sizeof(*shinfo)), sizeof(uintptr_t));
> > > > > +		if ((void *)shinfo <= buf_addr)
> > > > > +			return NULL;
> > > > > +
> > > > > +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> > > > > +	} else {
> > > > > +		m->buf_len = buf_len;
> > > > > +	}
> > > > > +
> > > > > +	m->data_len = 0;
> > > > > +
> > > > > +	rte_pktmbuf_reset_headroom(m);
> > > > I would suggest to make data_off one more parameter.
> > > > If I have a buffer with data which I'd like to attach to an mbuf, I'd like
> > > > to control data_off.
> > > Another option is to set the headroom to 0.
> > > Because the after attaching the mbuf to an external buffer, we will
> > > still require to set the length.
> > > 
> > > A user can do something like this:
> > > 
> > > 	rte_pktmbuf_attach_extbuf(m, buf_va, buf_iova, buf_len, shinfo,
> > > 		free_cb, free_cb_arg);
> > > 	rte_pktmbuf_append(m, data_len + headroom);
> > > 	rte_pktmbuf_adj(m, headroom);

I'd take this option. Will make the change and document it.

> > > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > > +	m->shinfo = shinfo;
> > > > > +
> > > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > > other mbuf?
> > > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > > keep the same approach here. Maybe an assert could be added?

Like I described in the doc, intention is to attach external buffer by
_attach_extbuf() for the first time and _attach() is just for additional mbuf
attachment. Will add an assert.

> > > > May be shinfo should be initialized only if it is not provided (shinfo ==
> > > > NULL on input)?
> > > I don't get why, can you explain please?
> > 
> > May be I misunderstand how it should look like when one huge buffer
> > is partitioned. I thought that it should be only one shinfo per huge buffer
> > to control when it is not used any more by any mbufs with extbuf.
> 
> OK I got it.
> 
> I think both approach could make sense:
> - one shinfo per huge buffer
> - or one shinfo per mbuf, and use the callback to manage another refcnt
>   (like what Yongseok described)
> 
> So I agree with your proposal, shinfo should be initialized by
> the caller if it is != NULL, else it can be initialized by
> rte_pktmbuf_attach_extbuf().

Also agreed. Will change.

> > Other option is to have shinfo per small buf plus reference counter
> > per huge buf (which is decremented when small buf reference counter
> > becomes zero and free callback is executed). I guess it is assumed above.
> > My fear is that it is too much reference counters:
> >  1. mbuf reference counter
> >  2. small buf reference counter
> >  3. huge buf reference counter
> > May be it is possible use (1) for (2) as well?
> 
> I would prefer to have only 2 reference counters, one in the mbuf
> and one in the shinfo.

Good discussion. It should be a design decision by user.

In my use-case, it would be a good idea to make all the mbufs in a same chunk
point to the same shared info in the head of the chunk and reset the refcnt of
shinfo to the total number of slices in the chunk.

+--+----+----+--------------+----+--------------+---+- - -
|global |head|mbuf1 data    |head|mbuf2 data    |   |
| shinfo|room|              |room|              |   |
+--+----+----+--------------+----+--------------+---+- - -


Thanks,
Yongseok
  
Olivier Matz April 25, 2018, 8:21 a.m. UTC | #12
On Tue, Apr 24, 2018 at 02:53:41PM -0700, Yongseok Koh wrote:
> On Tue, Apr 24, 2018 at 10:22:45PM +0200, Thomas Monjalon wrote:
> > 24/04/2018 21:15, Olivier Matz:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > On Tue, Apr 24, 2018 at 03:28:33PM +0300, Andrew Rybchenko wrote:
> > > > > > On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> > > > > > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > > > > > + * otherwise.
> > > > > > > + *
> > > > > > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > > > > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > > > > > + */
> > > > > > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +
> > > > > > > +/**
> > > > > > >     * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > > > > > >     */
> > > > > > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > > > > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > > > > > It is still confusing that INDIRECT != !DIRECT.
> > > > > > May be we have no good options right now, but I'd suggest to at least
> > > > > > deprecate
> > > > > > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > > > > Agree. I may have missed something, but is my previous suggestion
> > > > > not doable?
> > > > > 
> > > > > - direct = embeds its own data      (and indirect = !direct)
> > > > > - clone (or another name) = data is another mbuf
> > > > > - extbuf = data is in an external buffer
> > > > 
> > > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > > is added as well. I think strictly speaking it is an API change.
> > > > Is it OK to make it without announcement?
> > > 
> > > In any case, there will be an ABI change, because an application
> > > compiled for 18.02 will not be able to handle these new kind of
> > > mbuf.
> > > 
> > > So unfortunatly yes, I think this kind of changes should first be
> > > announced.
> > > 
> > > Thomas, what do you think?
> > 
> > What is the impact for the application developer?
> > Is there something to change in the application after this patch?
> 
> Let me address two concerns discussed here.
> 
> 1) API breakage of RTE_MBUF_DIRECT()
> Previously, direct == !indirect but now direct == !indirect && !extbuf. But to
> set the new flag (EXT_ATTACHED_MBUF), the new API, rte_pktmbuf_attach_extbuf()
> should be used and it is experimental. If application isn't compiled without
> allowing experimental API or application doesn't use the new API, it is always
> true that direct == !indirect. It looks logically okay to me. And FYI, it passed
> the mbuf_autotest.
> 
> 2) ABI breakage of mlx5's new Multi-Packet RQ (a.k.a MPRQ) feature
> It's right that it could breadk ABI if the PMD delivers packets with external
> buffer attached. But, the MPRQ feature is disabled by default and it can be
> enabled only by the newly introduced PMD parameter (mprq_en). So, there's no
> possibility that 18.02-based application receives a mbuf having an external
> buffer. And, like Olivier mentioned, there's another ABI breakage by removing
> control mbuf anyway.

Stricly speaking, it is possible that a user pass this parameter through
the application (which just forwards it) to the new dpdk. So there is an ABI
change. In short, if a user wants to enable an optimization of 18.05 on an
application compiled for 18.02, it will fail.

But I agree the impact is very limited.

We are a bit lucky, because:
- the mbuf size is aligned to 128, so it stays the same, and the priv area
  is after the 2nd cache line (note: we are at 112 bytes over 128 on x86_64).
- previously, the area where shinfo is added was filled with garbage. It has
  no impact because it is only accessed when the EXT flag is set.
- the unused flags are 0 by default.

Knowing there is an ABI breakage this release, it could also make sense to
try to limitate them and avoid breaking it again in 18.08.

So in my opinion, reagarding API/ABI, this patchset could go in for 18.05.

Olivier


> 
> So, I don't think there's need for developers to change their application after
> this patch unless they want to use the new feature.
> 
> 
> Thanks,
> Yongseok
  
Olivier Matz April 25, 2018, 8:28 a.m. UTC | #13
Hi Yongseok,

On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> > > @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > >   }
> > >   /**
> > > + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > > + * otherwise.
> > > + *
> > > + * If a mbuf has its data in another mbuf and references it by mbuf
> > > + * indirection, this mbuf can be defined as a cloned mbuf.
> > > + */
> > > +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > +
> > > +/**
> > >    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > >    */
> > > -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > > +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > 
> > It is still confusing that INDIRECT != !DIRECT.
> > May be we have no good options right now, but I'd suggest to at least
> > deprecate
> > RTE_MBUF_INDIRECT() and completely remove it in the next release.
> 
> Agree. I may have missed something, but is my previous suggestion
> not doable?
> 
> - direct = embeds its own data      (and indirect = !direct)
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

Any comment about this option?
  
Yongseok Koh April 25, 2018, 9:08 a.m. UTC | #14
> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> 
> Hi Yongseok,
> 
> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>>  }
>>>>  /**
>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>>> + * otherwise.
>>>> + *
>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>>> + */
>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>> +
>>>> +/**
>>>>   * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>>   */
>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>>> 
>>> It is still confusing that INDIRECT != !DIRECT.
>>> May be we have no good options right now, but I'd suggest to at least
>>> deprecate
>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
>> 
>> Agree. I may have missed something, but is my previous suggestion
>> not doable?
>> 
>> - direct = embeds its own data      (and indirect = !direct)
>> - clone (or another name) = data is another mbuf
>> - extbuf = data is in an external buffer
> 
> Any comment about this option?

I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
(!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
I'm not sure I understand you correctly.

Can you please give me more guidelines so that I can take you idea?

Thanks,
Yongseok
  
Yongseok Koh April 25, 2018, 9:19 a.m. UTC | #15
> On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
>> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
>> 
>> Hi Yongseok,
>> 
>> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
>>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>>>>> }
>>>>> /**
>>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
>>>>> + * otherwise.
>>>>> + *
>>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
>>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
>>>>> + */
>>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>>> +
>>>>> +/**
>>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>>>>>  */
>>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
>>>> 
>>>> It is still confusing that INDIRECT != !DIRECT.
>>>> May be we have no good options right now, but I'd suggest to at least
>>>> deprecate
>>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
>>> 
>>> Agree. I may have missed something, but is my previous suggestion
>>> not doable?
>>> 
>>> - direct = embeds its own data      (and indirect = !direct)
>>> - clone (or another name) = data is another mbuf
>>> - extbuf = data is in an external buffer
>> 
>> Any comment about this option?
> 
> I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> I'm not sure I understand you correctly.
> 
> Can you please give me more guidelines so that I can take you idea?

Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
can't logically mean 'mbuf embeds its own data', right?

#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))

#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

[...]

@@ -1327,7 +1572,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
        if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-               if (RTE_MBUF_INDIRECT(m))
+               if (RTE_MBUF_INDIRECT(m) || RTE_MBUF_HAS_EXTBUF(m))
                        rte_pktmbuf_detach(m);
 
                if (m->next != NULL) {
@@ -1339,7 +1584,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
        } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-               if (RTE_MBUF_INDIRECT(m))
+               if (RTE_MBUF_INDIRECT(m) || RTE_MBUF_HAS_EXTBUF(m))
                        rte_pktmbuf_detach(m);
 
                if (m->next != NULL) {


Thanks,
Yongseok
  
Andrew Rybchenko April 25, 2018, 2:45 p.m. UTC | #16
On 04/25/2018 02:34 AM, Yongseok Koh wrote:
> On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
>> On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
>>> On 04/24/2018 07:02 PM, Olivier Matz wrote:
>>>>>> +	m->ol_flags |= EXT_ATTACHED_MBUF;
>>>>>> +	m->shinfo = shinfo;
>>>>>> +
>>>>>> +	rte_mbuf_ext_refcnt_set(shinfo, 1);
>>>>> Why is assignment used here? Cannot we attach extbuf already attached to
>>>>> other mbuf?
>>>> In rte_pktmbuf_attach(), this is true. That's not illogical to
>>>> keep the same approach here. Maybe an assert could be added?
> Like I described in the doc, intention is to attach external buffer by
> _attach_extbuf() for the first time and _attach() is just for additional mbuf
> attachment. Will add an assert.

Not sure that I understand. How should the second chunk with shared shinfo
of the huge buffer be attached to a new mbuf?

>>> Other option is to have shinfo per small buf plus reference counter
>>> per huge buf (which is decremented when small buf reference counter
>>> becomes zero and free callback is executed). I guess it is assumed above.
>>> My fear is that it is too much reference counters:
>>>   1. mbuf reference counter
>>>   2. small buf reference counter
>>>   3. huge buf reference counter
>>> May be it is possible use (1) for (2) as well?
>> I would prefer to have only 2 reference counters, one in the mbuf
>> and one in the shinfo.
> Good discussion. It should be a design decision by user.
>
> In my use-case, it would be a good idea to make all the mbufs in a same chunk
> point to the same shared info in the head of the chunk and reset the refcnt of
> shinfo to the total number of slices in the chunk.
>
> +--+----+----+--------------+----+--------------+---+- - -
> |global |head|mbuf1 data    |head|mbuf2 data    |   |
> | shinfo|room|              |room|              |   |
> +--+----+----+--------------+----+--------------+---+- - -

I don't understand how it can be achieved using proposed API.

Andrew.
  
Stephen Hemminger April 25, 2018, 3:06 p.m. UTC | #17
On Tue, 24 Apr 2018 22:22:45 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> > > 
> > > I guess the problem that it changes INDIRECT semantics since EXTBUF
> > > is added as well. I think strictly speaking it is an API change.
> > > Is it OK to make it without announcement?  
> > 
> > In any case, there will be an ABI change, because an application
> > compiled for 18.02 will not be able to handle these new kind of
> > mbuf.
> > 
> > So unfortunatly yes, I think this kind of changes should first be
> > announced.
> > 
> > Thomas, what do you think?  
> 
> What is the impact for the application developer?
> Is there something to change in the application after this patch?

Maybe the use of external buffers should be negotiated as a receiver
flag (per queue) in the device driver.  If device wants external buffers
it sets that in capability flag, and only uses it application requests
it. This allows old applications to work with no semantic surprises.
  
Yongseok Koh April 25, 2018, 5:40 p.m. UTC | #18
On Wed, Apr 25, 2018 at 05:45:34PM +0300, Andrew Rybchenko wrote:
> On 04/25/2018 02:34 AM, Yongseok Koh wrote:
> > On Tue, Apr 24, 2018 at 09:15:38PM +0200, Olivier Matz wrote:
> > > On Tue, Apr 24, 2018 at 09:21:00PM +0300, Andrew Rybchenko wrote:
> > > > On 04/24/2018 07:02 PM, Olivier Matz wrote:
> > > > > > > +	m->ol_flags |= EXT_ATTACHED_MBUF;
> > > > > > > +	m->shinfo = shinfo;
> > > > > > > +
> > > > > > > +	rte_mbuf_ext_refcnt_set(shinfo, 1);
> > > > > > Why is assignment used here? Cannot we attach extbuf already attached to
> > > > > > other mbuf?
> > > > > In rte_pktmbuf_attach(), this is true. That's not illogical to
> > > > > keep the same approach here. Maybe an assert could be added?
> > Like I described in the doc, intention is to attach external buffer by
> > _attach_extbuf() for the first time and _attach() is just for additional mbuf
> > attachment. Will add an assert.
> 
> Not sure that I understand. How should the second chunk with shared shinfo
> of the huge buffer be attached to a new mbuf?

Okay, I think I know what you misunderstood. This patch itself has no
consideration about huge buffer. It is to simply attach an external buffer to an
mbuf. Slicing huge buffer and attaching multiple mbufs is one of use-cases of
this feature. Let me take a few examples.

rte_pktmbuf_attach_extbuf(m, buf, buf_iova, buf_len, NULL, fcb, fcb_arg);

                |<------ buf_len ------>|
 +----+         +----------------+------+
 | m  |-------->| ext buf        |shif  |
 |    |         |                |refc=1|
 +----+         +----------------+------+
                |<- m->buf_len ->|

rte_pktmbuf_attach(m1, m);

 +----+         +----------------+------+
 | m  |-------->| ext buf        |shif  |
 |    |         |                |refc=2|
 +----+         +----------------+------+
                ^
 +----+         |
 | m1 |---------+
 |    |
 +----+

rte_pktmbuf_attach_extbuf(m, buf, buf_iova, buf_len, shinfo, fcb, fcb_arg);

                                 |<------ buf_len ------>|
 +------+         +----+         +-----------------------+
 |shinfo|<--------| m  |-------->| ext buf               |
 |refc=1|         |    |         |                       |
 +------+         +----+         +-----------------------+
                                 |<----- m->buf_len ---->|

rte_pktmbuf_attach(m1, m);

 +------+         +----+         +-----------------------+
 |shinfo|<--------| m  |-------->| ext buf               |
 |refc=2|         |    |         |                       |
 +------+         +----+         +-----------------------+
     ^                           ^
     |            +----+         |
     +------------| m1 |---------+
                  |    |
                  +----+

> > > > Other option is to have shinfo per small buf plus reference counter
> > > > per huge buf (which is decremented when small buf reference counter
> > > > becomes zero and free callback is executed). I guess it is assumed above.
> > > > My fear is that it is too much reference counters:
> > > >   1. mbuf reference counter
> > > >   2. small buf reference counter
> > > >   3. huge buf reference counter
> > > > May be it is possible use (1) for (2) as well?
> > > I would prefer to have only 2 reference counters, one in the mbuf
> > > and one in the shinfo.
> > Good discussion. It should be a design decision by user.
> > 
> > In my use-case, it would be a good idea to make all the mbufs in a same chunk
> > point to the same shared info in the head of the chunk and reset the refcnt of
> > shinfo to the total number of slices in the chunk.
> > 
> > +--+----+----+--------------+----+--------------+---+- - -
> > |global |head|mbuf1 data    |head|mbuf2 data    |   |
> > | shinfo|room|              |room|              |   |
> > +--+----+----+--------------+----+--------------+---+- - -
> 
> I don't understand how it can be achieved using proposed API.
 
For the following use-case,
 +--+----+----+--------------+----+--------------+---+- - -
 |global |head|mbuf1 data    |head|mbuf2 data    |   |
 | shinfo|room|              |room|              |   |
 +--+----+----+--------------+----+--------------+---+- - -
 ^       |<---- buf_len ---->|
 |
 mem

User can do,

g_shinfo = (struct rte_mbuf_ext_shared_info *)mem;
buf = mem + sizeof (*g_shinfo);
buf_iova = get_iova(buf);
rte_pktmbuf_attach_extbuf(m1, buf, buf_iova, buf_len, g_shinfo, fcb, fcb_arg);
rte_mbuf_ext_refcnt_update(g_shinfo, 1);
rte_pktmbuf_reset_headroom(m1);
buf += buf_len;
buf_iova = get_iova(buf);
rte_pktmbuf_attach_extbuf(m2, buf, buf_iova, buf_len, g_shinfo, fcb, fcb_arg);
rte_mbuf_ext_refcnt_update(g_shinfo, 1);
rte_pktmbuf_reset_headroom(m2);


Does it make sense?

Thanks,
Yongseok
  
Olivier Matz April 25, 2018, 8 p.m. UTC | #19
On Wed, Apr 25, 2018 at 09:19:32AM +0000, Yongseok Koh wrote:
> > On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
> >> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> >> 
> >> Hi Yongseok,
> >> 
> >> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> >>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >>>>> }
> >>>>> /**
> >>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> >>>>> + * otherwise.
> >>>>> + *
> >>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
> >>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
> >>>>> + */
> >>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >>>>> +
> >>>>> +/**
> >>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> >>>>>  */
> >>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> >>>> 
> >>>> It is still confusing that INDIRECT != !DIRECT.
> >>>> May be we have no good options right now, but I'd suggest to at least
> >>>> deprecate
> >>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> >>> 
> >>> Agree. I may have missed something, but is my previous suggestion
> >>> not doable?
> >>> 
> >>> - direct = embeds its own data      (and indirect = !direct)
> >>> - clone (or another name) = data is another mbuf
> >>> - extbuf = data is in an external buffer
> >> 
> >> Any comment about this option?
> > 
> > I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> > RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> > (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> > I'm not sure I understand you correctly.
> > 
> > Can you please give me more guidelines so that I can take you idea?
> 
> Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
> can't logically mean 'mbuf embeds its own data', right?
> 
> #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> 
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

I was thinking about something like this:

#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)

#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_HAS_EXTBUF(mb) && !RTE_MBUF_CLONED(mb))
#define RTE_MBUF_INDIRECT(mb)   (!RTE_MBUF_DIRECT(mb))
  
Yongseok Koh April 25, 2018, 10:54 p.m. UTC | #20
On Wed, Apr 25, 2018 at 10:00:10PM +0200, Olivier Matz wrote:
> On Wed, Apr 25, 2018 at 09:19:32AM +0000, Yongseok Koh wrote:
> > > On Apr 25, 2018, at 2:08 AM, Yongseok Koh <yskoh@mellanox.com> wrote:
> > >> On Apr 25, 2018, at 1:28 AM, Olivier Matz <olivier.matz@6wind.com> wrote:
> > >> 
> > >> Hi Yongseok,
> > >> 
> > >> On Tue, Apr 24, 2018 at 06:02:44PM +0200, Olivier Matz wrote:
> > >>>>> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> > >>>>> }
> > >>>>> /**
> > >>>>> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> > >>>>> + * otherwise.
> > >>>>> + *
> > >>>>> + * If a mbuf has its data in another mbuf and references it by mbuf
> > >>>>> + * indirection, this mbuf can be defined as a cloned mbuf.
> > >>>>> + */
> > >>>>> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > >>>>> +
> > >>>>> +/**
> > >>>>>  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
> > >>>>>  */
> > >>>>> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > >>>>> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
> > >>>> 
> > >>>> It is still confusing that INDIRECT != !DIRECT.
> > >>>> May be we have no good options right now, but I'd suggest to at least
> > >>>> deprecate
> > >>>> RTE_MBUF_INDIRECT() and completely remove it in the next release.
> > >>> 
> > >>> Agree. I may have missed something, but is my previous suggestion
> > >>> not doable?
> > >>> 
> > >>> - direct = embeds its own data      (and indirect = !direct)
> > >>> - clone (or another name) = data is another mbuf
> > >>> - extbuf = data is in an external buffer
> > >> 
> > >> Any comment about this option?
> > > 
> > > I liked your idea, so I defined RTE_MBUF_CLONED() and wanted to deprecate
> > > RTE_MBUF_INDIRECT() in the coming release. But RTE_MBUF_DIRECT() can't be
> > > (!RTE_MBUF_INDIRECT()) because it will logically include RTE_MBUF_HAS_EXTBUF().
> > > I'm not sure I understand you correctly.
> > > 
> > > Can you please give me more guidelines so that I can take you idea?
> > 
> > Maybe, did you mean the following? Looks like doable but RTE_MBUF_DIRECT()
> > can't logically mean 'mbuf embeds its own data', right?
> > 
> > #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> > #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > 
> > #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> 
> I was thinking about something like this:
> 
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> #define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> 
> #define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_HAS_EXTBUF(mb) && !RTE_MBUF_CLONED(mb))
> #define RTE_MBUF_INDIRECT(mb)   (!RTE_MBUF_DIRECT(mb))

Then, indirect should mean either having IND_ATTACHED_MBUF or having
EXT_ATTACHED_MBUF and it sounds weird. In a situation where EXT_ATTACHED_MBUF
(experimental) is never set, your definition of indirect will be same as now. No
breakage either.  Although your idea is logically same as the current patch,
there seems to be semantic conflict.

Thanks,
Yongseok
  

Patch

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 06eceba37..7f6507a66 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -326,7 +326,7 @@  extern "C" {
 		PKT_TX_MACSEC |		 \
 		PKT_TX_SEC_OFFLOAD)
 
-#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
+#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */
 
 #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
 
@@ -566,8 +566,24 @@  struct rte_mbuf {
 	/** Sequence number. See also rte_reorder_insert(). */
 	uint32_t seqn;
 
+	struct rte_mbuf_ext_shared_info *shinfo;
+
 } __rte_cache_aligned;
 
+/**
+ * Function typedef of callback to free externally attached buffer.
+ */
+typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
+
+/**
+ * Shared data at the end of an external buffer.
+ */
+struct rte_mbuf_ext_shared_info {
+	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
+	void *fcb_opaque;                        /**< Free callback argument */
+	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
+};
+
 /**< Maximum number of nb_segs allowed. */
 #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
 
@@ -688,14 +704,33 @@  rte_mbuf_to_baddr(struct rte_mbuf *md)
 }
 
 /**
+ * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
+ * otherwise.
+ *
+ * If a mbuf has its data in another mbuf and references it by mbuf
+ * indirection, this mbuf can be defined as a cloned mbuf.
+ */
+#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
+
+/**
  * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
  */
-#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
+#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)
+
+/**
+ * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
+ *
+ * External buffer is a user-provided anonymous buffer.
+ */
+#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
 
 /**
  * Returns TRUE if given mbuf is direct, or FALSE otherwise.
+ *
+ * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
+ * can be defined as a direct mbuf.
  */
-#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
+#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))
 
 /**
  * Private data in case of pktmbuf pool.
@@ -821,6 +856,58 @@  rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 
 #endif /* RTE_MBUF_REFCNT_ATOMIC */
 
+/**
+ * Reads the refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @return
+ *   Reference count number.
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
+{
+	return (uint16_t)(rte_atomic16_read(&shinfo->refcnt_atomic));
+}
+
+/**
+ * Set refcnt of an external buffer.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param new_value
+ *   Value set
+ */
+static inline void
+rte_mbuf_ext_refcnt_set(struct rte_mbuf_ext_shared_info *shinfo,
+	uint16_t new_value)
+{
+	rte_atomic16_set(&shinfo->refcnt_atomic, new_value);
+}
+
+/**
+ * Add given value to refcnt of an external buffer and return its new
+ * value.
+ *
+ * @param shinfo
+ *   Shared data of the external buffer.
+ * @param value
+ *   Value to add/subtract
+ * @return
+ *   Updated value
+ */
+static inline uint16_t
+rte_mbuf_ext_refcnt_update(struct rte_mbuf_ext_shared_info *shinfo,
+	int16_t value)
+{
+	if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1)) {
+		rte_mbuf_ext_refcnt_set(shinfo, 1 + value);
+		return 1 + value;
+	}
+
+	return (uint16_t)rte_atomic16_add_return(&shinfo->refcnt_atomic, value);
+}
+
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do {       \
 	if ((m) != NULL)                        \
@@ -1195,11 +1282,122 @@  static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
 }
 
 /**
+ * Attach an external buffer to a mbuf.
+ *
+ * User-managed anonymous buffer can be attached to an mbuf. When attaching
+ * it, corresponding free callback function and its argument should be
+ * provided. This callback function will be called once all the mbufs are
+ * detached from the buffer.
+ *
+ * More mbufs can be attached to the same external buffer by
+ * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
+ * this API.
+ *
+ * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
+ * ``rte_pktmbuf_detach()``.
+ *
+ * Memory for shared data can be provided by shinfo argument. If shinfo is NULL,
+ * a few bytes in the trailer of the provided buffer will be dedicated for
+ * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt, callback
+ * function and so on. The pointer of shared data will be stored in m->shinfo.
+ *
+ * Attaching an external buffer is quite similar to mbuf indirection in
+ * replacing buffer addresses and length of a mbuf, but a few differences:
+ * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
+ *   2 as long as the direct mbuf itself isn't freed after the attachment.
+ *   In such cases, the buffer area of a direct mbuf must be read-only. But
+ *   external buffer has its own refcnt and it starts from 1. Unless
+ *   multiple mbufs are attached to a mbuf having an external buffer, the
+ *   external buffer is writable.
+ * - There's no need to allocate buffer from a mempool. Any buffer can be
+ *   attached with appropriate free callback and its IO address.
+ * - Smaller metadata is required to maintain shared data such as refcnt.
+ *
+ * @warning
+ * @b EXPERIMENTAL: This API may change without prior notice.
+ * Once external buffer is enabled by allowing experimental API,
+ * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
+ * exclusive. A mbuf can be considered direct if it is neither indirect nor
+ * having external buffer.
+ *
+ * @param m
+ *   The pointer to the mbuf.
+ * @param buf_addr
+ *   The pointer to the external buffer we're attaching to.
+ * @param buf_iova
+ *   IO address of the external buffer we're attaching to.
+ * @param buf_len
+ *   The size of the external buffer we're attaching to. If memory for
+ *   shared data is not provided, buf_len must be larger than the size of
+ *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
+ *   enough, this function will return NULL.
+ * @param shinfo
+ *   User-provided memory for shared data. If NULL, a few bytes in the
+ *   trailer of the provided buffer will be dedicated for shared data.
+ * @param free_cb
+ *   Free callback function to call when the external buffer needs to be
+ *   freed.
+ * @param fcb_opaque
+ *   Argument for the free callback function.
+ *
+ * @return
+ *   A pointer to the new start of the data on success, return NULL
+ *   otherwise.
+ */
+static inline char * __rte_experimental
+rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
+	rte_iova_t buf_iova, uint16_t buf_len,
+	struct rte_mbuf_ext_shared_info *shinfo,
+	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
+{
+	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
+
+	m->buf_addr = buf_addr;
+	m->buf_iova = buf_iova;
+
+	if (shinfo == NULL) {
+		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
+					sizeof(*shinfo)), sizeof(uintptr_t));
+		if ((void *)shinfo <= buf_addr)
+			return NULL;
+
+		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
+	} else {
+		m->buf_len = buf_len;
+	}
+
+	m->data_len = 0;
+
+	rte_pktmbuf_reset_headroom(m);
+	m->ol_flags |= EXT_ATTACHED_MBUF;
+	m->shinfo = shinfo;
+
+	rte_mbuf_ext_refcnt_set(shinfo, 1);
+	shinfo->free_cb = free_cb;
+	shinfo->fcb_opaque = fcb_opaque;
+
+	return (char *)m->buf_addr + m->data_off;
+}
+
+/**
+ * Detach the external buffer attached to a mbuf, same as
+ * ``rte_pktmbuf_detach()``
+ *
+ * @param m
+ *   The mbuf having external buffer.
+ */
+#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
+
+/**
  * Attach packet mbuf to another packet mbuf.
  *
- * After attachment we refer the mbuf we attached as 'indirect',
- * while mbuf we attached to as 'direct'.
- * The direct mbuf's reference counter is incremented.
+ * If the mbuf we are attaching to isn't a direct buffer and is attached to
+ * an external buffer, the mbuf being attached will be attached to the
+ * external buffer instead of mbuf indirection.
+ *
+ * Otherwise, the mbuf will be indirectly attached. After attachment we
+ * refer the mbuf we attached as 'indirect', while mbuf we attached to as
+ * 'direct'.  The direct mbuf's reference counter is incremented.
  *
  * Right now, not supported:
  *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
@@ -1213,19 +1411,18 @@  static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
  */
 static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 {
-	struct rte_mbuf *md;
-
 	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
 	    rte_mbuf_refcnt_read(mi) == 1);
 
-	/* if m is not direct, get the mbuf that embeds the data */
-	if (RTE_MBUF_DIRECT(m))
-		md = m;
-	else
-		md = rte_mbuf_from_indirect(m);
+	if (RTE_MBUF_HAS_EXTBUF(m)) {
+		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
+		mi->ol_flags = m->ol_flags;
+	} else {
+		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);
+		mi->priv_size = m->priv_size;
+		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
+	}
 
-	rte_mbuf_refcnt_update(md, 1);
-	mi->priv_size = m->priv_size;
 	mi->buf_iova = m->buf_iova;
 	mi->buf_addr = m->buf_addr;
 	mi->buf_len = m->buf_len;
@@ -1241,7 +1438,6 @@  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 	mi->next = NULL;
 	mi->pkt_len = mi->data_len;
 	mi->nb_segs = 1;
-	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
 	mi->packet_type = m->packet_type;
 	mi->timestamp = m->timestamp;
 
@@ -1250,12 +1446,50 @@  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
 }
 
 /**
- * Detach an indirect packet mbuf.
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the reference counter of the external buffer. When the
+ * reference counter becomes 0, the buffer is freed by pre-registered
+ * callback.
+ */
+static inline void
+__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
+{
+	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
+	RTE_ASSERT(m->shinfo != NULL);
+
+	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
+		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
+}
+
+/**
+ * @internal used by rte_pktmbuf_detach().
+ *
+ * Decrement the direct mbuf's reference counter. When the reference
+ * counter becomes 0, the direct mbuf is freed.
+ */
+static inline void
+__rte_pktmbuf_free_direct(struct rte_mbuf *m)
+{
+	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
+
+	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
+
+	if (rte_mbuf_refcnt_update(md, -1) == 0) {
+		md->next = NULL;
+		md->nb_segs = 1;
+		rte_mbuf_refcnt_set(md, 1);
+		rte_mbuf_raw_free(md);
+	}
+}
+
+/**
+ * Detach a packet mbuf from external buffer or direct buffer.
  *
+ *  - decrement refcnt and free the external/direct buffer if refcnt
+ *    becomes zero.
  *  - restore original mbuf address and length values.
  *  - reset pktmbuf data and data_len to their default values.
- *  - decrement the direct mbuf's reference counter. When the
- *  reference counter becomes 0, the direct mbuf is freed.
  *
  * All other fields of the given packet mbuf will be left intact.
  *
@@ -1264,10 +1498,14 @@  static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
  */
 static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
-	struct rte_mbuf *md = rte_mbuf_from_indirect(m);
 	struct rte_mempool *mp = m->pool;
 	uint32_t mbuf_size, buf_len, priv_size;
 
+	if (RTE_MBUF_HAS_EXTBUF(m))
+		__rte_pktmbuf_free_extbuf(m);
+	else
+		__rte_pktmbuf_free_direct(m);
+
 	priv_size = rte_pktmbuf_priv_size(mp);
 	mbuf_size = sizeof(struct rte_mbuf) + priv_size;
 	buf_len = rte_pktmbuf_data_room_size(mp);
@@ -1279,13 +1517,6 @@  static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	rte_pktmbuf_reset_headroom(m);
 	m->data_len = 0;
 	m->ol_flags = 0;
-
-	if (rte_mbuf_refcnt_update(md, -1) == 0) {
-		md->next = NULL;
-		md->nb_segs = 1;
-		rte_mbuf_refcnt_set(md, 1);
-		rte_mbuf_raw_free(md);
-	}
 }
 
 /**
@@ -1309,7 +1540,7 @@  rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	if (likely(rte_mbuf_refcnt_read(m) == 1)) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {
@@ -1321,7 +1552,7 @@  rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 
 	} else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
 
-		if (RTE_MBUF_INDIRECT(m))
+		if (!RTE_MBUF_DIRECT(m))
 			rte_pktmbuf_detach(m);
 
 		if (m->next != NULL) {