[dpdk-dev] [PATCH v4 1/2] mbuf: support attaching external buffer to mbuf

Andrew Rybchenko arybchenko at solarflare.com
Tue Apr 24 14:28:33 CEST 2018


On 04/24/2018 04:38 AM, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
>
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>    - When an indirect mbuf is attached, refcnt of the direct mbuf would be
>      2 as long as the direct mbuf itself isn't freed after the attachment.
>      In such cases, the buffer area of a direct mbuf must be read-only. But
>      external buffer has its own refcnt and it starts from 1. Unless
>      multiple mbufs are attached to a mbuf having an external buffer, the
>      external buffer is writable.
>    - There's no need to allocate buffer from a mempool. Any buffer can be
>      attached with appropriate free callback.
>    - Smaller metadata is required to maintain shared data such as refcnt.

Really useful. Many thanks. See my notes below.

It worries me that detach is more expensive than it really required 
since it
requires to restore mbuf as direct. If mbuf mempool is used for mbufs
as headers for external buffers only all these actions are absolutely 
useless.

> Signed-off-by: Yongseok Koh <yskoh at mellanox.com>
> ---
>
> ** This patch can pass the mbuf_autotest. **
>
> Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches
> will be submitted separately rebased on a differnet patchset which
> accommodates new memory hotplug design to mlx PMDs.
>
> v4:
> * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo.
>    user can pass memory for shared data via shinfo argument.
> * minor changes from review.
>
> v3:
> * implement external buffer attachment instead of introducing buf_off for
>   mbuf indirection.
>
>   lib/librte_mbuf/rte_mbuf.h | 289 ++++++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 260 insertions(+), 29 deletions(-)
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 06eceba37..7f6507a66 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -326,7 +326,7 @@ extern "C" {
>   		PKT_TX_MACSEC |		 \
>   		PKT_TX_SEC_OFFLOAD)
>   
> -#define __RESERVED           (1ULL << 61) /**< reserved for future mbuf use */
> +#define EXT_ATTACHED_MBUF    (1ULL << 61) /**< Mbuf having external buffer */

May be it should mention that shinfo is filled in.

>   
>   #define IND_ATTACHED_MBUF    (1ULL << 62) /**< Indirect attached mbuf */
>   
> @@ -566,8 +566,24 @@ struct rte_mbuf {
>   	/** Sequence number. See also rte_reorder_insert(). */
>   	uint32_t seqn;
>   
> +	struct rte_mbuf_ext_shared_info *shinfo;

I think it would be useful to add comment that it is used in the case of 
RTE_MBUF_HAS_EXTBUF() only.

> +
>   } __rte_cache_aligned;
>   
> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	rte_atomic16_t refcnt_atomic;        /**< Atomically accessed refcnt */
> +};
> +
>   /**< Maximum number of nb_segs allowed. */
>   #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>   
> @@ -688,14 +704,33 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>   }
>   
>   /**
> + * Returns TRUE if given mbuf is cloned by mbuf indirection, or FALSE
> + * otherwise.
> + *
> + * If a mbuf has its data in another mbuf and references it by mbuf
> + * indirection, this mbuf can be defined as a cloned mbuf.
> + */
> +#define RTE_MBUF_CLONED(mb)     ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +
> +/**
>    * Returns TRUE if given mbuf is indirect, or FALSE otherwise.
>    */
> -#define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> +#define RTE_MBUF_INDIRECT(mb)   RTE_MBUF_CLONED(mb)

It is still confusing that INDIRECT != !DIRECT.
May be we have no good options right now, but I'd suggest to at least 
deprecate
RTE_MBUF_INDIRECT() and completely remove it in the next release.

> +
> +/**
> + * Returns TRUE if given mbuf has an external buffer, or FALSE otherwise.
> + *
> + * External buffer is a user-provided anonymous buffer.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
>   
>   /**
>    * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> + *
> + * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> + * can be defined as a direct mbuf.
>    */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!(RTE_MBUF_CLONED(mb) || RTE_MBUF_HAS_EXTBUF(mb)))

[...]

> @@ -1195,11 +1282,122 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>   }
>   
>   /**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * Memory for shared data can be provided by shinfo argument. If shinfo is NULL,
> + * a few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt, callback
> + * function and so on. The pointer of shared data will be stored in m->shinfo.
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - When an indirect mbuf is attached, refcnt of the direct mbuf would be
> + *   2 as long as the direct mbuf itself isn't freed after the attachment.
> + *   In such cases, the buffer area of a direct mbuf must be read-only. But
> + *   external buffer has its own refcnt and it starts from 1. Unless
> + *   multiple mbufs are attached to a mbuf having an external buffer, the
> + *   external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback and its IO address.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be considered direct if it is neither indirect nor
> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_iova
> + *   IO address of the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. If memory for
> + *   shared data is not provided, buf_len must be larger than the size of
> + *   ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If not
> + *   enough, this function will return NULL.
> + * @param shinfo
> + *   User-provided memory for shared data. If NULL, a few bytes in the
> + *   trailer of the provided buffer will be dedicated for shared data.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be
> + *   freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + *
> + * @return
> + *   A pointer to the new start of the data on success, return NULL
> + *   otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	rte_iova_t buf_iova, uint16_t buf_len,
> +	struct rte_mbuf_ext_shared_info *shinfo,
> +	rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);

May I suggest to move it inside if (shinfo == NULL) to make it clear 
that it is not used if shinfo pointer is provided.

> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = buf_iova;
> +
> +	if (shinfo == NULL) {
> +		shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end,
> +					sizeof(*shinfo)), sizeof(uintptr_t));
> +		if ((void *)shinfo <= buf_addr)
> +			return NULL;
> +
> +		m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> +	} else {
> +		m->buf_len = buf_len;
> +	}
> +
> +	m->data_len = 0;
> +
> +	rte_pktmbuf_reset_headroom(m);

I would suggest to make data_off one more parameter.
If I have a buffer with data which I'd like to attach to an mbuf, I'd 
like to control data_off.

> +	m->ol_flags |= EXT_ATTACHED_MBUF;
> +	m->shinfo = shinfo;
> +
> +	rte_mbuf_ext_refcnt_set(shinfo, 1);

Why is assignment used here? Cannot we attach extbuf already attached to 
other mbuf?
May be shinfo should be initialized only if it is not provided (shinfo 
== NULL on input)?

> +	shinfo->free_cb = free_cb;
> +	shinfo->fcb_opaque = fcb_opaque;
> +
> +	return (char *)m->buf_addr + m->data_off;
> +}
> +
> +/**
> + * Detach the external buffer attached to a mbuf, same as
> + * ``rte_pktmbuf_detach()``
> + *
> + * @param m
> + *   The mbuf having external buffer.
> + */
> +#define rte_pktmbuf_detach_extbuf(m) rte_pktmbuf_detach(m)
> +
> +/**
>    * Attach packet mbuf to another packet mbuf.
>    *
> - * After attachment we refer the mbuf we attached as 'indirect',
> - * while mbuf we attached to as 'direct'.
> - * The direct mbuf's reference counter is incremented.
> + * If the mbuf we are attaching to isn't a direct buffer and is attached to
> + * an external buffer, the mbuf being attached will be attached to the
> + * external buffer instead of mbuf indirection.
> + *
> + * Otherwise, the mbuf will be indirectly attached. After attachment we
> + * refer the mbuf we attached as 'indirect', while mbuf we attached to as
> + * 'direct'.  The direct mbuf's reference counter is incremented.
>    *
>    * Right now, not supported:
>    *  - attachment for already indirect mbuf (e.g. - mi has to be direct).
> @@ -1213,19 +1411,18 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>    */
>   static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   {
> -	struct rte_mbuf *md;
> -
>   	RTE_ASSERT(RTE_MBUF_DIRECT(mi) &&
>   	    rte_mbuf_refcnt_read(mi) == 1);
>   
> -	/* if m is not direct, get the mbuf that embeds the data */
> -	if (RTE_MBUF_DIRECT(m))
> -		md = m;
> -	else
> -		md = rte_mbuf_from_indirect(m);
> +	if (RTE_MBUF_HAS_EXTBUF(m)) {
> +		rte_mbuf_ext_refcnt_update(m->shinfo, 1);
> +		mi->ol_flags = m->ol_flags;
> +	} else {
> +		rte_mbuf_refcnt_update(rte_mbuf_from_indirect(m), 1);

It looks like handling of the direct mbuf is lost here. May be it is 
intentional
to avoid branching since result will be the same for direct mbuf as well,
but looks confusing. Deserves at least a comment which explains why.
Ideally it should be proven by measurements.

> +		mi->priv_size = m->priv_size;
> +		mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
> +	}
>   
> -	rte_mbuf_refcnt_update(md, 1);
> -	mi->priv_size = m->priv_size;
>   	mi->buf_iova = m->buf_iova;
>   	mi->buf_addr = m->buf_addr;
>   	mi->buf_len = m->buf_len;
> @@ -1241,7 +1438,6 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   	mi->next = NULL;
>   	mi->pkt_len = mi->data_len;
>   	mi->nb_segs = 1;
> -	mi->ol_flags = m->ol_flags | IND_ATTACHED_MBUF;
>   	mi->packet_type = m->packet_type;
>   	mi->timestamp = m->timestamp;
>   
> @@ -1250,12 +1446,50 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *m)
>   }
>   
>   /**
> - * Detach an indirect packet mbuf.
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the reference counter of the external buffer. When the
> + * reference counter becomes 0, the buffer is freed by pre-registered
> + * callback.
> + */
> +static inline void
> +__rte_pktmbuf_free_extbuf(struct rte_mbuf *m)
> +{
> +	RTE_ASSERT(RTE_MBUF_HAS_EXTBUF(m));
> +	RTE_ASSERT(m->shinfo != NULL);
> +
> +	if (rte_mbuf_ext_refcnt_update(m->shinfo, -1) == 0)
> +		m->shinfo->free_cb(m->buf_addr, m->shinfo->fcb_opaque);
> +}
> +
> +/**
> + * @internal used by rte_pktmbuf_detach().
> + *
> + * Decrement the direct mbuf's reference counter. When the reference
> + * counter becomes 0, the direct mbuf is freed.
> + */
> +static inline void
> +__rte_pktmbuf_free_direct(struct rte_mbuf *m)
> +{
> +	struct rte_mbuf *md = rte_mbuf_from_indirect(m);

Shouldn't it be done after below assertion? Just to be less confusing.

> +
> +	RTE_ASSERT(RTE_MBUF_INDIRECT(m));
> +
> +	if (rte_mbuf_refcnt_update(md, -1) == 0) {

It is not directly related to the changeset, but rte_pktmbuf_prefree_seg()
has many optimizations which could be useful here:
  - do not update if refcnt is 1
  - do not set next/nb_seq if next is already NULL

> +		md->next = NULL;
> +		md->nb_segs = 1;
> +		rte_mbuf_refcnt_set(md, 1);
> +		rte_mbuf_raw_free(md);
> +	}
> +}

[...]


More information about the dev mailing list