[dpdk-dev] [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf

Olivier Matz olivier.matz at 6wind.com
Mon Apr 23 18:18:43 CEST 2018


Hi Yongseok,

Please see some comments below.

On Wed, Apr 18, 2018 at 06:11:04PM -0700, Yongseok Koh wrote:
> This patch introduces a new way of attaching an external buffer to a mbuf.
> 
> Attaching an external buffer is quite similar to mbuf indirection in
> replacing buffer addresses and length of a mbuf, but a few differences:
>   - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
>     mbuf must be read-only. But external buffer has its own refcnt and it
>     starts from 1. Unless multiple mbufs are attached to a mbuf having an
>     external buffer, the external buffer is writable.

I'm wondering if "As refcnt of a direct mbuf is at least 2" should be
clarified. I guess we are talking about a direct mbuf that has another one
attached too.

I'm also not sure if I understand properly: to me, it is possible to have
an indirect mbuf that references a direct mbuf with a refcount of 1:
  m = rte_pktmbuf_alloc()
  mi = rte_pktmbuf_alloc()
  rte_pktmbuf_attach(mi, m)
  rte_pktmbuf_free(m)

>   - There's no need to allocate buffer from a mempool. Any buffer can be
>     attached with appropriate free callback.
>   - Smaller metadata is required to maintain shared data such as refcnt.
> Signed-off-by: Yongseok Koh <yskoh at mellanox.com>

[...]

> +/**
> + * Function typedef of callback to free externally attached buffer.
> + */
> +typedef void (*rte_mbuf_extbuf_free_callback_t)(void *addr, void *opaque);
> +
> +/**
> + * Shared data at the end of an external buffer.
> + */
> +struct rte_mbuf_ext_shared_info {
> +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> +	void *fcb_opaque;                        /**< Free callback argument */
> +	RTE_STD_C11
> +	union {
> +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> +		uint16_t refcnt;          /**< Non-atomically accessed refcnt */

It looks that only refcnt_atomic is used.
I don't know if we really need the non-atomic one yet.


> +	};
> +};
> +
>  /**< Maximum number of nb_segs allowed. */
>  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
>  
> @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
>  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
>  
>  /**
> + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> + */
> +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> +
> +/**
>   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
>   */
> -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))

I'm a bit reticent to have RTE_MBUF_DIRECT(m) different of
!RTE_MBUF_INDIRECT(m), I feel it's not very natural.

What about:
- direct = embeds its own data
- clone (or another name) = data is another mbuf
- extbuf = data is in an external buffer


>  /**
>   * Private data in case of pktmbuf pool.
> @@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
>  
>  #endif /* RTE_MBUF_REFCNT_ATOMIC */
>  
> +/**
> + * Reads the refcnt of an external buffer.
> + *
> + * @param shinfo
> + *   Shared data of the external buffer.
> + * @return
> + *   Reference count number.
> + */
> +static inline uint16_t
> +rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)

What do you think about rte_mbuf_ext_refcnt_read() to keep name consistency?
(same for other functions below)

[...]

> @@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
>  }
>  
>  /**
> + * Return shared data of external buffer of a mbuf.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @return
> + *   The address of the shared data.
> + */
> +static inline struct rte_mbuf_ext_shared_info *
> +rte_mbuf_ext_shinfo(struct rte_mbuf *m)
> +{
> +	return (struct rte_mbuf_ext_shared_info *)
> +		RTE_PTR_ADD(m->buf_addr, m->buf_len);
> +}

This forces to have the shared data at the end of the buffer. Is it
always possible? I think there are use-cases where the user may want to
specify another location for it.

For instance, an application mmaps a big file (locked in memory), and
wants to send mbufs pointing to this data without doing any copy.

Maybe adding a m->shinfo field would be a better choice, what do you
think?

This would certainly break the ABI, but I wonder if that patch does
not already break it. I mean, how would react an application compiled
for 18.02 if an EXTBUF is passed to it, knowing that many functions
are inline?


> +/**
> + * Attach an external buffer to a mbuf.
> + *
> + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> + * it, corresponding free callback function and its argument should be
> + * provided. This callback function will be called once all the mbufs are
> + * detached from the buffer.
> + *
> + * More mbufs can be attached to the same external buffer by
> + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> + * this API.
> + *
> + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> + * ``rte_pktmbuf_detach()``.
> + *
> + * A few bytes in the trailer of the provided buffer will be dedicated for
> + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
> + * callback function and so on. The shared data can be referenced by
> + * ``rte_mbuf_ext_shinfo()``
> + *
> + * Attaching an external buffer is quite similar to mbuf indirection in
> + * replacing buffer addresses and length of a mbuf, but a few differences:
> + * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> + *   mbuf must be read-only. But external buffer has its own refcnt and it
> + *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
> + *   external buffer, the external buffer is writable.
> + * - There's no need to allocate buffer from a mempool. Any buffer can be
> + *   attached with appropriate free callback.
> + * - Smaller metadata is required to maintain shared data such as refcnt.
> + *
> + * @warning
> + * @b EXPERIMENTAL: This API may change without prior notice.
> + * Once external buffer is enabled by allowing experimental API,
> + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> + * exclusive. A mbuf can be consiered direct if it is neither indirect nor

small typo:
consiered -> considered

> + * having external buffer.
> + *
> + * @param m
> + *   The pointer to the mbuf.
> + * @param buf_addr
> + *   The pointer to the external buffer we're attaching to.
> + * @param buf_len
> + *   The size of the external buffer we're attaching to. This must be larger
> + *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
> + *   alignment. If not enough, this function will return NULL.
> + * @param free_cb
> + *   Free callback function to call when the external buffer needs to be freed.
> + * @param fcb_opaque
> + *   Argument for the free callback function.
> + * @return
> + *   A pointer to the new start of the data on success, return NULL otherwise.
> + */
> +static inline char * __rte_experimental
> +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> +	void *fcb_opaque)
> +{
> +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> +	struct rte_mbuf_ext_shared_info *shinfo;
> +
> +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> +			sizeof(uintptr_t));
> +
> +	if ((void *)shinfo <= buf_addr)
> +		return NULL;
> +
> +	m->buf_addr = buf_addr;
> +	m->buf_iova = rte_mempool_virt2iova(buf_addr);

Agree with Konstantin's comment. I think buf_iova should be an argument
of the function.


> +	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);

Related to what I said above: I think m->buf_len should be set to
the buf_len argument, so a user can point to existing read-only
data.

[...]

Few more comments:

I think we still need to find a good way to advertise to the users
if a mbuf is writable or readable. Today, the rules are quite implicit.
There are surely some use cases where the mbuf is indirect but with
only one active user, meaning it could be READ-WRITE. We could target
18.08 for this.

One side question about your implementation in mlx. I guess the
hardware will write the mbuf data in a big contiguous buffer like
this:

+-------+--------------+--------+--------------+--------+- - -
|       |mbuf1 data    |        |mbuf2 data    |        |
|       |              |        |              |        |
+-------+--------------+--------+--------------+--------+- - -

Which will be transformed in:

+--+----+--------------+---+----+--------------+---+---+- - -
|  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
|  |room|              |inf|room|              |inf|   |
+--+----+--------------+---+----+--------------+---+---+- - -

So, there is one shinfo (i.e one refcount) for each mbuf.
How do you know when the big buffer is not used anymore?


To summarize, I like the idea of your patchset, this is close to
what I had in mind... which does not necessarly mean it is the good
way to do ;)

I'm a bit afraid about ABI breakage, we need to check that a
18.02-compiled application still works well with this change.

About testing, I don't know if you checked the mbuf autotests,
but it could also help to check that basic stuff still work.


Thanks,
Olivier


More information about the dev mailing list