[dpdk-dev] [PATCH v3 1/2] mbuf: support attaching external buffer to mbuf

Yongseok Koh yskoh at mellanox.com
Tue Apr 24 03:29:57 CEST 2018


On Mon, Apr 23, 2018 at 06:18:43PM +0200, Olivier Matz wrote:
> Hi Yongseok,
> 
> Please see some comments below.
> 
> On Wed, Apr 18, 2018 at 06:11:04PM -0700, Yongseok Koh wrote:
> > This patch introduces a new way of attaching an external buffer to a mbuf.
> > 
> > Attaching an external buffer is quite similar to mbuf indirection in
> > replacing buffer addresses and length of a mbuf, but a few differences:
> >   - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> >     mbuf must be read-only. But external buffer has its own refcnt and it
> >     starts from 1. Unless multiple mbufs are attached to a mbuf having an
> >     external buffer, the external buffer is writable.
> 
> I'm wondering if "As refcnt of a direct mbuf is at least 2" should be
> clarified. I guess we are talking about a direct mbuf that has another one
> attached too.
> 
> I'm also not sure if I understand properly: to me, it is possible to have
> an indirect mbuf that references a direct mbuf with a refcount of 1:
>   m = rte_pktmbuf_alloc()
>   mi = rte_pktmbuf_alloc()
>   rte_pktmbuf_attach(mi, m)
>   rte_pktmbuf_free(m)

Totally agree. Will change the comment.

[...]
> > +struct rte_mbuf_ext_shared_info {
> > +	rte_mbuf_extbuf_free_callback_t free_cb; /**< Free callback function */
> > +	void *fcb_opaque;                        /**< Free callback argument */
> > +	RTE_STD_C11
> > +	union {
> > +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> > +		uint16_t refcnt;          /**< Non-atomically accessed refcnt */
> 
> It looks that only refcnt_atomic is used.
> I don't know if we really need the non-atomic one yet.

Will remove.

> > +	};
> > +};
> > +
> >  /**< Maximum number of nb_segs allowed. */
> >  #define RTE_MBUF_MAX_NB_SEGS	UINT16_MAX
> >  
> > @@ -693,9 +711,14 @@ rte_mbuf_to_baddr(struct rte_mbuf *md)
> >  #define RTE_MBUF_INDIRECT(mb)   ((mb)->ol_flags & IND_ATTACHED_MBUF)
> >  
> >  /**
> > + * Returns TRUE if given mbuf has external buffer, or FALSE otherwise.
> > + */
> > +#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & EXT_ATTACHED_MBUF)
> > +
> > +/**
> >   * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> >   */
> > -#define RTE_MBUF_DIRECT(mb)     (!RTE_MBUF_INDIRECT(mb))
> > +#define RTE_MBUF_DIRECT(mb) (!RTE_MBUF_INDIRECT(mb) && !RTE_MBUF_HAS_EXTBUF(mb))
> 
> I'm a bit reticent to have RTE_MBUF_DIRECT(m) different of
> !RTE_MBUF_INDIRECT(m), I feel it's not very natural.
> 
> What about:
> - direct = embeds its own data
> - clone (or another name) = data is another mbuf
> - extbuf = data is in an external buffer

Good point. I'll clarify it in a new version by adding RTE_MBUF_CLONED().

> >  /**
> >   * Private data in case of pktmbuf pool.
> > @@ -821,6 +844,58 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
> >  
> >  #endif /* RTE_MBUF_REFCNT_ATOMIC */
> >  
> > +/**
> > + * Reads the refcnt of an external buffer.
> > + *
> > + * @param shinfo
> > + *   Shared data of the external buffer.
> > + * @return
> > + *   Reference count number.
> > + */
> > +static inline uint16_t
> > +rte_extbuf_refcnt_read(const struct rte_mbuf_ext_shared_info *shinfo)
> 
> What do you think about rte_mbuf_ext_refcnt_read() to keep name consistency?
> (same for other functions below)

No problem.

> [...]
> 
> > @@ -1195,11 +1270,120 @@ static inline int rte_pktmbuf_alloc_bulk(struct rte_mempool *pool,
> >  }
> >  
> >  /**
> > + * Return shared data of external buffer of a mbuf.
> > + *
> > + * @param m
> > + *   The pointer to the mbuf.
> > + * @return
> > + *   The address of the shared data.
> > + */
> > +static inline struct rte_mbuf_ext_shared_info *
> > +rte_mbuf_ext_shinfo(struct rte_mbuf *m)
> > +{
> > +	return (struct rte_mbuf_ext_shared_info *)
> > +		RTE_PTR_ADD(m->buf_addr, m->buf_len);
> > +}
> 
> This forces to have the shared data at the end of the buffer. Is it
> always possible? I think there are use-cases where the user may want to
> specify another location for it.
> 
> For instance, an application mmaps a big file (locked in memory), and
> wants to send mbufs pointing to this data without doing any copy.

Very good point. Will make rte_pktmbuf_attach_extbuf() take *shinfo as an
argument.

> Maybe adding a m->shinfo field would be a better choice, what do you
> think?

I like it to store in mbuf too.

> This would certainly break the ABI, but I wonder if that patch does
> not already break it. I mean, how would react an application compiled
> for 18.02 if an EXTBUF is passed to it, knowing that many functions
> are inline?

Even if I add shinfo field in rte_mbuf, I think it won't break the ABI. The
second cacheline is just 40B and it would simply make it 48B. Some code might
check the order/size of some fields (e.g. vPMD) in the struct, but if it is
added at the end of the struct, it should be okay. And there's no need to make a
change in a C file for this.

> > +/**
> > + * Attach an external buffer to a mbuf.
> > + *
> > + * User-managed anonymous buffer can be attached to an mbuf. When attaching
> > + * it, corresponding free callback function and its argument should be
> > + * provided. This callback function will be called once all the mbufs are
> > + * detached from the buffer.
> > + *
> > + * More mbufs can be attached to the same external buffer by
> > + * ``rte_pktmbuf_attach()`` once the external buffer has been attached by
> > + * this API.
> > + *
> > + * Detachment can be done by either ``rte_pktmbuf_detach_extbuf()`` or
> > + * ``rte_pktmbuf_detach()``.
> > + *
> > + * A few bytes in the trailer of the provided buffer will be dedicated for
> > + * shared data (``struct rte_mbuf_ext_shared_info``) to store refcnt,
> > + * callback function and so on. The shared data can be referenced by
> > + * ``rte_mbuf_ext_shinfo()``
> > + *
> > + * Attaching an external buffer is quite similar to mbuf indirection in
> > + * replacing buffer addresses and length of a mbuf, but a few differences:
> > + * - As refcnt of a direct mbuf is at least 2, the buffer area of a direct
> > + *   mbuf must be read-only. But external buffer has its own refcnt and it
> > + *   starts from 1. Unless multiple mbufs are attached to a mbuf having an
> > + *   external buffer, the external buffer is writable.
> > + * - There's no need to allocate buffer from a mempool. Any buffer can be
> > + *   attached with appropriate free callback.
> > + * - Smaller metadata is required to maintain shared data such as refcnt.
> > + *
> > + * @warning
> > + * @b EXPERIMENTAL: This API may change without prior notice.
> > + * Once external buffer is enabled by allowing experimental API,
> > + * ``RTE_MBUF_DIRECT()`` and ``RTE_MBUF_INDIRECT()`` are no longer
> > + * exclusive. A mbuf can be consiered direct if it is neither indirect nor
> 
> small typo:
> consiered -> considered

Will fix. Thanks.

> > + * having external buffer.
> > + *
> > + * @param m
> > + *   The pointer to the mbuf.
> > + * @param buf_addr
> > + *   The pointer to the external buffer we're attaching to.
> > + * @param buf_len
> > + *   The size of the external buffer we're attaching to. This must be larger
> > + *   than the size of ``struct rte_mbuf_ext_shared_info`` and padding for
> > + *   alignment. If not enough, this function will return NULL.
> > + * @param free_cb
> > + *   Free callback function to call when the external buffer needs to be freed.
> > + * @param fcb_opaque
> > + *   Argument for the free callback function.
> > + * @return
> > + *   A pointer to the new start of the data on success, return NULL otherwise.
> > + */
> > +static inline char * __rte_experimental
> > +rte_pktmbuf_attach_extbuf(struct rte_mbuf *m, void *buf_addr,
> > +	uint16_t buf_len, rte_mbuf_extbuf_free_callback_t free_cb,
> > +	void *fcb_opaque)
> > +{
> > +	void *buf_end = RTE_PTR_ADD(buf_addr, buf_len);
> > +	struct rte_mbuf_ext_shared_info *shinfo;
> > +
> > +	shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, sizeof(*shinfo)),
> > +			sizeof(uintptr_t));
> > +
> > +	if ((void *)shinfo <= buf_addr)
> > +		return NULL;
> > +
> > +	m->buf_addr = buf_addr;
> > +	m->buf_iova = rte_mempool_virt2iova(buf_addr);
> 
> Agree with Konstantin's comment. I think buf_iova should be an argument
> of the function.

Oops, that was my silly mistake. I just copied this block from
rte_pktmbuf_init(). Then, I wanted to change it to rte_malloc_virt2iova() but I
forgot. I didn't realized it during my tests because mlx devices don't use iova
but virtaddr.

If it takes iova as an argument instead, it can be faster and it can use 'real'
external memory for packet DMA, e.g. storage application you mentioned. I mean,
even if a buffer isn't allocated inside DPDK (doesn't belong to one of memseg
list), this should work. Good suggestion!

> > +	m->buf_len = RTE_PTR_DIFF(shinfo, buf_addr);
> 
> Related to what I said above: I think m->buf_len should be set to
> the buf_len argument, so a user can point to existing read-only
> data.

I will make a change to have an argument of extra memory for shared data. And if
shinfo is passed as NULL, it will still spare bytes at the end, just for
convenience.

> [...]
> 
> Few more comments:
> 
> I think we still need to find a good way to advertise to the users
> if a mbuf is writable or readable. Today, the rules are quite implicit.
> There are surely some use cases where the mbuf is indirect but with
> only one active user, meaning it could be READ-WRITE. We could target
> 18.08 for this.

Right. That'll be very good to have.

> One side question about your implementation in mlx. I guess the
> hardware will write the mbuf data in a big contiguous buffer like
> this:
> 
> +-------+--------------+--------+--------------+--------+- - -
> |       |mbuf1 data    |        |mbuf2 data    |        |
> |       |              |        |              |        |
> +-------+--------------+--------+--------------+--------+- - -
> 
> Which will be transformed in:
> 
> +--+----+--------------+---+----+--------------+---+---+- - -
> |  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
> |  |room|              |inf|room|              |inf|   |
> +--+----+--------------+---+----+--------------+---+---+- - -
> 
> So, there is one shinfo (i.e one refcount) for each mbuf.
> How do you know when the big buffer is not used anymore?
 
 +--+----+--------------+---+----+--------------+---+---+- - -
 |  |head|mbuf1 data    |sh |head|mbuf2 data    |sh |   |
 |  |room|              |inf|room|              |inf|   |
 +--+----+--------------+---+----+--------------+---+---+- - -
  ^
  |
  Metadata for the whole chunk, having another refcnt managed by PMD.
  fcb_opaque will have this pointer so that the callback func knows it.

> To summarize, I like the idea of your patchset, this is close to
> what I had in mind... which does not necessarly mean it is the good
> way to do ;)
> 
> I'm a bit afraid about ABI breakage, we need to check that a
> 18.02-compiled application still works well with this change.

I had the same concern so I made rte_pktmbuf_attach_extbuf() __rte_experimental.
Although this new ol_flag is introduced, it can only be set by the new API and
the rest of changes won't be effective unless this flag is set.
RTE_MBUF_HAS_EXTBUF() will always be false if -DALLOW_EXPERIMENTAL_API isn't
specified or rte_pktmbuf_attach_extbuf() isn't called. And there's no change
needed in a C file. For this reason, I don't think there's ABI breakage.

Sounds correct?

> About testing, I don't know if you checked the mbuf autotests,
> but it could also help to check that basic stuff still work.

I'll make sure all the tests pass before I submit a new version.


Thanks,
Yongseok


More information about the dev mailing list