[dpdk-dev] [RFC v2 1/1] lib/ring: add scatter gather APIs

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Oct 9 10:05:29 CEST 2020


Hi lads,

> On Thu, Oct 08, 2020 at 08:44:13PM +0000, Honnappa Nagarahalli wrote:
> > <snip>
> >
> > >
> > > Hi Honnappa,
> > >
> > > From a quick walkthrough, I have some questions/comments, please see
> > > below.
> > Hi Olivier, appreciate your input.
> >
> > >
> > > On Tue, Oct 06, 2020 at 08:29:05AM -0500, Honnappa Nagarahalli wrote:
> > > > Add scatter gather APIs to avoid intermediate memcpy. Use cases that
> > > > involve copying large amount of data to/from the ring can benefit from
> > > > these APIs.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli at arm.com>
> > > > ---
> > > >  lib/librte_ring/meson.build        |   3 +-
> > > >  lib/librte_ring/rte_ring_elem.h    |   1 +
> > > >  lib/librte_ring/rte_ring_peek_sg.h | 552
> > > > +++++++++++++++++++++++++++++
> > > >  3 files changed, 555 insertions(+), 1 deletion(-)  create mode 100644
> > > > lib/librte_ring/rte_ring_peek_sg.h
> > > >
> > > > diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
> > > > index 31c0b4649..377694713 100644
> > > > --- a/lib/librte_ring/meson.build
> > > > +++ b/lib/librte_ring/meson.build
> > > > @@ -12,4 +12,5 @@ headers = files('rte_ring.h',
> > > >  		'rte_ring_peek.h',
> > > >  		'rte_ring_peek_c11_mem.h',
> > > >  		'rte_ring_rts.h',
> > > > -		'rte_ring_rts_c11_mem.h')
> > > > +		'rte_ring_rts_c11_mem.h',
> > > > +		'rte_ring_peek_sg.h')
> > > > diff --git a/lib/librte_ring/rte_ring_elem.h
> > > > b/lib/librte_ring/rte_ring_elem.h index 938b398fc..7d3933f15 100644
> > > > --- a/lib/librte_ring/rte_ring_elem.h
> > > > +++ b/lib/librte_ring/rte_ring_elem.h
> > > > @@ -1079,6 +1079,7 @@ rte_ring_dequeue_burst_elem(struct rte_ring *r,
> > > > void *obj_table,
> > > >
> > > >  #ifdef ALLOW_EXPERIMENTAL_API
> > > >  #include <rte_ring_peek.h>
> > > > +#include <rte_ring_peek_sg.h>
> > > >  #endif
> > > >
> > > >  #include <rte_ring.h>
> > > > diff --git a/lib/librte_ring/rte_ring_peek_sg.h
> > > > b/lib/librte_ring/rte_ring_peek_sg.h
> > > > new file mode 100644
> > > > index 000000000..97d5764a6
> > > > --- /dev/null
> > > > +++ b/lib/librte_ring/rte_ring_peek_sg.h
> > > > @@ -0,0 +1,552 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + *
> > > > + * Copyright (c) 2020 Arm
> > > > + * Copyright (c) 2007-2009 Kip Macy kmacy at freebsd.org
> > > > + * All rights reserved.
> > > > + * Derived from FreeBSD's bufring.h
> > > > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_RING_PEEK_SG_H_
> > > > +#define _RTE_RING_PEEK_SG_H_
> > > > +
> > > > +/**
> > > > + * @file
> > > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > > + * It is not recommended to include this file directly.
> > > > + * Please include <rte_ring_elem.h> instead.
> > > > + *
> > > > + * Ring Peek Scatter Gather APIs
> > >
> > > I am not fully convinced by the API name. To me, "scatter/gather" is
> > > associated to iovecs, like for instance in [1]. The wikipedia definition [2] may
> > > be closer though.
> > >
> > > [1]
> > > https://www.gnu.org/software/libc/manual/html_node/Scatter_002dGathe
> > > r.html
> > > [2] https://en.wikipedia.org/wiki/Gather-scatter_(vector_addressing)
> > The way I understand scatter-gather is, the data to be sent to something (like a device) is coming from multiple sources. It would require
> copying to put the data together to be contiguous. If the device supports scatter-gather, such copying is not required. The device can
> collect data from multiple locations and make it contiguous.
> >
> > In the case I was looking at, one part of the data was coming from the user of the API and another was generated by the API itself. If
> these two pieces of information have to be enqueued as a single object on the ring, I had to create an intermediate copy. But by exposing
> the ring memory to the user, the intermediate copy is avoided. Hence I called it scatter-gather.
> >
> > >
> > > What about "zero-copy"?
> > I think no-copy (nc for short) or user-copy (uc for short) would convey the meaning better. These would indicate that the rte_ring APIs are
> not copying the objects and it is left to the user to do the actual copy.
> >
> > >
> > > Also, the "peek" term looks also a bit confusing to me, but it existed before
> > > your patch. I understand it for dequeue, but not for enqueue.
> > I kept 'peek' there because the API still offers the 'peek' API capabilities. I am also not sure on what 'peek' means for enqueue API. The
> enqueue 'peek' API was provided to be symmetric with dequeue peek API.
> >
> > >
> > > Or, what about replacing the existing experimental peek API by this one?
> > > They look quite similar to me.
> > I agree, scatter gather APIs provide the peek capability and the no-copy benefits.
> > Konstantin, any opinions here?

Sorry, didn't have time yet, to look at this RFC properly.
Will try to do it next week, as I understand that's for 21.02 anyway?

> > >
> > > > + * Introduction of rte_ring with scatter gather serialized
> > > > + producer/consumer
> > > > + * (HTS sync mode) makes it possible to split public enqueue/dequeue
> > > > + API
> > > > + * into 3 phases:
> > > > + * - enqueue/dequeue start
> > > > + * - copy data to/from the ring
> > > > + * - enqueue/dequeue finish
> > > > + * Along with the advantages of the peek APIs, these APIs provide the
> > > > + ability
> > > > + * to avoid copying of the data to temporary area.
> > > > + *
> > > > + * Note that right now this new API is available only for two sync modes:
> > > > + * 1) Single Producer/Single Consumer (RTE_RING_SYNC_ST)
> > > > + * 2) Serialized Producer/Serialized Consumer (RTE_RING_SYNC_MT_HTS).
> > > > + * It is a user responsibility to create/init ring with appropriate
> > > > + sync
> > > > + * modes selected.
> > > > + *
> > > > + * Example usage:
> > > > + * // read 1 elem from the ring:
> > >
> > > Comment should be "prepare enqueuing 32 objects"
> > >
> > > > + * n = rte_ring_enqueue_sg_bulk_start(ring, 32, &sgd, NULL);
> > > > + * if (n != 0) {
> > > > + *	//Copy objects in the ring
> > > > + *	memcpy (sgd->ptr1, obj, sgd->n1 * sizeof(uintptr_t));
> > > > + *	if (n != sgd->n1)
> > > > + *		//Second memcpy because of wrapround
> > > > + *		n2 = n - sgd->n1;
> > > > + *		memcpy (sgd->ptr2, obj[n2], n2 * sizeof(uintptr_t));
> > >
> > > Missing { }
> > >
> > > > + *	rte_ring_dequeue_sg_finish(ring, n);
> > >
> > > Should be enqueue
> > >
> > Thanks, will correct them.
> >
> > > > + * }
> > > > + *
> > > > + * Note that between _start_ and _finish_ none other thread can
> > > > + proceed
> > > > + * with enqueue(/dequeue) operation till _finish_ completes.
> > > > + */
> > > > +
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > > +
> > > > +#include <rte_ring_peek_c11_mem.h>
> > > > +
> > > > +/* Rock that needs to be passed between reserve and commit APIs */
> > > > +struct rte_ring_sg_data {
> > > > +	/* Pointer to the first space in the ring */
> > > > +	void **ptr1;
> > > > +	/* Pointer to the second space in the ring if there is wrap-around */
> > > > +	void **ptr2;
> > > > +	/* Number of elements in the first pointer. If this is equal to
> > > > +	 * the number of elements requested, then ptr2 is NULL.
> > > > +	 * Otherwise, subtracting n1 from number of elements requested
> > > > +	 * will give the number of elements available at ptr2.
> > > > +	 */
> > > > +	unsigned int n1;
> > > > +};
> > >
> > > Would it be possible to simply return the offset instead of this structure?
> > > The wrap could be managed by a __rte_ring_enqueue_elems() function.
> > Trying to use __rte_ring_enqueue_elems() will force temporary copy. See below.
> >
> > >
> > > I mean something like this:
> > >
> > > 	uint32_t start;
> > > 	n = rte_ring_enqueue_sg_bulk_start(ring, 32, &start, NULL);
> > > 	if (n != 0) {
> > > 		/* Copy objects in the ring. */
> > > 		__rte_ring_enqueue_elems(ring, start, obj, sizeof(uintptr_t),
> > > n);
> > For ex: 'obj' here is temporary copy.
> >
> > > 		rte_ring_enqueue_sg_finish(ring, n);
> > > 	}
> > >
> > > It would require to slightly change __rte_ring_enqueue_elems() to support
> > > to be called with prod_head >= size, and wrap in that case.
> > >
> > The alternate solution I can think of requires 3 things 1) the base address of the ring 2) Index to where to copy 3) the mask. With these 3
> things one could write the code like below:
> > for (i = 0; i < n; i++) {
> > 	ring_addr[(index + i) & mask] = obj[i]; // ANDing with mask will take care of wrap-around.
> > }
> >
> > However, I think this does not allow for passing the address in the ring to another function/API to copy the data (It is possible, but the user
> has to calculate the actual address, worry about the wrap-around, second pointer etc).
> >
> > The current approach hides some details and provides flexibility to the application to use the pointer the way it wants.
> 
> I agree that doing the access + masking manually looks too complex.
> 
> However I'm not sure to get why using __rte_ring_enqueue_elems() would
> result in an additional copy. I suppose the object that you want to
> enqueue is already stored somewhere?
> 
> For instance, let's say you have 10 objects to enqueue, located at
> different places:
> 
> 	void *obj_0_to_3 = <place where objects 0 to 3 are stored>;
> 	void *obj_4_to_7 = ...;
> 	void *obj_8_to_9 = ...;
> 	uint32_t start;
> 	n = rte_ring_enqueue_sg_bulk_start(ring, 10, &start, NULL);
> 	if (n != 0) {
> 		__rte_ring_enqueue_elems(ring, start, obj_0_to_3,
> 			sizeof(uintptr_t), 4);
> 		__rte_ring_enqueue_elems(ring, start + 4, obj_4_to_7,
> 			sizeof(uintptr_t), 4);
> 		__rte_ring_enqueue_elems(ring, start + 8, obj_8_to_9,
> 			sizeof(uintptr_t), 2);
> 		rte_ring_enqueue_sg_finish(ring, 10);
> 	}
> 


As I understand, It is not about different objects stored in different places,
it is about:
a) object is relatively big (16B+ ?)
b) You compose objects from values stored in few different places.

Let say you have:
struct elem_obj {uint64_t a; uint32_t b, c;};

And then you'd like to copy 'a' value from one location, 'b' from second,
and 'c' from third one.

Konstantin




More information about the dev mailing list