[dpdk-dev] [RFC v2] doc compression API for DPDK

Trahe, Fiona fiona.trahe at intel.com
Thu Feb 15 19:47:21 CET 2018


Hi Shally, Ahmed, 
Sorry for the delay in replying,
Comments below

> -----Original Message-----
> From: Verma, Shally [mailto:Shally.Verma at cavium.com]
> Sent: Wednesday, February 14, 2018 7:41 AM
> To: Ahmed Mansour <ahmed.mansour at nxp.com>; Trahe, Fiona <fiona.trahe at intel.com>;
> dev at dpdk.org
> Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya at cavium.com>; Gupta, Ashish
> <Ashish.Gupta at cavium.com>; Sahu, Sunila <Sunila.Sahu at cavium.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch at intel.com>; Challa, Mahipal <Mahipal.Challa at cavium.com>; Jain, Deepak K
> <deepak.k.jain at intel.com>; Hemant Agrawal <hemant.agrawal at nxp.com>; Roy Pledge
> <roy.pledge at nxp.com>; Youri Querry <youri.querry_1 at nxp.com>
> Subject: RE: [RFC v2] doc compression API for DPDK
> 
> Hi Ahmed,
> 
> >-----Original Message-----
> >From: Ahmed Mansour [mailto:ahmed.mansour at nxp.com]
> >Sent: 02 February 2018 01:53
> >To: Trahe, Fiona <fiona.trahe at intel.com>; Verma, Shally <Shally.Verma at cavium.com>; dev at dpdk.org
> >Cc: Athreya, Narayana Prasad <NarayanaPrasad.Athreya at cavium.com>; Gupta, Ashish
> <Ashish.Gupta at cavium.com>; Sahu, Sunila
> ><Sunila.Sahu at cavium.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch at intel.com>; Challa,
> Mahipal
> ><Mahipal.Challa at cavium.com>; Jain, Deepak K <deepak.k.jain at intel.com>; Hemant Agrawal
> <hemant.agrawal at nxp.com>; Roy
> >Pledge <roy.pledge at nxp.com>; Youri Querry <youri.querry_1 at nxp.com>
> >Subject: Re: [RFC v2] doc compression API for DPDK
> >
> >On 1/31/2018 2:03 PM, Trahe, Fiona wrote:
> >> Hi Ahmed, Shally,
> >>
> >> ///snip///
> >>>>>>>> D.1.1 Stateless and OUT_OF_SPACE
> >>>>>>>> ------------------------------------------------
> >>>>>>>> OUT_OF_SPACE is a condition when output buffer runs out of space
> >>>>> and
> >>>>>>> where PMD still has more data to produce. If PMD run into such
> >>>>> condition,
> >>>>>>> then it's an error condition in stateless processing.
> >>>>>>>> In such case, PMD resets itself and return with status
> >>>>>>> RTE_COMP_OP_STATUS_OUT_OF_SPACE with produced=consumed=0
> >>>>> i.e.
> >>>>>>> no input read, no output written.
> >>>>>>>> Application can resubmit an full input with larger output buffer size.
> >>>>>>> [Ahmed] Can we add an option to allow the user to read the data that
> >>>>> was
> >>>>>>> produced while still reporting OUT_OF_SPACE? this is mainly useful for
> >>>>>>> decompression applications doing search.
> >>>>>> [Shally] It is there but applicable for stateful operation type (please refer to
> >>>>> handling out_of_space under
> >>>>>> "Stateful Section").
> >>>>>> By definition, "stateless" here means that application (such as IPCOMP)
> >>>>> knows maximum output size
> >>>>>> guaranteedly and ensure that uncompressed data size cannot grow more
> >>>>> than provided output buffer.
> >>>>>> Such apps can submit an op with type = STATELESS and provide full input,
> >>>>> then PMD assume it has
> >>>>>> sufficient input and output and thus doesn't need to maintain any contexts
> >>>>> after op is processed.
> >>>>>> If application doesn't know about max output size, then it should process it
> >>>>> as stateful op i.e. setup op
> >>>>>> with type = STATEFUL and attach a stream so that PMD can maintain
> >>>>> relevant context to handle such
> >>>>>> condition.
> >>>>> [Fiona] There may be an alternative that's useful for Ahmed, while still
> >>>>> respecting the stateless concept.
> >>>>> In Stateless case where a PMD reports OUT_OF_SPACE in decompression
> >>>>> case
> >>>>> it could also return consumed=0, produced = x, where x>0. X indicates the
> >>>>> amount of valid data which has
> >>>>>  been written to the output buffer. It is not complete, but if an application
> >>>>> wants to search it it may be sufficient.
> >>>>> If the application still wants the data it must resubmit the whole input with a
> >>>>> bigger output buffer, and
> >>>>>  decompression will be repeated from the start, it
> >>>>>  cannot expect to continue on as the PMD has not maintained state, history
> >>>>> or data.
> >>>>> I don't think there would be any need to indicate this in capabilities, PMDs
> >>>>> which cannot provide this
> >>>>> functionality would always return produced=consumed=0, while PMDs which
> >>>>> can could set produced > 0.
> >>>>> If this works for you both, we could consider a similar case for compression.
> >>>>>
> >>>> [Shally] Sounds Fine to me. Though then in that case, consume should also be updated to actual
> >>> consumed by PMD.
> >>>> Setting consumed = 0 with produced > 0 doesn't correlate.
> >>> [Ahmed]I like Fiona's suggestion, but I also do not like the implication
> >>> of returning consumed = 0. At the same time returning consumed = y
> >>> implies to the user that it can proceed from the middle. I prefer the
> >>> consumed = 0 implementation, but I think a different return is needed to
> >>> distinguish it from OUT_OF_SPACE that the use can recover from. Perhaps
> >>> OUT_OF_SPACE_RECOVERABLE and OUT_OF_SPACE_TERMINATED. This also allows
> >>> future PMD implementations to provide recover-ability even in STATELESS
> >>> mode if they so wish. In this model STATELESS or STATEFUL would be a
> >>> hint for the PMD implementation to make optimizations for each case, but
> >>> it does not force the PMD implementation to limit functionality if it
> >>> can provide recover-ability.
> >> [Fiona] So you're suggesting the following:
> >> OUT_OF_SPACE - returned only on stateful operation. Not an error. Op.produced
> >>     can be used and next op in stream should continue on from op.consumed+1.
> >> OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >>     Error condition, no recovery possible.
> >>     consumed=produced=0. Application must resubmit all input data with
> >>     a bigger output buffer.
> >> OUT_OF_SPACE_RECOVERABLE - returned only on stateless operation, some recovery possible.
> >>      - consumed = 0, produced > 0. Application must resubmit all input data with
> >>         a bigger output buffer. However in decompression case, data up to produced
> >>         in dst buffer may be inspected/searched. Never happens in compression
> >>         case as output data would be meaningless.
> >>      - consumed > 0, produced > 0. PMD has stored relevant state and history and so
> >>         can convert to stateful, using op.produced and continuing from consumed+1.
> >> I don't expect our PMDs to use this last case, but maybe this works for others?
> >> I'm not convinced it's not just adding complexity. It sounds like a version of stateful
> >> without a stream, and maybe less efficient?
> >> If so should it respect the FLUSH flag? Which would have been FULL or FINAL in the op.
> >> Or treat it as FLUSH_NONE or SYNC? I don't know why an application would not
> >> simply have submitted a STATEFUL request if this is the behaviour it wants?
> >[Ahmed] I was actually suggesting the removal of OUT_OF_SPACE entirely
> >and replacing it with
> >OUT_OF_SPACE_TERMINATED - returned only on stateless operation.
> >        Error condition, no recovery possible.
> >        - consumed=0 produced=amount of data produced. Application must
> >resubmit all input data with
> >          a bigger output buffer to process all of the op
> >OUT_OF_SPACE_RECOVERABLE -  Normally returned on stateful operation. Not
> >an error. Op.produced
> >    can be used and next op in stream should continue on from op.consumed+1.
> >        -  consumed > 0, produced > 0. PMD has stored relevant state and
> >history and so
> >            can continue using op.produced and continuing from consumed+1.
> >
> >We would not return OUT_OF_SPACE_RECOVERABLE in stateless mode in our
> >implementation either.
> >
> >Regardless of speculative future PMDs. The more important aspect of this
> >for today is that the return status clearly determines
> >the meaning of "consumed". If it is RECOVERABLE then consumed is
> >meaningful. if it is TERMINATED then consumed in meaningless.
> >This way we take away the ambiguity of having OUT_OF_SPACE mean two
> >different user work flows.
> >
> >A speculative future PMD may be designed to return RECOVERABLE for
> >stateless ops that are attached to streams.
> >A future PMD may look to see if an op has a stream is attached and write
> >out the state there and go into recoverable mode.
> >in essence this leaves the choice up to the implementation and allows
> >the PMD to take advantage of stateless optimizations
> >so long as a "RECOVERABLE" scenario is rarely hit. The PMD will dump
> >context as soon as it fully processes an op. It will only
> >write context out in cases where the op chokes.
> >This futuristic PMD should ignore the FLUSH since this STATELESS mode as
> >indicated by the user and optimize
> 
> [Shally] IMO, it looks okay to have two separate return code TERMINATED and RECOVERABLE with
> definition as you mentioned and seem doable.
> So then it mean all following conditions:
> a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
> has to start all over again, it's a failure (as in current definition)
> b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
> TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
> states in stream pointer)
> c. stateful with flush = full / NO_SYNC, stream pointer always there, PMD will
> TERMINATED/RECOVERABLE depending on STATEFUL_COMPRESSION/DECOMPRESSION feature flag
> enabled or not
[Fiona] I don't think the flush flag is relevant - it could be out of space on any flush flag, and if out of space
should ignore the flush flag. 
Is there a need for TERMINATED? - I didn't think it would ever need to be returned in stateful case.
 Why the ref to feature flag? If a PMD doesn't support a feature I think it should fail the op - not with
 out-of space, but unsupported or similar. Or it would fail on stream creation.

> 
> and one more exception case is:
> d. stateless with flush = full, no stream pointer provided, PMD can return RECOVERABLE i.e. PMD
> internally maintained that state somehow and consumed & produced > 0, so user can start consumed+1
> but there's restriction on user not to alter or change op until it is fully processed?!
[Fiona] Why the need for this case? 
There's always a restriction on user not to alter or change op until it is fully processed.
If a PMD can do this - why doesn't it create a stream when that API is called - and then it's same as b?

> 
> API currently takes care of case a and c, and case b can be supported if specification accept another
> proposal which mention optional usage of stream with stateless.
[Fiona] API has this, but as we agreed, not optional to call the create_stream() with an op_type 
parameter (stateful/stateless). PMD can return NULL or provide a stream, if the latter then that 
stream must be attached to ops.

 Until then API takes no difference to
> case b and c i.e. we can have op such as,
> - type= stateful with flush = full/final, stream pointer provided, PMD can return
> TERMINATED/RECOVERABLE according to its ability
> 
> Case d , is something exceptional, if there's requirement in PMDs to support it, then believe it will be
> doable with concept of different return code.
> 
[Fiona] That's not quite how I understood it. Can it be simpler and only following cases?
a. stateless with flush = full/final, no stream pointer provided , PMD can return TERMINATED i.e. user
    has to start all over again, it's a failure (as in current definition). 
    consumed = 0, produced=amount of data produced. This is usually 0, but in decompression 
    case a PMD may return > 0 and application may find it useful to inspect that data.
b. stateless with flush = full/final, stream pointer provided, here it's up to PMD to return either
    TERMINATED or RECOVERABLE depending upon its ability (note if Recoverable, then PMD will maintain
    states in stream pointer)
c. stateful with flush = any, stream pointer always there, PMD will return RECOVERABLE.
    op.produced can be used and next op in stream should continue on from op.consumed+1.
    Consumed=0, produced=0 is an unusual but allowed case. I'm not sure if it could ever happen, but
    no need to change state to TERMINATED in this case. There may be useful state/history 
    stored in the PMD, even though no output produced yet.

> >>>>>>>> D.2 Compression API Stateful operation
> >>>>>>>> ----------------------------------------------------------
> >>>>>>>>  A Stateful operation in DPDK compression means application invokes
> >>>>>>> enqueue burst() multiple times to process related chunk of data either
> >>>>>>> because
> >>>>>>>> - Application broke data into several ops, and/or
> >>>>>>>> - PMD ran into out_of_space situation during input processing
> >>>>>>>>
> >>>>>>>> In case of either one or all of the above conditions, PMD is required to
> >>>>>>> maintain state of op across enque_burst() calls and
> >>>>>>>> ops are setup with op_type RTE_COMP_OP_STATEFUL, and begin with
> >>>>>>> flush value = RTE_COMP_NO/SYNC_FLUSH and end at flush value
> >>>>>>> RTE_COMP_FULL/FINAL_FLUSH.
> >>>>>>>> D.2.1 Stateful operation state maintenance
> >>>>>>>> ---------------------------------------------------------------
> >>>>>>>> It is always an ideal expectation from application that it should parse
> >>>>>>> through all related chunk of source data making its mbuf-chain and
> >>>>> enqueue
> >>>>>>> it for stateless processing.
> >>>>>>>> However, if it need to break it into several enqueue_burst() calls, then
> >>>>> an
> >>>>>>> expected call flow would be something like:
> >>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>> [Ahmed] The work is now in flight to the PMD.The user will call dequeue
> >>>>>>> burst in a loop until all ops are received. Is this correct?
> >>>>>>>
> >>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>> [Shally] Yes. Ideally every submitted op need to be dequeued. However
> >>>>> this illustration is specifically in
> >>>>>> context of stateful op processing to reflect if a stream is broken into
> >>>>> chunks, then each chunk should be
> >>>>>> submitted as one op at-a-time with type = STATEFUL and need to be
> >>>>> dequeued first before next chunk is
> >>>>>> enqueued.
> >>>>>>
> >>>>>>>> enqueue_burst( |op.no_flush |)
> >>>>>>>> deque_burst(op) // should dequeue before we enqueue next
> >>>>>>>> enqueue_burst( |op.full_flush |)
> >>>>>>> [Ahmed] Why now allow multiple work items in flight? I understand that
> >>>>>>> occasionaly there will be OUT_OF_SPACE exception. Can we just
> >>>>> distinguish
> >>>>>>> the response in exception cases?
> >>>>>> [Shally] Multiples ops are allowed in flight, however condition is each op in
> >>>>> such case is independent of
> >>>>>> each other i.e. belong to different streams altogether.
> >>>>>> Earlier (as part of RFC v1 doc) we did consider the proposal to process all
> >>>>> related chunks of data in single
> >>>>>> burst by passing them as ops array but later found that as not-so-useful for
> >>>>> PMD handling for various
> >>>>>> reasons. You may please refer to RFC v1 doc review comments for same.
> >>>>> [Fiona] Agree with Shally. In summary, as only one op can be processed at a
> >>>>> time, since each needs the
> >>>>> state of the previous, to allow more than 1 op to be in-flight at a time would
> >>>>> force PMDs to implement internal queueing and exception handling for
> >>>>> OUT_OF_SPACE conditions you mention.
> >>> [Ahmed] But we are putting the ops on qps which would make them
> >>> sequential. Handling OUT_OF_SPACE conditions would be a little bit more
> >>> complex but doable.
> >> [Fiona] In my opinion this is not doable, could be very inefficient.
> >> There may be many streams.
> >> The PMD would have to have an internal queue per stream so
> >> it could adjust the next src offset and length in the OUT_OF_SPACE case.
> >> And this may ripple back though all subsequent ops in the stream as each
> >> source len is increased and its dst buffer is not big enough.
> >[Ahmed] Regarding multi op OUT_OF_SPACE handling.
> >The caller would still need to adjust
> >the src length/output buffer as you say. The PMD cannot handle
> >OUT_OF_SPACE internally.
> >After OUT_OF_SPACE occurs, the PMD should reject all ops in this stream
> >until it gets explicit
> >confirmation from the caller to continue working on this stream. Any ops
> >received by
> >the PMD should be returned to the caller with status STREAM_PAUSED since
> >the caller did not
> >explicitly acknowledge that it has solved the OUT_OF_SPACE issue.
> >These semantics can be enabled by adding a new function to the API
> >perhaps stream_resume().
> >This allows the caller to indicate that it acknowledges that it has seen
> >the issue and this op
> >should be used to resolve the issue. Implementations that do not support
> >this mode of use
> >can push back immediately after one op is in flight. Implementations
> >that support this use
> >mode can allow many ops from the same session
> >
> [Shally] Is it still in context of having single burst where all op belongs to one stream? If yes, I would still
> say it would add an overhead to PMDs especially if it is expected to work closer to HW (which I think is
> the case with DPDK PMD).
> Though your approach is doable but why this all cannot be in a layer above PMD? i.e. a layer above PMD
> can either pass one-op at a time with burst size = 1 OR can make chained mbuf of input and output and
> pass than as one op.
> Is it just to ease applications of chained mbuf burden or do you see any performance /use-case
> impacting aspect also?
> 
> if it is in context where each op belong to different stream in a burst, then why do we need
> stream_pause and resume? It is a expectations from app to pass more output buffer with consumed + 1
> from next call onwards as it has already
> seen OUT_OF_SPACE.
>
[Fiona] I still have concerns with this and would not want to support in our PMD.
TO make sure I understand, you want to send a burst of ops, with several from same stream.
If one causes OUT_OF_SPACE_RECOVERABLE, then the PMD should not process any 
subsequent ops in that stream. 
Should it return them in a dequeue_burst() with status still NOT_PROCESSED?
Or somehow drop them? How?
While still processing ops form other streams.
As we want to offload each op to hardware with as little CPU processing as possible we
would not want to open up each op to see which stream it's attached to and
make decisions to do per-stream storage, or drop it, or bypass hw and dequeue without processing.

Maybe we could add a capability if this behaviour is important for you?
e.g. ALLOW_ENQUEUE_MULTIPLE_STATEFUL_OPS ?
Our PMD would set this to 0. And expect no more than one op from a stateful stream
to be in flight at any time.  

 
> >Regarding the ordering of ops
> >We do force serialization of ops belonging to a stream in STATEFUL
> >operation. Related ops do
> >not go out of order and are given to available PMDs one at a time.
> >
> >>> The question is this mode of use useful for real
> >>> life applications or would we be just adding complexity? The technical
> >>> advantage of this is that processing of Stateful ops is interdependent
> >>> and PMDs can take advantage of caching and other optimizations to make
> >>> processing related ops much faster than switching on every op. PMDs have
> >>> maintain state of more than 32 KB for DEFLATE for every stream.
> >>>>> If the application has all the data, it can put it into chained mbufs in a single
> >>>>> op rather than
> >>>>> multiple ops, which avoids pushing all that complexity down to the PMDs.
> >>> [Ahmed] I think that your suggested scheme of putting all related mbufs
> >>> into one op may be the best solution without the extra complexity of
> >>> handling OUT_OF_SPACE cases, while still allowing the enqueuer extra
> >>> time If we have a way of marking mbufs as ready for consumption. The
> >>> enqueuer may not have all the data at hand but can enqueue the op with a
> >>> couple of empty mbus marked as not ready for consumption. The enqueuer
> >>> will then update the rest of the mbufs to ready for consumption once the
> >>> data is added. This introduces a race condition. A second flag for each
> >>> mbuf can be updated by the PMD to indicate that it processed it or not.
> >>> This way in cases where the PMD beat the application to the op, the
> >>> application will just update the op to point to the first unprocessed
> >>> mbuf and resend it to the PMD.
> >> [Fiona] This doesn't sound safe. You want to add data to a stream after you've
> >> enqueued the op. You would have to write to op.src.length at a time when the PMD
> >> might be reading it. Sounds like a lock would be necessary.
> >> Once the op has been enqueued, my understanding is its ownership is handed
> >> over to the PMD and the application should not touch it until it has been dequeued.
> >> I don't think it's a good idea to change this model.
> >> Can't the application just collect a stream of data in chained mbufs until it has
> >> enough to send an op, then construct the op and while waiting for that op to
> >> complete, accumulate the next batch of chained mbufs? Only construct the next op
> >> after the previous one is complete, based on the result of the previous one.
> >>
> >[Ahmed] Fair enough. I agree with you. I imagined it in a different way
> >in which each mbuf would have its own length.
> >The advantage to gain is in applications where there is one PMD user,
> >the down time between ops can be significant and setting up a single
> >producer consumer pair significantly reduces the CPU cycles and PMD down
> >time.
> >
> >////snip////


More information about the dev mailing list