[dpdk-stable] [PATCH v2] doc/compress: clarify error handling on data-plane

Shally Verma shallyv at marvell.com
Tue May 7 19:14:50 CEST 2019
Previous message: [dpdk-stable] [dpdk-dev] [PATCH] net/mlx5: fix missing release of Rx queue object
Next message: [dpdk-stable] [PATCH v2] doc/compress: clarify error handling on data-plane
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> -----Original Message-----
> From: Trahe, Fiona <fiona.trahe at intel.com>
> Sent: Tuesday, April 30, 2019 10:04 PM
> To: Shally Verma <shallyv at marvell.com>; dev at dpdk.org
> Cc: akhil.goyal at nxp.com; Ashish Gupta <ashishg at marvell.com>; Daly, Lee
> <lee.daly at intel.com>; Sunila Sahu <ssahu at marvell.com>; stable at dpdk.org;
> Trahe, Fiona <fiona.trahe at intel.com>
> Subject: RE: [PATCH v2] doc/compress: clarify error handling on data-plane
> 
> Hi Shally,
> 
> 
> > -----Original Message-----
> > From: Shally Verma [mailto:shallyv at marvell.com]
> > Sent: Thursday, April 18, 2019 1:12 PM
> > To: Trahe, Fiona <fiona.trahe at intel.com>; dev at dpdk.org
> > Cc: akhil.goyal at nxp.com; Ashish Gupta <ashishg at marvell.com>; Daly, Lee
> > <lee.daly at intel.com>; Sunila Sahu <ssahu at marvell.com>;
> stable at dpdk.org
> > Subject: RE: [PATCH v2] doc/compress: clarify error handling on
> > data-plane
> >
> > Hi Fiona,
> >
> >
> > > -----Original Message-----
> > > From: Fiona Trahe <fiona.trahe at intel.com>
> > > Sent: Tuesday, April 9, 2019 8:26 PM
> > > To: dev at dpdk.org
> > > Cc: akhil.goyal at nxp.com; Ashish Gupta <ashishg at marvell.com>;
> > > lee.daly at intel.com; Sunila Sahu <ssahu at marvell.com>; Shally Verma
> > > <shallyv at marvell.com>; Fiona Trahe <fiona.trahe at intel.com>;
> > > stable at dpdk.org
> > > Subject: [PATCH v2] doc/compress: clarify error handling on
> > > data-plane
> > >
> > > Fixed some typos and clarified which errors should be returned when
> > > and why on the enqueue and dequeue APIs.
> > >
> > > Fixes: a584d3bea902 ("doc: add compressdev library guide")
> > > cc: stable at dpdk.org
> > >
> > > Signed-off-by: Fiona Trahe <fiona.trahe at intel.com>
> > > ---
> > > v2 changes:
> > >  - changed "0 or undefined" to just "undefined" as 0 is superfluous.
> > >
> > >
> > >  doc/guides/prog_guide/compressdev.rst | 46
> > > ++++++++++++++++++++++++++++++++---
> > >  1 file changed, 43 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/doc/guides/prog_guide/compressdev.rst
> > > b/doc/guides/prog_guide/compressdev.rst
> > > index ad9703753..c700dd103 100644
> > > --- a/doc/guides/prog_guide/compressdev.rst
> > > +++ b/doc/guides/prog_guide/compressdev.rst
> > > @@ -201,7 +201,7 @@ for stateful processing of ops.
> > >  Operation Status
> > >  ~~~~~~~~~~~~~~~~
> > >  Each operation carries a status information updated by PMD after it
> > > is processed.
> > > -following are currently supported status:
> > > +Following are currently supported:
> > >
> > >  - RTE_COMP_OP_STATUS_SUCCESS,
> > >      Operation is successfully completed @@ -227,14 +227,54 @@
> > > following are currently supported status:
> > >      is not an error case. Output data up to op.produced can be used and
> > >      next op in the stream should continue on from op.consumed+1.
> > >
> > > +Operation status after enqueue / dequeue
> > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > +Some of the above values will only arise in the op after an
> > > +``rte_compressdev_enqueue_burst()``, some only after an
> > > +``rte_compressdev_dequeue_burst()``. For optimal performance on the
> > > +data-plane an application is not expected to check the
> > > +``op.status`` of all ops after both enqueue and dequeue, it should
> > > +be sufficient to only check after dequeue. To facilitate this
> > > +optimisation, most errors which may reasonably be expected to occur
> > > +in a production environment will be
> > > returned by the PMD on the ``dequeue``.
> > > +op.status may hold the following values after dequeue:
> > > +
> > > +- RTE_COMP_OP_STATUS_SUCCESS
> > > +- RTE_COMP_OP_STATUS_ERROR
> > > +- RTE_COMP_OP_STATUS_OUT_OF_SPACE_TERMINATED
> > > +- RTE_COMP_OP_STATUS_OUT_OF_SPACE_RECOVERABLE
> > > +
> > > +There are some exceptions whereby errors can occur on the
> ``enqueue``.
> > > +For any error which can occur in a production environment and can
> > > +be successful after a retry with the same op the PMD may return the
> > > +error on the enqueue.
> > This statement looks bit confusing.
> > Seems like we are trying to add a description regarding op status
> > check even after the enqueue call unlike current scenario, where app
> > only check for it after dequeue?
> [Fiona] The line following this explains that there is no need to check
> op.status in this case.
> Maybe it's not obvious that the application SHOULD check that all ops are
> enqueued?
> I can reword as:
> The application should always check the value returned by the enqueue.
> If less than the full burst is enqueued there's no need for the application to
> check op.status of any or every op - it can simply retry from the return
> value+1 in a later enqueue and expect success.
> 
 I agree to purpose of patch but have these confusions when I read description above:

My understand is , if op status is INVALID_ARGS or any ERROR which is permanent in nature,
Then nb_enqd return will be less than actually passed. Regardless of whatever reason, if any time app gets nb_enqd < actually passed, then app should check status of nb_enqd + 1th op to find
exact cause of failure and then either attempt re-enqueue Or correct op preparation or take any other appropriate action.
Also, STATUS_ERROR is very generic, it can be when queue is full in which case app can re-attempt an enqueue of same op OR
It can also indicate any irrecoverable error on enqueue, in which app just probably has to reset everything. For such kind of case, it might not be possible for PMD design
to even push it into completion queue for an app to dequeue .  I would suggest  add another status code type which reflect permanent error condition i.e. irrecoverable error code
which tells an app to perform PMD qp reset/re-init to recover and simplify description just to state an expected APP behavior to avoid infinite loop condition.
It is then an app choice whether or not to check for op status for error after enqueue depending on whether its running in production environment or dev environment.

Thanks
Shally


> If this isn't clear, can you suggest other wording - or expand on what's not
> unclear.
> 
> 
> > So if less than the full burst is enqueued there's no
> > > +need for the application to check op.status - the application can
> > > +simply retry in a later enqueue and expect success. Though the
> > > +application
> > > is not expected to check for these, the values are as follows:
> > > +
> > > +- RTE_COMP_OP_STATUS_NOT_PROCESSED  - could occur if a hardware
> > > device's queue is full, after a dequeue a retry of the enqueue can
> > > be successful.
> > > +
> > > +- RTE_COMP_OP_STATUS_ERROR - could occur due to out-of-memory
> or
> > > other transient condition which could clear after a time.
> > > +
> > > +Other errors may also occur on an ``enqueue``, but they are only
> > > +expected to arise during development. As a retry with the same op
> > > +won't be successful, if a performant application wants to avoid
> > > +checking op.status on the enqueue it should ensure these never
> > > +arise in a production environment, e.g. by checking device
> > > +capabilities and validating
> > > input parameters before sending operations. Examples are:
> > > +
> > > +- RTE_COMP_OP_STATUS_INVALID_ARGS
> > > +- RTE_COMP_OP_STATUS_ERROR (if due to a condition which is not
> > > +transient)
> > How does app identify error is transient or permanent?
> [Fiona] The app doesn't - it's up to the PMD to decide what the appropriate
> return is using the logic described above. If the PMD encounters an error
> that's not transient, and can be reasonably expected in a production
> environment, then it should forward the error to the dequeue - where it
> expects the application to check for it.
> Should I add this note at the end of this section?
> 
> >
> > > +- RTE_COMP_OP_STATUS_INVALID_STATE
> > > +
> > > +If an application doesn't safeguard against these AND doesn't check
> > > +the op.status of the next op which was not enqueued, but just
> > > +retries, it could
> > > result in an infinite loop.
> > > +
> > May be , you are trying to insist whenever the number_enqueued ops <
> > number actually enqueued , then caller should check for status of
> > num_enqueued + 1th op to know exact reason, and take corrective
> measure before re-enqueuing it for following status condition:
> > 1. INVALID_ARGS
> > 2. INVALID_STATE
> > 3 ERROR
> >
> > Is that correct?
> [Fiona] Only if the application has a slow-path for debug in non-production
> code.
> In production code, for high-performance, I'm saying it's reasonable not to
> do this check.
> 
> The reason for calling this out is we did have cases where errors were being
> returned on the enqueue, But the performance tool was not checking
> op.status of enqueued+1 and so getting into an infinite loop.
> Seems reasonable that a performant application would also not check.
> We've removed those cases from QAT PMD, and behave according to above
> logic.
> But  it's not described explicitly in this detail on the API - so think it's worth
> calling out this expectation.
> Does it make sense?
> 
> 
> >
> >
> > >  Produced, Consumed And Operation Status
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >
> > >  - If status is RTE_COMP_OP_STATUS_SUCCESS,
> > >      consumed = amount of data read from input buffer, and
> > >      produced = amount of data written in destination buffer
> > > -- If status is RTE_COMP_OP_STATUS_FAILURE,
> > > -    consumed = produced = 0 or undefined
> > > +- If status is RTE_COMP_OP_STATUS_ERROR,
> > > +    consumed = produced = undefined
> > >  - If status is RTE_COMP_OP_STATUS_OUT_OF_SPACE_TERMINATED,
> > >      consumed = 0 and
> > >      produced = usually 0, but in decompression cases a PMD may
> > > return > 0
> > > --
> > Acked.
> >
> > > 2.13.6
Previous message: [dpdk-stable] [dpdk-dev] [PATCH] net/mlx5: fix missing release of Rx queue object
Next message: [dpdk-stable] [PATCH v2] doc/compress: clarify error handling on data-plane
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the stable mailing list