[RFC,v2,3/7] ethdev: introduce Rx queue based limit watermark

Message ID 20220522055900.417282-4-spiked@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Andrew Rybchenko
Headers
Series introduce per-queue limit watermark and host shaper |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Spike Du May 22, 2022, 5:58 a.m. UTC
  LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
queue fullness is above LWM, the device will trigger the event
RTE_ETH_EVENT_RX_LWM.
LWM is defined as a percentage of Rx queue size with valid value of
[0,99].
Setting LWM to 0 means disable it, which is the default.
When translate the percentage to queue descriptor number, the numbe
should be bigger than 0 and less than queue size.
Add LWM's configuration and query driver callbacks in eth_dev_ops.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 22 ++++++++++++
 lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 74 +++++++++++++++++++++++++++++++++++++-
 lib/ethdev/version.map     |  4 +++
 4 files changed, 151 insertions(+), 1 deletion(-)
  

Comments

Stephen Hemminger May 22, 2022, 3:23 p.m. UTC | #1
On Sun, 22 May 2022 08:58:56 +0300
Spike Du <spiked@nvidia.com> wrote:

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..687ae5ff29 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>  	 */
>  	union rte_eth_rxseg *rx_seg;
>  
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	/**
> +	 * Per-queue Rx limit watermark defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this percentage,
> +	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
> +	 */
> +	uint8_t lwm;
> +
> +	uint8_t reserved_bits[3];
> +	uint32_t reserved_32s;
> +	uint64_t reserved_64s;
>  	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  };
>  

Ok but, this is an ABI risk about this because reserved stuff was never required before.
Whenever is a reserved field is introduced the code (in this case rte_ethdev_configure).

Best practice would have been to have the code require all reserved fields be 0
in earlier releases. In this case an application is like to define a watermark
of zero; how will your code handle it.

Also, using 8 bits as percentage is different than how other API's handle this.
Since Rx queue size is in packets, why is this not in packets?
Also document what behavior of 0 is.

Why introduce new query/set operations? This should just be part of the overall
device configuration.
  
Stephen Hemminger May 22, 2022, 3:24 p.m. UTC | #2
On Sun, 22 May 2022 08:58:56 +0300
Spike Du <spiked@nvidia.com> wrote:

> LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> queue fullness is above LWM, the device will trigger the event
> RTE_ETH_EVENT_RX_LWM.
> LWM is defined as a percentage of Rx queue size with valid value of
> [0,99].
> Setting LWM to 0 means disable it, which is the default.
> When translate the percentage to queue descriptor number, the numbe
> should be bigger than 0 and less than queue size.
> Add LWM's configuration and query driver callbacks in eth_dev_ops.
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>

One other objection, please don't invent yet another event channel
for this. It should be part of existing Rx interrupt logic.
  
Spike Du May 23, 2022, 2:18 a.m. UTC | #3
Hi,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Sunday, May 22, 2022 11:25 PM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du <spiked@nvidia.com> wrote:
> 
> > LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> > queue fullness is above LWM, the device will trigger the event
> > RTE_ETH_EVENT_RX_LWM.
> > LWM is defined as a percentage of Rx queue size with valid value of
> > [0,99].
> > Setting LWM to 0 means disable it, which is the default.
> > When translate the percentage to queue descriptor number, the numbe
> > should be bigger than 0 and less than queue size.
> > Add LWM's configuration and query driver callbacks in eth_dev_ops.
> >
> > Signed-off-by: Spike Du <spiked@nvidia.com>
> 
> One other objection, please don't invent yet another event channel for this.
> It should be part of existing Rx interrupt logic.

I think this is misunderstanding, the "event channel" is a specific concept in MLX5 PMD.
For the DPDK common code like testpmd and event register/callback, I'm using standard dpdk
interfaces.
  
Spike Du May 23, 2022, 3:01 a.m. UTC | #4
Hi, pls see below.

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Sunday, May 22, 2022 11:23 PM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du <spiked@nvidia.com> wrote:
> 
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..687ae5ff29 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> >        */
> >       union rte_eth_rxseg *rx_seg;
> >
> > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > +     /**
> > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > +      * size. If Rx queue receives traffic higher than this percentage,
> > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > +      */
> > +     uint8_t lwm;
> > +
> > +     uint8_t reserved_bits[3];
> > +     uint32_t reserved_32s;
> > +     uint64_t reserved_64s;
> >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> >  };
> >
> 
> Ok but, this is an ABI risk about this because reserved stuff was never
> required before.
> Whenever is a reserved field is introduced the code (in this case
> rte_ethdev_configure).
> 
> Best practice would have been to have the code require all reserved fields be
> 0 in earlier releases. In this case an application is like to define a watermark of
> zero; how will your code handle it.
Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
Queue's watermark is not monitored, hence no LWM event is generated.
> 
> Also, using 8 bits as percentage is different than how other API's handle this.
> Since Rx queue size is in packets, why is this not in packets?
The short answer is to simply the LWM configuration.
Rx queue descriptor is complex nowadays. 
For normal queue, user may configure LWM according to queue descriptor number easily.
But for below queues, it's not easy:
Take mprq as example, the testpmd cmd  options can be " -a 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
-- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the "unit" number is 16,  if you configure according to descriptor number(1024)
Here, you may easily set LWM as something like 512, but HW doesn't allow it, because 512 > 16. If you want the watermark to be half, the correct value is 8.
The same issue happens to feature like "Rx queue buffer split" where a packet can be split to multiple descriptors.
Using percentage doesn't have such issues, PMD will cover all the details.

> Also document what behavior of 0 is.
Sure. The behavior is like the old days without this feature, pls see above.

> Why introduce new query/set operations? This should just be part of the
> overall device configuration.
Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
not likely to be overwhelmed by the events.
These all require set operation.

For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
The query operation serves this purpose.


Regards,
Spike.
  
Morten Brørup May 23, 2022, 6:07 a.m. UTC | #5
> From: Spike Du [mailto:spiked@nvidia.com]
> Sent: Sunday, 22 May 2022 07.59
> 
> LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> queue fullness is above LWM, the device will trigger the event
> RTE_ETH_EVENT_RX_LWM.
> LWM is defined as a percentage of Rx queue size with valid value of
> [0,99].
> Setting LWM to 0 means disable it, which is the default.
> When translate the percentage to queue descriptor number, the numbe
> should be bigger than 0 and less than queue size.
> Add LWM's configuration and query driver callbacks in eth_dev_ops.
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---


> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>  	 */
>  	union rte_eth_rxseg *rx_seg;
> 
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	/**
> +	 * Per-queue Rx limit watermark defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this
> percentage,
> +	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
> +	 */
> +	uint8_t lwm;

Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and I wonder if such high granularity is really necessary. Just a thought, it's not important.

If you stick with percentage, it only needs 7 bits, and you can make the remaining one bit reserved.

Also, please add here that 0 means disable.

> +
> +	uint8_t reserved_bits[3];
> +	uint32_t reserved_32s;
> +	uint64_t reserved_64s;
>  	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  };
>
  
Thomas Monjalon May 23, 2022, 10:58 a.m. UTC | #6
23/05/2022 08:07, Morten Brørup:
> > +	uint8_t lwm;
> 
> Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and I wonder if such high granularity is really necessary. Just a thought, it's not important.

I think percentage is the easiest to understand
and to share with other teams in design documents.

> If you stick with percentage, it only needs 7 bits, and you can make the remaining one bit reserved.
> 
> Also, please add here that 0 means disable.

Good idea.
  
Spike Du May 23, 2022, 2:10 p.m. UTC | #7
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, May 23, 2022 6:59 PM
> To: Spike Du <spiked@nvidia.com>; Morten Brørup
> <mb@smartsharesystems.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 08:07, Morten Brørup:
> > > +   uint8_t lwm;
> >
> > Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and
> I wonder if such high granularity is really necessary. Just a thought, it's not
> important.
> 
> I think percentage is the easiest to understand and to share with other teams
> in design documents.
> 
> > If you stick with percentage, it only needs 7 bits, and you can make the
> remaining one bit reserved.

Agree, will change to use 7 bits.
> >
> > Also, please add here that 0 means disable.

Sure.
> 
> Good idea.
>
  
Thomas Monjalon May 23, 2022, 2:39 p.m. UTC | #8
23/05/2022 16:10, Spike Du:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > If you stick with percentage, it only needs 7 bits, and you can make the
> > remaining one bit reserved.
> 
> Agree, will change to use 7 bits.

I'm not sure it's worth introducing a bit field here.
  
Thomas Monjalon May 23, 2022, 9:45 p.m. UTC | #9
23/05/2022 05:01, Spike Du:
> From: Stephen Hemminger <stephen@networkplumber.org>
> > Spike Du <spiked@nvidia.com> wrote:
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > >        */
> > >       union rte_eth_rxseg *rx_seg;
> > >
> > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > +     /**
> > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > +      */
> > > +     uint8_t lwm;
> > > +
> > > +     uint8_t reserved_bits[3];
> > > +     uint32_t reserved_32s;
> > > +     uint64_t reserved_64s;
> > 
> > Ok but, this is an ABI risk about this because reserved stuff was never
> > required before.

An ABI compatibility issue would be for an application compiled
with an old DPDK, and loading a new DPDK at runtime.
Let's think what would happen in such a case.

> > Whenever is a reserved field is introduced the code (in this case
> > rte_ethdev_configure).

rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
Then the library and drivers may interpret a wrong value.

> > Best practice would have been to have the code require all reserved fields be
> > 0 in earlier releases. In this case an application is like to define a watermark of
> > zero; how will your code handle it.
> 
> Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
> Queue's watermark is not monitored, hence no LWM event is generated.

The problem is to have a value not initialized.
I think the best approach is to not expose the LWM value
through this configuration structure.
If the need is to get the current value,
we should better add a field in the struct rte_eth_rxq_info.

[...]
> 
> > Why introduce new query/set operations? This should just be part of the
> > overall device configuration.

Thanks to the "set" function, we can avoid the ABI compat issue.

> Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
> User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
> Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
> not likely to be overwhelmed by the events.
> These all require set operation.

Yes it is better to allow dynamic watermark configuration,
not using the function rte_eth_rx_queue_setup().

> For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
> When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
> The query operation serves this purpose.

Yes "query" has to be called in the event handler
because event structure is not specific to any event type.
  
Stephen Hemminger May 23, 2022, 10:54 p.m. UTC | #10
On Mon, 23 May 2022 03:01:20 +0000
Spike Du <spiked@nvidia.com> wrote:

> Hi, pls see below.
> 
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Sunday, May 22, 2022 11:23 PM
> > To: Spike Du <spiked@nvidia.com>
> > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> > Raslan Darawsheh <rasland@nvidia.com>
> > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Sun, 22 May 2022 08:58:56 +0300
> > Spike Du <spiked@nvidia.com> wrote:
> >   
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > 04cff8ee10..687ae5ff29 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > >        */
> > >       union rte_eth_rxseg *rx_seg;
> > >
> > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > +     /**
> > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > +      */
> > > +     uint8_t lwm;
> > > +
> > > +     uint8_t reserved_bits[3];
> > > +     uint32_t reserved_32s;
> > > +     uint64_t reserved_64s;
> > >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> > >  };
> > >  
> > 
> > Ok but, this is an ABI risk about this because reserved stuff was never
> > required before.
> > Whenever is a reserved field is introduced the code (in this case
> > rte_ethdev_configure).
> > 
> > Best practice would have been to have the code require all reserved fields be
> > 0 in earlier releases. In this case an application is like to define a watermark of
> > zero; how will your code handle it.  
> Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
> Queue's watermark is not monitored, hence no LWM event is generated.
> > 
> > Also, using 8 bits as percentage is different than how other API's handle this.
> > Since Rx queue size is in packets, why is this not in packets?  
> The short answer is to simply the LWM configuration.
> Rx queue descriptor is complex nowadays. 
> For normal queue, user may configure LWM according to queue descriptor number easily.
> But for below queues, it's not easy:
> Take mprq as example, the testpmd cmd  options can be " -a 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
> -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
> For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it, because 512 > 16. If you want the watermark to be half, the correct value is 8.
> The same issue happens to feature like "Rx queue buffer split" where a packet can be split to multiple descriptors.
> Using percentage doesn't have such issues, PMD will cover all the details.
> 
> > Also document what behavior of 0 is.  
> Sure. The behavior is like the old days without this feature, pls see above.
> 
> > Why introduce new query/set operations? This should just be part of the
> > overall device configuration.  
> Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
> User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
> Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
> not likely to be overwhelmed by the events.
> These all require set operation.
> 
> For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
> When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
> The query operation serves this purpose.
> 
> 
> Regards,
> Spike.
> 
> 

The bigger question is why does this have to be just MLX5 and why
can't it fit into the existing DPDK RX interrupt framework?

Linux and BSD have had this for years in their packet coalescing logic.
Ethtool provides ability to set lot of irq coalescing parameters like:

       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
              [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
              [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
              [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
              [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
              [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
              [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
              [cqe-mode-rx on|off] [cqe-mode-tx on|off]

It feels like this is just the DPDK version of a small subset of that.
Since many device already support IRQ coalescing, it would be best to build
one new API that has most of these. Rather than a MLX/Nvidia only API for
a single parameter.
  
Spike Du May 24, 2022, 2:50 a.m. UTC | #11
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Tuesday, May 24, 2022 5:46 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Spike Du
> <spiked@nvidia.com>
> Cc: dev@dpdk.org; Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; ferruh.yigit@amd.com;
> andrew.rybchenko@oktetlabs.ru
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 05:01, Spike Du:
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Spike Du <spiked@nvidia.com> wrote:
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >        */
> > > >       union rte_eth_rxseg *rx_seg;
> > > >
> > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > +     /**
> > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +      */
> > > > +     uint8_t lwm;
> > > > +
> > > > +     uint8_t reserved_bits[3];
> > > > +     uint32_t reserved_32s;
> > > > +     uint64_t reserved_64s;
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> 
> An ABI compatibility issue would be for an application compiled with an old
> DPDK, and loading a new DPDK at runtime.
> Let's think what would happen in such a case.
> 
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> 
> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> Then the library and drivers may interpret a wrong value.
> 
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> >
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> 
> The problem is to have a value not initialized.
> I think the best approach is to not expose the LWM value through this
> configuration structure.
> If the need is to get the current value, we should better add a field in the
> struct rte_eth_rxq_info.

At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
The Rx queue, if user follows these examples we should not have ABI issue.
Since many people are concerned about rxconf change, it's ok to remove the LWM
Field there.
Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
We should have a way to get it.

> 
> [...]
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> 
> Thanks to the "set" function, we can avoid the ABI compat issue.
> 
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsively by handling LWM event in order to
> reduce drop.
> > Some driver like mlx5 may implement LWM event as one-time shot. When
> > you receive LWM event, you need to reconfigure LWM in order to receive
> the event again, thus you will not likely to be overwhelmed by the events.
> > These all require set operation.
> 
> Yes it is better to allow dynamic watermark configuration, not using the
> function rte_eth_rx_queue_setup().
> 
> > For the query operation. The rte_event API
> rte_eth_dev_callback_process() is per-port API, it doesn't carry much
> information when an event happens.
> > When a LWM event happens, we need to know in which Rx queue it
> happens or optionally what's the current LWM percentage of this queue.
> > The query operation serves this purpose.
> 
> Yes "query" has to be called in the event handler because event structure is
> not specific to any event type.
> 
>
  
Spike Du May 24, 2022, 3:46 a.m. UTC | #12
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, May 24, 2022 6:55 AM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 23 May 2022 03:01:20 +0000
> Spike Du <spiked@nvidia.com> wrote:
> 
> > Hi, pls see below.
> >
> > > -----Original Message-----
> > > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Sent: Sunday, May 22, 2022 11:23 PM
> > > To: Spike Du <spiked@nvidia.com>
> > > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> > > Raslan Darawsheh <rasland@nvidia.com>
> > > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit
> > > watermark
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Sun, 22 May 2022 08:58:56 +0300
> > > Spike Du <spiked@nvidia.com> wrote:
> > >
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index
> > > > 04cff8ee10..687ae5ff29 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >        */
> > > >       union rte_eth_rxseg *rx_seg;
> > > >
> > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > +     /**
> > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +      */
> > > > +     uint8_t lwm;
> > > > +
> > > > +     uint8_t reserved_bits[3];
> > > > +     uint32_t reserved_32s;
> > > > +     uint64_t reserved_64s;
> > > >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> > > >  };
> > > >
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> > >
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> > >
> > > Also, using 8 bits as percentage is different than how other API's handle
> this.
> > > Since Rx queue size is in packets, why is this not in packets?
> > The short answer is to simply the LWM configuration.
> > Rx queue descriptor is complex nowadays.
> > For normal queue, user may configure LWM according to queue descriptor
> number easily.
> > But for below queues, it's not easy:
> > Take mprq as example, the testpmd cmd  options can be " -a
> >
> 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,
> mprq_lo
> > g_stride_size=8,mprq_log_stride_num=3
> > -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", For MLX5
> > implementation,  the minimum "unit" in queue has 64 descriptors, the
> > "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it,
> because 512 > 16. If you want the watermark to be half, the correct value is 8.
> > The same issue happens to feature like "Rx queue buffer split" where a
> packet can be split to multiple descriptors.
> > Using percentage doesn't have such issues, PMD will cover all the details.
> >
> > > Also document what behavior of 0 is.
> > Sure. The behavior is like the old days without this feature, pls see above.
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsively by handling LWM event in order to
> reduce drop.
> > Some driver like mlx5 may implement LWM event as one-time shot. When
> > you receive LWM event, you need to reconfigure LWM in order to receive
> the event again, thus you will not likely to be overwhelmed by the events.
> > These all require set operation.
> >
> > For the query operation. The rte_event API
> rte_eth_dev_callback_process() is per-port API, it doesn't carry much
> information when an event happens.
> > When a LWM event happens, we need to know in which Rx queue it
> happens or optionally what's the current LWM percentage of this queue.
> > The query operation serves this purpose.
> >
> >
> > Regards,
> > Spike.
> >
> >
> 
> The bigger question is why does this have to be just MLX5 and why can't it fit
> into the existing DPDK RX interrupt framework?
> 
> Linux and BSD have had this for years in their packet coalescing logic.
> Ethtool provides ability to set lot of irq coalescing parameters like:
> 
>        ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
>               [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
>               [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
>               [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
>               [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
>               [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
>               [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
>               [cqe-mode-rx on|off] [cqe-mode-tx on|off]
> 
> It feels like this is just the DPDK version of a small subset of that.
> Since many device already support IRQ coalescing, it would be best to build
> one new API that has most of these. Rather than a MLX/Nvidia only API for a
> single parameter.

I take MLX5 as example here because I only know how mlx5 works, I don't understand
How other NICs work.  It doesn't mean I try to change common code only to satisfy 
Mlx5 needs.

I think interrupt coalesce is different from LWM:
Interrupt coalesce is delay interrupt until a batch of packets(or an interval) is received. 
LWM intends to notify when a Rx queue is out of buffer. Delaying interrupt can't detect
A specific fullness value of the Rx queue, but LWM can if driver supports it.


Regards,
Spike.
  
Andrew Rybchenko May 24, 2022, 6:35 a.m. UTC | #13
On 5/23/22 17:39, Thomas Monjalon wrote:
> 23/05/2022 16:10, Spike Du:
>>> From: Thomas Monjalon <thomas@monjalon.net>
>>>> If you stick with percentage, it only needs 7 bits, and you can make the
>>> remaining one bit reserved.
>>
>> Agree, will change to use 7 bits.
> 
> I'm not sure it's worth introducing a bit field here.

+1
  
Thomas Monjalon May 24, 2022, 8:18 a.m. UTC | #14
24/05/2022 04:50, Spike Du:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 23/05/2022 05:01, Spike Du:
> > > From: Stephen Hemminger <stephen@networkplumber.org>
> > > > Spike Du <spiked@nvidia.com> wrote:
> > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > > >        */
> > > > >       union rte_eth_rxseg *rx_seg;
> > > > >
> > > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > > +     /**
> > > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > > +      */
> > > > > +     uint8_t lwm;
> > > > > +
> > > > > +     uint8_t reserved_bits[3];
> > > > > +     uint32_t reserved_32s;
> > > > > +     uint64_t reserved_64s;
> > > >
> > > > Ok but, this is an ABI risk about this because reserved stuff was
> > > > never required before.
> > 
> > An ABI compatibility issue would be for an application compiled with an old
> > DPDK, and loading a new DPDK at runtime.
> > Let's think what would happen in such a case.
> > 
> > > > Whenever is a reserved field is introduced the code (in this case
> > > > rte_ethdev_configure).
> > 
> > rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> > Then the library and drivers may interpret a wrong value.
> > 
> > > > Best practice would have been to have the code require all reserved
> > > > fields be
> > > > 0 in earlier releases. In this case an application is like to define
> > > > a watermark of zero; how will your code handle it.
> > >
> > > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > > the Rx Queue's watermark is not monitored, hence no LWM event is
> > generated.
> > 
> > The problem is to have a value not initialized.
> > I think the best approach is to not expose the LWM value through this
> > configuration structure.
> > If the need is to get the current value, we should better add a field in the
> > struct rte_eth_rxq_info.
> 
> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
> The Rx queue, if user follows these examples we should not have ABI issue.
> Since many people are concerned about rxconf change, it's ok to remove the LWM
> Field there.
> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
> We should have a way to get it.

Unfortunately we cannot rely on examples for ABI compatibility.
My suggestion of moving the field in rte_eth_rxq_info
is not obvious because it could change the size of the struct.
But thanks to __rte_cache_min_aligned, it is OK.
Running pahole on this struct shows we have 50 bytes free:
        /* size: 128, cachelines: 2, members: 6 */
        /* padding: 50 */

The other option would be to get the LWM value with a "get" function.

What others prefer?
  
Morten Brørup May 24, 2022, 9:40 a.m. UTC | #15
> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 24 May 2022 08.35
> 
> On 5/23/22 17:39, Thomas Monjalon wrote:
> > 23/05/2022 16:10, Spike Du:
> >>> From: Thomas Monjalon <thomas@monjalon.net>
> >>>> If you stick with percentage, it only needs 7 bits, and you can
> make the
> >>> remaining one bit reserved.
> >>
> >> Agree, will change to use 7 bits.
> >
> > I'm not sure it's worth introducing a bit field here.
> 
> +1

It's not important for me either. Just stick with the full byte. I mainly mentioned it in case you considered reducing the parameter granularity to 1/16th instead of percent, so you would only need 4 bits.
  
Andrew Rybchenko May 25, 2022, 12:59 p.m. UTC | #16
On 5/24/22 11:18, Thomas Monjalon wrote:
> 24/05/2022 04:50, Spike Du:
>> From: Thomas Monjalon <thomas@monjalon.net>
>>> 23/05/2022 05:01, Spike Du:
>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>>> Spike Du <spiked@nvidia.com> wrote:
>>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>>>>>>         */
>>>>>>        union rte_eth_rxseg *rx_seg;
>>>>>>
>>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
>>>>>> +     /**
>>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
>>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
>>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
>>>>>> +      */
>>>>>> +     uint8_t lwm;
>>>>>> +
>>>>>> +     uint8_t reserved_bits[3];
>>>>>> +     uint32_t reserved_32s;
>>>>>> +     uint64_t reserved_64s;
>>>>>
>>>>> Ok but, this is an ABI risk about this because reserved stuff was
>>>>> never required before.
>>>
>>> An ABI compatibility issue would be for an application compiled with an old
>>> DPDK, and loading a new DPDK at runtime.
>>> Let's think what would happen in such a case.
>>>
>>>>> Whenever is a reserved field is introduced the code (in this case
>>>>> rte_ethdev_configure).
>>>
>>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
>>> Then the library and drivers may interpret a wrong value.
>>>
>>>>> Best practice would have been to have the code require all reserved
>>>>> fields be
>>>>> 0 in earlier releases. In this case an application is like to define
>>>>> a watermark of zero; how will your code handle it.
>>>>
>>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
>>>> the Rx Queue's watermark is not monitored, hence no LWM event is
>>> generated.
>>>
>>> The problem is to have a value not initialized.
>>> I think the best approach is to not expose the LWM value through this
>>> configuration structure.
>>> If the need is to get the current value, we should better add a field in the
>>> struct rte_eth_rxq_info.
>>
>> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
>> The Rx queue, if user follows these examples we should not have ABI issue.
>> Since many people are concerned about rxconf change, it's ok to remove the LWM
>> Field there.
>> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
>> We should have a way to get it.
> 
> Unfortunately we cannot rely on examples for ABI compatibility.
> My suggestion of moving the field in rte_eth_rxq_info
> is not obvious because it could change the size of the struct.
> But thanks to __rte_cache_min_aligned, it is OK.
> Running pahole on this struct shows we have 50 bytes free:
>          /* size: 128, cachelines: 2, members: 6 */
>          /* padding: 50 */
> 
> The other option would be to get the LWM value with a "get" function.
> 
> What others prefer?

If I'm not mistaken the changeset breaks ABI in any case since
it adds a new event and changes MAX. If so, I'd wait for the
next ABI breaking release and do not touch reserved fields.
  
Thomas Monjalon May 25, 2022, 1:58 p.m. UTC | #17
25/05/2022 14:59, Andrew Rybchenko:
> On 5/24/22 11:18, Thomas Monjalon wrote:
> > 24/05/2022 04:50, Spike Du:
> >> From: Thomas Monjalon <thomas@monjalon.net>
> >>> 23/05/2022 05:01, Spike Du:
> >>>> From: Stephen Hemminger <stephen@networkplumber.org>
> >>>>> Spike Du <spiked@nvidia.com> wrote:
> >>>>>> --- a/lib/ethdev/rte_ethdev.h
> >>>>>> +++ b/lib/ethdev/rte_ethdev.h
> >>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> >>>>>>         */
> >>>>>>        union rte_eth_rxseg *rx_seg;
> >>>>>>
> >>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> >>>>>> +     /**
> >>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> >>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
> >>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> >>>>>> +      */
> >>>>>> +     uint8_t lwm;
> >>>>>> +
> >>>>>> +     uint8_t reserved_bits[3];
> >>>>>> +     uint32_t reserved_32s;
> >>>>>> +     uint64_t reserved_64s;
> >>>>>
> >>>>> Ok but, this is an ABI risk about this because reserved stuff was
> >>>>> never required before.
> >>>
> >>> An ABI compatibility issue would be for an application compiled with an old
> >>> DPDK, and loading a new DPDK at runtime.
> >>> Let's think what would happen in such a case.
> >>>
> >>>>> Whenever is a reserved field is introduced the code (in this case
> >>>>> rte_ethdev_configure).
> >>>
> >>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> >>> Then the library and drivers may interpret a wrong value.
> >>>
> >>>>> Best practice would have been to have the code require all reserved
> >>>>> fields be
> >>>>> 0 in earlier releases. In this case an application is like to define
> >>>>> a watermark of zero; how will your code handle it.
> >>>>
> >>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
> >>>> the Rx Queue's watermark is not monitored, hence no LWM event is
> >>> generated.
> >>>
> >>> The problem is to have a value not initialized.
> >>> I think the best approach is to not expose the LWM value through this
> >>> configuration structure.
> >>> If the need is to get the current value, we should better add a field in the
> >>> struct rte_eth_rxq_info.
> >>
> >> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
> >> The Rx queue, if user follows these examples we should not have ABI issue.
> >> Since many people are concerned about rxconf change, it's ok to remove the LWM
> >> Field there.
> >> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
> >> We should have a way to get it.
> > 
> > Unfortunately we cannot rely on examples for ABI compatibility.
> > My suggestion of moving the field in rte_eth_rxq_info
> > is not obvious because it could change the size of the struct.
> > But thanks to __rte_cache_min_aligned, it is OK.
> > Running pahole on this struct shows we have 50 bytes free:
> >          /* size: 128, cachelines: 2, members: 6 */
> >          /* padding: 50 */
> > 
> > The other option would be to get the LWM value with a "get" function.
> > 
> > What others prefer?
> 
> If I'm not mistaken the changeset breaks ABI in any case since
> it adds a new event and changes MAX.

I think we can consider it as not a breakage (a rule should be added).
Last time we had to update this enum, this was the conclusion:
from https://git.dpdk.org/dpdk/commit/?id=44bf3c796be3f
"
The new event type addition in the enum is flagged as an ABI breakage,
so an ignore rule is added for these reasons:
- It is not changing value of existing types (except MAX)
- The new value is not used by existing API if the event is not
  registered
In general, it is safe adding new ethdev event types at the end of the
enum, because of event callback registration mechanism.
"

> If so, I'd wait for the
> next ABI breaking release and do not touch reserved fields.

In any case, rte_eth_rxconf is not a good fit
because we have a separate function for configuration.
It should be either in rte_eth_rxq_info or a specific "get" function.
  
Andrew Rybchenko May 25, 2022, 2:23 p.m. UTC | #18
On 5/25/22 16:58, Thomas Monjalon wrote:
> 25/05/2022 14:59, Andrew Rybchenko:
>> On 5/24/22 11:18, Thomas Monjalon wrote:
>>> 24/05/2022 04:50, Spike Du:
>>>> From: Thomas Monjalon <thomas@monjalon.net>
>>>>> 23/05/2022 05:01, Spike Du:
>>>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>>>>> Spike Du <spiked@nvidia.com> wrote:
>>>>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>>>>>>>>          */
>>>>>>>>         union rte_eth_rxseg *rx_seg;
>>>>>>>>
>>>>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
>>>>>>>> +     /**
>>>>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
>>>>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
>>>>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
>>>>>>>> +      */
>>>>>>>> +     uint8_t lwm;
>>>>>>>> +
>>>>>>>> +     uint8_t reserved_bits[3];
>>>>>>>> +     uint32_t reserved_32s;
>>>>>>>> +     uint64_t reserved_64s;
>>>>>>>
>>>>>>> Ok but, this is an ABI risk about this because reserved stuff was
>>>>>>> never required before.
>>>>>
>>>>> An ABI compatibility issue would be for an application compiled with an old
>>>>> DPDK, and loading a new DPDK at runtime.
>>>>> Let's think what would happen in such a case.
>>>>>
>>>>>>> Whenever is a reserved field is introduced the code (in this case
>>>>>>> rte_ethdev_configure).
>>>>>
>>>>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
>>>>> Then the library and drivers may interpret a wrong value.
>>>>>
>>>>>>> Best practice would have been to have the code require all reserved
>>>>>>> fields be
>>>>>>> 0 in earlier releases. In this case an application is like to define
>>>>>>> a watermark of zero; how will your code handle it.
>>>>>>
>>>>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
>>>>>> the Rx Queue's watermark is not monitored, hence no LWM event is
>>>>> generated.
>>>>>
>>>>> The problem is to have a value not initialized.
>>>>> I think the best approach is to not expose the LWM value through this
>>>>> configuration structure.
>>>>> If the need is to get the current value, we should better add a field in the
>>>>> struct rte_eth_rxq_info.
>>>>
>>>> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
>>>> The Rx queue, if user follows these examples we should not have ABI issue.
>>>> Since many people are concerned about rxconf change, it's ok to remove the LWM
>>>> Field there.
>>>> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
>>>> We should have a way to get it.
>>>
>>> Unfortunately we cannot rely on examples for ABI compatibility.
>>> My suggestion of moving the field in rte_eth_rxq_info
>>> is not obvious because it could change the size of the struct.
>>> But thanks to __rte_cache_min_aligned, it is OK.
>>> Running pahole on this struct shows we have 50 bytes free:
>>>           /* size: 128, cachelines: 2, members: 6 */
>>>           /* padding: 50 */
>>>
>>> The other option would be to get the LWM value with a "get" function.
>>>
>>> What others prefer?
>>
>> If I'm not mistaken the changeset breaks ABI in any case since
>> it adds a new event and changes MAX.
> 
> I think we can consider it as not a breakage (a rule should be added).
> Last time we had to update this enum, this was the conclusion:
> from https://git.dpdk.org/dpdk/commit/?id=44bf3c796be3f
> "
> The new event type addition in the enum is flagged as an ABI breakage,
> so an ignore rule is added for these reasons:
> - It is not changing value of existing types (except MAX)
> - The new value is not used by existing API if the event is not
>    registered
> In general, it is safe adding new ethdev event types at the end of the
> enum, because of event callback registration mechanism.
> "

I see. Makes sense. Thanks for the information.

>> If so, I'd wait for the
>> next ABI breaking release and do not touch reserved fields.
> 
> In any case, rte_eth_rxconf is not a good fit
> because we have a separate function for configuration.

Yes, it is better to avoid two ways to configure the same
thing.

> It should be either in rte_eth_rxq_info or a specific "get" function.

I see no point to introduce specific get function for a single
value. I think that rte_eth_rxq_info is the right way to get
current value.
  

Patch

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..12ec5e7e19 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@  typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    const struct rte_eth_rxconf *rx_conf,
 				    struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue limit watermark.
+ * see @rte_eth_rx_lwm_set()
+ */
+typedef int (*eth_rx_queue_lwm_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t lwm);
+
+/**
+ * @internal Query queue limit watermark.
+ * see @rte_eth_rx_lwm_query()
+ */
+
+typedef int (*eth_rx_queue_lwm_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@  struct eth_dev_ops {
 	/** Priority flow control queue configure */
 	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+	/** Set Rx queue limit watermark */
+	eth_rx_queue_lwm_set_t rx_queue_lwm_set;
+	/** Query Rx queue limit watermark */
+	eth_rx_queue_lwm_query_t rx_queue_lwm_query;
+
 	/** Set Unicast Table Array */
 	eth_uc_hash_table_set_t    uc_hash_table_set;
 	/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 8520aec561..0a46c71288 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4429,6 +4429,58 @@  int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id,
+		       uint8_t lwm)
+{
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id > dev_info.max_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue LWM:port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (lwm > 99)
+		return -EINVAL;
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_set)(dev,
+							     queue_id, lwm));
+}
+
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_dev *dev;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev_info.max_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_query)(dev,
+							     queue_id, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..687ae5ff29 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1249,7 +1249,16 @@  struct rte_eth_rxconf {
 	 */
 	union rte_eth_rxseg *rx_seg;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	/**
+	 * Per-queue Rx limit watermark defined as percentage of Rx queue
+	 * size. If Rx queue receives traffic higher than this percentage,
+	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
+	 */
+	uint8_t lwm;
+
+	uint8_t reserved_bits[3];
+	uint32_t reserved_32s;
+	uint64_t reserved_64s;
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
@@ -3668,6 +3677,64 @@  int rte_eth_dev_get_vlan_offload(uint16_t port_id);
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based limit watermark.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param lwm
+ *  The limit watermark percentage of Rx queue size which describes
+ *  the fullness of Rx queue. If the Rx queue fullness is above LWM,
+ *  the device will trigger the event RTE_ETH_EVENT_RX_LWM.
+ *  [1-99] to set a new LWM.
+ *  0 to disable watermark monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id, uint8_t lwm);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query Rx queue based limit watermark.
+ * The function queries all queues in the port circularly until one
+ * pending LWM event is found or no pending LWM event is found.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The API caller sets the starting Rx queue id in the pointer.
+ *  If the queue_id is bigger than maximum queue id of the port,
+ *  it's rewinded to 0 so that application can keep calling
+ *  this function to handle all pending LWM events in the queues
+ *  with a simple increment between calls.
+ *  If a Rx queue has pending lwm event, the pointer is updated
+ *  with this Rx queue id; otherwise this pointer's content is
+ *  unchanged.
+ * @param lwm
+ *  The pointer to the limit watermark percentage of Rx queue.
+ *  If Rx queue with pending lwm event is found, the queue's LWM
+ *  percentage is stored in this pointer, otherwise the pointer's
+ *  content is unchanged.
+ *
+ * @return
+ *   - 1 if a Rx queue with pending lwm event is found.
+ *   - 0 if no Rx queue with pending lwm event is found.
+ *   - -EINVAL if queue_id is NULL.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3873,6 +3940,11 @@  enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	/**
+	 *  watermark value is exceeded in a queue.
+	 *  see @rte_eth_rx_lwm_set()
+	 */
+	RTE_ETH_EVENT_RX_LWM,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab29e..5cf44c5d1d 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -279,6 +279,10 @@  EXPERIMENTAL {
 	rte_flow_async_action_handle_create;
 	rte_flow_async_action_handle_destroy;
 	rte_flow_async_action_handle_update;
+
+	# added in 22.07
+	rte_eth_rx_lwm_set;
+	rte_eth_rx_lwm_query;
 };
 
 INTERNAL {