[dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue

Xueming(Steven) Li xuemingl at nvidia.com
Thu Aug 12 16:35:25 CEST 2021



> -----Original Message-----
> From: dev <dev-bounces at dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Wednesday, August 11, 2021 8:59 PM
> To: Ferruh Yigit <ferruh.yigit at intel.com>; Jerin Jacob <jerinjacobk at gmail.com>
> Cc: dpdk-dev <dev at dpdk.org>; NBU-Contact-Thomas Monjalon <thomas at monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko at oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> 
> 
> 
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit at intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl at nvidia.com>; Jerin Jacob
> > <jerinjacobk at gmail.com>
> > Cc: dpdk-dev <dev at dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas at monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko at oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jerin Jacob <jerinjacobk at gmail.com>
> > >> Sent: Wednesday, August 11, 2021 4:03 PM
> > >> To: Xueming(Steven) Li <xuemingl at nvidia.com>
> > >> Cc: dpdk-dev <dev at dpdk.org>; Ferruh Yigit <ferruh.yigit at intel.com>;
> > >> NBU-Contact-Thomas Monjalon <thomas at monjalon.net>; Andrew Rybchenko
> > >> <andrew.rybchenko at oktetlabs.ru>
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >> queue
> > >>
> > >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl at nvidia.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Jerin Jacob <jerinjacobk at gmail.com>
> > >>>> Sent: Monday, August 9, 2021 9:51 PM
> > >>>> To: Xueming(Steven) Li <xuemingl at nvidia.com>
> > >>>> Cc: dpdk-dev <dev at dpdk.org>; Ferruh Yigit
> > >>>> <ferruh.yigit at intel.com>; NBU-Contact-Thomas Monjalon
> > >>>> <thomas at monjalon.net>; Andrew Rybchenko
> > >>>> <andrew.rybchenko at oktetlabs.ru>
> > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >>>> queue
> > >>>>
> > >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl at nvidia.com> wrote:
> > >>>>>
> > >>>>> In current DPDK framework, each RX queue is pre-loaded with
> > >>>>> mbufs for incoming packets. When number of representors scale
> > >>>>> out in a switch domain, the memory consumption became
> > >>>>> significant. Most important, polling all ports leads to high
> > >>>>> cache miss, high latency and low throughput.
> > >>>>>
> > >>>>> This patch introduces shared RX queue. Ports with same
> > >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> > >>>>> Polling any queue using same shared RX queue receives packets
> > >>>>> from all member ports. Source port is identified by mbuf->port.
> > >>>>>
> > >>>>> Port queue number in a shared group should be identical. Queue
> > >>>>> index is
> > >>>>> 1:1 mapped in shared group.
> > >>>>>
> > >>>>> Share RX queue is supposed to be polled on same thread.
> > >>>>>
> > >>>>> Multiple groups is supported by group ID.
> > >>>>
> > >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > >>>
> > >>> Yes, PF and representor in switch domain could take advantage.
> > >>>
> > >>>> If it is for a generic case, how the flow ordering will be maintained?
> > >>>
> > >>> Not quite sure that I understood your question. The control path
> > >>> of is almost same as before, PF and representor port still needed, rte flows not impacted.
> > >>> Queues still needed for each member port, descriptors(mbuf) will
> > >>> be supplied from shared Rx queue in my PMD implementation.
> > >>
> > >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > >> offload, multiple ethdev receive queues land into the same receive
> > >> queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq
> > > forwarding, call it with packets handling callback, so it suites for
> > > all
> > forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue),
> > right? Does this means only single core will poll only, what will happen if there are multiple cores polling, won't it cause problem?
> 
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.

V2 with testpmd code uploaded, please check.

> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal
> api.
> 
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
> 
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in
> group.
> 
> If the member port subject to hot plug/remove,  it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
> 
> >
> > And if this requires specific changes in the application, I am not
> > sure about the solution, can't this work in a transparent way to the application?
> 
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
> This can be done as a wrapper PMD later, more efforts.
> 
> >
> > Overall, is this for optimizing memory for the port represontors? If
> > so can't we have a port representor specific solution, reducing scope can reduce the complexity it brings?
> 
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also introduces
> more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
> 
> It's great if any new solution/suggestion, my head buried in PMD code :)
> 
> >
> > >> If this offload is only useful for representor case, Can we make
> > >> this offload specific to representor the case by changing its name and scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > >>
> > >>
> > >>>
> > >>>>
> > >>>>>
> > >>>>> Signed-off-by: Xueming Li <xuemingl at nvidia.com>
> > >>>>> ---
> > >>>>>  doc/guides/nics/features.rst                    | 11 +++++++++++
> > >>>>>  doc/guides/nics/features/default.ini            |  1 +
> > >>>>>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >>>>>  lib/ethdev/rte_ethdev.c                         |  1 +
> > >>>>>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> > >>>>>  5 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/doc/guides/nics/features.rst
> > >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554
> > >>>>> 100644
> > >>>>> --- a/doc/guides/nics/features.rst
> > >>>>> +++ b/doc/guides/nics/features.rst
> > >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >>>>>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >>>>>
> > >>>>>
> > >>>>> +.. _nic_features_shared_rx_queue:
> > >>>>> +
> > >>>>> +Shared Rx queue
> > >>>>> +---------------
> > >>>>> +
> > >>>>> +Supports shared Rx queue for ports in same switch domain.
> > >>>>> +
> > >>>>> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> > >>>>> +
> > >>>>> +
> > >>>>>  .. _nic_features_packet_type_parsing:
> > >>>>>
> > >>>>>  Packet type parsing
> > >>>>> diff --git a/doc/guides/nics/features/default.ini
> > >>>>> b/doc/guides/nics/features/default.ini
> > >>>>> index 754184ddd4..ebeb4c1851 100644
> > >>>>> --- a/doc/guides/nics/features/default.ini
> > >>>>> +++ b/doc/guides/nics/features/default.ini
> > >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >>>>>  Queue start/stop     =
> > >>>>>  Runtime Rx queue setup =
> > >>>>>  Runtime Tx queue setup =
> > >>>>> +Shared Rx queue      =
> > >>>>>  Burst mode info      =
> > >>>>>  Power mgmt address monitor =
> > >>>>>  MTU update           =
> > >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> index ff6aa91c80..45bf5a3a10 100644
> > >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >>>>>  .. [1] `Ethernet switch device driver model (switchdev)
> > >>>>>
> > >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.t
> > >>>>> xt
> > >>>>>> `_
> > >>>>>
> > >>>>> +- Memory usage of representors is huge when number of
> > >>>>> +representor grows,
> > >>>>> +  because PMD always allocate mbuf for each descriptor of Rx queue.
> > >>>>> +  Polling the large number of ports brings more CPU load, cache
> > >>>>> +miss and
> > >>>>> +  latency. Shared Rx queue can be used to share Rx queue
> > >>>>> +between PF and
> > >>>>> +  representors in same switch domain.
> > >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > >>>>> +  is present in Rx offloading capability of device info.
> > >>>>> +Setting the
> > >>>>> +  offloading flag in device Rx mode or Rx queue configuration
> > >>>>> +to enable
> > >>>>> +  shared Rx queue. Polling any member port of shared Rx queue
> > >>>>> +can return
> > >>>>> +  packets of all ports in group, port ID is saved in ``mbuf.port``.
> > >>>>> +
> > >>>>>  Basic SR-IOV
> > >>>>>  ------------
> > >>>>>
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>>> index 9d95cd11e1..1361ff759a 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>>> @@ -127,6 +127,7 @@ static const struct {
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >>>>>         RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >>>>>         RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > >>>>> +       RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >>>>>  };
> > >>>>>
> > >>>>>  #undef RTE_RX_OFFLOAD_BIT2STR
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > >>>>> index d2b27c351f..a578c9db9d 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.h
> > >>>>> +++ b/lib/ethdev/rte_ethdev.h
> > >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >>>>>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >>>>>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >>>>>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > >>>>> */
> > >>>>> +       uint32_t shared_group; /**< Shared port group index in
> > >>>>> + switch domain. */
> > >>>>>         /**
> > >>>>>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >>>>>          * Only offloads set on rx_queue_offload_capa or
> > >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >>>>>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >>>>>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > >>>>> +/**
> > >>>>> + * Rx queue is shared among ports in same switch domain to save
> > >>>>> +memory,
> > >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> > >>>>> + * Real source port number saved in mbuf->port field.
> > >>>>> + */
> > >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >>>>>
> > >>>>>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >>>>>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > >>>>> --
> > >>>>> 2.25.1
> > >>>>>



More information about the dev mailing list