[RFC PATCH v1 0/4] Direct re-arming of buffers on receive side

Morten Brørup mb at smartsharesystems.com
Sun Dec 26 11:25:26 CET 2021


> From: Feifei Wang [mailto:feifei.wang2 at arm.com]
> Sent: Friday, 24 December 2021 17.46
> 
> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the
> 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
> 
> However, this solution poses several constraint:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will
> have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.

I disagree with the description of this constraint.

As I understand it, it doesn't matter now many ports or queues are in a NIC or system.

The constraint is more narrow:

This patch requires that all packets ingressing on some port/queue must egress on the specific port/queue that it has been configured to ream its buffers from. I.e. an application cannot route packets between multiple ports with this patch.

> 
> 2)The other factor that needs to be considered is 'run-to-completion'
> vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the
> pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at
> this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 

You are missing the fourth constraint:

4) The application must transmit all received packets immediately, i.e. QoS queueing and similar is prohibited.

> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
>   Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
>   side.
> 3)L3fwd application is hacked to do the mapping for the following
> command:
>   one core two flows case:
>   $./examples/dpdk-l3fwd -n 4 -l 1 -a 0001:01:00.0 -a 0001:01:00.1
>   -- -p 0x3 -P --config='(0,0,1),(1,0,1)'
>   where:
>   Port 0 Rx queue 0 is mapped to Port 1 Tx queue 0
>   Port 1 Rx queue 0 is mapped to Port 0 Tx queue 0
> 
> Testing status:
> 1)Tested L3fwd with the above command:
> The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> N1SDP:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +14.1%
> 
> Ampere Altra:
> Base performance(with this patch)   with direct re-arm mode enabled
>       0%                                  +17.1%
> -------------------------------------------------------------------
> This patch can not affect performance of normal mode, and if enable
> direct-rearm mode, performance can be improved by 14% - 17% in n1sdp
> and ampera-altra.
> 
> Feedback requested:
> 1) Has anyone done any similar experiments, any lessons learnt?
> 2) Feedback on API
> 
> Next steps:
> 1) Update the code for supporting 1:N(Rx : TX) mapping
> 2) Automate the configuration in L3fwd sample application
> 
> Reference:
> [1] https://store.nvidia.com/en-
> us/networking/store/product/MCX623105AN-
> CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-
> ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-
> adapters/100gb-nic-ocp/n1100g
> 
> Feifei Wang (4):
>   net/i40e: enable direct re-arm mode
>   ethdev: add API for direct re-arm mode
>   net/i40e: add direct re-arm mode internal API
>   examples/l3fwd: give an example for direct rearm mode
> 
>  drivers/net/i40e/i40e_ethdev.c        |  34 ++++++
>  drivers/net/i40e/i40e_rxtx.h          |   4 +
>  drivers/net/i40e/i40e_rxtx_vec_neon.c | 149 +++++++++++++++++++++++++-
>  examples/l3fwd/main.c                 |   3 +
>  lib/ethdev/ethdev_driver.h            |  15 +++
>  lib/ethdev/rte_ethdev.c               |  14 +++
>  lib/ethdev/rte_ethdev.h               |  31 ++++++
>  lib/ethdev/version.map                |   3 +
>  8 files changed, 251 insertions(+), 2 deletions(-)
> 
> --
> 2.25.1
> 

The patch provides a significant performance improvement, but I am wondering if any real world applications exist that would use this. Only a "router on a stick" (i.e. a single-port router) comes to my mind, and that is probably sufficient to call it useful in the real world. Do you have any other examples to support the usefulness of this patch?

Anyway, the patch doesn't do any harm if unused, and the only performance cost is the "if (rxq->direct_rxrearm_enable)" branch in the Ethdev driver. So I don't oppose to it.




More information about the dev mailing list