回复: [RFC PATCH v1] net/i40e: put mempool cache out of API

Feifei Wang Feifei.Wang2 at arm.com
Mon Jul 11 05:08:10 CEST 2022



> -----邮件原件-----
> 发件人: Feifei Wang
> 发送时间: Wednesday, July 6, 2022 7:36 PM
> 收件人: 'Konstantin Ananyev' <konstantin.v.ananyev at yandex.ru>; 'Yuying
> Zhang' <Yuying.Zhang at intel.com>; 'Beilei Xing' <beilei.xing at intel.com>;
> Ruifeng Wang <Ruifeng.Wang at arm.com>
> 抄送: 'dev at dpdk.org' <dev at dpdk.org>; nd <nd at arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>; nd
> <nd at arm.com>
> 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> 
> > -----邮件原件-----
> > 发件人: Feifei Wang
> > 发送时间: Wednesday, July 6, 2022 4:53 PM
> > 收件人: Konstantin Ananyev <konstantin.v.ananyev at yandex.ru>; Yuying
> Zhang
> > <Yuying.Zhang at intel.com>; Beilei Xing <beilei.xing at intel.com>; Ruifeng
> > Wang <Ruifeng.Wang at arm.com>
> > 抄送: dev at dpdk.org; nd <nd at arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>
> > 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> >
> >
> >
> > > -----邮件原件-----
> > > 发件人: Konstantin Ananyev <konstantin.v.ananyev at yandex.ru>
> > > 发送时间: Sunday, July 3, 2022 8:20 PM
> > > 收件人: Feifei Wang <Feifei.Wang2 at arm.com>; Yuying Zhang
> > > <Yuying.Zhang at intel.com>; Beilei Xing <beilei.xing at intel.com>;
> > > Ruifeng Wang <Ruifeng.Wang at arm.com>
> > > 抄送: dev at dpdk.org; nd <nd at arm.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli at arm.com>
> > > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> > >
> > >
> > > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> > > > out of API to free buffers directly. There are two changes
> > > > different with previous version:
> > > > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > > > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > > > directly
> > > >
> > > > Performance Test with l3fwd neon path:
> > > > 		with this patch
> > > > n1sdp:		no perforamnce change
> > > > amper-altra:	+4.0%
> > > >
> > >
> > >
> > Thanks for your detailed comments.
> >
> > > Thanks for RFC, appreciate your effort.
> > > So, as I understand - bypassing mempool put/get itself gives about
> > > 7-10% speedup for RX/TX on ARM platforms, correct?
> > [Feifei] Yes.
[Feifei] Sorry I need to correct this. Actually according to our test in direct-rearm
cover letter,  the improvement is 7% to 14% on N1SDP and 14% to 17% on Ampere
Altra.
> >
> > >
> > > About direct-rearm RX approach you propose:
> > > After another thought, probably it is possible to re-arrange it in a
> > > way that would help avoid related negatives.
> > > The basic idea as follows:
> > >
> > > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> > >     Also make sw_ring de-coupled from RXQ itself, i.e:
> > >     when RXQ is stopped or even destroyed, related sw_ring may still
> > >     exist (probably ref-counter or RCU would be sufficient here).
> > >     All that means we need a common layout/api for rxq_sw_ring
> > >     and PMDs that would like to support direct-rearming will have to
> > >     use/obey it.
> > [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to
> > RXQ is stopped but elements of it (sw-ring) is still kept and we may
> > forget to free this sw-ring in the end.
> > Furthermore,  if we apply this, we need to separate operation when
> > closing RXQ and add Rx sw-ring free operation when closing TXQ. This
> > will be complex and it is not conducive to subsequent maintenance if
> > maintainer does not understand direct-rearm mode very well.
> >
> > >
> > > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> > >     at txq_free_bufs() try to store released mbufs inside attached
> > >     sw_ring directly. If there is no attached sw_ring, or not enough
> > >     free space in it - continue with mempool_put() as usual.
> > >     Note that actual arming of HW RXDs still remains responsibility
> > >     of RX code-path:
> > >     rxq_rearm(rxq) {
> > >       ...
> > >       - check are there are N already filled entries inside rxq_sw_ring.
> > >         if not, populate them from mempool (usual mempool_get()).
> > >       - arm related RXDs and mark these sw_ring entries as managed by HW.
> > >       ...
> > >     }
> > >
> > [Feifei] We try to create two modes, one is direct-rearm and the other
> > is direct-free like above.
> > And by performance comparison, we select direct-rearm which improve
> > performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp.
> > Furthermore, I think put direct mode in Tx or Rx is equivalent. For
> > direct- rearm, if there is no Tx sw-ring, Rx will get mbufs from
> > mempool. For direct- fee, if there is no Rx sw-ring, Tx will put mbufs
> > into mempool. At last, what affects our decision-making is the
> improvement of performance.
> >
> > >
> > > So rxq_sw_ring will serve two purposes:
> > > - track mbufs that are managed by HW (that what it does now)
> > > - private (per RXQ) mbuf cache
> > >
> > > Now, if TXQ is stopped while RXQ is running - no extra
> > > synchronization is required, RXQ would just use
> > > mempool_get() to rearm its sw_ring itself.
> > >
> > > If RXQ is stopped while TXQ is still running - TXQ can still
> > > continue to populate related sw_ring till it gets full.
> > > Then it will continue with mempool_put() as usual.
> > > Of-course it means that user who wants to use this feature should
> > > probably account some extra mbufs for such case, or might be
> > > rxq_sw_ring can have enable/disable flag to mitigate such situation.
> > >
> > [Feifei] For direct-rearm, the key point should be the communication
> > between TXQ and RXQ when TXQ is stopped. De-coupled sw-ring is
> > complex, maybe we can simplify this and assign this to the
> > application. My thought is that if direct-rearm is enabled, when users
> > want to close TX port, they must firstly close mapped RX port and
> > disable direct-rearm feature. Then they can restart RX port.
> [Feifei] I think here we can refer to "hairpin queue" in '
> rte_eth_dev_tx_queue_stop ':
> ''
> if (rte_eth_dev_is_tx_hairpin_queue(dev, tx_queue_id)) {
> 		RTE_ETHDEV_LOG(INFO,
> 			"Can't stop Tx hairpin queue %"PRIu16" of device
> with port_id=%"PRIu16"\n",
> 			tx_queue_id, port_id);
> 		return -EINVAL;
> 	}
> "
> >
> > > As another benefit here - such approach makes possible to use
> > > several TXQs (even from different devices) to rearm same RXQ.
> > [Feifei] Actually, for direct-rearm, it can use several RXQs to rearm
> > same TXQ, so this is equivalent for direct-rearm and direct-free.
> > Furthermore, If use multiple cores, I think we need to consider
> > synchronization of variables, and lock is necessary.
> >
> > >
> > > Have to say, that I am still not sure that 10% RX/TX improvement is
> > > worth bypassing mempool completely and introducing all this extra
> > > complexity in RX/TX path.
> > [Feifei] Thus maybe we can avoid this complexity as much as possible.
> > We should not increase the complexity of the bottom layer for
> > convenience, but leave it to the user to decide.  If user wants
> > performance, he needs to consider and operate more.
> >
> > > But, if we'll still decide to go ahead with direct-rearming, this
> > > re-arrangement, I think, should help to keep things clear and avoid
> > > introducing new limitations in existing functionality.
> > >
> > > WDYT?
> > >
> > > Konstantin
> > >
> > >
> > >
> > >
> > >
> > >
> > >



More information about the dev mailing list