[dpdk-dev,v3,3/3] net/failsafe: improve Rx sub-devices iteration

Message ID 1513703669-29363-4-git-send-email-matan@mellanox.com (mailing list archive)
State Accepted, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Matan Azrad Dec. 19, 2017, 5:14 p.m. UTC
  Connecting the sub-devices each other by cyclic linked list can help to
iterate over them by Rx burst functions because there is no need to
check the sub-devices ring wraparound.

Create the aforementioned linked-list and change the Rx burst functions
iteration accordingly.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/failsafe/failsafe.c         |  5 ++++
 drivers/net/failsafe/failsafe_ops.c     |  1 +
 drivers/net/failsafe/failsafe_private.h |  5 ++--
 drivers/net/failsafe/failsafe_rxtx.c    | 50 ++++++++++-----------------------
 4 files changed, 24 insertions(+), 37 deletions(-)
  

Comments

Gaëtan Rivet Jan. 12, 2018, 10:28 a.m. UTC | #1
Hi Matan,

The other commits make sense to me so no issue there.
I'm just surprised by this one so a quick question.

On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:
> Connecting the sub-devices each other by cyclic linked list can help to
> iterate over them by Rx burst functions because there is no need to
> check the sub-devices ring wraparound.
> 
> Create the aforementioned linked-list and change the Rx burst functions
> iteration accordingly.

I'm surprised that a linked-list iteration, with the usual
dereferencing, is better than doing some integer arithmetic.
Maybe the locality of the referenced data helps.

Anyway, were you able to count the cycles gained by this change? It
might be interesting to do a measure with a CPU-bound bench, such as with
a dummy device under the fail-safe (ring or such). MLX devices use a lot
of PCI bandwidth, so the bottleneck could be masked in a physical
setting.

No comments otherwise, if you are sure that this is a performance gain,
the implementation seems ok to me.

Regards,
  
Matan Azrad Jan. 12, 2018, 1:29 p.m. UTC | #2
Hi Gaetan

From: Gaëtan Rivet, Friday, January 12, 2018 12:29 PM

> Hi Matan,

> 

> The other commits make sense to me so no issue there.

> I'm just surprised by this one so a quick question.

> 

> On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:

> > Connecting the sub-devices each other by cyclic linked list can help

> > to iterate over them by Rx burst functions because there is no need to

> > check the sub-devices ring wraparound.

> >

> > Create the aforementioned linked-list and change the Rx burst

> > functions iteration accordingly.

> 

> I'm surprised that a linked-list iteration, with the usual dereferencing, is

> better than doing some integer arithmetic.


This memory references are the same as the previous code because in the new code the linked list elements are still in continuous memory, so probably the addresses stay in the cache.
The removed calculations and wraparound branch probably caused to the performance gain.

> Maybe the locality of the referenced data helps.

> 

Sure.

> Anyway, were you able to count the cycles gained by this change? It might be

> interesting to do a measure with a CPU-bound bench, such as with a dummy

> device under the fail-safe (ring or such). MLX devices use a lot of PCI

> bandwidth, so the bottleneck could be masked in a physical setting.

> 

> No comments otherwise, if you are sure that this is a performance gain, the

> implementation seems ok to me.


Yes, I checked it and saw the little gain obviously.
(just run the test with and without this patch and saw the statistics).

Thanks, Have a good weekend!
> 

> Regards,

> --

> Gaëtan Rivet

> 6WIND
  
Gaëtan Rivet Jan. 12, 2018, 2:01 p.m. UTC | #3
On Fri, Jan 12, 2018 at 01:29:17PM +0000, Matan Azrad wrote:
> Hi Gaetan
> 
> From: Gaëtan Rivet, Friday, January 12, 2018 12:29 PM
> > Hi Matan,
> > 
> > The other commits make sense to me so no issue there.
> > I'm just surprised by this one so a quick question.
> > 
> > On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:
> > > Connecting the sub-devices each other by cyclic linked list can help
> > > to iterate over them by Rx burst functions because there is no need to
> > > check the sub-devices ring wraparound.
> > >
> > > Create the aforementioned linked-list and change the Rx burst
> > > functions iteration accordingly.
> > 
> > I'm surprised that a linked-list iteration, with the usual dereferencing, is
> > better than doing some integer arithmetic.
> 
> This memory references are the same as the previous code because in the new code the linked list elements are still in continuous memory, so probably the addresses stay in the cache.
> The removed calculations and wraparound branch probably caused to the performance gain.
> 
> > Maybe the locality of the referenced data helps.
> > 
> Sure.

This means that the sub_device definition is critical for the datapath.
It probably goes beyond a cache-line and could be optimized.

> 
> > Anyway, were you able to count the cycles gained by this change? It might be
> > interesting to do a measure with a CPU-bound bench, such as with a dummy
> > device under the fail-safe (ring or such). MLX devices use a lot of PCI
> > bandwidth, so the bottleneck could be masked in a physical setting.
> > 
> > No comments otherwise, if you are sure that this is a performance gain, the
> > implementation seems ok to me.
> 
> Yes, I checked it and saw the little gain obviously.
> (just run the test with and without this patch and saw the statistics).

Oh I'm sure you checked, I just wanted to make sure you properly
considered the methodology.

Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
  

Patch

diff --git a/drivers/net/failsafe/failsafe.c b/drivers/net/failsafe/failsafe.c
index 6bc5aba..d230c37 100644
--- a/drivers/net/failsafe/failsafe.c
+++ b/drivers/net/failsafe/failsafe.c
@@ -55,6 +55,7 @@ 
 {
 	uint8_t nb_subs;
 	int ret;
+	int i;
 
 	ret = failsafe_args_count_subdevice(dev, params);
 	if (ret)
@@ -72,6 +73,10 @@ 
 		ERROR("Could not allocate sub_devices");
 		return -ENOMEM;
 	}
+	/* Initiate static sub devices linked list. */
+	for (i = 1; i < nb_subs; i++)
+		PRIV(dev)->subs[i - 1].next = PRIV(dev)->subs + i;
+	PRIV(dev)->subs[i - 1].next = PRIV(dev)->subs;
 	return 0;
 }
 
diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c
index e16a590..fe957ad 100644
--- a/drivers/net/failsafe/failsafe_ops.c
+++ b/drivers/net/failsafe/failsafe_ops.c
@@ -294,6 +294,7 @@ 
 	rxq->info.conf = *rx_conf;
 	rxq->info.nb_desc = nb_rx_desc;
 	rxq->priv = PRIV(dev);
+	rxq->sdev = PRIV(dev)->subs;
 	dev->data->rx_queues[rx_queue_id] = rxq;
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		ret = rte_eth_rx_queue_setup(PORT_ID(sdev),
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index 4196ece..54b5b91 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -62,8 +62,8 @@ 
 struct rxq {
 	struct fs_priv *priv;
 	uint16_t qid;
-	/* id of last sub_device polled */
-	uint8_t last_polled;
+	/* next sub_device to poll */
+	struct sub_device *sdev;
 	unsigned int socket_id;
 	struct rte_eth_rxq_info info;
 	rte_atomic64_t refcnt[];
@@ -100,6 +100,7 @@  struct fs_stats {
 
 struct sub_device {
 	/* Exhaustive DPDK device description */
+	struct sub_device *next;
 	struct rte_devargs devargs;
 	struct rte_bus *bus;
 	struct rte_device *dev;
diff --git a/drivers/net/failsafe/failsafe_rxtx.c b/drivers/net/failsafe/failsafe_rxtx.c
index 178294c..0ebf06a 100644
--- a/drivers/net/failsafe/failsafe_rxtx.c
+++ b/drivers/net/failsafe/failsafe_rxtx.c
@@ -94,36 +94,27 @@ 
 		  struct rte_mbuf **rx_pkts,
 		  uint16_t nb_pkts)
 {
-	struct fs_priv *priv;
 	struct sub_device *sdev;
 	struct rxq *rxq;
 	void *sub_rxq;
 	uint16_t nb_rx;
-	uint8_t nb_polled, nb_subs;
-	uint8_t i;
 
 	rxq = queue;
-	priv = rxq->priv;
-	nb_subs = priv->subs_tail - priv->subs_head;
-	nb_polled = 0;
-	for (i = rxq->last_polled; nb_polled < nb_subs; nb_polled++) {
-		i++;
-		if (i == priv->subs_tail)
-			i = priv->subs_head;
-		sdev = &priv->subs[i];
-		if (fs_rx_unsafe(sdev))
+	sdev = rxq->sdev;
+	do {
+		if (fs_rx_unsafe(sdev)) {
+			nb_rx = 0;
 			continue;
+		}
 		sub_rxq = ETH(sdev)->data->rx_queues[rxq->qid];
 		FS_ATOMIC_P(rxq->refcnt[sdev->sid]);
 		nb_rx = ETH(sdev)->
 			rx_pkt_burst(sub_rxq, rx_pkts, nb_pkts);
 		FS_ATOMIC_V(rxq->refcnt[sdev->sid]);
-		if (nb_rx) {
-			rxq->last_polled = i;
-			return nb_rx;
-		}
-	}
-	return 0;
+		sdev = sdev->next;
+	} while (nb_rx == 0 && sdev != rxq->sdev);
+	rxq->sdev = sdev;
+	return nb_rx;
 }
 
 uint16_t
@@ -131,35 +122,24 @@ 
 			 struct rte_mbuf **rx_pkts,
 			 uint16_t nb_pkts)
 {
-	struct fs_priv *priv;
 	struct sub_device *sdev;
 	struct rxq *rxq;
 	void *sub_rxq;
 	uint16_t nb_rx;
-	uint8_t nb_polled, nb_subs;
-	uint8_t i;
 
 	rxq = queue;
-	priv = rxq->priv;
-	nb_subs = priv->subs_tail - priv->subs_head;
-	nb_polled = 0;
-	for (i = rxq->last_polled; nb_polled < nb_subs; nb_polled++) {
-		i++;
-		if (i == priv->subs_tail)
-			i = priv->subs_head;
-		sdev = &priv->subs[i];
+	sdev = rxq->sdev;
+	do {
 		RTE_ASSERT(!fs_rx_unsafe(sdev));
 		sub_rxq = ETH(sdev)->data->rx_queues[rxq->qid];
 		FS_ATOMIC_P(rxq->refcnt[sdev->sid]);
 		nb_rx = ETH(sdev)->
 			rx_pkt_burst(sub_rxq, rx_pkts, nb_pkts);
 		FS_ATOMIC_V(rxq->refcnt[sdev->sid]);
-		if (nb_rx) {
-			rxq->last_polled = i;
-			return nb_rx;
-		}
-	}
-	return 0;
+		sdev = sdev->next;
+	} while (nb_rx == 0 && sdev != rxq->sdev);
+	rxq->sdev = sdev;
+	return nb_rx;
 }
 
 uint16_t