[dpdk-dev,v3,3/3] net/failsafe: improve Rx sub-devices iteration
Checks
Commit Message
Connecting the sub-devices each other by cyclic linked list can help to
iterate over them by Rx burst functions because there is no need to
check the sub-devices ring wraparound.
Create the aforementioned linked-list and change the Rx burst functions
iteration accordingly.
Signed-off-by: Matan Azrad <matan@mellanox.com>
---
drivers/net/failsafe/failsafe.c | 5 ++++
drivers/net/failsafe/failsafe_ops.c | 1 +
drivers/net/failsafe/failsafe_private.h | 5 ++--
drivers/net/failsafe/failsafe_rxtx.c | 50 ++++++++++-----------------------
4 files changed, 24 insertions(+), 37 deletions(-)
Comments
Hi Matan,
The other commits make sense to me so no issue there.
I'm just surprised by this one so a quick question.
On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:
> Connecting the sub-devices each other by cyclic linked list can help to
> iterate over them by Rx burst functions because there is no need to
> check the sub-devices ring wraparound.
>
> Create the aforementioned linked-list and change the Rx burst functions
> iteration accordingly.
I'm surprised that a linked-list iteration, with the usual
dereferencing, is better than doing some integer arithmetic.
Maybe the locality of the referenced data helps.
Anyway, were you able to count the cycles gained by this change? It
might be interesting to do a measure with a CPU-bound bench, such as with
a dummy device under the fail-safe (ring or such). MLX devices use a lot
of PCI bandwidth, so the bottleneck could be masked in a physical
setting.
No comments otherwise, if you are sure that this is a performance gain,
the implementation seems ok to me.
Regards,
Hi Gaetan
From: Gaëtan Rivet, Friday, January 12, 2018 12:29 PM
> Hi Matan,
>
> The other commits make sense to me so no issue there.
> I'm just surprised by this one so a quick question.
>
> On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:
> > Connecting the sub-devices each other by cyclic linked list can help
> > to iterate over them by Rx burst functions because there is no need to
> > check the sub-devices ring wraparound.
> >
> > Create the aforementioned linked-list and change the Rx burst
> > functions iteration accordingly.
>
> I'm surprised that a linked-list iteration, with the usual dereferencing, is
> better than doing some integer arithmetic.
This memory references are the same as the previous code because in the new code the linked list elements are still in continuous memory, so probably the addresses stay in the cache.
The removed calculations and wraparound branch probably caused to the performance gain.
> Maybe the locality of the referenced data helps.
>
Sure.
> Anyway, were you able to count the cycles gained by this change? It might be
> interesting to do a measure with a CPU-bound bench, such as with a dummy
> device under the fail-safe (ring or such). MLX devices use a lot of PCI
> bandwidth, so the bottleneck could be masked in a physical setting.
>
> No comments otherwise, if you are sure that this is a performance gain, the
> implementation seems ok to me.
Yes, I checked it and saw the little gain obviously.
(just run the test with and without this patch and saw the statistics).
Thanks, Have a good weekend!
>
> Regards,
> --
> Gaëtan Rivet
> 6WIND
On Fri, Jan 12, 2018 at 01:29:17PM +0000, Matan Azrad wrote:
> Hi Gaetan
>
> From: Gaëtan Rivet, Friday, January 12, 2018 12:29 PM
> > Hi Matan,
> >
> > The other commits make sense to me so no issue there.
> > I'm just surprised by this one so a quick question.
> >
> > On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote:
> > > Connecting the sub-devices each other by cyclic linked list can help
> > > to iterate over them by Rx burst functions because there is no need to
> > > check the sub-devices ring wraparound.
> > >
> > > Create the aforementioned linked-list and change the Rx burst
> > > functions iteration accordingly.
> >
> > I'm surprised that a linked-list iteration, with the usual dereferencing, is
> > better than doing some integer arithmetic.
>
> This memory references are the same as the previous code because in the new code the linked list elements are still in continuous memory, so probably the addresses stay in the cache.
> The removed calculations and wraparound branch probably caused to the performance gain.
>
> > Maybe the locality of the referenced data helps.
> >
> Sure.
This means that the sub_device definition is critical for the datapath.
It probably goes beyond a cache-line and could be optimized.
>
> > Anyway, were you able to count the cycles gained by this change? It might be
> > interesting to do a measure with a CPU-bound bench, such as with a dummy
> > device under the fail-safe (ring or such). MLX devices use a lot of PCI
> > bandwidth, so the bottleneck could be masked in a physical setting.
> >
> > No comments otherwise, if you are sure that this is a performance gain, the
> > implementation seems ok to me.
>
> Yes, I checked it and saw the little gain obviously.
> (just run the test with and without this patch and saw the statistics).
Oh I'm sure you checked, I just wanted to make sure you properly
considered the methodology.
Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>
@@ -55,6 +55,7 @@
{
uint8_t nb_subs;
int ret;
+ int i;
ret = failsafe_args_count_subdevice(dev, params);
if (ret)
@@ -72,6 +73,10 @@
ERROR("Could not allocate sub_devices");
return -ENOMEM;
}
+ /* Initiate static sub devices linked list. */
+ for (i = 1; i < nb_subs; i++)
+ PRIV(dev)->subs[i - 1].next = PRIV(dev)->subs + i;
+ PRIV(dev)->subs[i - 1].next = PRIV(dev)->subs;
return 0;
}
@@ -294,6 +294,7 @@
rxq->info.conf = *rx_conf;
rxq->info.nb_desc = nb_rx_desc;
rxq->priv = PRIV(dev);
+ rxq->sdev = PRIV(dev)->subs;
dev->data->rx_queues[rx_queue_id] = rxq;
FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
ret = rte_eth_rx_queue_setup(PORT_ID(sdev),
@@ -62,8 +62,8 @@
struct rxq {
struct fs_priv *priv;
uint16_t qid;
- /* id of last sub_device polled */
- uint8_t last_polled;
+ /* next sub_device to poll */
+ struct sub_device *sdev;
unsigned int socket_id;
struct rte_eth_rxq_info info;
rte_atomic64_t refcnt[];
@@ -100,6 +100,7 @@ struct fs_stats {
struct sub_device {
/* Exhaustive DPDK device description */
+ struct sub_device *next;
struct rte_devargs devargs;
struct rte_bus *bus;
struct rte_device *dev;
@@ -94,36 +94,27 @@
struct rte_mbuf **rx_pkts,
uint16_t nb_pkts)
{
- struct fs_priv *priv;
struct sub_device *sdev;
struct rxq *rxq;
void *sub_rxq;
uint16_t nb_rx;
- uint8_t nb_polled, nb_subs;
- uint8_t i;
rxq = queue;
- priv = rxq->priv;
- nb_subs = priv->subs_tail - priv->subs_head;
- nb_polled = 0;
- for (i = rxq->last_polled; nb_polled < nb_subs; nb_polled++) {
- i++;
- if (i == priv->subs_tail)
- i = priv->subs_head;
- sdev = &priv->subs[i];
- if (fs_rx_unsafe(sdev))
+ sdev = rxq->sdev;
+ do {
+ if (fs_rx_unsafe(sdev)) {
+ nb_rx = 0;
continue;
+ }
sub_rxq = ETH(sdev)->data->rx_queues[rxq->qid];
FS_ATOMIC_P(rxq->refcnt[sdev->sid]);
nb_rx = ETH(sdev)->
rx_pkt_burst(sub_rxq, rx_pkts, nb_pkts);
FS_ATOMIC_V(rxq->refcnt[sdev->sid]);
- if (nb_rx) {
- rxq->last_polled = i;
- return nb_rx;
- }
- }
- return 0;
+ sdev = sdev->next;
+ } while (nb_rx == 0 && sdev != rxq->sdev);
+ rxq->sdev = sdev;
+ return nb_rx;
}
uint16_t
@@ -131,35 +122,24 @@
struct rte_mbuf **rx_pkts,
uint16_t nb_pkts)
{
- struct fs_priv *priv;
struct sub_device *sdev;
struct rxq *rxq;
void *sub_rxq;
uint16_t nb_rx;
- uint8_t nb_polled, nb_subs;
- uint8_t i;
rxq = queue;
- priv = rxq->priv;
- nb_subs = priv->subs_tail - priv->subs_head;
- nb_polled = 0;
- for (i = rxq->last_polled; nb_polled < nb_subs; nb_polled++) {
- i++;
- if (i == priv->subs_tail)
- i = priv->subs_head;
- sdev = &priv->subs[i];
+ sdev = rxq->sdev;
+ do {
RTE_ASSERT(!fs_rx_unsafe(sdev));
sub_rxq = ETH(sdev)->data->rx_queues[rxq->qid];
FS_ATOMIC_P(rxq->refcnt[sdev->sid]);
nb_rx = ETH(sdev)->
rx_pkt_burst(sub_rxq, rx_pkts, nb_pkts);
FS_ATOMIC_V(rxq->refcnt[sdev->sid]);
- if (nb_rx) {
- rxq->last_polled = i;
- return nb_rx;
- }
- }
- return 0;
+ sdev = sdev->next;
+ } while (nb_rx == 0 && sdev != rxq->sdev);
+ rxq->sdev = sdev;
+ return nb_rx;
}
uint16_t