[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

Ananyev, Konstantin konstantin.ananyev at intel.com
Tue Aug 25 19:33:57 CEST 2015


Hi Vlad,

> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
> Sent: Thursday, August 20, 2015 10:07 AM
> To: Ananyev, Konstantin; Lu, Wenzhuo
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> 
> 
> 
> On 08/20/15 12:05, Vlad Zolotarov wrote:
> >
> >
> > On 08/20/15 11:56, Vlad Zolotarov wrote:
> >>
> >>
> >> On 08/20/15 11:41, Ananyev, Konstantin wrote:
> >>> Hi Vlad,
> >>>
> >>>> -----Original Message-----
> >>>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
> >>>> Sent: Wednesday, August 19, 2015 11:03 AM
> >>>> To: Ananyev, Konstantin; Lu, Wenzhuo
> >>>> Cc: dev at dpdk.org
> >>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> >>>> above 1 for all NICs but 82598
> >>>>
> >>>>
> >>>>
> >>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
> >>>>> Hi Vlad,
> >>>>> Sorry for delay with review, I am OOO till next week.
> >>>>> Meanwhile, few questions/comments from me.
> >>>> Hi, Konstantin, long time no see... ;)
> >>>>
> >>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
> >>>>>> seastar-based
> >>>>>>>>>> application on x540 NIC.
> >>>>>>>>> Could you help to share with us how to reproduce the tx hang
> >>>>>>>>> issue,
> >>>>>> with using
> >>>>>>>>> typical DPDK examples?
> >>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to
> >>>>>>>> help u
> >>>>>>>> here. However this is quite irrelevant since without this this
> >>>>>>>> patch
> >>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained
> >>>>>>>> above.
> >>>>>>>>
> >>>>>>>> We saw the issue when u stressed the xmit path with a lot of
> >>>>>>>> highly
> >>>>>>>> fragmented TCP frames (packets with up to 33 fragments with
> >>>>>>>> non-headers
> >>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
> >>>>> Could you provide us with the pcap file to reproduce the issue?
> >>>> Well, the thing is it takes some time to reproduce it (a few
> >>>> minutes of
> >>>> heavy load) therefore a pcap would be quite large.
> >>> Probably you can upload it to some place, from which we will be able
> >>> to download it?
> >>
> >> I'll see what I can do but no promises...
> >
> > On a second thought pcap file won't help u much since in order to
> > reproduce the issue u have to reproduce exactly the same structure of
> > clusters i give to HW and it's not what u see on wire in a TSO case.
> 
> And not only in a TSO case... ;)

I understand that, but my thought was you can add some sort of TX callback for the rte_eth_tx_burst()
into your code that would write the packet into pcap file and then re-run your hang scenario.
I know that it means extra work for you - but I think it would be very helpful if we would be able to reproduce your hang scenario:
- if HW guys would confirm that setting RS bit for every EOP packet is not really required,
  then we probably have to look at what else can cause it.
- it might be added to our validation cycle, to prevent hitting similar problem in future.  
Thanks
Konstantin

> 
> >
> >>
> >>> Or might be you have some sort of scapy script to generate it?
> >>> I suppose we'll need something to reproduce the issue and verify the
> >>> fix.
> >>
> >> Since the original code abuses the HW spec u don't have to... ;)
> >>
> >>>
> >>>>> My concern with you approach is that it would affect TX performance.
> >>>> It certainly will ;) But it seem inevitable. See below.
> >>>>
> >>>>> Right now, for simple TX PMD usually reads only
> >>>>> (nb_tx_desc/tx_rs_thresh) TXDs,
> >>>>> While with your patch (if I understand it correctly) it has to
> >>>>> read all TXDs in the HW TX ring.
> >>>> If by "simple" u refer an always single fragment per Tx packet -
> >>>> then u
> >>>> are absolutely correct.
> >>>>
> >>>> My initial patch was to only set RS on every EOP descriptor without
> >>>> changing the rs_thresh value and this patch worked.
> >>>> However HW spec doesn't ensure in a general case that packets are
> >>>> always
> >>>> handled/completion write-back completes in the same order the packets
> >>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
> >>>> spec for instance). Therefore AFAIU one should not assume that if
> >>>> packet[x+1] DD bit is set then packet[x] is completed too.
> >>>  From my understanding, TX arbitration controls the order in which
> >>> TXDs from
> >>> different queues are fetched/processed.
> >>> But descriptors from the same TX queue are processed in FIFO order.
> >>> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is
> >>> completed too,
> >>> and setting RS on every EOP TXD should be enough.
> >>
> >> Ok. I'll rework the patch under this assumption then.
> >>
> >>>
> >>>> That's why I changed the patch to be as u see it now. However if I
> >>>> miss
> >>>> something here and your HW people ensure the in-order completion
> >>>> this of
> >>>> course may be changed back.
> >>>>
> >>>>> Even if we really need to setup RS bit in each TXD (I still doubt
> >>>>> we really do) - ,
> >>>> Well, if u doubt u may ask the guys from the Intel networking division
> >>>> that wrote the 82599 and x540 HW specs where they clearly state
> >>>> that. ;)
> >>> Good point, we'll see what we can do here :)
> >>> Konstantin
> >>>
> >>>>> I think inside PMD it still should be possible to check TX
> >>>>> completion in chunks.
> >>>>> Konstantin
> >>>>>
> >>>>>
> >>>>>>>> Thanks,
> >>>>>>>> vlad
> >>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
> >>>>>>>>>> ---
> >>>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> >>>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23
> >>>>>>>>>> ++++++++++++++++++++++-
> >>>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
> >>>>>>>>>> index b8ee1e9..6714fd9 100644
> >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> >>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev
> >>>>>>>>>> *dev,
> >>>>>>>> struct
> >>>>>>>>>> rte_eth_dev_info *dev_info)
> >>>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> >>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,
> >>>>>>>>>>      };
> >>>>>>>>>> +
> >>>>>>>>>> +  /*
> >>>>>>>>>> +   * According to 82599 and x540 specifications RS bit
> >>>>>>>>>> *must* be
> >>>>>> set on
> >>>>>>>> the
> >>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will
> >>>>>>>>>> not allow
> >>>>>> the
> >>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> >>>>>>>>>> +   */
> >>>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
> >>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;
> >>>>>>>>>> +
> >>>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *
> >>>>>>>>>> sizeof(uint32_t);
> >>>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> >>>>>>>>>>      dev_info->flow_type_rss_offloads =
> >>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff --
> >>>>>>>> git
> >>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>>> index
> >>>>>>>>>> 91023b9..8dbdffc 100644
> >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct
> >>>>>>>>>> rte_eth_dev
> >>>>>>>>>> *dev,
> >>>>>>>>>>      struct ixgbe_tx_queue *txq;
> >>>>>>>>>>      struct ixgbe_hw     *hw;
> >>>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
> >>>>>>>>>> +  bool rs_deferring_allowed;
> >>>>>>>>>>
> >>>>>>>>>>      PMD_INIT_FUNC_TRACE();
> >>>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>>>>>>>>>
> >>>>>>>>>>      /*
> >>>>>>>>>> +   * According to 82599 and x540 specifications RS bit
> >>>>>>>>>> *must* be
> >>>>>> set on
> >>>>>>>> the
> >>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will
> >>>>>>>>>> not allow
> >>>>>> the
> >>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> >>>>>>>>>> +   */
> >>>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> >>>>>>>>>> +
> >>>>>>>>>> +  /*
> >>>>>>>>>>       * Validate number of transmit descriptors.
> >>>>>>>>>>       * It must not exceed hardware maximum, and must be
> >>>>>>>>>> multiple
> >>>>>>>>>>       * of IXGBE_ALIGN.
> >>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct
> >>>>>>>>>> rte_eth_dev
> >>>>>>>> *dev,
> >>>>>>>>>>       * to transmit a packet is greater than the number of
> >>>>>>>>>> free TX
> >>>>>>>>>>       * descriptors.
> >>>>>>>>>>       * The following constraints must be satisfied:
> >>>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
> >>>>>> deferring is
> >>>>>>>>>> +   *  forbidden (all but 82598).
> >>>>>>>>>>       *  tx_rs_thresh must be greater than 0.
> >>>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring
> >>>>>>>>>> minus 2.
> >>>>>>>>>>       *  tx_rs_thresh must be less than or equal to
> >>>>>>>>>> tx_free_thresh.
> >>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct
> >>>>>>>>>> rte_eth_dev
> >>>>>>>> *dev,
> >>>>>>>>>>       * When set to zero use default values.
> >>>>>>>>>>       */
> >>>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> >>>>>>>>>> -                  tx_conf->tx_rs_thresh :
> >>>>>>>>>> DEFAULT_TX_RS_THRESH);
> >>>>>>>>>> +                  tx_conf->tx_rs_thresh :
> >>>>>>>>>> +                  (rs_deferring_allowed ?
> >>>>>>>>>> DEFAULT_TX_RS_THRESH :
> >>>>>> 1));
> >>>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> >>>>>>>>>>                      tx_conf->tx_free_thresh :
> >>>>>>>>>> DEFAULT_TX_FREE_THRESH);
> >>>>>>>>>> +
> >>>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> >>>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
> >>>>>>>>>> 2 since
> >>>>>> RS
> >>>>>>>> "
> >>>>>>>>>> + "must be set for every packet for this
> >>>>>> HW. "
> >>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",
> >>>>>>>>>> +                       (unsigned int)tx_rs_thresh,
> >>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);
> >>>>>>>>>> +          return -(EINVAL);
> >>>>>>>>>> +  }
> >>>>>>>>>> +
> >>>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
> >>>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less
> >>>>>>>>>> than the
> >>>>>>>> number "
> >>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u
> >>>>>> "
> >>>>>>>>>> --
> >>>>>>>>>> 2.1.0
> >>
> >



More information about the dev mailing list