[dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

Vlad Zolotarov vladz at cloudius-systems.com
Thu Aug 20 10:56:25 CEST 2015



On 08/20/15 11:41, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com]
>> Sent: Wednesday, August 19, 2015 11:03 AM
>> To: Ananyev, Konstantin; Lu, Wenzhuo
>> Cc: dev at dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
>>
>>
>>
>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>> Sorry for delay with review, I am OOO till next week.
>>> Meanwhile, few questions/comments from me.
>> Hi, Konstantin, long time no see... ;)
>>
>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>> seastar-based
>>>>>>>> application on x540 NIC.
>>>>>>> Could you help to share with us how to reproduce the tx hang issue,
>>>> with using
>>>>>>> typical DPDK examples?
>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to help u
>>>>>> here. However this is quite irrelevant since without this this patch
>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained above.
>>>>>>
>>>>>> We saw the issue when u stressed the xmit path with a lot of highly
>>>>>> fragmented TCP frames (packets with up to 33 fragments with non-headers
>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>> Could you provide us with the pcap file to reproduce the issue?
>> Well, the thing is it takes some time to reproduce it (a few minutes of
>> heavy load) therefore a pcap would be quite large.
> Probably you can upload it to some place, from which we will be able to download it?

I'll see what I can do but no promises...

> Or might be you have some sort of scapy script to generate it?
> I suppose we'll need something to reproduce the issue and verify the fix.

Since the original code abuses the HW spec u don't have to... ;)

>
>>> My concern with you approach is that it would affect TX performance.
>> It certainly will ;) But it seem inevitable. See below.
>>
>>> Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) TXDs,
>>> While with your patch (if I understand it correctly) it has to read all TXDs in the HW TX ring.
>> If by "simple" u refer an always single fragment per Tx packet - then u
>> are absolutely correct.
>>
>> My initial patch was to only set RS on every EOP descriptor without
>> changing the rs_thresh value and this patch worked.
>> However HW spec doesn't ensure in a general case that packets are always
>> handled/completion write-back completes in the same order the packets
>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>> spec for instance). Therefore AFAIU one should not assume that if
>> packet[x+1] DD bit is set then packet[x] is completed too.
>  From my understanding, TX arbitration controls the order in which TXDs from
> different queues are fetched/processed.
> But descriptors from the same TX queue are processed in FIFO order.
> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is completed too,
> and setting RS on every EOP TXD should be enough.

Ok. I'll rework the patch under this assumption then.

>
>> That's why I changed the patch to be as u see it now. However if I miss
>> something here and your HW people ensure the in-order completion this of
>> course may be changed back.
>>
>>> Even if we really need to setup RS bit in each TXD (I still doubt we really do) - ,
>> Well, if u doubt u may ask the guys from the Intel networking division
>> that wrote the 82599 and x540 HW specs where they clearly state that. ;)
> Good point, we'll see what we can do here :)
> Konstantin
>
>>> I think inside PMD it still should be possible to check TX completion in chunks.
>>> Konstantin
>>>
>>>
>>>>>> Thanks,
>>>>>> vlad
>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
>>>>>>>> ---
>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
>>>>>> struct
>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>                              ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>      };
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>> set on
>>>>>> the
>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>>>> the
>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>> +   */
>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>> +          dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>> +
>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>      dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
>>>>>> git
>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> index
>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>      struct ixgbe_tx_queue *txq;
>>>>>>>>      struct ixgbe_hw     *hw;
>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>> +  bool rs_deferring_allowed;
>>>>>>>>
>>>>>>>>      PMD_INIT_FUNC_TRACE();
>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>
>>>>>>>>      /*
>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>> set on
>>>>>> the
>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>>>> the
>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>> +   */
>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>>       * Validate number of transmit descriptors.
>>>>>>>>       * It must not exceed hardware maximum, and must be multiple
>>>>>>>>       * of IXGBE_ALIGN.
>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>> *dev,
>>>>>>>>       * to transmit a packet is greater than the number of free TX
>>>>>>>>       * descriptors.
>>>>>>>>       * The following constraints must be satisfied:
>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>>>> deferring is
>>>>>>>> +   *  forbidden (all but 82598).
>>>>>>>>       *  tx_rs_thresh must be greater than 0.
>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring minus 2.
>>>>>>>>       *  tx_rs_thresh must be less than or equal to tx_free_thresh.
>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>> *dev,
>>>>>>>>       * When set to zero use default values.
>>>>>>>>       */
>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>>>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
>>>> 1));
>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>                      tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
>>>>>>>> +
>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since
>>>> RS
>>>>>> "
>>>>>>>> +                            "must be set for every packet for this
>>>> HW. "
>>>>>>>> +                            "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>>>> +                       (int)dev->data->port_id, (int)queue_idx);
>>>>>>>> +          return -(EINVAL);
>>>>>>>> +  }
>>>>>>>> +
>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
>>>>>> number "
>>>>>>>>                           "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>> "
>>>>>>>> --
>>>>>>>> 2.1.0



More information about the dev mailing list