[PATCH v2] net/failsafe: link_update request crashing at boot

Gaëtan Rivet grive at u256.net
Mon Nov 22 11:23:01 CET 2021


On Thu, Oct 21, 2021, at 23:42, vipul.ashri at oracle.com wrote:
> From: Vipul Ashri <vipul.ashri at oracle.com>
>
> failsafe crashed while sending early link_update request during
> boot time initialization.
> Based on debugging we found failsafe device was good but sub-
> devices were progressing towards initialization and SUBOPS macro
> where expanding macro gives [partial_dev]->dev_ops->link_update()
> execution of which triggered crash because dev_ops==0. similar
> crash seen at failsafe_eth_dev_close()
>
> Failsafe driver need a separate check for subdevices similar to
> "RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);" which is
> called to almost every eth_dev function.
>
> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> Cc: stable at dpdk.org
> Signed-off-by: Vipul Ashri <vipul.ashri at oracle.com>

Hello Vipul,

I'm sorry for the delay, I missed your fix on the mailing list.

IIUC, the issue is that failsafe finished init and received an ethdev
operation call, but one of its sub-device, although marked DEV_ACTIVE,
has its eth_dev->dev_ops field NULL.

It is really surprising to me, because there aren't many ways for a sub-device
to become DEV_ACTIVE.

The only two ways are

  * by executing 'fs_dev_configure()', which will first execute
    rte_eth_dev_configure() on the sub-device, and on error would
    stop *without* setting DEV_ACTIVE.
    rte_eth_dev_configure() will itself execute
    RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV), so it would
    return negative errno and fs_dev_configure() would abort.

  * by executing 'fs_dev_remove()' and the sub-device was 'DEV_STARTED'
    to begin with, then it is retrograded to DEV_ACTIVE once stopped.

So I don't understand yet how it is possible for a sub-device to become DEV_ACTIVE
while its eth_dev->dev_ops are NULL. It seems more like a bug, memory corruption or
just an unexpected execution pattern.

Could describe in more detail the execution?
In particular, setting the EAL log-level to debug with the option:
' --log-level pmd.net.failsafe:debug '
for example while using testpmd or your DPDK app.
It should show ethdev level accesses to the sub-devices, and error values.

Best regards,
-- 
Gaetan Rivet


More information about the stable mailing list