[dpdk-dev] [RFC] hot plug failure handle mechanism

Matan Azrad matan at mellanox.com
Thu May 24 16:57:48 CEST 2018


Hi Guo

Some questions.

From: Guo Jia
> As we know, hot plug is an importance feature whenever it use for the
> datacenter device's fail-safe and consumption management , or use for the
> dynamic deployment  and SRIOV Live Migration in SDN/NFV, it could be bring
> the higher flexibility and continuality of the networking services in multiple use
> case in industry.
> 
> So let we see, dpdk as an importance networking combine framework with
> packet control path/fast path lib and multiple diversity PMD drivers, what can it
> do to help if application want to achieve their hot plug solution when they are
> working in packet processing by dpdk.
> 
> We already have a general device event mechanism, failsafe driver, bonding
> driver and hot plug/unplug api in framework, app could use these api to
> develop functional, but for the case of hot plug failure handle, that is removing
> a device at run-time will cause app trigger MMIO error and crash out, it is lack
> of a mechanism to handle the failure when hot unplug device. At present,
> kernel only guantiy the hotplug handle safer on the kernel side, but for the user
> mode side, no more specific 3rd tools such as udev/driverctl have especially
> cover about these part of mechanism, and considerate feasibility of the
> implementation, runtime performance and the general for almost user mode
> PMD driver, here a general hot plug failure handle mechanism in dpdk
> framework would be proposed.
> 
> The hot plug failure handle mechanism should be come across as bellow:
> 1. Add a new bus ops "handle_hot-unplug"in bus to handle bus read/write
> error, it is bus-specific and each kind of bus can implement its own logic.
> 2. Implement pci bus specific ops"pci_handle_hot_unplug", in the function,
> base on the failure address to remap memory which belong to the
> corresponding device that unplugged.
> 3. Implement a new sigbus handler, and register it when start device event
> monitoring, once the MMIO sigbus error exposure, it will trigger the above hot
> plug failure handle mechanism, that will keep app, that working on packet
> processing, would not be broken and crash, then could keep going clean, fail-
> safe or other working task.

Can you explain more what's happened with all the threads? Master thread, host thread, data-path threads,
The signal may happened only in a datapath thread or even from a control thread?

What's about resource leak?  (mainly relevant for control threads):
If you jump from the signal address to the restart address, how can you clean the process which was started and got the signal?

Matan.
> 4. Also also will introduce the solution by use testpmd to show the example of
> the whole procedure like that:
> device unplug ->failure handle->stop forwarding->stop port->close port->detach
> port.
> 
> Best regards,
> 
> Jeff Guo



More information about the dev mailing list