[dpdk-dev] net/failsafe: fix Rx clean race

Message ID 1508651468-31866-1-git-send-email-matan@mellanox.com (mailing list archive)
State Accepted, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Matan Azrad Oct. 22, 2017, 5:51 a.m. UTC
  In case of plug out, the RMV interrupt callback sets the remove flag of
the removed sub-device. The next hotplug alarm cycle should read this
flag and if the data path are clean it should remove the sub-device.

In case of fail-safe RX burst calling from application, fail-afe tries
to call to all STARTED sub-device rx_burst functions. The remove flag
is not checked here and fail-safe may call to the removed sub-device
rx_burst function.

The above 2 cases run in different threads and there is a race between
the removed sub-device RX clean check to the removed sub-device
rx_burst call makes the sub device RX unclean.

If the application calls to rx_burst in loop, the probability to get RX
clean is not enough, especially when there are few sub-devices or if the
rx_burst function of the removed sub-device takes a lot of time.

Each time the sub-device data path is unclean, the second oportunity to
check it again should be only in the hotplug alarm next cycle; the
default time between cycles is 2 seconds.

In this loop when fail-safe tries to remove the sub-device, the
sub-device may appear back and fail-safe cannot plug it in back until
the removal process is completted. In this time fail-safe may lose the
primary sub-device services and may hurt application performance.

This patch adds a remove flag check in safe rx_burst function.
By this way, at most one more hotplug alarm cycle is necessary
to get the sub-device clean for actual removal.

Fixes: 72a57bfd9a0e ("net/failsafe: add fast burst functions")
Cc: stable@dpdk.org

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/failsafe/failsafe_rxtx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
  

Comments

Gaëtan Rivet Oct. 26, 2017, 4:20 p.m. UTC | #1
Hello Matan,

I think the commit log could be shorter.
Proposing this, feel free to expand it if you prefer.

---8<---

When removing a device, the fail-safe checks that it is not within its
datapath before cleaning it.

When checking whether an Rx burst should be performed on a device, the
remove flag is not checked. Thus the port could still enter its datapath
and miss a removal round. Furthermore, there is a race between the
thread removing the device and the polling thread.

Check the remove flag before entering a sub-device Rx burst when in safe
mode. This check mitigates the aforementioned race condition.

--->8---

Otherwise,

On Sun, Oct 22, 2017 at 05:51:08AM +0000, Matan Azrad wrote:
> In case of plug out, the RMV interrupt callback sets the remove flag of
> the removed sub-device. The next hotplug alarm cycle should read this
> flag and if the data path are clean it should remove the sub-device.
> 
> In case of fail-safe RX burst calling from application, fail-afe tries
> to call to all STARTED sub-device rx_burst functions. The remove flag
> is not checked here and fail-safe may call to the removed sub-device
> rx_burst function.
> 
> The above 2 cases run in different threads and there is a race between
> the removed sub-device RX clean check to the removed sub-device
> rx_burst call makes the sub device RX unclean.
> 
> If the application calls to rx_burst in loop, the probability to get RX
> clean is not enough, especially when there are few sub-devices or if the
> rx_burst function of the removed sub-device takes a lot of time.
> 
> Each time the sub-device data path is unclean, the second oportunity to
> check it again should be only in the hotplug alarm next cycle; the
> default time between cycles is 2 seconds.
> 
> In this loop when fail-safe tries to remove the sub-device, the
> sub-device may appear back and fail-safe cannot plug it in back until
> the removal process is completted. In this time fail-safe may lose the
> primary sub-device services and may hurt application performance.
> 
> This patch adds a remove flag check in safe rx_burst function.
> By this way, at most one more hotplug alarm cycle is necessary
> to get the sub-device clean for actual removal.
> 
> Fixes: 72a57bfd9a0e ("net/failsafe: add fast burst functions")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>

> ---
>  drivers/net/failsafe/failsafe_rxtx.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/failsafe/failsafe_rxtx.c b/drivers/net/failsafe/failsafe_rxtx.c
> index 7311421..70157c8 100644
> --- a/drivers/net/failsafe/failsafe_rxtx.c
> +++ b/drivers/net/failsafe/failsafe_rxtx.c
> @@ -43,7 +43,8 @@
>  {
>  	return (ETH(sdev) == NULL) ||
>  		(ETH(sdev)->rx_pkt_burst == NULL) ||
> -		(sdev->state != DEV_STARTED);
> +		(sdev->state != DEV_STARTED) ||
> +		(sdev->remove != 0);
>  }
>  
>  static inline int
> -- 
> 1.8.3.1
>
  
Ferruh Yigit Oct. 26, 2017, 7:10 p.m. UTC | #2
On 10/26/2017 9:20 AM, Gaëtan Rivet wrote:
> Hello Matan,
> 
> I think the commit log could be shorter.
> Proposing this, feel free to expand it if you prefer.
> 
> ---8<---
> 
> When removing a device, the fail-safe checks that it is not within its
> datapath before cleaning it.
> 
> When checking whether an Rx burst should be performed on a device, the
> remove flag is not checked. Thus the port could still enter its datapath
> and miss a removal round. Furthermore, there is a race between the
> thread removing the device and the polling thread.
> 
> Check the remove flag before entering a sub-device Rx burst when in safe
> mode. This check mitigates the aforementioned race condition.
> 
> --->8---
> 
> Otherwise,
> 
> On Sun, Oct 22, 2017 at 05:51:08AM +0000, Matan Azrad wrote:
>> In case of plug out, the RMV interrupt callback sets the remove flag of
>> the removed sub-device. The next hotplug alarm cycle should read this
>> flag and if the data path are clean it should remove the sub-device.
>>
>> In case of fail-safe RX burst calling from application, fail-afe tries
>> to call to all STARTED sub-device rx_burst functions. The remove flag
>> is not checked here and fail-safe may call to the removed sub-device
>> rx_burst function.
>>
>> The above 2 cases run in different threads and there is a race between
>> the removed sub-device RX clean check to the removed sub-device
>> rx_burst call makes the sub device RX unclean.
>>
>> If the application calls to rx_burst in loop, the probability to get RX
>> clean is not enough, especially when there are few sub-devices or if the
>> rx_burst function of the removed sub-device takes a lot of time.
>>
>> Each time the sub-device data path is unclean, the second oportunity to
>> check it again should be only in the hotplug alarm next cycle; the
>> default time between cycles is 2 seconds.
>>
>> In this loop when fail-safe tries to remove the sub-device, the
>> sub-device may appear back and fail-safe cannot plug it in back until
>> the removal process is completted. In this time fail-safe may lose the
>> primary sub-device services and may hurt application performance.
>>
>> This patch adds a remove flag check in safe rx_burst function.
>> By this way, at most one more hotplug alarm cycle is necessary
>> to get the sub-device clean for actual removal.
>>
>> Fixes: 72a57bfd9a0e ("net/failsafe: add fast burst functions")
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Matan Azrad <matan@mellanox.com>
> 
> Acked-by: Gaetan Rivet <gaetan.rivet@6wind.com>

Applied to dpdk-next-net/master, thanks.

(used suggested commit log, thanks.)
  

Patch

diff --git a/drivers/net/failsafe/failsafe_rxtx.c b/drivers/net/failsafe/failsafe_rxtx.c
index 7311421..70157c8 100644
--- a/drivers/net/failsafe/failsafe_rxtx.c
+++ b/drivers/net/failsafe/failsafe_rxtx.c
@@ -43,7 +43,8 @@ 
 {
 	return (ETH(sdev) == NULL) ||
 		(ETH(sdev)->rx_pkt_burst == NULL) ||
-		(sdev->state != DEV_STARTED);
+		(sdev->state != DEV_STARTED) ||
+		(sdev->remove != 0);
 }
 
 static inline int