[dpdk-dev,v3,6/6] net/failsafe: fix removed device handling

Message ID 1513703415-29145-7-git-send-email-matan@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Matan Azrad Dec. 19, 2017, 5:10 p.m. UTC
  There is time between the physical removal of the device until
sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
applications still don't know about the removal and may call sub-device
control operation which should return an error.

In previous code this error is reported to the application contrary to
fail-safe principle that the app should not be aware of device removal.

Add an removal check in each relevant control command error flow and
prevent an error report to application when the sub-device is removed.

Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
Fixes: b737a1e ("net/failsafe: support flow API")

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/failsafe/failsafe_flow.c    | 18 ++++++++++-------
 drivers/net/failsafe/failsafe_ops.c     | 35 ++++++++++++++++++++++-----------
 drivers/net/failsafe/failsafe_private.h | 12 +++++++++++
 3 files changed, 47 insertions(+), 18 deletions(-)
  

Comments

Gaëtan Rivet Dec. 19, 2017, 10:21 p.m. UTC | #1
Hi Matan,

On Tue, Dec 19, 2017 at 05:10:15PM +0000, Matan Azrad wrote:
> There is time between the physical removal of the device until
> sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> applications still don't know about the removal and may call sub-device
> control operation which should return an error.
> 
> In previous code this error is reported to the application contrary to
> fail-safe principle that the app should not be aware of device removal.
> 
> Add an removal check in each relevant control command error flow and
> prevent an error report to application when the sub-device is removed.
> 
> Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> Fixes: b737a1e ("net/failsafe: support flow API")
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---

<snip>

> +/*
> + * Check if error should be reported to the user.
> + */
> +static inline bool
> +fs_is_error(struct sub_device *sdev, int err)
> +{
> +	/* A device removal shouldn't be reported as an error. */
> +	if (err == 0 || sdev->remove == 1 || err == -EIO)
> +		return false;
> +	return true;
> +}

This is better, thanks.

However is there a reason you did not follow the same pattern as ethdev with
eth_err? I see the two functions as similar in their intent, making them
close to each other would be clearer to a reader being familiar with the
ethdev API and that would be interested in fail-safe.

What do you think?
  
Matan Azrad Dec. 20, 2017, 10:58 a.m. UTC | #2
Hi Gaetan

> -----Original Message-----
> From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> Sent: Wednesday, December 20, 2017 12:22 AM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> <thomas@monjalon.net>; dev@dpdk.org
> Subject: Re: [PATCH v3 6/6] net/failsafe: fix removed device handling
> 
> Hi Matan,
> 
> On Tue, Dec 19, 2017 at 05:10:15PM +0000, Matan Azrad wrote:
> > There is time between the physical removal of the device until
> > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > applications still don't know about the removal and may call
> > sub-device control operation which should return an error.
> >
> > In previous code this error is reported to the application contrary to
> > fail-safe principle that the app should not be aware of device removal.
> >
> > Add an removal check in each relevant control command error flow and
> > prevent an error report to application when the sub-device is removed.
> >
> > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > Fixes: b737a1e ("net/failsafe: support flow API")
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> 
> <snip>
> 
> > +/*
> > + * Check if error should be reported to the user.
> > + */
> > +static inline bool
> > +fs_is_error(struct sub_device *sdev, int err) {
> > +	/* A device removal shouldn't be reported as an error. */
> > +	if (err == 0 || sdev->remove == 1 || err == -EIO)
> > +		return false;
> > +	return true;
> > +}
> 
> This is better, thanks.
> 
> However is there a reason you did not follow the same pattern as ethdev
> with eth_err? I see the two functions as similar in their intent, making them
> close to each other would be clearer to a reader being familiar with the
> ethdev API and that would be interested in fail-safe.
> 
> What do you think?
> 

I think that there is a real different between eth_err function to fs_is_error:
ethdev uses eth_err function to adjust removal return value to be -EIO.
fail-safe uses fs_is_error function to check if an error should be reported to the user to save the fail-safe principle that the app should not be aware of device removal  -  this is the main idea that also causes me to change the name from fs_is_removed to fs_is_error.

> --
> Gaëtan Rivet
> 6WIND
  
Gaëtan Rivet Jan. 8, 2018, 10:57 a.m. UTC | #3
Hi Matan,

Sorry for the delay on this.

On Wed, Dec 20, 2017 at 10:58:29AM +0000, Matan Azrad wrote:
> Hi Gaetan
> 
> > -----Original Message-----
> > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > Sent: Wednesday, December 20, 2017 12:22 AM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > <thomas@monjalon.net>; dev@dpdk.org
> > Subject: Re: [PATCH v3 6/6] net/failsafe: fix removed device handling
> > 
> > Hi Matan,
> > 
> > On Tue, Dec 19, 2017 at 05:10:15PM +0000, Matan Azrad wrote:
> > > There is time between the physical removal of the device until
> > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > applications still don't know about the removal and may call
> > > sub-device control operation which should return an error.
> > >
> > > In previous code this error is reported to the application contrary to
> > > fail-safe principle that the app should not be aware of device removal.
> > >
> > > Add an removal check in each relevant control command error flow and
> > > prevent an error report to application when the sub-device is removed.
> > >
> > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > Fixes: b737a1e ("net/failsafe: support flow API")

As stated previously, please do not include those fixes lines.

> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > 
> > <snip>
> > 
> > > +/*
> > > + * Check if error should be reported to the user.
> > > + */
> > > +static inline bool
> > > +fs_is_error(struct sub_device *sdev, int err) {
> > > +	/* A device removal shouldn't be reported as an error. */
> > > +	if (err == 0 || sdev->remove == 1 || err == -EIO)
> > > +		return false;
> > > +	return true;
> > > +}
> > 
> > This is better, thanks.
> > 
> > However is there a reason you did not follow the same pattern as ethdev
> > with eth_err? I see the two functions as similar in their intent, making them
> > close to each other would be clearer to a reader being familiar with the
> > ethdev API and that would be interested in fail-safe.
> > 
> > What do you think?
> > 
> 
> I think that there is a real different between eth_err function to fs_is_error:
> ethdev uses eth_err function to adjust removal return value to be -EIO.
> fail-safe uses fs_is_error function to check if an error should be reported to the user to save the fail-safe principle that the app should not be aware of device removal  -  this is the main idea that also causes me to change the name from fs_is_removed to fs_is_error.

I would have preferred if it followed the same pattern as ethdev (that
function be used to adjust the return value, not performing a flag check).

While better on its own, the pattern:

    if (fs_is_error(sdev, err)) {
            ERROR("xxxx");
            return err;
    }

is dangerous, as then the author is forbidden from returning err, assuming
err could be -EIO. He or she would be forced to return an explicit "0".
To be clear, here would be an easy mistake to do:

    if (fs_is_error(sdev, err)) {
            ERROR("xxxx");
    }
    return err;

And this kind of code-flow is not unusual, or even unwanted.
I dislike having this kind of implicit rule derived from using a helper
such as fs_is_error().

The alternative

    if ((err = fs_err(sdev, err))) {
            ERROR("xxxx");
            return err;
    }

Forces the value err to be set to the correct one.

This mistake can already be found in your patch:

> @@ -150,7 +150,7 @@
>                         continue;
>                 local_ret = rte_flow_destroy(PORT_ID(sdev),
>                                 flow->flows[i], error);
> -               if (local_ret) {
> +               if (fs_is_error(sdev, local_ret)) {
>                         ERROR("Failed to destroy flow on sub_device %d: %d",
>                                         i, local_ret);
>                         if (ret == 0)

Your environment does not include the function, but this is within
fs_flow_destroy (please update to include the context by the way
it helps a lot the review :). Afterward, line 162 ret is directly
used as return value.

Also, fs_err() would need to transform rte_errno when relevant (mostly
in failsafe_flow.c I think).

This is the kind of subtlety that needs to be avoided when designing
APIs, even internal ones. This will induce errors afterward and
complicate the maintenance of the codebase.

Best regards,
  
Matan Azrad Jan. 8, 2018, 12:55 p.m. UTC | #4
Hi Gaetan

From: Gaëtan Rivet, Monday, January 8, 2018 12:58 PM
> Hi Matan,
> 
> Sorry for the delay on this.
> 

It's OK in spite of I need to fetch it back :)

> On Wed, Dec 20, 2017 at 10:58:29AM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> > > -----Original Message-----
> > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > Sent: Wednesday, December 20, 2017 12:22 AM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > <thomas@monjalon.net>; dev@dpdk.org
> > > Subject: Re: [PATCH v3 6/6] net/failsafe: fix removed device
> > > handling
> > >
> > > Hi Matan,
> > >
> > > On Tue, Dec 19, 2017 at 05:10:15PM +0000, Matan Azrad wrote:
> > > > There is time between the physical removal of the device until
> > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > applications still don't know about the removal and may call
> > > > sub-device control operation which should return an error.
> > > >
> > > > In previous code this error is reported to the application
> > > > contrary to fail-safe principle that the app should not be aware of
> device removal.
> > > >
> > > > Add an removal check in each relevant control command error flow
> > > > and prevent an error report to application when the sub-device is
> removed.
> > > >
> > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > Fixes: b737a1e ("net/failsafe: support flow API")
> 
> As stated previously, please do not include those fixes lines.
> 
> > > >
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > >
> > > <snip>
> > >
> > > > +/*
> > > > + * Check if error should be reported to the user.
> > > > + */
> > > > +static inline bool
> > > > +fs_is_error(struct sub_device *sdev, int err) {
> > > > +	/* A device removal shouldn't be reported as an error. */
> > > > +	if (err == 0 || sdev->remove == 1 || err == -EIO)
> > > > +		return false;
> > > > +	return true;
> > > > +}
> > >
> > > This is better, thanks.
> > >
> > > However is there a reason you did not follow the same pattern as
> > > ethdev with eth_err? I see the two functions as similar in their
> > > intent, making them close to each other would be clearer to a reader
> > > being familiar with the ethdev API and that would be interested in fail-
> safe.
> > >
> > > What do you think?
> > >
> >
> > I think that there is a real different between eth_err function to
> fs_is_error:
> > ethdev uses eth_err function to adjust removal return value to be -EIO.
> > fail-safe uses fs_is_error function to check if an error should be reported to
> the user to save the fail-safe principle that the app should not be aware of
> device removal  -  this is the main idea that also causes me to change the
> name from fs_is_removed to fs_is_error.
> 
> I would have preferred if it followed the same pattern as ethdev (that
> function be used to adjust the return value, not performing a flag check).
> 
> While better on its own, the pattern:
> 
>     if (fs_is_error(sdev, err)) {
>             ERROR("xxxx");
>             return err;
>     }
> 
> is dangerous, as then the author is forbidden from returning err, assuming
> err could be -EIO. He or she would be forced to return an explicit "0".
> To be clear, here would be an easy mistake to do:
> 
>     if (fs_is_error(sdev, err)) {
>             ERROR("xxxx");
>     }
>     return err;
> 
> And this kind of code-flow is not unusual, or even unwanted.
> I dislike having this kind of implicit rule derived from using a helper such as
> fs_is_error().
> 
> The alternative
> 
>     if ((err = fs_err(sdev, err))) {
>             ERROR("xxxx");
>             return err;
>     }
> 
> Forces the value err to be set to the correct one.
> 
Good point, will change it.

> This mistake can already be found in your patch:
> 
> > @@ -150,7 +150,7 @@
> >                         continue;
> >                 local_ret = rte_flow_destroy(PORT_ID(sdev),
> >                                 flow->flows[i], error);
> > -               if (local_ret) {
> > +               if (fs_is_error(sdev, local_ret)) {
> >                         ERROR("Failed to destroy flow on sub_device %d: %d",
> >                                         i, local_ret);
> >                         if (ret == 0)
> 

Sorry, I can't see any issue here.

> Your environment does not include the function, but this is within
> fs_flow_destroy (please update to include the context by the way it helps a
> lot the review :). Afterward, line 162 ret is directly used as return value.
> 
I don't understand what do you mean.

> Also, fs_err() would need to transform rte_errno when relevant (mostly in
> failsafe_flow.c I think).
> 
Your suggestion is always to update rte_errno to 0 in case the error is because of removal?

> This is the kind of subtlety that needs to be avoided when designing APIs,
> even internal ones. This will induce errors afterward and complicate the
> maintenance of the codebase.
>

Thanks for the lesson! 
 
> Best regards,
> 
> --
> Gaëtan Rivet
> 6WIND
  
Gaëtan Rivet Jan. 8, 2018, 1:46 p.m. UTC | #5
On Mon, Jan 08, 2018 at 12:55:49PM +0000, Matan Azrad wrote:
> Hi Gaetan
> 
> From: Gaëtan Rivet, Monday, January 8, 2018 12:58 PM
> > Hi Matan,
> > 
> > Sorry for the delay on this.
> > 
> 
> It's OK in spite of I need to fetch it back :)
> 
> > On Wed, Dec 20, 2017 at 10:58:29AM +0000, Matan Azrad wrote:
> > > Hi Gaetan
> > >
> > > > -----Original Message-----
> > > > From: Gaëtan Rivet [mailto:gaetan.rivet@6wind.com]
> > > > Sent: Wednesday, December 20, 2017 12:22 AM
> > > > To: Matan Azrad <matan@mellanox.com>
> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; Thomas Monjalon
> > > > <thomas@monjalon.net>; dev@dpdk.org
> > > > Subject: Re: [PATCH v3 6/6] net/failsafe: fix removed device
> > > > handling
> > > >
> > > > Hi Matan,
> > > >
> > > > On Tue, Dec 19, 2017 at 05:10:15PM +0000, Matan Azrad wrote:
> > > > > There is time between the physical removal of the device until
> > > > > sub-device PMDs get a RMV interrupt. At this time DPDK PMDs and
> > > > > applications still don't know about the removal and may call
> > > > > sub-device control operation which should return an error.
> > > > >
> > > > > In previous code this error is reported to the application
> > > > > contrary to fail-safe principle that the app should not be aware of
> > device removal.
> > > > >
> > > > > Add an removal check in each relevant control command error flow
> > > > > and prevent an error report to application when the sub-device is
> > removed.
> > > > >
> > > > > Fixes: a46f8d5 ("net/failsafe: add fail-safe PMD")
> > > > > Fixes: b737a1e ("net/failsafe: support flow API")
> > 
> > As stated previously, please do not include those fixes lines.
> > 
> > > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > ---
> > > >
> > > > <snip>
> > > >
> > > > > +/*
> > > > > + * Check if error should be reported to the user.
> > > > > + */
> > > > > +static inline bool
> > > > > +fs_is_error(struct sub_device *sdev, int err) {
> > > > > +	/* A device removal shouldn't be reported as an error. */
> > > > > +	if (err == 0 || sdev->remove == 1 || err == -EIO)
> > > > > +		return false;
> > > > > +	return true;
> > > > > +}
> > > >
> > > > This is better, thanks.
> > > >
> > > > However is there a reason you did not follow the same pattern as
> > > > ethdev with eth_err? I see the two functions as similar in their
> > > > intent, making them close to each other would be clearer to a reader
> > > > being familiar with the ethdev API and that would be interested in fail-
> > safe.
> > > >
> > > > What do you think?
> > > >
> > >
> > > I think that there is a real different between eth_err function to
> > fs_is_error:
> > > ethdev uses eth_err function to adjust removal return value to be -EIO.
> > > fail-safe uses fs_is_error function to check if an error should be reported to
> > the user to save the fail-safe principle that the app should not be aware of
> > device removal  -  this is the main idea that also causes me to change the
> > name from fs_is_removed to fs_is_error.
> > 
> > I would have preferred if it followed the same pattern as ethdev (that
> > function be used to adjust the return value, not performing a flag check).
> > 
> > While better on its own, the pattern:
> > 
> >     if (fs_is_error(sdev, err)) {
> >             ERROR("xxxx");
> >             return err;
> >     }
> > 
> > is dangerous, as then the author is forbidden from returning err, assuming
> > err could be -EIO. He or she would be forced to return an explicit "0".
> > To be clear, here would be an easy mistake to do:
> > 
> >     if (fs_is_error(sdev, err)) {
> >             ERROR("xxxx");
> >     }
> >     return err;
> > 
> > And this kind of code-flow is not unusual, or even unwanted.
> > I dislike having this kind of implicit rule derived from using a helper such as
> > fs_is_error().
> > 
> > The alternative
> > 
> >     if ((err = fs_err(sdev, err))) {
> >             ERROR("xxxx");
> >             return err;
> >     }
> > 
> > Forces the value err to be set to the correct one.
> > 
> Good point, will change it.
> 
> > This mistake can already be found in your patch:
> > 
> > > @@ -150,7 +150,7 @@
> > >                         continue;
> > >                 local_ret = rte_flow_destroy(PORT_ID(sdev),
> > >                                 flow->flows[i], error);
> > > -               if (local_ret) {
> > > +               if (fs_is_error(sdev, local_ret)) {
> > >                         ERROR("Failed to destroy flow on sub_device %d: %d",
> > >                                         i, local_ret);
> > >                         if (ret == 0)
> > 
> 
> Sorry, I can't see any issue here.
> 

You're right, actually the code would still be correct.
I checked again the rest of the edit, there shouldn't be any issue,
usually "0" is explicitly returned.

Still, the point stands.

> > Your environment does not include the function, but this is within
> > fs_flow_destroy (please update to include the context by the way it helps a
> > lot the review :). Afterward, line 162 ret is directly used as return value.
> > 
> I don't understand what do you mean.
> 
> > Also, fs_err() would need to transform rte_errno when relevant (mostly in
> > failsafe_flow.c I think).
> > 
> Your suggestion is always to update rte_errno to 0 in case the error is because of removal?
> 

If the error is indeed due to the device being absent, then rte_errno
should be set back to its previous value I think.
  
Matan Azrad Jan. 8, 2018, 2 p.m. UTC | #6
Hi Gaetan

From: Gaëtan Rivet, Monday, January 8, 2018 3:47 PM
> On Mon, Jan 08, 2018 at 12:55:49PM +0000, Matan Azrad wrote:
> > Hi Gaetan
> >
> > From: Gaëtan Rivet, Monday, January 8, 2018 12:58 PM
> > > Hi Matan,
> > >
> > > Sorry for the delay on this.
> > >
> >
> > It's OK in spite of I need to fetch it back :)
> >
> > > On Wed, Dec 20, 2017 at 10:58:29AM +0000, Matan Azrad wrote:
<snip>
> > > And this kind of code-flow is not unusual, or even unwanted.
> > > I dislike having this kind of implicit rule derived from using a
> > > helper such as fs_is_error().
> > >
> > > The alternative
> > >
> > >     if ((err = fs_err(sdev, err))) {
> > >             ERROR("xxxx");
> > >             return err;
> > >     }
> > >
> > > Forces the value err to be set to the correct one.
> > >
> > Good point, will change it.
> >
> > > This mistake can already be found in your patch:
> > >
> > > > @@ -150,7 +150,7 @@
> > > >                         continue;
> > > >                 local_ret = rte_flow_destroy(PORT_ID(sdev),
> > > >                                 flow->flows[i], error);
> > > > -               if (local_ret) {
> > > > +               if (fs_is_error(sdev, local_ret)) {
> > > >                         ERROR("Failed to destroy flow on sub_device %d: %d",
> > > >                                         i, local_ret);
> > > >                         if (ret == 0)
> > >
> >
> > Sorry, I can't see any issue here.
> >
> 
> You're right, actually the code would still be correct.
> I checked again the rest of the edit, there shouldn't be any issue, usually "0"
> is explicitly returned.
> 
> Still, the point stands.
> 
Yes.

> > > Your environment does not include the function, but this is within
> > > fs_flow_destroy (please update to include the context by the way it
> > > helps a lot the review :). Afterward, line 162 ret is directly used as return
> value.
> > >
> > I don't understand what do you mean.
> >
> > > Also, fs_err() would need to transform rte_errno when relevant
> > > (mostly in failsafe_flow.c I think).
> > >
> > Your suggestion is always to update rte_errno to 0 in case the error is
> because of removal?
> >
> 
> If the error is indeed due to the device being absent, then rte_errno should
> be set back to its previous value I think.
So, I think it will require old rte_errno save before each device command...
Why not to set it to 0 in the special case(removal) by the new internal API?

> 
> --
> Gaëtan Rivet
> 6WIND
  
Gaëtan Rivet Jan. 8, 2018, 2:31 p.m. UTC | #7
On Mon, Jan 08, 2018 at 02:00:54PM +0000, Matan Azrad wrote:
> Hi Gaetan
> 

<snip>

> > > > Your environment does not include the function, but this is within
> > > > fs_flow_destroy (please update to include the context by the way it
> > > > helps a lot the review :). Afterward, line 162 ret is directly used as return
> > value.
> > > >
> > > I don't understand what do you mean.
> > >
> > > > Also, fs_err() would need to transform rte_errno when relevant
> > > > (mostly in failsafe_flow.c I think).
> > > >
> > > Your suggestion is always to update rte_errno to 0 in case the error is
> > because of removal?
> > >
> > 
> > If the error is indeed due to the device being absent, then rte_errno should
> > be set back to its previous value I think.
> So, I think it will require old rte_errno save before each device command...
> Why not to set it to 0 in the special case(removal) by the new internal API?
> 

Resetting it to 0 might be sufficient, yes.

There might be some old-school devs out-there that would such things as:

    do_thing_x();
    do_thing_y();
    do_thing_z();
    if (check_for_any_error(errno)) {
        abort();
    }

But I'm not too fond of this kind of pattern, so I'm not specifically
opposed to code that does not go with this flow.
  

Patch

diff --git a/drivers/net/failsafe/failsafe_flow.c b/drivers/net/failsafe/failsafe_flow.c
index 153ceee..123acb4 100644
--- a/drivers/net/failsafe/failsafe_flow.c
+++ b/drivers/net/failsafe/failsafe_flow.c
@@ -87,7 +87,7 @@ 
 		DEBUG("Calling rte_flow_validate on sub_device %d", i);
 		ret = rte_flow_validate(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_flow_validate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -111,7 +111,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		flow->flows[i] = rte_flow_create(PORT_ID(sdev),
 				attr, patterns, actions, error);
-		if (flow->flows[i] == NULL) {
+		if (flow->flows[i] == NULL && fs_is_error(sdev, -rte_errno)) {
 			ERROR("Failed to create flow on sub_device %d",
 				i);
 			goto err;
@@ -150,7 +150,7 @@ 
 			continue;
 		local_ret = rte_flow_destroy(PORT_ID(sdev),
 				flow->flows[i], error);
-		if (local_ret) {
+		if (fs_is_error(sdev, local_ret)) {
 			ERROR("Failed to destroy flow on sub_device %d: %d",
 					i, local_ret);
 			if (ret == 0)
@@ -175,7 +175,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_flow_flush on sub_device %d", i);
 		ret = rte_flow_flush(PORT_ID(sdev), error);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_flow_flush failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -199,8 +199,12 @@ 
 
 	sdev = TX_SUBDEV(dev);
 	if (sdev != NULL) {
-		return rte_flow_query(PORT_ID(sdev),
-				flow->flows[SUB_ID(sdev)], type, arg, error);
+		int ret = rte_flow_query(PORT_ID(sdev),
+					 flow->flows[SUB_ID(sdev)],
+					 type, arg, error);
+
+		if (fs_is_error(sdev, ret))
+			return ret;
 	}
 	WARN("No active sub_device to query about its flow");
 	return -1;
@@ -223,7 +227,7 @@ 
 			WARN("flow isolation mode of sub_device %d in incoherent state.",
 				i);
 		ret = rte_flow_isolate(PORT_ID(sdev), set, error);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_flow_isolate failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_ops.c b/drivers/net/failsafe/failsafe_ops.c
index e16a590..313ea2f 100644
--- a/drivers/net/failsafe/failsafe_ops.c
+++ b/drivers/net/failsafe/failsafe_ops.c
@@ -121,6 +121,8 @@ 
 					dev->data->nb_tx_queues,
 					&dev->data->dev_conf);
 		if (ret) {
+			if (!fs_is_error(sdev, ret))
+				continue;
 			ERROR("Could not configure sub_device %d", i);
 			return ret;
 		}
@@ -163,8 +165,11 @@ 
 			continue;
 		DEBUG("Starting sub_device %d", i);
 		ret = rte_eth_dev_start(PORT_ID(sdev));
-		if (ret)
+		if (ret) {
+			if (!fs_is_error(sdev, ret))
+				continue;
 			return ret;
+		}
 		sdev->state = DEV_STARTED;
 	}
 	if (PRIV(dev)->state < DEV_STARTED)
@@ -196,7 +201,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_up on sub_device %d", i);
 		ret = rte_eth_dev_set_link_up(PORT_ID(sdev));
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_link_up failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -215,7 +220,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_link_down on sub_device %d", i);
 		ret = rte_eth_dev_set_link_down(PORT_ID(sdev));
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_link_down failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -300,7 +305,7 @@ 
 				rx_queue_id,
 				nb_rx_desc, socket_id,
 				rx_conf, mb_pool);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("RX queue setup failed for sub_device %d", i);
 			goto free_rxq;
 		}
@@ -366,7 +371,7 @@ 
 				tx_queue_id,
 				nb_tx_desc, socket_id,
 				tx_conf);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("TX queue setup failed for sub_device %d", i);
 			goto free_txq;
 		}
@@ -445,7 +450,8 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling link_update on sub_device %d", i);
 		ret = (SUBOPS(sdev, link_update))(ETH(sdev), wait_to_complete);
-		if (ret && ret != -1) {
+		if (ret && ret != -1 && sdev->remove == 0 &&
+		    rte_eth_dev_is_removed(PORT_ID(sdev)) == 0) {
 			ERROR("Link update failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -469,6 +475,7 @@ 
 fs_stats_get(struct rte_eth_dev *dev,
 	     struct rte_eth_stats *stats)
 {
+	struct rte_eth_stats backup;
 	struct sub_device *sdev;
 	uint8_t i;
 	int ret;
@@ -478,14 +485,20 @@ 
 		struct rte_eth_stats *snapshot = &sdev->stats_snapshot.stats;
 		uint64_t *timestamp = &sdev->stats_snapshot.timestamp;
 
+		rte_memcpy(&backup, snapshot, sizeof(backup));
 		ret = rte_eth_stats_get(PORT_ID(sdev), snapshot);
 		if (ret) {
+			if (!fs_is_error(sdev, ret)) {
+				rte_memcpy(snapshot, &backup, sizeof(backup));
+				goto inc;
+			}
 			ERROR("Operation rte_eth_stats_get failed for sub_device %d with error %d",
 				  i, ret);
 			*timestamp = 0;
 			return ret;
 		}
 		*timestamp = rte_rdtsc();
+inc:
 		failsafe_stats_increment(stats, snapshot);
 	}
 	return 0;
@@ -598,7 +611,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_set_mtu on sub_device %d", i);
 		ret = rte_eth_dev_set_mtu(PORT_ID(sdev), mtu);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_set_mtu failed for sub_device %d with error %d",
 			      i, ret);
 			return ret;
@@ -617,7 +630,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_vlan_filter on sub_device %d", i);
 		ret = rte_eth_dev_vlan_filter(PORT_ID(sdev), vlan_id, on);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_vlan_filter failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -651,7 +664,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_flow_ctrl_set on sub_device %d", i);
 		ret = rte_eth_dev_flow_ctrl_set(PORT_ID(sdev), fc_conf);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_flow_ctrl_set failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
@@ -688,7 +701,7 @@ 
 	RTE_ASSERT(index < FAILSAFE_MAX_ETHADDR);
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		ret = rte_eth_dev_mac_addr_add(PORT_ID(sdev), mac_addr, vmdq);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_mac_addr_add failed for sub_device %"
 			      PRIu8 " with error %d", i, ret);
 			return ret;
@@ -730,7 +743,7 @@ 
 	FOREACH_SUBDEV_STATE(sdev, i, dev, DEV_ACTIVE) {
 		DEBUG("Calling rte_eth_dev_filter_ctrl on sub_device %d", i);
 		ret = rte_eth_dev_filter_ctrl(PORT_ID(sdev), type, op, arg);
-		if (ret) {
+		if (fs_is_error(sdev, ret)) {
 			ERROR("Operation rte_eth_dev_filter_ctrl failed for sub_device %d"
 			      " with error %d", i, ret);
 			return ret;
diff --git a/drivers/net/failsafe/failsafe_private.h b/drivers/net/failsafe/failsafe_private.h
index d81cc3c..585b554 100644
--- a/drivers/net/failsafe/failsafe_private.h
+++ b/drivers/net/failsafe/failsafe_private.h
@@ -34,6 +34,7 @@ 
 #ifndef _RTE_ETH_FAILSAFE_PRIVATE_H_
 #define _RTE_ETH_FAILSAFE_PRIVATE_H_
 
+#include <stdbool.h>
 #include <sys/queue.h>
 
 #include <rte_atomic.h>
@@ -375,4 +376,15 @@  int failsafe_eth_lsc_event_callback(uint16_t port_id,
 	rte_wmb();
 }
 
+/*
+ * Check if error should be reported to the user.
+ */
+static inline bool
+fs_is_error(struct sub_device *sdev, int err)
+{
+	/* A device removal shouldn't be reported as an error. */
+	if (err == 0 || sdev->remove == 1 || err == -EIO)
+		return false;
+	return true;
+}
 #endif /* _RTE_ETH_FAILSAFE_PRIVATE_H_ */