[dpdk-dev] [PATCH v3] net/mlx4: fix dev rmv not detected after port stop

Matan Azrad matan at mellanox.com
Wed Jan 31 18:07:56 CET 2018


Hi Adrien

From: Adrien Mazarguil , Sent: Wednesday, January 31, 2018 4:32 PM
> Hi Matan,
> 
> On Wed, Jan 31, 2018 at 01:44:41PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> >  From: Adrien Mazarguil, Sent: Wednesday, January 31, 2018 12:44 PM
> > > On Wed, Jan 31, 2018 at 10:08:06AM +0000, Matan Azrad wrote:
> > > > Hi all
> > > >
> > > > From: Adrien Mazarguil
> > > > > On Tue, Jan 30, 2018 at 08:37:06PM +0000, Shahaf Shuler wrote:
> > > > > > Tuesday, January 30, 2018 11:40 AM, Adrien Mazarguil:
> > > > > > > Unfortunately I didn't get a chance to review this patch
> > > > > > > before it was
> > > > > applied.
> > > > > > > I'm not sure a stopped port is supposed to report events
> > > > > > > (interrupts). Will applications expect them to occur at this point?
> > > > > >
> > > > > > Why not?
> > > > > >
> > > > > > Stopped port is still counted as attached. The fact the
> > > > > > application stopped
> > > > > the packet receive on it doesn't mean it should not receive a
> > > > > sync events (such as the remove event).
> > > > > > async events, by definition, are not related to traffic being
> > > > > > flows through
> > > > > the port.
> > > > >
> > > > > My comment is based on my understanding of rte_eth_dev_stop(),
> > > which
> > > > > is a device (or port) is completely stopped, in a suspended
> > > > > state and no interrupts shall occur, as a means for applications
> > > > > to temporarily not be bothered by them until restarted.
> > > >
> > > > Stopping traffic is not saying that the application is not
> > > > interesting in the device, I think that you mean to dev_close().
> > >
> > > No, dev_close() releases resources and destroys configuration. Good
> > > luck using dev_start() or any other devop after dev_close().
> >
> > I'm just saying here that when the user call dev_close() it means he
> > is not interesting in the device any more, This is not the case in dev_stop().
> >
> > > > Any event may still be usable for application between dev_stop()
> > > > to dev_start(), especially RMV or LCS can still be interested.
> > >
> > > Possibly, but then, how come no PMD implements it that way?
> >
> > Again, maybe others PMDs make mistakes.
> 
> It's basically the same as stating that for all these years, neither PMD nor
> application maintainers understood the true meaning of this API.
> 
> Maybe you're right, maybe not. What's for sure is this patch unilaterally
> decided to modify an accepted behavior. Perhaps it's not a big deal but we
> must be cautious.

Agree.

I'm just saying my opinion here.

> > > Neither did mlx4 before this patch got applied. At the very least it
> > > cannot be considered a "fix".
> >
> > It fixes something.
> 
> Technically every time a feature is added, its absence gets "fixed" :)

:)
 
> > > > > Think about it that way: applications do not want to get
> > > > > interrupts immediately after the device is initialized, because
> > > > > they might not be ready to process them at this point. An
> > > > > explicit call to
> > > > > rte_eth_dev_start() tells the PMD when it's OK to do so. The
> > > > > converse is
> > > rte_eth_dev_stop().
> > > >
> > > > So, they can delay the event registration to the time they
> > > > interesting in the
> > > events.
> > > > And use event unregister when they are not interesting in it anymore.
> > >
> > > Of course you can ask application maintainers to adapt to the new
> > > behavior, or you know, leave things as they used to be.
> > >
> >
> > I don't know what any application does but for me it is a mistake to
> > stop all event processes in dev_stop(), Maybe for other application
> maintainers too.
> 
> Just like you, I don't know either what all the applications ever written for
> DPDK expect out of dev_stop(). What's for sure is that currently, LSC/RMV
> don't occcur afterward, the same way these events do not occur before
> dev_start().

Why not? RMV event can occur any time after probe.

> Any application possibly relying on this fact will break. In such a
> situation, a conservative approach is better.

If an application should fail to get event in stopped state it may fail in the previous code too:
The interrupt run from host thread so the next race may occur:
dev_start() : master thread.
Context switch.
RMV interrupt started to run callbacks: host thread.
Context switch.
dev_stop(): master thread.
Start reconfiguration: master thread. 
Context switch.
Callback running.

So, the only thing which can disable callback running after dev_stop() is callback unregistration before it.

> > > Setting up RMV/LSC callbacks is not the only configuration an
> > > application usually performs before calling dev_start(). Think about
> > > setting up flow rules, MAC addresses, VLANs, and so on, this on
> > > multiple ports before starting them up all at once. Previously it
> > > could be done in an unspecified order, now they have to take special care
> for RMV/LSC.
> >
> > Or maybe there callbacks code are already safe for it.
> > Or they manages the unregister\register calls in the right places.
> 
> That's my point, these "maybes" don't argue in favor of changing things.

What I'm saying is that callbacks should be safe or not registered in the right time.

> > > Many devops are only safe when called while a device is stopped.
> > > It's even documented in rte_ethdev.h.
> > >
> >
> > And?
> 
> ...And applications therefore often do all their configuration in an unspecified
> order while a port is stopped as a measure of safety. No extra care is taken
> for RMV/LSC. This uncertainty can be addressed by not modifying the current
> behavior.

Or they expect to get interrupt and the corner case will come later if we will not change it.

> > > > > Stopping traffic can already be achieved by not polling from the
> > > > > application side, calling rte_eth_dev_[rt]x_queue_stop() and/or
> > > > > toggling RX/TX interrupts through rte_eth_dev_[rt]x_intr_enable().
> > > > > rte_eth_dev_stop() provides lower-level device control.
> > > >
> > > > I think it makes sense only for Rx interrupt which is traffic
> > > > oriented(like stop
> > > and start).
> > >
> > > OK, looks like we disagree :)
> > >
> > > > > Perhaps documentation is not clear, however that's how LSC seems
> > > > > implemented in all PMDs; it gets disabled after
> > > > > rte_eth_dev_stop() and one should explicitly use
> > > > > rte_eth_link_get() to retrieve link status afterward. I think
> > > > > RMV should behave similarly with
> > > rte_eth_dev_is_removed().
> > > > > Adapting fail-safe should be easier than modifying all the
> > > > > remaining
> > > PMDs.
> > > >
> > > > Or maybe PMDs which do it make mistakes.
> > >
> > > I'm not convinced mlx4 is the only PMD doing the right thing, we
> > > should ask other maintainers as well as application writers' opinion
> > > on the matter. If it's not a problem for RMV/LSC to occur while a
> > > device is stopped, then I'll agree with the approach. It still won't make it a
> fix regardless.
> >
> > Let's think about RMV event, How can the application\other dpdk entities
> to know if the device was removed when it was in stopped state?
> > Checking it synchronically (by rte_eth_dev_is_removed()) can miss the
> removal just a moment after the device came back again, errors will occur
> and no one will know why.
> >
> > So, at least for RMV event, we need the notification also in stopped state.
> 
> You sent the rte_eth_dev_is_removed() series. You're aware that PMDs
> implementing this call benefit from an automatic is_removed() check on all
> remaining devops whenever some error occur.
> In short, an application will get notified simply by getting dev_start() (or any
> other callback) return -EIO and not being able to use the device.
 
Yes, but between dev_stop to dev_start may not be any ethdev API calling.

> PMDs that do not implement is_removed() (or in addition to it) could also
> artificially trigger a RMV event after dev_start() is called. As long as the PMD
> remains quiet while the device is stopped, it's fine.

How can the PMD do it after dev_start()? Initiate alarm in dev start function to do it later And entering into race again?

I think it is not worth the effort and this patch is doing the right thing(also some other PMDs) .

Thanks.
> > > > > > > In my opinion it's not a fix, as in, it doesn't address an
> > > > > > > issue introduced by the mentioned patch whose behavior was
> correct.
> > > > > > >
> > > > > > > It's probably too late to change it now and it does address
> > > > > > > an issue seen with a use case involving this PMD, however I
> > > > > > > think the fail-safe PMD could as well poll using the
> > > > > > > recently-added
> > > > > > > rte_eth_dev_is_removed() when it's aware the underlying port
> > > > > > > is
> > > > > stopped instead of expecting interrupts.
> 
> --
> Adrien Mazarguil
> 6WIND


More information about the dev mailing list