[PATCH v3] net/ice: fix crash on closing representor ports

Zhang, Qi Z qi.z.zhang at intel.com
Thu Nov 2 04:59:21 CET 2023



> -----Original Message-----
> From: Ye, MingjinX <mingjinx.ye at intel.com>
> Sent: Thursday, November 2, 2023 10:39 AM
> To: Zhang, Qi Z <qi.z.zhang at intel.com>; dev at dpdk.org
> Cc: Yang, Qiming <qiming.yang at intel.com>; Zhou, YidingX
> <yidingx.zhou at intel.com>; stable at dpdk.org
> Subject: RE: [PATCH v3] net/ice: fix crash on closing representor ports
> 
> 
> 
> > -----Original Message-----
> > From: Zhang, Qi Z <qi.z.zhang at intel.com>
> > Sent: Wednesday, November 1, 2023 6:49 PM
> > To: Ye, MingjinX <mingjinx.ye at intel.com>; dev at dpdk.org
> > Cc: Yang, Qiming <qiming.yang at intel.com>; Zhou, YidingX
> > <yidingx.zhou at intel.com>; stable at dpdk.org
> > Subject: RE: [PATCH v3] net/ice: fix crash on closing representor
> > ports
> >
> >
> >
> > > -----Original Message-----
> > > From: Ye, MingjinX <mingjinx.ye at intel.com>
> > > Sent: Wednesday, November 1, 2023 6:14 PM
> > > To: dev at dpdk.org
> > > Cc: Yang, Qiming <qiming.yang at intel.com>; Zhou, YidingX
> > > <yidingx.zhou at intel.com>; Ye, MingjinX <mingjinx.ye at intel.com>;
> > > stable at dpdk.org; Zhang, Qi Z <qi.z.zhang at intel.com>
> > > Subject: [PATCH v3] net/ice: fix crash on closing representor ports
> > >
> > > The data resource in struct rte_eth_dev is cleared and points to
> > > NULL when the DCF port is closed.
> > >
> > > If the DCF representor port is closed after the DCF port is closed,
> > > a segmentation fault occurs because the representor port accesses
> > > the data resource released by the DCF port.
> > >
> > > This patch checks if the resource is present before accessing.
> > >
> > > Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
> > > Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port
> > > closing")
> > > Cc: stable at dpdk.org
> > >
> > > Signed-off-by: Mingjin Ye <mingjinx.ye at intel.com>
> > > ---
> > > v3: New solution.
> > > ---
> > >  drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/ice/ice_dcf_vf_representor.c
> > > b/drivers/net/ice/ice_dcf_vf_representor.c
> > > index b9fcfc80ad..8c45e28f02 100644
> > > --- a/drivers/net/ice/ice_dcf_vf_representor.c
> > > +++ b/drivers/net/ice/ice_dcf_vf_representor.c
> > > @@ -111,14 +111,16 @@ ice_dcf_vf_repr_link_update(__rte_unused
> > struct
> > > rte_eth_dev *ethdev,  static __rte_always_inline struct ice_dcf_hw *
> > > ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr)  {
> > > -	struct ice_dcf_adapter *dcf_adapter =
> > > -			repr->dcf_eth_dev->data->dev_private;
> > > +	struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;
> >
> > Seems this expose another issue, if dcf port already be closed, the
> > dcf_eth_dev instance could already be reused by another driver.
> Your analysis is spot on.
> 
> The reason for the issue:
> Affected by 36c46e738120c381cf663c96692454c5aa75e203 commit.
> da9cdcd1f37220e87db23993d6352637d71df25b commit does not fix the
> issue.
> 
> v2 patch:
> Notify every proxy port of DCF port status.
> 
> v3 patch:
> Based on da9cdcd1f37220e87db23993d636352637d71df25b, made
> optimization to fix the issue.
> 
> > So we can't assume dcf_eth_dev->data is NULL,  I think you can refine
> > based on v2's method, but don't update dcf_valid flag in representor
> > port's dev_stop.
> Implementation difficulties:
> 1. When the DCF is created, all associated proxy ports are created and each
> proxy port (struct rte_eth_dev) pointer is recorded in an array.
> 2. When shutting down the DCF, all all proxy ports are notified at
> ice_dcf_vf_repr_stop_all. Finally the array is deleted.
> 
> Based on the original scenario, notifying all agent ports of DCF state failure is
> the simplest and most effective way. This is very similar to the v2 patch
> scenario.

Yes, that's why I suggest to refine based on v2 after notice the issue on v3, seems we need to maintain a per repr flag for DCF status, but need to update it carefully.

> 
> Check each DCF port flag?
> If the DCF port has failed and the resource has been released. the DCF port
> status, naturally, is not available. This is very similar to the v3 patch scenario.

Check dev->data is NULL does not work, a better way is to check eth_dev->state and device name to make sure it is still used by DCF.
But I think v2's method is more reliable.


> 
> Other ideas
> 1. Re-implement DCF port and proxy port communication, can't directly use
> (struct rte_eth_dev) pointer to access the device resources on the other end.
> 
> 2. Designed to return an error when removing the proxy port directly. The
> proxy port device is managed by the control DCF (the proxy port is created by
> the DCF port and the removal work should be done by it).
> 
> The above 2 options will bring a huge amount of work. It seems beyond the
> scope of this discussion.
> 



More information about the stable mailing list