[dpdk-dev] [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers

Matan Azrad matan at mellanox.com
Tue Oct 31 12:35:29 CET 2017


Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> Sent: Tuesday, October 31, 2017 12:17 PM
> To: Matan Azrad <matan at mellanox.com>
> Cc: dev at dpdk.org; Ophir Munk <ophirmu at mellanox.com>
> Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> 
> Hi Matan,
> 
> On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> > > -----Original Message-----
> > > From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > > Sent: Monday, October 30, 2017 4:24 PM
> > > To: Matan Azrad <matan at mellanox.com>
> > > Cc: dev at dpdk.org; Ophir Munk <ophirmu at mellanox.com>
> > > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory
> > > barriers
> > >
> > > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > > > Replace most of the memory barriers by compiler barriers since
> > > > they are all targeted to the DRAM; This improves code efficiency
> > > > for systems which force store order between different addresses.
> > > >
> > > > Only the doorbell record store should be protected by memory
> > > > barrier since it is targeted to the PCI memory domain.
> > > >
> > > > Limit pre byte count store compiler barrier for systems with cache
> > > > line size smaller than 64B (TXBB size).
> > > >
> > > > Signed-off-by: Matan Azrad <matan at mellanox.com>
> > >
> > > This sounds like an interesting performance improvement, can you
> > > share the typical or expected amount (percentage/hard numbers) for a
> > > given use case as part of the commit log?
> > >
> >
> > Yes, it improves performance, I will share numbers.
> 
> First I must add I thought rte_io_[rw]mb() was really only a renamed
> compiler barrier, I better understand its purpose now, thanks.
> 
> (more below.)
> 
> > > More comments below.
> > >
> > > > ---
> > > >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> > > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > > > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > > > @@ -168,7 +168,7 @@ struct pv {
> > > >  		/*
> > > >  		 * Make sure we read the CQE after we read the ownership
> > > bit.
> > > >  		 */
> > > > -		rte_rmb();
> > > > +		rte_io_rmb();
> > >
> > > OK for this one since the rest of the code should not be run due to
> > > the condition (I'm not even sure even a compiler barrier is necessary at all
> here).
> > >
> > > >  #ifndef NDEBUG
> > > >  		if (unlikely((cqe->owner_sr_opcode &
> > > MLX4_CQE_OPCODE_MASK) ==
> > > >  			     MLX4_CQE_OPCODE_ERROR)) { @@ -203,7 +203,7
> @@ struct pv {
> > > >  	 */
> > > >  	cq->cons_index = cons_index;
> > > >  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> > > MLX4_CQ_DB_CI_MASK);
> > > > -	rte_wmb();
> > > > +	rte_io_wmb();
> > >
> > > This one could be removed entirely as well, which is more or less
> > > what the move to a compiler barrier does. Nothing in subsequent code
> > > depends on this doorbell being written, so this can piggy back on
> > > any subsequent rte_wmb().
> >
> > Yes, you right, probably this code was taken from multi thread
> implementation.
> > >
> > > On the other hand in my opinion a barrier (compiler or otherwise)
> > > might be needed before the doorbell write, to make clear it cannot
> > > somehow be done earlier in case something attempts to optimize it
> away.
> > >
> > I think we can remove it entirely (compiler can't optimize the ci_db store
> since in depends in previous code (cons_index).
> 
> Right, however you may still run into issues if the compiler determines the
> final cons_index value by looking at the loop and decides to store it before
> entering/leaving it. That's the kind of problematic optimization I was thinking
> of.
> 
> The barrier in that sense is just to assert the order of seemingly unrelated
> load/stores.

I think that If I left the rte_io_rmb after CQE owner check we can earn both:
1. The concern of read ordering while reading the owner before using CQE.
2. The ci_db concern: the compiler must read the last CQE(which is not valid and we have no more stamp to do) before it knows the last value of cons_index. 
So we can remove entirely this rte_io_wmb in completion function.
What do you think? 
 
> > > >  			/* Fill the control parameters for this packet. */ @@ -
> > > 533,7
> > > > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > > >  		 * setting ownership bit (because HW can start
> > > >  		 * executing as soon as we do).
> > > >  		 */
> > > > -		rte_wmb();
> > > > +		rte_io_wmb();
> > >
> > > This one looks dangerous. A compiler barrier is not strong enough to
> > > guarantee the order in which CPU will execute instructions, it only
> > > makes sure what follows the barrier doesn't appear before it in the
> generated code.
> > >
> > As I investigated, I understood that for CPUs which don't save store order
> between different addresses(arm,ppc), the rte_io_wmb is converted to
> rte_wmb.
> > So for thus who save it(x86) we just need the right order in compiler code
> because all the relevant stores are targeted to same memory domain(DRAM)
> and therefore also the actual store is guaranteed.
> > Unlike doorbell store which directed to different memory domain (PCI).
> > So the only place which need rte_wmb() is before doorbell write.
> 
> Fair enough, although after re-reading the code I think there's another issue
> present since the beginning: both ctrl and dseg pointers are not volatile, this
> fact doesn't guarantee intermediate writes will occur in the expected order
> or even at all, even in the presence of a barrier.
> 
> The volatile attribute should be inherited from both struct mlx4_cq and struct
> mlx4_sq (buf, db and most if not all other pointers). I think a separate fixes
> commit should add it for safety.

Great notice , I will add it, Thanks!
> 
> > > Unless the comment above this barrier is wrong, this change may
> > > cause hard- to-debug issues down the road, you should drop it.
> > >
> > > >  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> > > >  					      ((sq->head & sq->txbb_cnt) ?
> > > >  						       MLX4_BIT_WQE_OWN :
> > > 0));
> > > > --
> > > > 1.8.3.1
> > > >
> > >
> > > --
> > > Adrien Mazarguil
> > > 6WIND
> >
> > Thanks!
> 
> --
> Adrien Mazarguil
> 6WIND


More information about the dev mailing list