[dpdk-dev,v3,6/7] net/mlx4: mitigate Tx path memory barriers

Message ID 1509358049-18854-7-git-send-email-matan@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Matan Azrad Oct. 30, 2017, 10:07 a.m. UTC
  Replace most of the memory barriers by compiler barriers since they are
all targeted to the DRAM; This improves code efficiency for systems
which force store order between different addresses.

Only the doorbell record store should be protected by memory barrier
since it is targeted to the PCI memory domain.

Limit pre byte count store compiler barrier for systems with cache line
size smaller than 64B (TXBB size).

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)
  

Comments

Adrien Mazarguil Oct. 30, 2017, 2:23 p.m. UTC | #1
On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> Replace most of the memory barriers by compiler barriers since they are
> all targeted to the DRAM; This improves code efficiency for systems
> which force store order between different addresses.
> 
> Only the doorbell record store should be protected by memory barrier
> since it is targeted to the PCI memory domain.
> 
> Limit pre byte count store compiler barrier for systems with cache line
> size smaller than 64B (TXBB size).
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>

This sounds like an interesting performance improvement, can you share the
typical or expected amount (percentage/hard numbers) for a given use case as
part of the commit log?

More comments below.

> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
> index 8ea8851..482c399 100644
> --- a/drivers/net/mlx4/mlx4_rxtx.c
> +++ b/drivers/net/mlx4/mlx4_rxtx.c
> @@ -168,7 +168,7 @@ struct pv {
>  		/*
>  		 * Make sure we read the CQE after we read the ownership bit.
>  		 */
> -		rte_rmb();
> +		rte_io_rmb();

OK for this one since the rest of the code should not be run due to the
condition (I'm not even sure even a compiler barrier is necessary at all
here).

>  #ifndef NDEBUG
>  		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>  			     MLX4_CQE_OPCODE_ERROR)) {
> @@ -203,7 +203,7 @@ struct pv {
>  	 */
>  	cq->cons_index = cons_index;
>  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
> -	rte_wmb();
> +	rte_io_wmb();

This one could be removed entirely as well, which is more or less what the
move to a compiler barrier does. Nothing in subsequent code depends on this
doorbell being written, so this can piggy back on any subsequent rte_wmb().

On the other hand in my opinion a barrier (compiler or otherwise) might be
needed before the doorbell write, to make clear it cannot somehow be done
earlier in case something attempts to optimize it away.

>  	sq->tail = sq->tail + nr_txbbs;
>  	/* Update the list of packets posted for transmission. */
>  	elts_comp -= pkts;
> @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
>  		 * control segment.
>  		 */
>  		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> +#if RTE_CACHE_LINE_SIZE < 64
>  			/*
>  			 * Need a barrier here before writing the byte_count
>  			 * fields to make sure that all the data is visible
> @@ -331,6 +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
>  			 * data, and end up sending the wrong data.
>  			 */
>  			rte_io_wmb();
> +#endif /* RTE_CACHE_LINE_SIZE */

Interesting one.

>  			dseg->byte_count = byte_count;
>  		} else {
>  			/*
> @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
>  				break;
>  			}
>  #endif /* NDEBUG */
> -			/* Need a barrier here before byte count store. */
> -			rte_io_wmb();
> +			/* Never be TXBB aligned, no need compiler barrier. */

The reason there was a barrier here at all was unclear, so if it's really
useless, you don't even need to describe why.

>  			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
>  
>  			/* Fill the control parameters for this packet. */
> @@ -533,7 +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
>  		 * setting ownership bit (because HW can start
>  		 * executing as soon as we do).
>  		 */
> -		rte_wmb();
> +		rte_io_wmb();

This one looks dangerous. A compiler barrier is not strong enough to
guarantee the order in which CPU will execute instructions, it only makes
sure what follows the barrier doesn't appear before it in the generated
code.

Unless the comment above this barrier is wrong, this change may cause
hard-to-debug issues down the road, you should drop it.

>  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
>  					      ((sq->head & sq->txbb_cnt) ?
>  						       MLX4_BIT_WQE_OWN : 0));
> -- 
> 1.8.3.1
>
  
Matan Azrad Oct. 30, 2017, 7:47 p.m. UTC | #2
Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Monday, October 30, 2017 4:24 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> 
> On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > Replace most of the memory barriers by compiler barriers since they
> > are all targeted to the DRAM; This improves code efficiency for
> > systems which force store order between different addresses.
> >
> > Only the doorbell record store should be protected by memory barrier
> > since it is targeted to the PCI memory domain.
> >
> > Limit pre byte count store compiler barrier for systems with cache
> > line size smaller than 64B (TXBB size).
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> 
> This sounds like an interesting performance improvement, can you share the
> typical or expected amount (percentage/hard numbers) for a given use case
> as part of the commit log?
> 

Yes, it improves performance, I will share numbers.

> More comments below.
> 
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > @@ -168,7 +168,7 @@ struct pv {
> >  		/*
> >  		 * Make sure we read the CQE after we read the ownership
> bit.
> >  		 */
> > -		rte_rmb();
> > +		rte_io_rmb();
> 
> OK for this one since the rest of the code should not be run due to the
> condition (I'm not even sure even a compiler barrier is necessary at all here).
> 
> >  #ifndef NDEBUG
> >  		if (unlikely((cqe->owner_sr_opcode &
> MLX4_CQE_OPCODE_MASK) ==
> >  			     MLX4_CQE_OPCODE_ERROR)) {
> > @@ -203,7 +203,7 @@ struct pv {
> >  	 */
> >  	cq->cons_index = cons_index;
> >  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> MLX4_CQ_DB_CI_MASK);
> > -	rte_wmb();
> > +	rte_io_wmb();
> 
> This one could be removed entirely as well, which is more or less what the
> move to a compiler barrier does. Nothing in subsequent code depends on
> this doorbell being written, so this can piggy back on any subsequent
> rte_wmb().

Yes, you right, probably this code was taken from multi thread implementation.
> 
> On the other hand in my opinion a barrier (compiler or otherwise) might be
> needed before the doorbell write, to make clear it cannot somehow be done
> earlier in case something attempts to optimize it away.
> 
I think we can remove it entirely (compiler can't optimize the ci_db store since in depends in previous code (cons_index).

> >  	sq->tail = sq->tail + nr_txbbs;
> >  	/* Update the list of packets posted for transmission. */
> >  	elts_comp -= pkts;
> > @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  		 * control segment.
> >  		 */
> >  		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > +#if RTE_CACHE_LINE_SIZE < 64
> >  			/*
> >  			 * Need a barrier here before writing the byte_count
> >  			 * fields to make sure that all the data is visible @@ -
> 331,6
> > +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  			 * data, and end up sending the wrong data.
> >  			 */
> >  			rte_io_wmb();
> > +#endif /* RTE_CACHE_LINE_SIZE */
> 
> Interesting one.
> 
> >  			dseg->byte_count = byte_count;
> >  		} else {
> >  			/*
> > @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  				break;
> >  			}
> >  #endif /* NDEBUG */
> > -			/* Need a barrier here before byte count store. */
> > -			rte_io_wmb();
> > +			/* Never be TXBB aligned, no need compiler barrier.
> */
> 
> The reason there was a barrier here at all was unclear, so if it's really useless,
> you don't even need to describe why.

It is because there is a barrier in multi segment similar stage.
I think it can help for future review.

> 
> >  			dseg->byte_count = rte_cpu_to_be_32(buf-
> >data_len);
> >
> >  			/* Fill the control parameters for this packet. */ @@ -
> 533,7
> > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> >  		 * setting ownership bit (because HW can start
> >  		 * executing as soon as we do).
> >  		 */
> > -		rte_wmb();
> > +		rte_io_wmb();
> 
> This one looks dangerous. A compiler barrier is not strong enough to
> guarantee the order in which CPU will execute instructions, it only makes
> sure what follows the barrier doesn't appear before it in the generated code.
> 
As I investigated, I understood that for CPUs which don't save store order between different addresses(arm,ppc), the rte_io_wmb is converted to rte_wmb.
So for thus who save it(x86) we just need the right order in compiler code because all the relevant stores are targeted to same memory domain(DRAM) and therefore also the actual store is guaranteed.
Unlike doorbell store which directed to different memory domain (PCI).
So the only place which need rte_wmb() is before doorbell write.

> Unless the comment above this barrier is wrong, this change may cause hard-
> to-debug issues down the road, you should drop it.
> 
> >  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> >  					      ((sq->head & sq->txbb_cnt) ?
> >  						       MLX4_BIT_WQE_OWN :
> 0));
> > --
> > 1.8.3.1
> >
> 
> --
> Adrien Mazarguil
> 6WIND

Thanks!
  
Adrien Mazarguil Oct. 31, 2017, 10:17 a.m. UTC | #3
Hi Matan,

On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote:
> Hi Adrien
> 
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Monday, October 30, 2017 4:24 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> > 
> > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > > Replace most of the memory barriers by compiler barriers since they
> > > are all targeted to the DRAM; This improves code efficiency for
> > > systems which force store order between different addresses.
> > >
> > > Only the doorbell record store should be protected by memory barrier
> > > since it is targeted to the PCI memory domain.
> > >
> > > Limit pre byte count store compiler barrier for systems with cache
> > > line size smaller than 64B (TXBB size).
> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > 
> > This sounds like an interesting performance improvement, can you share the
> > typical or expected amount (percentage/hard numbers) for a given use case
> > as part of the commit log?
> > 
> 
> Yes, it improves performance, I will share numbers.

First I must add I thought rte_io_[rw]mb() was really only a renamed
compiler barrier, I better understand its purpose now, thanks.

(more below.)

> > More comments below.
> > 
> > > ---
> > >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > > @@ -168,7 +168,7 @@ struct pv {
> > >  		/*
> > >  		 * Make sure we read the CQE after we read the ownership
> > bit.
> > >  		 */
> > > -		rte_rmb();
> > > +		rte_io_rmb();
> > 
> > OK for this one since the rest of the code should not be run due to the
> > condition (I'm not even sure even a compiler barrier is necessary at all here).
> > 
> > >  #ifndef NDEBUG
> > >  		if (unlikely((cqe->owner_sr_opcode &
> > MLX4_CQE_OPCODE_MASK) ==
> > >  			     MLX4_CQE_OPCODE_ERROR)) {
> > > @@ -203,7 +203,7 @@ struct pv {
> > >  	 */
> > >  	cq->cons_index = cons_index;
> > >  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> > MLX4_CQ_DB_CI_MASK);
> > > -	rte_wmb();
> > > +	rte_io_wmb();
> > 
> > This one could be removed entirely as well, which is more or less what the
> > move to a compiler barrier does. Nothing in subsequent code depends on
> > this doorbell being written, so this can piggy back on any subsequent
> > rte_wmb().
> 
> Yes, you right, probably this code was taken from multi thread implementation.
> > 
> > On the other hand in my opinion a barrier (compiler or otherwise) might be
> > needed before the doorbell write, to make clear it cannot somehow be done
> > earlier in case something attempts to optimize it away.
> > 
> I think we can remove it entirely (compiler can't optimize the ci_db store since in depends in previous code (cons_index).

Right, however you may still run into issues if the compiler determines the
final cons_index value by looking at the loop and decides to store it before
entering/leaving it. That's the kind of problematic optimization I was
thinking of.

The barrier in that sense is just to assert the order of seemingly unrelated
load/stores.

> > >  	sq->tail = sq->tail + nr_txbbs;
> > >  	/* Update the list of packets posted for transmission. */
> > >  	elts_comp -= pkts;
> > > @@ -321,6 +321,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >  		 * control segment.
> > >  		 */
> > >  		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
> > > +#if RTE_CACHE_LINE_SIZE < 64
> > >  			/*
> > >  			 * Need a barrier here before writing the byte_count
> > >  			 * fields to make sure that all the data is visible @@ -
> > 331,6
> > > +332,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >  			 * data, and end up sending the wrong data.
> > >  			 */
> > >  			rte_io_wmb();
> > > +#endif /* RTE_CACHE_LINE_SIZE */
> > 
> > Interesting one.
> > 
> > >  			dseg->byte_count = byte_count;
> > >  		} else {
> > >  			/*
> > > @@ -469,8 +471,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >  				break;
> > >  			}
> > >  #endif /* NDEBUG */
> > > -			/* Need a barrier here before byte count store. */
> > > -			rte_io_wmb();
> > > +			/* Never be TXBB aligned, no need compiler barrier.
> > */
> > 
> > The reason there was a barrier here at all was unclear, so if it's really useless,
> > you don't even need to describe why.
> 
> It is because there is a barrier in multi segment similar stage.
> I think it can help for future review.

OK.

> > 
> > >  			dseg->byte_count = rte_cpu_to_be_32(buf-
> > >data_len);
> > >
> > >  			/* Fill the control parameters for this packet. */ @@ -
> > 533,7
> > > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > >  		 * setting ownership bit (because HW can start
> > >  		 * executing as soon as we do).
> > >  		 */
> > > -		rte_wmb();
> > > +		rte_io_wmb();
> > 
> > This one looks dangerous. A compiler barrier is not strong enough to
> > guarantee the order in which CPU will execute instructions, it only makes
> > sure what follows the barrier doesn't appear before it in the generated code.
> > 
> As I investigated, I understood that for CPUs which don't save store order between different addresses(arm,ppc), the rte_io_wmb is converted to rte_wmb.
> So for thus who save it(x86) we just need the right order in compiler code because all the relevant stores are targeted to same memory domain(DRAM) and therefore also the actual store is guaranteed.
> Unlike doorbell store which directed to different memory domain (PCI).
> So the only place which need rte_wmb() is before doorbell write.

Fair enough, although after re-reading the code I think there's another
issue present since the beginning: both ctrl and dseg pointers are not
volatile, this fact doesn't guarantee intermediate writes will occur in the
expected order or even at all, even in the presence of a barrier.

The volatile attribute should be inherited from both struct mlx4_cq and
struct mlx4_sq (buf, db and most if not all other pointers). I think a
separate fixes commit should add it for safety.

> > Unless the comment above this barrier is wrong, this change may cause hard-
> > to-debug issues down the road, you should drop it.
> > 
> > >  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> > >  					      ((sq->head & sq->txbb_cnt) ?
> > >  						       MLX4_BIT_WQE_OWN :
> > 0));
> > > --
> > > 1.8.3.1
> > >
> > 
> > --
> > Adrien Mazarguil
> > 6WIND
> 
> Thanks!
  
Matan Azrad Oct. 31, 2017, 11:35 a.m. UTC | #4
Hi Adrien

> -----Original Message-----
> From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> Sent: Tuesday, October 31, 2017 12:17 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> 
> Hi Matan,
> 
> On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote:
> > Hi Adrien
> >
> > > -----Original Message-----
> > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > Sent: Monday, October 30, 2017 4:24 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> > > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory
> > > barriers
> > >
> > > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > > > Replace most of the memory barriers by compiler barriers since
> > > > they are all targeted to the DRAM; This improves code efficiency
> > > > for systems which force store order between different addresses.
> > > >
> > > > Only the doorbell record store should be protected by memory
> > > > barrier since it is targeted to the PCI memory domain.
> > > >
> > > > Limit pre byte count store compiler barrier for systems with cache
> > > > line size smaller than 64B (TXBB size).
> > > >
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > >
> > > This sounds like an interesting performance improvement, can you
> > > share the typical or expected amount (percentage/hard numbers) for a
> > > given use case as part of the commit log?
> > >
> >
> > Yes, it improves performance, I will share numbers.
> 
> First I must add I thought rte_io_[rw]mb() was really only a renamed
> compiler barrier, I better understand its purpose now, thanks.
> 
> (more below.)
> 
> > > More comments below.
> > >
> > > > ---
> > > >  drivers/net/mlx4/mlx4_rxtx.c | 11 ++++++-----
> > > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c
> > > > b/drivers/net/mlx4/mlx4_rxtx.c index 8ea8851..482c399 100644
> > > > --- a/drivers/net/mlx4/mlx4_rxtx.c
> > > > +++ b/drivers/net/mlx4/mlx4_rxtx.c
> > > > @@ -168,7 +168,7 @@ struct pv {
> > > >  		/*
> > > >  		 * Make sure we read the CQE after we read the ownership
> > > bit.
> > > >  		 */
> > > > -		rte_rmb();
> > > > +		rte_io_rmb();
> > >
> > > OK for this one since the rest of the code should not be run due to
> > > the condition (I'm not even sure even a compiler barrier is necessary at all
> here).
> > >
> > > >  #ifndef NDEBUG
> > > >  		if (unlikely((cqe->owner_sr_opcode &
> > > MLX4_CQE_OPCODE_MASK) ==
> > > >  			     MLX4_CQE_OPCODE_ERROR)) { @@ -203,7 +203,7
> @@ struct pv {
> > > >  	 */
> > > >  	cq->cons_index = cons_index;
> > > >  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> > > MLX4_CQ_DB_CI_MASK);
> > > > -	rte_wmb();
> > > > +	rte_io_wmb();
> > >
> > > This one could be removed entirely as well, which is more or less
> > > what the move to a compiler barrier does. Nothing in subsequent code
> > > depends on this doorbell being written, so this can piggy back on
> > > any subsequent rte_wmb().
> >
> > Yes, you right, probably this code was taken from multi thread
> implementation.
> > >
> > > On the other hand in my opinion a barrier (compiler or otherwise)
> > > might be needed before the doorbell write, to make clear it cannot
> > > somehow be done earlier in case something attempts to optimize it
> away.
> > >
> > I think we can remove it entirely (compiler can't optimize the ci_db store
> since in depends in previous code (cons_index).
> 
> Right, however you may still run into issues if the compiler determines the
> final cons_index value by looking at the loop and decides to store it before
> entering/leaving it. That's the kind of problematic optimization I was thinking
> of.
> 
> The barrier in that sense is just to assert the order of seemingly unrelated
> load/stores.

I think that If I left the rte_io_rmb after CQE owner check we can earn both:
1. The concern of read ordering while reading the owner before using CQE.
2. The ci_db concern: the compiler must read the last CQE(which is not valid and we have no more stamp to do) before it knows the last value of cons_index. 
So we can remove entirely this rte_io_wmb in completion function.
What do you think? 
 
> > > >  			/* Fill the control parameters for this packet. */ @@ -
> > > 533,7
> > > > +534,7 @@ static int handle_multi_segs(struct rte_mbuf *buf,
> > > >  		 * setting ownership bit (because HW can start
> > > >  		 * executing as soon as we do).
> > > >  		 */
> > > > -		rte_wmb();
> > > > +		rte_io_wmb();
> > >
> > > This one looks dangerous. A compiler barrier is not strong enough to
> > > guarantee the order in which CPU will execute instructions, it only
> > > makes sure what follows the barrier doesn't appear before it in the
> generated code.
> > >
> > As I investigated, I understood that for CPUs which don't save store order
> between different addresses(arm,ppc), the rte_io_wmb is converted to
> rte_wmb.
> > So for thus who save it(x86) we just need the right order in compiler code
> because all the relevant stores are targeted to same memory domain(DRAM)
> and therefore also the actual store is guaranteed.
> > Unlike doorbell store which directed to different memory domain (PCI).
> > So the only place which need rte_wmb() is before doorbell write.
> 
> Fair enough, although after re-reading the code I think there's another issue
> present since the beginning: both ctrl and dseg pointers are not volatile, this
> fact doesn't guarantee intermediate writes will occur in the expected order
> or even at all, even in the presence of a barrier.
> 
> The volatile attribute should be inherited from both struct mlx4_cq and struct
> mlx4_sq (buf, db and most if not all other pointers). I think a separate fixes
> commit should add it for safety.

Great notice , I will add it, Thanks!
> 
> > > Unless the comment above this barrier is wrong, this change may
> > > cause hard- to-debug issues down the road, you should drop it.
> > >
> > > >  		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
> > > >  					      ((sq->head & sq->txbb_cnt) ?
> > > >  						       MLX4_BIT_WQE_OWN :
> > > 0));
> > > > --
> > > > 1.8.3.1
> > > >
> > >
> > > --
> > > Adrien Mazarguil
> > > 6WIND
> >
> > Thanks!
> 
> --
> Adrien Mazarguil
> 6WIND
  
Adrien Mazarguil Oct. 31, 2017, 1:21 p.m. UTC | #5
Hi Matan,

On Tue, Oct 31, 2017 at 11:35:29AM +0000, Matan Azrad wrote:
<snip>
> > -----Original Message-----
> > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > Sent: Tuesday, October 31, 2017 12:17 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory barriers
> > 
> > Hi Matan,
> > 
> > On Mon, Oct 30, 2017 at 07:47:20PM +0000, Matan Azrad wrote:
> > > Hi Adrien
> > >
> > > > -----Original Message-----
> > > > From: Adrien Mazarguil [mailto:adrien.mazarguil@6wind.com]
> > > > Sent: Monday, October 30, 2017 4:24 PM
> > > > To: Matan Azrad <matan@mellanox.com>
> > > > Cc: dev@dpdk.org; Ophir Munk <ophirmu@mellanox.com>
> > > > Subject: Re: [PATCH v3 6/7] net/mlx4: mitigate Tx path memory
> > > > barriers
> > > >
> > > > On Mon, Oct 30, 2017 at 10:07:28AM +0000, Matan Azrad wrote:
> > > > > Replace most of the memory barriers by compiler barriers since
> > > > > they are all targeted to the DRAM; This improves code efficiency
> > > > > for systems which force store order between different addresses.
> > > > >
> > > > > Only the doorbell record store should be protected by memory
> > > > > barrier since it is targeted to the PCI memory domain.
> > > > >
> > > > > Limit pre byte count store compiler barrier for systems with cache
> > > > > line size smaller than 64B (TXBB size).
> > > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
<snip>
> > > > >  	cq->cons_index = cons_index;
> > > > >  	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index &
> > > > MLX4_CQ_DB_CI_MASK);
> > > > > -	rte_wmb();
> > > > > +	rte_io_wmb();
> > > >
> > > > This one could be removed entirely as well, which is more or less
> > > > what the move to a compiler barrier does. Nothing in subsequent code
> > > > depends on this doorbell being written, so this can piggy back on
> > > > any subsequent rte_wmb().
> > >
> > > Yes, you right, probably this code was taken from multi thread
> > implementation.
> > > >
> > > > On the other hand in my opinion a barrier (compiler or otherwise)
> > > > might be needed before the doorbell write, to make clear it cannot
> > > > somehow be done earlier in case something attempts to optimize it
> > away.
> > > >
> > > I think we can remove it entirely (compiler can't optimize the ci_db store
> > since in depends in previous code (cons_index).
> > 
> > Right, however you may still run into issues if the compiler determines the
> > final cons_index value by looking at the loop and decides to store it before
> > entering/leaving it. That's the kind of problematic optimization I was thinking
> > of.
> > 
> > The barrier in that sense is just to assert the order of seemingly unrelated
> > load/stores.
> 
> I think that If I left the rte_io_rmb after CQE owner check we can earn both:
> 1. The concern of read ordering while reading the owner before using CQE.
> 2. The ci_db concern: the compiler must read the last CQE(which is not valid and we have no more stamp to do) before it knows the last value of cons_index. 
> So we can remove entirely this rte_io_wmb in completion function.
> What do you think? 

That's right, this means there's a barrier before the doorbell write in any
case, OK then.

Just make sure cq->set_ci_db is volatile in a prior "fix" commit as
described in my previous suggestion, otherwise the remaining barriers won't
guarantee much. Thanks.
  

Patch

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 8ea8851..482c399 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -168,7 +168,7 @@  struct pv {
 		/*
 		 * Make sure we read the CQE after we read the ownership bit.
 		 */
-		rte_rmb();
+		rte_io_rmb();
 #ifndef NDEBUG
 		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
 			     MLX4_CQE_OPCODE_ERROR)) {
@@ -203,7 +203,7 @@  struct pv {
 	 */
 	cq->cons_index = cons_index;
 	*cq->set_ci_db = rte_cpu_to_be_32(cq->cons_index & MLX4_CQ_DB_CI_MASK);
-	rte_wmb();
+	rte_io_wmb();
 	sq->tail = sq->tail + nr_txbbs;
 	/* Update the list of packets posted for transmission. */
 	elts_comp -= pkts;
@@ -321,6 +321,7 @@  static int handle_multi_segs(struct rte_mbuf *buf,
 		 * control segment.
 		 */
 		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+#if RTE_CACHE_LINE_SIZE < 64
 			/*
 			 * Need a barrier here before writing the byte_count
 			 * fields to make sure that all the data is visible
@@ -331,6 +332,7 @@  static int handle_multi_segs(struct rte_mbuf *buf,
 			 * data, and end up sending the wrong data.
 			 */
 			rte_io_wmb();
+#endif /* RTE_CACHE_LINE_SIZE */
 			dseg->byte_count = byte_count;
 		} else {
 			/*
@@ -469,8 +471,7 @@  static int handle_multi_segs(struct rte_mbuf *buf,
 				break;
 			}
 #endif /* NDEBUG */
-			/* Need a barrier here before byte count store. */
-			rte_io_wmb();
+			/* Never be TXBB aligned, no need compiler barrier. */
 			dseg->byte_count = rte_cpu_to_be_32(buf->data_len);
 
 			/* Fill the control parameters for this packet. */
@@ -533,7 +534,7 @@  static int handle_multi_segs(struct rte_mbuf *buf,
 		 * setting ownership bit (because HW can start
 		 * executing as soon as we do).
 		 */
-		rte_wmb();
+		rte_io_wmb();
 		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
 					      ((sq->head & sq->txbb_cnt) ?
 						       MLX4_BIT_WQE_OWN : 0));