[dpdk-dev,v2,4/7] net/mlx4: merge Tx path functions

Message ID 1508768520-4810-5-git-send-email-ophirmu@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Ophir Munk Oct. 23, 2017, 2:21 p.m. UTC
  From: Matan Azrad <matan@mellanox.com>

Merge tx_burst and mlx4_post_send functions to prevent
double asking about WQ remain space.

This should improve performance.

Signed-off-by: Matan Azrad <matan@mellanox.com>
---
 drivers/net/mlx4/mlx4_rxtx.c | 353 +++++++++++++++++++++----------------------
 1 file changed, 170 insertions(+), 183 deletions(-)
  

Comments

Nélio Laranjeiro Oct. 24, 2017, 1:51 p.m. UTC | #1
On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> From: Matan Azrad <matan@mellanox.com>
> 
> Merge tx_burst and mlx4_post_send functions to prevent
> double asking about WQ remain space.
> 
> This should improve performance.
> 
> Signed-off-by: Matan Azrad <matan@mellanox.com>
> ---
>  drivers/net/mlx4/mlx4_rxtx.c | 353 +++++++++++++++++++++----------------------
>  1 file changed, 170 insertions(+), 183 deletions(-)

What are the real expectation you have on the remaining patches of the
series?

According to the comment of this commit log "This should improve
performance" there are too many barriers at each packet/segment level to
improve something.

The point is, mlx4_burst_tx() should write all the WQE without any
barrier as it is processing a burst of packets (whereas Verbs functions
which may only process a single packet).  The lonely barrier which
should be present is the one to ensure that all the host memory is
flushed before triggering the Tx doorbell.

There is also too many cases handled which are useless in bursts
situation, this function needs to be re-written to its minimal use case
i.e.  processing a valid burst of packets/segments and triggering at the
end of the burst the Tx doorbell.

Regards,
  
Ophir Munk Oct. 24, 2017, 8:36 p.m. UTC | #2
Hi,

On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> 
> On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > From: Matan Azrad <matan@mellanox.com>
> >
> > Merge tx_burst and mlx4_post_send functions to prevent double asking
> > about WQ remain space.
> >
> > This should improve performance.
> >
> > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > ---
> >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > +++++++++++++++++++++----------------------
> >  1 file changed, 170 insertions(+), 183 deletions(-)
> 
> What are the real expectation you have on the remaining patches of the
> series?
> 
> According to the comment of this commit log "This should improve
> performance" there are too many barriers at each packet/segment level to
> improve something.
> 
> The point is, mlx4_burst_tx() should write all the WQE without any barrier as
> it is processing a burst of packets (whereas Verbs functions which may only
> process a single packet).  

> The lonely barrier which should be present is the
> one to ensure that all the host memory is flushed before triggering the Tx
> doorbell.
> 

There is a known ConnectX-3 HW limitation: the first 4 bytes of every TXWBB (64 bytes chunks) should be 
written in a reversed order (from last TXWBB to first TXWBB). 
The last 60 bytes of any TXWBB can be written in any order (before writing the first 4 bytes).
Is your last statement (using lonely barrier) is in accordance with this limitation? Please explain.

> There is also too many cases handled which are useless in bursts situation,
> this function needs to be re-written to its minimal use case i.e.  processing a
> valid burst of packets/segments and triggering at the end of the burst the Tx
> doorbell.
> 
> Regards,
> 
> --
> Nélio Laranjeiro
> 6WIND
  
Nélio Laranjeiro Oct. 25, 2017, 7:50 a.m. UTC | #3
On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> Hi,
> 
> On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > 
> > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > From: Matan Azrad <matan@mellanox.com>
> > >
> > > Merge tx_burst and mlx4_post_send functions to prevent double asking
> > > about WQ remain space.
> > >
> > > This should improve performance.
> > >
> > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > ---
> > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > +++++++++++++++++++++----------------------
> > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > 
> > What are the real expectation you have on the remaining patches of the
> > series?
> > 
> > According to the comment of this commit log "This should improve
> > performance" there are too many barriers at each packet/segment level to
> > improve something.
> > 
> > The point is, mlx4_burst_tx() should write all the WQE without any barrier as
> > it is processing a burst of packets (whereas Verbs functions which may only
> > process a single packet).  
> 
> > The lonely barrier which should be present is the
> > one to ensure that all the host memory is flushed before triggering the Tx
> > doorbell.
> > 
> 
> There is a known ConnectX-3 HW limitation: the first 4 bytes of every
> TXWBB (64 bytes chunks) should be 
> written in a reversed order (from last TXWBB to first TXWBB).

This means the first WQE filled by the burst function is the doorbell.
In such situation, the first four bytes of it can be written before
leaving the burst function and after a write memory barrier.

Until this first WQE is not complete, the NIC won't start processing the
packets.  Memory barriers per packets becomes useless.

It gives something like:

 uint32_t tx_bb_db = 0;
 void *first_wqe = NULL;

 /*
  * Prepare all Packets by writing the WQEs without the 4 first bytes of
  * the first WQE.
  */
 for () {
 	if (!wqe) {
		first_wqe = wqe;
		tx_bb_db = foo;
	}
 }
 /* Leaving. */
 rte_wmb();
 *(uin32_t*)wqe = tx_bb_db;
 return n;

> The last 60 bytes of any TXWBB can be written in any order (before
> writing the first 4 bytes).
> Is your last statement (using lonely barrier) is in accordance with
> this limitation? Please explain.
> 
> > There is also too many cases handled which are useless in bursts situation,
> > this function needs to be re-written to its minimal use case i.e.  processing a
> > valid burst of packets/segments and triggering at the end of the burst the Tx
> > doorbell.
> > 

Regards,
  
Matan Azrad Oct. 26, 2017, 10:31 a.m. UTC | #4
Hi Nelio

I think the memory barrier discussion is not relevant for this patch (if it will be relevant I will create new one).
Please see my comments inline.

Regarding this specific patch, I didn't see any comment from you, Are you agree with it? 
 
> -----Original Message-----
> From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> Sent: Wednesday, October 25, 2017 10:50 AM
> To: Ophir Munk <ophirmu@mellanox.com>
> Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> <olgas@mellanox.com>; Matan Azrad <matan@mellanox.com>
> Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions
> 
> On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> > Hi,
> >
> > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > >
> > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > > From: Matan Azrad <matan@mellanox.com>
> > > >
> > > > Merge tx_burst and mlx4_post_send functions to prevent double
> > > > asking about WQ remain space.
> > > >
> > > > This should improve performance.
> > > >
> > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > ---
> > > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > > +++++++++++++++++++++----------------------
> > > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > >
> > > What are the real expectation you have on the remaining patches of
> > > the series?
> > >
> > > According to the comment of this commit log "This should improve
> > > performance" there are too many barriers at each packet/segment
> > > level to improve something.
> > >
> > > The point is, mlx4_burst_tx() should write all the WQE without any
> > > barrier as it is processing a burst of packets (whereas Verbs
> > > functions which may only process a single packet).
> >
> > > The lonely barrier which should be present is the one to ensure that
> > > all the host memory is flushed before triggering the Tx doorbell.
> > >
> >
> > There is a known ConnectX-3 HW limitation: the first 4 bytes of every
> > TXWBB (64 bytes chunks) should be
> > written in a reversed order (from last TXWBB to first TXWBB).
> 
> This means the first WQE filled by the burst function is the doorbell.
> In such situation, the first four bytes of it can be written before
> leaving the burst function and after a write memory barrier.
> 
> Until this first WQE is not complete, the NIC won't start processing the
> packets.  Memory barriers per packets becomes useless.

I think this is not true, Since mlx4 HW can prefetch advanced TXbbs if their first 4
bytes are valid in spite of the first WQE is still not valid (please read the spec).

> 
> It gives something like:
> 
>  uint32_t tx_bb_db = 0;
>  void *first_wqe = NULL;
> 
>  /*
>   * Prepare all Packets by writing the WQEs without the 4 first bytes of
>   * the first WQE.
>   */
>  for () {
>  	if (!wqe) {
> 		first_wqe = wqe;
> 		tx_bb_db = foo;
> 	}
>  }
>  /* Leaving. */
>  rte_wmb();
>  *(uin32_t*)wqe = tx_bb_db;
>  return n;
>

I will take care to check if we can do 2 loops:
Write all  last 60B per TXbb.
Memory barrier.
Write all first 4B per TXbbs.

> > The last 60 bytes of any TXWBB can be written in any order (before
> > writing the first 4 bytes).
> > Is your last statement (using lonely barrier) is in accordance with
> > this limitation? Please explain.
> >
> > > There is also too many cases handled which are useless in bursts
> situation,
> > > this function needs to be re-written to its minimal use case i.e.
> processing a
> > > valid burst of packets/segments and triggering at the end of the burst the
> Tx
> > > doorbell.
> > >
> 
> Regards,
> 
> --
> Nélio Laranjeiro
> 6WIND
  
Nélio Laranjeiro Oct. 26, 2017, 12:12 p.m. UTC | #5
On Thu, Oct 26, 2017 at 10:31:06AM +0000, Matan Azrad wrote:
> Hi Nelio
> 
> I think the memory barrier discussion is not relevant for this patch
> (if it will be relevant I will create new one).
> Please see my comments inline.

It was not my single comment.  There is also useless code like having
null segments in the packets which is not allowed on DPDK.

> Regarding this specific patch, I didn't see any comment from you, Are
> you agree with it? 
>  
> > -----Original Message-----
> > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > Sent: Wednesday, October 25, 2017 10:50 AM
> > To: Ophir Munk <ophirmu@mellanox.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> > Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> > <olgas@mellanox.com>; Matan Azrad <matan@mellanox.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions
> > 
> > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> > > Hi,
> > >
> > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > > >
> > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > > > From: Matan Azrad <matan@mellanox.com>
> > > > >
> > > > > Merge tx_burst and mlx4_post_send functions to prevent double
> > > > > asking about WQ remain space.
> > > > >
> > > > > This should improve performance.
> > > > >
> > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > ---
> > > > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > > > +++++++++++++++++++++----------------------
> > > > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > > >
> > > > What are the real expectation you have on the remaining patches of
> > > > the series?
> > > >
> > > > According to the comment of this commit log "This should improve
> > > > performance" there are too many barriers at each packet/segment
> > > > level to improve something.
> > > >
> > > > The point is, mlx4_burst_tx() should write all the WQE without any
> > > > barrier as it is processing a burst of packets (whereas Verbs
> > > > functions which may only process a single packet).
> > >
> > > > The lonely barrier which should be present is the one to ensure that
> > > > all the host memory is flushed before triggering the Tx doorbell.
> > > >
> > >
> > > There is a known ConnectX-3 HW limitation: the first 4 bytes of every
> > > TXWBB (64 bytes chunks) should be
> > > written in a reversed order (from last TXWBB to first TXWBB).
> > 
> > This means the first WQE filled by the burst function is the doorbell.
> > In such situation, the first four bytes of it can be written before
> > leaving the burst function and after a write memory barrier.
> > 
> > Until this first WQE is not complete, the NIC won't start processing the
> > packets.  Memory barriers per packets becomes useless.
> 
> I think this is not true, Since mlx4 HW can prefetch advanced TXbbs if their first 4
> bytes are valid in spite of the first WQE is still not valid (please read the spec).

A compiler barrier is enough on x86 to forbid the CPU to re-order the
instructions, on arm you need a memory barrier, there is a macro in DPDK
for that, rte_io_wmb().

Before triggering the doorbell you must flush the case, this is the only
place where the rte_wmb() should be used.

> > It gives something like:
> > 
> >  uint32_t tx_bb_db = 0;
> >  void *first_wqe = NULL;
> > 
> >  /*
> >   * Prepare all Packets by writing the WQEs without the 4 first bytes of
> >   * the first WQE.
> >   */
> >  for () {
> >  	if (!wqe) {
> > 		first_wqe = wqe;
> > 		tx_bb_db = foo;
> > 	}
> >  }
> >  /* Leaving. */
> >  rte_wmb();
> >  *(uin32_t*)wqe = tx_bb_db;
> >  return n;
> >
> 
> I will take care to check if we can do 2 loops:
> Write all  last 60B per TXbb.
> Memory barrier.
> Write all first 4B per TXbbs.
> 
> > > The last 60 bytes of any TXWBB can be written in any order (before
> > > writing the first 4 bytes).
> > > Is your last statement (using lonely barrier) is in accordance with
> > > this limitation? Please explain.
> > >
> > > > There is also too many cases handled which are useless in bursts
> > situation,
> > > > this function needs to be re-written to its minimal use case i.e.
> > processing a
> > > > valid burst of packets/segments and triggering at the end of the burst the
> > Tx
> > > > doorbell.
> > > >
> > 
> > Regards,
> > 
> > --
> > Nélio Laranjeiro
> > 6WIND

Regards,
  
Matan Azrad Oct. 26, 2017, 12:30 p.m. UTC | #6
Hi Nelio
Please see my comments below (3).


> -----Original Message-----
> From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> Sent: Thursday, October 26, 2017 3:12 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Ophir Munk <ophirmu@mellanox.com>; Adrien Mazarguil
> <adrien.mazarguil@6wind.com>; dev@dpdk.org; Thomas Monjalon
> <thomas@monjalon.net>; Olga Shern <olgas@mellanox.com>; Mordechay
> Haimovsky <motih@mellanox.com>
> Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions
> 
> On Thu, Oct 26, 2017 at 10:31:06AM +0000, Matan Azrad wrote:
> > Hi Nelio
> >
> > I think the memory barrier discussion is not relevant for this patch
> > (if it will be relevant I will create new one).
> > Please see my comments inline.
> 
> It was not my single comment.  There is also useless code like having null
> segments in the packets which is not allowed on DPDK.

Sorry, but I can't find comments in the previous mails.
Moreover  this comment(first time I see it) is not relevant to this patch and asking something else.
All what this patch does is to merge 2 functions to prevent double
asking about WQ remain space...
Remove memory\compiler barriers or dealing with null segments are not in the scope here. 

> 
> > Regarding this specific patch, I didn't see any comment from you, Are
> > you agree with it?
> >
> > > -----Original Message-----
> > > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > > Sent: Wednesday, October 25, 2017 10:50 AM
> > > To: Ophir Munk <ophirmu@mellanox.com>
> > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> > > Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> > > <olgas@mellanox.com>; Matan Azrad <matan@mellanox.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path
> > > functions
> > >
> > > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> > > > Hi,
> > > >
> > > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > > > >
> > > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > > > > From: Matan Azrad <matan@mellanox.com>
> > > > > >
> > > > > > Merge tx_burst and mlx4_post_send functions to prevent double
> > > > > > asking about WQ remain space.
> > > > > >
> > > > > > This should improve performance.
> > > > > >
> > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > > ---
> > > > > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > > > > +++++++++++++++++++++----------------------
> > > > > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > > > >
> > > > > What are the real expectation you have on the remaining patches
> > > > > of the series?
> > > > >
> > > > > According to the comment of this commit log "This should improve
> > > > > performance" there are too many barriers at each packet/segment
> > > > > level to improve something.
> > > > >
> > > > > The point is, mlx4_burst_tx() should write all the WQE without
> > > > > any barrier as it is processing a burst of packets (whereas
> > > > > Verbs functions which may only process a single packet).
> > > >
> > > > > The lonely barrier which should be present is the one to ensure
> > > > > that all the host memory is flushed before triggering the Tx doorbell.
> > > > >
> > > >
> > > > There is a known ConnectX-3 HW limitation: the first 4 bytes of
> > > > every TXWBB (64 bytes chunks) should be written in a reversed
> > > > order (from last TXWBB to first TXWBB).
> > >
> > > This means the first WQE filled by the burst function is the doorbell.
> > > In such situation, the first four bytes of it can be written before
> > > leaving the burst function and after a write memory barrier.
> > >
> > > Until this first WQE is not complete, the NIC won't start processing
> > > the packets.  Memory barriers per packets becomes useless.
> >
> > I think this is not true, Since mlx4 HW can prefetch advanced TXbbs if
> > their first 4 bytes are valid in spite of the first WQE is still not valid (please
> read the spec).
> 
> A compiler barrier is enough on x86 to forbid the CPU to re-order the
> instructions, on arm you need a memory barrier, there is a macro in DPDK for
> that, rte_io_wmb().
> 

We are also using compiler barrier here.

> Before triggering the doorbell you must flush the case, this is the only place
> where the rte_wmb() should be used.
> 

We are also using memory barrier only for this reason.

> > > It gives something like:
> > >
> > >  uint32_t tx_bb_db = 0;
> > >  void *first_wqe = NULL;
> > >
> > >  /*
> > >   * Prepare all Packets by writing the WQEs without the 4 first bytes of
> > >   * the first WQE.
> > >   */
> > >  for () {
> > >  	if (!wqe) {
> > > 		first_wqe = wqe;
> > > 		tx_bb_db = foo;
> > > 	}
> > >  }
> > >  /* Leaving. */
> > >  rte_wmb();
> > >  *(uin32_t*)wqe = tx_bb_db;
> > >  return n;
> > >
> >
> > I will take care to check if we can do 2 loops:
> > Write all  last 60B per TXbb.
> > Memory barrier.
> > Write all first 4B per TXbbs.
> >
> > > > The last 60 bytes of any TXWBB can be written in any order (before
> > > > writing the first 4 bytes).
> > > > Is your last statement (using lonely barrier) is in accordance
> > > > with this limitation? Please explain.
> > > >
> > > > > There is also too many cases handled which are useless in bursts
> > > situation,
> > > > > this function needs to be re-written to its minimal use case i.e.
> > > processing a
> > > > > valid burst of packets/segments and triggering at the end of the
> > > > > burst the
> > > Tx
> > > > > doorbell.
> > > > >
> > >
> > > Regards,
> > >
> > > --
> > > Nélio Laranjeiro
> > > 6WIND
> 
> Regards,
> 
> --
> Nélio Laranjeiro
> 6WIND
  
Nélio Laranjeiro Oct. 26, 2017, 1:44 p.m. UTC | #7
On Thu, Oct 26, 2017 at 12:30:54PM +0000, Matan Azrad wrote:
> Hi Nelio
> Please see my comments below (3).
> 
> 
> > -----Original Message-----
> > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > Sent: Thursday, October 26, 2017 3:12 PM
> > To: Matan Azrad <matan@mellanox.com>
> > Cc: Ophir Munk <ophirmu@mellanox.com>; Adrien Mazarguil
> > <adrien.mazarguil@6wind.com>; dev@dpdk.org; Thomas Monjalon
> > <thomas@monjalon.net>; Olga Shern <olgas@mellanox.com>; Mordechay
> > Haimovsky <motih@mellanox.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions
> > 
> > On Thu, Oct 26, 2017 at 10:31:06AM +0000, Matan Azrad wrote:
> > > Hi Nelio
> > >
> > > I think the memory barrier discussion is not relevant for this patch
> > > (if it will be relevant I will create new one).
> > > Please see my comments inline.
> > 
> > It was not my single comment.  There is also useless code like having null
> > segments in the packets which is not allowed on DPDK.
> 
> Sorry, but I can't find comments in the previous mails.

You should search in the series,

> Moreover  this comment(first time I see it) is not relevant to this patch and asking something else.
> All what this patch does is to merge 2 functions to prevent double
> asking about WQ remain space...

Again in the series itself.

The point, this series embed 7 patches for "performance improvement",
whereas the single improvement is avoiding to call an outside function by
copy/pasting it into the PMD.
In fact it will save few cycles, but this improvements could have been
much more if the it was not a bare copy/paste.

The real question is what is the improvement?  If the improvement is
significant, it worse having this series, otherwise it does not as it
may also bring some bugs which may be resolve from its original source
whereas this one will remain.

> Remove memory\compiler barriers or dealing with null segments are not in the scope here. 
> 
> > 
> > > Regarding this specific patch, I didn't see any comment from you, Are
> > > you agree with it?
> > >
> > > > -----Original Message-----
> > > > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > > > Sent: Wednesday, October 25, 2017 10:50 AM
> > > > To: Ophir Munk <ophirmu@mellanox.com>
> > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> > > > Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> > > > <olgas@mellanox.com>; Matan Azrad <matan@mellanox.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path
> > > > functions
> > > >
> > > > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> > > > > Hi,
> > > > >
> > > > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > > > > >
> > > > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > > > > > From: Matan Azrad <matan@mellanox.com>
> > > > > > >
> > > > > > > Merge tx_burst and mlx4_post_send functions to prevent double
> > > > > > > asking about WQ remain space.
> > > > > > >
> > > > > > > This should improve performance.
> > > > > > >
> > > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > > > ---
> > > > > > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > > > > > +++++++++++++++++++++----------------------
> > > > > > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > > > > >
> > > > > > What are the real expectation you have on the remaining patches
> > > > > > of the series?
> > > > > >
> > > > > > According to the comment of this commit log "This should improve
> > > > > > performance" there are too many barriers at each packet/segment
> > > > > > level to improve something.
> > > > > >
> > > > > > The point is, mlx4_burst_tx() should write all the WQE without
> > > > > > any barrier as it is processing a burst of packets (whereas
> > > > > > Verbs functions which may only process a single packet).
> > > > >
> > > > > > The lonely barrier which should be present is the one to ensure
> > > > > > that all the host memory is flushed before triggering the Tx doorbell.
> > > > > >
> > > > >
> > > > > There is a known ConnectX-3 HW limitation: the first 4 bytes of
> > > > > every TXWBB (64 bytes chunks) should be written in a reversed
> > > > > order (from last TXWBB to first TXWBB).
> > > >
> > > > This means the first WQE filled by the burst function is the doorbell.
> > > > In such situation, the first four bytes of it can be written before
> > > > leaving the burst function and after a write memory barrier.
> > > >
> > > > Until this first WQE is not complete, the NIC won't start processing
> > > > the packets.  Memory barriers per packets becomes useless.
> > >
> > > I think this is not true, Since mlx4 HW can prefetch advanced TXbbs if
> > > their first 4 bytes are valid in spite of the first WQE is still not valid (please
> > read the spec).
> > 
> > A compiler barrier is enough on x86 to forbid the CPU to re-order the
> > instructions, on arm you need a memory barrier, there is a macro in DPDK for
> > that, rte_io_wmb().
> > 
> We are also using compiler barrier here.
> 
> > Before triggering the doorbell you must flush the case, this is the only place
> > where the rte_wmb() should be used.
> > 
> 
> We are also using memory barrier only for this reason.
> 
> > > > It gives something like:
> > > >
> > > >  uint32_t tx_bb_db = 0;
> > > >  void *first_wqe = NULL;
> > > >
> > > >  /*
> > > >   * Prepare all Packets by writing the WQEs without the 4 first bytes of
> > > >   * the first WQE.
> > > >   */
> > > >  for () {
> > > >  	if (!wqe) {
> > > > 		first_wqe = wqe;
> > > > 		tx_bb_db = foo;
> > > > 	}
> > > >  }
> > > >  /* Leaving. */
> > > >  rte_wmb();
> > > >  *(uin32_t*)wqe = tx_bb_db;
> > > >  return n;
> > > >
> > >
> > > I will take care to check if we can do 2 loops:
> > > Write all  last 60B per TXbb.
> > > Memory barrier.
> > > Write all first 4B per TXbbs.
> > >
> > > > > The last 60 bytes of any TXWBB can be written in any order (before
> > > > > writing the first 4 bytes).
> > > > > Is your last statement (using lonely barrier) is in accordance
> > > > > with this limitation? Please explain.
> > > > >
> > > > > > There is also too many cases handled which are useless in bursts
> > > > situation,
> > > > > > this function needs to be re-written to its minimal use case i.e.
> > > > processing a
> > > > > > valid burst of packets/segments and triggering at the end of the
> > > > > > burst the
> > > > Tx
> > > > > > doorbell.
> > > > > >
> > > >
> > > > Regards,
> > > >
> > > > --
> > > > Nélio Laranjeiro
> > > > 6WIND
> > 
> > Regards,
> > 
> > --
> > Nélio Laranjeiro
> > 6WIND
  
Matan Azrad Oct. 26, 2017, 4:21 p.m. UTC | #8
Hi Nelio

> -----Original Message-----
> From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> Sent: Thursday, October 26, 2017 4:44 PM
> To: Matan Azrad <matan@mellanox.com>
> Cc: Ophir Munk <ophirmu@mellanox.com>; Adrien Mazarguil
> <adrien.mazarguil@6wind.com>; dev@dpdk.org; Thomas Monjalon
> <thomas@monjalon.net>; Olga Shern <olgas@mellanox.com>; Mordechay
> Haimovsky <motih@mellanox.com>
> Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path functions
> 
> On Thu, Oct 26, 2017 at 12:30:54PM +0000, Matan Azrad wrote:
> > Hi Nelio
> > Please see my comments below (3).
> >
> >
> > > -----Original Message-----
> > > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > > Sent: Thursday, October 26, 2017 3:12 PM
> > > To: Matan Azrad <matan@mellanox.com>
> > > Cc: Ophir Munk <ophirmu@mellanox.com>; Adrien Mazarguil
> > > <adrien.mazarguil@6wind.com>; dev@dpdk.org; Thomas Monjalon
> > > <thomas@monjalon.net>; Olga Shern <olgas@mellanox.com>;
> Mordechay
> > > Haimovsky <motih@mellanox.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path
> > > functions
> > >
> > > On Thu, Oct 26, 2017 at 10:31:06AM +0000, Matan Azrad wrote:
> > > > Hi Nelio
> > > >
> > > > I think the memory barrier discussion is not relevant for this
> > > > patch (if it will be relevant I will create new one).
> > > > Please see my comments inline.
> > >
> > > It was not my single comment.  There is also useless code like
> > > having null segments in the packets which is not allowed on DPDK.
> >
> > Sorry, but I can't find comments in the previous mails.
> 
> You should search in the series,
> 
> > Moreover  this comment(first time I see it) is not relevant to this patch and
> asking something else.
> > All what this patch does is to merge 2 functions to prevent double
> > asking about WQ remain space...
> 
> Again in the series itself.
> 
> The point, this series embed 7 patches for "performance improvement",
> whereas the single improvement is avoiding to call an outside function by
> copy/pasting it into the PMD.
> In fact it will save few cycles, but this improvements could have been much
> more if the it was not a bare copy/paste.
> 

This simple merge improves 0.2MPPS in my setup.
If you have more improvements (other than reduce if statement) regarding this merge please suggest. 

> The real question is what is the improvement?  If the improvement is
> significant, it worse having this series, otherwise it does not as it may also
> bring some bugs which may be resolve from its original source whereas this
> one will remain.
> 

Each commit in this series improves performance - all of them improve performance significantly and brought us to our target.

By the way, I think series discussion should be in patch 0 :)

> > Remove memory\compiler barriers or dealing with null segments are not in
> the scope here.
> >
> > >
> > > > Regarding this specific patch, I didn't see any comment from you,
> > > > Are you agree with it?
> > > >
> > > > > -----Original Message-----
> > > > > From: Nélio Laranjeiro [mailto:nelio.laranjeiro@6wind.com]
> > > > > Sent: Wednesday, October 25, 2017 10:50 AM
> > > > > To: Ophir Munk <ophirmu@mellanox.com>
> > > > > Cc: Adrien Mazarguil <adrien.mazarguil@6wind.com>; dev@dpdk.org;
> > > > > Thomas Monjalon <thomas@monjalon.net>; Olga Shern
> > > > > <olgas@mellanox.com>; Matan Azrad <matan@mellanox.com>
> > > > > Subject: Re: [dpdk-dev] [PATCH v2 4/7] net/mlx4: merge Tx path
> > > > > functions
> > > > >
> > > > > On Tue, Oct 24, 2017 at 08:36:52PM +0000, Ophir Munk wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Tuesday, October 24, 2017 4:52 PM, Nélio Laranjeiro wrote:
> > > > > > >
> > > > > > > On Mon, Oct 23, 2017 at 02:21:57PM +0000, Ophir Munk wrote:
> > > > > > > > From: Matan Azrad <matan@mellanox.com>
> > > > > > > >
> > > > > > > > Merge tx_burst and mlx4_post_send functions to prevent
> > > > > > > > double asking about WQ remain space.
> > > > > > > >
> > > > > > > > This should improve performance.
> > > > > > > >
> > > > > > > > Signed-off-by: Matan Azrad <matan@mellanox.com>
> > > > > > > > ---
> > > > > > > >  drivers/net/mlx4/mlx4_rxtx.c | 353
> > > > > > > > +++++++++++++++++++++----------------------
> > > > > > > >  1 file changed, 170 insertions(+), 183 deletions(-)
> > > > > > >
> > > > > > > What are the real expectation you have on the remaining
> > > > > > > patches of the series?
> > > > > > >
> > > > > > > According to the comment of this commit log "This should
> > > > > > > improve performance" there are too many barriers at each
> > > > > > > packet/segment level to improve something.
> > > > > > >
> > > > > > > The point is, mlx4_burst_tx() should write all the WQE
> > > > > > > without any barrier as it is processing a burst of packets
> > > > > > > (whereas Verbs functions which may only process a single
> packet).
> > > > > >
> > > > > > > The lonely barrier which should be present is the one to
> > > > > > > ensure that all the host memory is flushed before triggering the Tx
> doorbell.
> > > > > > >
> > > > > >
> > > > > > There is a known ConnectX-3 HW limitation: the first 4 bytes
> > > > > > of every TXWBB (64 bytes chunks) should be written in a
> > > > > > reversed order (from last TXWBB to first TXWBB).
> > > > >
> > > > > This means the first WQE filled by the burst function is the doorbell.
> > > > > In such situation, the first four bytes of it can be written
> > > > > before leaving the burst function and after a write memory barrier.
> > > > >
> > > > > Until this first WQE is not complete, the NIC won't start
> > > > > processing the packets.  Memory barriers per packets becomes
> useless.
> > > >
> > > > I think this is not true, Since mlx4 HW can prefetch advanced
> > > > TXbbs if their first 4 bytes are valid in spite of the first WQE
> > > > is still not valid (please
> > > read the spec).
> > >
> > > A compiler barrier is enough on x86 to forbid the CPU to re-order
> > > the instructions, on arm you need a memory barrier, there is a macro
> > > in DPDK for that, rte_io_wmb().
> > >
> > We are also using compiler barrier here.
> >
> > > Before triggering the doorbell you must flush the case, this is the
> > > only place where the rte_wmb() should be used.
> > >
> >
> > We are also using memory barrier only for this reason.
> >
> > > > > It gives something like:
> > > > >
> > > > >  uint32_t tx_bb_db = 0;
> > > > >  void *first_wqe = NULL;
> > > > >
> > > > >  /*
> > > > >   * Prepare all Packets by writing the WQEs without the 4 first bytes of
> > > > >   * the first WQE.
> > > > >   */
> > > > >  for () {
> > > > >  	if (!wqe) {
> > > > > 		first_wqe = wqe;
> > > > > 		tx_bb_db = foo;
> > > > > 	}
> > > > >  }
> > > > >  /* Leaving. */
> > > > >  rte_wmb();
> > > > >  *(uin32_t*)wqe = tx_bb_db;
> > > > >  return n;
> > > > >
> > > >
> > > > I will take care to check if we can do 2 loops:
> > > > Write all  last 60B per TXbb.
> > > > Memory barrier.
> > > > Write all first 4B per TXbbs.
> > > >
> > > > > > The last 60 bytes of any TXWBB can be written in any order
> > > > > > (before writing the first 4 bytes).
> > > > > > Is your last statement (using lonely barrier) is in accordance
> > > > > > with this limitation? Please explain.
> > > > > >
> > > > > > > There is also too many cases handled which are useless in
> > > > > > > bursts
> > > > > situation,
> > > > > > > this function needs to be re-written to its minimal use case i.e.
> > > > > processing a
> > > > > > > valid burst of packets/segments and triggering at the end of
> > > > > > > the burst the
> > > > > Tx
> > > > > > > doorbell.
> > > > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > --
> > > > > Nélio Laranjeiro
> > > > > 6WIND
> > >
> > > Regards,
> > >
> > > --
> > > Nélio Laranjeiro
> > > 6WIND
> 
> --
> Nélio Laranjeiro
> 6WIND
  

Patch

diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c
index 4a77be8..014a6d3 100644
--- a/drivers/net/mlx4/mlx4_rxtx.c
+++ b/drivers/net/mlx4/mlx4_rxtx.c
@@ -267,183 +267,6 @@  rte_be32_t mlx4_txq_add_mr(struct txq *txq, struct rte_mempool *mp, uint32_t i)
 }
 
 /**
- * Posts a single work request to a send queue.
- *
- * @param txq
- *   Target Tx queue.
- * @param pkt
- *   Packet to transmit.
- *
- * @return
- *   0 on success, negative errno value otherwise and rte_errno is set.
- */
-static inline int
-mlx4_post_send(struct txq *txq, struct rte_mbuf *pkt)
-{
-	struct mlx4_wqe_ctrl_seg *ctrl;
-	struct mlx4_wqe_data_seg *dseg;
-	struct mlx4_sq *sq = &txq->msq;
-	struct rte_mbuf *buf;
-	union {
-		uint32_t flags;
-		uint16_t flags16[2];
-	} srcrb;
-	uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
-	uintptr_t addr;
-	uint32_t owner_opcode = MLX4_OPCODE_SEND;
-	uint32_t byte_count;
-	int wqe_real_size;
-	int nr_txbbs;
-	struct pv *pv = (struct pv *)txq->bounce_buf;
-	int pv_counter = 0;
-
-	/* Calculate the needed work queue entry size for this packet. */
-	wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
-			pkt->nb_segs * sizeof(struct mlx4_wqe_data_seg);
-	nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
-	/*
-	 * Check that there is room for this WQE in the send queue and that
-	 * the WQE size is legal.
-	 */
-	if (((sq->head - sq->tail) + nr_txbbs +
-	     sq->headroom_txbbs) >= sq->txbb_cnt ||
-	    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
-		return -ENOSPC;
-	}
-	/* Get the control and data entries of the WQE. */
-	ctrl = (struct mlx4_wqe_ctrl_seg *)mlx4_get_send_wqe(sq, head_idx);
-	dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
-					    sizeof(struct mlx4_wqe_ctrl_seg));
-	/* Fill the data segments with buffer information. */
-	for (buf = pkt; buf != NULL; buf = buf->next, dseg++) {
-		addr = rte_pktmbuf_mtod(buf, uintptr_t);
-		rte_prefetch0((volatile void *)addr);
-		/* Handle WQE wraparound. */
-		if (unlikely(dseg >= (struct mlx4_wqe_data_seg *)sq->eob))
-			dseg = (struct mlx4_wqe_data_seg *)sq->buf;
-		dseg->addr = rte_cpu_to_be_64(addr);
-		/* Memory region key (big endian) for this memory pool. */
-		dseg->lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(buf));
-#ifndef NDEBUG
-		if (unlikely(dseg->lkey == rte_cpu_to_be_32((uint32_t)-1))) {
-			/* MR does not exist. */
-			DEBUG("%p: unable to get MP <-> MR association",
-			      (void *)txq);
-			/*
-			 * Restamp entry in case of failure.
-			 * Make sure that size is written correctly
-			 * Note that we give ownership to the SW, not the HW.
-			 */
-			ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-			mlx4_txq_stamp_freed_wqe(sq, head_idx,
-				     (sq->head & sq->txbb_cnt) ? 0 : 1);
-			return -EFAULT;
-		}
-#endif /* NDEBUG */
-		if (likely(buf->data_len)) {
-			byte_count = rte_cpu_to_be_32(buf->data_len);
-		} else {
-			/*
-			 * Zero length segment is treated as inline segment
-			 * with zero data.
-			 */
-			byte_count = RTE_BE32(0x80000000);
-		}
-		/*
-		 * If the data segment is not at the beginning of a
-		 * Tx basic block (TXBB) then write the byte count,
-		 * else postpone the writing to just before updating the
-		 * control segment.
-		 */
-		if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
-			/*
-			 * Need a barrier here before writing the byte_count
-			 * fields to make sure that all the data is visible
-			 * before the byte_count field is set.
-			 * Otherwise, if the segment begins a new cacheline,
-			 * the HCA prefetcher could grab the 64-byte chunk and
-			 * get a valid (!= 0xffffffff) byte count but stale
-			 * data, and end up sending the wrong data.
-			 */
-			rte_io_wmb();
-			dseg->byte_count = byte_count;
-		} else {
-			/*
-			 * This data segment starts at the beginning of a new
-			 * TXBB, so we need to postpone its byte_count writing
-			 * for later.
-			 */
-			pv[pv_counter].dseg = dseg;
-			pv[pv_counter++].val = byte_count;
-		}
-	}
-	/* Write the first DWORD of each TXBB save earlier. */
-	if (pv_counter) {
-		/* Need a barrier here before writing the byte_count. */
-		rte_io_wmb();
-		for (--pv_counter; pv_counter  >= 0; pv_counter--)
-			pv[pv_counter].dseg->byte_count = pv[pv_counter].val;
-	}
-	/* Fill the control parameters for this packet. */
-	ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
-	/*
-	 * For raw Ethernet, the SOLICIT flag is used to indicate that no ICRC
-	 * should be calculated.
-	 */
-	txq->elts_comp_cd -= nr_txbbs;
-	if (unlikely(txq->elts_comp_cd <= 0)) {
-		txq->elts_comp_cd = txq->elts_comp_cd_init;
-		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
-				       MLX4_WQE_CTRL_CQ_UPDATE);
-	} else {
-		srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
-	}
-	/* Enable HW checksum offload if requested */
-	if (txq->csum &&
-	    (pkt->ol_flags &
-	     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
-		const uint64_t is_tunneled = (pkt->ol_flags &
-					      (PKT_TX_TUNNEL_GRE |
-					       PKT_TX_TUNNEL_VXLAN));
-
-		if (is_tunneled && txq->csum_l2tun) {
-			owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
-					MLX4_WQE_CTRL_IL4_HDR_CSUM;
-			if (pkt->ol_flags & PKT_TX_OUTER_IP_CKSUM)
-				srcrb.flags |=
-					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
-		} else {
-			srcrb.flags |= RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
-						MLX4_WQE_CTRL_TCP_UDP_CSUM);
-		}
-	}
-	if (txq->lb) {
-		/*
-		 * Copy destination MAC address to the WQE, this allows
-		 * loopback in eSwitch, so that VFs and PF can communicate
-		 * with each other.
-		 */
-		srcrb.flags16[0] = *(rte_pktmbuf_mtod(pkt, uint16_t *));
-		ctrl->imm = *(rte_pktmbuf_mtod_offset(pkt, uint32_t *,
-						      sizeof(uint16_t)));
-	} else {
-		ctrl->imm = 0;
-	}
-	ctrl->srcrb_flags = srcrb.flags;
-	/*
-	 * Make sure descriptor is fully written before
-	 * setting ownership bit (because HW can start
-	 * executing as soon as we do).
-	 */
-	rte_wmb();
-	ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
-					      ((sq->head & sq->txbb_cnt) ?
-					       MLX4_BIT_WQE_OWN : 0));
-	sq->head += nr_txbbs;
-	return 0;
-}
-
-/**
  * DPDK callback for Tx.
  *
  * @param dpdk_txq
@@ -466,7 +289,8 @@  mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 	unsigned int bytes_sent = 0;
 	unsigned int i;
 	unsigned int max;
-	int err;
+	struct mlx4_sq *sq = &txq->msq;
+	struct pv *pv = (struct pv *)txq->bounce_buf;
 
 	assert(txq->elts_comp_cd != 0);
 	mlx4_txq_complete(txq);
@@ -485,6 +309,20 @@  mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			(((elts_head + 1) == elts_n) ? 0 : elts_head + 1);
 		struct txq_elt *elt_next = &(*txq->elts)[elts_head_next];
 		struct txq_elt *elt = &(*txq->elts)[elts_head];
+		uint32_t owner_opcode = MLX4_OPCODE_SEND;
+		struct mlx4_wqe_ctrl_seg *ctrl;
+		struct mlx4_wqe_data_seg *dseg;
+		struct rte_mbuf *sbuf;
+		union {
+			uint32_t flags;
+			uint16_t flags16[2];
+		} srcrb;
+		uint32_t head_idx = sq->head & sq->txbb_cnt_mask;
+		uintptr_t addr;
+		uint32_t byte_count;
+		int wqe_real_size;
+		int nr_txbbs;
+		int pv_counter = 0;
 
 		/* Clean up old buffer. */
 		if (likely(elt->buf != NULL)) {
@@ -503,18 +341,167 @@  mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n)
 			} while (tmp != NULL);
 		}
 		RTE_MBUF_PREFETCH_TO_FREE(elt_next->buf);
-		/* Post the packet for sending. */
-		err = mlx4_post_send(txq, buf);
-		if (unlikely(err)) {
+
+		/*
+		 * Calculate the needed work queue entry size
+		 * for this packet.
+		 */
+		wqe_real_size = sizeof(struct mlx4_wqe_ctrl_seg) +
+				buf->nb_segs * sizeof(struct mlx4_wqe_data_seg);
+		nr_txbbs = MLX4_SIZE_TO_TXBBS(wqe_real_size);
+		/*
+		 * Check that there is room for this WQE in the send
+		 * queue and that the WQE size is legal.
+		 */
+		if (((sq->head - sq->tail) + nr_txbbs +
+		     sq->headroom_txbbs) >= sq->txbb_cnt ||
+		    nr_txbbs > MLX4_MAX_WQE_TXBBS) {
 			elt->buf = NULL;
-			goto stop;
+			break;
+		}
+		/* Get the control and data entries of the WQE. */
+		ctrl = (struct mlx4_wqe_ctrl_seg *)
+				mlx4_get_send_wqe(sq, head_idx);
+		dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl +
+				sizeof(struct mlx4_wqe_ctrl_seg));
+		/* Fill the data segments with buffer information. */
+		for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) {
+			addr = rte_pktmbuf_mtod(sbuf, uintptr_t);
+			rte_prefetch0((volatile void *)addr);
+			/* Handle WQE wraparound. */
+			if (unlikely(dseg >=
+			    (struct mlx4_wqe_data_seg *)sq->eob))
+				dseg = (struct mlx4_wqe_data_seg *)sq->buf;
+			dseg->addr = rte_cpu_to_be_64(addr);
+			/* Memory region key (big endian). */
+			dseg->lkey = mlx4_txq_mp2mr(txq, mlx4_txq_mb2mp(sbuf));
+	#ifndef NDEBUG
+			if (unlikely(dseg->lkey ==
+				rte_cpu_to_be_32((uint32_t)-1))) {
+				/* MR does not exist. */
+				DEBUG("%p: unable to get MP <-> MR association",
+				      (void *)txq);
+				/*
+				 * Restamp entry in case of failure.
+				 * Make sure that size is written correctly
+				 * Note that we give ownership to the SW,
+				 * not the HW.
+				 */
+				ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+				mlx4_txq_stamp_freed_wqe(sq, head_idx,
+					     (sq->head & sq->txbb_cnt) ? 0 : 1);
+				elt->buf = NULL;
+				break;
+			}
+	#endif /* NDEBUG */
+			if (likely(sbuf->data_len)) {
+				byte_count = rte_cpu_to_be_32(sbuf->data_len);
+			} else {
+				/*
+				 * Zero length segment is treated as inline
+				 * segment with zero data.
+				 */
+				byte_count = RTE_BE32(0x80000000);
+			}
+			/*
+			 * If the data segment is not at the beginning
+			 * of a Tx basic block (TXBB) then write the
+			 * byte count, else postpone the writing to
+			 * just before updating the control segment.
+			 */
+			if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) {
+				/*
+				 * Need a barrier here before writing the
+				 * byte_count fields to make sure that all the
+				 * data is visible before the byte_count field
+				 * is set. otherwise, if the segment begins a
+				 * new cacheline, the HCA prefetcher could grab
+				 * the 64-byte chunk and get a valid
+				 * (!= 0xffffffff) byte count but stale data,
+				 * and end up sending the wrong data.
+				 */
+				rte_io_wmb();
+				dseg->byte_count = byte_count;
+			} else {
+				/*
+				 * This data segment starts at the beginning of
+				 * a new TXBB, so we need to postpone its
+				 * byte_count writing for later.
+				 */
+				pv[pv_counter].dseg = dseg;
+				pv[pv_counter++].val = byte_count;
+			}
+		}
+		/* Write the first DWORD of each TXBB save earlier. */
+		if (pv_counter) {
+			/* Need a barrier before writing the byte_count. */
+			rte_io_wmb();
+			for (--pv_counter; pv_counter  >= 0; pv_counter--)
+				pv[pv_counter].dseg->byte_count =
+						pv[pv_counter].val;
+		}
+		/* Fill the control parameters for this packet. */
+		ctrl->fence_size = (wqe_real_size >> 4) & 0x3f;
+		/*
+		 * For raw Ethernet, the SOLICIT flag is used to indicate
+		 * that no ICRC should be calculated.
+		 */
+		txq->elts_comp_cd -= nr_txbbs;
+		if (unlikely(txq->elts_comp_cd <= 0)) {
+			txq->elts_comp_cd = txq->elts_comp_cd_init;
+			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT |
+					       MLX4_WQE_CTRL_CQ_UPDATE);
+		} else {
+			srcrb.flags = RTE_BE32(MLX4_WQE_CTRL_SOLICIT);
 		}
+		/* Enable HW checksum offload if requested */
+		if (txq->csum &&
+		    (buf->ol_flags &
+		     (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM))) {
+			const uint64_t is_tunneled = (buf->ol_flags &
+						      (PKT_TX_TUNNEL_GRE |
+						       PKT_TX_TUNNEL_VXLAN));
+
+			if (is_tunneled && txq->csum_l2tun) {
+				owner_opcode |= MLX4_WQE_CTRL_IIP_HDR_CSUM |
+						MLX4_WQE_CTRL_IL4_HDR_CSUM;
+				if (buf->ol_flags & PKT_TX_OUTER_IP_CKSUM)
+					srcrb.flags |=
+					    RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM);
+			} else {
+				srcrb.flags |=
+					RTE_BE32(MLX4_WQE_CTRL_IP_HDR_CSUM |
+						MLX4_WQE_CTRL_TCP_UDP_CSUM);
+			}
+		}
+		if (txq->lb) {
+			/*
+			 * Copy destination MAC address to the WQE, this allows
+			 * loopback in eSwitch, so that VFs and PF can
+			 * communicate with each other.
+			 */
+			srcrb.flags16[0] = *(rte_pktmbuf_mtod(buf, uint16_t *));
+			ctrl->imm = *(rte_pktmbuf_mtod_offset(buf, uint32_t *,
+					      sizeof(uint16_t)));
+		} else {
+			ctrl->imm = 0;
+		}
+		ctrl->srcrb_flags = srcrb.flags;
+		/*
+		 * Make sure descriptor is fully written before
+		 * setting ownership bit (because HW can start
+		 * executing as soon as we do).
+		 */
+		rte_wmb();
+		ctrl->owner_opcode = rte_cpu_to_be_32(owner_opcode |
+					      ((sq->head & sq->txbb_cnt) ?
+						       MLX4_BIT_WQE_OWN : 0));
+		sq->head += nr_txbbs;
 		elt->buf = buf;
 		bytes_sent += buf->pkt_len;
 		++elts_comp;
 		elts_head = elts_head_next;
 	}
-stop:
 	/* Take a shortcut if nothing must be sent. */
 	if (unlikely(i == 0))
 		return 0;