[dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0

Ariel Rodriguez arodriguez at callistech.com
Sun Nov 15 23:58:27 CET 2015
Previous message: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
Next message: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Bruce, im going to list the results after the test´s.

I will start with the second hint you proposed:

2) I upgrade our custom dpdk application with the latest dpdk code (2.1.0)
and the issue still there.

1) I test the load balancer app with the latest dpdk code (2.1.0) with the nic
82599ES 10-Gigabit SFI/SFP+ with tapped traffic and the results are:

   a) Work fine after 6 hours of running. (For timing issues i cant wait
longer but the issue always happend before 5 hours of running so i supposed
we are fine in this test).

   b) I made a change to load balancer code to behave as our dpdk
application in the workers code. This change is just for giving  the
workers code enough load (load in terms of core frecuency) that made the rx
core drop several packet because ring between workers and rx core is full.
(Our application drop several packets because the workers code are not fast
enough).

       In the last test, the segmentation fault arise , just in the same
line that i previously report.

Debugging and reading the code in the ixgbe_rxtx.c i  see some weird things.

  - The core dump of the issue always is around line 260 in the
ixgbe_rxtx.c code.
  - Looking at the function "ixgbe_tx_free_bufs" at line 132 , i understand
there is a test for looking at the rs bit write back mechanism.
The IXGBE_ADVTXD_STAT_DD is set and then the code type cast to
ixgbe_tx_entry from the sw_ring in the tx queue (variable name txep).

  - The txep->mbuf entry is totally corrupted beacause has a invalid memory
address, obviously i compared that memory address with the mbuf mempool and
is not even close to be valid. But the address of ixgbe_tx_entry is valid
and in the range of the zmalloc sotware ring structure constructed at
initialization.

 - The txep pointer is the first one in the sw_ring. That
because txq->tx_next_dd is 31 and txq->tx_rs_thresh is 32.
txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);

 - txq->tx_rs_thresh is 32. I use  the default values just setting null in
the corresponding *_queue_setup functions.

 - The weirdess thing is that the next entry on the software ring (next
ixgbe_tx_entry) is valid  and has a valid mbuf memory address.

I dont know how to continue , because im tryng to find out where i could
corrupt the mbuf associated with the ixgbe_tx_entry. I debug and test all
part of the worker core code , finding out a bad mbuf or a mbuf corruption
before enqueue on the tx ring. The tx core and the rx core is just the same
as the one in the load balancer core (This apply in our application). Not
issue there. If there is a corruption of the mbuf in the workers code the
segmentation fault has to be before tx queue ring enqueue. (I test several
field of the mbuf before enqueuing it, ->port field , ->data_len ... etc)

In the second test of the load balancer core i could not see a relationship
between the packets drop in the rx core and the mbuf corruption in the
ixgbe_tx_entry.


Waiting for some advices...

Regards

Ariel Horacio Rodriguez.













On Tue, Nov 10, 2015 at 8:50 AM, Ariel Rodriguez <arodriguez at callistech.com>
wrote:

> Thank you very much for your rapid response.
>
> 1) IO part is the same as load balancer. The worker part is different. The
> tx part use qos scheduler framework also. I will try to run the example and
> see what happends.
>
> 2) yes i can. I will do that too.
>
> The nic is 82599ES 10-Gigabit SFI/SFP+ with tapped traffic (is a hardware
> bypass device silicom vendor).
>
> I develop a similar app without the tx part. It just received a copy of
> the traffic (around 6gbps and 400000 concurrent flows) and then free the
> mbufs. It works like a charm.
>
> Is strange this issue ... If i disabled the qos scheduler code and the tx
> code dropping all packets instead of rte_eth_tx_burst ( is like disabling
> tx core) the issue is happening in rte_eth_rx_burst returning corrupted
> mbuf (rx core)
>
> Could the nic behave anormally?
>
> I will try the 2 things you comment before.
>
> Regards .
>
> Ariel Horacio Rodriguez
> On Tue, Nov 10, 2015 at 01:35:21AM -0300, Ariel Rodriguez wrote:
> > Dear dpdk experts.
> >
> > Im having a recurrent segmentation fault under the
> > function ixgbe_tx_free_bufs (ixgbe_rxtx.c:150) (i enable -g3 -O0).
> >
> > Surfing the core dump i find out this:
> >
> > txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
> >
> > txq->tx_next_dd = 31
> > txq->txq->tx_rs_thresh=32
> >
> > Obviosly txep points out to the first element but
> >
> > *(txep).mbuf == INVALID MBUF ADDRESS
> >
> > The same applies to
> >
> > *(txep+1).mbuf ; *(txep +2).mbuf;*(txep+3).mbuf
> >
> > from *(txep+4) .mbuf to *(txep+31).mbuf seems to be valid because im able
> > to derefence the mbuf's
> >
> >
> > Note:
> >
> > I disable CONFIG_RTE_IXGBE_INC_VECTOR because i gets similiar behavior ,
> I
> > thought the problem would disappear disabling that feature.
> >
> >
> > the program always  runs well up to 4 or 5 hours and then crash ...
> always
> > in the same line.
> >
> > this is the backtrace of the program:
> >
> > #0  0x0000000000677a64 in rte_atomic16_read (v=0x47dc14c18b14) at
> >
> /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/generic/rte_atomic.h:151
> > #1  0x0000000000677c1d in rte_mbuf_refcnt_read (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:411
> > #2  0x000000000067a13c in __rte_pktmbuf_prefree_seg (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:778
> > #3  rte_pktmbuf_free_seg (m=0x47dc14c18b00) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:810
> > #4  ixgbe_tx_free_bufs (txq=0x7ffb40ae52c0) at
> > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:150
> > #5  tx_xmit_pkts (tx_queue=0x7ffb40ae52c0, tx_pkts=0x64534770
> <app+290608>,
> > nb_pkts=32) at /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:256
> > #6  0x000000000067c6f3 in ixgbe_xmit_pkts_simple
> (tx_queue=0x7ffb40ae52c0,
> > tx_pkts=0x64534570 <app+290096>, nb_pkts=80) at
> > /opt/dpdk-2.0.0/lib/librte_pmd_ixgbe/ixgbe_rxtx.c:343
> > #7  0x00000000004ec93d in rte_eth_tx_burst (port_id=1 '\001', queue_id=0,
> > tx_pkts=0x64534570 <app+290096>, nb_pkts=144) at
> > /opt/dpdk-2.0.0/x86_64-native-linuxapp-gcc/include/rte_ethdev.h:2572
> >
> Hi,
>
> I'd like a bit more information to help debug your problem:
> * what application are you running when you see this crash? If it's an app
> of your
> own making, can you reproduce the crash using one of the standard DPDK
> apps, or
> example apps, e.g. testpmd, l2fwd, etc.
>
> * Can you also try to verify if the crash occurs with the latest DPDK code
> available
> in git from dpdk.org?
>
> Regards,
> /Bruce
>
Previous message: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
Next message: [dpdk-dev] URGENT please help. Issue on ixgbe_tx_free_bufs version 2.0.0
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list