[dpdk-users] bonding driver LACP mode issues

Alex Kiselev kiselev99 at gmail.com
Mon Jul 10 14:37:58 CEST 2017


Hello.

I've managed to gather more information about my problem and it looks
like I have pinpointed the its source when my lacp bond port stop
forwarding packets.

At first, I was thinking that the cause of the problem was LACP
protocol. But turning on RTE_LIBRTE_BOND_DEBUG_8023AD showed that the
both switch ports (21,22) and my app's bond ports (0,1) are perfectly
synchronized:


on the switch:
xx # sho lacp lag 21

Lag   Actor    Actor  Partner           Partner  Partner Agg   Actor
      Sys-Pri  Key    MAC               Sys-Pri  Key     Count MAC
--------------------------------------------------------------------------------
21          0  0x03fd 00:e0:ed:7b:ce:08   65535  0x0021      2 00:04:96:83:6d:2f

Port list:

Member     Port      Rx           Sel          Mux            Actor     Partner
Port       Priority  State        Logic        State          Flags     Port
--------------------------------------------------------------------------------
21         0         Current      Selected     Collect-Dist   A-GSCD--  1
22         0         Current      Selected     Collect-Dist   A-GSCD--  2
================================================================================
Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization
             C-Collecting, D-Distributing, F-Defaulted, E-Expired



Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: 250434656 [Port 0:
tx_machine] sending LACP frame
Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: 250436556 [Port 0:
rx_machine] LACP -> CURRENT
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}



Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: 250547261 [Port 1:
tx_machine] sending LACP frame
Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: 250551162 [Port 1:
rx_machine] LACP -> CURRENT
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
       state={ ACT AGG SYNC COL DIST }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG SYNC COL DIST }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}

Then I started looking at tx sending errors and noticed that in some
cases (I send icmp echo request packets and expect my app to send
replies back) all reply packets are dropped because of
rte_eth_tx_burst indicates that all packets are not sent, and in the
rest of cases, I receive all icmp replies with zero packet loss.

rte_eth_stats_get also repors that no packets are transmited on slave
ports 0 and 1 when
I am not receiving echo replies.

So, looks like one bonding slave port fails to send packets and the
other slave port has no problem with sending.

At the same time both bonding ports have no problem with sending lacpdu packets.

I am not sure if both slave ports receive packets normally as the
switch sends all test icmp streams from the same port.

Also rte_eth_bond_slaves_get and rte_eth_bond_active_slaves_get
reports that the bonding ports has 2 slaves and that's correct, the
bond port is created with 2 slaves.

xxx ~ # rcli sh port bond stat 3
bond port 3:
slaves: 0, 1
active slaves: 0, 1

Looking at the source code of bonding driver so far brings me nothing.

So, the question is why after some time of normal operations (last
time app has been working for 4 days) bonding driver stop sending
packets?
Is there any other things that I can do to troubleshoot this situation?

I would appreciate any help.
Thank you in advance.

-- 
Alex Kiselev


More information about the users mailing list