[dpdk-users] bonding driver LACP mode issues
Alex Kiselev
kiselev99 at gmail.com
Mon Jul 10 14:37:58 CEST 2017
Hello.
I've managed to gather more information about my problem and it looks
like I have pinpointed the its source when my lacp bond port stop
forwarding packets.
At first, I was thinking that the cause of the problem was LACP
protocol. But turning on RTE_LIBRTE_BOND_DEBUG_8023AD showed that the
both switch ports (21,22) and my app's bond ports (0,1) are perfectly
synchronized:
on the switch:
xx # sho lacp lag 21
Lag Actor Actor Partner Partner Partner Agg Actor
Sys-Pri Key MAC Sys-Pri Key Count MAC
--------------------------------------------------------------------------------
21 0 0x03fd 00:e0:ed:7b:ce:08 65535 0x0021 2 00:04:96:83:6d:2f
Port list:
Member Port Rx Sel Mux Actor Partner
Port Priority State Logic State Flags Port
--------------------------------------------------------------------------------
21 0 Current Selected Collect-Dist A-GSCD-- 1
22 0 Current Selected Collect-Dist A-GSCD-- 2
================================================================================
Actor Flags: A-Activity, T-Timeout, G-Aggregation, S-Synchronization
C-Collecting, D-Distributing, F-Defaulted, E-Expired
Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: 250434656 [Port 0:
tx_machine] sending LACP frame
Jul 10 16:38:31 xxx the_router.lag[22009]: PMD: LACP: {
subtype= 01
ver_num=01
actor={ tlv=01, len=14
pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
state={ ACT AGG SYNC COL DIST }
}
partner={ tlv=02, len=14
pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
state={ ACT AGG SYNC COL DIST }
}
collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: 250436556 [Port 0:
rx_machine] LACP -> CURRENT
Jul 10 16:38:33 bizin the_router.lag[22009]: PMD: LACP: {
subtype= 01
ver_num=01
actor={ tlv=01, len=14
pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FD03
state={ ACT AGG SYNC COL DIST }
}
partner={ tlv=02, len=14
pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0100
state={ ACT AGG SYNC COL DIST }
}
collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: 250547261 [Port 1:
tx_machine] sending LACP frame
Jul 10 16:40:24 bizin the_router.lag[22009]: PMD: LACP: {
subtype= 01
ver_num=01
actor={ tlv=01, len=14
pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
state={ ACT AGG SYNC COL DIST }
}
partner={ tlv=02, len=14
pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
state={ ACT AGG SYNC COL DIST }
}
collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: 250551162 [Port 1:
rx_machine] LACP -> CURRENT
Jul 10 16:40:28 bizin the_router.lag[22009]: PMD: LACP: {
subtype= 01
ver_num=01
actor={ tlv=01, len=14
pri=0000, system=00:04:96:83:6D:2F, key=FD03, p_pri=0000 p_num=FE03
state={ ACT AGG SYNC COL DIST }
}
partner={ tlv=02, len=14
pri=FFFF, system=00:E0:ED:7B:CE:08, key=2100, p_pri=FF00 p_num=0200
state={ ACT AGG SYNC COL DIST }
}
collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00}
Then I started looking at tx sending errors and noticed that in some
cases (I send icmp echo request packets and expect my app to send
replies back) all reply packets are dropped because of
rte_eth_tx_burst indicates that all packets are not sent, and in the
rest of cases, I receive all icmp replies with zero packet loss.
rte_eth_stats_get also repors that no packets are transmited on slave
ports 0 and 1 when
I am not receiving echo replies.
So, looks like one bonding slave port fails to send packets and the
other slave port has no problem with sending.
At the same time both bonding ports have no problem with sending lacpdu packets.
I am not sure if both slave ports receive packets normally as the
switch sends all test icmp streams from the same port.
Also rte_eth_bond_slaves_get and rte_eth_bond_active_slaves_get
reports that the bonding ports has 2 slaves and that's correct, the
bond port is created with 2 slaves.
xxx ~ # rcli sh port bond stat 3
bond port 3:
slaves: 0, 1
active slaves: 0, 1
Looking at the source code of bonding driver so far brings me nothing.
So, the question is why after some time of normal operations (last
time app has been working for 4 days) bonding driver stop sending
packets?
Is there any other things that I can do to troubleshoot this situation?
I would appreciate any help.
Thank you in advance.
--
Alex Kiselev
More information about the users
mailing list