Bug 483 - Bond 8023ad lacp handshake sometimes fail
Summary: Bond 8023ad lacp handshake sometimes fail
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 19.11
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: Markopeng
URL:
Depends on:
Blocks:
 
Reported: 2020-05-21 14:12 CEST by Markopeng
Modified: 2021-12-07 12:48 CET (History)
3 users (show)



Attachments

Description Markopeng 2020-05-21 14:12:01 CEST
There are two ports in my bond and two hosts are connected by a switch.
I open the dpdk debug info with macro RTE_LIBRTE_BOND_DEBUG_8023AD.


Port 0 MAC: ac:f9:70:88:f3:26
Port 1 MAC: ac:f9:70:88:f3:27
BOND MAC: ac:f9:70:88:f3:26

When tx_machine send lacp with Port 1 Mac ac:f9:70:88:f3:27, the handshake will fail.


when lacp handshake failed, log like this:

----------

  997 [Port 0: rx_machine] -> INITIALIZE
   997 [Port 0: periodic_machine] -> NO_PERIODIC ( begind LACP active )
   997 [Port 0: mux_machine] -> DETACHED
   997 [Port 0: selection_logic] -> SELECTED: ID=  1
	aggregator found aggregator ID=  1
   997 [Port 0: mux_machine] DETACHED -> WAITING

1995 [Port 1: tx_machine] Sending LACP frame
bond_print_lacp(122) - LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=AC:F9:70:88:F3:27, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG DEF EXP }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:00:00:00:00:00, key=0100, p_pri=FF00 p_num=0000
       state={ ACT TIMEOUT AGG }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00 }
  1995 [Port 0: tx_machine] Sending LACP frame
bond_print_lacp(122) - LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=AC:F9:70:88:F3:27, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG DEF EXP }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:00:00:00:00:00, key=0100, p_pri=FF00 p_num=0000
       state={ ACT TIMEOUT AGG }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00 }
  2095 [Port 1: mux_machine] ATTACHED Entered
  2594 [Port 1: tx_machine] Sending LACP frame
----------

when lacp handshake succeeds, log like this:
----------

     0 [Port 0: rx_machine] -> INITIALIZE
     0 [Port 0: periodic_machine] -> NO_PERIODIC ( begind LACP active )
     0 [Port 0: mux_machine] -> DETACHED
    99 [Port 0: mux_machine] DETACHED -> WAITING
Waiting for slaves to become active...
Port 2 MAC: ac:f9:70:88:f3:26
   236 [Port 1: rx_machine] -> INITIALIZE
   236 [Port 1: periodic_machine] -> NO_PERIODIC ( begind LACP active )
   236 [Port 1: mux_machine] -> DETACHED
   236 [Port 1: selection_logic] -> SELECTED: ID=  0
	aggregator found aggregator ID=  0
   236 [Port 1: mux_machine] DETACHED -> WAITING
  1034 [Port 0: tx_machine] Sending LACP frame

1034 [Port 0: tx_machine] Sending LACP frame
bond_print_lacp(122) - LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=AC:F9:70:88:F3:26, key=2100, p_pri=FF00 p_num=0100
       state={ ACT AGG DEF EXP }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:00:00:00:00:00, key=0100, p_pri=FF00 p_num=0000
       state={ ACT TIMEOUT AGG }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00 }
  1234 [Port 1: tx_machine] Sending LACP frame
bond_print_lacp(122) - LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=FFFF, system=AC:F9:70:88:F3:26, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG DEF EXP }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=00:00:00:00:00:00, key=0100, p_pri=FF00 p_num=0000
       state={ ACT TIMEOUT AGG }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00 }
  2032 [Port 0: tx_machine] Sending LACP frame

2332 [Port 1: rx_machine] LACP -> CURRENT
bond_print_lacp(122) - LACP: {
  subtype= 01
  ver_num=01
  actor={ tlv=01, len=14
    pri=0080, system=F8:98:EF:69:83:91, key=417F, p_pri=0080 p_num=0600
       state={ ACT TIMEOUT AGG }
  }
  partner={ tlv=02, len=14
    pri=FFFF, system=AC:F9:70:88:F3:26, key=2100, p_pri=FF00 p_num=0200
       state={ ACT AGG DEF EXP }
  }
  collector={info=03, length=10, max_delay=0000
, type_term=00, terminator_length = 00 }
  2332 [Port 1: mux_machine] ATTACHED Entered

----------

Through my observation: when log print "SELECTED: ID=  1", it uses the wrong mac address to send lacp.

selection_logic function choose wrong aggregator_port_id here.

rte_eth_bond_8023ad.c:749

	case AGG_STABLE:
		if (default_slave == slaves_count)
			new_agg_id = slaves[slave_id];
		else
                        new_agg_id = slaves[default_slave]; // sometimes new_agg_id will be 1

why does the lacp handshake succeed sometimes?

The "slaves" array is filled with unsure order by function "activate_slave".
When port 0 fill the slave[0], It works correctly.
Comment 1 Ajit Khaparde 2020-05-23 07:52:19 CEST
Chas - can you please check? Thanks
Comment 2 Markopeng 2020-05-31 03:40:02 CEST
index b77a37d..48242a9 100644
--- a/FStackV1.12/dpdk/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/FStackV1.12/dpdk/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -658,6 +658,25 @@ max_index(uint64_t *a, int n)
        return max_i;
 }

+static uint16_t
+min_index(uint16_t *a, uint16_t n)
+{
+    if (n <= 0)
+        return -1;
+
+    int i, min_i = 0;
+    uint64_t min = a[0];
+
+    for (i = 1; i < n; ++i) {
+        if (a[i] < min) {
+            min = a[i];
+            min_i = i;
+        }
+    }
+
+    return min_i;
+}
+
 /**
  * Function assigns port to aggregator.
  *
@@ -728,7 +747,11 @@ selection_logic(struct bond_dev_private *internals, uint16_t slave_id)
                if (default_slave == slaves_count)
                        new_agg_id = slaves[slave_id];
                else
-                       new_agg_id = slaves[default_slave];
+               {
+            //new_agg_id = slaves[default_slave];
+            agg_new_idx = min_index(slaves, slaves_count);
+            new_agg_id = slaves[agg_new_idx];
+               }
                break;
        default:
                if (default_slave == slaves_count)

This is my patch, It works well for me.
Comment 3 Ajit Khaparde 2020-05-31 04:37:31 CEST
Markopeng - can you submit the patch to the mailing list formally? Thnaks
Comment 4 Christian 2021-12-07 12:48:21 CET
Thanks, this works for me.

Note You need to log in before you can comment on or make changes to this bug.