Bug 284
Summary: | Secondary not able to Rx/TX after primary dies in symmetric multiprocess | ||
---|---|---|---|
Product: | DPDK | Reporter: | Oleksandr (oleksandr.gromovskyi) |
Component: | other | Assignee: | Ferruh YIGIT (ferruh.yigit) |
Status: | CONFIRMED --- | ||
Severity: | major | CC: | anatoly.burakov |
Priority: | Normal | ||
Version: | 18.11 | ||
Target Milestone: | --- | ||
Hardware: | x86 | ||
OS: | Linux | ||
Attachments: |
config file
modified sample app |
Description
Oleksandr
2019-05-23 16:23:18 CEST
I took a quick look at the code of the app in question, and i don't see anything that would prevent it from working. I'll try reproducing this. Thank you. I checked versions 18.02.1, 18.11, 19.05 and in all same behavior when primary is alive i'm able to rx/tx data, but immediately after primary stop/crash secondaries not able to rx/tx any data and no error reporting. If it goes as far back as 18.02, then i definitely know where *not* to look :) (i think it's safe to say we can rule out either memory changes done in 18.05, or the multiprocess hotplug work done in later releases) Yeah, btw checked 2.2.0 all works as expected :) that'll make a good bisect starting point, just in case. i didn't get a chance to reproduce it as of yet, it'll hopefully be this afternoon. As an aside, i don't think this model of letting secondary processes start without the primary should be supported any longer. I don't think it's even possible to initialize a secondary process without the primary, seeing how hotplug and VFIO rely on IPC. Or rather, the process will be initialized alright, but none of the ports would actually work. However, that's a different discussion. It's not the case of secondary without primary at all. I'm starting primary and then secondary with "same" functionality. Then when primary crashes secondary can detect it and continue serving of data rx/tx as it was still alive primary (can be used as application redundancy). This is also the idea of Symmetric Multi-process Example of DPDK. This example present in all dpdk versions but looks like at least in 18.02.1 version and higher it does not work properly when primary die. I can reproduce this on 18.11. Seeing how there's quite a few bisecting to do with various build system changes since 2.2, figuring out the root cause will take a while. Hello. Is the any updates about status of this bug? No updates from my side, i haven't looked at it just yet. I've attempted to reproduce this on 2.2. It doesn't work there either. Are you sure you can successfully do that with v2.2? Yes i used 2.2.0 and it worked as expected. I will attached my .config and modified sample to print each stats second. Command is in my first message. Created attachment 58 [details]
config file
Created attachment 59 [details]
modified sample app
I am not able to reproduce that. Whenever i load secondary process after exiting the primary, i don't get any traffic. I use testpmd with "start tx_first" as a traffic generator, and it works after running process as primary, but doesn't RX or forward anything when running process as secondary. Command i use for primary: > sudo DPDK/examples/multi_process/symmetric_mp/build/symmetric_mp -w 81:00.0 > -w 81:00.1 -c 1 -n 4 --proc-type=primary --file-prefix mp_test -- -p 3 > --num-procs=2 --proc-id=0 > sudo DPDK/examples/multi_process/symmetric_mp/build/symmetric_mp -w 81:00.0 > -w 81:00.1 -c 1 -n 4 --proc-type=secondary --file-prefix mp_test -- -p 3 > --num-procs=2 --proc-id=0 Command i use for secondary: > sudo DPDK/examples/multi_process/symmetric_mp/build/symmetric_mp -w 81:00.0 > -w 81:00.1 -c 2 -n 4 --proc-type=auto --file-prefix mp_test -- -p 3 > --num-procs=2 --proc-id=1 Testpmd logs: ``` testpmd> start tx_first rxonly packet forwarding - ports=2 - cores=1 - streams=2 - NUMA support enabled, MP allocation mode: native Logical Core 1 (socket 0) forwards packets on 2 streams: RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01 RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00 rxonly packet forwarding packets/burst=32 nb forwarding cores=1 - nb forwarding ports=2 port 0: RX queue number: 1 Tx queue number: 1 Rx offloads=0x0 Tx offloads=0x0 RX queue: 0 RX desc=256 - RX free threshold=32 RX threshold registers: pthresh=8 hthresh=8 wthresh=0 RX Offloads=0x0 TX queue: 0 TX desc=256 - TX free threshold=32 TX threshold registers: pthresh=32 hthresh=0 wthresh=0 TX offloads=0x0 - TX RS bit threshold=32 port 1: RX queue number: 1 Tx queue number: 1 Rx offloads=0x0 Tx offloads=0x0 RX queue: 0 RX desc=256 - RX free threshold=32 RX threshold registers: pthresh=8 hthresh=8 wthresh=0 RX Offloads=0x0 TX queue: 0 TX desc=256 - TX free threshold=32 TX threshold registers: pthresh=32 hthresh=0 wthresh=0 TX offloads=0x0 - TX RS bit threshold=32 testpmd> stop Telling cores to stop... Waiting for lcores to finish... ---------------------- Forward statistics for port 0 ---------------------- RX-packets: 32 RX-dropped: 0 RX-total: 32 TX-packets: 32 TX-dropped: 0 TX-total: 32 ---------------------------------------------------------------------------- ---------------------- Forward statistics for port 1 ---------------------- RX-packets: 32 RX-dropped: 0 RX-total: 32 TX-packets: 32 TX-dropped: 0 TX-total: 32 ---------------------------------------------------------------------------- +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++ RX-packets: 64 RX-dropped: 0 RX-total: 64 TX-packets: 64 TX-dropped: 0 TX-total: 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Done. ``` Then, after re-running primary as secondary: ``` testpmd> start tx_first rxonly packet forwarding - ports=2 - cores=1 - streams=2 - NUMA support enabled, MP allocation mode: native Logical Core 1 (socket 0) forwards packets on 2 streams: RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01 RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00 rxonly packet forwarding packets/burst=32 nb forwarding cores=1 - nb forwarding ports=2 port 0: RX queue number: 1 Tx queue number: 1 Rx offloads=0x0 Tx offloads=0x0 RX queue: 0 RX desc=256 - RX free threshold=32 RX threshold registers: pthresh=8 hthresh=8 wthresh=0 RX Offloads=0x0 TX queue: 0 TX desc=256 - TX free threshold=32 TX threshold registers: pthresh=32 hthresh=0 wthresh=0 TX offloads=0x0 - TX RS bit threshold=32 port 1: RX queue number: 1 Tx queue number: 1 Rx offloads=0x0 Tx offloads=0x0 RX queue: 0 RX desc=256 - RX free threshold=32 RX threshold registers: pthresh=8 hthresh=8 wthresh=0 RX Offloads=0x0 TX queue: 0 TX desc=256 - TX free threshold=32 TX threshold registers: pthresh=32 hthresh=0 wthresh=0 TX offloads=0x0 - TX RS bit threshold=32 testpmd> stop Telling cores to stop... Waiting for lcores to finish... ---------------------- Forward statistics for port 0 ---------------------- RX-packets: 0 RX-dropped: 0 RX-total: 0 TX-packets: 32 TX-dropped: 0 TX-total: 32 ---------------------------------------------------------------------------- ---------------------- Forward statistics for port 1 ---------------------- RX-packets: 0 RX-dropped: 0 RX-total: 0 TX-packets: 32 TX-dropped: 0 TX-total: 32 ---------------------------------------------------------------------------- +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++ RX-packets: 0 RX-dropped: 0 RX-total: 0 TX-packets: 64 TX-dropped: 0 TX-total: 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Done. ``` Logs from your sample app run as primary: ``` <...> Checking link statusdone Port 0 Link Up - speed 10000 Mbps - full-duplex Port 1 Link Up - speed 10000 Mbps - full-duplex APP: Finished Process Init. Lcore 0 using ports 0 1 lcore 0 using queue 0 of each port Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 32, TX - 32, Drop - 0 Port 1: RX - 32, TX - 32, Drop - 0 Port 0: RX - 32, TX - 32, Drop - 0 Port 1: RX - 32, TX - 32, Drop - 0 Port 0: RX - 32, TX - 32, Drop - 0 Port 1: RX - 32, TX - 32, Drop - 0 <...> ``` Logs after re-running as secondary: ``` APP: Finished Process Init. Lcore 0 using ports 0 1 lcore 0 using queue 0 of each port Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 Port 0: RX - 0, TX - 0, Drop - 0 Port 1: RX - 0, TX - 0, Drop - 0 ``` So, whatever the issue is, i cannot confirm that it is not present in v2.2.0 on my setup - it looks like it was there since a long time ago. Not sure that what you did is same as i did. First of all i start primary and secondary at the same time, check that both receive. Then stop primary and check that secondary still able to receive the data. I see that you configured different que for primary and secondary (--proc-id=) are you sure that secondary receives data at all even when primary is alive? Hm, it seems that i've somewhat misintepreted the original report and attempted to reproduce the issue the wrong way. Still, i can't reproduce your behavior either - RX stops working in secondary whenever i kill the primary process. Log from primary: Port 0: RX - 45153335, TX - 0, Drop - 0 Port 1: RX - 0, TX - 45153335, Drop - 0 Port 0: RX - 48282679, TX - 0, Drop - 0 Port 1: RX - 0, TX - 48282679, Drop - 0 Port 0: RX - 51445248, TX - 0, Drop - 0 Port 1: RX - 0, TX - 51445248, Drop - 0 Port 0: RX - 54610948, TX - 0, Drop - 0 Port 1: RX - 0, TX - 54610948, Drop - 0 Port 0: RX - 57694344, TX - 0, Drop - 0 Port 1: RX - 0, TX - 57694344, Drop - 0 ^C Exiting on signal 2 Log from secondary: Port 0: RX - 793763, TX - 0, Drop - 0 Port 1: RX - 0, TX - 793763, Drop - 0 Port 0: RX - 908510, TX - 0, Drop - 0 Port 1: RX - 0, TX - 908510, Drop - 0 Port 0: RX - 1023473, TX - 0, Drop - 0 Port 1: RX - 0, TX - 1023473, Drop - 0 Port 0: RX - 1066257, TX - 0, Drop - 0 Port 1: RX - 0, TX - 1066257, Drop - 0 Port 0: RX - 1066257, TX - 0, Drop - 0 Port 1: RX - 0, TX - 1066257, Drop - 0 Port 0: RX - 1066257, TX - 0, Drop - 0 Port 1: RX - 0, TX - 1066257, Drop - 0 Port 0: RX - 1066257, TX - 0, Drop - 0 Port 1: RX - 0, TX - 1066257, Drop - 0 Note: RX stops so all RX values after certain point are the same. To clarify: this is log from version 2.2.0, not latest. The issue happens on all versions i try it with. So, it doesn't look like it was "introduced", more likely is that it was always there (or alternatively is PMD-specific). Some more debugging revealed that this scenario actually works on 19.08 latest master, but only with VFIO. It only doesn't work with igb_uio. Can you check if this is the case on your setup? Interesting. I worked only igb_uio, can you explain how to do test with vfio? The same way you test with igb_uio, only instead of binding your NICs to igb_uio, you bind them to vfio-pci driver (and make sure your IOMMU is either enabled or in pass-through mode - the latter will work with both VFIO and igb_uio, so you can use that if you want to quickly switch between them). Understand, i will try but it will take some time to prepare test env cause iommu is not enabled on my setup. Had time to do some testing: iommu enabled: [root@localhost ~]# cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-229.1.2.47109.MSSr1.el7.centos.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet isolcpus=2,3 processor.max_cstate=0 intel_idle.max_cstate=0 iommu=pt intel_iommu=on [root@localhost usertools]# dmesg | grep -e IOMMU -e DMAR [ 0.000000] Intel-IOMMU: enabled but when i try to bind device to vfio i'm getting this error. Enter PCI address of device to bind to VFIO driver: 0000:05:00.1 Error: bind failed for 0000:05:00.1 - Cannot bind to driver vfio-pci maybe i need to do something additional. Enable IOMMU in the BIOS? If you don't see a bunch of address mappings in dmesg, you don't have IOMMU enabled. Understand, i thought enabling in linux enough. It will take some time as all my servers are remote and in different location. I was able to go further now i see [root@localhost symmetric_mp]# ./build/symmetric_mp -c 1 -n 4 --proc-type=auto -- -p 3 --num-procs=2 --proc-id=0 EAL: Detected 4 lcore(s) EAL: Detected 1 NUMA nodes EAL: Auto-detected process type: PRIMARY EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: Error - exiting with code: 1 Cause: No Ethernet ports - bye That'll get you nowhere - of course you can run without ports, but the *point* of this exercise is to do it *with* ports :) Of course i bound some ports to vfio: Network devices using DPDK-compatible driver ============================================ 0000:01:00.0 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=vfio-pci unused=ixgbe,igb_uio 0000:01:00.1 '82599ES 10-Gigabit SFI/SFP+ Network Connection 10fb' drv=vfio-pci unused=ixgbe,igb_uio Network devices using kernel driver =================================== 0000:00:19.0 '82579LM Gigabit Network Connection 1502' if=eno1 drv=e1000e unused=igb_uio,vfio-pci *Active* Why aren't they being initialized then? Can you enable debug log? It's already with debug logs as in the code before rte_eal_init i added rte_log_set_global_level(RTE_LOG_DEBUG) I don't know what you have in your code, but what you have pasted here aren't debug logs. You want to add --log-level=eal,8 to the EAL command line. removed rte_log_set_global_level and added what you proposed to parameters [root@localhost symmetric_mp]# ./build/symmetric_mp -c 1 -n 4 --proc-type=auto -- -p 3 --num-procs=2 --proc-id=0 --log-level=eal,8 EAL: Detected 4 lcore(s) EAL: Detected 1 NUMA nodes EAL: Auto-detected process type: PRIMARY EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: VFIO support initialized EAL: Error - exiting with code: 1 Cause: No Ethernet ports - bye You've added it to the app command-line, not to the EAL command-line. Can we perhaps take this to IRC as to not spam the bug tracker? Yes please. Further investigation by Oleksandr showed that this is indeed a problem that is specific to igb_uio, and is not reproducible with VFIO. Reassigning to Ferruh. |