Doing PVP testing, we noticed that testpmd startup time is noticeably increased when not passing --legacy-mem option. With 1G hugepages and below command lines, we measure ~2 seconds with --legacy-mem option, and 13 seconds without: testpmd -c 7 -n 4 --socket-mem 2048 0 -w 0000:00:02.0 --legacy-mem -- --burst 64 -i --rxq=2 --txq=2 --rxd=4096 --txd=1024 --coremask=6 --auto-start --port-topology=chained --forward-mode=macswap testpmd -c 7 -n 4 --socket-mem 2048 0 -w 0000:00:02.0 -- --burst 64 -i --rxq=2 --txq=2 --rxd=4096 --txd=1024 --coremask=6 --auto-start --port-topology=chained --forward-mode=macswap
Is it the EAL startup time that's increased, or is it the mempool creation time that's increased?
Hi Anatoly, I have simplified the setup, I just need to run testpmd on host without any devices binded. The regression seems to happen in mlockall(). Without --legacy-mem: # strace -T -e trace=mlockall ./install/bin/testpmd -l 0,2,3,4,5 -m 1024 -n 4 EAL: Detected 8 lcore(s) EAL: Detected 1 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: PCI device 0000:04:00.0 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e EAL: PCI device 0000:04:00.1 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e EAL: PCI device 0000:05:00.0 on NUMA socket 0 EAL: probe driver: 8086:1572 net_i40e EAL: PCI device 0000:05:00.1 on NUMA socket 0 EAL: probe driver: 8086:1572 net_i40e testpmd: No probed ethernet devices *mlockall(MCL_CURRENT|MCL_FUTURE) = 0 <5.702764>* testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=179456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Done No commandline core given, start packet forwarding io packet forwarding - ports=0 - cores=0 - streams=0 - NUMA support enabled, MP allocation mode: native io packet forwarding packets/burst=32 nb forwarding cores=1 - nb forwarding ports=0 Press enter to exit With --legacy-mem: # strace -T -e trace=mlockall ./install/bin/testpmd -l 0,2,3,4,5 -m 1024 -n 4 --legacy-mem EAL: Detected 8 lcore(s) EAL: Detected 1 NUMA nodes EAL: Multi-process socket /var/run/dpdk/rte/mp_socket EAL: Probing VFIO support... EAL: PCI device 0000:04:00.0 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e EAL: PCI device 0000:04:00.1 on NUMA socket 0 EAL: probe driver: 8086:1583 net_i40e EAL: PCI device 0000:05:00.0 on NUMA socket 0 EAL: probe driver: 8086:1572 net_i40e EAL: PCI device 0000:05:00.1 on NUMA socket 0 EAL: probe driver: 8086:1572 net_i40e testpmd: No probed ethernet devices *mlockall(MCL_CURRENT|MCL_FUTURE) = 0 <1.393868>* testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=179456, size=2176, socket=0 testpmd: preferred mempool ops selected: ring_mp_mc Done No commandline core given, start packet forwarding io packet forwarding - ports=0 - cores=0 - streams=0 - NUMA support enabled, MP allocation mode: native io packet forwarding packets/burst=32 nb forwarding cores=1 - nb forwarding ports=0 Press enter to exit
Testpmd has a --no-mlockall - i presume the issue does not exist with it?
Correct, no issue with this option passed
Just to be clear, with --no-mlockall, the issue is not seen at startup time as mlockall is not called. But maybe it will be faced first time the pages are accessed.
I worked on reproducing the issue with upstream kernel. With upstream Kernel v5.0, I manage to reproduce the same behaviour. However, I noticed that the issue only reproduce with tuned realtime-virtual-host profile [0]. Switching back to Red Hat's v4.18 Kernel, the issue also does not reproduce with tuned realtime-virtual-host profile disabled. Note1: When the profile, the mlockall() duration is ~40ms both with and without --legacy-mem option passed. Note2: There is a NMI watchdog regression in upstream kernel v4.19+, making the NMI watchdog to fire whereas it is disabled via its sysctl when using tuned realtim-virtual-host profile. I reported it here: https://lkml.org/lkml/2019/3/8/111 [0]: https://github.com/redhat-performance/tuned
Not to sidestep or downplay the issue, but I don't believe other applications do mlockall() for their memory (although we do recommend that in "Writing efficient code" section of our programmer's guide). I wonder how important this call really is.
I now OVS has a cmdline option to do mlockall(). It could be problematic in case a VNF application use it, as startup time might be important. That said, I don't think it is a DPDK issue now that we see time is spent in mlockall(). So I assigned the bug to myself, and keep it open so that I can put conclusion once problem is fixed or at least understood.