[dpdk-dev] VMXNET3 on vmware, ping delay

Vass, Sandor (Nokia - HU/Budapest) sandor.vass at nokia.com
Thu Jun 25 23:13:59 CEST 2015


It seems I have found the cause, but I still don't understand the reason.
So, let me describe my setup a bit further. I installed the VMWare Workstation onto my laptop. It has a mobile i5 CPU: 2 cores with hyperthreading, so basically 4 cores.
In VMWare I assigned to C1 and C2 nodes 1 CPU and one core, BR has one CPU and 4 cores allocated (the possible maximum value).

If I execute the 'basicfwd' or the multi-process master (and two clients) on any of the cores out of [2,3,4] then the ping is received immediately (less than 0.5ms) and the transfer speed is immediately high (starting from ~30MB and finishing at around 80-90MB/s with basicfwd and test-pdm also *). 

If I allocate them on core 1 (the clients on any other cores), then the ping behaves as I originally described: 1sec delays. When I tried to transfer a bigger file (I used scp) it started really slow (some 16-32KB/s), sometimes even it was stalled. Then later on it get faster as Matthew wrote but it didn't went upper than 20-30MB/s

test-pmd worked originally. 
This is because when executing test-pmd there had to be defined 2 cores and I always passed '-c 3'. Checking with top it could be seen that it always used the CPU#2 (top showed that the second CPU was utilized by 100%).

Can anyone tell me the reason of this behavior? Using CPU 1 there are huge latencies, using other CPUs everything work as expected...
Checking on the laptop (windows task manager) it could be seen that none of the VMs were utilizing one CPU's to 100% on my laptop. The dpdk processes 100% utilization were somehow distributed amongst the physical CPU cores. So no single core were allocated exclusively by a VM. Why is it a different situation when I use the first CPU on BR rather than the others? It doesn't seem that C1 and C2 are blocking that CPU. Anyway, the HOST opsys already uses all the cores (not heavily).


Rashmin, thanks for the docs. I think I already saw that one but I didn't take that as serious. I thought perf tuning about latency in VMWare ESXi makes point when one would like to go from 5ms to 0.5ms. But I had 1000ms latency at low load... I will check those params if they apply to Workstation at all.


*) Top speed of Multi process master-client example was around 20-30 MB/s, immediately. I think this is a normal limitation because the processes have to talk with each other through shared mem, so it is anyway slower. I didn't test its speed when the Master process was bound to core 1

Sandor

-----Original Message-----
From: ext Patel, Rashmin N [mailto:rashmin.n.patel at intel.com] 
Sent: Thursday, June 25, 2015 10:56 PM
To: Matthew Hall; Vass, Sandor (Nokia - HU/Budapest)
Cc: dev at dpdk.org
Subject: RE: [dpdk-dev] VMXNET3 on vmware, ping delay

For tuning ESXi and vSwitch for latency sensitive workloads, I remember the following paper published by VMware: https://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf that you can try out.

The overall latency in setup (vmware and dpdk-vm using vmxnet3) remains in vmware-native-driver/vmkernel/vmxnet3-backend/vmx-emulation threads in ESXi. So you can better tune ESXi (as explained in the above white paper) and/or make sure that these important threads are not starving to improve latency and throughput in some cases of this setup.

Thanks,
Rashmin

-----Original Message-----
From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Matthew Hall
Sent: Thursday, June 25, 2015 8:19 AM
To: Vass, Sandor (Nokia - HU/Budapest)
Cc: dev at dpdk.org
Subject: Re: [dpdk-dev] VMXNET3 on vmware, ping delay

On Thu, Jun 25, 2015 at 09:14:53AM +0000, Vass, Sandor (Nokia - HU/Budapest) wrote:
> According to my understanding each packet should go through BR as fast 
> as possible, but it seems that the rte_eth_rx_burst retrieves packets 
> only when there are at least 2 packets on the RX queue of the NIC. At 
> least most of the times as there are cases (rarely - according to my 
> console log) when it can retrieve 1 packet also and sometimes only 3 
> packets can be retrieved...

By default DPDK is optimized for throughput not latency. Try a test with heavier traffic.

There is also some work going on now for DPDK interrupt-driven mode, which will work more like traditional Ethernet drivers instead of polling mode Ethernet drivers.

Though I'm not an expert on it, there is also a series of ways to optimize for latency, which hopefully some others could discuss... or maybe search the archives / web site / Intel tuning documentation.

Matthew.


More information about the dev mailing list