Bug 29 - pktgen hangs when it tries to send packets through libvirt driver, works for all other drivers
Summary: pktgen hangs when it tries to send packets through libvirt driver, works for ...
Status: CONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 18.02
Hardware: x86 Linux
: Normal normal
Target Milestone: ---
Assignee: Keith Wiles
URL:
Depends on:
Blocks:
 
Reported: 2018-04-27 22:03 CEST by Gregory Shimansky
Modified: 2019-06-26 15:43 CEST (History)
4 users (show)



Attachments
Vagrantfile and scripts to create test VMs (3.88 KB, application/x-zip-compressed)
2018-04-27 22:03 CEST, Gregory Shimansky
Details

Description Gregory Shimansky 2018-04-27 22:03:47 CEST
Created attachment 6 [details]
Vagrantfile and scripts to create test VMs

This is a bug we encounter in NFF-Go project when we try to use pktgen in two VMs connected directly through internal network. On Qemu/KVM we use UDP tunnel for internal networks connection, on VirtualBox we use VB internal network.

I attached Vagrantfile and scripts.sh which we use in our project to setup VMs. You need vagrant version 2.0.x to use this Vagrantfile and a reload plugin for it. Install it with command "vagrant plugin install vagrant-reload". If you connect to internet through a proxy, also install a vagrant-proxyconf plugin with "vagrant plugin install vagrant-proxyconf".

To use libvirt provisioner it is also important to install vagrant-libvirt plugin ("vagrant plugin install vagrant-libvirt") because otherwise VirtualBox is used to create VMs. For libvirt we also use "images" storage pool because default location is often too small to hold all of the VM images. If you don't need it, comment line 37 in Vagrantfile. Otherwise it is necessary to create it with the following commands

virsh pool-define-as images dir --target /localdisk/libvirt
virsh pool-start images

After setting up you can created VMs with command "vagrant up --provider=libvirt". It creates and provisions two VMs, downloads and builds NFF-Go project which also includes DPDK 18.02 and pktgen 3.4.9. If you used VirtualBox it is necessary to manually change type of NICs 2 and 3 from Intel to virtio.

When VMs are created, use "vagrant ssh" to access both of them. There is a predefined bash function "bindports" which binds two ports which are connected to each other on these VMs to DPDK igb_uio driver. There is also a function "runpktgen" bash function to start pktgen for ports 0 and 1.

When pktgen is started, try to use command "start 0" and "start 1". It starts sending packets but only for a very brief period of time. Quickly it stops sending packets and only restarting the program allows it to send some more packets and hang again. This behavior happens only for virtio driver and pktgen author claims that the cause of the problem is in it https://github.com/pktgen/Pktgen-DPDK/issues/148#issuecomment-380972000

If you don't want to deal with vagrant, here are two Qemu command lines that vagrant starts for these two VMs:

libvirt+ 28402     1  4 12:44 ?        00:06:03 /usr/bin/qemu-system-x86_64 -name guest=tests_fedora-1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-61-tests_fedora-1/master-key.aes -machine pc-i440fx-bionic,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on,pdpe1gb=on,abm=on -m 4096 -realtime mlock=off -smp 8,sockets=8,cores=1,threads=1 -uuid 9cb78faf-9894-4fef-8833-8416689163cb -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-61-tests_fedora-1/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/localdisk/libvirt/tests_fedora-1.img,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=33,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b8:95:7f,bus=pci.0,addr=0x5 -netdev socket,udp=127.0.0.1:44403,localaddr=127.0.0.1:44409,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=3c:fd:fe:a4:dd:f0,bus=pci.0,addr=0x6 -netdev socket,udp=127.0.0.1:44404,localaddr=127.0.0.1:44410,id=hostnet2 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:3b:d5:7d,bus=pci.0,addr=0x7 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:7 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on

libvirt+ 36753     1  4 12:47 ?        00:05:39 /usr/bin/qemu-system-x86_64 -name guest=tests_fedora-0,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-62-tests_fedora-0/master-key.aes -machine pc-i440fx-bionic,accel=kvm,usb=off,dump-guest-core=off -cpu Haswell-noTSX,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on,pdpe1gb=on,abm=on -m 4096 -realtime mlock=off -smp 8,sockets=8,cores=1,threads=1 -uuid 1e2c2002-d1e3-4985-a403-254762fbf30d -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-62-tests_fedora-0/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/localdisk/libvirt/tests_fedora-0.img,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:26:6a:fb,bus=pci.0,addr=0x5 -netdev socket,udp=127.0.0.1:44409,localaddr=127.0.0.1:44403,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:d5:31:ca,bus=pci.0,addr=0x6 -netdev socket,udp=127.0.0.1:44410,localaddr=127.0.0.1:44404,id=hostnet2 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=52:54:00:f7:b9:48,bus=pci.0,addr=0x7 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:6 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -msg timestamp=on
Comment 1 Gregory Shimansky 2018-04-27 22:12:43 CEST
There is also a comment from pktgen maintainer that virtio makes modifications to mbuf structures which may interfere in how pktgen works https://github.com/pktgen/Pktgen-DPDK/issues/148#issuecomment-383088512
Comment 2 Keith Wiles 2018-05-02 19:09:11 CEST
What is the DPDK (dpdk-devbind.py) and Pktgen command line options for each nff-go instances you are executing?

I see three interfaces, but I assumed you were using virtIO are these eth0-2 virtio interfaces?
Comment 3 Gregory Shimansky 2018-05-02 20:00:29 CEST
The first interface is management interface and is used to connect to external world. Second and third are interfaces connected to each other, like this:

     |             |
 nff-go-0 ==== nff-go-1

There is a scripts.sh file which is sourced when you log in, so you should have two functions: "bindports" and "runpktgen". First one binds second and third ports to DPDK driver built inside of NFF-Go source tree, and second command launches pktgen on these two ports.
Comment 4 Keith Wiles 2018-05-03 15:51:51 CEST
I want to make some changes to DPDK or pktgen inside the vagrant instance. What is the command to rebuild dpdk and pktgen?
Comment 5 Gregory Shimansky 2018-05-03 16:28:52 CEST
You can run make within ${NFF_GO}/dpdk directory. It should rebuild both DPDK and pktgen.
Comment 6 Keith Wiles 2018-05-03 18:02:28 CEST
I found a line in the file virtio_rxtx.c at around line 1048 'need = RTE_MIN(need, (int)nb_used);' The problem appears when nb_used is zero and the virtio_xmit_cleanup is called with zero.

I commented out the line and now pktgen runs. It appears a deadlock is happening when nb_used is zero and when we need a ring entry. I do not know the virtio driver that well and the virtio maintainer(s) need to address the correct fix.
Comment 7 Gregory Shimansky 2018-05-03 20:07:43 CEST
I tried to run pktgen with this change several times on Libvirt/Qemu and VirtualBox VMs, and got several different results:

1. Several times on Qemu I got a freeze in terminal and VM froze so that I had to revive it with "vagrant reload nff-go-0" command because I couldn't ssh to the system any more. But the packets were actually were going, I saw them arriving on a second VM, so pktgen was successfully sending packets!
2. A lot of times pktgen crashed with Segmentation fault both on Qemu and VB.
3. Several times on VB I saw it start to send packets juts fine.
Comment 8 Gregory Shimansky 2018-05-03 21:23:07 CEST
I suspect, that in the #1 case in the comment above, something bad happens with Linux kernel driver on network interface used by ssh because VM doesn't freeze, kernel doesn't crash and pktgen executes just fine. Just no ssh connection, that is why terminal freezes.
Comment 9 Keith Wiles 2018-05-03 22:01:16 CEST
Pktgen is consuming all of the cycles sending traffic and the VM can not timeshare enough time to manage anything else in the VM.

This means to me the VM is telling us we have 8 cores, but in reality that is not the case most likely one physical core or two. Look at how the QEMU is setup and see if you can increase the number of core as DPDK wants physical cores.

I ran pktgen a number times and never saw a crash and I am guessing it is not a pktgen problem, but a DPDK or virtio problem.
Comment 10 Daria Kolistratova 2018-07-04 16:56:16 CEST
Hava a similar problem on AWS images with ENA driver.
Comment 11 Ajit Khaparde 2018-09-16 00:18:22 CEST
Who gets to work on this?
Comment 12 Tiwei Bie 2019-01-16 06:17:58 CET
I tried to reproduce this issue (that pktgen stops sending packets after few seconds) in a similar setup (two VMs with virtio ports connected by UDP tunnel, one of them runs pktgen and the other one runs testpmd/rxonly). Seems the issue has already been fixed in the latest code (pktgen-3.6.2/DPDK v19.02-rc2). (The issue can be reproduced in this setup with pktgen-3.4.9/DPDK-18.02)
Comment 13 Gregory Shimansky 2019-01-25 00:49:51 CET
The issue is fixed in a way that pktgen doesn't stop sending packets any more (I see them coming on the other side). But pktgen's UI freezes and I can only kill it on another console because it doesn't accept any commands after I run "start 0". Possibly now it is an application bug, not DPDK.

I used DPDK 19.02.0-rc3 (0a703f0f36c11b6f23fad4fab9e79c308811329d) and pktgen-3.6.4.
Comment 14 Keith Wiles 2019-01-25 03:58:53 CET
Dpdk also had a bug reported as the one you referenced and they just reported it was fixed after testing it again. 

This one I have wait till I am back from vacation and I am having an operation on my shoulder when I get back. This means it maybe some time before I can get to this one. 

Sent from my iPhone

> On Jan 24, 2019, at 6:49 PM, "bugzilla@dpdk.org" <bugzilla@dpdk.org> wrote:
> 
> https://bugs.dpdk.org/show_bug.cgi?id=29
> 
> --- Comment #13 from Gregory Shimansky (gregory.shimansky@intel.com) ---
> The issue is fixed in a way that pktgen doesn't stop sending packets any more
> (I see them coming on the other side). But pktgen's UI freezes and I can only
> kill it on another console because it doesn't accept any commands after I run
> "start 0". Possibly now it is an application bug, not DPDK.
> 
> I used DPDK 19.02.0-rc3 (0a703f0f36c11b6f23fad4fab9e79c308811329d) and
> pktgen-3.6.4.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 15 Keith Wiles 2019-01-25 10:12:16 CET
I remember now the testing that dpdk did was around virtio and not VMs. 

I do not use VMs much, but sometimes the cpus in a VM are not mapped as CPU’s in the host. Because pktgen needs at a minimum of two cpus to run could it be the configuration of the VMs is not really using two different cpus. 

Pktgen try’s to make sure you do not put the rx/tx core on the same core as the cli and display. Could the vm be putting the vCPU on the same cores?

Not sure if that is the problem here. 

Sent from my iPhone

> On Jan 24, 2019, at 6:49 PM, "bugzilla@dpdk.org" <bugzilla@dpdk.org> wrote:
> 
> https://bugs.dpdk.org/show_bug.cgi?id=29
> 
> --- Comment #13 from Gregory Shimansky (gregory.shimansky@intel.com) ---
> The issue is fixed in a way that pktgen doesn't stop sending packets any more
> (I see them coming on the other side). But pktgen's UI freezes and I can only
> kill it on another console because it doesn't accept any commands after I run
> "start 0". Possibly now it is an application bug, not DPDK.
> 
> I used DPDK 19.02.0-rc3 (0a703f0f36c11b6f23fad4fab9e79c308811329d) and
> pktgen-3.6.4.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 16 Keith Wiles 2019-06-26 15:42:11 CEST
Which is the case pktgen UI freezes up, but sending traffic or does not send at all after a few seconds?
Comment 17 Keith Wiles 2019-06-26 15:43:14 CEST
Which is the case pktgen UI freezes up, but sending traffic or does not send at all after a few seconds?
Comment 18 Keith Wiles 2019-06-26 15:43:43 CEST
Which is the case pktgen UI freezes up, but sending traffic or does not send at all after a few seconds?

Note You need to log in before you can comment on or make changes to this bug.