[dpdk-dev] virtio kills qemu VM after stopping/starting ports

Tan, Jianfeng jianfeng.tan at intel.com
Fri Sep 2 14:53:41 CEST 2016


Hi Kyle,


On 9/2/2016 8:35 PM, Kyle Larose wrote:
>
>> -----Original Message-----
>> From: Tan, Jianfeng [mailto:jianfeng.tan at intel.com]
>> Sent: Friday, September 02, 2016 2:56 AM
>> To: Kyle Larose; dev at dpdk.org
>> Cc: huawei.xie at intel.com; yuanhan.liu at linux.intel.com
>> Subject: Re: virtio kills qemu VM after stopping/starting ports
>>
>> Hi Kyle,
>>
>>
>> On 9/2/2016 4:53 AM, Kyle Larose wrote:
>>> Hello everyone,
>>>
>>> In my own testing, I recently stumbled across an issue where I could get qemu
>> to exit when sending traffic to my application. To do this, I simply needed to do
>> the following:
>>> 1) Start my virtio interfaces
>>> 2) Send some traffic into/out of the interfaces
>>> 3) Stop the interfaces
>>> 4) Start the interfaces
>>> 5) Send some more traffic
>>>
>>> At this point, I would lose connectivity to my VM.  Further investigation
>> revealed qemu exiting with the following log:
>>> 	2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
>> from 5
>>> to 1
>>>
>>> I found the following bug report against qemu, reported by a user of
>>> DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
>>>
>>> That thread seems to have stalled out, so I think we probably should deal with
>> the problem within DPDK itself. Either way, later in the bug report chain, we
>> see a link to this patch to DPDK:
>> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of
>> the bug report claims that this patch fixes the problem. Perhaps it does.
>>
>> I once got a chance to analyze the bug you referred here. The patch
>> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix that
>> bug. The root cause of that bug is: when DPDK appilcation gets killed, nobody
>> tells the vhost backend to stop. So when restarting the DPDK app, those
>> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in
>> those memory is changed to 0. I have written a patch to notify the vhost
>> backend to stop when DPDK app gets killed suddenly (more accurate, when
>> /dev/uioX gets closed), and this patch will be sent out recently. And this patch
>> does not fix your problem, either. You did not kill the app. (I should not update
>> info about that bug here, and I mean they are different problems)
>>
>>> However, it introduces a new problem: If I remove the patch, I cannot
>> reproduce the problem. So, that leads me to believe that it has caused a
>> regression.
>>> To summarize the patch’s changes, it basically changes the virtio_dev_stop
>> function to flag the device as stopped, and stops the device when
>> closing/uninitializing it. However, there is a seemingly unintended side-effect.
>>> In virtio_dev_start, we have the following block of code:
>>>
>>> 	/* On restart after stop do not touch queues */
>>> 	if (hw->started)
>>> 		return 0;
>>>
>>> 	/* Do final configuration before rx/tx engine starts */
>>> 	virtio_dev_rxtx_start(dev);
>>>
>>> ….
>>>
>>> Prior to the patch, if an interface were stopped then started, without
>> restarting the application, the queues would be left as-is, because hw->started
>> would be set to 1.
>>
>> Yes, my previous patch did break this behavior (by stopping and re-starting the
>> device, the queues would be left as-is) and leads to the problem here. And you
>> are proposing to recover.
>>
>> But is this a right behavior to follow?
>> On the one side, when we stop the virtio device, should we notify the vhost
>> backend to stop too? Currently, we just flag it as stopped.
>> On the other side, now in many PMDs, we mix the dev
>> initialization/configuration code into dev_start functions, that is to day, we re-
>> configure the device every time we start it (may be to make sure that changed
>> configuration will be configured). Then what if we increase the queue numbers
>> of virtio?
> Hi Jianfeng,
>
>
> Let me see if my understanding is correct. You're saying that there is a problem introduced by your patch, but going back to the original behaviour is not ideal: it still leaves a gap with respect to changing the number of queues with virtio. The ideal solution would have us properly tear down the device when it is stopped, and then properly reinitialize it when it is started? That seems reasonable. However, I'm not sure what it would take.

Yes, exactly.

> Does it make sense to submit a patch like I suggested in the short term, and have somebody work on a more robust long term solution? How could we go about making sure that happens? (I can't allocate much time to fixing this, so I likely won't be able to do anything more complicated than what I've proposed).

Let's see if maintainers have any suggestions here. (And I personally do 
not oppose your proposal for short term).

Thanks,
Jianfeng





More information about the dev mailing list