Bug 337
Summary: | Live migration with dpdk(in host)+vhost-user+dpdk(in guest) fails: Failed to load virtio-net:virtio | ||
---|---|---|---|
Product: | DPDK | Reporter: | Pei Zhang (pezhang) |
Component: | vhost/virtio | Assignee: | Maxime Coquelin (maxime.coquelin) |
Status: | CONFIRMED --- | ||
Severity: | normal | CC: | ajit.khaparde, amorenoz |
Priority: | Normal | ||
Version: | 19.08 | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All |
Description
Pei Zhang
2019-08-12 05:10:17 CEST
Maxime, can you please take a look? Thanks I have been able to reproduce this issue and bisected it to the following commit: commit bbe29a9bd7ab6feab9a52051c32092a94ee886eb Author: Jerin Jacob <jerinj@marvell.com> Date: Mon Jul 22 14:56:53 2019 +0200 eal/linux: select IOVA as VA mode for default case When bus layer reports the preferred mode as RTE_IOVA_DC then select the RTE_IOVA_VA mode: - All drivers work in RTE_IOVA_VA mode, irrespective of physical address availability. - By default, a mempool asks for IOVA-contiguous memory using RTE_MEMZONE_IOVA_CONTIG. This is slow in RTE_IOVA_PA mode and it may affect the application boot time. Signed-off-by: Jerin Jacob <jerinj@marvell.com> Acked-by: Anatoly Burakov <anatoly.burakov@intel.com> Signed-off-by: David Marchand <david.marchand@redhat.com> This commit only changes the default IOVA mode, from IOVA_PA to IOVA_VA so this is just revealing an underlying problem. Confirmed this by verifying that affected version (19.08) works fine with "--iova-mode pa" fine and stable (18.11) fails in the same manner if "--iova-mode va" is used. Going to qemu, the code that detecting the error is: vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx - vdev->vq[i].used_idx); if (vdev->vq[i].inuse > vdev->vq[i].vring.num) { error_report("VQ %d size 0x%x < last_avail_idx 0x%x - " "used_idx 0x%x", i, vdev->vq[i].vring.num, vdev->vq[i].last_avail_idx, vdev->vq[i].used_idx); return -1; } One of the times I've reproduced it, I looked at the index values on the sending qemu just before sending the vmstates: size 0x100 | last_avail_idx 0x3aa0 | used_idx 0x3aa0 And just after loading the vmstates at the receiving qemu: VQ 0 size 0x100 < last_avail_idx 0x3aa0 - used_idx 0xbda0 At first I suspected an endianes issue but then confirmed that virtio_lduw_phys_cached handles it properly. So, it might be that the memory caches don't get properly synchronized before the migration takes place. Further research confirms that the vhost backend is not logging the dirty pages properly when IOVA_VA is used. This makes QEMU to send outdated pages to the destination, which causes the failure. |