[dpdk-dev] vhost compliant virtio based networking interface in container

Xie, Huawei huawei.xie at intel.com
Mon Sep 14 05:15:52 CEST 2015


On 9/8/2015 12:45 PM, Tetsuya Mukawa wrote:
> On 2015/09/07 14:54, Xie, Huawei wrote:
>> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>>>> Hi Xie and Yanping,
>>>>>
>>>>>
>>>>> May I ask you some questions?
>>>>> It seems we are also developing an almost same one.
>>>> Good to know that we are tackling the same problem and have the similar
>>>> idea.
>>>> What is your status now? We had the POC running, and compliant with
>>>> dpdkvhost.
>>>> Interrupt like notification isn't supported.
>>> We implemented vhost PMD first, so we just start implementing it.
>>>
>>>>> On 2015/08/20 19:14, Xie, Huawei wrote:
>>>>>> Added dev at dpdk.org
>>>>>>
>>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>>>>>> Yanping:
>>>>>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>>>>>> quick mail to describe our design. Let me know if it is the same thing.
>>>>>>>
>>>>>>> Problem Statement:
>>>>>>> We don't have a high performance networking interface in container for
>>>>>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>>>>>
>>>>>>> The key components involved:
>>>>>>>     1.    DPDK based virtio PMD driver in container.
>>>>>>>     2.    device simulation framework in container.
>>>>>>>     3.    dpdk(or kernel) vhost running in host.
>>>>>>>
>>>>>>> How virtio is created?
>>>>>>> A:  There is no "real" virtio-pci device in container environment.
>>>>>>> 1). Host maintains pools of memories, and shares memory to container.
>>>>>>> This could be accomplished through host share a huge page file to container.
>>>>>>> 2). Containers creates virtio rings based on the shared memory.
>>>>>>> 3). Container creates mbuf memory pools on the shared memory.
>>>>>>> 4) Container send the memory and vring information to vhost through
>>>>>>> vhost message. This could be done either through ioctl call or vhost
>>>>>>> user message.
>>>>>>>
>>>>>>> How vhost message is sent?
>>>>>>> A: There are two alternative ways to do this.
>>>>>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>>>>>> and vhost message sending.
>>>>> Above is our approach so far.
>>>>> It seems Yanping also takes this kind of approach.
>>>>> We are using vhost-user functionality instead of using the vhost-net
>>>>> kernel module.
>>>>> Probably this is the difference between Yanping and us.
>>>> In my current implementation, the device simulation layer talks to "user
>>>> space" vhost through cuse interface. It could also be done through vhost
>>>> user socket. This isn't the key point.
>>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>>> accurate, either cuse or unix domain socket. :).
>>>>
>>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>>>> Yes, there is some difference between these two. Vhost-net kernel module
>>>> could directly access other process's memory, while using
>>>> vhost-user(cuse/user), we need do the memory mapping.
>>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>>>> This PMD is implemented on librte_vhost.
>>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>>>> normal NIC port.
>>>>> This PMD should work with both Xie and Yanping approach.
>>>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>>>
>>>>>>> 2) We could do this through a lightweight device simulation framework.
>>>>>>>     The device simulation creates simple PCI bus. On the PCI bus,
>>>>>>> virtio-net PCI devices are created. The device simulations provides
>>>>>>> IOAPI for MMIO/IO access.
>>>>> Does it mean you implemented a kernel module?
>>>>> If so, do you still need vhost-cuse functionality to handle vhost
>>>>> messages n userspace?
>>>> The device simulation is  a library running in user space in container. 
>>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>>> devices.
>>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>>> through IOAPI provided by the device simulation rather than IO
>>>> instructions as in KVM.
>>>> Why we use device simulation?
>>>> We could create other virtio devices in container, and provide an common
>>>> way to talk to vhost-xx module.
>>> Thanks for explanation.
>>> At first reading, I thought the difference between approach1 and
>>> approach2 is whether we need to implement a new kernel module, or not.
>>> But I understand how you implemented.
>>>
>>> Please let me explain our design more.
>>> We might use a kind of similar approach to handle a pseudo virtio-net
>>> device in DPDK.
>>> (Anyway, we haven't finished implementing yet, this overview might have
>>> some technical problems)
>>>
>>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>>> then implement it as a separated program.
>>> The program also has below features.
>>>  - Create a directory that contains almost same files like
>>> /sys/bus/pci/device/<pci address>/*
>>>    (To scan these file located on outside sysfs, we need to fix EAL)
>>>  - This dummy device is driven by dummy-virtio-net-driver. This name is
>>> specified by '<pci addr>/driver' file.
>>>  - Create a shared file that represents pci configuration space, then
>>> mmap it, also specify the path in '<pci addr>/resource_path'
>>>
>>> The program will be GPL, but it will be like a bridge on the shared
>>> memory between virtio-net PMD and DPDK vhost backend.
>>> Actually, It will work under virtio-net PMD, but we don't need to link it.
>>> So I guess we don't have GPL license issue.
>>>
>>> Step2. Fix pci scan code of EAL to scan dummy devices.
>>>  - To scan above files, extend pci_scan() of EAL.
>>>
>>> Step3. Add a new kdrv type to EAL.
>>>  - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
>>>
>>> Step4. Implement pci_dummy_virtio_net_map/unmap().
>>>  - It will have almost same functionality like pci_uio_map(), but for
>>> dummy virtio-net device.
>>>  - The dummy device will be mmaped using a path specified in '<pci
>>> addr>/resource_path'.
>>>
>>> Step5. Add a new compile option for virtio-net device to replace IO
>>> functions.
>>>  - The IO functions of virtio-net PMD will be replaced by read() and
>>> write() to access to the shared memory.
>>>  - Add notification mechanism to IO functions. This will be used when
>>> write() to the shared memory is done.
>>>  (Not sure exactly, but probably we need it)
>>>
>>> Does it make sense?
>>> I guess Step1&2 is different from your approach, but the rest might be
>>> similar.
>>>
>>> Actually, we just need sysfs entries for a virtio-net dummy device, but
>>> so far, I don't have a fine way to register them from user space without
>>> loading a kernel module.
>> Tetsuya:
>> I don't quite get the details. Who will create those sysfs entries? A
>> kernel module right?
> Hi Xie,
>
> I don't create sysfs entries. Just create a directory that contains
> files looks like sysfs entries.
> And initialize EAL with not only sysfs but also the above directory.
>
> In quoted last sentence, I wanted to say we just needed files looks like
> sysfs entries.
> But I don't know a good way to create files under sysfs without loading
> kernel module.
> This is because I try to create the additional directory.
>
>> The virtio-net is configured through read/write to sharing
>> memory(between host and guest), right?
> Yes, I agree.
>
>> Where is shared vring created and shared memory created, on shared huge
>> page between host and guest?
> The vritqueues(vrings) are on guest hugepage.
>
> Let me explain.
> Guest container should have read/write access to a part of hugepage
> directory on host.
> (For example, /mnt/huge/conainer1/ is shared between host and guest.)
> Also host and guest needs to communicate through a unix domain socket.
> (For example, host and guest can communicate with using
> "/tmp/container1/sock")
>
> If we can do like above, a virtio-net PMD on guest can creates
> virtqueues(vrings) on it's hugepage, and writes these information to a
> pseudo virtio-net device that is a process created in guest container.
> Then the pseudo virtio-net device sends it to vhost-user backend(host
> DPDK application) through a unix domain socket.
>
> So with my plan, there are 3 processes.
> DPDK applications on host and guest, also a process that works like
> virtio-net device.
>
>> Who will talk to dpdkvhost?
> If we need to talk to a cuse device or the vhost-net kernel module, an
> above pseudo virtio-net device could talk to.
> (But, so far, my target is only vhost-user.)
>
>>> This is because I need to change pci_scan() also.
>>>
>>> It seems you have implemented a virtio-net pseudo device as BSD license.
>>> If so, this kind of PMD would be nice to use it.
>> Currently it is based on native linux kvm tool.
> Great, I hadn't noticed this option.
>
>>> In the case that it takes much time to implement some lost
>>> functionalities like interrupt mode, using QEMU code might be an one of
>>> options.
>> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
>> yet.
>>> Anyway, we just need a fine virtual NIC between containers and host.
>>> So we don't hold to our approach and implementation.
>> Do you have comments to my implementation?
>> We could publish the version without the device framework first for
>> reference.
> No I don't have. Could you please share it?
> I am looking forward to seeing it.
OK,  we are removing the device framework. Hope to publish it  in one
month's time.

>
> Tetsuya
>



More information about the dev mailing list