OVS DPDK DMA-Dev library/Design Discussion

Maxime Coquelin maxime.coquelin at redhat.com
Wed Mar 30 11:25:05 CEST 2022



On 3/30/22 04:02, Hu, Jiayu wrote:
> 
> 
>> -----Original Message-----
>> From: Ilya Maximets <i.maximets at ovn.org>
>> Sent: Wednesday, March 30, 2022 1:45 AM
>> To: Morten Brørup <mb at smartsharesystems.com>; Richardson, Bruce
>> <bruce.richardson at intel.com>
>> Cc: i.maximets at ovn.org; Maxime Coquelin <maxime.coquelin at redhat.com>;
>> Van Haaren, Harry <harry.van.haaren at intel.com>; Pai G, Sunil
>> <sunil.pai.g at intel.com>; Stokes, Ian <ian.stokes at intel.com>; Hu, Jiayu
>> <jiayu.hu at intel.com>; Ferriter, Cian <cian.ferriter at intel.com>; ovs-
>> dev at openvswitch.org; dev at dpdk.org; Mcnamara, John
>> <john.mcnamara at intel.com>; O'Driscoll, Tim <tim.odriscoll at intel.com>;
>> Finn, Emma <emma.finn at intel.com>
>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
>>
>> On 3/29/22 19:13, Morten Brørup wrote:
>>>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
>>>> Sent: Tuesday, 29 March 2022 19.03
>>>>
>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
>>>>>> From: Maxime Coquelin [mailto:maxime.coquelin at redhat.com]
>>>>>> Sent: Tuesday, 29 March 2022 18.24
>>>>>>
>>>>>> Hi Morten,
>>>>>>
>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haaren at intel.com]
>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
>>>>>>>>
>>>>>>>>> From: Morten Brørup <mb at smartsharesystems.com>
>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
>>>>>>>>>
>>>>>>>>> Having thought more about it, I think that a completely
>>>> different
>>>>>> architectural approach is required:
>>>>>>>>>
>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
>>>>>> packet burst functions, each optimized for different CPU vector
>>>>>> instruction sets. The availability of a DMA engine should be
>>>> treated
>>>>>> the same way. So I suggest that PMDs copying packet contents, e.g.
>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
>>>> packet
>>>>>> burst functions.
>>>>>>>>>
>>>>>>>>> Similarly for the DPDK vhost library.
>>>>>>>>>
>>>>>>>>> In such an architecture, it would be the application's job to
>>>>>> allocate DMA channels and assign them to the specific PMDs that
>>>> should
>>>>>> use them. But the actual use of the DMA channels would move down
>>>> below
>>>>>> the application and into the DPDK PMDs and libraries.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Med venlig hilsen / Kind regards, -Morten Brørup
>>>>>>>>
>>>>>>>> Hi Morten,
>>>>>>>>
>>>>>>>> That's *exactly* how this architecture is designed &
>>>> implemented.
>>>>>>>> 1.	The DMA configuration and initialization is up to the
>>>> application
>>>>>> (OVS).
>>>>>>>> 2.	The VHost library is passed the DMA-dev ID, and its new
>>>> async
>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
>>>>>>>>
>>>>>>>> Looking forward to talking on the call that just started.
>>>> Regards, -
>>>>>> Harry
>>>>>>>>
>>>>>>>
>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
>>>> patches.
>>>>>>>
>>>>>>> Then, I suppose that the TX completions can be handled in the TX
>>>>>> function, and the RX completions can be handled in the RX function,
>>>>>> just like the Ethdev PMDs handle packet descriptors:
>>>>>>>
>>>>>>> TX_Burst(tx_packet_array):
>>>>>>> 1.	Clean up descriptors processed by the NIC chip. --> Process
>>>> TX
>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
>>>>>>> 2.	Pass on the tx_packet_array to the NIC chip descriptors. --
>>>>> Pass
>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
>>>>>> pipeline stage.)
>>>>>>
>>>>>> The problem is Tx function might not be called again, so enqueued
>>>>>> packets in 2. may never be completed from a Virtio point of view.
>>>> IOW,
>>>>>> the packets will be copied to the Virtio descriptors buffers, but
>>>> the
>>>>>> descriptors will not be made available to the Virtio driver.
>>>>>
>>>>> In that case, the application needs to call TX_Burst() periodically
>>>> with an empty array, for completion purposes.
>>>>>
>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
>>>> library, to handle DMA completion. It might even handle multiple DMA
>>>> channels, if convenient - and if possible without locking or other
>>>> weird complexity.
>>>>>
>>>>> Here is another idea, inspired by a presentation at one of the DPDK
>>>> Userspace conferences. It may be wishful thinking, though:
>>>>>
>>>>> Add an additional transaction to each DMA burst; a special
>>>> transaction containing the memory write operation that makes the
>>>> descriptors available to the Virtio driver.
>>
>> I was talking with Maxime after the call today about the same idea.
>> And it looks fairly doable, I would say.
> 
> If the idea is making DMA update used ring's index (2B) and packed ring descriptor's flag (2B),
> yes, it will work functionally. But considering the offloading cost of DMA, it would hurt
> performance. In addition, the latency of small copy of DMA is much higher than that of
> CPU. So it will also increase latency.

I agree writing back descriptors using DMA can be sub-optimal,
especially for packed ring where the head desc flags have to be written
last.

Are you sure about latency? With current solution, the descriptors
write-backs can happen quite some time after the DMA transfers are done,
isn't it?

>>
>>>>>
>>>>
>>>> That is something that can work, so long as the receiver is operating
>>>> in polling mode. For cases where virtio interrupts are enabled, you
>>>> still need to do a write to the eventfd in the kernel in vhost to
>>>> signal the virtio side. That's not something that can be offloaded to
>>>> a DMA engine, sadly, so we still need some form of completion call.
>>>
>>> I guess that virtio interrupts is the most widely deployed scenario,
>>> so let's ignore the DMA TX completion transaction for now - and call
>>> it a possible future optimization for specific use cases. So it seems
>>> that some form of completion call is unavoidable.
>>>
>>
>> We could separate the actual kick of the guest with the data transfer.
>> If interrupts are enabled, this means that the guest is not actively polling, i.e.
>> we can allow some extra latency by performing the actual kick from the rx
>> context, or, as Maxime said, if DMA engine can generate interrupts when the
>> DMA queue is empty, vhost thread may listen to them and kick the guest if
>> needed.  This will additionally remove the extra system call from the fast
>> path.
> 
> Separating kick with data transfer is a very good idea. But it requires a dedicated
> control plane thread to kick guest after DMA interrupt. Anyway, we can try this
> optimization in the future.

Yes it requires a dedicated thread, but I don't think this is really an
issue. Interrupt mode can be considered as slow-path.

> 
> Thanks,
> Jiayu



More information about the dev mailing list