[dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on cache line in receive path

Maxime Coquelin maxime.coquelin at redhat.com
Wed Mar 1 08:36:24 CET 2017



On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
>>
>>
>> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
>>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>>>> This patch aligns the Virtio-net header on a cache-line boundary to
>>>> optimize cache utilization, as it puts the Virtio-net header (which
>>>> is always accessed) on the same cache line as the packet header.
>>>>
>>>> For example with an application that forwards packets at L2 level,
>>>> a single cache-line will be accessed with this patch, instead of
>>>> two before.
>>>
>>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
>>
>> No, I tested with 64 bytes packets only.
>
> Oh, my bad, I overlooked it. While you were saying "a single cache
> line", I was thinking putting the virtio net hdr and the "whole"
> packet data in single cache line, which is not possible for pkt
> size 64B.
>
>> I run some more tests this morning with different packet sizes,
>> and also with changing the mbuf size on guest side to have multi-
>> buffers packets:
>>
>> +-------+--------+--------+-------------------------+
>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>> +-------+--------+--------+-------------------------+
>> |    64 |   2048 |  11.05 |                   11.78 |
>> |   128 |   2048 |  10.66 |                   11.48 |
>> |   256 |   2048 |  10.47 |                   11.21 |
>> |   512 |   2048 |  10.22 |                   10.88 |
>> |  1024 |   2048 |   7.65 |                    7.84 |
>> |  1500 |   2048 |   6.25 |                    6.45 |
>> |  2000 |   2048 |   5.31 |                    5.43 |
>> |  2048 |   2048 |   5.32 |                    4.25 |
>> |  1500 |    512 |   3.89 |                    3.98 |
>> |  2048 |    512 |   1.96 |                    2.02 |
>> +-------+--------+--------+-------------------------+
>
> Could you share more info, say is it a PVP test? Is mergeable on?
> What's the fwd mode?

No, this is not PVP benchmark, I have neither another server nor a 
packet generator connected to my Haswell machine back-to-back.

This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
rxonly. In this configuration, mergeable is ON and no offload disabled
in QEMU cmdline.

That's why I would be interested in more testing on recent hardware
with PVP benchmark. Is it something that could be run in Intel lab?

I did some more trials, and I think that most of the gain seen in this
microbenchmark  could happen in fact on vhost side.
Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst()
call, and I can see there are packets in the vq only once every 20
calls. On Vhost side, monitoring shows that it always succeeds to write
its burts, i.e. the vq is never full.

>>>> In case of multi-buffers packets, next segments will be aligned on
>>>> a cache-line boundary, instead of cache-line boundary minus size of
>>>> vnet header before.
>>>
>>> The another thing is, this patch always makes the pkt data cache
>>> unaligned for the first packet, which makes Zhihong's optimization
>>> on memcpy (for big packet) useless.
>>>
>>>    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
>>>    Author: Zhihong Wang <zhihong.wang at intel.com>
>>>    Date:   Tue Dec 6 20:31:06 2016 -0500
>>
>> I did run some loopback test with large packet also, an I see a small gain
>> with my patch (fwd io on both ends):
>>
>> +-------+--------+--------+-------------------------+
>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>> +-------+--------+--------+-------------------------+
>> |  1500 |   2048 |   4.05 |                    4.14 |
>> +-------+--------+--------+-------------------------+
>
> Wierd, that basically means Zhihong's patch doesn't work? Could you add
> one more colum here: what's the data when roll back to the point without
> Zhihong's commit?

I add this to my ToDo list, don't expect results before next week.

>>>
>>>        Signed-off-by: Zhihong Wang <zhihong.wang at intel.com>
>>>        Reviewed-by: Yuanhan Liu <yuanhan.liu at linux.intel.com>
>>>        Tested-by: Lei Yao <lei.a.yao at intel.com>
>>
>> Does this need to be cache-line aligned?
>
> Nope, the alignment size is different with different platforms. AVX512
> needs a 64B alignment, while AVX2 needs 32B alignment.
>
>> I also tried to align pkt on 16bytes boundary, basically putting header
>> at HEADROOM + 4 bytes offset, but I didn't measured any gain on
>> Haswell,
>
> The fast rte_memcpy path (when dst & src is well aligned) on Haswell
> (with AVX2) requires 32B alignment. Even the 16B boundary would make
> it into the slow path. From this point of view, the extra pad does
> not change anything. Thus, no gain is expected.
>
>> and even a drop on SandyBridge.
>
> That's weird, SandyBridge requries the 16B alignment, meaning the extra
> pad should put it into fast path of rte_memcpy, whereas the performance
> is worse.

Thanks for the info, I will run more tests to explain this.

Cheers,
Maxime
>
> 	--yliu
>
>> I understand your point regarding aligned memcpy, but I'm surprised I
>> don't see its expected superiority with my benchmarks.
>> Any thoughts?


More information about the dev mailing list