[dpdk-dev,RFC] net/virtio: Align Virtio-net header on cache line in receive path

Message ID 20170221173243.20779-1-maxime.coquelin@redhat.com (mailing list archive)
State Rejected, archived
Delegated to: Yuanhan Liu
Headers

Checks

Context Check Description
ci/Intel-compilation success Compilation OK

Commit Message

Maxime Coquelin Feb. 21, 2017, 5:32 p.m. UTC
  This patch aligns the Virtio-net header on a cache-line boundary to
optimize cache utilization, as it puts the Virtio-net header (which
is always accessed) on the same cache line as the packet header.

For example with an application that forwards packets at L2 level,
a single cache-line will be accessed with this patch, instead of
two before.

In case of multi-buffers packets, next segments will be aligned on
a cache-line boundary, instead of cache-line boundary minus size of
vnet header before.

Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
---

Hi,

I send this patch as RFC because I get strange results on SandyBridge.

For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
performance drop on SandyBridge (~-18%).
When running PVP benchmark on SandyBridge, I measure a +4% performance
gain though.

So I'd like to call for testing on this patch, especially PVP-like testing
on newer architectures.

Regarding SandyBridge, I would be interrested to know whether we should
take the performance drop into account, as we for example had one patch in
last release that cause a performance drop on SB we merged anyway.

Cheers,
Maxime

 drivers/net/virtio/virtio_rxtx.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)
  

Comments

Yuanhan Liu Feb. 22, 2017, 1:37 a.m. UTC | #1
On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> This patch aligns the Virtio-net header on a cache-line boundary to
> optimize cache utilization, as it puts the Virtio-net header (which
> is always accessed) on the same cache line as the packet header.
> 
> For example with an application that forwards packets at L2 level,
> a single cache-line will be accessed with this patch, instead of
> two before.

I'm assuming you were testing pkt size <= (64 - hdr_size)?

> In case of multi-buffers packets, next segments will be aligned on
> a cache-line boundary, instead of cache-line boundary minus size of
> vnet header before.

The another thing is, this patch always makes the pkt data cache
unaligned for the first packet, which makes Zhihong's optimization
on memcpy (for big packet) useless.

    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
    Author: Zhihong Wang <zhihong.wang@intel.com>
    Date:   Tue Dec 6 20:31:06 2016 -0500
    
        eal: optimize aligned memcpy on x86
    
        This patch optimizes rte_memcpy for well aligned cases, where both
        dst and src addr are aligned to maximum MOV width. It introduces a
        dedicated function called rte_memcpy_aligned to handle the aligned
        cases with simplified instruction stream. The existing rte_memcpy
        cases with simplified instruction stream. The existing rte_memcpy
        is renamed as rte_memcpy_generic. The selection between them 2 is
        done at the entry of rte_memcpy.
    
        The existing rte_memcpy is for generic cases, it handles unaligned
        copies and make store aligned, it even makes load aligned for micro
        architectures like Ivy Bridge. However alignment handling comes at
        a price: It adds extra load/store instructions, which can cause
        complications sometime.
    
        DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
        The copy is aligned, and remote, and there is header write along
        which is also remote. In this case the memcpy instruction stream
        should be simplified, to reduce extra load/store, therefore reduce
        the probability of load/store buffer full caused pipeline stall, to
        let the actual memcpy instructions be issued and let H/W prefetcher
        goes to work as early as possible.
    
        This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
        up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
        from 64 to 1500 bytes.
    
        The test can also be conducted without NIC, by setting loopback
        traffic between Virtio and Vhost. For example, modify the macro
        TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
        rebuild and start testpmd in both host and guest, then "start" on
        one side and "start tx_first 32" on the other.
    
        Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
        Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
        Tested-by: Lei Yao <lei.a.yao@intel.com>
    
> 
> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> 
> Hi,
> 
> I send this patch as RFC because I get strange results on SandyBridge.
> 
> For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
> performance drop on SandyBridge (~-18%).
> When running PVP benchmark on SandyBridge, I measure a +4% performance
> gain though.
> 
> So I'd like to call for testing on this patch, especially PVP-like testing
> on newer architectures.
> 
> Regarding SandyBridge, I would be interrested to know whether we should
> take the performance drop into account, as we for example had one patch in
> last release that cause a performance drop on SB we merged anyway.

Sorry, would you remind me which patch it is?

	--yliu
  
Yang, Zhiyong Feb. 22, 2017, 2:49 a.m. UTC | #2
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
> Sent: Wednesday, February 22, 2017 9:38 AM
> To: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: Liang, Cunming <cunming.liang@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on
> cache line in receive path
> 
> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> > This patch aligns the Virtio-net header on a cache-line boundary to
> > optimize cache utilization, as it puts the Virtio-net header (which is
> > always accessed) on the same cache line as the packet header.
> >
> > For example with an application that forwards packets at L2 level, a
> > single cache-line will be accessed with this patch, instead of two
> > before.
> 
> I'm assuming you were testing pkt size <= (64 - hdr_size)?
> 
> > In case of multi-buffers packets, next segments will be aligned on a
> > cache-line boundary, instead of cache-line boundary minus size of vnet
> > header before.
> 
> The another thing is, this patch always makes the pkt data cache unaligned
> for the first packet, which makes Zhihong's optimization on memcpy (for big
> packet) useless.

Why not could  we keep pkt data starting always on  the cache-line boundary?
In case of multi-buffer, the first remains unchanged, next segments can do as
Maxime said that.

Thanks
Zhiyong
  
Maxime Coquelin Feb. 22, 2017, 9:36 a.m. UTC | #3
On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>> This patch aligns the Virtio-net header on a cache-line boundary to
>> optimize cache utilization, as it puts the Virtio-net header (which
>> is always accessed) on the same cache line as the packet header.
>>
>> For example with an application that forwards packets at L2 level,
>> a single cache-line will be accessed with this patch, instead of
>> two before.
>
> I'm assuming you were testing pkt size <= (64 - hdr_size)?

No, I tested with 64 bytes packets only.
I run some more tests this morning with different packet sizes,
and also with changing the mbuf size on guest side to have multi-
buffers packets:

+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
|    64 |   2048 |  11.05 |                   11.78 |
|   128 |   2048 |  10.66 |                   11.48 |
|   256 |   2048 |  10.47 |                   11.21 |
|   512 |   2048 |  10.22 |                   10.88 |
|  1024 |   2048 |   7.65 |                    7.84 |
|  1500 |   2048 |   6.25 |                    6.45 |
|  2000 |   2048 |   5.31 |                    5.43 |
|  2048 |   2048 |   5.32 |                    4.25 |
|  1500 |    512 |   3.89 |                    3.98 |
|  2048 |    512 |   1.96 |                    2.02 |
+-------+--------+--------+-------------------------+

Overall we can see it is always beneficial.
The only case we see a drop is the 2048/2048 case, which is explained 
because it needs two buffers as the vnet header + pkt does not fit in 
2048 bytes.
It could be fixed by aligning vnet header to the cacheline before,
inside the headroom.


>
>> In case of multi-buffers packets, next segments will be aligned on
>> a cache-line boundary, instead of cache-line boundary minus size of
>> vnet header before.
>
> The another thing is, this patch always makes the pkt data cache
> unaligned for the first packet, which makes Zhihong's optimization
> on memcpy (for big packet) useless.
>
>     commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
>     Author: Zhihong Wang <zhihong.wang@intel.com>
>     Date:   Tue Dec 6 20:31:06 2016 -0500
>
>         eal: optimize aligned memcpy on x86
>
>         This patch optimizes rte_memcpy for well aligned cases, where both
>         dst and src addr are aligned to maximum MOV width. It introduces a
>         dedicated function called rte_memcpy_aligned to handle the aligned
>         cases with simplified instruction stream. The existing rte_memcpy
>         cases with simplified instruction stream. The existing rte_memcpy
>         is renamed as rte_memcpy_generic. The selection between them 2 is
>         done at the entry of rte_memcpy.
>
>         The existing rte_memcpy is for generic cases, it handles unaligned
>         copies and make store aligned, it even makes load aligned for micro
>         architectures like Ivy Bridge. However alignment handling comes at
>         a price: It adds extra load/store instructions, which can cause
>         complications sometime.
>
>         DPDK Vhost memcpy with Mergeable Rx Buffer feature as an example:
>         The copy is aligned, and remote, and there is header write along
>         which is also remote. In this case the memcpy instruction stream
>         should be simplified, to reduce extra load/store, therefore reduce
>         the probability of load/store buffer full caused pipeline stall, to
>         let the actual memcpy instructions be issued and let H/W prefetcher
>         goes to work as early as possible.
>
>         This patch is tested on Ivy Bridge, Haswell and Skylake, it provides
>         up to 20% gain for Virtio Vhost PVP traffic, with packet size ranging
>         from 64 to 1500 bytes.
>
>         The test can also be conducted without NIC, by setting loopback
>         traffic between Virtio and Vhost. For example, modify the macro
>         TXONLY_DEF_PACKET_LEN to the requested packet size in testpmd.h,
>         rebuild and start testpmd in both host and guest, then "start" on
>         one side and "start tx_first 32" on the other.

I did run some loopback test with large packet also, an I see a small 
gain with my patch (fwd io on both ends):

+-------+--------+--------+-------------------------+
| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
+-------+--------+--------+-------------------------+
|  1500 |   2048 |   4.05 |                    4.14 |
+-------+--------+--------+-------------------------+


>
>         Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
>         Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>         Tested-by: Lei Yao <lei.a.yao@intel.com>

Does this need to be cache-line aligned?
I also tried to align pkt on 16bytes boundary, basically putting header
at HEADROOM + 4 bytes offset, but I didn't measured any gain on
Haswell, and even a drop on SandyBridge.


I understand your point regarding aligned memcpy, but I'm surprised I
don't see its expected superiority with my benchmarks.
Any thoughts?

Cheers,
Maxime

>>
>> Signed-off-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>> ---
>>
>> Hi,
>>
>> I send this patch as RFC because I get strange results on SandyBridge.
>>
>> For micro-benchmarks, I measure a +6% gain on Haswell, but I get a big
>> performance drop on SandyBridge (~-18%).
>> When running PVP benchmark on SandyBridge, I measure a +4% performance
>> gain though.
>>
>> So I'd like to call for testing on this patch, especially PVP-like testing
>> on newer architectures.
>>
>> Regarding SandyBridge, I would be interrested to know whether we should
>> take the performance drop into account, as we for example had one patch in
>> last release that cause a performance drop on SB we merged anyway.
>
> Sorry, would you remind me which patch it is?
>
> 	--yliu
>
  
Maxime Coquelin Feb. 22, 2017, 9:39 a.m. UTC | #4
On 02/22/2017 03:49 AM, Yang, Zhiyong wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Yuanhan Liu
>> Sent: Wednesday, February 22, 2017 9:38 AM
>> To: Maxime Coquelin <maxime.coquelin@redhat.com>
>> Cc: Liang, Cunming <cunming.liang@intel.com>; Tan, Jianfeng
>> <jianfeng.tan@intel.com>; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [RFC PATCH] net/virtio: Align Virtio-net header on
>> cache line in receive path
>>
>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>>> This patch aligns the Virtio-net header on a cache-line boundary to
>>> optimize cache utilization, as it puts the Virtio-net header (which is
>>> always accessed) on the same cache line as the packet header.
>>>
>>> For example with an application that forwards packets at L2 level, a
>>> single cache-line will be accessed with this patch, instead of two
>>> before.
>>
>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
>>
>>> In case of multi-buffers packets, next segments will be aligned on a
>>> cache-line boundary, instead of cache-line boundary minus size of vnet
>>> header before.
>>
>> The another thing is, this patch always makes the pkt data cache unaligned
>> for the first packet, which makes Zhihong's optimization on memcpy (for big
>> packet) useless.
>
> Why not could  we keep pkt data starting always on  the cache-line boundary?
> In case of multi-buffer, the first remains unchanged, next segments can do as
> Maxime said that.

This is not possible to have both the first and next buffers aligned on 
cache line boundary, as we don't know whether the deccriptor will be the 
first one, or a subsequent one on Virtio side when we refill the vq.

Cheers,
Maxime

>
> Thanks
> Zhiyong
>
  
Yuanhan Liu Feb. 23, 2017, 5:49 a.m. UTC | #5
On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> 
> 
> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>This patch aligns the Virtio-net header on a cache-line boundary to
> >>optimize cache utilization, as it puts the Virtio-net header (which
> >>is always accessed) on the same cache line as the packet header.
> >>
> >>For example with an application that forwards packets at L2 level,
> >>a single cache-line will be accessed with this patch, instead of
> >>two before.
> >
> >I'm assuming you were testing pkt size <= (64 - hdr_size)?
> 
> No, I tested with 64 bytes packets only.

Oh, my bad, I overlooked it. While you were saying "a single cache
line", I was thinking putting the virtio net hdr and the "whole"
packet data in single cache line, which is not possible for pkt
size 64B.

> I run some more tests this morning with different packet sizes,
> and also with changing the mbuf size on guest side to have multi-
> buffers packets:
> 
> +-------+--------+--------+-------------------------+
> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> +-------+--------+--------+-------------------------+
> |    64 |   2048 |  11.05 |                   11.78 |
> |   128 |   2048 |  10.66 |                   11.48 |
> |   256 |   2048 |  10.47 |                   11.21 |
> |   512 |   2048 |  10.22 |                   10.88 |
> |  1024 |   2048 |   7.65 |                    7.84 |
> |  1500 |   2048 |   6.25 |                    6.45 |
> |  2000 |   2048 |   5.31 |                    5.43 |
> |  2048 |   2048 |   5.32 |                    4.25 |
> |  1500 |    512 |   3.89 |                    3.98 |
> |  2048 |    512 |   1.96 |                    2.02 |
> +-------+--------+--------+-------------------------+

Could you share more info, say is it a PVP test? Is mergeable on?
What's the fwd mode?

> >>In case of multi-buffers packets, next segments will be aligned on
> >>a cache-line boundary, instead of cache-line boundary minus size of
> >>vnet header before.
> >
> >The another thing is, this patch always makes the pkt data cache
> >unaligned for the first packet, which makes Zhihong's optimization
> >on memcpy (for big packet) useless.
> >
> >    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> >    Author: Zhihong Wang <zhihong.wang@intel.com>
> >    Date:   Tue Dec 6 20:31:06 2016 -0500
> 
> I did run some loopback test with large packet also, an I see a small gain
> with my patch (fwd io on both ends):
> 
> +-------+--------+--------+-------------------------+
> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> +-------+--------+--------+-------------------------+
> |  1500 |   2048 |   4.05 |                    4.14 |
> +-------+--------+--------+-------------------------+

Wierd, that basically means Zhihong's patch doesn't work? Could you add
one more colum here: what's the data when roll back to the point without
Zhihong's commit?

> >
> >        Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >        Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >        Tested-by: Lei Yao <lei.a.yao@intel.com>
> 
> Does this need to be cache-line aligned?

Nope, the alignment size is different with different platforms. AVX512
needs a 64B alignment, while AVX2 needs 32B alignment.

> I also tried to align pkt on 16bytes boundary, basically putting header
> at HEADROOM + 4 bytes offset, but I didn't measured any gain on
> Haswell,

The fast rte_memcpy path (when dst & src is well aligned) on Haswell
(with AVX2) requires 32B alignment. Even the 16B boundary would make
it into the slow path. From this point of view, the extra pad does
not change anything. Thus, no gain is expected.

> and even a drop on SandyBridge.

That's weird, SandyBridge requries the 16B alignment, meaning the extra
pad should put it into fast path of rte_memcpy, whereas the performance
is worse.

	--yliu

> I understand your point regarding aligned memcpy, but I'm surprised I
> don't see its expected superiority with my benchmarks.
> Any thoughts?
  
Maxime Coquelin March 1, 2017, 7:36 a.m. UTC | #6
On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
>>
>>
>> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
>>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>>>> This patch aligns the Virtio-net header on a cache-line boundary to
>>>> optimize cache utilization, as it puts the Virtio-net header (which
>>>> is always accessed) on the same cache line as the packet header.
>>>>
>>>> For example with an application that forwards packets at L2 level,
>>>> a single cache-line will be accessed with this patch, instead of
>>>> two before.
>>>
>>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
>>
>> No, I tested with 64 bytes packets only.
>
> Oh, my bad, I overlooked it. While you were saying "a single cache
> line", I was thinking putting the virtio net hdr and the "whole"
> packet data in single cache line, which is not possible for pkt
> size 64B.
>
>> I run some more tests this morning with different packet sizes,
>> and also with changing the mbuf size on guest side to have multi-
>> buffers packets:
>>
>> +-------+--------+--------+-------------------------+
>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>> +-------+--------+--------+-------------------------+
>> |    64 |   2048 |  11.05 |                   11.78 |
>> |   128 |   2048 |  10.66 |                   11.48 |
>> |   256 |   2048 |  10.47 |                   11.21 |
>> |   512 |   2048 |  10.22 |                   10.88 |
>> |  1024 |   2048 |   7.65 |                    7.84 |
>> |  1500 |   2048 |   6.25 |                    6.45 |
>> |  2000 |   2048 |   5.31 |                    5.43 |
>> |  2048 |   2048 |   5.32 |                    4.25 |
>> |  1500 |    512 |   3.89 |                    3.98 |
>> |  2048 |    512 |   1.96 |                    2.02 |
>> +-------+--------+--------+-------------------------+
>
> Could you share more info, say is it a PVP test? Is mergeable on?
> What's the fwd mode?

No, this is not PVP benchmark, I have neither another server nor a 
packet generator connected to my Haswell machine back-to-back.

This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
rxonly. In this configuration, mergeable is ON and no offload disabled
in QEMU cmdline.

That's why I would be interested in more testing on recent hardware
with PVP benchmark. Is it something that could be run in Intel lab?

I did some more trials, and I think that most of the gain seen in this
microbenchmark  could happen in fact on vhost side.
Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst()
call, and I can see there are packets in the vq only once every 20
calls. On Vhost side, monitoring shows that it always succeeds to write
its burts, i.e. the vq is never full.

>>>> In case of multi-buffers packets, next segments will be aligned on
>>>> a cache-line boundary, instead of cache-line boundary minus size of
>>>> vnet header before.
>>>
>>> The another thing is, this patch always makes the pkt data cache
>>> unaligned for the first packet, which makes Zhihong's optimization
>>> on memcpy (for big packet) useless.
>>>
>>>    commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
>>>    Author: Zhihong Wang <zhihong.wang@intel.com>
>>>    Date:   Tue Dec 6 20:31:06 2016 -0500
>>
>> I did run some loopback test with large packet also, an I see a small gain
>> with my patch (fwd io on both ends):
>>
>> +-------+--------+--------+-------------------------+
>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>> +-------+--------+--------+-------------------------+
>> |  1500 |   2048 |   4.05 |                    4.14 |
>> +-------+--------+--------+-------------------------+
>
> Wierd, that basically means Zhihong's patch doesn't work? Could you add
> one more colum here: what's the data when roll back to the point without
> Zhihong's commit?

I add this to my ToDo list, don't expect results before next week.

>>>
>>>        Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
>>>        Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>>>        Tested-by: Lei Yao <lei.a.yao@intel.com>
>>
>> Does this need to be cache-line aligned?
>
> Nope, the alignment size is different with different platforms. AVX512
> needs a 64B alignment, while AVX2 needs 32B alignment.
>
>> I also tried to align pkt on 16bytes boundary, basically putting header
>> at HEADROOM + 4 bytes offset, but I didn't measured any gain on
>> Haswell,
>
> The fast rte_memcpy path (when dst & src is well aligned) on Haswell
> (with AVX2) requires 32B alignment. Even the 16B boundary would make
> it into the slow path. From this point of view, the extra pad does
> not change anything. Thus, no gain is expected.
>
>> and even a drop on SandyBridge.
>
> That's weird, SandyBridge requries the 16B alignment, meaning the extra
> pad should put it into fast path of rte_memcpy, whereas the performance
> is worse.

Thanks for the info, I will run more tests to explain this.

Cheers,
Maxime
>
> 	--yliu
>
>> I understand your point regarding aligned memcpy, but I'm surprised I
>> don't see its expected superiority with my benchmarks.
>> Any thoughts?
  
Yuanhan Liu March 6, 2017, 8:46 a.m. UTC | #7
On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
> 
> 
> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> >On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> >>
> >>
> >>On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >>>On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>>>This patch aligns the Virtio-net header on a cache-line boundary to
> >>>>optimize cache utilization, as it puts the Virtio-net header (which
> >>>>is always accessed) on the same cache line as the packet header.
> >>>>
> >>>>For example with an application that forwards packets at L2 level,
> >>>>a single cache-line will be accessed with this patch, instead of
> >>>>two before.
> >>>
> >>>I'm assuming you were testing pkt size <= (64 - hdr_size)?
> >>
> >>No, I tested with 64 bytes packets only.
> >
> >Oh, my bad, I overlooked it. While you were saying "a single cache
> >line", I was thinking putting the virtio net hdr and the "whole"
> >packet data in single cache line, which is not possible for pkt
> >size 64B.
> >
> >>I run some more tests this morning with different packet sizes,
> >>and also with changing the mbuf size on guest side to have multi-
> >>buffers packets:
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|    64 |   2048 |  11.05 |                   11.78 |
> >>|   128 |   2048 |  10.66 |                   11.48 |
> >>|   256 |   2048 |  10.47 |                   11.21 |
> >>|   512 |   2048 |  10.22 |                   10.88 |
> >>|  1024 |   2048 |   7.65 |                    7.84 |
> >>|  1500 |   2048 |   6.25 |                    6.45 |
> >>|  2000 |   2048 |   5.31 |                    5.43 |
> >>|  2048 |   2048 |   5.32 |                    4.25 |
> >>|  1500 |    512 |   3.89 |                    3.98 |
> >>|  2048 |    512 |   1.96 |                    2.02 |
> >>+-------+--------+--------+-------------------------+
> >
> >Could you share more info, say is it a PVP test? Is mergeable on?
> >What's the fwd mode?
> 
> No, this is not PVP benchmark, I have neither another server nor a packet
> generator connected to my Haswell machine back-to-back.
> 
> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
> rxonly. In this configuration, mergeable is ON and no offload disabled
> in QEMU cmdline.

Okay, I see. So the boost, as you have stated, comes from saving two
cache line access to one. Before that, vhost write 2 cache lines,
while the virtio pmd reads 2 cache lines: one for reading the header,
another one for reading the ether header, for updating xstats (there
is no ether access in the fwd mode you tested).

> That's why I would be interested in more testing on recent hardware
> with PVP benchmark. Is it something that could be run in Intel lab?

I think Yao Lei could help on that? But as stated, I think it may
break the performance for bit packets. And I also won't expect big
boost even for 64B in PVP test, judging that it's only 6% boost in
micro bechmarking.

	--yliu
> 
> I did some more trials, and I think that most of the gain seen in this
> microbenchmark  could happen in fact on vhost side.
> Indeed, I monitored the number of packets dequeued at each .rx_pkt_burst()
> call, and I can see there are packets in the vq only once every 20
> calls. On Vhost side, monitoring shows that it always succeeds to write
> its burts, i.e. the vq is never full.
> 
> >>>>In case of multi-buffers packets, next segments will be aligned on
> >>>>a cache-line boundary, instead of cache-line boundary minus size of
> >>>>vnet header before.
> >>>
> >>>The another thing is, this patch always makes the pkt data cache
> >>>unaligned for the first packet, which makes Zhihong's optimization
> >>>on memcpy (for big packet) useless.
> >>>
> >>>   commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f
> >>>   Author: Zhihong Wang <zhihong.wang@intel.com>
> >>>   Date:   Tue Dec 6 20:31:06 2016 -0500
> >>
> >>I did run some loopback test with large packet also, an I see a small gain
> >>with my patch (fwd io on both ends):
> >>
> >>+-------+--------+--------+-------------------------+
> >>| Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>+-------+--------+--------+-------------------------+
> >>|  1500 |   2048 |   4.05 |                    4.14 |
> >>+-------+--------+--------+-------------------------+
> >
> >Wierd, that basically means Zhihong's patch doesn't work? Could you add
> >one more colum here: what's the data when roll back to the point without
> >Zhihong's commit?
> 
> I add this to my ToDo list, don't expect results before next week.
> 
> >>>
> >>>       Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >>>       Reviewed-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> >>>       Tested-by: Lei Yao <lei.a.yao@intel.com>
> >>
> >>Does this need to be cache-line aligned?
> >
> >Nope, the alignment size is different with different platforms. AVX512
> >needs a 64B alignment, while AVX2 needs 32B alignment.
> >
> >>I also tried to align pkt on 16bytes boundary, basically putting header
> >>at HEADROOM + 4 bytes offset, but I didn't measured any gain on
> >>Haswell,
> >
> >The fast rte_memcpy path (when dst & src is well aligned) on Haswell
> >(with AVX2) requires 32B alignment. Even the 16B boundary would make
> >it into the slow path. From this point of view, the extra pad does
> >not change anything. Thus, no gain is expected.
> >
> >>and even a drop on SandyBridge.
> >
> >That's weird, SandyBridge requries the 16B alignment, meaning the extra
> >pad should put it into fast path of rte_memcpy, whereas the performance
> >is worse.
> 
> Thanks for the info, I will run more tests to explain this.
> 
> Cheers,
> Maxime
> >
> >	--yliu
> >
> >>I understand your point regarding aligned memcpy, but I'm surprised I
> >>don't see its expected superiority with my benchmarks.
> >>Any thoughts?
  
Maxime Coquelin March 6, 2017, 2:11 p.m. UTC | #8
On 03/06/2017 09:46 AM, Yuanhan Liu wrote:
> On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
>>
>>
>> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
>>> On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
>>>>
>>>>
>>>> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
>>>>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>>>>>> This patch aligns the Virtio-net header on a cache-line boundary to
>>>>>> optimize cache utilization, as it puts the Virtio-net header (which
>>>>>> is always accessed) on the same cache line as the packet header.
>>>>>>
>>>>>> For example with an application that forwards packets at L2 level,
>>>>>> a single cache-line will be accessed with this patch, instead of
>>>>>> two before.
>>>>>
>>>>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
>>>>
>>>> No, I tested with 64 bytes packets only.
>>>
>>> Oh, my bad, I overlooked it. While you were saying "a single cache
>>> line", I was thinking putting the virtio net hdr and the "whole"
>>> packet data in single cache line, which is not possible for pkt
>>> size 64B.
>>>
>>>> I run some more tests this morning with different packet sizes,
>>>> and also with changing the mbuf size on guest side to have multi-
>>>> buffers packets:
>>>>
>>>> +-------+--------+--------+-------------------------+
>>>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>>>> +-------+--------+--------+-------------------------+
>>>> |    64 |   2048 |  11.05 |                   11.78 |
>>>> |   128 |   2048 |  10.66 |                   11.48 |
>>>> |   256 |   2048 |  10.47 |                   11.21 |
>>>> |   512 |   2048 |  10.22 |                   10.88 |
>>>> |  1024 |   2048 |   7.65 |                    7.84 |
>>>> |  1500 |   2048 |   6.25 |                    6.45 |
>>>> |  2000 |   2048 |   5.31 |                    5.43 |
>>>> |  2048 |   2048 |   5.32 |                    4.25 |
>>>> |  1500 |    512 |   3.89 |                    3.98 |
>>>> |  2048 |    512 |   1.96 |                    2.02 |
>>>> +-------+--------+--------+-------------------------+
>>>
>>> Could you share more info, say is it a PVP test? Is mergeable on?
>>> What's the fwd mode?
>>
>> No, this is not PVP benchmark, I have neither another server nor a packet
>> generator connected to my Haswell machine back-to-back.
>>
>> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
>> rxonly. In this configuration, mergeable is ON and no offload disabled
>> in QEMU cmdline.
>
> Okay, I see. So the boost, as you have stated, comes from saving two
> cache line access to one. Before that, vhost write 2 cache lines,
> while the virtio pmd reads 2 cache lines: one for reading the header,
> another one for reading the ether header, for updating xstats (there
> is no ether access in the fwd mode you tested).
>
>> That's why I would be interested in more testing on recent hardware
>> with PVP benchmark. Is it something that could be run in Intel lab?
>
> I think Yao Lei could help on that? But as stated, I think it may
> break the performance for bit packets. And I also won't expect big
> boost even for 64B in PVP test, judging that it's only 6% boost in
> micro bechmarking.
That would be great.
Note that on SandyBridge, on which I see a drop in perf with
microbenchmark, I get a 4% gain on PVP benchmark. So on recent hardware
that show a gain on microbenchmark, I'm curious of the gain with PVP
bench.

Cheers,
Maxime
  
Yao, Lei A March 8, 2017, 6:01 a.m. UTC | #9
> -----Original Message-----
> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
> Sent: Monday, March 6, 2017 10:11 PM
> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> Cc: Liang, Cunming <cunming.liang@intel.com>; Tan, Jianfeng
> <jianfeng.tan@intel.com>; dev@dpdk.org; Wang, Zhihong
> <zhihong.wang@intel.com>; Yao, Lei A <lei.a.yao@intel.com>
> Subject: Re: [RFC PATCH] net/virtio: Align Virtio-net header on cache line in
> receive path
> 
> 
> 
> On 03/06/2017 09:46 AM, Yuanhan Liu wrote:
> > On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
> >>
> >>
> >> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
> >>> On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
> >>>>
> >>>>
> >>>> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
> >>>>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
> >>>>>> This patch aligns the Virtio-net header on a cache-line boundary to
> >>>>>> optimize cache utilization, as it puts the Virtio-net header (which
> >>>>>> is always accessed) on the same cache line as the packet header.
> >>>>>>
> >>>>>> For example with an application that forwards packets at L2 level,
> >>>>>> a single cache-line will be accessed with this patch, instead of
> >>>>>> two before.
> >>>>>
> >>>>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
> >>>>
> >>>> No, I tested with 64 bytes packets only.
> >>>
> >>> Oh, my bad, I overlooked it. While you were saying "a single cache
> >>> line", I was thinking putting the virtio net hdr and the "whole"
> >>> packet data in single cache line, which is not possible for pkt
> >>> size 64B.
> >>>
> >>>> I run some more tests this morning with different packet sizes,
> >>>> and also with changing the mbuf size on guest side to have multi-
> >>>> buffers packets:
> >>>>
> >>>> +-------+--------+--------+-------------------------+
> >>>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
> >>>> +-------+--------+--------+-------------------------+
> >>>> |    64 |   2048 |  11.05 |                   11.78 |
> >>>> |   128 |   2048 |  10.66 |                   11.48 |
> >>>> |   256 |   2048 |  10.47 |                   11.21 |
> >>>> |   512 |   2048 |  10.22 |                   10.88 |
> >>>> |  1024 |   2048 |   7.65 |                    7.84 |
> >>>> |  1500 |   2048 |   6.25 |                    6.45 |
> >>>> |  2000 |   2048 |   5.31 |                    5.43 |
> >>>> |  2048 |   2048 |   5.32 |                    4.25 |
> >>>> |  1500 |    512 |   3.89 |                    3.98 |
> >>>> |  2048 |    512 |   1.96 |                    2.02 |
> >>>> +-------+--------+--------+-------------------------+
> >>>
> >>> Could you share more info, say is it a PVP test? Is mergeable on?
> >>> What's the fwd mode?
> >>
> >> No, this is not PVP benchmark, I have neither another server nor a packet
> >> generator connected to my Haswell machine back-to-back.
> >>
> >> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
> >> rxonly. In this configuration, mergeable is ON and no offload disabled
> >> in QEMU cmdline.
> >
> > Okay, I see. So the boost, as you have stated, comes from saving two
> > cache line access to one. Before that, vhost write 2 cache lines,
> > while the virtio pmd reads 2 cache lines: one for reading the header,
> > another one for reading the ether header, for updating xstats (there
> > is no ether access in the fwd mode you tested).
> >
> >> That's why I would be interested in more testing on recent hardware
> >> with PVP benchmark. Is it something that could be run in Intel lab?
> >
> > I think Yao Lei could help on that? But as stated, I think it may
> > break the performance for bit packets. And I also won't expect big
> > boost even for 64B in PVP test, judging that it's only 6% boost in
> > micro bechmarking.
> That would be great.
> Note that on SandyBridge, on which I see a drop in perf with
> microbenchmark, I get a 4% gain on PVP benchmark. So on recent hardware
> that show a gain on microbenchmark, I'm curious of the gain with PVP
> bench.
> 
Hi, Maxime, Yuanhan

I have execute the PVP and loopback performance test on my Ivy bridge server. 
OS:Ubutnu16.04
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
Kernal:  4.4.0
gcc : 5.4.0
I use MAC forward for test.

Performance base is commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f, 
"eal: optimize aligned memcpy on x86".
I can see big performance drop on Mergeable and no-mergeable path
after apply this patch 
Mergebale Path loopback test			
packet size		Performance compare	
64		 		-21.76%
128				-17.79%
260				-20.25%
520				-14.80%
1024				-9.34%
1500				-6.16%

No-mergeable  Path loopback test	
packet size			
64				-13.72%
128				-10.35%
260				-16.40%
520				-14.78%
1024				-10.48%
1500				-6.91%

Mergeable Path PVP test			
packet size		Performance compare		
64	                                               -16.33%

No-mergeable Path PVP test			
packet size			
64		                               -8.69%

Best Regards
Lei


> Cheers,
> Maxime
  
Maxime Coquelin March 9, 2017, 2:38 p.m. UTC | #10
On 03/08/2017 07:01 AM, Yao, Lei A wrote:
>
>
>> -----Original Message-----
>> From: Maxime Coquelin [mailto:maxime.coquelin@redhat.com]
>> Sent: Monday, March 6, 2017 10:11 PM
>> To: Yuanhan Liu <yuanhan.liu@linux.intel.com>
>> Cc: Liang, Cunming <cunming.liang@intel.com>; Tan, Jianfeng
>> <jianfeng.tan@intel.com>; dev@dpdk.org; Wang, Zhihong
>> <zhihong.wang@intel.com>; Yao, Lei A <lei.a.yao@intel.com>
>> Subject: Re: [RFC PATCH] net/virtio: Align Virtio-net header on cache line in
>> receive path
>>
>>
>>
>> On 03/06/2017 09:46 AM, Yuanhan Liu wrote:
>>> On Wed, Mar 01, 2017 at 08:36:24AM +0100, Maxime Coquelin wrote:
>>>>
>>>>
>>>> On 02/23/2017 06:49 AM, Yuanhan Liu wrote:
>>>>> On Wed, Feb 22, 2017 at 10:36:36AM +0100, Maxime Coquelin wrote:
>>>>>>
>>>>>>
>>>>>> On 02/22/2017 02:37 AM, Yuanhan Liu wrote:
>>>>>>> On Tue, Feb 21, 2017 at 06:32:43PM +0100, Maxime Coquelin wrote:
>>>>>>>> This patch aligns the Virtio-net header on a cache-line boundary to
>>>>>>>> optimize cache utilization, as it puts the Virtio-net header (which
>>>>>>>> is always accessed) on the same cache line as the packet header.
>>>>>>>>
>>>>>>>> For example with an application that forwards packets at L2 level,
>>>>>>>> a single cache-line will be accessed with this patch, instead of
>>>>>>>> two before.
>>>>>>>
>>>>>>> I'm assuming you were testing pkt size <= (64 - hdr_size)?
>>>>>>
>>>>>> No, I tested with 64 bytes packets only.
>>>>>
>>>>> Oh, my bad, I overlooked it. While you were saying "a single cache
>>>>> line", I was thinking putting the virtio net hdr and the "whole"
>>>>> packet data in single cache line, which is not possible for pkt
>>>>> size 64B.
>>>>>
>>>>>> I run some more tests this morning with different packet sizes,
>>>>>> and also with changing the mbuf size on guest side to have multi-
>>>>>> buffers packets:
>>>>>>
>>>>>> +-------+--------+--------+-------------------------+
>>>>>> | Txpkt | Rxmbuf | v17.02 | v17.02 + vnet hdr align |
>>>>>> +-------+--------+--------+-------------------------+
>>>>>> |    64 |   2048 |  11.05 |                   11.78 |
>>>>>> |   128 |   2048 |  10.66 |                   11.48 |
>>>>>> |   256 |   2048 |  10.47 |                   11.21 |
>>>>>> |   512 |   2048 |  10.22 |                   10.88 |
>>>>>> |  1024 |   2048 |   7.65 |                    7.84 |
>>>>>> |  1500 |   2048 |   6.25 |                    6.45 |
>>>>>> |  2000 |   2048 |   5.31 |                    5.43 |
>>>>>> |  2048 |   2048 |   5.32 |                    4.25 |
>>>>>> |  1500 |    512 |   3.89 |                    3.98 |
>>>>>> |  2048 |    512 |   1.96 |                    2.02 |
>>>>>> +-------+--------+--------+-------------------------+
>>>>>
>>>>> Could you share more info, say is it a PVP test? Is mergeable on?
>>>>> What's the fwd mode?
>>>>
>>>> No, this is not PVP benchmark, I have neither another server nor a packet
>>>> generator connected to my Haswell machine back-to-back.
>>>>
>>>> This is simple micro-benchmark, vhost PMD in txonly, Virtio PMD in
>>>> rxonly. In this configuration, mergeable is ON and no offload disabled
>>>> in QEMU cmdline.
>>>
>>> Okay, I see. So the boost, as you have stated, comes from saving two
>>> cache line access to one. Before that, vhost write 2 cache lines,
>>> while the virtio pmd reads 2 cache lines: one for reading the header,
>>> another one for reading the ether header, for updating xstats (there
>>> is no ether access in the fwd mode you tested).
>>>
>>>> That's why I would be interested in more testing on recent hardware
>>>> with PVP benchmark. Is it something that could be run in Intel lab?
>>>
>>> I think Yao Lei could help on that? But as stated, I think it may
>>> break the performance for bit packets. And I also won't expect big
>>> boost even for 64B in PVP test, judging that it's only 6% boost in
>>> micro bechmarking.
>> That would be great.
>> Note that on SandyBridge, on which I see a drop in perf with
>> microbenchmark, I get a 4% gain on PVP benchmark. So on recent hardware
>> that show a gain on microbenchmark, I'm curious of the gain with PVP
>> bench.
>>
> Hi, Maxime, Yuanhan
>
> I have execute the PVP and loopback performance test on my Ivy bridge server.
> OS:Ubutnu16.04
> CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
> Kernal:  4.4.0
> gcc : 5.4.0
> I use MAC forward for test.
>
> Performance base is commit f5472703c0bdfc29c46fc4b2ca445bce3dc08c9f,
> "eal: optimize aligned memcpy on x86".
> I can see big performance drop on Mergeable and no-mergeable path
> after apply this patch
> Mergebale Path loopback test			
> packet size		Performance compare	
> 64		 		-21.76%
> 128				-17.79%
> 260				-20.25%
> 520				-14.80%
> 1024				-9.34%
> 1500				-6.16%
>
> No-mergeable  Path loopback test	
> packet size			
> 64				-13.72%
> 128				-10.35%
> 260				-16.40%
> 520				-14.78%
> 1024				-10.48%
> 1500				-6.91%
>
> Mergeable Path PVP test			
> packet size		Performance compare		
> 64	                                               -16.33%
>
> No-mergeable Path PVP test			
> packet size			
> 64		                               -8.69%

Thanks Yao for the testing.
I'm surprised of the PVP results as even on SandyBridge, where I get
perf drop on micro benchmarks, I get improvement with PVP.

I'll try to reproduce some tests with Ivy Bridge, to understand what is
happening.

Cheers,
Maxime
  

Patch

diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index cab6e8f..ef95dde 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -182,7 +182,6 @@  static inline int
 virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)
 {
 	struct vq_desc_extra *dxp;
-	struct virtio_hw *hw = vq->hw;
 	struct vring_desc *start_dp;
 	uint16_t needed = 1;
 	uint16_t head_idx, idx;
@@ -203,10 +202,8 @@  virtqueue_enqueue_recv_refill(struct virtqueue *vq, struct rte_mbuf *cookie)
 
 	start_dp = vq->vq_ring.desc;
 	start_dp[idx].addr =
-		VIRTIO_MBUF_ADDR(cookie, vq) +
-		RTE_PKTMBUF_HEADROOM - hw->vtnet_hdr_size;
-	start_dp[idx].len =
-		cookie->buf_len - RTE_PKTMBUF_HEADROOM + hw->vtnet_hdr_size;
+		VIRTIO_MBUF_ADDR(cookie, vq) + RTE_PKTMBUF_HEADROOM;
+	start_dp[idx].len = cookie->buf_len - RTE_PKTMBUF_HEADROOM;
 	start_dp[idx].flags =  VRING_DESC_F_WRITE;
 	idx = start_dp[idx].next;
 	vq->vq_desc_head_idx = idx;
@@ -768,7 +765,7 @@  virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		}
 
 		rxm->port = rxvq->port_id;
-		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM + hdr_size;
 		rxm->ol_flags = 0;
 		rxm->vlan_tci = 0;
 
@@ -778,7 +775,7 @@  virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rxm->data_len = (uint16_t)(len[i] - hdr_size);
 
 		hdr = (struct virtio_net_hdr *)((char *)rxm->buf_addr +
-			RTE_PKTMBUF_HEADROOM - hdr_size);
+			RTE_PKTMBUF_HEADROOM);
 
 		if (hw->vlan_strip)
 			rte_vlan_strip(rxm);
@@ -892,13 +889,13 @@  virtio_recv_mergeable_pkts(void *rx_queue,
 		}
 
 		header = (struct virtio_net_hdr_mrg_rxbuf *)((char *)rxm->buf_addr +
-			RTE_PKTMBUF_HEADROOM - hdr_size);
+			RTE_PKTMBUF_HEADROOM);
 		seg_num = header->num_buffers;
 
 		if (seg_num == 0)
 			seg_num = 1;
 
-		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM + hdr_size;
 		rxm->nb_segs = seg_num;
 		rxm->next = NULL;
 		rxm->ol_flags = 0;
@@ -944,7 +941,7 @@  virtio_recv_mergeable_pkts(void *rx_queue,
 			while (extra_idx < rcv_cnt) {
 				rxm = rcv_pkts[extra_idx];
 
-				rxm->data_off = RTE_PKTMBUF_HEADROOM - hdr_size;
+				rxm->data_off = RTE_PKTMBUF_HEADROOM;
 				rxm->next = NULL;
 				rxm->pkt_len = (uint32_t)(len[extra_idx]);
 				rxm->data_len = (uint16_t)(len[extra_idx]);