Bug 48

Summary:	Unexpected performance regression since CVE-2018-1059 fix with vector path
Product:	DPDK	Reporter:	Maxime Coquelin (maxime.coquelin)
Component:	vhost/virtio	Assignee:	Maxime Coquelin (maxime.coquelin)
Status:	CONFIRMED ---
Severity:	normal	CC:	ajit.khaparde, john.mcnamara, maxime.coquelin, thomas, tiwei.bie
Priority:	Normal
Version:	18.05
Target Milestone:	---
Hardware:	All
OS:	All

Description Maxime Coquelin 2018-05-18 18:01:41 CEST

A small performance drop was expected with CVE-2018-1059 fix integration.
While this expectation is verified for mast cases (less than 3% drop), a bigger
drop is seen when using vector path in guest with small packets.

The goal of this ticket is to see if this performance regression can be 
improved. 

For reference, below is a copy paste from Lei Yao mail reporting the issue:

During the 18.05-rc1 performance testing, I find this patch set will bring
slightly performance drop on mergeable and normal path, and big performance
drop on vector path. Could you have a check on this? I know this patch is 
important for security. Not sure if there is any way to improve the performance.

Mergebale	
packet size	
64	0.80%
128	-2.75%
260	-2.93%
520	-2.72%
1024	-1.18%
1500	-0.65%
	
Normal	
packet size	
64	-1.47%
128	-7.43%
260	-3.66%
520	-2.52%
1024	-1.19%
1500	-0.78%
	
Vector	
packet size	
64	-8.60%
128	-3.54%
260	-2.63%
520	-6.12%
1024	-1.05%
1500	-1.20% 

CPU info: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
OS: Ubuntu 16.04

Comment 1 Ajit Khaparde 2018-08-29 20:08:40 CEST

Maxime,
I hope this is still a valid concern. Right?

Ferruh,
If yes, what do we do? Who do you think can look at this?

Thanks
Ajit

Comment 2 Ferruh YIGIT 2018-08-30 19:33:48 CEST

Hi Ajit,

I think this is still a valid concern.

As far as I understand the expected drop is not a result of a calculation but a rough estimation, there is a chance that expectation is wrong and current performance drop is unavoidable, but still I believe it worth investigating this more and look for more optimization and confirm the performance drop is unavoidable.

I think Maxime and Tiwei are correct people for this.
Also Marvin (Marvin Liu <yong.liu@intel.com>) is very interested in virtio performance numbers, it may be good to include him too.

Comment 3 Maxime Coquelin 2018-09-11 22:58:13 CEST

Hi,

I managed to reproduce the performance regression on my testbed.
The setup consists in doing an io loop using DPDK v18.08 in guest,
and with DPDK on host version before and after the CVE series.
"orig" refers to commit 91c6de7eb7ac, before the CVE series.
"cve" refers to commit 9553e6e40888, after the CVE series.

mergreable: orig/cve
64B: 12.79/12.56
128B: 12.22/11.52
260B: 10.02/9.72
520B: 8.28/8.15
1024B: 5.99/5.91
1500B: 4.60/4.53

non-mergeable: orig/cve/latest
64B: 12.68/12.56
128B: 12.21/11.55
260B: 10.00/9.66
520B: 8.36/8.10
1024B: 5.95/5.92
1500B: 4.61/4.54

I ran perf on host side to find where the regression happens (perf record -C <PMD cpu> -e cycles:pp).
The profiling is done for the 128B packet size, the one showing the biggest regression.

Perf annotate for "orig":
  0.03 │        add    %ax,0x1e(%rbx)                                                                                ▒
       │      update_used_idx():                                                                                     ▒
       │              if (unlikely(count == 0))                                                                      ▒
  0.00 │        test   %edx,%edx                                                                                     ▒
       │      ↓ je     1806                                                                                          ▒
       │              vq->used->idx += count;                                                                        ◆
       │        mov    0xa0(%rsp),%rbx                                                                               ▒
  0.03 │        mov    0x10(%rbx),%rdx                                                                               ▒
  0.51 │        add    %ax,0x2(%rdx)                                                                                 ▒
       │      vhost_log_used_vring():                                                                                ▒
       │              vhost_log_write(dev, vq->log_guest_addr + offset, len);                                        ▒
       │        mov    0x40(%rbx),%rax                                                                               ▒
       │      vhost_log_write():                                                                                     ▒
       │              if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) ||                               ▒
  0.02 │        mov    0xb8(%rsp),%rbx                                                                               ▒
       │        testb  $0x4,0xb(%rbx)                                                                                ▒
  0.02 │      ↓ jne    173c                                                                                          ▒
       │      _mm_mfence():                                                                                          ▒
       │      }                                                                                                      ▒
       │                                                                                                             ▒
       │      extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))                ▒
       │      _mm_mfence (void)                                                                                      ▒
       │      {                                                                                                      ▒
       │        __builtin_ia32_mfence ();                                                                            ▒
  3.55 │ 546:   mfence                                                                                               ▒
       │      vhost_vring_call():                                                                                    ▒
       │      {                                                                                                      ▒
       │              /* Flush used->idx update before we read avail->flags. */                                      ▒
       │              rte_mb();                                                                                      ▒
       │                                                                                                             ▒
       │              /* Don't kick guest if we don't reach index specified by guest. */                             ▒
       │              if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) {                                       ▒
  0.23 │        mov    0xb8(%rsp),%rax                                                                               ▒
  0.15 │        mov    0x8(%rax),%rax                                                                                ▒
       │        test   $0x20000000,%eax                                                                              ▒
  0.03 │      ↓ je     16fe                                                                                          ▒
       │                      uint16_t old = vq->signalled_used;                                                     ▒
       │                      uint16_t new = vq->last_used_idx;                                                      ▒
       │        mov    0xa0(%rsp),%rbx                                                                               ▒
       │                                                                                                             ▒
       │                      VHOST_LOG_DEBUG(VHOST_DATA, "%s: used_event_idx=%d, old=%d, new=%d\n",                 ▒
       │                              __func__,                                                                      ▒
       │                              vhost_used_event(vq),                                                          ▒
       │                              old, new);                                                                     ▒
       │                      if (vhost_need_event(vhost_used_event(vq), new, old)                                   ▒
       │        mov    0x18(%rbx),%esi                                                                               ▒
       │        mov    0x8(%rbx),%rcx                                                                                ▒
       │                      uint16_t new = vq->last_used_idx;                                                      ▒
       │        movzwl 0x1e(%rbx),%edx                                                                               ▒
       │                      if (vhost_need_event(vhost_used_event(vq), new, old)                                   ▒
       │        movzwl 0x4(%rcx,%rsi,2),%esi                                                                         ▒
       │      vhost_need_event():

Perf annotate for "cve":
       │ e12:   test   %rcx,%rcx                                                                                     ▒
       │      ↑ je     b30                                                                                           ▒
       │      _mm256_loadu_si256():                                                                                  ▒
       │        return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P);                                      ▒
       │        vmovdqu -0x20(%r8,%rcx,1),%ymm0                                                                      ▒
       │      _mm256_storeu_si256():                                                                                 ▒
       │        __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A);                                              ▒
       │        vmovdqu %ymm0,-0x20(%rdx,%rcx,1)                                                                     ▒
       │      ↑ jmpq   b30                                                                                           ▒
       │        nop                                                                                                  ▒
       │      rte_memcpy_aligned():                                                                                  ▒
       │              if (n <= 32) {                                                                                 ▒
  0.06 │ e30:   cmp    $0x20,%rax                                                                                    ▒
       │      ↓ jbe    fd0                                                                                           ▒
       │              if (n <= 64) {                                                                                 ▒
  0.03 │        cmp    $0x40,%rax                                                                                    ▒
       │      ↓ jbe    1058                                                                                          ▒
  0.10 │        lea    0x20(%rdx),%r8                                                                                ▒
  0.65 │        mov    %rdx,%r9                                                                                      ▒
  0.07 │        mov    %rax,%rsi                                                                                     ▒
       │      copy_desc_to_mbuf():                                                                                   ▒
       │                                      rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,                        ▒
  0.02 │        mov    %rdi,%rcx                                                                                     ▒
  0.08 │        sub    %rdi,%r9                                                                                      ◆
  0.68 │        sub    %rdi,%r8                                                                                      ▒
  0.04 │        nop                                                                                                  ▒
       │      _mm256_loadu_si256():                                                                                  ▒
       │        return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P);                                      ▒
 15.29 │ e60:   vmovdqu (%rcx),%ymm0                                                                                 ▒
       │      rte_memcpy_aligned():                                                                                  ▒
       │              for (; n >= 64; n -= 64) {                                                                     ▒
  0.04 │        sub    $0x40,%rsi                                                                                    ▒
       │                      src = (const uint8_t *)src + 64;                                                       ▒
  0.81 │        add    $0x40,%rcx                                                                                    ▒
       │      _mm256_storeu_si256():                                                                                 ▒
       │        __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A);                                              ▒
  2.26 │        vmovdqu %ymm0,-0x40(%rcx,%r9,1)                                                                      ▒
       │      _mm256_loadu_si256():                                                                                  ▒
       │        return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P);                                      ▒
  0.44 │        vmovdqu -0x20(%rcx),%ymm0                                                                            ▒
       │      _mm256_storeu_si256():                                                                                 ▒
       │        __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A);                                              ▒
  1.85 │        vmovdqu %ymm0,-0x40(%rcx,%r8,1)                                                                      ▒
       │      rte_memcpy_aligned():                                                                                  ▒
       │              for (; n >= 64; n -= 64) {                                                                     ▒
  0.74 │        cmp    $0x3f,%rsi                                                                                    ▒
  0.15 │      ↑ ja     e60                                                                                           ▒
  0.12 │        lea    -0x40(%rax),%rsi                                                                              ▒
       │      _mm256_storeu_si256():                                                                                 ▒
  0.08 │        sub    %eax,%r12d                                                                                    ▒
  0.05 │        shr    $0x6,%rsi                                                                                     ▒
  0.62 │        lea    0x1(%rsi),%rcx                                                                                ▒
  0.13 │        neg    %rsi                                                                                          ▒
  0.05 │        shl    $0x6,%rsi                                                                                     ▒
  0.09 │        shl    $0x6,%rcx                                                                                     ▒
       │      rte_memcpy_aligned():                                                                                  ▒
       │                              (const uint8_t *)src - 64 + n);

From profiling results above, it seems that the regression happens in the dequeue path,
when copying the packet content from the desc buffer to the mbuf one.

The generated assembly changes quite a lot before and after the CVE series.
As the desc buffer is prefetched just before the packet copy, I wondered if it could have
moved closer to the vmovdqu instruction.
In order to confirm this hypothesis, I commented the prefetch in both orig and cve case to see
if their benchmark would be closer. This is not the case, performance remains the same as
above perf results.

Next step would be to compare hardware performance counters in both cases.

In case you have other ideas, please let me know, I have the setup in place.