A small performance drop was expected with CVE-2018-1059 fix integration. While this expectation is verified for mast cases (less than 3% drop), a bigger drop is seen when using vector path in guest with small packets. The goal of this ticket is to see if this performance regression can be improved. For reference, below is a copy paste from Lei Yao mail reporting the issue: During the 18.05-rc1 performance testing, I find this patch set will bring slightly performance drop on mergeable and normal path, and big performance drop on vector path. Could you have a check on this? I know this patch is important for security. Not sure if there is any way to improve the performance. Mergebale packet size 64 0.80% 128 -2.75% 260 -2.93% 520 -2.72% 1024 -1.18% 1500 -0.65% Normal packet size 64 -1.47% 128 -7.43% 260 -3.66% 520 -2.52% 1024 -1.19% 1500 -0.78% Vector packet size 64 -8.60% 128 -3.54% 260 -2.63% 520 -6.12% 1024 -1.05% 1500 -1.20% CPU info: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz OS: Ubuntu 16.04
Maxime, I hope this is still a valid concern. Right? Ferruh, If yes, what do we do? Who do you think can look at this? Thanks Ajit
Hi Ajit, I think this is still a valid concern. As far as I understand the expected drop is not a result of a calculation but a rough estimation, there is a chance that expectation is wrong and current performance drop is unavoidable, but still I believe it worth investigating this more and look for more optimization and confirm the performance drop is unavoidable. I think Maxime and Tiwei are correct people for this. Also Marvin (Marvin Liu <yong.liu@intel.com>) is very interested in virtio performance numbers, it may be good to include him too.
Hi, I managed to reproduce the performance regression on my testbed. The setup consists in doing an io loop using DPDK v18.08 in guest, and with DPDK on host version before and after the CVE series. "orig" refers to commit 91c6de7eb7ac, before the CVE series. "cve" refers to commit 9553e6e40888, after the CVE series. mergreable: orig/cve 64B: 12.79/12.56 128B: 12.22/11.52 260B: 10.02/9.72 520B: 8.28/8.15 1024B: 5.99/5.91 1500B: 4.60/4.53 non-mergeable: orig/cve/latest 64B: 12.68/12.56 128B: 12.21/11.55 260B: 10.00/9.66 520B: 8.36/8.10 1024B: 5.95/5.92 1500B: 4.61/4.54 I ran perf on host side to find where the regression happens (perf record -C <PMD cpu> -e cycles:pp). The profiling is done for the 128B packet size, the one showing the biggest regression. Perf annotate for "orig": 0.03 │ add %ax,0x1e(%rbx) ▒ │ update_used_idx(): ▒ │ if (unlikely(count == 0)) ▒ 0.00 │ test %edx,%edx ▒ │ ↓ je 1806 ▒ │ vq->used->idx += count; ◆ │ mov 0xa0(%rsp),%rbx ▒ 0.03 │ mov 0x10(%rbx),%rdx ▒ 0.51 │ add %ax,0x2(%rdx) ▒ │ vhost_log_used_vring(): ▒ │ vhost_log_write(dev, vq->log_guest_addr + offset, len); ▒ │ mov 0x40(%rbx),%rax ▒ │ vhost_log_write(): ▒ │ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) || ▒ 0.02 │ mov 0xb8(%rsp),%rbx ▒ │ testb $0x4,0xb(%rbx) ▒ 0.02 │ ↓ jne 173c ▒ │ _mm_mfence(): ▒ │ } ▒ │ ▒ │ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) ▒ │ _mm_mfence (void) ▒ │ { ▒ │ __builtin_ia32_mfence (); ▒ 3.55 │ 546: mfence ▒ │ vhost_vring_call(): ▒ │ { ▒ │ /* Flush used->idx update before we read avail->flags. */ ▒ │ rte_mb(); ▒ │ ▒ │ /* Don't kick guest if we don't reach index specified by guest. */ ▒ │ if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) { ▒ 0.23 │ mov 0xb8(%rsp),%rax ▒ 0.15 │ mov 0x8(%rax),%rax ▒ │ test $0x20000000,%eax ▒ 0.03 │ ↓ je 16fe ▒ │ uint16_t old = vq->signalled_used; ▒ │ uint16_t new = vq->last_used_idx; ▒ │ mov 0xa0(%rsp),%rbx ▒ │ ▒ │ VHOST_LOG_DEBUG(VHOST_DATA, "%s: used_event_idx=%d, old=%d, new=%d\n", ▒ │ __func__, ▒ │ vhost_used_event(vq), ▒ │ old, new); ▒ │ if (vhost_need_event(vhost_used_event(vq), new, old) ▒ │ mov 0x18(%rbx),%esi ▒ │ mov 0x8(%rbx),%rcx ▒ │ uint16_t new = vq->last_used_idx; ▒ │ movzwl 0x1e(%rbx),%edx ▒ │ if (vhost_need_event(vhost_used_event(vq), new, old) ▒ │ movzwl 0x4(%rcx,%rsi,2),%esi ▒ │ vhost_need_event(): Perf annotate for "cve": │ e12: test %rcx,%rcx ▒ │ ↑ je b30 ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ │ vmovdqu -0x20(%r8,%rcx,1),%ymm0 ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ │ vmovdqu %ymm0,-0x20(%rdx,%rcx,1) ▒ │ ↑ jmpq b30 ▒ │ nop ▒ │ rte_memcpy_aligned(): ▒ │ if (n <= 32) { ▒ 0.06 │ e30: cmp $0x20,%rax ▒ │ ↓ jbe fd0 ▒ │ if (n <= 64) { ▒ 0.03 │ cmp $0x40,%rax ▒ │ ↓ jbe 1058 ▒ 0.10 │ lea 0x20(%rdx),%r8 ▒ 0.65 │ mov %rdx,%r9 ▒ 0.07 │ mov %rax,%rsi ▒ │ copy_desc_to_mbuf(): ▒ │ rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, ▒ 0.02 │ mov %rdi,%rcx ▒ 0.08 │ sub %rdi,%r9 ◆ 0.68 │ sub %rdi,%r8 ▒ 0.04 │ nop ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ 15.29 │ e60: vmovdqu (%rcx),%ymm0 ▒ │ rte_memcpy_aligned(): ▒ │ for (; n >= 64; n -= 64) { ▒ 0.04 │ sub $0x40,%rsi ▒ │ src = (const uint8_t *)src + 64; ▒ 0.81 │ add $0x40,%rcx ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ 2.26 │ vmovdqu %ymm0,-0x40(%rcx,%r9,1) ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ 0.44 │ vmovdqu -0x20(%rcx),%ymm0 ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ 1.85 │ vmovdqu %ymm0,-0x40(%rcx,%r8,1) ▒ │ rte_memcpy_aligned(): ▒ │ for (; n >= 64; n -= 64) { ▒ 0.74 │ cmp $0x3f,%rsi ▒ 0.15 │ ↑ ja e60 ▒ 0.12 │ lea -0x40(%rax),%rsi ▒ │ _mm256_storeu_si256(): ▒ 0.08 │ sub %eax,%r12d ▒ 0.05 │ shr $0x6,%rsi ▒ 0.62 │ lea 0x1(%rsi),%rcx ▒ 0.13 │ neg %rsi ▒ 0.05 │ shl $0x6,%rsi ▒ 0.09 │ shl $0x6,%rcx ▒ │ rte_memcpy_aligned(): ▒ │ (const uint8_t *)src - 64 + n); From profiling results above, it seems that the regression happens in the dequeue path, when copying the packet content from the desc buffer to the mbuf one. The generated assembly changes quite a lot before and after the CVE series. As the desc buffer is prefetched just before the packet copy, I wondered if it could have moved closer to the vmovdqu instruction. In order to confirm this hypothesis, I commented the prefetch in both orig and cve case to see if their benchmark would be closer. This is not the case, performance remains the same as above perf results. Next step would be to compare hardware performance counters in both cases. In case you have other ideas, please let me know, I have the setup in place.