Bug 48
Summary: | Unexpected performance regression since CVE-2018-1059 fix with vector path | ||
---|---|---|---|
Product: | DPDK | Reporter: | Maxime Coquelin (maxime.coquelin) |
Component: | vhost/virtio | Assignee: | Maxime Coquelin (maxime.coquelin) |
Status: | CONFIRMED --- | ||
Severity: | normal | CC: | ajit.khaparde, john.mcnamara, maxime.coquelin, thomas, tiwei.bie |
Priority: | Normal | ||
Version: | 18.05 | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All |
Description
Maxime Coquelin
2018-05-18 18:01:41 CEST
Maxime, I hope this is still a valid concern. Right? Ferruh, If yes, what do we do? Who do you think can look at this? Thanks Ajit Hi Ajit, I think this is still a valid concern. As far as I understand the expected drop is not a result of a calculation but a rough estimation, there is a chance that expectation is wrong and current performance drop is unavoidable, but still I believe it worth investigating this more and look for more optimization and confirm the performance drop is unavoidable. I think Maxime and Tiwei are correct people for this. Also Marvin (Marvin Liu <yong.liu@intel.com>) is very interested in virtio performance numbers, it may be good to include him too. Hi, I managed to reproduce the performance regression on my testbed. The setup consists in doing an io loop using DPDK v18.08 in guest, and with DPDK on host version before and after the CVE series. "orig" refers to commit 91c6de7eb7ac, before the CVE series. "cve" refers to commit 9553e6e40888, after the CVE series. mergreable: orig/cve 64B: 12.79/12.56 128B: 12.22/11.52 260B: 10.02/9.72 520B: 8.28/8.15 1024B: 5.99/5.91 1500B: 4.60/4.53 non-mergeable: orig/cve/latest 64B: 12.68/12.56 128B: 12.21/11.55 260B: 10.00/9.66 520B: 8.36/8.10 1024B: 5.95/5.92 1500B: 4.61/4.54 I ran perf on host side to find where the regression happens (perf record -C <PMD cpu> -e cycles:pp). The profiling is done for the 128B packet size, the one showing the biggest regression. Perf annotate for "orig": 0.03 │ add %ax,0x1e(%rbx) ▒ │ update_used_idx(): ▒ │ if (unlikely(count == 0)) ▒ 0.00 │ test %edx,%edx ▒ │ ↓ je 1806 ▒ │ vq->used->idx += count; ◆ │ mov 0xa0(%rsp),%rbx ▒ 0.03 │ mov 0x10(%rbx),%rdx ▒ 0.51 │ add %ax,0x2(%rdx) ▒ │ vhost_log_used_vring(): ▒ │ vhost_log_write(dev, vq->log_guest_addr + offset, len); ▒ │ mov 0x40(%rbx),%rax ▒ │ vhost_log_write(): ▒ │ if (likely(((dev->features & (1ULL << VHOST_F_LOG_ALL)) == 0) || ▒ 0.02 │ mov 0xb8(%rsp),%rbx ▒ │ testb $0x4,0xb(%rbx) ▒ 0.02 │ ↓ jne 173c ▒ │ _mm_mfence(): ▒ │ } ▒ │ ▒ │ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) ▒ │ _mm_mfence (void) ▒ │ { ▒ │ __builtin_ia32_mfence (); ▒ 3.55 │ 546: mfence ▒ │ vhost_vring_call(): ▒ │ { ▒ │ /* Flush used->idx update before we read avail->flags. */ ▒ │ rte_mb(); ▒ │ ▒ │ /* Don't kick guest if we don't reach index specified by guest. */ ▒ │ if (dev->features & (1ULL << VIRTIO_RING_F_EVENT_IDX)) { ▒ 0.23 │ mov 0xb8(%rsp),%rax ▒ 0.15 │ mov 0x8(%rax),%rax ▒ │ test $0x20000000,%eax ▒ 0.03 │ ↓ je 16fe ▒ │ uint16_t old = vq->signalled_used; ▒ │ uint16_t new = vq->last_used_idx; ▒ │ mov 0xa0(%rsp),%rbx ▒ │ ▒ │ VHOST_LOG_DEBUG(VHOST_DATA, "%s: used_event_idx=%d, old=%d, new=%d\n", ▒ │ __func__, ▒ │ vhost_used_event(vq), ▒ │ old, new); ▒ │ if (vhost_need_event(vhost_used_event(vq), new, old) ▒ │ mov 0x18(%rbx),%esi ▒ │ mov 0x8(%rbx),%rcx ▒ │ uint16_t new = vq->last_used_idx; ▒ │ movzwl 0x1e(%rbx),%edx ▒ │ if (vhost_need_event(vhost_used_event(vq), new, old) ▒ │ movzwl 0x4(%rcx,%rsi,2),%esi ▒ │ vhost_need_event(): Perf annotate for "cve": │ e12: test %rcx,%rcx ▒ │ ↑ je b30 ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ │ vmovdqu -0x20(%r8,%rcx,1),%ymm0 ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ │ vmovdqu %ymm0,-0x20(%rdx,%rcx,1) ▒ │ ↑ jmpq b30 ▒ │ nop ▒ │ rte_memcpy_aligned(): ▒ │ if (n <= 32) { ▒ 0.06 │ e30: cmp $0x20,%rax ▒ │ ↓ jbe fd0 ▒ │ if (n <= 64) { ▒ 0.03 │ cmp $0x40,%rax ▒ │ ↓ jbe 1058 ▒ 0.10 │ lea 0x20(%rdx),%r8 ▒ 0.65 │ mov %rdx,%r9 ▒ 0.07 │ mov %rax,%rsi ▒ │ copy_desc_to_mbuf(): ▒ │ rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *, ▒ 0.02 │ mov %rdi,%rcx ▒ 0.08 │ sub %rdi,%r9 ◆ 0.68 │ sub %rdi,%r8 ▒ 0.04 │ nop ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ 15.29 │ e60: vmovdqu (%rcx),%ymm0 ▒ │ rte_memcpy_aligned(): ▒ │ for (; n >= 64; n -= 64) { ▒ 0.04 │ sub $0x40,%rsi ▒ │ src = (const uint8_t *)src + 64; ▒ 0.81 │ add $0x40,%rcx ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ 2.26 │ vmovdqu %ymm0,-0x40(%rcx,%r9,1) ▒ │ _mm256_loadu_si256(): ▒ │ return (__m256i) __builtin_ia32_loaddqu256 ((char const *)__P); ▒ 0.44 │ vmovdqu -0x20(%rcx),%ymm0 ▒ │ _mm256_storeu_si256(): ▒ │ __builtin_ia32_storedqu256 ((char *)__P, (__v32qi)__A); ▒ 1.85 │ vmovdqu %ymm0,-0x40(%rcx,%r8,1) ▒ │ rte_memcpy_aligned(): ▒ │ for (; n >= 64; n -= 64) { ▒ 0.74 │ cmp $0x3f,%rsi ▒ 0.15 │ ↑ ja e60 ▒ 0.12 │ lea -0x40(%rax),%rsi ▒ │ _mm256_storeu_si256(): ▒ 0.08 │ sub %eax,%r12d ▒ 0.05 │ shr $0x6,%rsi ▒ 0.62 │ lea 0x1(%rsi),%rcx ▒ 0.13 │ neg %rsi ▒ 0.05 │ shl $0x6,%rsi ▒ 0.09 │ shl $0x6,%rcx ▒ │ rte_memcpy_aligned(): ▒ │ (const uint8_t *)src - 64 + n); From profiling results above, it seems that the regression happens in the dequeue path, when copying the packet content from the desc buffer to the mbuf one. The generated assembly changes quite a lot before and after the CVE series. As the desc buffer is prefetched just before the packet copy, I wondered if it could have moved closer to the vmovdqu instruction. In order to confirm this hypothesis, I commented the prefetch in both orig and cve case to see if their benchmark would be closer. This is not the case, performance remains the same as above perf results. Next step would be to compare hardware performance counters in both cases. In case you have other ideas, please let me know, I have the setup in place. |