[dpdk-dev,v4] vhost_user: protect active rings from async ring changes

Message ID 991767555.2220884.1513860074866.JavaMail.zimbra@redhat.com (mailing list archive)
State Not Applicable, archived
Delegated to: Yuanhan Liu
Headers

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Victor Kaplansky Dec. 21, 2017, 12:41 p.m. UTC
  ----- Original Message -----
> From: "Stephen Hemminger" <stephen@networkplumber.org>
> To: "Victor Kaplansky" <vkaplans@redhat.com>
> Cc: dev@dpdk.org, stable@dpdk.org, "Jens Freimann" <jfreiman@redhat.com>, "Maxime Coquelin"
> <maxime.coquelin@redhat.com>, "Yuanhan Liu" <yliu@fridaylinux.org>, "Tiwei Bie" <tiwei.bie@intel.com>, "Jianfeng
> Tan" <jianfeng.tan@intel.com>
> Sent: Wednesday, December 20, 2017 10:19:45 PM
> Subject: Re: [dpdk-dev] [PATCH v4] vhost_user: protect active rings from async ring changes
> 
> On Wed, 20 Dec 2017 15:06:30 -0500 (EST)
> Victor Kaplansky <vkaplans@redhat.com> wrote:
> 
> > > Wrapping locking inline's adds nothing and makes life harder
> > > for static analysis tools.
> > 
> > Yep. In this case it inhibits the details of how the locking is
> > implemented (e.g. the name of the lock). It also facilitates
> > replacement of locking mechanism, by another implementation.
> > See below.
> 
> YAGNI You aren't gonna need it.
> 
> Don't build infrastructure for things that you forsee.

Good point, thanks. I'll simplify this.


> 
> 
> 
> 
> > > 
> > > The bigger problem is that doing locking on all enqueue/dequeue
> > > can have a visible performance impact. Did you measure that?
> > > 
> > > Could you invent an RCUish mechanism using compiler barriers?
> > >   
> > 
> > I've played a bit with measuring performance impact. Successful
> > lock adds on the average about 30 cycles on my Haswell cpu.
> > (and it successes 99.999...% of time).
> > 
> > I can investigate it more, but my initial feeling is that adding a
> > memory barrier (the real one, not the compiler barrier) would add
> > about the same overhead.
> > 
> > By the way, the way update_queuing_status() in
> > drivers/net/vhost/rte_eth_vhost.c tries to avoid contention with
> > the active queue by playing with "allow_queuing" and "while_queuing"
> > seems to be broken, since memory barriers are missing.
> 
> CPU cycles alone don't matter on modern x86.
> What matters is cache and instructions per cycle.
> In this case locking requires locked instruction which causes the cpu
> prefetching and instruction pipeline to stall.
> 

I agree. I've measured total overhead of added pair of lock/unlock and
it appears to be around 28 cycles per lock/unlock pair on my 3.5GHz Haswell.

From "Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3A: System Programming Guide, Part 1":

        In the Pentium 4, Intel Xeon, and P6 family processors, the
        locking operation is handled with either a cache lock or bus
        lock. If a memory access is cacheable and affects only a
        single cache line, a cache lock is invoked and the system
        bus and the actual memory location in system memory are not
        locked during the operation. Here, other Pentium 4, Intel
        Xeon, or P6 family processors on the bus write-back any
        modified data and invalidate their caches as necessary to
        maintain system memory coherency. If the memory access is
        not cacheable and/or it crosses a cache line boundary, the
        processor’s LOCK# signal is asserted and the processor does
        not respond to requests for bus control during the locked
        operation.

So, the whole memory bus is locked only if the memory access crosses 
memory cache line.

Anyway, I'm open to ways to reduce this overhead. This patch fixes a critical
host of bugs reported in bugzilla, so if we can pull this fix
and try to optimize it later by a subsequent patch it would be great.

See below a quick test program I've used to test and measure the overhead.
It also demonstrates the problem I'm trying to fix. Do you have any idea
about using RCU of how to reduce the overhead?

BTW, our implementation of rte_spinlock_unlock() could be slightly faster,
if we would use regular move instead of xchg instruction.

Also our implementation of rte_spinlock_lock() could be faster if we
optimize it for success path by making conditional branch to fall-trough
or even better if we reimplement the spinlock using gcc builtins.
  

Patch

diff --git a/tlock.c b/tlock.c
deleted file mode 100644
index 62c68852..00000000
--- a/tlock.c
+++ /dev/null
@@ -1,91 +0,0 @@ 
-#include <pthread.h>
-#include <sys/mman.h>
-#include <unistd.h>
-#include <stdio.h>
-
-/* build me with: 
-      gcc -march=native -std=gnu11 -O3 -lpthread tlock.c -o tlock
-*/
-
-
-
-typedef struct {
-        volatile int locked; /**< lock status 0 = unlocked, 1 = locked */
-} rte_spinlock_t;
-
-static inline void
-rte_spinlock_lock(rte_spinlock_t *sl)
-{
-        int lock_val = 1;
-        asm volatile (
-                        "1:\n"
-                        "xchg %[locked], %[lv]\n"
-                        "test %[lv], %[lv]\n"
-                        "jz 3f\n"
-                        "2:\n"
-                        "pause\n"
-                        "cmpl $0, %[locked]\n"
-                        "jnz 2b\n"
-                        "jmp 1b\n"
-                        "3:\n"
-                        : [locked] "=m" (sl->locked), [lv] "=q" (lock_val)
-                        : "[lv]" (lock_val)
-                        : "memory");
-}
-
-static inline void
-rte_spinlock_unlock (rte_spinlock_t *sl)
-{
-        int unlock_val = 0;
-        asm volatile (
-                        "xchg %[locked], %[ulv]\n"
-                        : [locked] "=m" (sl->locked), [ulv] "=q" (unlock_val)
-                        : "[ulv]" (unlock_val)
-                        : "memory");
-}
-
-static unsigned * volatile pointer;
-static rte_spinlock_t reader_access;
-
-void *
-worker(void *unused)
-{
-       int i = 0;
-       while (1) {
-               unsigned *new_pointer = (unsigned *) mmap(NULL, 4096, PROT_READ | PROT_WRITE,
-                                                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-               unsigned *old_pointer = pointer;
-
-               rte_spinlock_lock(&reader_access);
-               pointer = new_pointer;
-               rte_spinlock_unlock(&reader_access);
-
-               munmap (old_pointer, 4096);
-
-               usleep(10000);
-                       
-       }
-       return 0;
-
-}
-
-int main()
-{
-       pthread_t t;
-       pointer = (unsigned *) mmap(NULL, 4096, PROT_READ | PROT_WRITE,
-                                   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-
-       pthread_create(&t, 0, worker, NULL);
-
-       unsigned n = 400000000;
-
-       while (n--) {
-               rte_spinlock_lock(&reader_access);
-               *pointer = 1;
-               rte_spinlock_unlock(&reader_access);
-       }
-
-       pthread_cancel(t);
-       return 0;
-
-}