[dpdk-dev] Proposal of mbuf Race Condition Detection

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Apr 14 01:19:45 CEST 2017



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of jigsaw
> Sent: Thursday, April 13, 2017 9:59 PM
> To: Adrien Mazarguil <adrien.mazarguil at 6wind.com>
> Cc: dev at dpdk.org
> Subject: Re: [dpdk-dev] Proposal of mbuf Race Condition Detection
> 
> Hi Adrien,
> 
> Thanks for your comment.
> 
> The LOCK/UNLOCK may be called by user application only. There are several
> reasons.
> 
> 1. If the lib calls LOCK, user application may be punished unexpectedly.
> Consider what if the Rx burst function calls the LOCK in core #1, and then
> the mbuf is handed over from
> core #1 to core #2 through a enqueue/dequeue operation, as below:
> 
> Rx(1) --> Enqueue(1) --> Dequeue(2)
> LOCKED                       Panic!
> 
> The core #2 will then panic because the mbuf is owned by core #1 without
> being released.
> 
> 2. Rx and Tx both usually works in a burst mode, combined with prefetch
> operation. Meanwhile
> LOCK and UNLOCK cannot work well with prefetch, because it requires data
> access of mbuf header.
> 
> 3. The critical session tends to be small. Here we (user application) need
> to find a balance: with longer interval of critical
> section, we can have more secured code; however, longer duration leads to
> more complicated business
> logic.
> 
> Hence, we urge user application to use LOCK/UNLOCK with the common sense of
> critical sections.
> 
> Take the examples/l3fwd for example. A proper LOCK/UNLOCK pair is shown
> below:
> 
> ==========================================
> diff --git a/examples/l3fwd/l3fwd_em_hlm_sse.h
> b/examples/l3fwd/l3fwd_em_hlm_sse.h
> index 7714a20..9db0190 100644
> --- a/examples/l3fwd/l3fwd_em_hlm_sse.h
> +++ b/examples/l3fwd/l3fwd_em_hlm_sse.h
> @@ -299,6 +299,15 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> **pkts_burst,
> 
>         for (j = 0; j < n; j += 8) {
> 
> +            RTE_MBUF_LOCK(pkts_burst[j]);
> +            RTE_MBUF_LOCK(pkts_burst[j+1]);
> +            RTE_MBUF_LOCK(pkts_burst[j+2]);
> +            RTE_MBUF_LOCK(pkts_burst[j+3]);
> +            RTE_MBUF_LOCK(pkts_burst[j+4]);
> +            RTE_MBUF_LOCK(pkts_burst[j+5]);
> +            RTE_MBUF_LOCK(pkts_burst[j+6]);
> +            RTE_MBUF_LOCK(pkts_burst[j+7]);
> +
>                 uint32_t pkt_type =
>                         pkts_burst[j]->packet_type &
>                         pkts_burst[j+1]->packet_type &
> @@ -333,8 +342,10 @@ l3fwd_em_send_packets(int nb_rx, struct rte_mbuf
> **pkts_burst,
>                 }
>         }
> 
> -       for (; j < nb_rx; j++)
> +      for (; j < nb_rx; j++) {
> +              RTE_MBUF_LOCK(pkts_burst[j]);
>                 dst_port[j] = em_get_dst_port(qconf, pkts_burst[j], portid);
> +      }
> 
>         send_packets_multi(qconf, pkts_burst, dst_port, nb_rx);
> 
> diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> index 1afa1f0..2938558 100644
> --- a/examples/l3fwd/l3fwd_sse.h
> +++ b/examples/l3fwd/l3fwd_sse.h
> @@ -322,6 +322,9 @@ send_packetsx4(struct lcore_conf *qconf, uint8_t port,
> struct rte_mbuf *m[],
> 
>         len = qconf->tx_mbufs[port].len;
> 
> +      for (j = 0; j < num; ++j)
> +            RTE_MBUF_UNLOCK(m);
> +
>         /*
>          * If TX buffer for that queue is empty, and we have enough packets,
>          * then send them straightway.
> @@ -492,8 +495,10 @@ send_packets_multi(struct lcore_conf *qconf, struct
> rte_mbuf **pkts_burst,
>                 if (likely(pn != BAD_PORT))
>                         send_packetsx4(qconf, pn, pkts_burst + j, k);
>                 else
> -                       for (m = j; m != j + k; m++)
> +                      for (m = j; m != j + k; m++) {
> +                              RTE_MBUF_UNLOCK(pkts_burst[m]);
>                                 rte_pktmbuf_free(pkts_burst[m]);
> +                      }
> 
>         }
>  }
> ==========================================
> 
> Note that data race may or may not have visible consequence. If two cores
> unconsciously process same mbuf
> at different time, they may never notice it; but if two cores access same
> mbuf at the same physical time, the
> consequence is usually visible (crash). We don't seek for a solution that
> captures even potential data race;
> instead, we seek for a solution that can capture data race that happens
> simultaneously in two or more cores.
> Therefore, we do not need to extend the border of locking as wide as
> possible. We will apply locking only when
> we are focusing on a single mbuf processing.
> 
> In a real life application, a core will spend quite some time on each mbuf.
> The interval ranges from a few hundred
> cycles to a few thousand cycles. And usually not more than a handful of
> mbuf are involved. This is a ideal use case
> for locking mbuf.
> 
> I agree that the race detection shall not be compiled by default, since
> mtod is often called, and every mtod implies a
> visit to local cache. Further, recursive call of LOCK/UNLOCK shall be
> supported as well. But I don't think refcnt logic
> should be taken into account; these two are orthogonal designs, IMO. *****
> Pls correct me if I am wrong here. *****
> 
> Nether does LOCK/UNLOCK is aware of mbuf alloc/free, for same reason. That
> is why I said LOCK/UNLOCK needs to
> survive mbuf alloc initialization.
> 
> Of course we need to support locking multiple mbufs at the same time. For
> each core, we will then preserve, say, 8 slots.
> It works exactly like a direct mapped cacheline. That is, we can use 4bits
> from the mbuf address to locate its cacheline.
> If the cacheline has been occupied, we do an eviction; that is, the new
> mbuf will take the place of the old one. The old one is
> then UNLOCKed, unfortunately.
> 
> Honestly I have not yet tried this approach in real life application. But I
> have been thinking over the problem of data race detection
> for a long time, and I found the restriction and requirement makes this
> solution the only viable one. There are hundreds of papers
> published in the field on data race condition detection, but the lightest
> of the options has at least 5x of performance
> penalty, not to mention the space complexity, making it not applicable in
> practice.
> 
> Again, pls, anyone has same painful experience of data race bugs, share
> with me your concerns. It would be nice to come up with
> some practical device to address this problem. I believe 6Wind and other IP
> stack vendors must share the same feeling and opinion.
> 
> thx &
> rgds,
> 
> Qinglai
> 
> 
> 
> On Thu, Apr 13, 2017 at 5:59 PM, Adrien Mazarguil <
> adrien.mazarguil at 6wind.com> wrote:
> 
> > Hi Qinglai,
> >
> > On Thu, Apr 13, 2017 at 04:38:19PM +0300, jigsaw wrote:
> > > Hi,
> > >
> > > I have a proposal for mbuf race condition detection and I would like to
> > get
> > > your opinions before
> > > committing any patch.
> > >
> > > Race condition is the worst bug I can think of; as it causes crashing
> > long
> > > after the first crime scene,
> > > and the consequence is unpredictable and difficult to understand.
> > >
> > > Besides, race condition is very difficult to reproduce. Usually we get a
> > > coredump from live network,
> > > but the coredump itself delivers nearly zero info. We just know that the
> > > mbuf is somehow broken, and
> > > it is perhaps overwritten by another core at any moment.
> > >
> > > There are tools such as Valgrind and ThreadSanitizer to capture this
> > fault.
> > > But the overhead brought
> > > by the tools are too high to be deployed in performance test, not to
> > > mention in the live network. But
> > > race condition rarely happens under low pressure.
> > >
> > > Even worse, even in theory, the tools mentioned are not capable of
> > > capturing the scenario, because
> > > they requires explicit calls on locking primitives such as pthread mutex.
> > > This is because the semantics
> > > are not understood by the tools without explicit lock/unlock.
> > >
> > > With such known restrictions, I have a proposal roughly as below.
> > >
> > > The idea is to ask coder to do explicit lock/unlock on each mbuf that
> > > matters. The pseudo code is as below:
> > >
> > >     RTE_MBUF_LOCK(m);
> > >     /* use mbuf as usual */
> > >     ...
> > >     RTE_MBUF_UNLOCK(m);
> > >
> > > During the protected critical section, only the core which holds the lock
> > > can access the mbuf; while
> > > other cores, if they dare to use mbuf, they will be punished by segfault.
> > >
> > > Since the proposal shall be feasible at least in performance test, the
> > > overhead of locking and
> > > punishment must be small. And the access to mbuf must be as usual.
> > >
> > > Hence, the RTE_MBUF_LOCK is actually doing the following (pseudo code):
> > >
> > > RTE_MBUF_LOCK(m)
> > > {
> > >     store_per_core_cache(m, m->buf_addr);
> > >     m->buf_addr = NULL;
> > >     mb(); // memory barrier
> > > }
> > >
> > > And RTE_MBUF_UNLOCK is simply the reverse:
> > >
> > > RTE_MBUF_UNLOCK(m)
> > > {
> > >     purge_per_core_cache(m);
> > >     m->buf_addr = load_per_core_cache(m);
> > >     mb();
> > > }
> > >
> > > As we can see that the idea is to re-write m->buf_addr with NULL, and
> > store
> > > it in a per-core
> > > cache. The other cores, while trying to access the mbuf during critical
> > > section, will be certainly
> > > punished.
> > >
> > > Then of course we need to keep the owner core working. This is done by
> > > making changes to
> > > mtod, as below:
> > >
> > > #define rte_pktmbuf_mtod_offset(m, t, o) \
> > >     ((t)((char *)(m)->buf_addr + load_per_core_cache(m) + (m)->data_off +
> > > (o)))
> > >
> > > The per-core cache of mbuf works like a key-value database, which works
> > > like a direct-mapping
> > > cache. If an eviction happens (a newly arriving mbuf must kick out an old
> > > one), then the we restore
> > > the buf_addr of the evicted mbuf. Of course the RTE_MBUF_UNLOCK will then
> > > take care of such
> > > situation.
> > >
> > > Note that we expect the LOCK/UNLOCK to work even when the mbuf is freed
> > and
> > > allocated, since
> > > we don't trust the refcnt. A double-free is actually very rare; but data
> > > race can occur without double-free. Therefore, we need to survive the
> > > allocation of mbuf, that is rte_pktmbuf_init, which reset the
> > > buf_addr.
> > >
> > > Further, other dereference to buf_addr in rte_mbuf.h need to be revised.
> > > But it is not a big deal (I hope).
> > >
> > > The runtime overhead of this implementation is very light-weighted, and
> > > probably is able to turned
> > > on even in live network. And it is not intrusive at all: no change is
> > > needed in struct rte_mbuf; we just
> > > need a O(N) space complexity (where N is number of cores), and O(1)
> > runtime
> > > complexity.
> > >
> > > Alright, that is basically what is in my mind. Before any patch is
> > > provided, or any perf analysis is made, I would like to know what are the
> > > risks form design point of view. Please leave your feedback.
> >
> > Your proposal makes sense but I'm not sure where developers should call
> > RTE_MBUF_LOCK() and RTE_MBUF_UNLOCK(). Is it targeted at user applications
> > only? Is it to be used internally by DPDK as well? Does it include PMDs?
> >
> > I think any overhead outside of debugging mode, as minimal as it is, is too
> > much of a "punishment" for well behaving applications. The cost of a memory
> > barrier can be extremely high and this is one of the reasons DPDK processes
> > mbufs as bulks every time it gets the chance. RTE_MBUF_LOCK() and
> > RTE_MBUF_UNLOCK() should thus exist under some compilation option disabled
> > by default, and thought as an additional debugging tool.
> >
> > Also the implementation would probably be more expensive than your
> > suggestion, in my opinion the reference count must be taken into account
> > and/or RTE_MBUF_LOCK() must be recursive, because this is how mbufs
> > work. Freeing mbufs multiple times is perfectly valid in many cases.
> >
> > How can one lock several mbufs at once by the way?
> >
> > Since it affects performance, this can only make sense as a debugging tool
> > developers can use when they encounter some corruption issue they cannot
> > identify, somewhat like poisoning buffers before they are freed. Not sure
> > it's worth the trouble.

I am agree with Adrian here - it doesn't  look worth it:
From one side it adds quite a lot of overhead and complexity to the mbuf code.
From other side it usage seems pretty limited - there would be a lot of cases when it wouldn't be
able to detect race condition anyway..
So my opinion is NACK.
BTW, if your app is intentionally trying to do read/write to the same mbufs simultaneously from different threads
without some explicit synchronization, then most likely there is something wrong with it.
Konstantin


More information about the dev mailing list