[dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix deadlock issue for aarch64

Ruifeng Wang (Arm Technology China) Ruifeng.Wang at arm.com
Wed Oct 9 07:52:03 CEST 2019


> -----Original Message-----
> From: David Marchand <david.marchand at redhat.com>
> Sent: Wednesday, October 9, 2019 03:47
> To: Aaron Conole <aconole at redhat.com>
> Cc: Ruifeng Wang (Arm Technology China) <Ruifeng.Wang at arm.com>; David
> Hunt <david.hunt at intel.com>; dev <dev at dpdk.org>; hkalra at marvell.com;
> Gavin Hu (Arm Technology China) <Gavin.Hu at arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli at arm.com>; nd <nd at arm.com>; dpdk
> stable <stable at dpdk.org>
> Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix deadlock
> issue for aarch64
> 
> On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <aconole at redhat.com> wrote:
> >
> > Ruifeng Wang <ruifeng.wang at arm.com> writes:
> >
> > > Distributor and worker threads rely on data structs in cache line
> > > for synchronization. The shared data structs were not protected.
> > > This caused deadlock issue on weaker memory ordering platforms as
> > > aarch64.
> > > Fix this issue by adding memory barriers to ensure synchronization
> > > among cores.
> > >
> > > Bugzilla ID: 342
> > > Fixes: 775003ad2f96 ("distributor: add new burst-capable library")
> > > Cc: stable at dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang at arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu at arm.com>
> > > ---
> >
> > I see a failure in the distributor_autotest (on one of the builds):
> >
> > 64/82 DPDK:fast-tests / distributor_autotest  FAIL     0.37 s (exit status 255
> or signal 127 SIGinvalid)
> >
> > --- command ---
> >
> > DPDK_TEST='distributor_autotest'
> > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1
> > --file-prefix=distributor_autotest
> >
> > --- stdout ---
> >
> > EAL: Probing VFIO support...
> >
> > APP: HPET is not enabled, using TSC as default timer
> >
> > RTE>>distributor_autotest
> >
> > === Basic distributor sanity tests ===
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with all zero hashes done.
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with non-zero hashes done
> >
> > === testing big burst (single) ===
> >
> > Sanity test of returned packets done
> >
> > === Sanity test with mbuf alloc/free (single) ===
> >
> > Sanity test with mbuf alloc/free passed
> >
> > Too few cores to run worker shutdown test
> >
> > === Basic distributor sanity tests ===
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with all zero hashes done.
> >
> > Worker 0 handled 32 packets
> >
> > Sanity test with non-zero hashes done
> >
> > === testing big burst (burst) ===
> >
> > Sanity test of returned packets done
> >
> > === Sanity test with mbuf alloc/free (burst) ===
> >
> > Line 326: Packet count is incorrect, 1048568, expected 1048576
> >
> > Test Failed
> >
> > RTE>>
> >
> > --- stderr ---
> >
> > EAL: Detected 2 lcore(s)
> >
> > EAL: Detected 1 NUMA nodes
> >
> > EAL: Multi-process socket /var/run/dpdk/distributor_autotest/mp_socket
> >
> > EAL: Selected IOVA mode 'PA'
> >
> > EAL: No available hugepages reported in hugepages-1048576kB
> >
> > -------
> >
> > Not sure how to help debug further.  I'll re-start the job to see if
> > it 'clears' up - but I guess there may be a delicate synchronization
> > somewhere that needs to be accounted.
> 
> Idem, and with the same loop I used before, it can be caught quickly.
> 
> # time (log=/tmp/$$.log; while true; do echo distributor_autotest
> |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level *:8
> -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; rm -f $log)
> 
Thanks Aaron and David for your report. I can reproduce this issue with the script.
Will fix it in next version.

> [snip]
> 
> RTE>>distributor_autotest
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 2MB
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous space
> EAL: Trying to obtain current memory policy.
> EAL: Setting policy MPOL_PREFERRED for socket 0
> EAL: Restoring previous memory policy: 0
> EAL: request: mp_malloc_sync
> EAL: Heap on socket 0 was expanded by 8MB === Basic distributor sanity
> tests === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (single) ===
> Sanity test of returned packets done
> 
> === Sanity test with mbuf alloc/free (single) === Sanity test with mbuf
> alloc/free passed
> 
> Too few cores to run worker shutdown test === Basic distributor sanity tests
> === Worker 0 handled 32 packets Sanity test with all zero hashes done.
> Worker 0 handled 32 packets
> Sanity test with non-zero hashes done
> === testing big burst (burst) ===
> Sanity test of returned packets done
> 
> === Sanity test with mbuf alloc/free (burst) === Line 326: Packet count is
> incorrect, 1048568, expected 1048576 Test Failed
> RTE>>
> real    0m36.668s
> user    1m7.293s
> sys    0m1.560s
> 
> Could be worth running this loop on all tests? (not talking about the CI, it
> would be a manual effort to catch lurking issues).
> 
> 
> --
> David Marchand


More information about the stable mailing list