[dpdk-dev] Random mbuf corruption
Gray, Mark D
mark.d.gray at intel.com
Tue Jun 24 10:05:58 CEST 2014
>
> Paul,
>
> Thanks for the advice; we ran memtest as well as the Dell complete system
> diagnostic and neither found an issue. The plot thickens, though!
>
> Our admins messed up our kickstart labels and what I *thought* was CentOS
> 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS
> 6.4 installations -- the current configuration of success/failure is:
> 1 server - Westmere - RHEL 6.4 -- works
> 1 server - Sandy Bridge - RHEL 6.4 -- works
> 2 servers - Sandy Bridge - CentOS 6.4 -- fails
>
> Given that the hardware seems otherwise stable/checks out I'm trying to
> figure out how to determine if this is:
> a) our software has a bug
> b) a kernel/hugetlbfs bug
> c) a DPDK 1.6.0r2 bug
>
> I have seen similar issues where calling rte_eal_init too late in a process also
> causes similar issues (things like calling 'free' on memory that was allocated
> with 'malloc' before 'rte_eal_init' is called fails/results in segfault in libc)
> which seems odd to me but in this case we are calling rte_eal_init as the first
> thing we do in main().
I have seen the following issues causing mbuf corruption of this type
1. Calling an rte_pktmbuf_free() on an mbuf and then still using a reference
to that mbuf.
2. Using rte_pktmbuf_free() and rte_pktmbuf_alloc() in a pthread (i.e. not
a "dpdk" thread). This corrupted the per-lcore mbuf cache.
Not pleasant to debug, especially if you are sharing the mempool between
primary and secondary processes. I have no tips for debug other than careful
code review everywhere an mbuf is freed or allocated.
Mark
More information about the dev
mailing list