[dpdk-dev] Random mbuf corruption

Gray, Mark D mark.d.gray at intel.com
Tue Jun 24 10:05:58 CEST 2014


> 
> Paul,
> 
> Thanks for the advice; we ran memtest as well as the Dell complete system
> diagnostic and neither found an issue. The plot thickens, though!
> 
> Our admins messed up our kickstart labels and what I *thought* was CentOS
> 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS
> 6.4 installations -- the current configuration of success/failure is:
>   1 server - Westmere - RHEL 6.4 -- works
>   1 server - Sandy Bridge - RHEL 6.4 -- works
>   2 servers - Sandy Bridge - CentOS 6.4 -- fails
> 
> Given that the hardware seems otherwise stable/checks out I'm trying to
> figure out how to determine if this is:
>   a) our software has a bug
>   b) a kernel/hugetlbfs bug
>   c) a  DPDK 1.6.0r2 bug
> 
> I have seen similar issues where calling rte_eal_init too late in a process also
> causes similar issues (things like calling 'free' on memory that was allocated
> with 'malloc' before 'rte_eal_init' is called fails/results in segfault in libc)
> which seems odd to me but in this case we are calling rte_eal_init as the first
> thing we do in main().

I have seen the following issues causing mbuf corruption of this type

1. Calling an rte_pktmbuf_free() on an mbuf and then still using a reference
to that mbuf.
2. Using rte_pktmbuf_free() and rte_pktmbuf_alloc() in a pthread (i.e. not
a "dpdk" thread). This corrupted the per-lcore mbuf cache.

Not pleasant to debug, especially if you are sharing the mempool between 
primary and secondary processes. I have no tips for debug other than careful
code review everywhere an mbuf is freed or allocated. 

Mark


More information about the dev mailing list