Bug 59 - Cannot start secondary processes anyhow on Redhat EL7
Summary: Cannot start secondary processes anyhow on Redhat EL7
Status: RESOLVED INVALID
Alias: None
Product: DPDK
Classification: Unclassified
Component: core (show other bugs)
Version: 18.05
Hardware: x86 Linux
: Normal major
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2018-06-01 18:05 CEST by Matteo Lanzuisi
Modified: 2018-07-10 00:14 CEST (History)
1 user (show)



Attachments
Core Dump GDB Output (3.04 KB, text/plain)
2018-06-01 18:05 CEST, Matteo Lanzuisi
Details

Description Matteo Lanzuisi 2018-06-01 18:05:42 CEST
Created attachment 7 [details]
Core Dump GDB Output

Hi all,

I was using DPDK 2.2.0 on Redhat EL6. It worked with this configuration:

- a tiny process launched the "rte_eal_get_physmem_layout()" function;
- the primary process started with "--base-virtaddr=X" where X is the value returned by the tiny process;
- the secondary process started and attached to the rings and mempools created by the primary process.

Now I downloaded dpdk-18.02.1.tar.xz, recompiled all the processes with the new version, changed some functions for compatibility and ran everything on a Redhat EL7.
ASLR is disabled, but I had the 

"WARNING! Base virtual address hint (0x7fff80000000 != 0x7ffefffcd000) not respected!"

messages when starting the primary process, so changed manually the "--base-virtaddr" parameter to avoid these warnings.
After the warnings did not happen any more, when starting the secondary process, it fails with a segfault.

To avoid the warnings, I set the address "0x7ffa3ffcd000"

Inside the core dump attached you can see that address requested by the secondary process is "0x7ffabfd47080" and the gdb says "Cannot access memory at address 0x7ffabfd47080.

The function that generates the error is "rte_ctrlmbuf_alloc()".

I don't know if any other parameter in the kernel/hugepages spaces must be set.
Comment 1 Matteo Lanzuisi 2018-06-15 12:37:26 CEST
I changed the affected version from 18.02 to 18.05, because I did the following steps:

- applied patches to OS and now I have RedHat EL 7.5
- kernel version is 3.10.0-862.3.2.el7.x86_64
- changed DPDK version from 18.02 to 18.05 because 18.02.1 on RH7.5 is not compiling any more
- applied all needed changes to my application code to make it compatible with dpdk-18.05 - deleted the ctrlmbuf calls and the rte_eal_get_physmem_layout() call
- disabled the "--base-virtaddr" option on primary process, because I could not find any address that made disappear the "WARNING: Base ..." output
- tried with ASLR enabled or disabled, same behaviour.

Still the secondary process starts and it doesn't work, the output is:

EAL: Detected 40 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket_2397_68d2e387007be
EAL: Probing VFIO support...
EAL: Cannot get a virtual area at requested address: 0x7ffff7fb9000 (got 0x7ffff7eb0000)
EAL: Cannot attach to memzone list
EAL: FATAL: Cannot init memzone

EAL: Cannot init memzone

One more info:
- with a tiny secondary process that does nothing except rte_eal_init(), it all goes well, while with the correct secondary process (same input parameters) the error is the one high in the description.

Any suggestion or problem/solution related will be appreciated.
Comment 2 Matteo Lanzuisi 2018-06-15 14:01:09 CEST
I kept doing tests with my tiny secondary process and after the rte_eal_init() worked it fails on the ""rte_pktmbuf_alloc()" so I think I reproduced the first original error.

Reading symbols from /root/libifcea/c/test_ifcea...done.
[New LWP 8502]
[New LWP 8512]
[New LWP 8511]
[New LWP 8513]
[New LWP 8514]
[New LWP 8509]
[New LWP 8515]
[New LWP 8510]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./test_ifcea'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 numactl-libs-2.0.9-7.el7.x86_64
(gdb) where
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff79b5783 in rte_mempool_ops_dequeue_bulk (mp=0x7fd6ffd46500, obj_table=0x7fffffffd7a8, n=1)
    at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mempool.h:679
#2  0x00007ffff79b5997 in __mempool_generic_get (cache=0x0, n=1, obj_table=0x7fffffffd7a8, mp=0x7fd6ffd46500)
    at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1523
#3  rte_mempool_generic_get (cache=0x0, n=1, obj_table=0x7fffffffd7a8, mp=0x7fd6ffd46500)
    at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1558
#4  rte_mempool_get_bulk (n=1, obj_table=0x7fffffffd7a8, mp=0x7fd6ffd46500) at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1591
#5  rte_mempool_get (obj_p=0x7fffffffd7a8, mp=0x7fd6ffd46500) at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mempool.h:1617
#6  rte_mbuf_raw_alloc (mp=0x7fd6ffd46500) at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:992
#7  0x00007ffff79b5a9b in rte_pktmbuf_alloc (mp=0x7fd6ffd46500) at /root/dpdk/x86_64-native-linuxapp-gcc/include/rte_mbuf.h:1253
#8  0x00007ffff79b5cb6 in if_cea_send_gen_info (fd=-1, ring_name=0x7fffffffd9f0 "RB_CEA_CFG_WRITE_0", p_msg=0x7fffffffda70, len_msg=8)
    at ifcea_filter.c:185
#9  0x00007ffff79b2a3c in if_cea_manager_phisical_ch (info_th=0x6cb980, p_cfg_ch=0x6bde40 <cfg_ch>) at ifcea.c:2449
#10 0x00007ffff79b49c0 in if_cea_open_logical_channel (p_cfg_ch=0x6bde40 <cfg_ch>) at ifcea.c:3486
#11 0x0000000000404c6a in leggi_cfg () at ifcea_main.c:698
#12 0x0000000000406173 in main () at ifcea_main.c:1183
(gdb)
Comment 3 Matteo Lanzuisi 2018-06-21 18:46:48 CEST
Update:

the "rte_mempool_ops_dequeue_bulk()" segmentation fault happens with the following releases, too:

- dpdk-17.11-7 (Redhat rpm)
- dpdk-17.11.3 (dpdk.org source code)
- dpdk-18.02.2 (dpdk.org source code)
Comment 4 Matteo Lanzuisi 2018-07-04 09:49:45 CEST
I found that it was a compilation problem.
I was not using the rte.extapp.mk file to compile my application I was linking statically only rte_ring rte_mempool rte_eal libraries.
It was not giving any error in compilation, but at runtime it gave the errors as described.

I think this can be closed.
Comment 5 Ajit Khaparde 2018-07-10 00:12:58 CEST
Closing based on comment# 4.
Comment 6 Ajit Khaparde 2018-07-10 00:14:05 CEST
Closing as invalid since it was a problem with compilation.

Note You need to log in before you can comment on or make changes to this bug.