Bug 50 - Secondary process launch is unreliable
Summary: Secondary process launch is unreliable
Status: RESOLVED FIXED
Alias: None
Product: DPDK
Classification: Unclassified
Component: core (show other bugs)
Version: 18.05
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: Anatoly Burakov
URL:
Depends on:
Blocks:
 
Reported: 2018-05-22 18:15 CEST by Anatoly Burakov
Modified: 2018-11-20 20:29 CET (History)
2 users (show)



Attachments

Description Anatoly Burakov 2018-05-22 18:15:47 CEST
Secondary process initialization was known to be unreliable before, with numerous workarounds already being available and documented. However, DPDK version 18.05 has introduced memory subsystem rework, which has made situation worse due to much more memory being pre-reserved at initialization.

In addition, following investigation, it was revealed that DPDK secondary processes ran with fork()/execv() method from within another DPDK application (such as is the case with some DPDK unit tests) carry an additional risk of initialization failures due to how pthread library appears to be working under these conditions. Specifically, two IPC threads created by EAL will take up a lot of address space (more than 10x of the usual amount), thereby interfering with secondary process mappings and causing initialization failure. In this case, disabling ASLR may actually make the situation worse, as non-deterministic but sometimes lucky memory layout will be replaced by deterministically wrong address mappings.

Things to try when affected by this issue:

* Using --base-virtaddr to relocate DPDK memory segments at initialization
* Not running DPDK applications using fork()/execv() method
* Enabling or disabling ASLR
* Recompiling the binary with different configuration
* Reducing the amount of preallocated memory (adjusting config options CONFIG_RTE_MAX_MEM_MB_PER_TYPE and/or RTE_MAX_MEM_MB_PER_LIST)
Comment 1 Vipin Varghese 2018-05-29 16:31:48 CEST
Hi Anatoly, during internal tests and validation with 18.05 base we did see this recurring more often. 

Enforcing use for '--base-virtaddr with ASLR enabled in primary process' has seen successful secondary runs in 1000 continuous runs with same binary.
Comment 2 Anatoly Burakov 2018-05-29 16:34:02 CEST
Hi Vipin,

I'm glad to hear that you were able to work around it. Unfortunately, this is not always the case. From our internal tests, it works on some machines/circumstances, but not others, and picking the right --base-virtaddr value is also tricky.
Comment 3 Vipin Varghese 2018-05-29 16:45:52 CEST
Hi, thanks for the update. this is true with regard to base address. So I follow it up as
1) Run primary
2) pmap -x <process id>
3) Locate the area where actual process mmap and library load is done.
4) Find a area which can house the huge pages that would be mmap.
5) Share the address as possible offset.

example: 
1) process A is primary
2) using pmap, locate the start area (0000000100220000       0       0       0 -----   [ anon ], 00007f7378000000     132       4       4 rw---   [ anon ])
3) pick address area for 2GB huge to start from '0000500000000000'

But as you correctly pointed out, this might not be the case for all scenario and compilers. The start offset and stack sizes will vary.

Thanks you for the inputs, appreciate the help.
Comment 4 Ajit Khaparde 2018-08-24 20:16:52 CEST
Vipin,
Is this still an issue? Or we can close this now?

Thanks
Ajit
Comment 5 Vipin Varghese 2018-08-27 06:47:28 CEST
Hi Ajit,

I do not have any issue with the ticket. I have proactively shared my observations and findings. If you would like to close the ticket, it is ok for me.


Hence marking this as 'resolved' with sub category 'works for me'

thanks
Vipin Varghese
Comment 6 Anatoly Burakov 2018-08-27 10:56:38 CEST
Hi Vipin, Ajit,

I don't think we can close this yet. Vipin is not the only one having this issue, and it is not being fixed because it cannot be 100% fixed. That said, there is a patch from Alejandro Lucero that will improve the situation (has to do with setting base virtual address to a certain value by default), so when it gets merged, i will be OK with marking it as fixed.
Comment 7 Ajit Khaparde 2018-08-30 10:57:55 CEST
Anatoly,
Can you can point to the patch that Alejandro has submitted?
We can keep an eye on it as well and close it once it is accepted.

Thanks
Ajit
Comment 8 Anatoly Burakov 2018-08-30 10:59:56 CEST
@Ajit i don't think there's a current patch for 18.11 - only for older DPDK versions. He mentioned that he would submit it for 18.11, but if he doesn't - i'll backport (forward-port?) relevant changes to 18.11 and submit it myself.
Comment 9 Anatoly Burakov 2018-11-20 16:31:24 CET
@Ajit

Thanks to Alejandro's work on IOVA [1], this is no longer such a pressing problem.

http://patches.dpdk.org/project/dpdk/list/?series=1717&state=*

This can now be closed.
Comment 10 Ajit Khaparde 2018-11-20 20:29:44 CET
Closing the bug based on the last comment. Thanks Anatoly, Alejandro.

Note You need to log in before you can comment on or make changes to this bug.