808 – Unexpected error log 'EAL: Cannot find action: mp_malloc_response'

Bug 808 - Unexpected error log 'EAL: Cannot find action: mp_malloc_response'

Summary: Unexpected error log 'EAL: Cannot find action: mp_malloc_response'

Status:	UNCONFIRMED

Alias:	None

Product:	DPDK
Classification:	Unclassified
Component:	core (show other bugs)
Version:	20.11
Hardware:	x86 Linux

Importance:	Normal normal
Target Milestone:	---
Assignee:	Anatoly Burakov

URL:

Depends on:
Blocks:

Reported:	2021-09-10 08:57 CEST by li_hong.bi
Modified:	2021-09-23 13:28 CEST (History)
CC List:	2 users (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description li_hong.bi 2021-09-10 08:57:22 CEST

Our test env has two CPU sockets (2 NUMA nodes), and we hope that dpdk can automatically identify CPU sockets and allocate hugepage memory according to runtime requirement of application without manual designation, therefore we set "--socket-mem 0,0 --socket-limit 0,0" into dpdk initialization parameter list.

We do a recovery test case as below for our dpdk application in k8s container:
Test Step:
1. run 'kill -9' to terminate a dpdk primary process
2. restart dpdk primary process 
3. run a dpdk secondary process after dpdk primary process restarted successfully.
Repeat the upper 1-3 test steps for 3 rounds

Expected result:
- no error log was found.
Test result:
- An error log "EAL: Cannot find action: mp_malloc_response" was printed out by dpdk secondary process after the 3th round test. （The error was NOT found in the first two rounds test)

Could you help to check this issue ?
Note：
We don't know whether "- socket MEM 0,0 -- socket limit 0,0" is correct.
If not, could you give some suggestions?

Comment 1 Ajit Khaparde 2021-09-16 06:34:28 CEST

Anatoly, Can you take a look at this? Thanks

Comment 2 Anatoly Burakov 2021-09-16 11:40:20 CEST

First, there's no need to specify socket-mem and socket-limit at all in this case, defaults will work fine.

Second, the test description looks to be either incomplete or incorrect. If you are restarting a primary process, you should also kill all the associated secondary processes along with it, but as far as i remember you wouldn't be able to do that if any secondary processes were already running, so either you did not include that step in your test description, or your test is incorrect *and* we have a bug in primary/secondary detection. So, I would really appreciate if you have clarified this point.

Comment 3 li_hong.bi 2021-09-18 04:09:28 CEST

After we run 'kill -9' to primary process, the secondary process will be stopped.
we will restart primary process and secondary after cleaning "/run/dpdk/" directory.

Did it need to clean other directories for dpdk ?

Comment 4 Anatoly Burakov 2021-09-20 15:56:28 CEST

Based on your description, i don't see why things wouldn't work.

However, i've tried reproducing this with testpmd, and i couldn't reproduce what you see, but i've spotted something interesting: testpmd specifically will not die immediately after you kill it. Instead, it will notify other processes about memory being freed (which is expected), but since secondary process has already died, no response from secondary is received, which results in primary waiting for a timeout on IPC to expire.

Which specific process displays the error you are seeing?

Comment 5 li_hong.bi 2021-09-23 05:27:41 CEST

This error log was printed out by secondary and our application is running in one k8s pod container.
This issue only occurred when config dpdk parameter 'socket-mem 0,0 -- socket-limit 0,0'. When we config like 'socket-mem 1000,1000 -- socket-limit 1000,1000' this issue was not found.

Comment 6 Anatoly Burakov 2021-09-23 11:47:17 CEST

Does the secondary process start after primary has finished initializing? or do they start at the same time?

Comment 7 li_hong.bi 2021-09-23 12:06:22 CEST

secondary process started to initialize dpdk after primary process had finished initializing dpdk (rte_eal_init) and not same time.

Comment 8 Anatoly Burakov 2021-09-23 13:28:51 CEST

Does this happen on baremetal (as opposed to inside a k8s container)? I cannot reproduce this issue on baremetal, and i don't have a ready-made k8s setup.

Alternatively, if you could provide me with a set of instructions on how to reproduce this with k8s and one of the builtin apps (test, testpmd etc.), it would go a very long way towards diagnosing and fixing any potential issues.

Note You need to log in before you can comment on or make changes to this bug.