Our test env has two CPU sockets (2 NUMA nodes), and we hope that dpdk can automatically identify CPU sockets and allocate hugepage memory according to runtime requirement of application without manual designation, therefore we set "--socket-mem 0,0 --socket-limit 0,0" into dpdk initialization parameter list. We do a recovery test case as below for our dpdk application in k8s container: Test Step: 1. run 'kill -9' to terminate a dpdk primary process 2. restart dpdk primary process 3. run a dpdk secondary process after dpdk primary process restarted successfully. Repeat the upper 1-3 test steps for 3 rounds Expected result: - no error log was found. Test result: - An error log "EAL: Cannot find action: mp_malloc_response" was printed out by dpdk secondary process after the 3th round test. (The error was NOT found in the first two rounds test) Could you help to check this issue ? Note: We don't know whether "- socket MEM 0,0 -- socket limit 0,0" is correct. If not, could you give some suggestions?
Anatoly, Can you take a look at this? Thanks
First, there's no need to specify socket-mem and socket-limit at all in this case, defaults will work fine. Second, the test description looks to be either incomplete or incorrect. If you are restarting a primary process, you should also kill all the associated secondary processes along with it, but as far as i remember you wouldn't be able to do that if any secondary processes were already running, so either you did not include that step in your test description, or your test is incorrect *and* we have a bug in primary/secondary detection. So, I would really appreciate if you have clarified this point.
After we run 'kill -9' to primary process, the secondary process will be stopped. we will restart primary process and secondary after cleaning "/run/dpdk/" directory. Did it need to clean other directories for dpdk ?
Based on your description, i don't see why things wouldn't work. However, i've tried reproducing this with testpmd, and i couldn't reproduce what you see, but i've spotted something interesting: testpmd specifically will not die immediately after you kill it. Instead, it will notify other processes about memory being freed (which is expected), but since secondary process has already died, no response from secondary is received, which results in primary waiting for a timeout on IPC to expire. Which specific process displays the error you are seeing?
This error log was printed out by secondary and our application is running in one k8s pod container. This issue only occurred when config dpdk parameter 'socket-mem 0,0 -- socket-limit 0,0'. When we config like 'socket-mem 1000,1000 -- socket-limit 1000,1000' this issue was not found.
Does the secondary process start after primary has finished initializing? or do they start at the same time?
secondary process started to initialize dpdk after primary process had finished initializing dpdk (rte_eal_init) and not same time.
Does this happen on baremetal (as opposed to inside a k8s container)? I cannot reproduce this issue on baremetal, and i don't have a ready-made k8s setup. Alternatively, if you could provide me with a set of instructions on how to reproduce this with k8s and one of the builtin apps (test, testpmd etc.), it would go a very long way towards diagnosing and fixing any potential issues.