Bug 918 - RTE EAL is unable to complete its work
Summary: RTE EAL is unable to complete its work
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: core (show other bugs)
Version: 20.11
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2021-12-27 14:53 CET by Roman E. Chechnev
Modified: 2024-02-28 01:35 CET (History)
2 users (show)



Attachments
threads list before rte_eal_init (518.50 KB, image/png)
2021-12-27 14:54 CET, Roman E. Chechnev
Details
threads list after rte_eal_init (428.93 KB, image/png)
2021-12-27 14:55 CET, Roman E. Chechnev
Details
threads list after rte_eal_cleanup (556.36 KB, image/png)
2021-12-27 14:56 CET, Roman E. Chechnev
Details
Dataplane SDK (614.24 KB, image/png)
2021-12-27 14:58 CET, Roman E. Chechnev
Details
Dataplane SDK (739.44 KB, image/png)
2021-12-27 15:01 CET, Roman E. Chechnev
Details

Description Roman E. Chechnev 2021-12-27 14:53:23 CET
Hi guys,

I am one of two authors of Dataplane SDK, where a lot of C-based OOP modules. Part of them is related to DPDK.

For example:

sb_class.dpdk.eal
sb_class.dpdk.lpm
sb_class.dpdk.mempool
sb_class.dpdk.ring
sb_class.netdev.dpdk
sb_class.netdev.dpdk.eth
sb_class.netdev.dpdk.eth.bonding
sb_class.netdev.dpdk.pcap
sb_class.netdev.dpdk.tap
sb_class.task.node.group.dpdk.eth
sb_class.task.node.group.dpdk.tap

I also have a module sb_class.hal (Hardware Abstraction Layer) his object runs before sb_object.dpdk.eal and its task is to determine the hardware and initialize the synonym table between DPDK device number and SDK object name.

Also sb_object.hal calls /usr/bin/dpdk-devbind.py script in order to connect network cards to the required drivers.

that is, the order of initialization is as follows:

sb_object.some_project       - children initialization (they below)
sb_object.hal                - hardware detection, bind drivers
sb_object.dpdk.eal           - rte_eal_init
sb_object.dpdk.mempool       - rte_pktmbuf_pool_create
sb_object.netdev.dpdk.eth    - rte_eth_dev_configure queues etc
sb_object.lcore.scheduler.roundrobin - sb_lcore_launch
*** other objects ***

the application terminates in reverse order

and this is what I noticed when sb_object.dpdk.eal exits, I call the following code (on the lcore that started everything)

void IMPLEMENTATION (object_free) (
    sb_status_t * a_out_err,
    sb_class_t * a_class,
    sb_object_t * a_obj)
{
    sb_object_dpdk_eal_t * obj
        = (sb_object_dpdk_eal_t *) a_obj;

    sb_object_super_class_free (a_out_err, a_class, a_obj);

    obj-> eal_cmd_line [0] = '\ 0';

#if RTE_VERSION_NUM (19, 2, 0, 0) <= RTE_VERSION
    {
        unsigned lcore_id;
        // wait for threads to stop
        // RTE_LCORE_FOREACH_WORKER (lcore_id) {
        RTE_LCORE_FOREACH (lcore_id) {
            rte_eal_wait_lcore (lcore_id);
        }
        // clean up the EAL
        rte_eal_cleanup ();
    }
#endif
}

and in gdb I still see that DPDK has not finished working and its threads are still running.

how can I stop DPDK completely?

because when object_free of sb_object.hal is called, it attempts to bind network cards back to the kernel stack, but DPDK is still running and the parent would be killed as a result thole application would be marked as "defunct" and only hardware reboot can help

Please tell me how to correctly shutdown DPDK so that:
all DPDK resources were freed
all threads created by DPDK were closed

Regards, screenshots attached


Hi guys,

I am one of two authors of Dataplane SDK, where a lot of C-based OOP modules. Part of them is related to DPDK.

For example:

sb_class.dpdk.eal
sb_class.dpdk.lpm
sb_class.dpdk.mempool
sb_class.dpdk.ring
sb_class.netdev.dpdk
sb_class.netdev.dpdk.eth
sb_class.netdev.dpdk.eth.bonding
sb_class.netdev.dpdk.pcap
sb_class.netdev.dpdk.tap
sb_class.task.node.group.dpdk.eth
sb_class.task.node.group.dpdk.tap

I also have a module sb_class.hal (Hardware Abstraction Layer) his object runs before sb_object.dpdk.eal and its task is to determine the hardware and initialize the synonym table between DPDK device number and SDK object name.

Also sb_object.hal calls /usr/bin/dpdk-devbind.py script in order to connect network cards to the required drivers.

that is, the order of initialization is as follows:

sb_object.some_project       - children initialization (they below)
sb_object.hal                - hardware detection, bind drivers
sb_object.dpdk.eal           - rte_eal_init
sb_object.dpdk.mempool       - rte_pktmbuf_pool_create
sb_object.netdev.dpdk.eth    - rte_eth_dev_configure queues etc
sb_object.lcore.scheduler.roundrobin - sb_lcore_launch
*** other objects ***

the application terminates in reverse order

and this is what I noticed when sb_object.dpdk.eal exits, I call the following code (on the lcore that started everything)

void IMPLEMENTATION (object_free) (
    sb_status_t * a_out_err,
    sb_class_t * a_class,
    sb_object_t * a_obj)
{
    sb_object_dpdk_eal_t * obj
        = (sb_object_dpdk_eal_t *) a_obj;

    sb_object_super_class_free (a_out_err, a_class, a_obj);

    obj-> eal_cmd_line [0] = '\ 0';

#if RTE_VERSION_NUM (19, 2, 0, 0) <= RTE_VERSION
    {
        unsigned lcore_id;
        // wait for threads to stop
        // RTE_LCORE_FOREACH_WORKER (lcore_id) {
        RTE_LCORE_FOREACH (lcore_id) {
            rte_eal_wait_lcore (lcore_id);
        }
        // clean up the EAL
        rte_eal_cleanup ();
    }
#endif
}

and in gdb I still see that DPDK has not finished working and its threads are still running.

how can I stop DPDK completely?

because when object_free of sb_object.hal is called, it attempts to bind network cards back to the kernel stack, but DPDK is still running and the parent would be killed as a result thole application would be marked as "defunct" and only hardware reboot can help

Please tell me how to correctly shutdown DPDK so that:
all DPDK resources were freed
all threads created by DPDK were closed

Regards, screenshots attached
Comment 1 Roman E. Chechnev 2021-12-27 14:54:25 CET
Created attachment 185 [details]
threads list before rte_eal_init
Comment 2 Roman E. Chechnev 2021-12-27 14:55:08 CET
Created attachment 186 [details]
threads list after rte_eal_init
Comment 3 Roman E. Chechnev 2021-12-27 14:56:12 CET
Created attachment 187 [details]
threads list after rte_eal_cleanup
Comment 4 Roman E. Chechnev 2021-12-27 14:58:38 CET
Created attachment 188 [details]
Dataplane SDK
Comment 5 Roman E. Chechnev 2021-12-27 15:01:06 CET
Created attachment 189 [details]
Dataplane SDK
Comment 6 Stephen Hemminger 2024-02-28 01:35:19 CET
The normal way to do safe shutdown is to call the following on the main thread.
The one that called rte_eal_mp_remote_launch().

rte_eal_mp_wait_lcore(); // waits for all worker lcore threads to finish
rte_eal_cleanup(); // cleans up other threads, devices and memory

At that point process should be back at the state it was before calling rte_eal_init().

Note You need to log in before you can comment on or make changes to this bug.