Bug 382 - rte_eth: rx/tx callbacks invoked without lock protection
Summary: rte_eth: rx/tx callbacks invoked without lock protection
Status: UNCONFIRMED
Alias: None
Product: DPDK
Classification: Unclassified
Component: ethdev (show other bugs)
Version: 18.11
Hardware: All All
: Normal normal
Target Milestone: ---
Assignee: dev
URL:
Depends on:
Blocks:
 
Reported: 2020-01-09 08:52 CET by Linfan
Modified: 2020-01-10 04:05 CET (History)
2 users (show)



Attachments

Description Linfan 2020-01-09 08:52:34 CET
Hi, all  
    I launch my DPDK app, and then use dpdk-pdump to capture wire packets. When I stop dpdk-pdump with ctrl+c, my DPDK app crash. Here is the coredump backtrace:  

```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/sbin/dpvs -- -l 0,1,2,3,4,5,6,7,8 -w 0000:04:00.0 -w 0000:04:00.1 --legacy'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005555558d24e3 in rte_eth_rx_burst (port_id=1, queue_id=2, rx_pkts=0x55555721ed10 <lcore_conf+2386256>, nb_pkts=32)
    at /builds/lbc/kigw/kigw_scripts/dpdk-stable-18.11.2/build/include/rte_ethdev.h:3888
3888    /builds/lbc/kigw/kigw_scripts/dpdk-stable-18.11.2/build/include/rte_ethdev.h: No such file or directory.
[Current thread is 1 (Thread 0x7ffff3d36700 (LWP 29228))]
(gdb) bt
#0  0x00005555558d24e3 in rte_eth_rx_burst (port_id=1, queue_id=2, rx_pkts=0x55555721ed10 <lcore_conf+2386256>, nb_pkts=32)
    at /builds/lbc/kigw/kigw_scripts/dpdk-stable-18.11.2/build/include/rte_ethdev.h:3888
#1  0x00005555558d8743 in netif_rx_burst (pid=1, qconf=0x55555721ed00 <lcore_conf+2386240>) at /builds/lbc/kigw/src/netif.c:1618
#2  0x00005555558dc331 in lcore_job_recv_fwd (arg=0x0) at /builds/lbc/kigw/src/netif.c:2424
#3  0x00005555558e16de in do_lcore_job (job=0x555555f6f980 <netif_jobs>) at /builds/lbc/kigw/src/netif.c:4265
#4  0x00005555558e184d in netif_loop (dummy=0x0) at /builds/lbc/kigw/src/netif.c:4307
#5  0x00005555556cfb9d in eal_thread_loop ()
#6  0x00007ffff67744a4 in start_thread (arg=0x7ffff3d36700) at pthread_create.c:456
#7  0x00007ffff62abd0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
```

I followed the source code in rte_ethdev.h where the segmentation fault occurred in line 3888:  

```
static inline uint16_t
rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
		 struct rte_mbuf **rx_pkts, const uint16_t nb_pkts)
{
	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
	uint16_t nb_rx;

#ifdef RTE_LIBRTE_ETHDEV_DEBUG
	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, 0);
	RTE_FUNC_PTR_OR_ERR_RET(*dev->rx_pkt_burst, 0);

	if (queue_id >= dev->data->nb_rx_queues) {
		RTE_ETHDEV_LOG(ERR, "Invalid RX queue_id=%u\n", queue_id);
		return 0;
	}
#endif
	nb_rx = (*dev->rx_pkt_burst)(dev->data->rx_queues[queue_id],
				     rx_pkts, nb_pkts);

#ifdef RTE_ETHDEV_RXTX_CALLBACKS
	if (unlikely(dev->post_rx_burst_cbs[queue_id] != NULL)) {
		struct rte_eth_rxtx_callback *cb =
				dev->post_rx_burst_cbs[queue_id];

		do {
			nb_rx = cb->fn.rx(port_id, queue_id, rx_pkts, nb_rx,
						nb_pkts, cb->param);
			cb = cb->next;
		} while (cb != NULL);
	}
#endif

	return nb_rx;
}
```

Found that callback list doesn't protected by any lock. These may have concurrent issues when another process like dpdk-pdump unregister tx/rx callbacks before it exits. The registration/unregistration of tx/rx callbacks hold a spinlock. I wonder why there is no lock in invocation of callback list.  
```
int
rte_eth_remove_rx_callback(uint16_t port_id, uint16_t queue_id,
		const struct rte_eth_rxtx_callback *user_cb)
{
#ifndef RTE_ETHDEV_RXTX_CALLBACKS
	return -ENOTSUP;
#endif
	/* Check input parameters. */
	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
	if (user_cb == NULL ||
			queue_id >= rte_eth_devices[port_id].data->nb_rx_queues)
		return -EINVAL;

	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
	struct rte_eth_rxtx_callback *cb;
	struct rte_eth_rxtx_callback **prev_cb;
	int ret = -EINVAL;

	rte_spinlock_lock(&rte_eth_rx_cb_lock);
	prev_cb = &dev->post_rx_burst_cbs[queue_id];
	for (; *prev_cb != NULL; prev_cb = &cb->next) {
		cb = *prev_cb;
		if (cb == user_cb) {
			/* Remove the user cb from the callback list. */
			*prev_cb = cb->next;
			ret = 0;
			break;
		}
	}
	rte_spinlock_unlock(&rte_eth_rx_cb_lock);

	return ret;
} 
```

Please check and conform if this is a bug.

Thanks,
Linfan Hu
Comment 1 Thomas Monjalon 2020-01-09 09:45:16 CET
There is no lock protection in the datapath for performance reason.

The application must take care of stopping Rx on the devices before unregistering the callbacks.

Is it OK for you?
Comment 2 Stephen Hemminger 2020-01-09 17:10:53 CET
On Thu, 09 Jan 2020 07:52:34 +0000
bugzilla@dpdk.org wrote:

> https://bugs.dpdk.org/show_bug.cgi?id=382
> 
>             Bug ID: 382
>            Summary: rte_eth: rx/tx callbacks invoked without lock
>                     protection
>            Product: DPDK
>            Version: 18.11
>           Hardware: All
>                 OS: All
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: Normal
>          Component: ethdev
>           Assignee: dev@dpdk.org
>           Reporter: zhongdahulinfan@163.com
>   Target Milestone: ---
> 
> Hi, all  
>     I launch my DPDK app, and then use dpdk-pdump to capture wire packets.
>     When
> I stop dpdk-pdump with ctrl+c, my DPDK app crash. Here is the coredump
> backtrace:  
> 
> ```

dpdk-pdump needs a signal handler to catch control-c and unregister its
callback.
Comment 3 Linfan 2020-01-10 03:28:28 CET
Hi Stephen, 
  dpdk-pdump has its signal handler: 

static void
signal_handler(int sig_num)
{
	if (sig_num == SIGINT) {
		printf("\n\nSignal %d received, preparing to exit...\n",
				sig_num);
		quit_signal = 1;
	}
}

dpdk-pdump wil cleanup pdump resources, including removing callbacks, before it exits.
Comment 4 Linfan 2020-01-10 04:05:49 CET
Hi Thomas, 
  Our DPDK application runs online and stopping RX and/or TX is not acceptable. 

Given that callbacks are allowed to register/unregister dynamically, I think protection mechanism should be considered while calling callbacks. Maybe RCU is tolerable If lock is too expansive?

Note You need to log in before you can comment on or make changes to this bug.