Hello, We have run into a timing issue between threads when using the memif interface type and need some guidance. Our application has a DPDK based process operating (among other things) a memif server interface. The problem is exposed when this memif interface receives a memif.disconnect message from the remote client, while in the middle of an rte_eth_rx_burst() on this same memif interface. As the IRQ message handling is on its own thread as compared to the DPDK worker thread doing the rx_burst, this resulted in a crash. The backtraces for which have been shared below. How does one ensure there are guard rails in place to gracefully exit the rx-burst when a disconnect occurs? Or, how do we properly modify the code such that we defer responding to the disconnect CB after the rx-burst operation has completed? We are utilizing DPDK 21.11.2. I have diff’d dpdks-stable:22.11.3 in ./drivers/net/memif, but I do not see anything obvious that would address this. I did a similar diff for dpdk:23.07, but do not see anything obvious there either. -Mike (gdb) thread 1 [Switching to thread 1 (Thread 0x7f17e2813600 (LWP 470))] #0 0x00007f17e374d225 in eth_memif_rx (queue=0x1189023b00, bufs=0x7f17e28100e8, nb_pkts=32) at ../git/drivers/net/memif/rte_eth_memif.c:338 338 last_slot = __atomic_load_n(&ring->head, __ATOMIC_ACQUIRE); (gdb) bt #0 0x00007f17e374d225 in eth_memif_rx (queue=0x1189023b00, bufs=0x7f17e28100e8, nb_pkts=32) at ../git/drivers/net/memif/rte_eth_memif.c:338 #1 0x000000000047e6fb in rte_eth_rx_burst (nb_pkts=32, rx_pkts=0x7f17e28100e8, queue_id=0, port_id=<optimized out>) at /usr/include/rte_ethdev.h:5368 #2 pmd_main_loop () at ../git/swfw/api/src/swfwPmd.c:1086 #3 0x000000000047f309 in pmd_launch_one_lcore (dummy=<optimized out>) at ../git/my_process.c:1157 #4 0x00007f17f7070e7c in eal_thread_loop (arg=<optimized out>) at ../git/lib/eal/linux/eal_thread.c:146 #5 0x00007f17f4c3da72 in start_thread (arg=<optimized out>) at pthread_create.c:442 #6 0x00007f17f4cbf930 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) l 333 ring_size = 1 << mq->log2_ring_size; 334 mask = ring_size - 1; 335 336 if (type == MEMIF_RING_C2S) { 337 cur_slot = mq->last_head; 338 last_slot = __atomic_load_n(&ring->head, __ATOMIC_ACQUIRE); 339 } else { 340 cur_slot = mq->last_tail; 341 last_slot = __atomic_load_n(&ring->tail, __ATOMIC_ACQUIRE); 342 } (gdb) p ring->head Cannot access memory at address 0x7f17d8e58006 (gdb) thread 19 [Switching to thread 19 (Thread 0x7f17f0804600 (LWP 468))] #0 0x00007f17f4caf97b in __GI___close (fd=494) at ../sysdeps/unix/sysv/linux/close.c:27 27 return SYSCALL_CANCEL (close, fd); (gdb) bt #0 0x00007f17f4caf97b in __GI___close (fd=494) at ../sysdeps/unix/sysv/linux/close.c:27 #1 0x00007f17e374f01f in memif_free_regions (dev=dev@entry=0x7f17f727f000 <rte_eth_devices+99072>) at ../git/drivers/net/memif/rte_eth_memif.c:882 #2 0x00007f17e37475d0 in memif_disconnect (dev=0x7f17f727f000 <rte_eth_devices+99072>) at ../git/drivers/net/memif/memif_socket.c:623 #3 0x00007f17f7091bd2 in eal_intr_process_interrupts (nfds=<optimized out>, events=<optimized out>) at ../git/lib/eal/linux/eal_interrupts.c:1026 #4 eal_intr_handle_interrupts (totalfds=<optimized out>, pfd=20) at ../git/lib/eal/linux/eal_interrupts.c:1100 #5 eal_intr_thread_main (arg=<optimized out>) at ../git/lib/eal/linux/eal_interrupts.c:1172 #6 0x00007f17f4c3da72 in start_thread (arg=<optimized out>) at pthread_create.c:442 #7 0x00007f17f4cbf930 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
We believe there is an additional improvement required to ensure the memif attibutes for a given interface/queue is valid. This is because memif.disconnect will zero out the MQ attributes. As such, the following checks are needed up front in the memif rx-burst function. diff --git a/drivers/net/memif/rte_eth_memif.c b/drivers/net/memif/rte_eth_memif.c index 34a25133ca..2ae5db9929 100644 --- a/drivers/net/memif/rte_eth_memif.c +++ b/drivers/net/memif/rte_eth_memif.c @@ -295,7 +295,12 @@ static uint16_t eth_memif_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { struct memif_queue *mq = queue; + if (unlikely(mq->mempool == NULL)) + return 0; struct pmd_internals *pmd = rte_eth_devices[mq->in_port].data->dev_private; + if (unlikely((pmd->flags & ETH_MEMIF_FLAG_CONNECTED) == 0)) + return 0; + struct pmd_process_private *proc_private = rte_eth_devices[mq->in_port].process_private; memif_ring_t *ring = memif_get_ring_from_queue(proc_private, mq); @@ -313,8 +318,6 @@ eth_memif_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) int ret; struct rte_eth_link link; - if (unlikely((pmd->flags & ETH_MEMIF_FLAG_CONNECTED) == 0)) - return 0; if (unlikely(ring == NULL)) { /* Secondary process will attempt to request regions. */ ret = rte_eth_link_get(mq->in_port, &link); **************************************************************** Here is the supporting evidence of why the above change is necessary. This is from the same BT provided when this defect was submitted. **************************************************************** (gdb) f #0 0x00007f17e374d225 in eth_memif_rx (queue=0x1189023b00, bufs=0x7f17e28100e8, nb_pkts=32) at ../git/drivers/net/memif/rte_eth_memif.c:338 338 last_slot = __atomic_load_n(&ring->head, __ATOMIC_ACQUIRE); (gdb) p mq $78 = (struct memif_queue *) 0x1189023b00 (gdb) p mq->in_port $79 = 0 (gdb) p *mq $80 = { mempool = 0x0, pmd = 0x0, type = MEMIF_RING_C2S, region = 0, in_port = 0, ring_offset = 0, last_head = 0, last_tail = 0, buffers = 0x0, n_pkts = 0, n_bytes = 0, intr_handle = 0x0, log2_ring_size = 0 '\000' }