Bug 183

Summary:	Problem using cloned rte_mbuf buffers with KNI interface
Product:	DPDK	Reporter:	Dinesh (dinesh.kp78)
Component:	other	Assignee:	dev
Status:	CONFIRMED ---
Severity:	normal	CC:	sunnylandh, zhouyates
Priority:	Normal
Version:	18.11
Target Milestone:	---
Hardware:	All
OS:	Linux
Attachments:	test case

Description Dinesh 2019-01-07 16:29:42 CET

problem appears in DPDK 18.11

we have a scenario to send cloned rte_mbuf buffer packets to kernel virtual interface via KNI api. Things were working fine till DPDK-18.05 but when we upgraded to DPDK-18.11 noticing some issues that "empty packets are getting delivered via KNI interface"

environment setup
--------------------------

dpdk-devbind.py --status-dev net

Network devices using DPDK-compatible driver
============================================
0000:00:0b.0 '82540EM Gigabit Ethernet Controller 100e' drv=igb_uio unused=e1000
0000:00:0c.0 '82540EM Gigabit Ethernet Controller 100e' drv=igb_uio unused=e1000

Network devices using kernel driver
===================================
0000:00:03.0 'Virtio network device 1000' if=eth0 drv=virtio-pci unused=virtio_pci,igb_uio *Active*
0000:00:04.0 'Virtio network device 1000' if=eth1 drv=virtio-pci unused=virtio_pci,igb_uio *Active*
0000:00:05.0 'Virtio network device 1000' if=eth2 drv=virtio-pci unused=virtio_pci,igb_uio *Active*

DPDK kernel modules loaded
--------------------------
lsmod | grep igb_uio
igb_uio                13506  2 
uio                    19259  5 igb_uio
lsmod | grep rte_kni
rte_kni                28122  1 

Redhat linux7
uname -r
3.10.0-862.9.1.el7.x86_64 #1 SMP Wed Jun 27 04:30:39 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux


Problem simulation
--------------------------
To simulate the scenario, modified dpdk-18.11\examples\kni\main.c "kni_ingress()" function to use rte_pktmbuf_clone() before sending packets to kni interface

static void
kni_ingress(struct kni_port_params *p)
{
	uint8_t i;
	uint16_t port_id;
	unsigned nb_rx, num;
	uint32_t nb_kni;
	struct rte_mbuf *pkts_burst[PKT_BURST_SZ];
	struct rte_mbuf *pkt;
	
	if (p == NULL)
		return;

	nb_kni = p->nb_kni;
	port_id = p->port_id;
	for (i = 0; i < nb_kni; i++) {
		/* Burst rx from eth */
		nb_rx = rte_eth_rx_burst(port_id, 0, pkts_burst, PKT_BURST_SZ);
		if (unlikely(nb_rx > PKT_BURST_SZ)) {
			RTE_LOG(ERR, APP, "Error receiving from eth\n");
			return;
		}
		
		// ----------- clone pkt start -----------
		for (k = 0; k < nb_rx; k++) {
			pkt = pkts_burst[k];
			// using 'pkt->pool' for clone pkts is not efficient way of using memory. perhaps 
			// we should have another pool with no memory reserved for the packet data as clone will 
			// have new metadata + just a reference to raw data. for test simulation it's fine to reuse same buffer pool.
			pkts_burst[k] = rte_pktmbuf_clone(pkt, pkt->pool);
			rte_pktmbuf_free(pkt);
		} // ----------- clone pkt end -----------
		
		/* Burst tx to kni */
		num = rte_kni_tx_burst(p->kni[i], pkts_burst, nb_rx);
		if (num)
			kni_stats[port_id].rx_packets += num;

		rte_kni_handle_request(p->kni[i]);
		if (unlikely(num < nb_rx)) {
			/* Free mbufs not tx to kni interface */
			kni_burst_free_mbufs(&pkts_burst[num], nb_rx - num);
			kni_stats[port_id].rx_dropped += nb_rx - num;
		}
	}
}



# /tmp/18.11/kni -l 0-1 -n 4 -b 0000:00:03.0 -b 0000:00:04.0 -b 0000:00:05.0 --proc-type=auto -m 512 -- -p 0x1 -P --config="(0,0,1)"
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Auto-detected process type: PRIMARY
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:03.0 on NUMA socket -1
EAL:   Device is blacklisted, not initializing
EAL: PCI device 0000:00:04.0 on NUMA socket -1
EAL:   Device is blacklisted, not initializing
EAL: PCI device 0000:00:05.0 on NUMA socket -1
EAL:   Device is blacklisted, not initializing
EAL: PCI device 0000:00:0b.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:100e net_e1000_em
EAL: PCI device 0000:00:0c.0 on NUMA socket -1
EAL:   Invalid NUMA socket, default to 0
EAL:   probe driver: 8086:100e net_e1000_em
APP: Initialising port 0 ...
KNI: pci: 00:0b:00 	 8086:100e

Checking link status
.....done
Port0 Link Up - speed 1000Mbps - full-duplex
APP: ========================
APP: KNI Running
APP: kill -SIGUSR1 8903
APP:     Show KNI Statistics.
APP: kill -SIGUSR2 8903
APP:     Zero KNI Statistics.
APP: ========================
APP: Lcore 1 is writing to port 0
APP: Lcore 0 is reading from port 0
APP: Configure network interface of 0 up
KNI: Configure promiscuous mode of 0 to 1


Bring up vEth0 interface created by kni example app
# ifconfig vEth0 up

Dump the content received on vEth0 interface
# tcpdump -vv -i vEth0
tcpdump: listening on vEth0, link-type EN10MB (Ethernet), capture size 262144 bytes
13:38:31.982085 [|ether]
13:38:32.050576 [|ether]
13:38:32.099805 [|ether]
13:38:32.151790 [|ether]
13:38:32.206755 [|ether]
13:38:32.253135 [|ether]
13:38:32.298773 [|ether]
13:38:32.345555 [|ether]
13:38:32.388859 [|ether]
13:38:32.467562 [|ether]


On sending packets to "00:0b:00" interface by using tcpreply, I could see packets with empty content received on "vEth0". sometime I have seen kni example app crashing with segmentation fault.

After rte_kni net driver analysis, it appears to be physical to virtual address conversion is not proper. perhaps something to do with memory management changes in recent DPDK versions. (I can also confirm that modified kni example works perfectly fine on DPDK 18.02)

As a workaround used --legacy-mem switch during kni app start up. it seems to be promising, could manage to receive and dump cloned packets without any issue.
#/tmp/18.11/kni -l 0-1 -n 4 -b 0000:00:03.0 -b 0000:00:04.0 -b 0000:00:05.0 --proc-type=auto --legacy-mem -m 512 -- -p 0x1 -P --config="(0,0,1)"


Could someone confirm that it's a bug in DPDK 18.11 ?

Thanks,
Dinesh

Comment 1 Junxiao Shi 2019-07-02 20:59:48 CEST

I'm seeing this bug in DPDK 196a46fab6eeb3ce2039e3bcaca80f8ba43ffc8d , which is later than 19.05 release.
In current version, when DPDK application sends a packet to KNI, it involves three functions:

1. Userspace rte_kni_tx_burst puts mbufs into kni->rx_q.
2. Kernel kni_net_rx_normal gets mbufs from kni->rx_q, allocates skb and copies mbuf payload into skb, passes skb to kernel network stack, and puts mbufs into kni->free_q.
3. Userspace kni_free_mbufs gets mbufs from kni->free_q and deallocates them.

Three kinds of address are used in this process:

* userspace virtual address "va"
* kernel virtual address "kva"
* physical address "pa"

The kni->rx_q queue uses "pa", while the kni->free_q queue uses "va".

Userspace has va2pa address conversion function.
Kernel has pa2kva, pa2va, and va2pa address conversion functions.

Both va2pa functions are declared as: (cast operators omitted)

static void *
va2pa(struct rte_mbuf *m)
{
  return m - (m->buf_addr - m->buf_iova);
}

Essentially, it assumes the offset between "va" and "pa" equals the difference between buf_addr and buf_iova.

Now look at rte_pktmbuf_attach, the function that makes an indirect mbuf:

  mi->buf_iova = m->buf_iova;
  mi->buf_addr = m->buf_addr;
  mi->buf_len = m->buf_len;

It copies buf_addr and buf_iova from direct mbuf to indirect mbuf!
Since the direct mbuf could come from a different mempool, it may have an different offset between "va" and "pa".
Using it to implement va2pa would give incorrect results.

I tested this theory with the following snippet:

    struct rte_mbuf* md = rte_pktmbuf_alloc(mp0);
    assert(md != NULL);
    printf("A md=%p md.buf_iova=%lx md.buf_addr=%p\n", md, md->buf_iova, md->buf_addr);
    rte_memcpy(rte_pktmbuf_prepend(md, sizeof(ethhdr)), &ethhdr, sizeof(ethhdr));
    memset(rte_pktmbuf_append(md, 256), 0x01, 256);
    rte_mbuf_sanity_check(md, true);

    struct rte_mbuf* mi = rte_pktmbuf_alloc(mp1);
    assert(mi != NULL);
    printf("B mi=%p mi.buf_iova=%lx mi.buf_addr=%p\n", mi, mi->buf_iova, mi->buf_addr);

    rte_pktmbuf_attach(mi, md);
    rte_pktmbuf_free(md);
    printf("C mi=%p mi.buf_iova=%lx mi.buf_addr=%p\n", mi, mi->buf_iova, mi->buf_addr);
    rte_mbuf_sanity_check(mi, true);

Output is:

A md=0x173455480 md.buf_iova=259855500 md.buf_addr=0x173455500
B mi=0x172221ac0 mi.buf_iova=258a21b40 mi.buf_addr=0x172221b40
C mi=0x172221ac0 mi.buf_iova=259855500 mi.buf_addr=0x173455500

On line A and line C, (buf_iova - buf_addr) equals E6400000.
On line B, before making mi indirect, (buf_iova - buf_addr) equals E6800000.
According to line B, mi's "pa" should be 0x258A21AC0.
However, after mi is attached to md, va2pa would calculate its "pa" to be 0x258621AC0.

Kernel pa2va function has the same problem.

This problem can resemble itself in several forms:

* Packets that contain an indirect mbuf cannot be transmitted.
* When CONFIG_RTE_LIBRTE_MBUF_DEBUG is enabled, kni_free_mbufs's invocation of rte_pktmbuf_free throws "bad mbuf pool" error.
* Kernel panics on skb_put when a 2-segment packet contains an indirect mbuf as the second segment. In this case, kernel calculates wrong address of the indirect mbuf, and therefore data_len field may exceed packet's pkt_len, causing skb overflow.

I think the solution is reimplementing va2pa function using rte_mempool_virt2iova. pa2va needs a similar solution.

Comment 2 Yangchao Zhou 2019-07-03 13:18:17 CEST

pa2va works fine with pa and kva pointing to the same mbuf. This is the way it is currently used.

Comment 3 Junxiao Shi 2019-07-03 15:19:18 CEST

(In reply to Yangchao Zhou from comment #2)
> pa2va works fine with pa and kva pointing to the same mbuf.

No, it does not work with indirect mbufs.

From the example in my earlier message:

mi's va is 0x172221ac0; mi's pa is 0x258A21AC0.
As an indirect mbuf, its buf_iova and buf_addr are copied from the corresponding direct mbuf (line C).
Using the current pa2va function, mi's va would be calculated as 0x172621AC0.
This does not match the correct va 0x172221ac0.

Comment 4 Junxiao Shi 2019-07-03 15:40:15 CEST

> I think the solution is reimplementing va2pa function using
> rte_mempool_virt2iova. pa2va needs a similar solution.

I attempted this method a bit, but found it difficult:

* rte_mempool_virt2iova relies on struct rte_mempool_objhdr.
* This in turn requires declaring struct rte_kni_mempool_objhdr in kernel headers.
* sizeof(struct rte_mempool_objhdr) changes depending on RTE_LIBRTE_MEMPOOL_DEBUG setting.
* rte_kni_common.h does not include rte_config.h and cannot access RTE_LIBRTE_MEMPOOL_DEBUG setting.

I believe the best way forward is still https://patches.dpdk.org/patch/55323/ ,
in which librte_kni performs va2pa translation for all segments.
However, as I noted above, pa2va in kernel does not work for indirect mbufs.

I suggest this solution:

* librte_kni performs va2pa translation for all segments, as the patch 55323 does.
* va for each mbuf is written to mbuf.userdata field.
  It is safe to overwrite because mbuf is now owned by and would be freed by KNI,
  and no application is expected to access mbuf.userdata anymore.
* Kernel module performs pa2va translation by reading mbuf.userdata field.

Further optimization is possible if we could spare another 64-bit field:
Instead of changing mbuf.next to pa in userspace, and changing it back in kernel,
mbuf.next can stay as va, while pa for next segment is written to another field
such as mbuf.hash. Kernel then traverses segments using that field, and does
not need to restore va into mbuf.next.
Even with this optimization, writing va to mbuf.userdata is still necessary
at least for the first segment.

Comment 5 Junxiao Shi 2019-07-10 22:33:34 CEST

Created attachment 51 [details]
test case

This is the test case to reproduce the logs I posted on Jul 02.

Comment 6 Yangchao Zhou 2019-07-30 08:03:53 CEST

I think of a temporary solution, send the direct mbufs to KNI after the rte_pktmbuf_attach(). This should work, but there are some restrictions, such as not being able to send to multiple KNI interfaces.
Still need a more complete mechanism to deal with VA/PA address translation issues in KNI.

Comment 7 Junxiao Shi 2019-09-10 21:24:06 CEST

(In reply to Yangchao Zhou from comment #6)
> send the direct mbufs to KNI after the rte_pktmbuf_attach().
> This should work, but not being able to send to multiple KNI interfaces.

My use case may require sending to multiple KNI interfaces, so this is unacceptable.

I've posted partial code for my Comment 4 design:
https://patches.dpdk.org/patch/59107/