Bug 183
Summary: | Problem using cloned rte_mbuf buffers with KNI interface | ||
---|---|---|---|
Product: | DPDK | Reporter: | Dinesh (dinesh.kp78) |
Component: | other | Assignee: | dev |
Status: | CONFIRMED --- | ||
Severity: | normal | CC: | sunnylandh, zhouyates |
Priority: | Normal | ||
Version: | 18.11 | ||
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Attachments: | test case |
Description
Dinesh
2019-01-07 16:29:42 CET
I'm seeing this bug in DPDK 196a46fab6eeb3ce2039e3bcaca80f8ba43ffc8d , which is later than 19.05 release. In current version, when DPDK application sends a packet to KNI, it involves three functions: 1. Userspace rte_kni_tx_burst puts mbufs into kni->rx_q. 2. Kernel kni_net_rx_normal gets mbufs from kni->rx_q, allocates skb and copies mbuf payload into skb, passes skb to kernel network stack, and puts mbufs into kni->free_q. 3. Userspace kni_free_mbufs gets mbufs from kni->free_q and deallocates them. Three kinds of address are used in this process: * userspace virtual address "va" * kernel virtual address "kva" * physical address "pa" The kni->rx_q queue uses "pa", while the kni->free_q queue uses "va". Userspace has va2pa address conversion function. Kernel has pa2kva, pa2va, and va2pa address conversion functions. Both va2pa functions are declared as: (cast operators omitted) static void * va2pa(struct rte_mbuf *m) { return m - (m->buf_addr - m->buf_iova); } Essentially, it assumes the offset between "va" and "pa" equals the difference between buf_addr and buf_iova. Now look at rte_pktmbuf_attach, the function that makes an indirect mbuf: mi->buf_iova = m->buf_iova; mi->buf_addr = m->buf_addr; mi->buf_len = m->buf_len; It copies buf_addr and buf_iova from direct mbuf to indirect mbuf! Since the direct mbuf could come from a different mempool, it may have an different offset between "va" and "pa". Using it to implement va2pa would give incorrect results. I tested this theory with the following snippet: struct rte_mbuf* md = rte_pktmbuf_alloc(mp0); assert(md != NULL); printf("A md=%p md.buf_iova=%lx md.buf_addr=%p\n", md, md->buf_iova, md->buf_addr); rte_memcpy(rte_pktmbuf_prepend(md, sizeof(ethhdr)), ðhdr, sizeof(ethhdr)); memset(rte_pktmbuf_append(md, 256), 0x01, 256); rte_mbuf_sanity_check(md, true); struct rte_mbuf* mi = rte_pktmbuf_alloc(mp1); assert(mi != NULL); printf("B mi=%p mi.buf_iova=%lx mi.buf_addr=%p\n", mi, mi->buf_iova, mi->buf_addr); rte_pktmbuf_attach(mi, md); rte_pktmbuf_free(md); printf("C mi=%p mi.buf_iova=%lx mi.buf_addr=%p\n", mi, mi->buf_iova, mi->buf_addr); rte_mbuf_sanity_check(mi, true); Output is: A md=0x173455480 md.buf_iova=259855500 md.buf_addr=0x173455500 B mi=0x172221ac0 mi.buf_iova=258a21b40 mi.buf_addr=0x172221b40 C mi=0x172221ac0 mi.buf_iova=259855500 mi.buf_addr=0x173455500 On line A and line C, (buf_iova - buf_addr) equals E6400000. On line B, before making mi indirect, (buf_iova - buf_addr) equals E6800000. According to line B, mi's "pa" should be 0x258A21AC0. However, after mi is attached to md, va2pa would calculate its "pa" to be 0x258621AC0. Kernel pa2va function has the same problem. This problem can resemble itself in several forms: * Packets that contain an indirect mbuf cannot be transmitted. * When CONFIG_RTE_LIBRTE_MBUF_DEBUG is enabled, kni_free_mbufs's invocation of rte_pktmbuf_free throws "bad mbuf pool" error. * Kernel panics on skb_put when a 2-segment packet contains an indirect mbuf as the second segment. In this case, kernel calculates wrong address of the indirect mbuf, and therefore data_len field may exceed packet's pkt_len, causing skb overflow. I think the solution is reimplementing va2pa function using rte_mempool_virt2iova. pa2va needs a similar solution. pa2va works fine with pa and kva pointing to the same mbuf. This is the way it is currently used. (In reply to Yangchao Zhou from comment #2) > pa2va works fine with pa and kva pointing to the same mbuf. No, it does not work with indirect mbufs. From the example in my earlier message: mi's va is 0x172221ac0; mi's pa is 0x258A21AC0. As an indirect mbuf, its buf_iova and buf_addr are copied from the corresponding direct mbuf (line C). Using the current pa2va function, mi's va would be calculated as 0x172621AC0. This does not match the correct va 0x172221ac0. > I think the solution is reimplementing va2pa function using > rte_mempool_virt2iova. pa2va needs a similar solution. I attempted this method a bit, but found it difficult: * rte_mempool_virt2iova relies on struct rte_mempool_objhdr. * This in turn requires declaring struct rte_kni_mempool_objhdr in kernel headers. * sizeof(struct rte_mempool_objhdr) changes depending on RTE_LIBRTE_MEMPOOL_DEBUG setting. * rte_kni_common.h does not include rte_config.h and cannot access RTE_LIBRTE_MEMPOOL_DEBUG setting. I believe the best way forward is still https://patches.dpdk.org/patch/55323/ , in which librte_kni performs va2pa translation for all segments. However, as I noted above, pa2va in kernel does not work for indirect mbufs. I suggest this solution: * librte_kni performs va2pa translation for all segments, as the patch 55323 does. * va for each mbuf is written to mbuf.userdata field. It is safe to overwrite because mbuf is now owned by and would be freed by KNI, and no application is expected to access mbuf.userdata anymore. * Kernel module performs pa2va translation by reading mbuf.userdata field. Further optimization is possible if we could spare another 64-bit field: Instead of changing mbuf.next to pa in userspace, and changing it back in kernel, mbuf.next can stay as va, while pa for next segment is written to another field such as mbuf.hash. Kernel then traverses segments using that field, and does not need to restore va into mbuf.next. Even with this optimization, writing va to mbuf.userdata is still necessary at least for the first segment. Created attachment 51 [details]
test case
This is the test case to reproduce the logs I posted on Jul 02.
I think of a temporary solution, send the direct mbufs to KNI after the rte_pktmbuf_attach(). This should work, but there are some restrictions, such as not being able to send to multiple KNI interfaces. Still need a more complete mechanism to deal with VA/PA address translation issues in KNI. (In reply to Yangchao Zhou from comment #6) > send the direct mbufs to KNI after the rte_pktmbuf_attach(). > This should work, but not being able to send to multiple KNI interfaces. My use case may require sending to multiple KNI interfaces, so this is unacceptable. I've posted partial code for my Comment 4 design: https://patches.dpdk.org/patch/59107/ |