To access the summary, slides, and video links for a specific session, click on each of the tabs below.
DPDK and PCIe Gen4 Benchmarking
Amir Ancel, Mellanox & Keesang Song, AMD
This collaborative presentation with AMD will introduce PCIe fundamentals for networking engineers, including the new features on PCIe 4.0.We will then show DPDK performance when running 200Gb/s Mellanox device using PCIe Gen4 and AMD 2nd Generation EPYC (Rome) CPU.This presentation will also depict peak performance as well as the key advantages of the new architecture that optimizes local and remote NUMA node performance.
DPDK PMD for NTB
Jingjing Wu & Omkar Maslekar, Intel
NTB (Non-Transparent Bridge) can provide a non-transparent bridge between two separate systems so that they can communicate with each other. Thus, many user cases can benefit from this technique, such as fault tolerance and visual acceleration. In this presentation, we will share our recent work about enabling a DPDK Polling Mode driver for NTB. Firstly, we will briefly introduce NTB raw device driver skeleton. Then we will present the implementation details about how to use memory windows, doorbell and scratchpad registers to do handshake between 2 systems. Lastly, an efficient ring design on mapped memory will be introduced, and based on this ring layout, DPDK typical applications can seamlessly transmit packets by NTB device.
DPDK Acceleration with GPU
Elena Agostini, Nvidia, Cliff Burdick, ViaSat & Shahaf Shuler, Mellanox
We demonstrate the applicability of GPUs as packet processing accelerators, especially for compute-intensive tasks. The following techniques and challenges will be discussed:
– Allowing GPUDirect RDMA Rx and Tx, in which the packets are exchanged directly between the NIC and the GPU.
– For zero-copy, mbuf data needs to be located in a memory usable by both devices, therefore the external buffer feature of mbuf is used, with the external buffer located in GPU on-chip memory or GPU-addressable CPU memory.
– Rx queue can optionally be configured to split incoming packets between CPU and GPU memory which allows CPU processing of packet headers and GPU direct access to packet payload.
Various applications are demonstrating these techniques, including:
– An L2 forwarding application using a CUDA kernel.
– An application matching flows to process on the GPU, with the use of CPU/GPU header/data split.
– Modified version of testpmd using GPU memory
DPDK Unikernel with Unikraft
Sharan Santhanam, NEC Laboratories Europe GmbH
Unikernels have shown immense performance potential (e.g., throughout in the range of 10-40 Gb/s, boot times of only a few ms, image sizes of only hundreds of KBs). However, most of these have been manually built and have used rather obscure or research prototype software (e.g., the Click modular router) to handle packets.
In this talk we will present how we tackle these two issues at once. First, we will describe Unikraft, a Linux Foundation project that severely reduces the time to develop new unikernels. Second, we will show our port of DPDK to it, the result of which is the first unikernel, to the best of our knowledge, fully specialized to run DPDK-only workloads. Finally, we will show performance numbers from running this unikernel, as well as discuss future work.
Running Multi-process DPDK App on Kubernetes with Operator SDK
Yasufumi Ogawa, NTT
We will talk an approach to run DPDK multi-process app on Kubernetes by using Operator SDK. We have developed a DPDK called Soft Patch Panel
(SPP) for Service Function Chaining in NFV environment and it enables to connect DPDK apps running on host, virtual machines and also containers. We can use Multus for running DPDK app on Kubernetes, but supported type of network interface are still restricted. SPP has several types of PMD, for example, physical, vhost, ring or so. We have realized zero copy packet forwarding between Kubernetes DPDK container apps by using Operator SDK which is a toolkit to manage Kubernetes native applications. Operator enables to manage complex stateful applications on top of Kubernetes, and is appropriate for managing multi-process app. For SPP, we defined custom resource manager by which users can organize processes via Kubernetes CLI. In terms of implementation, Operator SDK is a set of tools for scaffolding and code generation to bootstrap a new project fast so that you can deploy your application rapidly.
DPDK & Containers: Challenges + Solutions
Wang Yong, ZTE
When DPDK is applied to containerized scenarios, it brings some problems and challenges that have not been encountered in normal cases. This presentation focuses on several typical problems and challenges, and gives the corresponding solutions or suggestions.
Transparent Container Solution for DPDK Applications
Tanya Brokhman, SW Architect & Shahar Belkar, Toga Networks
During the presentation, we will present an innovative plug-in, developed by our team in TRC, which enables DPDK applications running inside a container with virtually no bandwidth nor latency penalties, compared with the same application running directly on the host. Our solution extends the Docker CNM capabilities by enabling users to run DPDK applications inside a Docker** container using DPDK for networking and delivers the best performance on the market for applications running DPDK over containers. We welcome you to join our trip on the DPDK traffic highway!
OVS DPDK Pitfalls in Openstack and Kubernetes
Yi Yang, Inspur
Our customers require high performance networking, we’re struggling to switch to OVS DPDK from OVS, but we encountered many issues, it seems they are insoluble unless we change our infrastructure, this brings many challenges, for example, very poor tap interface performance, but Openstack floating IP, router and SNAT are using it, I will show all the issues we found in this presentation, we would like to share them to the community in order that developers in the community can help fix them.
Offloading Context Aware Flows, OVS-DPDK Connection Tracking Use Case
Roni Bar Yanai, Mellanox
Flow Offloads for DPDK Applications: The Partial, The Full, and The Graceful
Mesut Ali Ergin, Intel
DPDK offers libraries to accelerate packet processing workloads running on a wide variety of CPU architectures. Some of these libraries rely on offloading tasks to hardware entities other than CPU cores in order to accelerate the functionality they provide. There are also those libraries designed to facilitate applications’ offload requests to the relevant hardware. Among those, rte_flow API provides a generic means to offload the process of matching specific ingress or egress traffic, as well as taking actions on those matched packets. In this presentation, we will demonstrate benefits of using rte_flow offload capabilities on an OVS DPDK case study, and discuss practical implications as to when, where and how much one can offload. We will also discuss some potential algorithms and improvements to DPDK to be able to efficiently partition and utilize the packet processing resources in the platform, gracefully.
Stabilizing the DPDK ABI and What it Means for You
Stephen Hemminger, Microsoft
DPDK has its roots as a toolkit for developing Packet Processing appliances, where realizing packet processing performance is traditionally the highest priority. Since then it has grown into the new usage models of Network Function Virtualization and Cloud, where there is now the competing demands to continue the pace of innovation and also provide ABI Stability, Seamless Upgrades, Long Term Support, and OS Packaging as primary means of distribution.
ABI Stability will help bring these numerous benefits listed above and possibly more, however it will mean changes to the often permissive culture that has existed around ABI changes in the past. This presentation will dig into what these changes will mean for end consumer of DPDK; Network Operators and Telecom Equipment Manufactures, and how it will ultimately be a positive change for the DPDK User Experience.
A Comparison Between HTM and Lock-Free Algorithms
Dharmik Thakkar, Arm
As the number of CPU cores packed in a single SoC increases, scalability of algorithms becomes important. In this presentation, I will talk about Hardware Transactional Memory (HTM) and Lock-Free mechanisms in terms of basic working, requirements, and challenges. Both of these mechanisms improve scalability and thereby speed up the execution of multi-threaded software. DPDK is in a unique position wherein the rte_hash library implements an HTM optimized algorithm as well as a lock-free algorithm. This presentation will further talk about the performance comparison of HTM and Lock-Free in rte_hash library.
Rte_flow Optimization in i40e Driver
Chenmin Sun, Intel
Rte_flow is widely used for accelerating packet processing in cloud services, therefore the flow refresh rate is vitally important. Currently, the insertion and deletion flow operation are slow in the original driver, which limits the ability of typical cloud switching applications such as OVS-DPDK/VPP to timely respond in a rapidly changing cloud networking.
This presentation introduces the rte_flow driver optimization for i40e driver. In the refactored code, we introduced rte_bitmap and software pipeline to manage hardware resources and avoid synchronization waiting for hardware. Meanwhile, the consumed cycles are further compressed via optimizing the dynamic memory allocation code. The performance of the revised code is 20,000 times better than the original code.
Finally, this presentation will demonstrate that rte_flow optimization can gain huge performance improvement in OVS-DPDK hardware offload scenario.
Custom Meta Data in PMDs
Honnappa Nagarahalli, ARM
There are packet processing applications, created before DPDK came into existence, both in open source as well as in private development. Some examples in open source are VPP, OVS etc. These applications define their own packet meta data. The protocol stacks in these applications use that meta data extensively. These applications have integrated DPDK to make use of the rich set of PMDs. However, they cannot use the meta data from rte_mbuf directly in their protocol stacks as that would require the protocol stack re-write. Hence they end up converting from rte_mbuf to their application specific meta data format. This results in a performance penalty of ~20% to ~30%. This is forcing these applications to write their own native PMDs resulting in duplicated code/effort across DPDK and these projects.
It is possible to create an abstraction layer in PMDs such that the descriptor to rte_mbuf conversion code can be user defined. This will allow applications to avoid rte_mbuf to application specific packet meta data format conversion, thus saving the performance penalty.
This presentation talks about the need for the abstraction layer, how such an abstraction can be created and its benefits. Please note that this is still work under progress. There is no guarantee that it will succeed, in which case this presentation will talk about what was attempted and the issues faced. May be the community can suggest solutions.
Hairpin – Offloading Load Balancer and Gateway Applications
Ori Kam, Mellanox
This presentation is detailing the hairpin feature which is used to offload forwarding traffic from the wire back to the wire, while modifying the packet header.
This feature is managed via ethdev and is proposed in 19.11.
The hairpin is a good fit for QoS features.
It will show the use cases and the improvements that can be achieved using this feature.
It will also show the future roadmap for this including hairpin between port and devices.
HW Offloaded Regex/DPI Appliance
Shahaf Shuler, Mellanox
Previous talk on Bordeaux summit focused on the new Regex subsystem in DPDK, where the Regex device acts as a look aside accelerator.
This talk will be a follow up of the previous one, and will have a wider scope of looking into all the component a Regex/DPI needs.
We will overview the common SW pipeline of Regex/DPI appliance, and will describe the DPDK components that will help application to orchestrate the data movement. For example – Connection awareness library, IPSEC/TLS termination, flow classification and more.
In specific we will describe how the different pipeline stages that can be offloaded into HW using the existing or newly introduce APIs.
The Design and Implementation of a New User-level DPDK TCP Stack in Rust
Lilith Stephenson, Microsoft Research
Modern datacenter applications require low latency, high throughput access to the network. Using DPDK, applications can achieve signficantly better performance by bypassing the OS kernel; however, they still need support for traditional networking protocols like TCP. Existing user-level TCP libraries simply re-purpose existing kernel stacks or optimize for only high throughput, not low latency. We found that these libraries are too slow to meet the latency requirements of datacenter applications with new 100Gb datacenter networks offering 5 microsecond RTTs. To meet our requirements, we built a new TCP stack from the ground up for DPDK applications using Rust. Rust provides both memory and concurrency safety while remaining appropriate for low-latency environments. In this talk, I discuss our experience building a new low-latency TCP stack using Rust. I will present preliminary performance experiments and welcome input and contributions from the DPDK community in the continuing development of this stack.
TLDKv2: the TCP/IP Stack for Elastic and Ephemeral Serverless Apps
Jianfeng Tan, Ant Financial & Konstantin Ananyev, Intel
TLDK is a “DPDK-native” userspace TCP/IP stack targeting extreme performance, but also inherits some shortcomings of DPDK (for example, heavy and nearly static memory footprint).
In cloud-native environments, we need a stack to be performant but also (or more importantly) easy-of-use, lightweight, scalable, robust, and secure.
In this talk, we will present our work to enhance TLDK to meet these requirements. To ease integration of the existing applications a socket layer (POSIX semantic, I/O event notification facility) is added. To reduce initial memory footprint while keeping the performance, dynamic memory model is adopted at different levels (memseg, mempool, and stream management); we got to start an instance with several MBs, and scale to large number of open connections. At last, we will talk about the test frameworks for function test, performance regression, and fuzzing.
Validating DPDK Application Portability in Multi-cloud/Hybrid-cloud Environments
Subarna Kar Software, Microsoft
As DPDK gains new and complex features with each release, there is an increased divergence in feature support by different NIC vendors. the developers would want their DPDK based SDN applications to work on large number of underlying platforms especially in a multi-cloud or hybrid cloud environment. There might be performance difference between various platforms depending on the feature set supported by the underlying adapter, but the actual functionality should not break.
This talk will discuss some of the DPDK usage patterns typically encountered in our SDN environment, and will especially focus on some of the challenges we have encountered in using the rte_flow APIs for network packet filtering. Rte_flow supports a wide range of patterns and actions which are usually not be supported by various drivers that offer DPDK support. Currently the best known method to find out whether a flow can be offloaded to a NIC or not is to code it using rte_flow, and subsequently verify it manually. Such verification approach is cumbersome because it relies on accurately coding the target feature set, and requires expert knowledge of the physical hardware.
We propose a more efficient approach that is based on a unique test suite that can create flows for common use cases and run it for all drivers. This will give developers an overview of the kind of features being supported by each driver.
4G/5G Granular RSS Challenge
Roni bar Yanai, Mellanox
Lately we see a massive trend of 4G/5G towards virtualization, vRAN, vEPC, MEC…etc. As demand continues to grow rapidly vendors are
seeking for offload solutions. will present a short introduction about 4G/5G world and virtualization trends, then will present the required support of RSS granularity. 4G/5G requires new RSS modes per traffic type, for example RSS on inner source ip (over GTP tunnel), RSS on destination ip for ip with no tunnel traffic type (termination point), for some use cases RSS must be symmetric, while RSS is done on different fields according to the traffic direction. All options should work in harmony and flexibility, while still supporting all existing modes. We show a demo done lately for one of the vendors, and discuss the requirements and API.
Using DPDK APIs as the I/F between UPF-C and UPF-U
Brian Klaff & Barak Perlman, Ethernity
UPF (User Plane Function) is the main data path element in 3gpp architecture for 5G.
Several carriers have announced their plans to place UPF in edge locations as part of their 5G deployment plans.
Carriers are looking for HW acceleration for UPF, as compute resources at edge locations are limited.
There’s a need to define a standard interface between the UPF application (UPF-C) and the SmartNICs (UPF-U).
We suggest using DPDK APIs as the interface between UPF-C and UPF-U.
The presentation will also list the missing APIs we need to add to DPDK for fully offloading UPF functionality.