[dpdk-users] Understanding ACL Implementation

Nate Jensen Nate.Jensen at sdl.usu.edu
Fri Jan 6 23:51:55 CET 2017

Previous message: [dpdk-users] dpdkvhostuser ports created with zero mac
Next message: [dpdk-users] query on m_src buffer in crypto_op buffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello DPDK Users:

I have some in-depth/rigorous questions about DPDK's ACL matching capabilities. To answer these questions, I have been scouring the web, reading much documentation, and pouring over the source code to the best of my ability. As I dig through the source, I am struggling to understand all of nuances of the implementation and was hoping someone might have a bit higher level discussion about how the ACL matching algorithm behaves. If this doesn't exist, then I will continue to plow through the source, but it isn't a fast process.

Currently, my organization has a product that does ACL matching by using a TCAM. Because of this, I can make hard guarantees about our systems performance. A TCAM has some desirable traits:

1. Line rate matching of 64K frames
2. No performance degradation as rule set increases in size, (until the TCAM is exhausted of course)
3. No restrictions on rule set, (i.e. no rule set behaves better or worse than another. They all work at line rate)
4. Adding or removing a rule is fast
5. No preprocessing or "compilation" of the rule set is required

While the TCAM is nice, I am interested in pushing our product away from the specialized hardware and replace the TCAM with DPDK. However, I have concerns about the ACL matching capability provided by DPDK. I would like to make some assertions about what rule sets perform better than others. Specifically, I would like to know:

1. What types of rules increase memory use?
2. What types of rules lead to decreased performance?
3. Has benchmarking been done to find pathological worst case rule sets? What do these rule sets look like?

Personally, I have built several "home grown" ACL matching algorithms on top of DPDK. I have experimented with many variants of trie trees, cross producting all permutations of fields/dimensions, and many others. None of my implementations match the observed capability of the DPDK provided ACL implementation. I have even generated, (via script), random ACL rulesets that can be quite large (10K rules). When loaded into the l3fwd-acl example program, performance is still quite strong. Rule sets that were a pathological worst case for my implementations were generally handled gracefully by DPDK ACLs. I need to understand this implementation, but I tend to get lost when working through the source.

I understand that the ACL implementation uses tries with an 8bit stride. However, I don't understand how the DPDK implementation efficiently handles don't cares or ranges that may be found in various search fields. My homegrown trie implementation accounted for this in one of two ways:

1. Expand the tree to account for don't cares -- This was effective, but pathological rule sets caused huge amounts of RAM usage. A don't care bit early on in the rule resulted in many duplicative copies of subtrees under the don't care trie paths.
2. Use a "ternary" trie with backtracking -- This is also effective, but since the trie was ternary it may have to backtrack to follow the don't care path. In the pathological worst case, the traversal essentially had to examine every rule. Performance was abysmal.

What I need to understand is the DPDK ACL algorithm and the motivation behind it. It seems that ACLs are generated it two phases. The build phase evaluates each rule and then sorts it for "wildness." This is a concept/heuristic that is currently lost on me. I then see that the algorithm develops a trie for each rule. It seems that this initial trie is actually a 1-bit stride trie that may be quite large. The algorithm then moves to the "generate" phase where the internal tries are compacted into something more space/time efficient. It is unclear to me how the algorithm doesn't suffer from:

1. Memory bloat if runtime performance is favored
2. Matching performance suffering is memory efficiency is favored

If someone had any background on the algorithm implemented that could help me see the forest from the trees I would be grateful. I am also quite willing to document what I learn and feed this documentation back to the community if it is of value. I would also be willing to provide my testing results showing performance observations from best/average/pathological rule sets.

Best Regards,
Nathan Jensen

Previous message: [dpdk-users] dpdkvhostuser ports created with zero mac
Next message: [dpdk-users] query on m_src buffer in crypto_op buffer
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the users mailing list