[RFC] ethdev: fast path async flow API

Dariusz Sosnowski dsosnowski at nvidia.com
Thu Jan 4 17:08:47 CET 2024


Hi Konstantin,

> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev at huawei.com>
> Sent: Thursday, January 4, 2024 09:47
> > > This is a blocker, showstopper for me.
> > +1
> >
> > > Have you considered having something like
> > >    rte_flow_create_bulk()
> > >
> > > or better yet a Linux iouring style API?
> > >
> > > A ring style API would allow for better mixed operations across the
> > > board and get rid of the I-cache overhead which is the root cause of the
> needing inline.
> > Existing async flow API is somewhat close to the io_uring interface.
> > The difference being that queue is not directly exposed to the application.
> > Application interacts with the queue using rte_flow_async_* APIs (e.g.,
> places operations in the queue, pushes them to the HW).
> > Such design has some benefits over a flow API which exposes the queue to
> the user:
> > - Easier to use - Applications do not manage the queue directly, they do it
> through exposed APIs.
> > - Consistent with other DPDK APIs - In other libraries, queues are
> manipulated through API, not directly by an application.
> > - Lower memory usage - only HW primitives are needed (e.g., HW queue
> > on PMD side), no need to allocate separate application queues.
> >
> > Bulking of flow operations is a tricky subject.
> > Compared to packet processing, where it is desired to keep the
> > manipulation of raw packet data to the minimum (e.g., only packet
> > headers are accessed), during flow rule creation all items and actions must
> be processed by PMD to create a flow rule.
> > The amount of memory consumed by items and actions themselves during
> this process might be nonnegligible.
> > If flow rule operations were bulked, the size of working set of memory
> > would increase, which could have negative consequences on the cache
> behavior.
> > So, it might be the case that by utilizing bulking the I-cache overhead is
> removed, but the D-cache overhead is added.
> 
> Is rte_flow struct really that big?
> We do bulk processing for mbufs, crypto_ops, etc., and usually bulk
> processing improves performance not degrades it.
> Of course bulk size has to be somewhat reasonable.
It does not really depend on rte_flow struct size itself (it's opaque to the user), but on sizes of items and actions which are the parameters for flow operations.
To create a flow through async flow API the following is needed:
- array of items and their spec,
- array of actions and their configuration,
- pointer to template table,
- indexes of pattern and actions templates to be used.
If we assume a simple case of ETH/IPV4/TCP/END match and COUNT/RSS/END actions, then we have at most:
- 4 items (32B each) + 3 specs (20B each) = 188B
- 3 actions (16B each) + 2 configurations (4B and 40B) = 92B
- 8B for table pointer
- 2B for template indexes
In total = 290B.
Bulk API can be designed in a way that single bulk operates on a single set of tables and templates - this would remove a few bytes.
Flow actions can be based on actions templates (so no need for conf), but items' specs are still needed.
This would leave us at 236B, so at least 4 cache lines (assuming everything is tightly packed) for a single flow and almost twice the size of the mbuf.
Depending on the bulk size it might be a much more significant chunk of the cache.

I don't want to dismiss the idea. I think it's worth of evaluation.
However, I'm not entirely confident if bulking API would introduce performance benefits.

Best regards,
Dariusz Sosnowski


More information about the dev mailing list