1231 – l3fwd: perf reports affected by silently enabling "fast free"

Bug 1231 - l3fwd: perf reports affected by silently enabling "fast free"

Summary: l3fwd: perf reports affected by silently enabling "fast free"

Status:	UNCONFIRMED

Alias:	None

Product:	DPDK
Classification:	Unclassified
Component:	examples (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	Normal normal
Target Milestone:	---
Assignee:	dev

URL:

Depends on:
Blocks:

Reported:	2023-05-16 10:37 CEST by Morten Brørup
Modified:	2023-05-19 08:33 CEST (History)
CC List:	3 users (show)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Morten Brørup 2023-05-16 10:37:03 CEST

The l3fwd example application is used for benchmarking, including the official NIC performance reports published on the DPDK web site [1].

A patch to silently enable the "fast free" (RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) optimization was applied to l3fwd [2][3] in January 2018.

This means that the performance reports starting from DPDK 18.02 do not reflect generic performance, but the performance of applications that meet the preconditions for using this optimization feature.

In order to prevent misleading performance results, such non-generic optimizations should be not be silently enabled; they should be explicitly enabled using a command line option in l3fwd.

The same applies to the coming "buffer recycle" optimization.

[1]: https://core.dpdk.org/perf-reports/
[2]: http://inbox.dpdk.org/dev/cover.1514280003.git.shahafs@mellanox.com/
[3]: http://git.dpdk.org/dpdk/commit/examples/l3fwd/main.c?id=1ef9600b2d20078538ca4082f9a4adf2d9bd2ab2

PS: I don't oppose to NIC vendors publishing performance results using non-generic optimizations such as "fast free" or the coming "buffer recycle", but it should be fully disclosed which optimizations have been used to achieve better results.

Comment 1 Stephen Hemminger 2023-05-16 17:04:18 CEST

If you followed that logic, then l3fwd would be frozen and no changes ever.
Would be better to allow changes to l3fwd, but have a history of footnotes of performance impact related changes.

Comment 2 Morten Brørup 2023-05-16 17:40:18 CEST

I don't mind this or other application specific optimizations being added to the reference benchmark application, l3fwd.

I object to application specific optimizations being enabled by default, because they only work for applications meeting a certain set of preconditions.

Regarding "fast free", the documentation [4] mentions: The application must guarantee that per-queue all mbufs comes from the same mempool and has refcnt = 1. (AFAIK, there are more preconditions, but they are not documented. And they are probably irrelevant for the discussion of this bug.)

The coming "buffer recycle" optimization is targeting a very specific and narrow use case. Should that optimization also be silently enabled by default?

If we allow non-generic optimizations to be silently added, the l3fwd performance results will gradually become meaningless, because they only apply to a very narrow use case.

It should be up the the user of the l3fwd to determine which application specific optimizations are relevant for the user's application.

If the NIC vendors want to enable all of the application specific optimizations when creating their performance test reports, it is perfectly fine with me; it will be documented in their test reports (at least in the test setup's "Command Line" description), instead of hidden away in some footnote coming with the l3fwd application.

[4]: https://elixir.bootlin.com/dpdk/v23.03/source/lib/ethdev/rte_ethdev.h#L1560

Comment 3 Slava 2023-05-17 09:40:05 CEST

Moreover, the testpmd also enables FAST_MBUF_FREE offload by default if PMD reports offload support.

I agree, the configured offload used for benchmarking should be documented.
What to consider as generic is questionable.

Single mbuf packets, originating from the same pool, with no extra references made by the application (send-and-forget), are definitely one of the typical cases.
For now, we have two basic benchmarking scenarios - Zero-Packet-Loss and Single-Core-Rate, and we have no traffic requirements but packet size, and how packets are handled in drivers/app datapath is completely behind the scene.

So, the question is whether we should have a closer view of packet handling and consider the relevant sets of enabled offloads. In other words, consider what could be  "a generic case" and extend the benchmarking if we find any interesting ones.

Comment 4 Morten Brørup 2023-05-17 10:32:07 CEST

In my opinion, the term "generic" means without any limitations on the application. If an optimization feature is unavailable to some applications, it is not a "generic" optimization, regardless if the majority of applications can use it.

The choice of relevant sets of offloads depends 100 % on the application, and thus should be configurable by a command line parameter to the benchmark application. If someone's application cannot use "fast free", they should be able to use the reference benchmark application, l3fwd, to obtain performance data for their own comparisons.

I agree that "fast free" can probably be used by the vast majority of applications with very little effort from the application developers, so it would probably make sense for NIC vendors to continue publishing performance data with this optimization feature enabled.

We also need to decide which optimization features should be enabled in the performance regression tests in the CI.

Expanding on your last sentence, I would like to see a benchmark test with packet queueing, so the mbufs are not already in the cache (from ingress processing) when being transmitted. However, I don't have time to provide such a test application myself, so I have learned to accept that all the CI performance regression tests are based on run-to-completion with mbufs residing in the cache - this probably reflects the vast majority of DPDK application use cases anyway. (Note that the MBUF structure is described as the first cache line holding ingress information and the second cache line holding egress information. The current run-to-completion tests don't consider this, but effectively treat the entire mbuf as one unified structure.)

Comment 5 Ruifeng Wang 2023-05-19 08:33:46 CEST

(In reply to Morten Brørup from comment #2)
> 
> The coming "buffer recycle" optimization is targeting a very specific and
> narrow use case. Should that optimization also be silently enabled by
> default?

Hi Morten,

l3fwd example was used to measure the gain of "buffer recycle" during feature development. It received some concerns like you have. So, the author is following community's suggestion to integrate the feature to testpmd with a new fwd engine. It will not be silently enabled.

Thanks and regards.

Note You need to log in before you can comment on or make changes to this bug.