[PATCH v4 1/4] eal: add generic support for reading PMU events

Tomasz Duszynski tduszynski at marvell.com
Thu Jan 5 22:14:18 CET 2023
Previous message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Next message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Morten, 

A few comments inline. 

>-----Original Message-----
>From: Morten Brørup <mb at smartsharesystems.com>
>Sent: Wednesday, December 14, 2022 11:41 AM
>To: Tomasz Duszynski <tduszynski at marvell.com>; dev at dpdk.org
>Cc: thomas at monjalon.net; Jerin Jacob Kollanukkaran <jerinj at marvell.com>; zhoumin at loongson.cn;
>mattias.ronnblom at ericsson.com
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>+CC: Mattias, see my comment below about per-thread constructor for this
>
>> From: Tomasz Duszynski [mailto:tduszynski at marvell.com]
>> Sent: Wednesday, 14 December 2022 10.39
>>
>> Hello Morten,
>>
>> Thanks for review. Answers inline.
>>
>> [...]
>>
>> > > +/**
>> > > + * @file
>> > > + *
>> > > + * PMU event tracing operations
>> > > + *
>> > > + * This file defines generic API and types necessary to setup PMU
>> and
>> > > + * read selected counters in runtime.
>> > > + */
>> > > +
>> > > +/**
>> > > + * A structure describing a group of events.
>> > > + */
>> > > +struct rte_pmu_event_group {
>> > > +	int *fds; /**< array of event descriptors */
>> > > +	void **mmap_pages; /**< array of pointers to mmapped
>> > > perf_event_attr structures */
>> >
>> > There seems to be a lot of indirection involved here. Why are these
>> arrays not statically sized,
>> > instead of dynamically allocated?
>> >
>>
>> Different architectures/pmus impose limits on number of simultaneously
>> enabled counters. So in order relief the pain of thinking about it and
>> adding macros for each and every arch I decided to allocate the number
>> user wants dynamically. Also assumption holds that user knows about
>> tradeoffs of using too many counters hence will not enable too many
>> events at once.
>
>The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the
>fast path, for performance reasons.
>
>Please use fixed size arrays instead of dynamically allocated arrays.
>

I do agree that from maintenance angle fixed arrays are much more convenient 
but when optimization kicks in then that statement does not necessarily
hold true anymore.

For example, in this case performance dropped by ~0.3% which is insignificant imo. So
given simpler code, next patchset will use fixed arrays. 

>>
>> > Also, what is the reason for hiding the type struct
>> perf_event_mmap_page **mmap_pages opaque by
>> > using void **mmap_pages instead?
>>
>> I think, that part doing mmap/munmap was written first hence void **
>> was chosen in the first place.
>
>Please update it, so the actual type is reflected here.
>
>>
>> >
>> > > +	bool enabled; /**< true if group was enabled on particular lcore
>> > > */
>> > > +};
>> > > +
>> > > +/**
>> > > + * A structure describing an event.
>> > > + */
>> > > +struct rte_pmu_event {
>> > > +	char *name; /** name of an event */
>> > > +	int index; /** event index into fds/mmap_pages */
>> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>> > > +
>> > > +/**
>> > > + * A PMU state container.
>> > > + */
>> > > +struct rte_pmu {
>> > > +	char *name; /** name of core PMU listed under
>> > > /sys/bus/event_source/devices */
>> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
>> > > event group data */
>> > > +	int num_group_events; /**< number of events in a group */
>> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
>> > > events */
>
>The event_list is used in slow path only, so it can remain a list - i.e. no change requested here.
>:-)
>
>> > > +};
>> > > +
>> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
>> > > +*rte_pmu;
>> >
>> > Again, why not just extern struct rte_pmu, instead of dynamic
>> allocation?
>> >
>>
>> No strong opinions here since this is a matter of personal preference.
>> Can be removed
>> in the next version.
>
>Yes, please.
>
>>
>> > > +
>> > > +/** Each architecture supporting PMU needs to provide its own
>> version
>> > > */
>> > > +#ifndef rte_pmu_pmc_read
>> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Read PMU counter.
>> > > + *
>> > > + * @param pc
>> > > + *   Pointer to the mmapped user page.
>> > > + * @return
>> > > + *   Counter value read from hardware.
>> > > + */
>> > > +__rte_internal
>> > > +static __rte_always_inline uint64_t rte_pmu_read_userpage(struct
>> > > +perf_event_mmap_page *pc) {
>> > > +	uint64_t offset, width, pmc = 0;
>> > > +	uint32_t seq, index;
>> > > +	int tries = 100;
>> > > +
>> > > +	for (;;) {
>
>As a matter of personal preference, I would write this loop differently:
>
>+ for (tries = 100; tries != 0; tries--) {
>
>> > > +		seq = pc->lock;
>> > > +		rte_compiler_barrier();
>> > > +		index = pc->index;
>> > > +		offset = pc->offset;
>> > > +		width = pc->pmc_width;
>> > > +
>> > > +		if (likely(pc->cap_user_rdpmc && index)) {
>
>Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.
>

Valid index starts at 1. 0 means that either hw counter is stopped or isn't active. Maybe this is not
initially clear from man but there's example later on how to get actual number. 

>[man perf_event_open]: https://urldefense.proofpoint.com/v2/url?u=https-3A__man7.org_linux_man-
>2Dpages_man2_perf-5Fevent-
>5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=tny
>gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJogwRRXHqAuIay47H-
>aWl9SL5wpQ9tCjfiQUgrY&e=
>
>> > > +			pmc = rte_pmu_pmc_read(index - 1);
>> > > +			pmc <<= 64 - width;
>> > > +			pmc >>= 64 - width;
>> > > +		}
>> > > +
>> > > +		rte_compiler_barrier();
>> > > +
>> > > +		if (likely(pc->lock == seq))
>> > > +			return pmc + offset;
>> > > +
>> > > +		if (--tries == 0) {
>> > > +			RTE_LOG(DEBUG, EAL, "failed to get
>> > > perf_event_mmap_page lock\n");
>> > > +			break;
>> > > +		}
>
>- Remove the 4 above lines of code, and move the debug log message to the end of the function
>instead.
>
>> > > +	}
>
>+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
>
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Enable group of events for a given lcore.
>> > > + *
>> > > + * @param lcore_id
>> > > + *   The identifier of the lcore.
>> > > + * @return
>> > > + *   0 in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_internal
>> > > +int
>> > > +rte_pmu_enable_group(int lcore_id);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Add event to the group of enabled events.
>> > > + *
>> > > + * @param name
>> > > + *   Name of an event listed under
>> > > /sys/bus/event_source/devices/pmu/events.
>> > > + * @return
>> > > + *   Event index in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_experimental
>> > > +int
>> > > +rte_pmu_add_event(const char *name);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Read hardware counter configured to count occurrences of an
>> event.
>> > > + *
>> > > + * @param index
>> > > + *   Index of an event to be read.
>> > > + * @return
>> > > + *   Event value read from register. In case of errors or lack of
>> > > support
>> > > + *   0 is returned. In other words, stream of zeros in a trace
>> file
>> > > + *   indicates problem with reading particular PMU event register.
>> > > + */
>> > > +__rte_experimental
>> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
>
>The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of
>the comparison further below in this function.
>

That's true. 

>> > > +{
>> > > +	int lcore_id = rte_lcore_id();
>> > > +	struct rte_pmu_event_group *group;
>> > > +	int ret;
>> > > +
>> > > +	if (!rte_pmu)
>> > > +		return 0;
>> > > +
>> > > +	group = &rte_pmu->group[lcore_id];
>> > > +	if (!group->enabled) {
>
>Optimized: if (unlikely(!group->enabled)) {
>

Compiler will optimize the branch itself correctly. Extra hint is not required.  

>> > > +		ret = rte_pmu_enable_group(lcore_id);
>> > > +		if (ret)
>> > > +			return 0;
>> > > +
>> > > +		group->enabled = true;
>> > > +	}
>> >
>> > Why is the group not enabled in the setup function,
>> rte_pmu_add_event(), instead of here, in the
>> > hot path?
>> >
>>
>> When this is executed for the very first time then cpu will have
>> obviously more work to do but afterwards setup path is not taken hence
>> much less cpu cycles are required.
>>
>> Setup is executed by main lcore solely, before lcores are executed
>> hence some info passed to SYS_perf_event_open ioctl() is missing, pid
>> (via rte_gettid()) being an example here.
>
>OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.
>
>@Mattias: Another good example of something that would belong in per-thread constructors, as my
>suggested feature creep in [1].
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
>40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5
>ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
>uWgJjDlgqj1wHf1nGWyHCNo&e=
>
>>
>> > > +
>> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
>
>Optimized: if (unlikely(index >= rte_pmu.num_group_events))
>
>> > > +		return 0;
>> > > +
>> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > > *)group->mmap_pages[index]);
>> >
>> > Using fixed size arrays instead of multiple indirections via
>> > pointers
>> is faster. It could be:
>> >
>> > return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
>> >
>> > With our without suggested performance improvements...
>> >
>> > Series-acked-by: Morten Brørup <mb at smartsharesystems.com>
>>
Previous message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Next message (by thread): [PATCH v4 1/4] eal: add generic support for reading PMU events
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the dev mailing list