[PATCH v4 1/4] eal: add generic support for reading PMU events

Morten Brørup mb at smartsharesystems.com
Thu Jan 5 23:07:41 CET 2023


> From: Tomasz Duszynski [mailto:tduszynski at marvell.com]
> Sent: Thursday, 5 January 2023 22.14
> 
> Hi Morten,
> 
> A few comments inline.
> 
> >From: Morten Brørup <mb at smartsharesystems.com>
> >Sent: Wednesday, December 14, 2022 11:41 AM
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >+CC: Mattias, see my comment below about per-thread constructor for
> this
> >
> >> From: Tomasz Duszynski [mailto:tduszynski at marvell.com]
> >> Sent: Wednesday, 14 December 2022 10.39
> >>
> >> Hello Morten,
> >>
> >> Thanks for review. Answers inline.
> >>
> >> [...]
> >>
> >> > > +/**
> >> > > + * @file
> >> > > + *
> >> > > + * PMU event tracing operations
> >> > > + *
> >> > > + * This file defines generic API and types necessary to setup
> PMU
> >> and
> >> > > + * read selected counters in runtime.
> >> > > + */
> >> > > +
> >> > > +/**
> >> > > + * A structure describing a group of events.
> >> > > + */
> >> > > +struct rte_pmu_event_group {
> >> > > +	int *fds; /**< array of event descriptors */
> >> > > +	void **mmap_pages; /**< array of pointers to mmapped
> >> > > perf_event_attr structures */
> >> >
> >> > There seems to be a lot of indirection involved here. Why are
> these
> >> arrays not statically sized,
> >> > instead of dynamically allocated?
> >> >
> >>
> >> Different architectures/pmus impose limits on number of
> simultaneously
> >> enabled counters. So in order relief the pain of thinking about it
> and
> >> adding macros for each and every arch I decided to allocate the
> number
> >> user wants dynamically. Also assumption holds that user knows about
> >> tradeoffs of using too many counters hence will not enable too many
> >> events at once.
> >
> >The DPDK convention is to use fixed size arrays (with a maximum size,
> e.g. RTE_MAX_ETHPORTS) in the
> >fast path, for performance reasons.
> >
> >Please use fixed size arrays instead of dynamically allocated arrays.
> >
> 
> I do agree that from maintenance angle fixed arrays are much more
> convenient
> but when optimization kicks in then that statement does not necessarily
> hold true anymore.
> 
> For example, in this case performance dropped by ~0.3% which is
> insignificant imo. So
> given simpler code, next patchset will use fixed arrays.

I fail to understand how pointer chasing can perform better than obtaining an address by multiplying by a constant. Modern CPUs work in mysterious ways, and you obviously tested this, so I believe your test results. But in theory, pointer chasing touches more cache lines, and should perform worse in a loaded system where pointers in the chain have been evicted from the cache.

Anyway, you agreed to use fixed arrays, so I am happy. :-)

> 
> >>
> >> > Also, what is the reason for hiding the type struct
> >> perf_event_mmap_page **mmap_pages opaque by
> >> > using void **mmap_pages instead?
> >>
> >> I think, that part doing mmap/munmap was written first hence void **
> >> was chosen in the first place.
> >
> >Please update it, so the actual type is reflected here.
> >
> >>
> >> >
> >> > > +	bool enabled; /**< true if group was enabled on particular
> lcore
> >> > > */
> >> > > +};
> >> > > +
> >> > > +/**
> >> > > + * A structure describing an event.
> >> > > + */
> >> > > +struct rte_pmu_event {
> >> > > +	char *name; /** name of an event */
> >> > > +	int index; /** event index into fds/mmap_pages */
> >> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >> > > +
> >> > > +/**
> >> > > + * A PMU state container.
> >> > > + */
> >> > > +struct rte_pmu {
> >> > > +	char *name; /** name of core PMU listed under
> >> > > /sys/bus/event_source/devices */
> >> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per
> lcore
> >> > > event group data */
> >> > > +	int num_group_events; /**< number of events in a group */
> >> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of
> matching
> >> > > events */
> >
> >The event_list is used in slow path only, so it can remain a list -
> i.e. no change requested here.
> >:-)
> >
> >> > > +};
> >> > > +
> >> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> > > +*rte_pmu;
> >> >
> >> > Again, why not just extern struct rte_pmu, instead of dynamic
> >> allocation?
> >> >
> >>
> >> No strong opinions here since this is a matter of personal
> preference.
> >> Can be removed
> >> in the next version.
> >
> >Yes, please.
> >
> >>
> >> > > +
> >> > > +/** Each architecture supporting PMU needs to provide its own
> >> version
> >> > > */
> >> > > +#ifndef rte_pmu_pmc_read
> >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Read PMU counter.
> >> > > + *
> >> > > + * @param pc
> >> > > + *   Pointer to the mmapped user page.
> >> > > + * @return
> >> > > + *   Counter value read from hardware.
> >> > > + */
> >> > > +__rte_internal
> >> > > +static __rte_always_inline uint64_t
> rte_pmu_read_userpage(struct
> >> > > +perf_event_mmap_page *pc) {
> >> > > +	uint64_t offset, width, pmc = 0;
> >> > > +	uint32_t seq, index;
> >> > > +	int tries = 100;
> >> > > +
> >> > > +	for (;;) {
> >
> >As a matter of personal preference, I would write this loop
> differently:
> >
> >+ for (tries = 100; tries != 0; tries--) {
> >
> >> > > +		seq = pc->lock;
> >> > > +		rte_compiler_barrier();
> >> > > +		index = pc->index;
> >> > > +		offset = pc->offset;
> >> > > +		width = pc->pmc_width;
> >> > > +
> >> > > +		if (likely(pc->cap_user_rdpmc && index)) {
> >
> >Why "&& index"? The way I read [man perf_event_open], index 0 is
> perfectly valid.
> >
> 
> Valid index starts at 1. 0 means that either hw counter is stopped or
> isn't active. Maybe this is not
> initially clear from man but there's example later on how to get actual
> number.

OK. Thanks for the reference.

Please add a comment about the special meaning of index 0 in the code.

> 
> >[man perf_event_open]:
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__man7.org_linux_man-
> >2Dpages_man2_perf-5Fevent-
> >5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
> xRndyEUwWU_ad5ce22YI6Is&m=tny
> >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJo
> gwRRXHqAuIay47H-
> >aWl9SL5wpQ9tCjfiQUgrY&e=
> >
> >> > > +			pmc = rte_pmu_pmc_read(index - 1);
> >> > > +			pmc <<= 64 - width;
> >> > > +			pmc >>= 64 - width;
> >> > > +		}
> >> > > +
> >> > > +		rte_compiler_barrier();
> >> > > +
> >> > > +		if (likely(pc->lock == seq))
> >> > > +			return pmc + offset;
> >> > > +
> >> > > +		if (--tries == 0) {
> >> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> >> > > perf_event_mmap_page lock\n");
> >> > > +			break;
> >> > > +		}
> >
> >- Remove the 4 above lines of code, and move the debug log message to
> the end of the function
> >instead.
> >
> >> > > +	}
> >
> >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> >
> >> > > +
> >> > > +	return 0;
> >> > > +}
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Enable group of events for a given lcore.
> >> > > + *
> >> > > + * @param lcore_id
> >> > > + *   The identifier of the lcore.
> >> > > + * @return
> >> > > + *   0 in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_internal
> >> > > +int
> >> > > +rte_pmu_enable_group(int lcore_id);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Add event to the group of enabled events.
> >> > > + *
> >> > > + * @param name
> >> > > + *   Name of an event listed under
> >> > > /sys/bus/event_source/devices/pmu/events.
> >> > > + * @return
> >> > > + *   Event index in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +int
> >> > > +rte_pmu_add_event(const char *name);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Read hardware counter configured to count occurrences of an
> >> event.
> >> > > + *
> >> > > + * @param index
> >> > > + *   Index of an event to be read.
> >> > > + * @return
> >> > > + *   Event value read from register. In case of errors or lack
> of
> >> > > support
> >> > > + *   0 is returned. In other words, stream of zeros in a trace
> >> file
> >> > > + *   indicates problem with reading particular PMU event
> register.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
> >
> >The index type can be changed from int to uint32_t. This also
> eliminates the "(index < 0" part of
> >the comparison further below in this function.
> >
> 
> That's true.
> 
> >> > > +{
> >> > > +	int lcore_id = rte_lcore_id();
> >> > > +	struct rte_pmu_event_group *group;
> >> > > +	int ret;
> >> > > +
> >> > > +	if (!rte_pmu)
> >> > > +		return 0;
> >> > > +
> >> > > +	group = &rte_pmu->group[lcore_id];
> >> > > +	if (!group->enabled) {
> >
> >Optimized: if (unlikely(!group->enabled)) {
> >
> 
> Compiler will optimize the branch itself correctly. Extra hint is not
> required.

I haven't reviewed the output from this, so I'll take your word for it. I suggested the unlikely() because I previously tested some very simple code, and it optimized for taking the "if":

void testb(bool b)
{
    if (!b)
        exit(1);
    
    exit(99);
}

I guess I should experiment with more realistic code, and update my optimization notes!

You could add the unlikely() for readability purposes. ;-)

> 
> >> > > +		ret = rte_pmu_enable_group(lcore_id);
> >> > > +		if (ret)
> >> > > +			return 0;
> >> > > +
> >> > > +		group->enabled = true;
> >> > > +	}
> >> >
> >> > Why is the group not enabled in the setup function,
> >> rte_pmu_add_event(), instead of here, in the
> >> > hot path?
> >> >
> >>
> >> When this is executed for the very first time then cpu will have
> >> obviously more work to do but afterwards setup path is not taken
> hence
> >> much less cpu cycles are required.
> >>
> >> Setup is executed by main lcore solely, before lcores are executed
> >> hence some info passed to SYS_perf_event_open ioctl() is missing,
> pid
> >> (via rte_gettid()) being an example here.
> >
> >OK. Thank you for the explanation. Since impossible at setup, it has
> to be done at runtime.
> >
> >@Mattias: Another good example of something that would belong in per-
> thread constructors, as my
> >suggested feature creep in [1].
> >
> >[1]: https://urldefense.proofpoint.com/v2/url?u=http-
> >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
> >40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXg
> rbjdlXxVEEGYkxIxRndyEUwWU_ad5
> >ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt
> 6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
> >uWgJjDlgqj1wHf1nGWyHCNo&e=
> >
> >>
> >> > > +
> >> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
> >
> >Optimized: if (unlikely(index >= rte_pmu.num_group_events))
> >
> >> > > +		return 0;
> >> > > +
> >> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > > *)group->mmap_pages[index]);
> >> >
> >> > Using fixed size arrays instead of multiple indirections via
> >> > pointers
> >> is faster. It could be:
> >> >
> >> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >> >
> >> > With our without suggested performance improvements...
> >> >
> >> > Series-acked-by: Morten Brørup <mb at smartsharesystems.com>
> >>
> 



More information about the dev mailing list