[dpdk-dev] [PATCH v1 1/7] power_intrinsics: allow monitor checks inversion

Ananyev, Konstantin konstantin.ananyev at intel.com
Thu Jun 24 11:47:03 CEST 2021


> >>>>>>>
> >>>>>>>> Previously, the semantics of power monitor were such that we were
> >>>>>>>> checking current value against the expected value, and if they matched,
> >>>>>>>> then the sleep was aborted. This is somewhat inflexible, because it only
> >>>>>>>> allowed us to check for a specific value.
> >>>>>>>>
> >>>>>>>> This commit adds an option to reverse the check, so that we can have
> >>>>>>>> monitor sleep aborted if the expected value *doesn't* match what's in
> >>>>>>>> memory. This allows us to both implement all currently implemented
> >>>>>>>> driver code, as well as support more use cases which don't easily map to
> >>>>>>>> previous semantics (such as waiting on writes to AF_XDP counter value).
> >>>>>>>>
> >>>>>>>> Since the old behavior is the default, no need to adjust existing
> >>>>>>>> implementations.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
> >>>>>>>> ---
> >>>>>>>>      lib/eal/include/generic/rte_power_intrinsics.h | 4 ++++
> >>>>>>>>      lib/eal/x86/rte_power_intrinsics.c             | 5 ++++-
> >>>>>>>>      2 files changed, 8 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/lib/eal/include/generic/rte_power_intrinsics.h b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> index dddca3d41c..1006c2edfc 100644
> >>>>>>>> --- a/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> +++ b/lib/eal/include/generic/rte_power_intrinsics.h
> >>>>>>>> @@ -31,6 +31,10 @@ struct rte_power_monitor_cond {
> >>>>>>>>                             *   4, or 8. Supplying any other value will result in
> >>>>>>>>                             *   an error.
> >>>>>>>>                             */
> >>>>>>>> +     uint8_t invert;  /**< Invert check for expected value (e.g. instead of
> >>>>>>>> +                       *   checking if `val` matches something, check if
> >>>>>>>> +                       *   `val` *doesn't* match a particular value)
> >>>>>>>> +                       */
> >>>>>>>>      };
> >>>>>>>>
> >>>>>>>>      /**
> >>>>>>>> diff --git a/lib/eal/x86/rte_power_intrinsics.c b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> index 39ea9fdecd..5d944e9aa4 100644
> >>>>>>>> --- a/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> +++ b/lib/eal/x86/rte_power_intrinsics.c
> >>>>>>>> @@ -117,7 +117,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> >>>>>>>>                   const uint64_t masked = cur_value & pmc->mask;
> >>>>>>>>
> >>>>>>>>                   /* if the masked value is already matching, abort */
> >>>>>>>> -             if (masked == pmc->val)
> >>>>>>>> +             if (!pmc->invert && masked == pmc->val)
> >>>>>>>> +                     goto end;
> >>>>>>>> +             /* same, but for inverse check */
> >>>>>>>> +             if (pmc->invert && masked != pmc->val)
> >>>>>>>>                           goto end;
> >>>>>>>>           }
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hmm..., such approach looks too 'patchy'...
> >>>>>>> Can we at least replace 'inver' with something like:
> >>>>>>> enum rte_power_monitor_cond_op {
> >>>>>>>             _EQ, NEQ,...
> >>>>>>> };
> >>>>>>> Then at least new comparions ops can be added in future.
> >>>>>>> Even better I think would be to just leave to PMD to provide a comparison callback.
> >>>>>>> Will make things really simple and generic:
> >>>>>>> struct rte_power_monitor_cond {
> >>>>>>>          volatile void *addr;
> >>>>>>>          int (*cmp)(uint64_t val);
> >>>>>>>          uint8_t size;
> >>>>>>> };
> >>>>>>> And then in rte_power_monitor(...):
> >>>>>>> ....
> >>>>>>> const uint64_t cur_value = __get_umwait_val(pmc->addr, pmc->size);
> >>>>>>> if (pmc->cmp(cur_value) != 0)
> >>>>>>>             goto end;
> >>>>>>> ....
> >>>>>>>
> >>>>>>
> >>>>>> I like the idea of a callback, but these are supposed to be
> >>>>>> intrinsic-like functions, so putting too much into them is contrary to
> >>>>>> their goal, and it's going to make the API hard to use in simpler cases
> >>>>>> (e.g. when we're explicitly calling rte_power_monitor as opposed to
> >>>>>> letting the RX callback do it for us). For example, event/dlb code calls
> >>>>>> rte_power_monitor explicitly.
> >>>>>
> >>>>> Good point, I didn't know that.
> >>>>> Would be interesting to see how do they use it.
> >>>>
> >>>> To be fair, it should be possible to rewrite their code using a
> >>>> callback. Perhaps adding a (void *) parameter for any custom data
> >>>> related to the callback (because C doesn't have closures...), but
> >>>> otherwise it should be doable, so the question isn't that it's
> >>>> impossible to rewrite event/dlb code to use callbacks, it's more of an
> >>>> issue with complicating usage of already-not-quite-straightforward API
> >>>> even more.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> It's going to be especially "fun" to do these indirect function calls
> >>>>>> from inside transactional region on call to multi-monitor.
> >>>>>
> >>>>> But the callback is not supposed to do any memory reads/writes.
> >>>>> Just mask/compare of the provided value with some constant.
> >>>>
> >>>> Yeah, but with callbacks we can't really control that, can we? I mean i
> >>>> guess a *sane* implementation wouldn't do that, but still, it's
> >>>> theoretically possible to perform more complex checks and even touch
> >>>> some unrelated data in the process.
> >>>
> >>> Yep, PMD developer can ignore recommendations and do whatever
> >>> he wants in the call-back. We can't control it.
> >>> If he touches some memory in it - probably there will be more spurious wakeups and less power saves.
> >>> In principle it is the same with all other PMD dev-ops - we have to trust that they are
> >>> doing what they have to.
> >>
> >> I did a quick prototype for this, and i don't think it is going to work.
> >>
> >> Callbacks with just "current value" as argument will be pretty limited
> >> and will only really work for cases where we know what we are expecting.
> >> However, for cases like event/dlb or net/mlx5, the expected value is (or
> >> appears to be) dependent upon some internal device data, and is not
> >> constant like in case of net/ixgbe for example.
> >>
> >> This can be fixed by passing an opaque pointer, either by storing it in
> >> the monitor condition, or by passing it directly to rte_power_monitor at
> >> invocation time.
> >>
> >> The latter doesn't work well because when we call rte_power_monitor from
> >> inside the rte_power library, we lack the context necessary to get said
> >> opaque pointer.
> >>
> >> The former doesn't work either, because the only place where we can get
> >> this argument is inside get_monitor_addr, but the opaque pointer must
> >> persist after we exit that function in order to avoid use-after-free -
> >> which means that it either has to be statically allocated (which means
> >> it's not thread-safe for a non-trivial case), or dynamically allocated
> >> (which a big no-no on a hotpath).
> >
> > If I get you right, expected_value (and probably mask) can be variable ones.
> > So for callback approach to work we need to pass all this as parameters
> > to PMD comparison callback:
> > int pmc_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
> > Correct?
> 
> If we have both expected value, mask, and current value, then what's the
> point of the callback? The point of the callback would be to pass just
> the current value, and let the callback decide what's the expected value
> and how to compare it.

For me the main point of callback is to hide PMD specific comparison semantics.
Basically they provide us with some values in struct rte_power_monitor_cond,
and then it is up to them how to interpret them in their comparison function.
All we'll do for them: will read the value at address provided. 
I understand that it looks like an overkill, as majority of these comparison functions
will be like:
int cmp_callback(uint64_t real_val, uint64_t expected_val, uint64_t mask)
{
	return ((real_val & mask) == expected_val);
}  
Though qsort() and bsearch() work in a similar manner, and everyone seems ok with it.

> 
> So, we can either let callback handle expected values itself by having
> an opaque callback-specific argument (which means it has to persist
> between .get_monitor_addr() and rte_power_monitor() calls), 

But that's what we doing already - PMD fills rte_power_monitor_cond values
for us, we store them somewhere and then use them to decide should we go to sleep or not.
All callback does - moves actual values interpretation back to PMD:
Right now:
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
                interpret PMC values and newly read value and make the decision

With callback:   
PMD:      provide PMC values
POWER: store PMC values somewhere
                read the value at address provided in PMC
PMD:      interpret PMC values and newly read value and make the decision

Or did you mean something different here?

>or we do the
> comparisons inside rte_power_monitor(), and store the expected/mask
> values in the monitor condition, and *don't* have any callbacks at all.
> Are you suggesting an alternative to the above two options?

As I said in my first mail - we can just replace 'inverse' with 'op'.
That at least will make this API extendable, if someone will need
something different in future.    

Another option is 

> 
> >
> >>
> >> Any other suggestions? :)
> >>
> >>>
> >>>>
> >>>>>
> >>>>>> I'm not
> >>>>>> opposed to having a callback here, but maybe others have more thoughts
> >>>>>> on this?
> >>>>>>
> >>>>>> --
> >>>>>> Thanks,
> >>>>>> Anatoly
> >>>>
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> Anatoly
> >>
> >>
> >> --
> >> Thanks,
> >> Anatoly
> 
> 
> --
> Thanks,
> Anatoly


More information about the dev mailing list