[dpdk-dev] mbuf changes

Morten Brørup mb at smartsharesystems.com
Tue Oct 25 16:25:08 CEST 2016


Lots of good arguments from Konstantin below!

Thomas also wrote something worth repeating:
"The mbuf space must be kept jealously like a DPDK treasure :)"

Maybe we should take a step away from the keyboard and consider the decision process and guidance criteria for mbuf members.

Here is an example of what I am thinking:

1. It is the intention to optimize mbuf handling for the most likely fast path through the entire system on the most common applications.
2. It is the intention that the most likely fast path through the RX handler and RX processing of the most common applications only needs to touch the first cache line.
3. The first cache line should be preferred for metadata set by most common NIC hardware in the RX handler.
4. The second cache line should be preferred for metadata required by most common NIC hardware in the TX handler.
5. The first cache line could be used for metadata used by the RX processing in the most likely fast path of most common applications.
6. The second cache line could be used for metadata used by the TX processing in the most likely fast path of most common applications.
7. The RX and TX handlers are equally important, and the RX handler should not be optimized at the cost of the TX handler.

Please note the use of "most common" and "most likely". It should be clear which cases we are optimizing for.

We should probably also define "the most common applications" (e.g. a Layer 3 router with or without VPN) and "the most likely fast path" (e.g. a single packet matching a series of lookups and ultimately forwarded to a single port).

It might also make sense documenting the mbuf fields in more detail somewhere. E.g. the need for nb_segs in the NIC's TX handler.


Med venlig hilsen / kind regards
- Morten Brørup


> -----Original Message-----
> From: Ananyev, Konstantin [mailto:konstantin.ananyev at intel.com]
> Sent: Tuesday, October 25, 2016 3:39 PM
> To: Olivier Matz; Richardson, Bruce; Morten Brørup
> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Oleg Kuporosov
> Subject: RE: [dpdk-dev] mbuf changes
> 
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Olivier Matz
> > Sent: Tuesday, October 25, 2016 1:49 PM
> > To: Richardson, Bruce <bruce.richardson at intel.com>; Morten Brørup
> > <mb at smartsharesystems.com>
> > Cc: Adrien Mazarguil <adrien.mazarguil at 6wind.com>; Wiles, Keith
> > <keith.wiles at intel.com>; dev at dpdk.org; Oleg Kuporosov
> > <olegk at mellanox.com>
> > Subject: Re: [dpdk-dev] mbuf changes
> >
> >
> >
> > On 10/25/2016 02:45 PM, Bruce Richardson wrote:
> > > On Tue, Oct 25, 2016 at 02:33:55PM +0200, Morten Brørup wrote:
> > >> Comments at the end.
> > >>
> > >> Med venlig hilsen / kind regards
> > >> - Morten Brørup
> > >>
> > >>> -----Original Message-----
> > >>> From: Bruce Richardson [mailto:bruce.richardson at intel.com]
> > >>> Sent: Tuesday, October 25, 2016 2:20 PM
> > >>> To: Morten Brørup
> > >>> Cc: Adrien Mazarguil; Wiles, Keith; dev at dpdk.org; Olivier Matz;
> > >>> Oleg Kuporosov
> > >>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>
> > >>> On Tue, Oct 25, 2016 at 02:16:29PM +0200, Morten Brørup wrote:
> > >>>> Comments inline.
> > >>>>
> > >>>>> -----Original Message-----
> > >>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce
> > >>>>> Richardson
> > >>>>> Sent: Tuesday, October 25, 2016 1:14 PM
> > >>>>> To: Adrien Mazarguil
> > >>>>> Cc: Morten Brørup; Wiles, Keith; dev at dpdk.org; Olivier Matz;
> > >>>>> Oleg Kuporosov
> > >>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>
> > >>>>> On Tue, Oct 25, 2016 at 01:04:44PM +0200, Adrien Mazarguil
> wrote:
> > >>>>>> On Tue, Oct 25, 2016 at 12:11:04PM +0200, Morten Brørup wrote:
> > >>>>>>> Comments inline.
> > >>>>>>>
> > >>>>>>> Med venlig hilsen / kind regards
> > >>>>>>> - Morten Brørup
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> -----Original Message-----
> > >>>>>>>> From: Adrien Mazarguil [mailto:adrien.mazarguil at 6wind.com]
> > >>>>>>>> Sent: Tuesday, October 25, 2016 11:39 AM
> > >>>>>>>> To: Bruce Richardson
> > >>>>>>>> Cc: Wiles, Keith; Morten Brørup; dev at dpdk.org; Olivier Matz;
> > >>>>>>>> Oleg Kuporosov
> > >>>>>>>> Subject: Re: [dpdk-dev] mbuf changes
> > >>>>>>>>
> > >>>>>>>> On Mon, Oct 24, 2016 at 05:25:38PM +0100, Bruce Richardson
> > >>> wrote:
> > >>>>>>>>> On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith
> > >>> wrote:
> > >>>>>>>> [...]
> > >>>>>>>>>>> On Oct 24, 2016, at 10:49 AM, Morten Brørup
> > >>>>>>>> <mb at smartsharesystems.com> wrote:
> > >>>>>>>> [...]
> > >>>>
> > >>>>>>>>> One other point I'll mention is that we need to have a
> > >>>>>>>>> discussion on how/where to add in a timestamp value into
> > >>> the
> > >>>>>>>>> mbuf. Personally, I think it can be in a union with the
> > >>>>> sequence
> > >>>>>>>>> number value, but I also suspect that 32-bits of a
> > >>> timestamp
> > >>>>>>>>> is not going to be enough for
> > >>>>>>>> many.
> > >>>>>>>>>
> > >>>>>>>>> Thoughts?
> > >>>>>>>>
> > >>>>>>>> If we consider that timestamp representation should use
> > >>>>> nanosecond
> > >>>>>>>> granularity, a 32-bit value may likely wrap around too
> > >>> quickly
> > >>>>>>>> to be useful. We can also assume that applications
> requesting
> > >>>>>>>> timestamps may care more about latency than throughput, Oleg
> > >>>>> found
> > >>>>>>>> that using the second cache line for this purpose had a
> > >>>>> noticeable impact [1].
> > >>>>>>>>
> > >>>>>>>>  [1] http://dpdk.org/ml/archives/dev/2016-
> October/049237.html
> > >>>>>>>
> > >>>>>>> I agree with Oleg about the latency vs. throughput importance
> > >>>>>>> for
> > >>>>> such applications.
> > >>>>>>>
> > >>>>>>> If you need high resolution timestamps, consider them to be
> > >>>>> generated by the NIC RX driver, possibly by the hardware itself
> > >>>>> (http://w3new.napatech.com/features/time-precision/hardware-
> time
> > >>>>> - stamp), so the timestamp belongs in the first cache line. And
> > >>>>> I am proposing that it should have the highest possible
> > >>>>> accuracy, which makes the value hardware dependent.
> > >>>>>>>
> > >>>>>>> Furthermore, I am arguing that we leave it up to the
> > >>> application
> > >>>>>>> to
> > >>>>> keep track of the slowly moving bits (i.e. counting whole
> > >>>>> seconds, hours and calendar date) out of band, so we don't use
> > >>>>> precious
> > >>> space
> > >>>>> in the mbuf. The application doesn't need the NIC RX driver's
> > >>>>> fast path to capture which date (or even which second) a packet
> > >>>>> was received. Yes, it adds complexity to the application, but
> we
> > >>>>> can't set aside 64 bit for a generic timestamp. Or as a weird
> tradeoff:
> > >>>>> Put the fast moving 32 bit in the first cache line and the slow
> > >>>>> moving 32 bit in the second cache line, as a placeholder for
> the
> > >>> application to fill out if needed.
> > >>>>> Yes, it means that the application needs to check the time and
> > >>>>> update its variable holding the slow moving time once every
> > >>>>> second or so; but that should be doable without significant
> effort.
> > >>>>>>
> > >>>>>> That's a good point, however without a 64 bit value, elapsed
> > >>>>>> time between two arbitrary mbufs cannot be measured reliably
> > >>>>>> due to
> > >>> not
> > >>>>>> enough context, one way or another the low resolution value is
> > >>>>>> also
> > >>>>> needed.
> > >>>>>>
> > >>>>>> Obviously latency-sensitive applications are unlikely to
> > >>>>>> perform lengthy buffering and require this but I'm not sure
> > >>>>>> about all the possible use-cases. Considering many NICs expose
> > >>>>>> 64 bit
> > >>> timestaps,
> > >>>>>> I suggest we do not truncate them.
> > >>>>>>
> > >>>>>> I'm not a fan of the weird tradeoff either, PMDs will be
> > >>>>>> tempted to fill the extra 32 bits whenever they can and negate
> > >>>>>> the performance improvement of the first cache line.
> > >>>>>
> > >>>>> I would tend to agree, and I don't really see any convenient
> way
> > >>>>> to avoid putting in a 64-bit field for the timestamp in cache-
> line 0.
> > >>>>> If we are ok with having this overlap/partially overlap with
> > >>>>> sequence number, it will use up an extra 4B of storage in that
> > >>> cacheline.
> > >>>>
> > >>>> I agree about the lack of convenience! And Adrien certainly has
> a
> > >>> point about PMD temptations.
> > >>>>
> > >>>> However, I still don't think that a NICs ability to date-stamp a
> > >>> packet is sufficient reason to put a date-stamp in cache line 0
> of
> > >>> the mbuf. Storing only the fast moving 32 bit in cache line 0
> > >>> seems like a good compromise to me.
> > >>>>
> > >>>> Maybe you can find just one more byte, so it can hold 17 minutes
> > >>>> with nanosecond resolution. (I'm joking!)
> > >>>>
> > >>>> Please don't sacrifice the sequence number for the
> > >>>> seconds/hours/days
> > >>> part a timestamp. Maybe it could be configurable to use a 32 bit
> > >>> or 64 bit timestamp.
> > >>>>
> > >>> Do you see both timestamp and sequence numbers being used
> > >>> together? I would have thought that apps would either use one or
> the other?
> > >>> However, your suggestion is workable in any case, to allow the
> > >>> sequence number to overlap just the high 32 bits of the
> timestamp,
> > >>> rather than the low.
> > >>
> > >> In our case, I can foresee sequence numbers used for packet
> > >> processing and timestamps for timing analysis (and possibly for
> > >> packet
> > capturing, when being used).
> 
> Great, but right now none of these fields are filled from NIC HW by PMD
> RX function (except RFC for melanox provided by Oleg, but again it is
> pure SW implementation).
> So I would repeat my question: why these fields should stay in the
> first cache-line?
> I understand that  it would speed-up some particular application, but
> there are plenty of apps which do use different metadata.
> Let say l2/l3/l4 len - is very useful information for upper layer
> (L3/L4) packet processing.
> Should people who do use it start to push moving that fields into first
> cache-line too?
> 
> >For timing analysis, we don’t need long durations, e.g. 4 seconds with
> 32 bit nanosecond resolution suffices.
> > And for packet capturing we are perfectly capable of adding the
> slowly
> >moving 32 bit of the timestamp to our output data stream without
> fetching it from the mbuf.
> > >>
> >
> > We should keep in mind that today we have the seqn field but it is
> not
> > used by any PMD. In case it is implemented, would it be a per-queue
> > sequence number? Is it useful from an application view?
> 
> Exactly - it depends from SW that uses that field, right?
> It could be per process / per group of lcores / per port / per queue,
> etc.
> Depending on what are upper layer needs.
> 
> >
> > This field is only used by the librte_reorder library, and in my
> > opinion, we should consider moving it in the second cache line since
> > it is not filled by the PMD.
> 
> +1
> 
> >
> >
> > > For the 32-bit timestamp case, it might be useful to have a
> > > right-shift value passed in to the ethdev driver. If we assume a
> NIC
> > > with nanosecond resolution, (or TSC value with resolution of that
> > > order of magnitude), then the app can choose to have 1 ns
> resolution
> > > with 4 second wraparound, or alternatively 4ns resolution with 16
> > > second wraparound, or even microsecond resolution with wrap around
> of over an hour.
> > > The cost is obviously just a shift op in the driver code per packet
> > > - hopefully with multiple packets done at a time using vector
> operations.
> >
> >
> > About the timestamp, we can manage to find 64 bits in the first cache
> > line, without sacrifying any field we have today.
> 
> We can I suppose, but again what for and who will fill it?
> If PMD, then where it will get this information?
> If it is from rte_rdtsc() or clock(), then why upper layer can't do it
> itself?
> Konstantin
> 
> >The question is more
> > for the fields we may want to add later.
> >
> > To answer to the question of the size of the timestamp, the first
> > question is to know what is the precision required for the
> > applications using it?
> >
> > I don't quite like the idea of splitting the timestamp in the 2 cache
> > lines, I think it would not be easy to use.
> >
> >
> > Olivier


More information about the dev mailing list