[dpdk-dev,1/3] mbuf: embedding timestamp into the packet

Message ID 1476369308-17021-2-git-send-email-olegk@mellanox.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Oleg Kuporosov Oct. 13, 2016, 2:35 p.m. UTC
  The hard requirement of financial services industry is accurate
timestamping aligned with the packet itself. This patch is to satisfy
this requirement:

- include uint64_t timestamp field into rte_mbuf with minimal impact to
  throughput/latency. Keep it just simple uint64_t in ns (more than 580
  years) would be enough for immediate needs while using full
  struct timespec with twice bigger size would have much stronger
  performance impact as missed cacheline0.

- it is possible as there is 6-bytes gap in 1st cacheline (fast path)
  and moving uint16_t vlan_tci_outer field to 2nd cacheline.

- such move will only impact for pretty rare usable VLAN RX stripping
  mode for outer TCI (it used only for one NIC i40e from the whole set and
  allows to keep minimal performance impact for RX/TX timestamps.

Signed-off-by: Oleg Kuporosov <olegk@mellanox.com>
---
 lib/librte_mbuf/rte_mbuf.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
  

Comments

Olivier Matz Oct. 18, 2016, 3:43 p.m. UTC | #1
On 10/13/2016 04:35 PM, Oleg Kuporosov wrote:
> The hard requirement of financial services industry is accurate
> timestamping aligned with the packet itself. This patch is to satisfy
> this requirement:
> 
> - include uint64_t timestamp field into rte_mbuf with minimal impact to
>   throughput/latency. Keep it just simple uint64_t in ns (more than 580
>   years) would be enough for immediate needs while using full
>   struct timespec with twice bigger size would have much stronger
>   performance impact as missed cacheline0.
> 
> - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
>   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> 
> - such move will only impact for pretty rare usable VLAN RX stripping
>   mode for outer TCI (it used only for one NIC i40e from the whole set and
>   allows to keep minimal performance impact for RX/TX timestamps.

This argument is difficult to accept. One can say you are adding
a field for a pretty rare case used by only one NIC :)

Honestly, I'm not able to judge whether timestamp is more important than
vlan_tci_outer. As room is tight in the first cache line, your patch
submission is the occasion to raise the question: how to decide what
should be in the first part of the mbuf? There are also some other
candidates for moving: m->seqn is only used in librte_reorder and it
is not set in the RX part of a driver.

About the timestamp, it would be valuable to have other opinions,
not only about the placement of the field in the structure, but also
to check that this API is also usable for other NICs.

Have you measured the impact of having the timestamp in the second part
of the mbuf?

Changing the mbuf structure should happen as rarely as possible, and we
have to make sure we take the correct decisions. I think we will
discuss this at dpdk userland 2016.


Apart from that, I wonder if an ol_flag should be added to tell that
the timestamp field is valid in the mbuf.

Regards,
Olivier
  
Pattan, Reshma Oct. 19, 2016, 10:14 a.m. UTC | #2
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Tuesday, October 18, 2016 4:44 PM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the
> packet
> 
> 
> 
> On 10/13/2016 04:35 PM, Oleg Kuporosov wrote:
> > The hard requirement of financial services industry is accurate
> > timestamping aligned with the packet itself. This patch is to satisfy
> > this requirement:
> >
> > - include uint64_t timestamp field into rte_mbuf with minimal impact to
> >   throughput/latency. Keep it just simple uint64_t in ns (more than 580
> >   years) would be enough for immediate needs while using full
> >   struct timespec with twice bigger size would have much stronger
> >   performance impact as missed cacheline0.
> >
> > - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
> >   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> >
> > - such move will only impact for pretty rare usable VLAN RX stripping
> >   mode for outer TCI (it used only for one NIC i40e from the whole set and
> >   allows to keep minimal performance impact for RX/TX timestamps.
> 
> This argument is difficult to accept. One can say you are adding a field for a
> pretty rare case used by only one NIC :)
> 
> Honestly, I'm not able to judge whether timestamp is more important than
> vlan_tci_outer. As room is tight in the first cache line, your patch submission is
> the occasion to raise the question: how to decide what should be in the first part
> of the mbuf? There are also some other candidates for moving: m->seqn is only
> used in librte_reorder and it is not set in the RX part of a driver.
> 
> About the timestamp, it would be valuable to have other opinions, not only
> about the placement of the field in the structure, but also to check that this API
> is also usable for other NICs.
> 
> Have you measured the impact of having the timestamp in the second part of
> the mbuf?
> 
> Changing the mbuf structure should happen as rarely as possible, and we have to
> make sure we take the correct decisions. I think we will discuss this at dpdk
> userland 2016.
> 
> 

I just read this mail chain, to make every one aware again, I am emphasizing on the point that I am also adding new "time_arraived" field 
to mbuf struct as part of  below 17.02 Road map item. 

" Extended Stats (Latency and Bit Rate Statistics): Enhance the Extended NIC Stats (Xstats) implementation to support the collection and reporting of latency and bit rate measurements. Latency statistics will include min, max and average latency, and jitter. Bit rate statistics will include peak and average bit rate aggregated over a user-defined time period. This will be implemented for IXGBE and I40E."

 Adding new field  " time_arrived" to the 2nd cache line of rte_mbuf struct to mark the packet arrival time for latency measurements. 
Adding this new filed to second cache line will not break the ABI. I implemented a new library to mark the time as part of rte_eth_rx callbacks
and use that time stamp inside rte_eth_tx callback for measuring the latency. Below is the latest v3 RFC patch for reference. 
http://dpdk.org/dev/patchwork/patch/16655/

Comments are welcome.

Thanks,
Reshma
  
Ananyev, Konstantin Oct. 19, 2016, 1:31 p.m. UTC | #3
Hi lads,
> 
> 
> 
> On 10/13/2016 04:35 PM, Oleg Kuporosov wrote:
> > The hard requirement of financial services industry is accurate
> > timestamping aligned with the packet itself. This patch is to satisfy
> > this requirement:
> >
> > - include uint64_t timestamp field into rte_mbuf with minimal impact to
> >   throughput/latency. Keep it just simple uint64_t in ns (more than 580
> >   years) would be enough for immediate needs while using full
> >   struct timespec with twice bigger size would have much stronger
> >   performance impact as missed cacheline0.
> >
> > - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
> >   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> >
> > - such move will only impact for pretty rare usable VLAN RX stripping
> >   mode for outer TCI (it used only for one NIC i40e from the whole set and
> >   allows to keep minimal performance impact for RX/TX timestamps.
> 
> This argument is difficult to accept. One can say you are adding
> a field for a pretty rare case used by only one NIC :)
> 
> Honestly, I'm not able to judge whether timestamp is more important than
> vlan_tci_outer. As room is tight in the first cache line, your patch
> submission is the occasion to raise the question: how to decide what
> should be in the first part of the mbuf? There are also some other
> candidates for moving: m->seqn is only used in librte_reorder and it
> is not set in the RX part of a driver.
> 
> About the timestamp, it would be valuable to have other opinions,
> not only about the placement of the field in the structure, but also
> to check that this API is also usable for other NICs.
> 
> Have you measured the impact of having the timestamp in the second part
> of the mbuf?

My vote also would be to have timestamp in the second cache line.
About moving seqn to the 2-nd cache line too - that's probably a fair point.

About the rest of the patch: 
Do you really need to put that code into the PMDs itself?
Can't the same result be achieved by using RX callbacks?
Again that approach would work with any PMD and
there would be no need to modify PMD code itself.

Another thing, that I am in doubt why to use system time?
Wouldn't raw HW TSC value (rte_rdtsc()) do here?
Konstantin

> 
> Changing the mbuf structure should happen as rarely as possible, and we
> have to make sure we take the correct decisions. I think we will
> discuss this at dpdk userland 2016.
> 
> 
> Apart from that, I wonder if an ol_flag should be added to tell that
> the timestamp field is valid in the mbuf.
> 
> Regards,
> Olivier
  
Oleg Kuporosov Oct. 19, 2016, 5:08 p.m. UTC | #4
Hello Oliver

Great thanks for your review and con

> > - include uint64_t timestamp field into rte_mbuf with minimal impact to
> >   throughput/latency. Keep it just simple uint64_t in ns (more than 580
> >   years) would be enough for immediate needs while using full
> >   struct timespec with twice bigger size would have much stronger
> >   performance impact as missed cacheline0.
> >
> > - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
> >   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> >
> > - such move will only impact for pretty rare usable VLAN RX stripping
> >   mode for outer TCI (it used only for one NIC i40e from the whole set and
> >   allows to keep minimal performance impact for RX/TX timestamps.
> 
> This argument is difficult to accept. One can say you are adding a field for a
> pretty rare case used by only one NIC :)

It  may be looks so, but in fact not only for one NIC. Absence of timestamp there 
Required from developers implement its support in out of DPDK data path with
Local DPDK patching which also may lead some penalty in accuracy.
Good example here is open source implementation of tracing library - 
https://github.com/wanduow/libtrace/tree/libtrace4

These folks patched DPDK to have timestamp for every packet (Intel DPDK Patches folder)
And used it in out of band to store and later processing (lib/format_dpdk.c).
That was actually the starting point of my investigations.
Such approach is working for 1GBb case but not for 10-100 cases.

> 
> Honestly, I'm not able to judge whether timestamp is more important than
> vlan_tci_outer. As room is tight in the first cache line, your patch submission
> is the occasion to raise the question: how to decide what should be in the
> first part of the mbuf? There are also some other candidates for moving: m-
> >seqn is only used in librte_reorder and it is not set in the RX part of a driver.

Agree, it is difficult to decide, my thoughts were:
- there is hole (6 bytes) which wasn't marked as reserved for any planned extensions;
-vlan_tci_outer is being used by only one NIC (i40e) so far, 2nd NIC (fm10k) is using it
With comment:
		 * mbuf->vlan_tci_outer is an idle field in fm10k driver,
		 * so it can be selected to store sglort value.
To store some another value under some specific "if".

Also for i40e it is under #ifndef RTE_LIBRTE_I40E_16BYTE_RX_DESC which per doc should be
Enabled for high throughput of small packets. So in default case (disabled) it anyway has
some performance penalty with using 32 bytes descriptor and moving it to 2nd CL would
not hit big additional penalty. Unfortunately I have no such NIC to measure.
Is there any data how often double tagging in being used  in  DPDK applications?

Another my thought was to have at the end of 1st CL enum which may hold
Reserved fields per specific use cases and data widths (uint8, 2xuint4, 4xuint2, 8xbytes).
 
> 
> About the timestamp, it would be valuable to have other opinions, not only
> about the placement of the field in the structure, but also to check that this
> API is also usable for other NICs.

Sure, but I didn't change timesync/timestamping API itself.

> Have you measured the impact of having the timestamp in the second part
> of the mbuf?

Yes, the worst case with prefetching of 2nd CL it is 5.7-5.9 % penalty for combined RX+TX
And expectedly much worse without prefetching.
In the best case it is 0.3..0.5 % for RX only. It can be explained by much harder cache
trashing when TX is "on".
 
> Changing the mbuf structure should happen as rarely as possible, and we
> have to make sure we take the correct decisions. I think we will discuss this at
> dpdk userland 2016.

Oh, yes, please discuss, I would not be able to join. :(

> Apart from that, I wonder if an ol_flag should be added to tell that the
> timestamp field is valid in the mbuf.

Oliver, there is PKT_RX_IEEE1588_TMST flag already.

Best regards,
Oleg.
  
Oleg Kuporosov Oct. 19, 2016, 5:40 p.m. UTC | #5
Hello Reshma

> I just read this mail chain, to make every one aware again, I am emphasizing
> on the point that I am also adding new "time_arraived" field to mbuf struct as
> part of  below 17.02 Road map item.

Thanks for your work for extending statistics support in DPDK.

"time_arrived" points here for mostly RX path, while more common term used "timestamp"
Allows also using it for TX path as well in the future.

Best regards,
Oleg.
  
Oleg Kuporosov Oct. 20, 2016, 8:03 a.m. UTC | #6
Hello Konstantin,

> 
> My vote also would be to have timestamp in the second cache line.
> About moving seqn to the 2-nd cache line too - that's probably a fair point.

It may impact throughput till ~6% for applications required such embedded 
Timestamps.
 
> About the rest of the patch:
> Do you really need to put that code into the PMDs itself?
> Can't the same result be achieved by using RX callbacks?
> Again that approach would work with any PMD and there would be no need
> to modify PMD code itself.

Correct, the approach with using callbacs (rte_eth_timesync_read_[r|t]x_timestamp())
Has also some Cons for this use case:
- FSI needs the most accurate stamping as possible by reasons were described in
Cover letter
- callback will be called from user app and so you have to take into account 
Difference between time when packet was released by NIC and callback call
- such difference is not easy to estimate correctly due to dependency on CPU type,
Its frequency and current load conditions
- even estimated it would be hard without additional performance penalty to align
Packet with timestamp, taking account that call may actually placed from
Different thread or even process.

It looks the least impacting and correct way is to have timestamp in rte_mbuf and fill
It in Rx burst procedure.

> Another thing, that I am in doubt why to use system time?
> Wouldn't raw HW TSC value (rte_rdtsc()) do here?

System time is being used for periodic clock synchronization between wall
clock and NIC to estimate NIC clock deviation. It is in assumption the system itself is
synchronized by PTP from master clock. It is run on context of control thread.

Thanks,
Oleg.
  
Jan Blunck Oct. 20, 2016, 10:57 a.m. UTC | #7
On Thu, Oct 20, 2016 at 4:03 AM, Oleg Kuporosov <olegk@mellanox.com> wrote:
> Hello Konstantin,
>
>>
>> My vote also would be to have timestamp in the second cache line.
>> About moving seqn to the 2-nd cache line too - that's probably a fair point.
>
> It may impact throughput till ~6% for applications required such embedded
> Timestamps.
>
>> About the rest of the patch:
>> Do you really need to put that code into the PMDs itself?
>> Can't the same result be achieved by using RX callbacks?
>> Again that approach would work with any PMD and there would be no need
>> to modify PMD code itself.
>
> Correct, the approach with using callbacs (rte_eth_timesync_read_[r|t]x_timestamp())
> Has also some Cons for this use case:
> - FSI needs the most accurate stamping as possible by reasons were described in
> Cover letter

From my experience this is only true if there is near-zero performance
impact. From my perspective this is only relevant if the used hardware
supports offloading of writing the timestamps. Everything else is a
huge impact if its unconditionally enabled.

The regulatory requirements are already covered by the exchange
protocols which means that timestamps are already present in the
network packet payload (generated by the exchange trading system
and/or the trading application itself). In the end it is the exchange
itself and its members that are regulated. I can see that this might
be interesting for exchange members allowing sponsored naked access
(for non-exchange members) to generate data that they are not
front-running their clients.

I doubt that this non-functional requirement is important enough to
sacrifice the functional requirement of supporting QinQ.

> - callback will be called from user app and so you have to take into account
> Difference between time when packet was released by NIC and callback call

Have you looked at using dedicated preallocated trace buffers that are
filled with timestamps values? This should work fine for getting some
inside into the latency between application readiness and the actual
time the burst happened.

Thanks,
Jan

> - such difference is not easy to estimate correctly due to dependency on CPU type,
> Its frequency and current load conditions
> - even estimated it would be hard without additional performance penalty to align
> Packet with timestamp, taking account that call may actually placed from
> Different thread or even process.
>
> It looks the least impacting and correct way is to have timestamp in rte_mbuf and fill
> It in Rx burst procedure.
>
>> Another thing, that I am in doubt why to use system time?
>> Wouldn't raw HW TSC value (rte_rdtsc()) do here?
>
> System time is being used for periodic clock synchronization between wall
> clock and NIC to estimate NIC clock deviation. It is in assumption the system itself is
> synchronized by PTP from master clock. It is run on context of control thread.
>
> Thanks,
> Oleg.
  
Pattan, Reshma Oct. 25, 2016, 12:39 p.m. UTC | #8
Hi,

> -----Original Message-----
> From: Oleg Kuporosov [mailto:olegk@mellanox.com]
> Sent: Wednesday, October 19, 2016 6:40 PM
> To: Pattan, Reshma <reshma.pattan@intel.com>; Olivier Matz
> <olivier.matz@6wind.com>
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 1/3] mbuf: embedding timestamp into the
> packet
> 
> Hello Reshma
> 
> > I just read this mail chain, to make every one aware again, I am
> > emphasizing on the point that I am also adding new "time_arraived"
> > field to mbuf struct as part of  below 17.02 Road map item.
> 
> Thanks for your work for extending statistics support in DPDK.
> 
> "time_arrived" points here for mostly RX path, while more common term used
> "timestamp"
> Allows also using it for TX path as well in the future.
> 

I will take a note of this for next revision.

Thanks,
Reshma
  
Olivier Matz Jan. 24, 2017, 3:27 p.m. UTC | #9
On Thu, 13 Oct 2016 14:35:06 +0000, Oleg Kuporosov <olegk@mellanox.com>
wrote:
> The hard requirement of financial services industry is accurate
> timestamping aligned with the packet itself. This patch is to satisfy
> this requirement:
> 
> - include uint64_t timestamp field into rte_mbuf with minimal impact
> to throughput/latency. Keep it just simple uint64_t in ns (more than
> 580 years) would be enough for immediate needs while using full
>   struct timespec with twice bigger size would have much stronger
>   performance impact as missed cacheline0.
> 
> - it is possible as there is 6-bytes gap in 1st cacheline (fast path)
>   and moving uint16_t vlan_tci_outer field to 2nd cacheline.
> 
> - such move will only impact for pretty rare usable VLAN RX stripping
>   mode for outer TCI (it used only for one NIC i40e from the whole
> set and allows to keep minimal performance impact for RX/TX
> timestamps.
> 
> Signed-off-by: Oleg Kuporosov <olegk@mellanox.com>
> ---
>  lib/librte_mbuf/rte_mbuf.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 23b7bf8..1e1f2ed 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -851,8 +851,7 @@ struct rte_mbuf {
>  
>  	uint32_t seqn; /**< Sequence number. See also
> rte_reorder_insert() */ 
> -	/** Outer VLAN TCI (CPU order), valid if
> PKT_RX_QINQ_STRIPPED is set. */
> -	uint16_t vlan_tci_outer;
> +	uint64_t timestamp;       /**< Packet's timestamp, usually
> in ns */ 
>  	/* second cache line - fields only used in slow path or on
> TX */ MARKER cacheline1 __rte_cache_min_aligned;
> @@ -885,6 +884,9 @@ struct rte_mbuf {
>  		};
>  	};
>  
> +	/** Outer VLAN TCI (CPU order), valid if
> PKT_RX_QINQ_STRIPPED is set. */
> +	uint16_t vlan_tci_outer;
> +
>  	/** Size of the application private data. In case of an
> indirect
>  	 * mbuf, it stores the direct mbuf private data size. */
>  	uint16_t priv_size;

FYI, I posted a RFC patchset that introduces the timestamp field in the
mbuf for v17.05:
http://dpdk.org/ml/archives/dev/2017-January/056187.html


Regards,
Olivier
  

Patch

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 23b7bf8..1e1f2ed 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -851,8 +851,7 @@  struct rte_mbuf {
 
 	uint32_t seqn; /**< Sequence number. See also rte_reorder_insert() */
 
-	/** Outer VLAN TCI (CPU order), valid if PKT_RX_QINQ_STRIPPED is set. */
-	uint16_t vlan_tci_outer;
+	uint64_t timestamp;       /**< Packet's timestamp, usually in ns */
 
 	/* second cache line - fields only used in slow path or on TX */
 	MARKER cacheline1 __rte_cache_min_aligned;
@@ -885,6 +884,9 @@  struct rte_mbuf {
 		};
 	};
 
+	/** Outer VLAN TCI (CPU order), valid if PKT_RX_QINQ_STRIPPED is set. */
+	uint16_t vlan_tci_outer;
+
 	/** Size of the application private data. In case of an indirect
 	 * mbuf, it stores the direct mbuf private data size. */
 	uint16_t priv_size;