[dpdk-dev] [PATCH 1/2] mbuf: fix performance/cache resource issue with 128-byte cache line targets

Jerin Jacob jerin.jacob at caviumnetworks.com
Wed Dec 9 15:49:37 CET 2015


On Wed, Dec 09, 2015 at 01:44:44PM +0000, Ananyev, Konstantin wrote:

Hi Konstantin,

>
> Hi Jerin,
>
> > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > Sent: Tuesday, December 08, 2015 5:49 PM
> > To: Ananyev, Konstantin
> > Cc: dev at dpdk.org; thomas.monjalon at 6wind.com; Richardson, Bruce; olivier.matz at 6wind.com; Dumitrescu, Cristian
> > Subject: Re: [dpdk-dev] [PATCH 1/2] mbuf: fix performance/cache resource issue with 128-byte cache line targets
> >
> > On Tue, Dec 08, 2015 at 04:07:46PM +0000, Ananyev, Konstantin wrote:
> > > >
> > > > Hi Konstantin,
> > > >
> > > > > Hi Jerin,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob [mailto:jerin.jacob at caviumnetworks.com]
> > > > > > Sent: Sunday, December 06, 2015 3:59 PM
> > > > > > To: dev at dpdk.org
> > > > > > Cc: thomas.monjalon at 6wind.com; Richardson, Bruce; olivier.matz at 6wind.com; Dumitrescu, Cristian; Ananyev, Konstantin;
> > Jerin
> > > > > > Jacob
> > > > > > Subject: [dpdk-dev] [PATCH 1/2] mbuf: fix performance/cache resource issue with 128-byte cache line targets
> > > > > >
> > > > > > No need to split mbuf structure to two cache lines for 128-byte cache line
> > > > > > size targets as it can fit on a single 128-byte cache line.
> > > > > >
> > > > > > Signed-off-by: Jerin Jacob <jerin.jacob at caviumnetworks.com>
> > > > > > ---
> > > > > >  app/test/test_mbuf.c                                          | 4 ++++
> > > > > >  lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 4 ++++
> > > > > >  lib/librte_mbuf/rte_mbuf.h                                    | 2 ++
> > > > > >  3 files changed, 10 insertions(+)
> > > > > >
> > > > > > diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
> > > > > > index b32bef6..5e21075 100644
> > > > > > --- a/app/test/test_mbuf.c
> > > > > > +++ b/app/test/test_mbuf.c
> > > > > > @@ -930,7 +930,11 @@ test_failing_mbuf_sanity_check(void)
> > > > > >  static int
> > > > > >  test_mbuf(void)
> > > > > >  {
> > > > > > +#if RTE_CACHE_LINE_SIZE == 64
> > > > > >  	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != RTE_CACHE_LINE_SIZE * 2);
> > > > > > +#elif RTE_CACHE_LINE_SIZE == 128
> > > > > > +	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != RTE_CACHE_LINE_SIZE);
> > > > > > +#endif
> > > > > >
> > > > > >  	/* create pktmbuf pool if it does not exist */
> > > > > >  	if (pktmbuf_pool == NULL) {
> > > > > > diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-
> > > > > > env/rte_kni_common.h
> > > > > > index bd1cc09..e724af7 100644
> > > > > > --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > > > > > +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > > > > > @@ -121,8 +121,12 @@ struct rte_kni_mbuf {
> > > > > >  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
> > > > > >  	uint16_t data_len;      /**< Amount of data in segment buffer. */
> > > > > >
> > > > > > +#if RTE_CACHE_LINE_SIZE == 64
> > > > > >  	/* fields on second cache line */
> > > > > >  	char pad3[8] __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
> > > > > > +#elif RTE_CACHE_LINE_SIZE == 128
> > > > > > +	char pad3[24];
> > > > > > +#endif
> > > > > >  	void *pool;
> > > > > >  	void *next;
> > > > > >  };
> > > > > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > > > > index f234ac9..0bf55e0 100644
> > > > > > --- a/lib/librte_mbuf/rte_mbuf.h
> > > > > > +++ b/lib/librte_mbuf/rte_mbuf.h
> > > > > > @@ -813,8 +813,10 @@ struct rte_mbuf {
> > > > > >
> > > > > >  	uint16_t vlan_tci_outer;  /**< Outer VLAN Tag Control Identifier (CPU order) */
> > > > > >
> > > > > > +#if RTE_CACHE_LINE_SIZE == 64
> > > > > >  	/* second cache line - fields only used in slow path or on TX */
> > > > > >  	MARKER cacheline1 __rte_cache_aligned;
> > > > > > +#endif
> > > > >
> > > > > I suppose you'll need to keep same space reserved for first 64B even on systems with 128B cache-line.
> > > > > Otherwise we can endup with different mbuf format for systems with 128B cache-line.
> > > >
> > > > Just to understand, Is there any issue in mbuf format being different
> > > > across the systems. I think, we are not sending the mbuf over the wire
> > > > or sharing with different system etc. right?
> > >
> > > No, we don't have to support that.
> > > At least I am not aware about such cases.
> > >
> > > >
> > > > Yes, I do understand the KNI dependency with mbuf.
> > >
> > > Are you asking about what will be broken (except KNI) if mbuf layout IA and ARM would be different?
> > > Probably nothing right now, except vector RX/TX.
> > > But they are not supported on ARM anyway, and if someone will implement them in future, it
> > > might be completely different from IA one.
> > > It just seems wrong to me to have different mbuf layout for each architecture.
> >
> > It's not architecture specific, it's machine and PMD specific.
> > Typical ARM machines are 64-bytes CL but ThunderX and Power8  have
> > 128-byte CL.
>
> Ok, didn't know that.
> Thanks for clarification.
>
> >
> > It's PMD specific also, There are some NIC's which can write application
> > interested fields before the packet in DDR, typically one CL size is dedicated
> > for that.
> > So there is an overhead to emulate the standard mbuf layout(Which the
> > application shouldn't care about) i.e
> > - reserve the space for generic mbuf layout
> > - reserve the space for HW mbuf write
> > - on packet receive, copy the content from HW mbuf space to generic
> >   buf layout(space and additional cache misses for each packet, bad :-( )
>
> Each different NIC model has different format of HW descriptors
> That's what each PMD have to do: read information from HW specific layout,
> interpret it and fill mbuf.
> I suppose that's the price all of us have to pay.
> Otherwise it would mean that DPDK app would be able to work only with one
> NIC model and if you'll have to rebuild your app each time the underlying HW
> Is going to change.

Yes, I understood your view.
But to accommodate Embedded NW hardware or
some enterprise NW version of HW which derived from Embedded version it
going to be VERY expensive.
I know certain HW that can write complete MBUF info with changing also in HW
(== libmbuf/libmempool kind of libraries its not required at all in fast
path)

So am just thinking, To support all the PMD device at runtime instead
of compilation time(the issue you have pointed out correctly), Why can't
we "attach" operation of libmuf/libmempool(like get the flags, put and get
the buffers etc) to a PMD as function pointers.

And if given platform/PMD doesn't have HW support then they can use
software based libmbuf/libmempool and attach to PMD driver and if it has
HW support then specific PMD can override those function pointers.

and application just uses the function pointer associated with PMD to
support multiple PMD's at runtime without losing the HW capability of
the pool and mbuf management in HW.

I think that will improve DPDK capability to deal the multiple
variety of NW HWs(Enterprise, Embedded or Hybrid)

>
> >
> > So, It's critical to abstract the mbuf to support such HW capable NICs.
> > The application should be interested in the fields of mbuf, not the
> > actual layout.Maybe we can take up this with external mem pool manager.
> >
> > >
> > > >
> > > > > Another thing - now we have __rte_cache_aligned all over the places, and I don't know is to double
> > > > > sizes of all these structures is a good idea.
> > > >
> > > > I thought so, the only concern I have, what if, the struct split to 64
> > > > and one cache line is shared between two core/two different structs which have
> > > > the different type of operation(most likely). One extensive write and other one
> > > > read, The write makes the line dirty start evicting and read core is
> > > > going to suffer. Any thoughts?
> > > >
> > > > If its tradeoff between amount memory and performance, I think, it makes sense
> > > > to stick the performance in data plane, Hence split option may be not useful?
> > > > right?
> > >
> > > I understand that for most cases you would like to have your data structures CL aligned -
> > > to avoid performance penalties.
> > > I just suggest to have RTE_CACHE_MIN_LINE_SIZE(==64) for few cases when it might be plausible.
> > > As an example:
> > > struct rte_mbuf {
> > > 	...
> > > 	MARKER cacheline1 __rte_cache_min_aligned;
> > > 	...
> > > } _rte_cache_aligned;
> >
> > I agree(in last email also). I will send next revision based on this,
>
> Ok.
>
> > But kni muf definition, bitmap change we need to have some #define,
> > so I have proposed some scheme in the last email(See below)[1]. Any thoughts?
>
> Probably I am missing something, but with RTE_CACHE_MIN_LINE_SIZE,
> and keeping mbuf layout intact, why do we need:
> #if RTE_CACHE_LINE_SIZE == 64\
> for_64;
> #elif RTE_CACHE_LINE_SIZE == 128\
> for_128;\
> #endif
> for rte_mbuf.h and friends at all?

Not required, I was trying a macro for common code. I think we live
without that, That's neat too. Will address in next version.

>
> Inside kni_common.h, we can change:
> - char pad3[8] __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
> + char pad3[8] __attribute__((__aligned__(RTE_CACHE_MIN_LINE_SIZE)));
> To keep it in sync with rte_mbuf.h
>
> Inside test_mbuf.c:
> - RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != RTE_CACHE_LINE_SIZE * 2);
> + RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != RTE_CACHE_MIN_LINE_SIZE * 2);
>
> For rte_bitmap.h, and similar stuff - if we'll have
> CONFIG_RTE_CACHE_LINE_SIZE_LOG2 defined in the config file,
> and will make  RTE_CACHE_LINE_SIZE derived from it,
> then it would fix such problems?

makes sense. Will address in next version.

>
> Konstantin
>
>
> >
> > >
> > > So we would have mbuf with the same size and layout, but different alignment for IA and ARM.
> > >
> > > Another example, where RTE_CACHE_MIN_LINE_SIZE  could be used:
> > > struct rte_eth_(rxq|txq)_info.
> > > There is no real need to have them 128B aligned for ARM.
> > > The main purpose why they were defined as '__rte_cache_aligned' -
> > > just to reserve some space for future expansion.
> >
> > makes sense
> >
> > >
> > > Konstantin
> > >
> > > >
> > > >
> > > > > Again,  #if RTE_CACHE_LINE_SIZE == 64 ... all over the places looks a bit clumsy.
> > > > > Wonder can we have __rte_cache_aligned/ RTE_CACHE_LINE_SIZE architecture specific,
> > > >
> > > > I think, its architecture specific now
> > > >
> > > > > but introduce RTE_CACHE_MIN_LINE_SIZE(==64)/ __rte_cache_min_aligned and used it for mbuf
> > > > > (and might be other places).
> > > >
> > > > Yes, it will help in this specific mbuf case and it make sense
> > > > if mbuf going to stay within 2 x 64 CL.
> > > >
> > > > AND/OR
> > > >
> > > > can we introduce something like below to reduce the clutter in
> > > > other places, macro name is just not correct, trying to share the view
> > > >
> > > > #define rte_cacheline_diff(for_64, for_128)\
> > > > do {\
> > > > #if RTE_CACHE_LINE_SIZE == 64\
> > > > for_64;
> > > > #elif RTE_CACHE_LINE_SIZE == 128\
> > > > for_128;\
> > > > #endif
> > > >
> > > > OR
> > > >
> > > > Typedef struct rte_mbuf to new abstract type and define for 64 bytes and
> > > > 128 byte
> >
> > [1] some proposals list.
> >
> > > >
> > > > Jerin
> > > >
> > > > > Konstantin
> > > > >
> > > > >


More information about the dev mailing list