[dpdk-dev] rte_memcpy - fence and stream
Morten Brørup
mb at smartsharesystems.com
Tue Jun 22 23:55:55 CEST 2021
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Morten Brørup
> Sent: Thursday, 27 May 2021 20.15
>
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Thursday, 27 May 2021 19.22
> >
> > On Thu, May 27, 2021 at 10:39:59PM +0530, Manish Sharma wrote:
> > > For the case I have, hardly 2% of the data buffers which are
> being
> > > copied get looked at - mostly its for DMA.
Which data buffers are you not looking at, Manish? The original data buffers, or the copies, or both?
> > > Having a version of DPDK
> > > memcopy that does non temporal copies would definitely be good.
> > > If in my case, I have a lot of CPUs doing the copy in parallel,
> > would
> > > I/OAT driver copy accelerator still help?
> > >
> > It will depend upon the size of the copies being done. For bigger
> > packets
> > the accelerator can help free up CPU cycles for other things.
> >
> > However, if only 2% of the data which is being copied gets looked at,
> > why
> > does it need to be copied? Can the original buffers not be used in
> that
> > case?
>
> I can only speak for myself here...
>
> Our firmware has a packet capture feature with a filter.
>
> If a packet matches the capture filter, a metadata header and the
> relevant part of the packet contents ("snap length" in tcpdump
> terminology) is appended to a large memory area (the "capture buffer")
> using rte_pktmbuf_read/rte_memcpy. This capture buffer is only read
> through the GUI or management API by the network administrator, i.e. it
> will only be read minutes or hours later, so there is no need to put
> any of it in any CPU cache.
>
> It does not make sense to clone and hold on to many thousands of mbufs
> when we only need some of their contents. So we copy the contents
> instead of increasing the mbuf refcount.
>
> We currently only use our packet capture feature for R&D purposes, so
> we have not optimized it yet. However, we will need to optimize it for
> production use at some point. So I find this discussion initiated by
> Manish very interesting.
>
> -Morten
Here's some code for inspiration. I haven't tested it yet. And it can be further optimized.
/**
* Copy 16 bytes from one location to another, using non-temporal storage
* at the destination.
* The locations must not overlap.
*
* @param dst
* Pointer to the destination of the data.
* Must be aligned on a 16-byte boundary.
* @param src
* Pointer to the source data.
* Does not need to be aligned on any particular boundary.
*/
static __rte_always_inline void
rte_mov16_aligned16_non_temporal(uint8_t *dst, const uint8_t *src)
{
__m128i xmm0;
xmm0 = _mm_loadu_si128((const __m128i *)src);
_mm_stream_si128((__m128i *)dst, xmm0);
}
/**
* Copy bytes from one location to another, using non-temporal storage
* at the destination.
* The locations must not overlap.
*
* @param dst
* Pointer to the destination of the data.
* Must be aligned on a 16-byte boundary.
* @param src
* Pointer to the source data.
* Does not need to be aligned on any particular boundary.
* @param n
* Number of bytes to copy.
* Must be divisble by 4.
* @return
* Pointer to the destination data.
*/
static __rte_always_inline void *
rte_memcpy_aligned16_non_temporal(void *dst, const void *src, size_t n)
{
void * const ret = dst;
RTE_ASSERT(!((uintptr_t)dst & 0xF));
RTE_ASSERT(!(n & 3));
while (n >= 16) {
rte_mov16_aligned16_non_temporal(dst, src);
src = (const uint8_t *)src + 16;
dst = (uint8_t *)dst + 16;
n -= 16;
}
if (n & 8) {
int64_t a = *(const int64_t *)src;
_mm_stream_si64((long long int *)dst, a);
src = (const uint8_t *)src + 8;
dst = (uint8_t *)dst + 8;
n -= 8;
}
if (n & 4) {
int32_t a = *(const int32_t *)src;
_mm_stream_si32((int32_t *)dst, a);
src = (const uint8_t *)src + 4;
dst = (uint8_t *)dst + 4;
n -= 4;
}
return ret;
}
More information about the dev
mailing list