[dpdk-dev] rte_memcpy - fence and stream

Bruce Richardson bruce.richardson at intel.com
Tue May 25 11:20:24 CEST 2021


On Mon, May 24, 2021 at 11:43:24PM +0530, Manish Sharma wrote:
> I am looking at the source for rte_memcpy (this is a discussion only for
> x86-64)
> 
> For one of the cases, when aligned correctly, it uses
> 
> /**
>  * Copy 64 bytes from one location to another,
>  * locations should not overlap.
>  */
> static __rte_always_inline void
> rte_mov64(uint8_t *dst, const uint8_t *src)
> {
>         __m512i zmm0;
> 
>         zmm0 = _mm512_loadu_si512((const void *)src);
>         _mm512_storeu_si512((void *)dst, zmm0);
> }
> 
> I had some questions about this:
> 
> 1. What I dont see is any use of x86 fence(rmb,wmb) instructions. Is that
> not required in this case and if not, why isnt it needed?
> 
> 2. Are the  mm512_loadu_si512 and  _mm512_storeu_si512 non temporal?
> 
They are not non-temporal, so don't need fences.

> 3. Why isn't the code using  stream variants, _mm512_stream_load_si512 and
> friends? It would not pollute the cache, so should be better - unless the
> required fence instructions cause a drop in performance?
> 
Whether the stream variants perform better really depends on what you are
trying to measure. However, in the vast majority of cases a core is making
a copy of data to work on it, in which case having the data not in cache is
going to cause massive stalls on the next access to that data as it's
fetched from DRAM. Therefore, the best option is to use regular
instructions meaning that the data is in local cache afterwards, giving far
better performance when the data is actually used.

> 4. Do the _mm512_stream_load_si512 need fence instructions? Based on my
> reading of the spec, the answer is yes - but wanted to confirm.
>
I believe a fence would be necessary for safety, yes. 


More information about the dev mailing list