[dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics

Ananyev, Konstantin konstantin.ananyev at intel.com
Fri Nov 13 11:35:06 CET 2015



> -----Original Message-----
> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of harish.patil at qlogic.com
> Sent: Sunday, November 08, 2015 7:40 PM
> To: dev at dpdk.org
> Subject: [dpdk-dev] [PATCH] l3fwd: Fix l3fwd crash due to unaligned load/store intrinsics
> 
> From: Harish Patil <harish.patil at qlogic.com>
> 
> l3fwd app expects PMDs to return packets whose L2 header is
> 16-byte aligned due to usage of _mm_load_si128()/_mm_store_si128()
> intrinsics in the app. However, most of the protocol stacks expects
> packets such that its IP/L3 header be aligned on a 16-byte boundary.
> 
> Based on the recommendations received on dpdk-dev, we are changing
> the l3fwd app to use _mm_loadu_si128()/_mm_loadu_si128() so that the
> address need not be 16-byte aligned and thereby preventing crash.
> We have tested that there is no performance impact due to this
> change.
> 
> Signed-off-by: Harish Patil <harish.patil at qlogic.com>
> ---

Acked-by: Konstantin Ananyev <konstantin.ananyev at intel.com>

As a side notice:
In fact with gcc build I do see a slight regression: ~1%
for 4 ports over 1 core test-case.
Though I think the problem is not in the patch itself.
By some, unknown to me reason, gcc treats aligned and unaligned load/store
instrincts in a different way (at least for that particular case).
With aligned load/store in use it generates code that is pretty close to the source:
 4 loads first, then 4 BLENDs, then 4  stores  (with some interfering scalar instructions of course).
But with unaligned ones  gcc starts to mix loads and blends for the same register, so now it is:
load x0; blend x0; load x1; blend x1; .. 
As if the source code was:

te[0] = _mm_loadu_si128(p[0]);
te[0] =  _mm_blend_epi16(te[0], ve[0], MASK_ETH);
te[1] = _mm_loadu_si128(p[1]);
te[1] =  _mm_blend_epi16(te[1], ve[1], MASK_ETH);  
...

So load latency is not hidden any more.
I tried it with different versions of  - same story for all of them.
Clang doesn't have such issue and generates similar code for both
aligned and unaligned instrincts. 

The only way to fix it I can think about  -  put rte_compiler_barrier() just before the first blend instinct.
That helped, now there are no noticeable differences in generated code and results before and after the patch.
 So I suppose, I'll have to submit a patch after yours one to fix that problem.
Konstantin



More information about the dev mailing list