[PATCH v2 2/2] net: have checksum routines accept unaligned data

Ferruh Yigit ferruh.yigit at xilinx.com
Fri Jul 8 16:44:14 CEST 2022


On 7/8/2022 1:56 PM, Mattias Rönnblom wrote:
> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> data through an uint16_t pointer, which allowed the compiler to assume
> the data was 16-bit aligned. This in turn would, with certain
> architectures and compiler flag combinations, result in code with SIMD
> load or store instructions with restrictions on data alignment.
> 
> This patch keeps the old algorithm, but data is read using memcpy()
> instead of direct pointer access, forcing the compiler to always
> generate code that handles unaligned input. The __may_alias__ GCC
> attribute is no longer needed.
> 
> The data on which the Internet checksum functions operates are almost
> always 16-bit aligned, but there are exceptions. In particular, the
> PDCP protocol header may (literally) have an odd size.
> 
> Performance impact seems to range from none to a very slight
> regression.
> 
> Bugzilla ID: 1035
> Cc: stable at dpdk.org
> 
> ---
> 
> v2:
>    * Simplified the odd-length conditional (Morten Brørup).
> 
> Reviewed-by: Morten Brørup <mb at smartsharesystems.com>
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom at ericsson.com>
> ---
>   lib/net/rte_ip.h | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..a0334d931e 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
>   static inline uint32_t
>   __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>   {
> -	/* extend strict-aliasing rules */
> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const void *end;
>   
> -	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> +		uint16_t v;
> +
> +		memcpy(&v, buf, sizeof(uint16_t));
> +		sum += v;
> +	}
>   
>   	/* if length is odd, keeping it byte order independent */
>   	if (unlikely(len % 2)) {
>   		uint16_t left = 0;
> -		*(unsigned char *)&left = *(const unsigned char *)end;
> +
> +		memcpy(&left, end, 1);
>   		sum += left;
>   	}
>   

Hi Mattias,

I got following result [1] with patches on [2].
Can you shed light to some questions I have,
1) For 1500 why 'Unaligned' access gives better performance than 
'Aligned' access?
2) Why 21/101 bytes almost doubles 20/100 bytes perf?
3) Why 1501 bytes perf better than 1500 bytes perf?


Btw, I don't see any noticeable performance difference between with and 
without patch.

[1]
RTE>>cksum_perf_autotest
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                25.1             1.25
Unaligned         20                25.1             1.25
Aligned           21                51.5             2.45
Unaligned         21                51.5             2.45
Aligned          100                28.2             0.28
Unaligned        100                28.2             0.28
Aligned          101                54.5             0.54
Unaligned        101                54.5             0.54
Aligned         1500               188.9             0.13
Unaligned       1500               138.7             0.09
Aligned         1501               114.1             0.08
Unaligned       1501               110.1             0.07
Test OK
RTE>>


[2]
AMD EPYC 7543P


More information about the stable mailing list