[PATCH] net: fix checksum with unaligned buffer

Emil Berg emil.berg at ericsson.com
Mon Jun 20 12:37:41 CEST 2022


> -----Original Message-----
> From: Morten Brørup <mb at smartsharesystems.com>
> Sent: den 17 juni 2022 11:07
> To: Emil Berg <emil.berg at ericsson.com>
> Cc: stable at dpdk.org; bugzilla at dpdk.org; hofors at lysator.liu.se;
> olivier.matz at 6wind.com; dev at dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Morten Brørup [mailto:mb at smartsharesystems.com]
> > Sent: Friday, 17 June 2022 10.45
> >
> > With this patch, the checksum can be calculated on an unligned part of
> > a packet buffer.
> > I.e. the buf parameter is no longer required to be 16 bit aligned.
> >
> > The DPDK invariant that packet buffers must be 16 bit aligned remains
> > unchanged.
> > This invariant also defines how to calculate the 16 bit checksum on an
> > unaligned part of a packet buffer.
> >
> > Bugzilla ID: 1035
> > Cc: stable at dpdk.org
> >
> > Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> > ---
> >  lib/net/rte_ip.h | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > b502481670..8e301d9c26 100644
> > --- a/lib/net/rte_ip.h
> > +++ b/lib/net/rte_ip.h
> > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > uint32_t sum)  {
> >  	/* extend strict-aliasing rules */
> >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > -	const u16_p *u16_buf = (const u16_p *)buf;
> > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > +	const u16_p *u16_buf;
> > +	const u16_p *end;
> > +
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely((uintptr_t)buf & 1)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> > +		sum += first;
> > +		buf = (const void *)((uintptr_t)buf + 1);
> > +		len--;
> > +	}
> >
> > +	u16_buf = (const u16_p *)buf;
> > +	end = u16_buf + len / sizeof(*u16_buf);
> >  	for (; u16_buf != end; ++u16_buf)
> >  		sum += *u16_buf;
> >
> > --
> > 2.17.1
> 
> @Emil, can you please test this patch with an unaligned buffer on your
> application to confirm that it produces the expected result.

Hi!

I tested the patch. It doesn't seem to produce the same results. I think the problem is that it always starts summing from an even address, the sum should always start from the first byte according to the checksum specification. Can I instead propose something Mattias Rönnblom sent me?
--------------------------------------------------------------------------------------------------------------
const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * sizeof(uint16_t));

for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
    uint16_t v;
    memcpy(&v, buf, sizeof(uint16_t));
    sum += v;
}

/* if length is odd, keeping it byte order independent */
if (unlikely(len % 2)) {
    uint16_t left = 0;
    *(unsigned char *)&left = *(const unsigned char *)end;
    sum += left;
}
--------------------------------------------------------------------------------------------------------------
Note that the last block is the same as before. Amazingly I see no measurable performance hit from this compared to the previous one (-O3, march=native). Looking at the previous the loop body may compile to (x86):
--------------------------------------------------------------------------------------------------------------
vmovdqa (%rdx),%xmm1
vpmovzxwd %xmm1,%xmm0
vpsrldq $0x8,%xmm1,%xmm1
vpmovzxwd %xmm1,%xmm1
vpaddd %xmm1,%xmm0,%xmm0
cmp    $0xf,%rax
jbe    0x7ff7a0dfb1a9
--------------------------------------------------------------------------------------------------------------
while Mattias' memcpy solution:
--------------------------------------------------------------------------------------------------------------
vmovdqu (%rcx),%ymm0
add    $0x20,%rcx
vpmovzxwd %xmm0,%ymm1
vextracti128 $0x1,%ymm0,%xmm0
vpmovzxwd %xmm0,%ymm0
vpaddd %ymm0,%ymm1,%ymm0
vpaddd %ymm0,%ymm2,%ymm2
cmp    %r9,%rcx
jne    0x555555556380
--------------------------------------------------------------------------------------------------------------
Thus two extra instructions in the loop, but I suspect it may be memory bound, leading to no measurable performance difference. 

Any comments?

/Emil


More information about the stable mailing list