[PATCH] net: fix checksum with unaligned buffer

Emil Berg emil.berg at ericsson.com
Tue Jun 21 09:16:44 CEST 2022



> -----Original Message-----
> From: Morten Brørup <mb at smartsharesystems.com>
> Sent: den 20 juni 2022 12:58
> To: Emil Berg <emil.berg at ericsson.com>
> Cc: stable at dpdk.org; bugzilla at dpdk.org; hofors at lysator.liu.se;
> olivier.matz at 6wind.com; dev at dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Emil Berg [mailto:emil.berg at ericsson.com]
> > Sent: Monday, 20 June 2022 12.38
> >
> > > From: Morten Brørup <mb at smartsharesystems.com>
> > > Sent: den 17 juni 2022 11:07
> > >
> > > > From: Morten Brørup [mailto:mb at smartsharesystems.com]
> > > > Sent: Friday, 17 June 2022 10.45
> > > >
> > > > With this patch, the checksum can be calculated on an unligned
> > > > part
> > of
> > > > a packet buffer.
> > > > I.e. the buf parameter is no longer required to be 16 bit aligned.
> > > >
> > > > The DPDK invariant that packet buffers must be 16 bit aligned
> > remains
> > > > unchanged.
> > > > This invariant also defines how to calculate the 16 bit checksum
> > > > on
> > an
> > > > unaligned part of a packet buffer.
> > > >
> > > > Bugzilla ID: 1035
> > > > Cc: stable at dpdk.org
> > > >
> > > > Signed-off-by: Morten Brørup <mb at smartsharesystems.com>
> > > > ---
> > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > b502481670..8e301d9c26 100644
> > > > --- a/lib/net/rte_ip.h
> > > > +++ b/lib/net/rte_ip.h
> > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > > > uint32_t sum)  {
> > > >  	/* extend strict-aliasing rules */
> > > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > +	const u16_p *u16_buf;
> > > > +	const u16_p *end;
> > > > +
> > > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > +		uint16_t first = 0;
> > > > +		if (unlikely(len == 0))
> > > > +			return 0;
> > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > char *)buf;
> > > > +		sum += first;
> > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > +		len--;
> > > > +	}
> > > >
> > > > +	u16_buf = (const u16_p *)buf;
> > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > >  	for (; u16_buf != end; ++u16_buf)
> > > >  		sum += *u16_buf;
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > @Emil, can you please test this patch with an unaligned buffer on
> > your
> > > application to confirm that it produces the expected result.
> >
> > Hi!
> >
> > I tested the patch. It doesn't seem to produce the same results. I
> > think the problem is that it always starts summing from an even
> > address, the sum should always start from the first byte according to
> > the checksum specification. Can I instead propose something Mattias
> > Rönnblom sent me?
> 
> I assume that it produces the same result when the "buf" parameter is
> aligned?
> 
> And when the "buf" parameter is unaligned, I don't expect it to produce the
> same results as the simple algorithm!
> 
> This was the whole point of the patch: I expect the overall packet buffer to
> be 16 bit aligned, and the checksum to be a partial checksum of such a 16 bit
> aligned packet buffer. When calling this function, I assume that the "buf" and
> "len" parameters point to a part of such a packet buffer. If these
> expectations are correct, the simple algorithm will produce incorrect results
> when "buf" is unaligned.
> 
> I was asking you to test if the checksum on the packet is correct when your
> application modifies an unaligned part of the packet and uses this function to
> update the checksum.
> 

Now I understand your use case. Your use case seems to be about partial checksums, of which some partial checksums may start on unaligned addresses in an otherwise aligned packet. 

Our use case is about calculating the full checksum on a nested packet. That nested packet may start on unaligned addresses.

The difference is basically if we want to sum over aligned addresses or not, handling the heading and trailing bytes appropriately.

Your method does not work in our case since we want to treat the first two bytes as the first word in our case. But I do understand that both methods are useful.

Note that your method breaks the API. Previously (assuming no crashing due to low optimization levels, more accepting hardware, or a different compiler (version)) the current method would calculate the checksum assuming the first two bytes is the first word.

> 
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
> > sizeof(uint16_t));
> >
> > for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> >     uint16_t v;
> >     memcpy(&v, buf, sizeof(uint16_t));
> >     sum += v;
> > }
> >
> > /* if length is odd, keeping it byte order independent */ if
> > (unlikely(len % 2)) {
> >     uint16_t left = 0;
> >     *(unsigned char *)&left = *(const unsigned char *)end;
> >     sum += left;
> > }
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > Note that the last block is the same as before. Amazingly I see no
> > measurable performance hit from this compared to the previous one
> > (-O3, march=native). Looking at the previous the loop body may compile
> > to
> > (x86):
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > vmovdqa (%rdx),%xmm1
> > vpmovzxwd %xmm1,%xmm0
> > vpsrldq $0x8,%xmm1,%xmm1
> > vpmovzxwd %xmm1,%xmm1
> > vpaddd %xmm1,%xmm0,%xmm0
> > cmp    $0xf,%rax
> > jbe    0x7ff7a0dfb1a9
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > while Mattias' memcpy solution:
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > vmovdqu (%rcx),%ymm0
> > add    $0x20,%rcx
> > vpmovzxwd %xmm0,%ymm1
> > vextracti128 $0x1,%ymm0,%xmm0
> > vpmovzxwd %xmm0,%ymm0
> > vpaddd %ymm0,%ymm1,%ymm0
> > vpaddd %ymm0,%ymm2,%ymm2
> > cmp    %r9,%rcx
> > jne    0x555555556380
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > Thus two extra instructions in the loop, but I suspect it may be
> > memory bound, leading to no measurable performance difference.
> >
> > Any comments?
> >
> > /Emil


More information about the stable mailing list