[dpdk-dev,5/5] net/virtio: fix Tso when mbuf is shared

Message ID 20170109184625.7884290a@platinum (mailing list archive)
State Not Applicable, archived
Headers

Checks

Context Check Description
ci/Intel compilation success Compilation OK

Commit Message

Olivier Matz Jan. 9, 2017, 5:46 p.m. UTC
  Hi Yuanhan,

On Wed, 14 Dec 2016 15:27:50 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> Firstly sorry for late response!

No problem, I fully understand ;)

> On Thu, Nov 24, 2016 at 09:56:38AM +0100, Olivier Matz wrote:
> > With virtio, doing tso requires to modify the network
> > packet data:  
> 
> I thought more about it this time, and I'm wondering why it's needed.
> 
> > - the dpdk API requires to set the l4 checksum to an
> >   Intel-Nic-like pseudo header checksum that does
> >   not include the ip length  
> 
> If the packet is for a NIC pmd driver in the end, then the NIC driver
> (or application) would handle the checksum correctly.  You could check
> the tx_prep patchset for example.
> 
> > - the virtio peer expects that the l4 checksum is
> >   a standard pseudo header checksum.  
> 
> For this case, the checksum is then not needed: we could assume the
> data between virtio to virtio transmission on the same host is always
> valid, that checksum validation is unnecessary.
> 
> So, in either case, it doesn't seem to me we have to generate the
> checksum here. Or am I miss something?

The virtio specifications requires that the L4 checksum is set to the
pseudo header checksum. You can search for "pseudo header" in the
following doc:
http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf

Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
must set the checksum to phdr, and if we do tso, we must set the csum
flag.

We can check that this is really needed with Linux vhost by replaying
the test plan described at [1].

[1] http://dpdk.org/ml/archives/dev/2016-October/048793.html

If we add the following patch to disable the checksum fix (on top of
this patchset), the test1 "large packets (lro/tso)" won't work.



In one direction ("flow1" in the test desc), large packets are
transmitted from host on the ixgbe interface, and received by the
guest. Then, testpmd bridges the packet to the virtio interface. But
the packet is not received by the host.

> 
> OTOH, even if it does, I still see some issues (see below).
> 
> >  		/* TCP Segmentation Offload */
> >  		if (cookie->ol_flags & PKT_TX_TCP_SEG) {
> > -			virtio_tso_fix_cksum(cookie);
> > +			offset = virtio_tso_fix_cksum(cookie,
> > +				RTE_PTR_ADD(hdr,
> > start_dp[hdr_idx].len),
> > +				VIRTIO_MAX_HDR_SZ);
> > +			if (offset > 0) {
> > +				RTE_ASSERT(can_push != 0);  
> 
> I think it's (can_push == 0) ?
> 

Yes, indeed. I'll fix that in next version.

> > +				start_dp[hdr_idx].len += offset;  
> 
> Actually, there is an assumption if you do this, that the backend
> driver must have to support ANY_LAYOUT. Otherwise, it won't work: the
> driver would expect the header and packet data is totally separated
> into two desc buffers.
> 
> Though the assumption is most likely true in nowadays, I don't think
> it's a guarantee.

Right.

There are at least 2 options for this one:

- try to use 2 different descriptors (the patch is probably harder,
  and it may slow-down the case where ANY_LAYOUT is supported)

- refuse to initialize with TSO enabled if ANY_LAYOUT is not supported.

If you think ANY_LAYOUT is most likely true today, we could choose
option 2. Let me know what's your preference here.

Thank you for the review.

Olivier
  

Comments

Yuanhan Liu Jan. 16, 2017, 6:48 a.m. UTC | #1
On Mon, Jan 09, 2017 at 06:46:25PM +0100, Olivier Matz wrote:
> The virtio specifications requires that the L4 checksum is set to the
> pseudo header checksum. You can search for "pseudo header" in the
> following doc:
> http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf
> 
> Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
> must set the checksum to phdr, and if we do tso, we must set the csum
> flag.
> 
> We can check that this is really needed with Linux vhost by replaying
> the test plan described at [1].
> 
> [1] http://dpdk.org/ml/archives/dev/2016-October/048793.html
> 
> If we add the following patch to disable the checksum fix (on top of
> this patchset), the test1 "large packets (lro/tso)" won't work.
> 
> --- a/drivers/net/virtio/virtio_rxtx.c
> +++ b/drivers/net/virtio/virtio_rxtx.c
> @@ -224,6 +224,9 @@
>         uint32_t tmp;
>         int shared = 0;
>  
> +        if (1)
> +               return 0;
> +
>         /* mbuf is write-only, we need to copy the headers in a linear
> buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
>                 shared = 1;
> 
> 
> In one direction ("flow1" in the test desc), large packets are
> transmitted from host on the ixgbe interface, and received by the
> guest. Then, testpmd bridges the packet to the virtio interface. But
> the packet is not received by the host.

I hope I could have time to dig this further, since, honestly, I don't
quite like this patch: it makes things un-maintainable.

Besides that, I think we have similar issue with nic drivers. See the
rte_net_intel_cksum_flags_prepare() function introduced at commit
4fb7e803eb1a ("ethdev: add Tx preparation").

Cc more people here. And here is a quick background for them: NIC drivers
doing TSO need change the mbuf (say, for cksum updating), however, as
Stephen pointed out, we could not do that if the mbuf is shared: I don't
see such checks in the driver code as well.

> There are at least 2 options for this one:
> 
> - try to use 2 different descriptors (the patch is probably harder,
>   and it may slow-down the case where ANY_LAYOUT is supported)
> 
> - refuse to initialize with TSO enabled if ANY_LAYOUT is not supported.
> 
> If you think ANY_LAYOUT is most likely true today, we could choose
> option 2. Let me know what's your preference here.

Maybe we could go with a simpler one: COW. Yeah, it costs more, but this
would be rare, that it should be OKay, right? Besides, we just need copy
the heading mbuf.

	--yliu
  
Olivier Matz Jan. 17, 2017, 11:18 a.m. UTC | #2
Hi Yuanhan,

On Mon, 16 Jan 2017 14:48:19 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> On Mon, Jan 09, 2017 at 06:46:25PM +0100, Olivier Matz wrote:
> > The virtio specifications requires that the L4 checksum is set to
> > the pseudo header checksum. You can search for "pseudo header" in
> > the following doc:
> > http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.pdf
> > 
> > Especially in 5.1.6.2.1, we can see that if we use the csum flag, we
> > must set the checksum to phdr, and if we do tso, we must set the
> > csum flag.
> > 
> > We can check that this is really needed with Linux vhost by
> > replaying the test plan described at [1].
> > 
> > [1] http://dpdk.org/ml/archives/dev/2016-October/048793.html
> > 
> > If we add the following patch to disable the checksum fix (on top of
> > this patchset), the test1 "large packets (lro/tso)" won't work.
> > 
> > --- a/drivers/net/virtio/virtio_rxtx.c
> > +++ b/drivers/net/virtio/virtio_rxtx.c
> > @@ -224,6 +224,9 @@
> >         uint32_t tmp;
> >         int shared = 0;
> >  
> > +        if (1)
> > +               return 0;
> > +
> >         /* mbuf is write-only, we need to copy the headers in a
> > linear buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0,
> > hdrlen))) { shared = 1;
> > 
> > 
> > In one direction ("flow1" in the test desc), large packets are
> > transmitted from host on the ixgbe interface, and received by the
> > guest. Then, testpmd bridges the packet to the virtio interface. But
> > the packet is not received by the host.  
> 
> I hope I could have time to dig this further, since, honestly, I don't
> quite like this patch: it makes things un-maintainable.

Well, I'm not that proud of the patch, but that's the best solution
I've found. Nevertheless saying it makes things un-maintainable looks a
bit excessive to me :)

The option of reallocating a mbuf, copy and fix network headers in it
looks even more complex to me (that was my first approach).

> Besides that, I think we have similar issue with nic drivers. See the
> rte_net_intel_cksum_flags_prepare() function introduced at commit
> 4fb7e803eb1a ("ethdev: add Tx preparation").

Yes, that was discussed a bit. See [1] and the subsequent mails.
http://dpdk.org/ml/archives/dev/2016-December/051014.html

My opinion is that tx_burst() should not change the mbuf data, it's
always been like this. For Intel NICs, there is no issue since the DPDK
API is derived from Intel NICs API, so there is no fix to do in the
mbuf data.

For tx_prepare(), it's explicitly said that it can update the data.
If tx_prepare() becomes mandatory, it will naturally fix this issue
without modifying the driver, because the phdr csum calculation will be
done in tx_prepare().

An alternative is to mark this as a known issue for now, and wait until
tx_prepare() is mandatory.


> Cc more people here. And here is a quick background for them: NIC
> drivers doing TSO need change the mbuf (say, for cksum updating),
> however, as Stephen pointed out, we could not do that if the mbuf is
> shared: I don't see such checks in the driver code as well.
> 
> > There are at least 2 options for this one:
> > 
> > - try to use 2 different descriptors (the patch is probably harder,
> >   and it may slow-down the case where ANY_LAYOUT is supported)
> > 
> > - refuse to initialize with TSO enabled if ANY_LAYOUT is not
> > supported.
> > 
> > If you think ANY_LAYOUT is most likely true today, we could choose
> > option 2. Let me know what's your preference here.  
> 
> Maybe we could go with a simpler one: COW. Yeah, it costs more, but
> this would be rare, that it should be OKay, right? Besides, we just
> need copy the heading mbuf.

Could you detail what you mean by COW in this context? Do you mean
reallocating a new mbuf? If yes, it's not only a problem of cost:

- There is no mempool pointer associated to tx queues, so we cannot
  allocate a mbuf. Reusing a mempool pointer from the current mbuf looks
  risky, because it can be a special pool, like a pool dedicated for
  clones, without data.

- It makes allocation error quite hard to manage, it would require some
  rework in tx functions.


Thanks for your review and the discussion. Let me know what you think.

Regards,
Olivier
  
Yuanhan Liu Jan. 18, 2017, 5:03 a.m. UTC | #3
On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > I hope I could have time to dig this further, since, honestly, I don't
> > quite like this patch: it makes things un-maintainable.
> 
> Well, I'm not that proud of the patch, but that's the best solution
> I've found. Nevertheless saying it makes things un-maintainable looks a
> bit excessive to me :)

Aha... really sorry about that!

But honestly, I'd say again, it makes thing more complex, just for fixing
a corner and rare issue. I'd try to avoid that.

> 
> The option of reallocating a mbuf, copy and fix network headers in it
> looks even more complex to me (that was my first approach).
> 
> > Besides that, I think we have similar issue with nic drivers. See the
> > rte_net_intel_cksum_flags_prepare() function introduced at commit
> > 4fb7e803eb1a ("ethdev: add Tx preparation").
> 
> Yes, that was discussed a bit. See [1] and the subsequent mails.
> http://dpdk.org/ml/archives/dev/2016-December/051014.html

Thanks for the info, and I'm pretty Okay with that.

> My opinion is that tx_burst() should not change the mbuf data, it's
> always been like this. For Intel NICs, there is no issue since the DPDK
> API is derived from Intel NICs API, so there is no fix to do in the
> mbuf data.
> 
> For tx_prepare(), it's explicitly said that it can update the data.
> If tx_prepare() becomes mandatory, it will naturally fix this issue
> without modifying the driver, because the phdr csum calculation will be
> done in tx_prepare().
> 
> An alternative is to mark this as a known issue for now, and wait until
> tx_prepare() is mandatory.

I see no reason to wait. Though my understanding is it may not be a
mandatory so far, but user is supposed to calculate the pseudo-header
checksum by themself before. Now they have one more option: tx_prepare.

That means, in either way, user has to do some extra works to make
TSO work (either by themself or call tx_prepare). So I don't think
it'd be an issue?

	--yliu
  
Olivier Matz Jan. 24, 2017, 10:51 a.m. UTC | #4
On Wed, 18 Jan 2017 13:03:48 +0800, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > > I hope I could have time to dig this further, since, honestly, I
> > > don't quite like this patch: it makes things un-maintainable.  
> > 
> > Well, I'm not that proud of the patch, but that's the best solution
> > I've found. Nevertheless saying it makes things un-maintainable
> > looks a bit excessive to me :)  
> 
> Aha... really sorry about that!
> 
> But honestly, I'd say again, it makes thing more complex, just for
> fixing a corner and rare issue. I'd try to avoid that.
> 
> > 
> > The option of reallocating a mbuf, copy and fix network headers in
> > it looks even more complex to me (that was my first approach).
> >   
> > > Besides that, I think we have similar issue with nic drivers. See
> > > the rte_net_intel_cksum_flags_prepare() function introduced at
> > > commit 4fb7e803eb1a ("ethdev: add Tx preparation").  
> > 
> > Yes, that was discussed a bit. See [1] and the subsequent mails.
> > http://dpdk.org/ml/archives/dev/2016-December/051014.html  
> 
> Thanks for the info, and I'm pretty Okay with that.
> 
> > My opinion is that tx_burst() should not change the mbuf data, it's
> > always been like this. For Intel NICs, there is no issue since the
> > DPDK API is derived from Intel NICs API, so there is no fix to do
> > in the mbuf data.
> > 
> > For tx_prepare(), it's explicitly said that it can update the data.
> > If tx_prepare() becomes mandatory, it will naturally fix this issue
> > without modifying the driver, because the phdr csum calculation
> > will be done in tx_prepare().
> > 
> > An alternative is to mark this as a known issue for now, and wait
> > until tx_prepare() is mandatory.  
> 
> I see no reason to wait. Though my understanding is it may not be a
> mandatory so far, but user is supposed to calculate the pseudo-header
> checksum by themself before. Now they have one more option:
> tx_prepare.
>
> That means, in either way, user has to do some extra works to make
> TSO work (either by themself or call tx_prepare). So I don't think
> it'd be an issue?

Yes sounds good. I'll check how a tx_prepare() could be implemented for
virtio, in order to fix this issue.

In the meanwhile, for the 17.02, I think it could be good to highlight
the problem in the known issues, what do you think?


Thanks
Olivier
  
Yuanhan Liu Jan. 28, 2017, 12:32 p.m. UTC | #5
On Tue, Jan 24, 2017 at 11:51:20AM +0100, Olivier MATZ wrote:
> On Wed, 18 Jan 2017 13:03:48 +0800, Yuanhan Liu
> <yuanhan.liu@linux.intel.com> wrote:
> > On Tue, Jan 17, 2017 at 12:18:25PM +0100, Olivier Matz wrote:
> > > > I hope I could have time to dig this further, since, honestly, I
> > > > don't quite like this patch: it makes things un-maintainable.  
> > > 
> > > Well, I'm not that proud of the patch, but that's the best solution
> > > I've found. Nevertheless saying it makes things un-maintainable
> > > looks a bit excessive to me :)  
> > 
> > Aha... really sorry about that!
> > 
> > But honestly, I'd say again, it makes thing more complex, just for
> > fixing a corner and rare issue. I'd try to avoid that.
> > 
> > > 
> > > The option of reallocating a mbuf, copy and fix network headers in
> > > it looks even more complex to me (that was my first approach).
> > >   
> > > > Besides that, I think we have similar issue with nic drivers. See
> > > > the rte_net_intel_cksum_flags_prepare() function introduced at
> > > > commit 4fb7e803eb1a ("ethdev: add Tx preparation").  
> > > 
> > > Yes, that was discussed a bit. See [1] and the subsequent mails.
> > > http://dpdk.org/ml/archives/dev/2016-December/051014.html  
> > 
> > Thanks for the info, and I'm pretty Okay with that.
> > 
> > > My opinion is that tx_burst() should not change the mbuf data, it's
> > > always been like this. For Intel NICs, there is no issue since the
> > > DPDK API is derived from Intel NICs API, so there is no fix to do
> > > in the mbuf data.
> > > 
> > > For tx_prepare(), it's explicitly said that it can update the data.
> > > If tx_prepare() becomes mandatory, it will naturally fix this issue
> > > without modifying the driver, because the phdr csum calculation
> > > will be done in tx_prepare().
> > > 
> > > An alternative is to mark this as a known issue for now, and wait
> > > until tx_prepare() is mandatory.  
> > 
> > I see no reason to wait. Though my understanding is it may not be a
> > mandatory so far, but user is supposed to calculate the pseudo-header
> > checksum by themself before. Now they have one more option:
> > tx_prepare.
> >
> > That means, in either way, user has to do some extra works to make
> > TSO work (either by themself or call tx_prepare). So I don't think
> > it'd be an issue?
> 
> Yes sounds good. I'll check how a tx_prepare() could be implemented for
> virtio, in order to fix this issue.

Thanks.

> In the meanwhile, for the 17.02, I think it could be good to highlight
> the problem in the known issues, what do you think?

Yes, I think it's a good idea.

	--yliu
  

Patch

--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -224,6 +224,9 @@ 
        uint32_t tmp;
        int shared = 0;
 
+        if (1)
+               return 0;
+
        /* mbuf is write-only, we need to copy the headers in a linear
buffer */ if (unlikely(rte_pktmbuf_data_is_shared(m, 0, hdrlen))) {
                shared = 1;