[dpdk-dev] [PATCH v6 1/6] ethdev: add Tx preparation

Olivier Matz olivier.matz at 6wind.com
Tue Oct 18 16:57:09 CEST 2016


Hi Tomasz,

I think the principle of tx_prep() is good, it may for instance help to
remove the function virtio_tso_fix_cksum() from the virtio, and maybe
even change the mbuf TSO/cksum API.

I have some questions/comments below, I'm sorry it comes very late.

On 10/14/2016 05:05 PM, Tomasz Kulasek wrote:
> Added API for `rte_eth_tx_prep`
> 
> uint16_t rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id,
> 	struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> 
> Added fields to the `struct rte_eth_desc_lim`:
> 
> 	uint16_t nb_seg_max;
> 		/**< Max number of segments per whole packet. */
> 
> 	uint16_t nb_mtu_seg_max;
> 		/**< Max number of segments per one MTU */

Not sure I understand the second one. Is this is case of TSO?

Is it a usual limitation in different network hardware?
Can this info be retrieved/used by the application?

> 
> Created `rte_pkt.h` header with common used functions:
> 
> int rte_validate_tx_offload(struct rte_mbuf *m)
> 	to validate general requirements for tx offload in packet such a
> 	flag completness. In current implementation this function is called
> 	optionaly when RTE_LIBRTE_ETHDEV_DEBUG is enabled.
> 
> int rte_phdr_cksum_fix(struct rte_mbuf *m)
> 	to fix pseudo header checksum for TSO and non-TSO tcp/udp packets
> 	before hardware tx checksum offload.
> 	 - for non-TSO tcp/udp packets full pseudo-header checksum is
> 	   counted and set.
> 	 - for TSO the IP payload length is not included.

Why not in rte_net.h?


> [...]
>  
> @@ -2816,6 +2825,82 @@ rte_eth_tx_burst(uint8_t port_id, uint16_t queue_id,
>  	return (*dev->tx_pkt_burst)(dev->data->tx_queues[queue_id], tx_pkts, nb_pkts);
>  }
>  
> +/**
> + * Process a burst of output packets on a transmit queue of an Ethernet device.
> + *
> + * The rte_eth_tx_prep() function is invoked to prepare output packets to be
> + * transmitted on the output queue *queue_id* of the Ethernet device designated
> + * by its *port_id*.
> + * The *nb_pkts* parameter is the number of packets to be prepared which are
> + * supplied in the *tx_pkts* array of *rte_mbuf* structures, each of them
> + * allocated from a pool created with rte_pktmbuf_pool_create().
> + * For each packet to send, the rte_eth_tx_prep() function performs
> + * the following operations:
> + *
> + * - Check if packet meets devices requirements for tx offloads.

Do you mean hardware requirements?
Can the application be aware of these requirements? I mean capability
flags, or something in dev_infos?

Maybe the comment could be more precise?

> + * - Check limitations about number of segments.
> + *
> + * - Check additional requirements when debug is enabled.

What kind of additional requirements?

> + *
> + * - Update and/or reset required checksums when tx offload is set for packet.
> + *

By reading this, I think it may not be clear for the user about what
should be set in the mbuf. In mbuf API, it is said:

 * TCP segmentation offload. To enable this offload feature for a
 * packet to be transmitted on hardware supporting TSO:
 *  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
 *    PKT_TX_TCP_CKSUM)
 *  - set the flag PKT_TX_IPV4 or PKT_TX_IPV6
 *  - if it's IPv4, set the PKT_TX_IP_CKSUM flag and write the IP checksum
 *    to 0 in the packet
 *  - fill the mbuf offload information: l2_len, l3_len, l4_len, tso_segsz
 *  - calculate the pseudo header checksum without taking ip_len in account,
 *    and set it in the TCP header. Refer to rte_ipv4_phdr_cksum() and
 *    rte_ipv6_phdr_cksum() that can be used as helpers.


If I understand well, using tx_prep(), the user will have to do the
same except writing the IP checksum to 0, and without setting the
TCP pseudo header checksum, right?


> + * The rte_eth_tx_prep() function returns the number of packets ready to be
> + * sent. A return value equal to *nb_pkts* means that all packets are valid and
> + * ready to be sent.
> + *
> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param queue_id
> + *   The index of the transmit queue through which output packets must be
> + *   sent.
> + *   The value must be in the range [0, nb_tx_queue - 1] previously supplied
> + *   to rte_eth_dev_configure().
> + * @param tx_pkts
> + *   The address of an array of *nb_pkts* pointers to *rte_mbuf* structures
> + *   which contain the output packets.
> + * @param nb_pkts
> + *   The maximum number of packets to process.
> + * @return
> + *   The number of packets correct and ready to be sent. The return value can be
> + *   less than the value of the *tx_pkts* parameter when some packet doesn't
> + *   meet devices requirements with rte_errno set appropriately.
> + */

Can we add the constraint that invalid packets are left untouched?

I think most of the time there will be a software fallback in that
case, so it would be good to ensure that this function does not change
the flags or the packet data.

Another thing that could be interesting for the caller is to know the
reason of the failure. Maybe the different errno types could be detailed
here. For instance:
- EINVAL: offload flags are not correctly set (i.e. would fail whatever
  the hardware)
- ENOTSUP: the offload feature is not supported by the hardware
- ...

> +
> +#ifdef RTE_ETHDEV_TX_PREP
> +
> +static inline uint16_t
> +rte_eth_tx_prep(uint8_t port_id, uint16_t queue_id, struct rte_mbuf **tx_pkts,
> +		uint16_t nb_pkts)
> +{
> +	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
> +
> +	if (!dev->tx_pkt_prep)
> +		return nb_pkts;
> +
> +#ifdef RTE_LIBRTE_ETHDEV_DEBUG
> +	if (queue_id >= dev->data->nb_tx_queues) {
> +		RTE_PMD_DEBUG_TRACE("Invalid TX queue_id=%d\n", queue_id);
> +		rte_errno = -EINVAL;
> +		return 0;
> +	}
> +#endif

Why checking the queue_id but not the port_id?

Maybe the API comment should also be modified to say that the port_id
has to be valid, because most ethdev functions do the check.

> +
> +	return (*dev->tx_pkt_prep)(dev->data->tx_queues[queue_id],
> +			tx_pkts, nb_pkts);
> +}
> +
> +#else
> +
> +static inline uint16_t
> +rte_eth_tx_prep(__rte_unused uint8_t port_id, __rte_unused uint16_t queue_id,
> +		__rte_unused struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +	return nb_pkts;
> +}
> +
> +#endif
> +

nit: I wonder if the #else part should be inside the function instead
(with the proper RTE_SET_USED()), it would avoid to define the prototype
twice.

>  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
>  		void *userdata);
>  
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 109e666..cfd6284 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -276,6 +276,15 @@ extern "C" {
>   */
>  #define PKT_TX_OUTER_IPV4   (1ULL << 59)
>  
> +#define PKT_TX_OFFLOAD_MASK (    \
> +		PKT_TX_IP_CKSUM |        \
> +		PKT_TX_L4_MASK |         \
> +		PKT_TX_OUTER_IP_CKSUM |  \
> +		PKT_TX_TCP_SEG |         \
> +		PKT_TX_QINQ_PKT |        \
> +		PKT_TX_VLAN_PKT |        \
> +		PKT_TX_TUNNEL_MASK)
> +

Could you add an API comment?

> --- /dev/null
> +++ b/lib/librte_net/rte_pkt.h
> 
> [...]
> +/**
> + * Validate general requirements for tx offload in packet.
> + */

The API comment does not have the usual format.

> +static inline int
> +rte_validate_tx_offload(struct rte_mbuf *m)

should be const struct rte_mbuf *m

> +{
> +	uint64_t ol_flags = m->ol_flags;
> +
> +	/* Does packet set any of available offloads? */
> +	if (!(ol_flags & PKT_TX_OFFLOAD_MASK))
> +		return 0;
> +
> +	/* IP checksum can be counted only for IPv4 packet */
> +	if ((ol_flags & PKT_TX_IP_CKSUM) && (ol_flags & PKT_TX_IPV6))
> +		return -EINVAL;
> +
> +	if (ol_flags & (PKT_TX_L4_MASK | PKT_TX_TCP_SEG))
> +		/* IP type not set */
> +		if (!(ol_flags & (PKT_TX_IPV4 | PKT_TX_IPV6)))
> +			return -EINVAL;
> +
> +	if (ol_flags & PKT_TX_TCP_SEG)
> +		/* PKT_TX_IP_CKSUM offload not set for IPv4 TSO packet */
> +		if ((m->tso_segsz == 0) ||
> +				((ol_flags & PKT_TX_IPV4) && !(ol_flags & PKT_TX_IP_CKSUM)))
> +			return -EINVAL;
> +
> +	/* PKT_TX_OUTER_IP_CKSUM set for non outer IPv4 packet. */
> +	if ((ol_flags & PKT_TX_OUTER_IP_CKSUM) && !(ol_flags & PKT_TX_OUTER_IPV4))
> +		return -EINVAL;
> +
> +	return 0;
> +}

It looks this function is only used when RTE_LIBRTE_ETHDEV_DEBUG is
set.

I'd say this function should go in rte_mbuf.h, because it's purely
related to mbuf flags, and does not rely on packet data or network
headers.


> +
> +/**
> + * Fix pseudo header checksum for TSO and non-TSO tcp/udp packets before
> + * hardware tx checksum.
> + * For non-TSO tcp/udp packets full pseudo-header checksum is counted and set.
> + * For TSO the IP payload length is not included.
> + */

The API comment should be fixed.

> +static inline int
> +rte_phdr_cksum_fix(struct rte_mbuf *m)
> +{
> +	struct ipv4_hdr *ipv4_hdr;
> +	struct ipv6_hdr *ipv6_hdr;
> +	struct tcp_hdr *tcp_hdr;
> +	struct udp_hdr *udp_hdr;
> +	uint64_t ol_flags = m->ol_flags;
> +	uint64_t inner_l3_offset = m->l2_len;
> +
> +	if (ol_flags & PKT_TX_OUTER_IP_CKSUM)
> +		inner_l3_offset += m->outer_l2_len + m->outer_l3_len;
> +
> +	if ((ol_flags & PKT_TX_UDP_CKSUM) == PKT_TX_UDP_CKSUM) {
> +		if (ol_flags & PKT_TX_IPV4) {
> +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> +					inner_l3_offset);
> +
> +			if (ol_flags & PKT_TX_IP_CKSUM)
> +				ipv4_hdr->hdr_checksum = 0;
> +
> +			udp_hdr = (struct udp_hdr *)((char *)ipv4_hdr + m->l3_len);
> +			udp_hdr->dgram_cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> +		} else {
> +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> +					inner_l3_offset);
> +			/* non-TSO udp */
> +			udp_hdr = rte_pktmbuf_mtod_offset(m, struct udp_hdr *,
> +					inner_l3_offset + m->l3_len);
> +			udp_hdr->dgram_cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> +		}
> +	} else if ((ol_flags & PKT_TX_TCP_CKSUM) ||
> +			(ol_flags & PKT_TX_TCP_SEG)) {
> +		if (ol_flags & PKT_TX_IPV4) {
> +			ipv4_hdr = rte_pktmbuf_mtod_offset(m, struct ipv4_hdr *,
> +					inner_l3_offset);
> +
> +			if (ol_flags & PKT_TX_IP_CKSUM)
> +				ipv4_hdr->hdr_checksum = 0;
> +
> +			/* non-TSO tcp or TSO */
> +			tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + m->l3_len);
> +			tcp_hdr->cksum = rte_ipv4_phdr_cksum(ipv4_hdr, ol_flags);
> +		} else {
> +			ipv6_hdr = rte_pktmbuf_mtod_offset(m, struct ipv6_hdr *,
> +					inner_l3_offset);
> +			/* non-TSO tcp or TSO */
> +			tcp_hdr = rte_pktmbuf_mtod_offset(m, struct tcp_hdr *,
> +					inner_l3_offset + m->l3_len);
> +			tcp_hdr->cksum = rte_ipv6_phdr_cksum(ipv6_hdr, ol_flags);
> +		}
> +	}
> +	return 0;
> +}
> +
> +#endif /* _RTE_PKT_H_ */
> 

The function expects that all the network headers are in the first, and
that each of them is contiguous.

Also, I had an interesting remark from Stephen [1] on a similar code.
If the mbuf is a clone, it will modify the data of the direct mbuf,
which should be read-only. Note that it is likely to happen in
a TCP stack, because the packet is kept locally in case it has to
be retransmitted. Cloning a mbuf is more efficient than duplicating
it.

I plan to fix it in virtio code by "uncloning" the headers.

[1] http://dpdk.org/ml/archives/dev/2016-October/048873.html



Regards,
Olivier


More information about the dev mailing list