[dpdk-dev,v7,2/3] lib/gro: add TCP/IPv4 GRO support

Message ID 1498459430-116048-3-git-send-email-jiayu.hu@intel.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch success coding style OK

Commit Message

Hu, Jiayu June 26, 2017, 6:43 a.m. UTC
  In this patch, we introduce five APIs to support TCP/IPv4 GRO.
- gro_tcp_tbl_create: create a TCP reassembly table, which is used to
    merge packets.
- gro_tcp_tbl_destroy: free memory space of a TCP reassembly table.
- gro_tcp_tbl_flush: flush all packets from a TCP reassembly table.
- gro_tcp_tbl_timeout_flush: flush timeout packets from a TCP
    reassembly table.
- gro_tcp4_reassemble: reassemble an inputted TCP/IPv4 packet.

TCP/IPv4 GRO API assumes all inputted packets are with correct IPv4
and TCP checksums. And TCP/IPv4 GRO API doesn't update IPv4 and TCP
checksums for merged packets. If inputted packets are IP fragmented,
TCP/IPv4 GRO API assumes they are complete packets (i.e. with L4
headers).

In TCP GRO, we use a table structure, called TCP reassembly table, to
reassemble packets. Both TCP/IPv4 and TCP/IPv6 GRO use the same table
structure. A TCP reassembly table includes a key array and a item array,
where the key array keeps the criteria to merge packets and the item
array keeps packet information.

One key in the key array points to an item group, which consists of
packets which have the same criteria value. If two packets are able to
merge, they must be in the same item group. Each key in the key array
includes two parts:
- criteria: the criteria of merging packets. If two packets can be
    merged, they must have the same criteria value.
- start_index: the index of the first incoming packet of the item group.

Each element in the item array keeps the information of one packet. It
mainly includes two parts:
- pkt: packet address
- next_pkt_index: the index of the next packet in the same item group.
    All packets in the same item group are chained by next_pkt_index.
    With next_pkt_index, we can locate all packets in the same item
    group one by one.

To process an incoming packet needs three steps:
a. check if the packet should be processed. Packets with the following
    properties won't be processed:
	- packets without data (e.g. SYN, SYN-ACK)
b. traverse the key array to find a key which has the same criteria
    value with the incoming packet. If find, goto step c. Otherwise,
    insert a new key and insert the packet into the item array.
c. locate the first packet in the item group via the start_index in the
    key. Then traverse all packets in the item group via next_pkt_index.
    If find one packet which can merge with the incoming one, merge them
    together. If can't find, insert the packet into this item group.

Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
---
 doc/guides/rel_notes/release_17_08.rst |   7 +
 lib/librte_gro/Makefile                |   1 +
 lib/librte_gro/rte_gro.c               | 123 ++++++++--
 lib/librte_gro/rte_gro.h               |   6 +-
 lib/librte_gro/rte_gro_tcp.c           | 394 +++++++++++++++++++++++++++++++++
 lib/librte_gro/rte_gro_tcp.h           | 191 ++++++++++++++++
 6 files changed, 706 insertions(+), 16 deletions(-)
 create mode 100644 lib/librte_gro/rte_gro_tcp.c
 create mode 100644 lib/librte_gro/rte_gro_tcp.h
  

Comments

Ananyev, Konstantin June 28, 2017, 11:56 p.m. UTC | #1
Hi Jiayu,

> 
> In this patch, we introduce five APIs to support TCP/IPv4 GRO.
> - gro_tcp_tbl_create: create a TCP reassembly table, which is used to
>     merge packets.
> - gro_tcp_tbl_destroy: free memory space of a TCP reassembly table.
> - gro_tcp_tbl_flush: flush all packets from a TCP reassembly table.
> - gro_tcp_tbl_timeout_flush: flush timeout packets from a TCP
>     reassembly table.
> - gro_tcp4_reassemble: reassemble an inputted TCP/IPv4 packet.
> 
> TCP/IPv4 GRO API assumes all inputted packets are with correct IPv4
> and TCP checksums. And TCP/IPv4 GRO API doesn't update IPv4 and TCP
> checksums for merged packets. If inputted packets are IP fragmented,
> TCP/IPv4 GRO API assumes they are complete packets (i.e. with L4
> headers).
> 
> In TCP GRO, we use a table structure, called TCP reassembly table, to
> reassemble packets. Both TCP/IPv4 and TCP/IPv6 GRO use the same table
> structure. A TCP reassembly table includes a key array and a item array,
> where the key array keeps the criteria to merge packets and the item
> array keeps packet information.
> 
> One key in the key array points to an item group, which consists of
> packets which have the same criteria value. If two packets are able to
> merge, they must be in the same item group. Each key in the key array
> includes two parts:
> - criteria: the criteria of merging packets. If two packets can be
>     merged, they must have the same criteria value.
> - start_index: the index of the first incoming packet of the item group.
> 
> Each element in the item array keeps the information of one packet. It
> mainly includes two parts:
> - pkt: packet address
> - next_pkt_index: the index of the next packet in the same item group.
>     All packets in the same item group are chained by next_pkt_index.
>     With next_pkt_index, we can locate all packets in the same item
>     group one by one.
> 
> To process an incoming packet needs three steps:
> a. check if the packet should be processed. Packets with the following
>     properties won't be processed:
> 	- packets without data (e.g. SYN, SYN-ACK)
> b. traverse the key array to find a key which has the same criteria
>     value with the incoming packet. If find, goto step c. Otherwise,
>     insert a new key and insert the packet into the item array.
> c. locate the first packet in the item group via the start_index in the
>     key. Then traverse all packets in the item group via next_pkt_index.
>     If find one packet which can merge with the incoming one, merge them
>     together. If can't find, insert the packet into this item group.
> 
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> ---
>  doc/guides/rel_notes/release_17_08.rst |   7 +
>  lib/librte_gro/Makefile                |   1 +
>  lib/librte_gro/rte_gro.c               | 123 ++++++++--
>  lib/librte_gro/rte_gro.h               |   6 +-
>  lib/librte_gro/rte_gro_tcp.c           | 394 +++++++++++++++++++++++++++++++++
>  lib/librte_gro/rte_gro_tcp.h           | 191 ++++++++++++++++
>  6 files changed, 706 insertions(+), 16 deletions(-)
>  create mode 100644 lib/librte_gro/rte_gro_tcp.c
>  create mode 100644 lib/librte_gro/rte_gro_tcp.h
> 
> diff --git a/doc/guides/rel_notes/release_17_08.rst b/doc/guides/rel_notes/release_17_08.rst
> index 842f46f..f067247 100644
> --- a/doc/guides/rel_notes/release_17_08.rst
> +++ b/doc/guides/rel_notes/release_17_08.rst
> @@ -75,6 +75,13 @@ New Features
> 
>    Added support for firmwares with multiple Ethernet ports per physical port.
> 
> +* **Add Generic Receive Offload API support.**
> +
> +  Generic Receive Offload (GRO) API supports to reassemble TCP/IPv4
> +  packets. GRO API assumes all inputted packets are with correct
> +  checksums. GRO API doesn't update checksums for merged packets. If
> +  inputted packets are IP fragmented, GRO API assumes they are complete
> +  packets (i.e. with L4 headers).
> 
>  Resolved Issues
>  ---------------
> diff --git a/lib/librte_gro/Makefile b/lib/librte_gro/Makefile
> index 7e0f128..e89344d 100644
> --- a/lib/librte_gro/Makefile
> +++ b/lib/librte_gro/Makefile
> @@ -43,6 +43,7 @@ LIBABIVER := 1
> 
>  # source files
>  SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro.c
> +SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro_tcp.c
> 
>  # install this header file
>  SYMLINK-$(CONFIG_RTE_LIBRTE_GRO)-include += rte_gro.h
> diff --git a/lib/librte_gro/rte_gro.c b/lib/librte_gro/rte_gro.c
> index 33275e8..5b89928 100644
> --- a/lib/librte_gro/rte_gro.c
> +++ b/lib/librte_gro/rte_gro.c
> @@ -32,11 +32,15 @@
> 
>  #include <rte_malloc.h>
>  #include <rte_mbuf.h>
> +#include <rte_ethdev.h>
> 
>  #include "rte_gro.h"
> +#include "rte_gro_tcp.h"
> 
> -static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM];
> -static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM];
> +static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM] = {
> +	gro_tcp_tbl_create, NULL};
> +static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM] = {
> +	gro_tcp_tbl_destroy, NULL};
> 
>  struct rte_gro_tbl *rte_gro_tbl_create(uint16_t socket_id,
>  		uint16_t max_flow_num,
> @@ -94,32 +98,121 @@ void rte_gro_tbl_destroy(struct rte_gro_tbl *gro_tbl)
>  }
> 
>  uint16_t
> -rte_gro_reassemble_burst(struct rte_mbuf **pkts __rte_unused,
> +rte_gro_reassemble_burst(struct rte_mbuf **pkts,
>  		const uint16_t nb_pkts,
> -		const struct rte_gro_param param __rte_unused)
> +		const struct rte_gro_param param)
>  {
> -	return nb_pkts;
> +	uint16_t i;
> +	uint16_t nb_after_gro = nb_pkts;
> +	uint32_t item_num = RTE_MIN(nb_pkts, param.max_flow_num *
> +			param.max_item_per_flow);
> +
> +	/* allocate a reassembly table for TCP/IPv4 GRO */
> +	uint32_t tcp_item_num = RTE_MIN(item_num, GRO_MAX_BURST_ITEM_NUM);
> +	struct gro_tcp_tbl tcp_tbl;
> +	struct gro_tcp_key tcp_keys[tcp_item_num];
> +	struct gro_tcp_item tcp_items[tcp_item_num];
> +
> +	struct rte_mbuf *unprocess_pkts[nb_pkts];
> +	uint16_t unprocess_num = 0;
> +	int32_t ret;
> +
> +	memset(tcp_keys, 0, sizeof(struct gro_tcp_key) *
> +			tcp_item_num);
> +	memset(tcp_items, 0, sizeof(struct gro_tcp_item) *
> +			tcp_item_num);
> +	tcp_tbl.keys = tcp_keys;
> +	tcp_tbl.items = tcp_items;
> +	tcp_tbl.key_num = 0;
> +	tcp_tbl.item_num = 0;
> +	tcp_tbl.max_key_num = tcp_item_num;
> +	tcp_tbl.max_item_num = tcp_item_num;
> +
> +	for (i = 0; i < nb_pkts; i++) {
> +		if (RTE_ETH_IS_IPV4_HDR(pkts[i]->packet_type)) {

Why just not && for these 2 conditions?

> +			if ((pkts[i]->packet_type & RTE_PTYPE_L4_TCP) &&
> +				(param.desired_gro_types &
> +					 GRO_TCP_IPV4)) {

No need to check param.desired_gro_types inside the loop.
You can do that before the loop.

> +				ret = gro_tcp4_reassemble(pkts[i],
> +						&tcp_tbl,
> +						param.max_packet_size);
> +				/* merge successfully */
> +				if (ret > 0)
> +					nb_after_gro--;
> +				else if (ret < 0)
> +					unprocess_pkts[unprocess_num++] =
> +						pkts[i];
> +			} else
> +				unprocess_pkts[unprocess_num++] =
> +					pkts[i];
> +		} else
> +			unprocess_pkts[unprocess_num++] =
> +				pkts[i];
> +	}
> +
> +	/* re-arrange GROed packets */
> +	if (nb_after_gro < nb_pkts) {
> +		if (param.desired_gro_types & GRO_TCP_IPV4)
> +			i = gro_tcp_tbl_flush(&tcp_tbl, pkts, nb_pkts);
> +		if (unprocess_num > 0) {
> +			memcpy(&pkts[i], unprocess_pkts,
> +					sizeof(struct rte_mbuf *) *
> +					unprocess_num);
> +			i += unprocess_num;
> +		}
> +		if (nb_pkts > i)
> +			memset(&pkts[i], 0,
> +					sizeof(struct rte_mbuf *) *
> +					(nb_pkts - i));
> +	}

Why do you need to zero remaining pkts[]?

> +	return nb_after_gro;
>  }
> 
> -int rte_gro_reassemble(struct rte_mbuf *pkt __rte_unused,
> -		struct rte_gro_tbl *gro_tbl __rte_unused)
> +int rte_gro_reassemble(struct rte_mbuf *pkt,
> +		struct rte_gro_tbl *gro_tbl)
>  {
> +	if (unlikely(pkt == NULL))
> +		return -1;
> +
> +	if (RTE_ETH_IS_IPV4_HDR(pkt->packet_type)) {
> +		if ((pkt->packet_type & RTE_PTYPE_L4_TCP) &&
> +				(gro_tbl->desired_gro_types &
> +				 GRO_TCP_IPV4))
> +			return gro_tcp4_reassemble(pkt,
> +					gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +					gro_tbl->max_packet_size);
> +	}
> +
>  	return -1;
>  }
> 
> -uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> -		uint64_t desired_gro_types __rte_unused,
> -		struct rte_mbuf **out __rte_unused,
> -		const uint16_t max_nb_out __rte_unused)
> +uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl,
> +		uint64_t desired_gro_types,
> +		struct rte_mbuf **out,
> +		const uint16_t max_nb_out)
>  {
> +	desired_gro_types = desired_gro_types &
> +		gro_tbl->desired_gro_types;
> +	if (desired_gro_types & GRO_TCP_IPV4)
> +		return gro_tcp_tbl_flush(
> +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +				out,
> +				max_nb_out);
>  	return 0;
>  }
> 
>  uint16_t
> -rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> -		uint64_t desired_gro_types __rte_unused,
> -		struct rte_mbuf **out __rte_unused,
> -		const uint16_t max_nb_out __rte_unused)
> +rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl,
> +		uint64_t desired_gro_types,
> +		struct rte_mbuf **out,
> +		const uint16_t max_nb_out)
>  {
> +	desired_gro_types = desired_gro_types &
> +		gro_tbl->desired_gro_types;
> +	if (desired_gro_types & GRO_TCP_IPV4)
> +		return gro_tcp_tbl_timeout_flush(
> +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +				gro_tbl->max_timeout_cycles,
> +				out, max_nb_out);
>  	return 0;
>  }
> diff --git a/lib/librte_gro/rte_gro.h b/lib/librte_gro/rte_gro.h
> index f9d36e8..a30b1c6 100644
> --- a/lib/librte_gro/rte_gro.h
> +++ b/lib/librte_gro/rte_gro.h
> @@ -45,7 +45,11 @@ extern "C" {
> 
>  /* max number of supported GRO types */
>  #define GRO_TYPE_MAX_NUM 64
> -#define GRO_TYPE_SUPPORT_NUM 0	/**< current supported GRO num */
> +#define GRO_TYPE_SUPPORT_NUM 1	/**< current supported GRO num */
> +
> +/* TCP/IPv4 GRO flag */
> +#define GRO_TCP_IPV4_INDEX 0
> +#define GRO_TCP_IPV4 (1ULL << GRO_TCP_IPV4_INDEX)
> 
>  /**
>   * GRO table, which is used to merge packets. It keeps many reassembly
> diff --git a/lib/librte_gro/rte_gro_tcp.c b/lib/librte_gro/rte_gro_tcp.c
> new file mode 100644
> index 0000000..c0eaa45
> --- /dev/null
> +++ b/lib/librte_gro/rte_gro_tcp.c
> @@ -0,0 +1,394 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_malloc.h>
> +#include <rte_mbuf.h>
> +#include <rte_cycles.h>
> +
> +#include <rte_ethdev.h>
> +#include <rte_ip.h>
> +#include <rte_tcp.h>
> +
> +#include "rte_gro_tcp.h"
> +
> +void *gro_tcp_tbl_create(uint16_t socket_id,
> +		uint16_t max_flow_num,
> +		uint16_t max_item_per_flow)
> +{
> +	size_t size;
> +	uint32_t entries_num;
> +	struct gro_tcp_tbl *tbl;
> +
> +	entries_num = max_flow_num * max_item_per_flow;
> +	entries_num = entries_num > GRO_TCP_TBL_MAX_ITEM_NUM ?
> +		GRO_TCP_TBL_MAX_ITEM_NUM : entries_num;
> +
> +	if (entries_num == 0)
> +		return NULL;
> +
> +	tbl = (struct gro_tcp_tbl *)rte_zmalloc_socket(
> +			__func__,
> +			sizeof(struct gro_tcp_tbl),
> +			RTE_CACHE_LINE_SIZE,
> +			socket_id);

Here and everywhere - rte_malloc() can fail.
Add proper error handling.

> +
> +	size = sizeof(struct gro_tcp_item) * entries_num;
> +	tbl->items = (struct gro_tcp_item *)rte_zmalloc_socket(
> +			__func__,
> +			size,
> +			RTE_CACHE_LINE_SIZE,
> +			socket_id);
> +	tbl->max_item_num = entries_num;
> +
> +	size = sizeof(struct gro_tcp_key) * entries_num;
> +	tbl->keys = (struct gro_tcp_key *)rte_zmalloc_socket(
> +			__func__,
> +			size, RTE_CACHE_LINE_SIZE,
> +			socket_id);
> +	tbl->max_key_num = entries_num;
> +	return tbl;
> +}
> +
> +void gro_tcp_tbl_destroy(void *tbl)
> +{
> +	struct gro_tcp_tbl *tcp_tbl = (struct gro_tcp_tbl *)tbl;
> +
> +	if (tcp_tbl) {
> +		if (tcp_tbl->items)

No need to, rte_free(NULL) is a valid construction.
Same below.

> +			rte_free(tcp_tbl->items);
> +		if (tcp_tbl->keys)
> +			rte_free(tcp_tbl->keys);
> +		rte_free(tcp_tbl);
> +	}
> +}
> +
> +/**
> + * merge two TCP/IPv4 packets without update checksums.
> + */
> +static int
> +merge_two_tcp4_packets(struct rte_mbuf *pkt_src,
> +		struct rte_mbuf *pkt,
> +		uint32_t max_packet_size)
> +{
> +	struct ipv4_hdr *ipv4_hdr1, *ipv4_hdr2;
> +	struct tcp_hdr *tcp_hdr1;
> +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> +	struct rte_mbuf *tail;
> +
> +	/* parse the given packet */
> +	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> +				struct ether_hdr *) + 1);

You probably shouldn't assume that l2_len is always 14B long.

> +	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
> +	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> +	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
> +	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> +		- tcp_hl1;
> +
> +	/* parse the original packet */
> +	ipv4_hdr2 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt_src,
> +				struct ether_hdr *) + 1);
> +
> +	if (pkt_src->pkt_len + tcp_dl1 > max_packet_size)
> +		return -1;
> +
> +	/* remove the header of the incoming packet */
> +	rte_pktmbuf_adj(pkt, sizeof(struct ether_hdr) +
> +			ipv4_ihl1 + tcp_hl1);
> +
> +	/* chain the two packet together */
> +	tail = rte_pktmbuf_lastseg(pkt_src);
> +	tail->next = pkt;

What I see as a problem here:
You have to reparse your packet and do lastseg for it for each new segment.
That seems like a big overhead.
Would be good instead to parse the packet once and then store that infomatio
inside mbuf: l2_len/l3_len/l4_len, etc.
You can probably even avoid parsing inside your library - by adding as a prerequisite
for the caller to fill these fields properly.

Similar thought about lastseg - would be good to store it somewhere inside your table.

> +
> +	/* update IP header */
> +	ipv4_hdr2->total_length = rte_cpu_to_be_16(
> +			rte_be_to_cpu_16(
> +				ipv4_hdr2->total_length)
> +			+ tcp_dl1);
> +
> +	/* update mbuf metadata for the merged packet */
> +	pkt_src->nb_segs++;

Why do you assume that incoming packet always contains only one segment?

> +	pkt_src->pkt_len += pkt->pkt_len;
> +	return 1;
> +}
> +
> +static int
> +check_seq_option(struct rte_mbuf *pkt,
> +		struct tcp_hdr *tcp_hdr,
> +		uint16_t tcp_hl)
> +{
> +	struct ipv4_hdr *ipv4_hdr1;
> +	struct tcp_hdr *tcp_hdr1;
> +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> +	uint32_t sent_seq1, sent_seq;
> +	int ret = -1;
> +
> +	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> +				struct ether_hdr *) + 1);
> +	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
> +	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> +	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
> +	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> +		- tcp_hl1;
> +	sent_seq1 = rte_be_to_cpu_32(tcp_hdr1->sent_seq) + tcp_dl1;
> +	sent_seq = rte_be_to_cpu_32(tcp_hdr->sent_seq);
> +
> +	/* check if the two packets are neighbor */
> +	if ((sent_seq ^ sent_seq1) == 0) {

Why just not if sent_seq == sent_seq1?

> +		/* check if TCP option field equals */
> +		if (tcp_hl1 > sizeof(struct tcp_hdr)) {

And what if tcp_hl1 == sizeof(struct tcp_hdr), but tcp_hl > tcp_hl1?
I think you need to remove that check.
  
> +			if ((tcp_hl1 != tcp_hl) ||
> +					(memcmp(tcp_hdr1 + 1,
> +							tcp_hdr + 1,
> +							tcp_hl - sizeof
> +							(struct tcp_hdr))
> +					 == 0))
> +				ret = 1;
> +		}
> +	}
> +	return ret;
> +}
> +
> +static uint32_t
> +find_an_empty_item(struct gro_tcp_tbl *tbl)
> +{
> +	uint32_t i;
> +
> +	for (i = 0; i < tbl->max_item_num; i++)
> +		if (tbl->items[i].is_valid == 0)
> +			return i;
> +	return INVALID_ARRAY_INDEX;
> +}
> +
> +static uint32_t
> +find_an_empty_key(struct gro_tcp_tbl *tbl)
> +{
> +	uint32_t i;
> +
> +	for (i = 0; i < tbl->max_key_num; i++)
> +		if (tbl->keys[i].is_valid == 0)
> +			return i;
> +	return INVALID_ARRAY_INDEX;
> +}
> +
> +int32_t
> +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> +		struct gro_tcp_tbl *tbl,
> +		uint32_t max_packet_size)
> +{
> +	struct ether_hdr *eth_hdr;
> +	struct ipv4_hdr *ipv4_hdr;
> +	struct tcp_hdr *tcp_hdr;
> +	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
> +
> +	struct tcp_key key;
> +	uint32_t cur_idx, prev_idx, item_idx;
> +	uint32_t i, key_idx;
> +
> +	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> +	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
> +	ipv4_ihl = IPv4_HDR_LEN(ipv4_hdr);
> +
> +	/* check if the packet should be processed */
> +	if (ipv4_ihl < sizeof(struct ipv4_hdr))
> +		goto fail;
> +	tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
> +	tcp_hl = TCP_HDR_LEN(tcp_hdr);
> +	tcp_dl = rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
> +		- tcp_hl;
> +	if (tcp_dl == 0)
> +		goto fail;
> +
> +	/* find a key and traverse all packets in its item group */
> +	key.eth_saddr = eth_hdr->s_addr;
> +	key.eth_daddr = eth_hdr->d_addr;
> +	key.ip_src_addr[0] = rte_be_to_cpu_32(ipv4_hdr->src_addr);
> +	key.ip_dst_addr[0] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);

Your key.ip_src_addr[1-3] still contains some junk.
How memcmp below supposed to worj properly?
BTW why do you need 4 elems here, why just not uint32_t ip_src_addr;?
Same for ip_dst_addr.

> +	key.src_port = rte_be_to_cpu_16(tcp_hdr->src_port);
> +	key.dst_port = rte_be_to_cpu_16(tcp_hdr->dst_port);
> +	key.recv_ack = rte_be_to_cpu_32(tcp_hdr->recv_ack);
> +	key.tcp_flags = tcp_hdr->tcp_flags;
> +
> +	for (i = 0; i < tbl->max_key_num; i++) {
> +		if (tbl->keys[i].is_valid &&
> +				(memcmp(&(tbl->keys[i].key), &key,
> +						sizeof(struct tcp_key))
> +				 == 0)) {
> +			cur_idx = tbl->keys[i].start_index;
> +			prev_idx = cur_idx;
> +			while (cur_idx != INVALID_ARRAY_INDEX) {
> +				if (check_seq_option(tbl->items[cur_idx].pkt,
> +							tcp_hdr,
> +							tcp_hl) > 0) {

As I remember linux gro also check ipv4 packet_id - it should be consecutive.

> +					if (merge_two_tcp4_packets(
> +								tbl->items[cur_idx].pkt,
> +								pkt,
> +								max_packet_size) > 0) {
> +						/* successfully merge two packets */
> +						tbl->items[cur_idx].is_groed = 1;
> +						return 1;
> +					}

If you allow more then packet per flow to be stored in the table, then you should be
prepared that new segment can fill a gap between 2 packets.
Probably the easiest thing - don't allow more then one 'item' per flow.  

> +					/**
> +					 * fail to merge two packets since
> +					 * it's beyond the max packet length.
> +					 * Insert it into the item group.
> +					 */
> +					goto insert_to_item_group;
> +				} else {
> +					prev_idx = cur_idx;
> +					cur_idx = tbl->items[cur_idx].next_pkt_idx;
> +				}
> +			}
> +			/**
> +			 * find a corresponding item group but fails to find
> +			 * one packet to merge. Insert it into this item group.
> +			 */
> +insert_to_item_group:
> +			item_idx = find_an_empty_item(tbl);
> +			/* the item number is greater than the max value */
> +			if (item_idx == INVALID_ARRAY_INDEX)
> +				return -1;
> +			tbl->items[prev_idx].next_pkt_idx = item_idx;
> +			tbl->items[item_idx].pkt = pkt;
> +			tbl->items[item_idx].is_groed = 0;
> +			tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> +			tbl->items[item_idx].is_valid = 1;
> +			tbl->items[item_idx].start_time = rte_rdtsc();
> +			tbl->item_num++;
> +			return 0;
> +		}
> +	}
> +
> +	/**
> +	 * merge fail as the given packet has a new key.
> +	 * So insert a new key.
> +	 */
> +	item_idx = find_an_empty_item(tbl);
> +	key_idx = find_an_empty_key(tbl);
> +	/**
> +	 * if current key or item number is greater than the max
> +	 * value, don't insert the packet into the table and return
> +	 * immediately.
> +	 */
> +	if (item_idx == INVALID_ARRAY_INDEX ||
> +			key_idx == INVALID_ARRAY_INDEX)
> +		return -1;
> +	tbl->items[item_idx].pkt = pkt;
> +	tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> +	tbl->items[item_idx].is_groed = 0;
> +	tbl->items[item_idx].is_valid = 1;
> +	tbl->items[item_idx].start_time = rte_rdtsc();

You can pass start-time as a parameter instead.

> +	tbl->item_num++;
> +
> +	memcpy(&(tbl->keys[key_idx].key),
> +			&key, sizeof(struct tcp_key));
> +	tbl->keys[key_idx].start_index = item_idx;
> +	tbl->keys[key_idx].is_valid = 1;
> +	tbl->key_num++;
> +
> +	return 0;
> +fail:

Please try to avoid goto whenever possible.
Looks really ugly.

> +	return -1;
> +}
> +
> +uint16_t gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out)
> +{
> +	uint32_t i, num = 0;
> +
> +	if (nb_out < tbl->item_num)
> +		return 0;

And how user would now how many items are now in the table?

> +
> +	for (i = 0; i < tbl->max_item_num; i++) {
> +		if (tbl->items[i].is_valid) {
> +			out[num++] = tbl->items[i].pkt;
> +			tbl->items[i].is_valid = 0;
> +			tbl->item_num--;
> +		}
> +	}
> +	memset(tbl->keys, 0, sizeof(struct gro_tcp_key) *
> +			tbl->max_key_num);
> +	tbl->key_num = 0;
> +
> +	return num;
> +}
> +
> +uint16_t
> +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> +		uint64_t timeout_cycles,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out)
> +{
> +	uint16_t k;
> +	uint32_t i, j;
> +	uint64_t current_time;
> +
> +	if (nb_out == 0)
> +		return 0;
> +	k = 0;
> +	current_time = rte_rdtsc();
> +
> +	for (i = 0; i < tbl->max_key_num; i++) {
> +		if (tbl->keys[i].is_valid) {

Seems pretty expensive to traverse the whole table...
Would it worth to have some sort of LRU list?

> +			j = tbl->keys[i].start_index;
> +			while (j != INVALID_ARRAY_INDEX) {
> +				if (current_time - tbl->items[j].start_time >=
> +						timeout_cycles) {
> +					out[k++] = tbl->items[j].pkt;
> +					tbl->items[j].is_valid = 0;
> +					tbl->item_num--;
> +					j = tbl->items[j].next_pkt_idx;
> +
> +					if (k == nb_out &&
> +							j == INVALID_ARRAY_INDEX) {
> +						/* delete the key */
> +						tbl->keys[i].is_valid = 0;
> +						tbl->key_num--;
> +						goto end;

Please rearrange the code to avoid gotos.

> +					} else if (k == nb_out &&
> +							j != INVALID_ARRAY_INDEX) {
> +						/* update the first item index */
> +						tbl->keys[i].start_index = j;
> +						goto end;
> +					}
> +				}
> +			}
> +			/* delete the key, as all of its packets are flushed */
> +			tbl->keys[i].is_valid = 0;
> +			tbl->key_num--;
> +		}
> +		if (tbl->key_num == 0)
> +			goto end;
> +	}
> +end:
> +	return k;
> +}
> diff --git a/lib/librte_gro/rte_gro_tcp.h b/lib/librte_gro/rte_gro_tcp.h
> new file mode 100644
> index 0000000..a9a7aca
> --- /dev/null
> +++ b/lib/librte_gro/rte_gro_tcp.h
> @@ -0,0 +1,191 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_GRO_TCP_H_
> +#define _RTE_GRO_TCP_H_
> +
> +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> +#define TCP_HDR_LEN(tcph) \
> +	((tcph->data_off >> 4) * 4)
> +#define IPv4_HDR_LEN(iph) \
> +	((iph->version_ihl & 0x0f) * 4)
> +#else
> +#define TCP_DATAOFF_MASK 0x0f
> +#define TCP_HDR_LEN(tcph) \
> +	((tcph->data_off & TCP_DATAOFF_MASK) * 4)
> +#define IPv4_HDR_LEN(iph) \
> +	((iph->version_ihl >> 4) * 4)
> +#endif
> +
> +#define INVALID_ARRAY_INDEX 0xffffffffUL
> +#define GRO_TCP_TBL_MAX_ITEM_NUM (UINT32_MAX - 1)
> +
> +/* criteria of mergeing packets */
> +struct tcp_key {
> +	struct ether_addr eth_saddr;
> +	struct ether_addr eth_daddr;
> +	uint32_t ip_src_addr[4];	/**< IPv4 uses the first 8B */
> +	uint32_t ip_dst_addr[4];
> +
> +	uint32_t recv_ack;	/**< acknowledgment sequence number. */
> +	uint16_t src_port;
> +	uint16_t dst_port;
> +	uint8_t tcp_flags;	/**< TCP flags. */
> +};
> +
> +struct gro_tcp_key {
> +	struct tcp_key key;
> +	uint32_t start_index;	/**< the first packet index of the flow */
> +	uint8_t is_valid;
> +};
> +
> +struct gro_tcp_item {
> +	struct rte_mbuf *pkt;	/**< packet address. */
> +	/* the time when the packet in added into the table */
> +	uint64_t start_time;
> +	uint32_t next_pkt_idx;	/**< next packet index. */
> +	/* flag to indicate if the packet is GROed */
> +	uint8_t is_groed;
> +	uint8_t is_valid;	/**< flag indicates if the item is valid */

Why do you need these 2 flags at all?
Why not just reset let say pkt to NULL for invalid item?

> +};
> +
> +/**
> + * TCP reassembly table. Both TCP/IPv4 and TCP/IPv6 use the same table
> + * structure.
> + */
> +struct gro_tcp_tbl {
> +	struct gro_tcp_item *items;	/**< item array */
> +	struct gro_tcp_key *keys;	/**< key array */
> +	uint32_t item_num;	/**< current item number */
> +	uint32_t key_num;	/**< current key num */
> +	uint32_t max_item_num;	/**< item array size */
> +	uint32_t max_key_num;	/**< key array size */
> +};
> +
> +/**
> + * This function creates a TCP reassembly table.
> + *
> + * @param socket_id
> + *  socket index where the Ethernet port connects to.
> + * @param max_flow_num
> + *  the maximum number of flows in the TCP GRO table
> + * @param max_item_per_flow
> + *  the maximum packet number per flow.
> + * @return
> + *  if create successfully, return a pointer which points to the
> + *  created TCP GRO table. Otherwise, return NULL.
> + */
> +void *gro_tcp_tbl_create(uint16_t socket_id,
> +		uint16_t max_flow_num,
> +		uint16_t max_item_per_flow);
> +
> +/**
> + * This function destroys a TCP reassembly table.
> + * @param tbl
> + *  a pointer points to the TCP reassembly table.
> + */
> +void gro_tcp_tbl_destroy(void *tbl);
> +
> +/**
> + * This function searches for a packet in the TCP reassembly table to
> + * merge with the inputted one. To merge two packets is to chain them
> + * together and update packet headers. If the packet is without data
> + * (e.g. SYN, SYN-ACK packet), this function returns immediately.
> + * Otherwise, the packet is either merged, or inserted into the table.
> + * Besides, if there is no available space to insert the packet, this
> + * function returns immediately too.
> + *
> + * This function assumes the inputted packet is with correct IPv4 and
> + * TCP checksums. And if two packets are merged, it won't re-calculate
> + * IPv4 and TCP checksums. Besides, if the inputted packet is IP
> + * fragmented, it assumes the packet is complete (with TCP header).
> + *
> + * @param pkt
> + *  packet to reassemble.
> + * @param tbl
> + *  a pointer that points to a TCP reassembly table.
> + * @param max_packet_size
> + *  max packet length after merged
> + * @return
> + *  if the packet doesn't have data, or there is no available space
> + *  in the table to insert a new item or a new key, return a negative
> + *  value. If the packet is merged successfully, return an positive
> + *  value. If the packet is inserted into the table, return 0.
> + */
> +int32_t
> +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> +		struct gro_tcp_tbl *tbl,
> +		uint32_t max_packet_size);
> +
> +/**
> + * This function flushes all packets in a TCP reassembly table to
> + * applications, and without updating checksums for merged packets.
> + * If the array which is used to keep flushed packets is not large
> + * enough, error happens and this function returns immediately.
> + *
> + * @param tbl
> + *  a pointer that points to a TCP GRO table.
> + * @param out
> + *  pointer array which is used to keep flushed packets. Applications
> + *  should guarantee it's large enough to hold all packets in the table.
> + * @param nb_out
> + *  the element number of out.
> + * @return
> + *  the number of flushed packets. If out is not large enough to hold
> + *  all packets in the table, return 0.
> + */
> +uint16_t
> +gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out);
> +
> +/**
> + * This function flushes timeout packets in a TCP reassembly table to
> + * applications, and without updating checksums for merged packets.
> + *
> + * @param tbl
> + *  a pointer that points to a TCP GRO table.
> + * @param timeout_cycles
> + *  the maximum time that packets can stay in the table.
> + * @param out
> + *  pointer array which is used to keep flushed packets.
> + * @param nb_out
> + *  the element number of out.
> + * @return
> + *  the number of packets that are returned.
> + */
> +uint16_t
> +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> +		uint64_t timeout_cycles,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out);
> +#endif
> --
> 2.7.4
  
Hu, Jiayu June 29, 2017, 2:26 a.m. UTC | #2
Hi Konstantin,

On Thu, Jun 29, 2017 at 07:56:10AM +0800, Ananyev, Konstantin wrote:
> Hi Jiayu,
> 
> > 
> > In this patch, we introduce five APIs to support TCP/IPv4 GRO.
> > - gro_tcp_tbl_create: create a TCP reassembly table, which is used to
> >     merge packets.
> > - gro_tcp_tbl_destroy: free memory space of a TCP reassembly table.
> > - gro_tcp_tbl_flush: flush all packets from a TCP reassembly table.
> > - gro_tcp_tbl_timeout_flush: flush timeout packets from a TCP
> >     reassembly table.
> > - gro_tcp4_reassemble: reassemble an inputted TCP/IPv4 packet.
> > 
> > TCP/IPv4 GRO API assumes all inputted packets are with correct IPv4
> > and TCP checksums. And TCP/IPv4 GRO API doesn't update IPv4 and TCP
> > checksums for merged packets. If inputted packets are IP fragmented,
> > TCP/IPv4 GRO API assumes they are complete packets (i.e. with L4
> > headers).
> > 
> > In TCP GRO, we use a table structure, called TCP reassembly table, to
> > reassemble packets. Both TCP/IPv4 and TCP/IPv6 GRO use the same table
> > structure. A TCP reassembly table includes a key array and a item array,
> > where the key array keeps the criteria to merge packets and the item
> > array keeps packet information.
> > 
> > One key in the key array points to an item group, which consists of
> > packets which have the same criteria value. If two packets are able to
> > merge, they must be in the same item group. Each key in the key array
> > includes two parts:
> > - criteria: the criteria of merging packets. If two packets can be
> >     merged, they must have the same criteria value.
> > - start_index: the index of the first incoming packet of the item group.
> > 
> > Each element in the item array keeps the information of one packet. It
> > mainly includes two parts:
> > - pkt: packet address
> > - next_pkt_index: the index of the next packet in the same item group.
> >     All packets in the same item group are chained by next_pkt_index.
> >     With next_pkt_index, we can locate all packets in the same item
> >     group one by one.
> > 
> > To process an incoming packet needs three steps:
> > a. check if the packet should be processed. Packets with the following
> >     properties won't be processed:
> > 	- packets without data (e.g. SYN, SYN-ACK)
> > b. traverse the key array to find a key which has the same criteria
> >     value with the incoming packet. If find, goto step c. Otherwise,
> >     insert a new key and insert the packet into the item array.
> > c. locate the first packet in the item group via the start_index in the
> >     key. Then traverse all packets in the item group via next_pkt_index.
> >     If find one packet which can merge with the incoming one, merge them
> >     together. If can't find, insert the packet into this item group.
> > 
> > Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> > ---
> >  doc/guides/rel_notes/release_17_08.rst |   7 +
> >  lib/librte_gro/Makefile                |   1 +
> >  lib/librte_gro/rte_gro.c               | 123 ++++++++--
> >  lib/librte_gro/rte_gro.h               |   6 +-
> >  lib/librte_gro/rte_gro_tcp.c           | 394 +++++++++++++++++++++++++++++++++
> >  lib/librte_gro/rte_gro_tcp.h           | 191 ++++++++++++++++
> >  6 files changed, 706 insertions(+), 16 deletions(-)
> >  create mode 100644 lib/librte_gro/rte_gro_tcp.c
> >  create mode 100644 lib/librte_gro/rte_gro_tcp.h
> > 
> > diff --git a/doc/guides/rel_notes/release_17_08.rst b/doc/guides/rel_notes/release_17_08.rst
> > index 842f46f..f067247 100644
> > --- a/doc/guides/rel_notes/release_17_08.rst
> > +++ b/doc/guides/rel_notes/release_17_08.rst
> > @@ -75,6 +75,13 @@ New Features
> > 
> >    Added support for firmwares with multiple Ethernet ports per physical port.
> > 
> > +* **Add Generic Receive Offload API support.**
> > +
> > +  Generic Receive Offload (GRO) API supports to reassemble TCP/IPv4
> > +  packets. GRO API assumes all inputted packets are with correct
> > +  checksums. GRO API doesn't update checksums for merged packets. If
> > +  inputted packets are IP fragmented, GRO API assumes they are complete
> > +  packets (i.e. with L4 headers).
> > 
> >  Resolved Issues
> >  ---------------
> > diff --git a/lib/librte_gro/Makefile b/lib/librte_gro/Makefile
> > index 7e0f128..e89344d 100644
> > --- a/lib/librte_gro/Makefile
> > +++ b/lib/librte_gro/Makefile
> > @@ -43,6 +43,7 @@ LIBABIVER := 1
> > 
> >  # source files
> >  SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro_tcp.c
> > 
> >  # install this header file
> >  SYMLINK-$(CONFIG_RTE_LIBRTE_GRO)-include += rte_gro.h
> > diff --git a/lib/librte_gro/rte_gro.c b/lib/librte_gro/rte_gro.c
> > index 33275e8..5b89928 100644
> > --- a/lib/librte_gro/rte_gro.c
> > +++ b/lib/librte_gro/rte_gro.c
> > @@ -32,11 +32,15 @@
> > 
> >  #include <rte_malloc.h>
> >  #include <rte_mbuf.h>
> > +#include <rte_ethdev.h>
> > 
> >  #include "rte_gro.h"
> > +#include "rte_gro_tcp.h"
> > 
> > -static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM];
> > -static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM];
> > +static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM] = {
> > +	gro_tcp_tbl_create, NULL};
> > +static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM] = {
> > +	gro_tcp_tbl_destroy, NULL};
> > 
> >  struct rte_gro_tbl *rte_gro_tbl_create(uint16_t socket_id,
> >  		uint16_t max_flow_num,
> > @@ -94,32 +98,121 @@ void rte_gro_tbl_destroy(struct rte_gro_tbl *gro_tbl)
> >  }
> > 
> >  uint16_t
> > -rte_gro_reassemble_burst(struct rte_mbuf **pkts __rte_unused,
> > +rte_gro_reassemble_burst(struct rte_mbuf **pkts,
> >  		const uint16_t nb_pkts,
> > -		const struct rte_gro_param param __rte_unused)
> > +		const struct rte_gro_param param)
> >  {
> > -	return nb_pkts;
> > +	uint16_t i;
> > +	uint16_t nb_after_gro = nb_pkts;
> > +	uint32_t item_num = RTE_MIN(nb_pkts, param.max_flow_num *
> > +			param.max_item_per_flow);
> > +
> > +	/* allocate a reassembly table for TCP/IPv4 GRO */
> > +	uint32_t tcp_item_num = RTE_MIN(item_num, GRO_MAX_BURST_ITEM_NUM);
> > +	struct gro_tcp_tbl tcp_tbl;
> > +	struct gro_tcp_key tcp_keys[tcp_item_num];
> > +	struct gro_tcp_item tcp_items[tcp_item_num];
> > +
> > +	struct rte_mbuf *unprocess_pkts[nb_pkts];
> > +	uint16_t unprocess_num = 0;
> > +	int32_t ret;
> > +
> > +	memset(tcp_keys, 0, sizeof(struct gro_tcp_key) *
> > +			tcp_item_num);
> > +	memset(tcp_items, 0, sizeof(struct gro_tcp_item) *
> > +			tcp_item_num);
> > +	tcp_tbl.keys = tcp_keys;
> > +	tcp_tbl.items = tcp_items;
> > +	tcp_tbl.key_num = 0;
> > +	tcp_tbl.item_num = 0;
> > +	tcp_tbl.max_key_num = tcp_item_num;
> > +	tcp_tbl.max_item_num = tcp_item_num;
> > +
> > +	for (i = 0; i < nb_pkts; i++) {
> > +		if (RTE_ETH_IS_IPV4_HDR(pkts[i]->packet_type)) {
> 
> Why just not && for these 2 conditions?
> 
> > +			if ((pkts[i]->packet_type & RTE_PTYPE_L4_TCP) &&
> > +				(param.desired_gro_types &
> > +					 GRO_TCP_IPV4)) {
> 
> No need to check param.desired_gro_types inside the loop.
> You can do that before the loop.

Yes, I should do it before the loop. Thanks, I will change it.

> 
> > +				ret = gro_tcp4_reassemble(pkts[i],
> > +						&tcp_tbl,
> > +						param.max_packet_size);
> > +				/* merge successfully */
> > +				if (ret > 0)
> > +					nb_after_gro--;
> > +				else if (ret < 0)
> > +					unprocess_pkts[unprocess_num++] =
> > +						pkts[i];
> > +			} else
> > +				unprocess_pkts[unprocess_num++] =
> > +					pkts[i];
> > +		} else
> > +			unprocess_pkts[unprocess_num++] =
> > +				pkts[i];
> > +	}
> > +
> > +	/* re-arrange GROed packets */
> > +	if (nb_after_gro < nb_pkts) {
> > +		if (param.desired_gro_types & GRO_TCP_IPV4)
> > +			i = gro_tcp_tbl_flush(&tcp_tbl, pkts, nb_pkts);
> > +		if (unprocess_num > 0) {
> > +			memcpy(&pkts[i], unprocess_pkts,
> > +					sizeof(struct rte_mbuf *) *
> > +					unprocess_num);
> > +			i += unprocess_num;
> > +		}
> > +		if (nb_pkts > i)
> > +			memset(&pkts[i], 0,
> > +					sizeof(struct rte_mbuf *) *
> > +					(nb_pkts - i));
> > +	}
> 
> Why do you need to zero remaining pkts[]?

Thanks, I will remove these reseting operations.

> 
> > +	return nb_after_gro;
> >  }
> > 
> > -int rte_gro_reassemble(struct rte_mbuf *pkt __rte_unused,
> > -		struct rte_gro_tbl *gro_tbl __rte_unused)
> > +int rte_gro_reassemble(struct rte_mbuf *pkt,
> > +		struct rte_gro_tbl *gro_tbl)
> >  {
> > +	if (unlikely(pkt == NULL))
> > +		return -1;
> > +
> > +	if (RTE_ETH_IS_IPV4_HDR(pkt->packet_type)) {
> > +		if ((pkt->packet_type & RTE_PTYPE_L4_TCP) &&
> > +				(gro_tbl->desired_gro_types &
> > +				 GRO_TCP_IPV4))
> > +			return gro_tcp4_reassemble(pkt,
> > +					gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> > +					gro_tbl->max_packet_size);
> > +	}
> > +
> >  	return -1;
> >  }
> > 
> > -uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> > -		uint64_t desired_gro_types __rte_unused,
> > -		struct rte_mbuf **out __rte_unused,
> > -		const uint16_t max_nb_out __rte_unused)
> > +uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl,
> > +		uint64_t desired_gro_types,
> > +		struct rte_mbuf **out,
> > +		const uint16_t max_nb_out)
> >  {
> > +	desired_gro_types = desired_gro_types &
> > +		gro_tbl->desired_gro_types;
> > +	if (desired_gro_types & GRO_TCP_IPV4)
> > +		return gro_tcp_tbl_flush(
> > +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> > +				out,
> > +				max_nb_out);
> >  	return 0;
> >  }
> > 
> >  uint16_t
> > -rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> > -		uint64_t desired_gro_types __rte_unused,
> > -		struct rte_mbuf **out __rte_unused,
> > -		const uint16_t max_nb_out __rte_unused)
> > +rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl,
> > +		uint64_t desired_gro_types,
> > +		struct rte_mbuf **out,
> > +		const uint16_t max_nb_out)
> >  {
> > +	desired_gro_types = desired_gro_types &
> > +		gro_tbl->desired_gro_types;
> > +	if (desired_gro_types & GRO_TCP_IPV4)
> > +		return gro_tcp_tbl_timeout_flush(
> > +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> > +				gro_tbl->max_timeout_cycles,
> > +				out, max_nb_out);
> >  	return 0;
> >  }
> > diff --git a/lib/librte_gro/rte_gro.h b/lib/librte_gro/rte_gro.h
> > index f9d36e8..a30b1c6 100644
> > --- a/lib/librte_gro/rte_gro.h
> > +++ b/lib/librte_gro/rte_gro.h
> > @@ -45,7 +45,11 @@ extern "C" {
> > 
> >  /* max number of supported GRO types */
> >  #define GRO_TYPE_MAX_NUM 64
> > -#define GRO_TYPE_SUPPORT_NUM 0	/**< current supported GRO num */
> > +#define GRO_TYPE_SUPPORT_NUM 1	/**< current supported GRO num */
> > +
> > +/* TCP/IPv4 GRO flag */
> > +#define GRO_TCP_IPV4_INDEX 0
> > +#define GRO_TCP_IPV4 (1ULL << GRO_TCP_IPV4_INDEX)
> > 
> >  /**
> >   * GRO table, which is used to merge packets. It keeps many reassembly
> > diff --git a/lib/librte_gro/rte_gro_tcp.c b/lib/librte_gro/rte_gro_tcp.c
> > new file mode 100644
> > index 0000000..c0eaa45
> > --- /dev/null
> > +++ b/lib/librte_gro/rte_gro_tcp.c
> > @@ -0,0 +1,394 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#include <rte_malloc.h>
> > +#include <rte_mbuf.h>
> > +#include <rte_cycles.h>
> > +
> > +#include <rte_ethdev.h>
> > +#include <rte_ip.h>
> > +#include <rte_tcp.h>
> > +
> > +#include "rte_gro_tcp.h"
> > +
> > +void *gro_tcp_tbl_create(uint16_t socket_id,
> > +		uint16_t max_flow_num,
> > +		uint16_t max_item_per_flow)
> > +{
> > +	size_t size;
> > +	uint32_t entries_num;
> > +	struct gro_tcp_tbl *tbl;
> > +
> > +	entries_num = max_flow_num * max_item_per_flow;
> > +	entries_num = entries_num > GRO_TCP_TBL_MAX_ITEM_NUM ?
> > +		GRO_TCP_TBL_MAX_ITEM_NUM : entries_num;
> > +
> > +	if (entries_num == 0)
> > +		return NULL;
> > +
> > +	tbl = (struct gro_tcp_tbl *)rte_zmalloc_socket(
> > +			__func__,
> > +			sizeof(struct gro_tcp_tbl),
> > +			RTE_CACHE_LINE_SIZE,
> > +			socket_id);
> 
> Here and everywhere - rte_malloc() can fail.
> Add proper error handling.

Thanks, I will modify it.

> 
> > +
> > +	size = sizeof(struct gro_tcp_item) * entries_num;
> > +	tbl->items = (struct gro_tcp_item *)rte_zmalloc_socket(
> > +			__func__,
> > +			size,
> > +			RTE_CACHE_LINE_SIZE,
> > +			socket_id);
> > +	tbl->max_item_num = entries_num;
> > +
> > +	size = sizeof(struct gro_tcp_key) * entries_num;
> > +	tbl->keys = (struct gro_tcp_key *)rte_zmalloc_socket(
> > +			__func__,
> > +			size, RTE_CACHE_LINE_SIZE,
> > +			socket_id);
> > +	tbl->max_key_num = entries_num;
> > +	return tbl;
> > +}
> > +
> > +void gro_tcp_tbl_destroy(void *tbl)
> > +{
> > +	struct gro_tcp_tbl *tcp_tbl = (struct gro_tcp_tbl *)tbl;
> > +
> > +	if (tcp_tbl) {
> > +		if (tcp_tbl->items)
> 
> No need to, rte_free(NULL) is a valid construction.
> Same below.

Thanks.

> 
> > +			rte_free(tcp_tbl->items);
> > +		if (tcp_tbl->keys)
> > +			rte_free(tcp_tbl->keys);
> > +		rte_free(tcp_tbl);
> > +	}
> > +}
> > +
> > +/**
> > + * merge two TCP/IPv4 packets without update checksums.
> > + */
> > +static int
> > +merge_two_tcp4_packets(struct rte_mbuf *pkt_src,
> > +		struct rte_mbuf *pkt,
> > +		uint32_t max_packet_size)
> > +{
> > +	struct ipv4_hdr *ipv4_hdr1, *ipv4_hdr2;
> > +	struct tcp_hdr *tcp_hdr1;
> > +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> > +	struct rte_mbuf *tail;
> > +
> > +	/* parse the given packet */
> > +	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> > +				struct ether_hdr *) + 1);
> 
> You probably shouldn't assume that l2_len is always 14B long.
> 
> > +	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
> > +	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> > +	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
> > +	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> > +		- tcp_hl1;
> > +
> > +	/* parse the original packet */
> > +	ipv4_hdr2 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt_src,
> > +				struct ether_hdr *) + 1);
> > +
> > +	if (pkt_src->pkt_len + tcp_dl1 > max_packet_size)
> > +		return -1;
> > +
> > +	/* remove the header of the incoming packet */
> > +	rte_pktmbuf_adj(pkt, sizeof(struct ether_hdr) +
> > +			ipv4_ihl1 + tcp_hl1);
> > +
> > +	/* chain the two packet together */
> > +	tail = rte_pktmbuf_lastseg(pkt_src);
> > +	tail->next = pkt;
> 
> What I see as a problem here:
> You have to reparse your packet and do lastseg for it for each new segment.
> That seems like a big overhead.
> Would be good instead to parse the packet once and then store that infomatio
> inside mbuf: l2_len/l3_len/l4_len, etc.
> You can probably even avoid parsing inside your library - by adding as a prerequisite
> for the caller to fill these fields properly.
> 
> Similar thought about lastseg - would be good to store it somewhere inside your table.

Yes, it's faster and simpler to store the lastseg into the table. I will modify it.

> 
> > +
> > +	/* update IP header */
> > +	ipv4_hdr2->total_length = rte_cpu_to_be_16(
> > +			rte_be_to_cpu_16(
> > +				ipv4_hdr2->total_length)
> > +			+ tcp_dl1);
> > +
> > +	/* update mbuf metadata for the merged packet */
> > +	pkt_src->nb_segs++;
> 
> Why do you assume that incoming packet always contains only one segment?

I think it's a bug here. I need to handle the multi-segment packets. Thanks,
I will modify it.

> 
> > +	pkt_src->pkt_len += pkt->pkt_len;
> > +	return 1;
> > +}
> > +
> > +static int
> > +check_seq_option(struct rte_mbuf *pkt,
> > +		struct tcp_hdr *tcp_hdr,
> > +		uint16_t tcp_hl)
> > +{
> > +	struct ipv4_hdr *ipv4_hdr1;
> > +	struct tcp_hdr *tcp_hdr1;
> > +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> > +	uint32_t sent_seq1, sent_seq;
> > +	int ret = -1;
> > +
> > +	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> > +				struct ether_hdr *) + 1);
> > +	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
> > +	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> > +	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
> > +	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> > +		- tcp_hl1;
> > +	sent_seq1 = rte_be_to_cpu_32(tcp_hdr1->sent_seq) + tcp_dl1;
> > +	sent_seq = rte_be_to_cpu_32(tcp_hdr->sent_seq);
> > +
> > +	/* check if the two packets are neighbor */
> > +	if ((sent_seq ^ sent_seq1) == 0) {
> 
> Why just not if sent_seq == sent_seq1?

Thanks, I will change it.

> 
> > +		/* check if TCP option field equals */
> > +		if (tcp_hl1 > sizeof(struct tcp_hdr)) {
> 
> And what if tcp_hl1 == sizeof(struct tcp_hdr), but tcp_hl > tcp_hl1?
> I think you need to remove that check.

I will remove it. Thanks.

>   
> > +			if ((tcp_hl1 != tcp_hl) ||
> > +					(memcmp(tcp_hdr1 + 1,
> > +							tcp_hdr + 1,
> > +							tcp_hl - sizeof
> > +							(struct tcp_hdr))
> > +					 == 0))
> > +				ret = 1;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> > +static uint32_t
> > +find_an_empty_item(struct gro_tcp_tbl *tbl)
> > +{
> > +	uint32_t i;
> > +
> > +	for (i = 0; i < tbl->max_item_num; i++)
> > +		if (tbl->items[i].is_valid == 0)
> > +			return i;
> > +	return INVALID_ARRAY_INDEX;
> > +}
> > +
> > +static uint32_t
> > +find_an_empty_key(struct gro_tcp_tbl *tbl)
> > +{
> > +	uint32_t i;
> > +
> > +	for (i = 0; i < tbl->max_key_num; i++)
> > +		if (tbl->keys[i].is_valid == 0)
> > +			return i;
> > +	return INVALID_ARRAY_INDEX;
> > +}
> > +
> > +int32_t
> > +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> > +		struct gro_tcp_tbl *tbl,
> > +		uint32_t max_packet_size)
> > +{
> > +	struct ether_hdr *eth_hdr;
> > +	struct ipv4_hdr *ipv4_hdr;
> > +	struct tcp_hdr *tcp_hdr;
> > +	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
> > +
> > +	struct tcp_key key;
> > +	uint32_t cur_idx, prev_idx, item_idx;
> > +	uint32_t i, key_idx;
> > +
> > +	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> > +	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
> > +	ipv4_ihl = IPv4_HDR_LEN(ipv4_hdr);
> > +
> > +	/* check if the packet should be processed */
> > +	if (ipv4_ihl < sizeof(struct ipv4_hdr))
> > +		goto fail;
> > +	tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
> > +	tcp_hl = TCP_HDR_LEN(tcp_hdr);
> > +	tcp_dl = rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
> > +		- tcp_hl;
> > +	if (tcp_dl == 0)
> > +		goto fail;
> > +
> > +	/* find a key and traverse all packets in its item group */
> > +	key.eth_saddr = eth_hdr->s_addr;
> > +	key.eth_daddr = eth_hdr->d_addr;
> > +	key.ip_src_addr[0] = rte_be_to_cpu_32(ipv4_hdr->src_addr);
> > +	key.ip_dst_addr[0] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
> 
> Your key.ip_src_addr[1-3] still contains some junk.
> How memcmp below supposed to worj properly?

When allocate an item, we already guarantee the content of its
memory space is 0. So memcpy won't be error.

> BTW why do you need 4 elems here, why just not uint32_t ip_src_addr;?
> Same for ip_dst_addr.

I think tcp6 and tcp4 can share the same table structure. So I use
128bit IP address here. You mean we need to use different structures
for tcp4 and tcp6?

> 
> > +	key.src_port = rte_be_to_cpu_16(tcp_hdr->src_port);
> > +	key.dst_port = rte_be_to_cpu_16(tcp_hdr->dst_port);
> > +	key.recv_ack = rte_be_to_cpu_32(tcp_hdr->recv_ack);
> > +	key.tcp_flags = tcp_hdr->tcp_flags;
> > +
> > +	for (i = 0; i < tbl->max_key_num; i++) {
> > +		if (tbl->keys[i].is_valid &&
> > +				(memcmp(&(tbl->keys[i].key), &key,
> > +						sizeof(struct tcp_key))
> > +				 == 0)) {
> > +			cur_idx = tbl->keys[i].start_index;
> > +			prev_idx = cur_idx;
> > +			while (cur_idx != INVALID_ARRAY_INDEX) {
> > +				if (check_seq_option(tbl->items[cur_idx].pkt,
> > +							tcp_hdr,
> > +							tcp_hl) > 0) {
> 
> As I remember linux gro also check ipv4 packet_id - it should be consecutive.

IP fragmented packets have the same IP ID, but others are consecutive. As we
suppose GRO can merge IP fragmented packets, so I think we shouldn't check if
the IP ID is consecutive here. How do you think?

> 
> > +					if (merge_two_tcp4_packets(
> > +								tbl->items[cur_idx].pkt,
> > +								pkt,
> > +								max_packet_size) > 0) {
> > +						/* successfully merge two packets */
> > +						tbl->items[cur_idx].is_groed = 1;
> > +						return 1;
> > +					}
> 
> If you allow more then packet per flow to be stored in the table, then you should be
> prepared that new segment can fill a gap between 2 packets.
> Probably the easiest thing - don't allow more then one 'item' per flow.

We allow the table to store same flow but out-of-order arriving packets. For
these packets, they will occupy different 'item' and we won't re-merge them.
For example, there are three same flow tcp packets: p1, p2 and p3. And p1 arrives
first, then p3, and last is p2. So TCP GRO will allocate one 'item' for p1 and one
'item' for p3, and when p2 arrives, p2 will be merged with p1. Therefore, in the
table, we will have two 'item': item1 to store merged p1 and p2, item2 to store p3.

As you can see, TCP GRO can only merges sequential arriving packets. If we want to
merge all out-of-order arriving packets, we need to re-process these packets which
are already processed and have one 'item'. IMO, this procedure will be very complicated.
So we don't do that.

Sorry, I don't understand how to allow one 'item' per-flow. Because packets are arriving
out-of-order. If we don't re-process these packets which already have one 'item', how to
guarantee it?
 
> 
> > +					/**
> > +					 * fail to merge two packets since
> > +					 * it's beyond the max packet length.
> > +					 * Insert it into the item group.
> > +					 */
> > +					goto insert_to_item_group;
> > +				} else {
> > +					prev_idx = cur_idx;
> > +					cur_idx = tbl->items[cur_idx].next_pkt_idx;
> > +				}
> > +			}
> > +			/**
> > +			 * find a corresponding item group but fails to find
> > +			 * one packet to merge. Insert it into this item group.
> > +			 */
> > +insert_to_item_group:
> > +			item_idx = find_an_empty_item(tbl);
> > +			/* the item number is greater than the max value */
> > +			if (item_idx == INVALID_ARRAY_INDEX)
> > +				return -1;
> > +			tbl->items[prev_idx].next_pkt_idx = item_idx;
> > +			tbl->items[item_idx].pkt = pkt;
> > +			tbl->items[item_idx].is_groed = 0;
> > +			tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> > +			tbl->items[item_idx].is_valid = 1;
> > +			tbl->items[item_idx].start_time = rte_rdtsc();
> > +			tbl->item_num++;
> > +			return 0;
> > +		}
> > +	}
> > +
> > +	/**
> > +	 * merge fail as the given packet has a new key.
> > +	 * So insert a new key.
> > +	 */
> > +	item_idx = find_an_empty_item(tbl);
> > +	key_idx = find_an_empty_key(tbl);
> > +	/**
> > +	 * if current key or item number is greater than the max
> > +	 * value, don't insert the packet into the table and return
> > +	 * immediately.
> > +	 */
> > +	if (item_idx == INVALID_ARRAY_INDEX ||
> > +			key_idx == INVALID_ARRAY_INDEX)
> > +		return -1;
> > +	tbl->items[item_idx].pkt = pkt;
> > +	tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> > +	tbl->items[item_idx].is_groed = 0;
> > +	tbl->items[item_idx].is_valid = 1;
> > +	tbl->items[item_idx].start_time = rte_rdtsc();
> 
> You can pass start-time as a parameter instead.

Thanks, I will modify it.

> 
> > +	tbl->item_num++;
> > +
> > +	memcpy(&(tbl->keys[key_idx].key),
> > +			&key, sizeof(struct tcp_key));
> > +	tbl->keys[key_idx].start_index = item_idx;
> > +	tbl->keys[key_idx].is_valid = 1;
> > +	tbl->key_num++;
> > +
> > +	return 0;
> > +fail:
> 
> Please try to avoid goto whenever possible.
> Looks really ugly.

Thanks, I will modify it.

> 
> > +	return -1;
> > +}
> > +
> > +uint16_t gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> > +		struct rte_mbuf **out,
> > +		const uint16_t nb_out)
> > +{
> > +	uint32_t i, num = 0;
> > +
> > +	if (nb_out < tbl->item_num)
> > +		return 0;
> 
> And how user would now how many items are now in the table?

I will add a API to tell users the item number. Thanks.

> 
> > +
> > +	for (i = 0; i < tbl->max_item_num; i++) {
> > +		if (tbl->items[i].is_valid) {
> > +			out[num++] = tbl->items[i].pkt;
> > +			tbl->items[i].is_valid = 0;
> > +			tbl->item_num--;
> > +		}
> > +	}
> > +	memset(tbl->keys, 0, sizeof(struct gro_tcp_key) *
> > +			tbl->max_key_num);
> > +	tbl->key_num = 0;
> > +
> > +	return num;
> > +}
> > +
> > +uint16_t
> > +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> > +		uint64_t timeout_cycles,
> > +		struct rte_mbuf **out,
> > +		const uint16_t nb_out)
> > +{
> > +	uint16_t k;
> > +	uint32_t i, j;
> > +	uint64_t current_time;
> > +
> > +	if (nb_out == 0)
> > +		return 0;
> > +	k = 0;
> > +	current_time = rte_rdtsc();
> > +
> > +	for (i = 0; i < tbl->max_key_num; i++) {
> > +		if (tbl->keys[i].is_valid) {
> 
> Seems pretty expensive to traverse the whole table...
> Would it worth to have some sort of LRU list?
> 
> > +			j = tbl->keys[i].start_index;
> > +			while (j != INVALID_ARRAY_INDEX) {
> > +				if (current_time - tbl->items[j].start_time >=
> > +						timeout_cycles) {
> > +					out[k++] = tbl->items[j].pkt;
> > +					tbl->items[j].is_valid = 0;
> > +					tbl->item_num--;
> > +					j = tbl->items[j].next_pkt_idx;
> > +
> > +					if (k == nb_out &&
> > +							j == INVALID_ARRAY_INDEX) {
> > +						/* delete the key */
> > +						tbl->keys[i].is_valid = 0;
> > +						tbl->key_num--;
> > +						goto end;
> 
> Please rearrange the code to avoid gotos.
> 
> > +					} else if (k == nb_out &&
> > +							j != INVALID_ARRAY_INDEX) {
> > +						/* update the first item index */
> > +						tbl->keys[i].start_index = j;
> > +						goto end;
> > +					}
> > +				}
> > +			}
> > +			/* delete the key, as all of its packets are flushed */
> > +			tbl->keys[i].is_valid = 0;
> > +			tbl->key_num--;
> > +		}
> > +		if (tbl->key_num == 0)
> > +			goto end;
> > +	}
> > +end:
> > +	return k;
> > +}
> > diff --git a/lib/librte_gro/rte_gro_tcp.h b/lib/librte_gro/rte_gro_tcp.h
> > new file mode 100644
> > index 0000000..a9a7aca
> > --- /dev/null
> > +++ b/lib/librte_gro/rte_gro_tcp.h
> > @@ -0,0 +1,191 @@
> > +/*-
> > + *   BSD LICENSE
> > + *
> > + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> > + *
> > + *   Redistribution and use in source and binary forms, with or without
> > + *   modification, are permitted provided that the following conditions
> > + *   are met:
> > + *
> > + *     * Redistributions of source code must retain the above copyright
> > + *       notice, this list of conditions and the following disclaimer.
> > + *     * Redistributions in binary form must reproduce the above copyright
> > + *       notice, this list of conditions and the following disclaimer in
> > + *       the documentation and/or other materials provided with the
> > + *       distribution.
> > + *     * Neither the name of Intel Corporation nor the names of its
> > + *       contributors may be used to endorse or promote products derived
> > + *       from this software without specific prior written permission.
> > + *
> > + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> > + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> > + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> > + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> > + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> > + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> > + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> > + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> > + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> > + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> > + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> > + */
> > +
> > +#ifndef _RTE_GRO_TCP_H_
> > +#define _RTE_GRO_TCP_H_
> > +
> > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > +#define TCP_HDR_LEN(tcph) \
> > +	((tcph->data_off >> 4) * 4)
> > +#define IPv4_HDR_LEN(iph) \
> > +	((iph->version_ihl & 0x0f) * 4)
> > +#else
> > +#define TCP_DATAOFF_MASK 0x0f
> > +#define TCP_HDR_LEN(tcph) \
> > +	((tcph->data_off & TCP_DATAOFF_MASK) * 4)
> > +#define IPv4_HDR_LEN(iph) \
> > +	((iph->version_ihl >> 4) * 4)
> > +#endif
> > +
> > +#define INVALID_ARRAY_INDEX 0xffffffffUL
> > +#define GRO_TCP_TBL_MAX_ITEM_NUM (UINT32_MAX - 1)
> > +
> > +/* criteria of mergeing packets */
> > +struct tcp_key {
> > +	struct ether_addr eth_saddr;
> > +	struct ether_addr eth_daddr;
> > +	uint32_t ip_src_addr[4];	/**< IPv4 uses the first 8B */
> > +	uint32_t ip_dst_addr[4];
> > +
> > +	uint32_t recv_ack;	/**< acknowledgment sequence number. */
> > +	uint16_t src_port;
> > +	uint16_t dst_port;
> > +	uint8_t tcp_flags;	/**< TCP flags. */
> > +};
> > +
> > +struct gro_tcp_key {
> > +	struct tcp_key key;
> > +	uint32_t start_index;	/**< the first packet index of the flow */
> > +	uint8_t is_valid;
> > +};
> > +
> > +struct gro_tcp_item {
> > +	struct rte_mbuf *pkt;	/**< packet address. */
> > +	/* the time when the packet in added into the table */
> > +	uint64_t start_time;
> > +	uint32_t next_pkt_idx;	/**< next packet index. */
> > +	/* flag to indicate if the packet is GROed */
> > +	uint8_t is_groed;
> > +	uint8_t is_valid;	/**< flag indicates if the item is valid */
> 
> Why do you need these 2 flags at all?
> Why not just reset let say pkt to NULL for invalid item?

Thanks, I will use NULL to replace is_valid.

> 
> > +};
> > +
> > +/**
> > + * TCP reassembly table. Both TCP/IPv4 and TCP/IPv6 use the same table
> > + * structure.
> > + */
> > +struct gro_tcp_tbl {
> > +	struct gro_tcp_item *items;	/**< item array */
> > +	struct gro_tcp_key *keys;	/**< key array */
> > +	uint32_t item_num;	/**< current item number */
> > +	uint32_t key_num;	/**< current key num */
> > +	uint32_t max_item_num;	/**< item array size */
> > +	uint32_t max_key_num;	/**< key array size */
> > +};
> > +
> > +/**
> > + * This function creates a TCP reassembly table.
> > + *
> > + * @param socket_id
> > + *  socket index where the Ethernet port connects to.
> > + * @param max_flow_num
> > + *  the maximum number of flows in the TCP GRO table
> > + * @param max_item_per_flow
> > + *  the maximum packet number per flow.
> > + * @return
> > + *  if create successfully, return a pointer which points to the
> > + *  created TCP GRO table. Otherwise, return NULL.
> > + */
> > +void *gro_tcp_tbl_create(uint16_t socket_id,
> > +		uint16_t max_flow_num,
> > +		uint16_t max_item_per_flow);
> > +
> > +/**
> > + * This function destroys a TCP reassembly table.
> > + * @param tbl
> > + *  a pointer points to the TCP reassembly table.
> > + */
> > +void gro_tcp_tbl_destroy(void *tbl);
> > +
> > +/**
> > + * This function searches for a packet in the TCP reassembly table to
> > + * merge with the inputted one. To merge two packets is to chain them
> > + * together and update packet headers. If the packet is without data
> > + * (e.g. SYN, SYN-ACK packet), this function returns immediately.
> > + * Otherwise, the packet is either merged, or inserted into the table.
> > + * Besides, if there is no available space to insert the packet, this
> > + * function returns immediately too.
> > + *
> > + * This function assumes the inputted packet is with correct IPv4 and
> > + * TCP checksums. And if two packets are merged, it won't re-calculate
> > + * IPv4 and TCP checksums. Besides, if the inputted packet is IP
> > + * fragmented, it assumes the packet is complete (with TCP header).
> > + *
> > + * @param pkt
> > + *  packet to reassemble.
> > + * @param tbl
> > + *  a pointer that points to a TCP reassembly table.
> > + * @param max_packet_size
> > + *  max packet length after merged
> > + * @return
> > + *  if the packet doesn't have data, or there is no available space
> > + *  in the table to insert a new item or a new key, return a negative
> > + *  value. If the packet is merged successfully, return an positive
> > + *  value. If the packet is inserted into the table, return 0.
> > + */
> > +int32_t
> > +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> > +		struct gro_tcp_tbl *tbl,
> > +		uint32_t max_packet_size);
> > +
> > +/**
> > + * This function flushes all packets in a TCP reassembly table to
> > + * applications, and without updating checksums for merged packets.
> > + * If the array which is used to keep flushed packets is not large
> > + * enough, error happens and this function returns immediately.
> > + *
> > + * @param tbl
> > + *  a pointer that points to a TCP GRO table.
> > + * @param out
> > + *  pointer array which is used to keep flushed packets. Applications
> > + *  should guarantee it's large enough to hold all packets in the table.
> > + * @param nb_out
> > + *  the element number of out.
> > + * @return
> > + *  the number of flushed packets. If out is not large enough to hold
> > + *  all packets in the table, return 0.
> > + */
> > +uint16_t
> > +gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> > +		struct rte_mbuf **out,
> > +		const uint16_t nb_out);
> > +
> > +/**
> > + * This function flushes timeout packets in a TCP reassembly table to
> > + * applications, and without updating checksums for merged packets.
> > + *
> > + * @param tbl
> > + *  a pointer that points to a TCP GRO table.
> > + * @param timeout_cycles
> > + *  the maximum time that packets can stay in the table.
> > + * @param out
> > + *  pointer array which is used to keep flushed packets.
> > + * @param nb_out
> > + *  the element number of out.
> > + * @return
> > + *  the number of packets that are returned.
> > + */
> > +uint16_t
> > +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> > +		uint64_t timeout_cycles,
> > +		struct rte_mbuf **out,
> > +		const uint16_t nb_out);
> > +#endif
> > --
> > 2.7.4
  
Ananyev, Konstantin June 30, 2017, 12:07 p.m. UTC | #3
Hi Jiayu,

> > > +
> > > +int32_t
> > > +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> > > +		struct gro_tcp_tbl *tbl,
> > > +		uint32_t max_packet_size)
> > > +{
> > > +	struct ether_hdr *eth_hdr;
> > > +	struct ipv4_hdr *ipv4_hdr;
> > > +	struct tcp_hdr *tcp_hdr;
> > > +	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
> > > +
> > > +	struct tcp_key key;
> > > +	uint32_t cur_idx, prev_idx, item_idx;
> > > +	uint32_t i, key_idx;
> > > +
> > > +	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> > > +	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
> > > +	ipv4_ihl = IPv4_HDR_LEN(ipv4_hdr);
> > > +
> > > +	/* check if the packet should be processed */
> > > +	if (ipv4_ihl < sizeof(struct ipv4_hdr))
> > > +		goto fail;
> > > +	tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
> > > +	tcp_hl = TCP_HDR_LEN(tcp_hdr);
> > > +	tcp_dl = rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
> > > +		- tcp_hl;
> > > +	if (tcp_dl == 0)
> > > +		goto fail;
> > > +
> > > +	/* find a key and traverse all packets in its item group */
> > > +	key.eth_saddr = eth_hdr->s_addr;
> > > +	key.eth_daddr = eth_hdr->d_addr;
> > > +	key.ip_src_addr[0] = rte_be_to_cpu_32(ipv4_hdr->src_addr);
> > > +	key.ip_dst_addr[0] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
> >
> > Your key.ip_src_addr[1-3] still contains some junk.
> > How memcmp below supposed to worj properly?
> 
> When allocate an item, we already guarantee the content of its
> memory space is 0. So memcpy won't be error.

key is allocated on the stack.
Obviously fileds that are not initialized manually will contain undefined values,
i.e.: ip_src-addr[1-3].
Then below you are doing:
memcp((&(tbl->keys[i].key), &key, sizeof(struct tcp_key));
...
memcpy(&(tbl->keys[key_idx].key), &key, sizeof(struct tcp_key));

So I still think you are comparing/copying some junk here.

> 
> > BTW why do you need 4 elems here, why just not uint32_t ip_src_addr;?
> > Same for ip_dst_addr.
> 
> I think tcp6 and tcp4 can share the same table structure. So I use
> 128bit IP address here. You mean we need to use different structures
> for tcp4 and tcp6?

That would be my preference.

> 
> >
> > > +	key.src_port = rte_be_to_cpu_16(tcp_hdr->src_port);
> > > +	key.dst_port = rte_be_to_cpu_16(tcp_hdr->dst_port);
> > > +	key.recv_ack = rte_be_to_cpu_32(tcp_hdr->recv_ack);
> > > +	key.tcp_flags = tcp_hdr->tcp_flags;
> > > +
> > > +	for (i = 0; i < tbl->max_key_num; i++) {
> > > +		if (tbl->keys[i].is_valid &&
> > > +				(memcmp(&(tbl->keys[i].key), &key,
> > > +						sizeof(struct tcp_key))
> > > +				 == 0)) {
> > > +			cur_idx = tbl->keys[i].start_index;
> > > +			prev_idx = cur_idx;
> > > +			while (cur_idx != INVALID_ARRAY_INDEX) {
> > > +				if (check_seq_option(tbl->items[cur_idx].pkt,
> > > +							tcp_hdr,
> > > +							tcp_hl) > 0) {
> >
> > As I remember linux gro also check ipv4 packet_id - it should be consecutive.
> 
> IP fragmented packets have the same IP ID, but others are consecutive.

Yes, you assume that they are consecutive.
But the only way to know for sure that they are - check it.
Another thing - I think we need to perform GRO only for TCP packets with only ACK bit set
(i.e. - no GRO for FIN/SYN/PUSH/URG/, etc.).

> As we  suppose GRO can merge IP fragmented packets, so I think we shouldn't check if
> the IP ID is consecutive here. How do you think?
> 
> >
> > > +					if (merge_two_tcp4_packets(
> > > +								tbl->items[cur_idx].pkt,
> > > +								pkt,
> > > +								max_packet_size) > 0) {
> > > +						/* successfully merge two packets */
> > > +						tbl->items[cur_idx].is_groed = 1;
> > > +						return 1;
> > > +					}
> >
> > If you allow more then packet per flow to be stored in the table, then you should be
> > prepared that new segment can fill a gap between 2 packets.
> > Probably the easiest thing - don't allow more then one 'item' per flow.
> 
> We allow the table to store same flow but out-of-order arriving packets. For
> these packets, they will occupy different 'item' and we won't re-merge them.
> For example, there are three same flow tcp packets: p1, p2 and p3. And p1 arrives
> first, then p3, and last is p2. So TCP GRO will allocate one 'item' for p1 and one
> 'item' for p3, and when p2 arrives, p2 will be merged with p1. Therefore, in the
> table, we will have two 'item': item1 to store merged p1 and p2, item2 to store p3.
> 
> As you can see, TCP GRO can only merges sequential arriving packets. If we want to
> merge all out-of-order arriving packets, we need to re-process these packets which
> are already processed and have one 'item'. IMO, this procedure will be very complicated.
> So we don't do that.
> 
> Sorry, I don't understand how to allow one 'item' per-flow. Because packets are arriving
> out-of-order. If we don't re-process these packets which already have one 'item', how to
> guarantee it?

As I understand you'll have an input array:
<seq=2, seq=1> - you wouldn't be able to merge it.
So I think your merge need be prepared to merge both smaller and bigger sequence.
About one item my thought : instead of allowing N items per key(flow) - for simplicity
just allow one item per flow.
In that case we wouldn't allow holes, but still would be able to merge reordered packets.
Alternatively you can store items ordered by seq, and after merge, try to merge neighbor
Items too.

> 
> >
> > > +					/**
> > > +					 * fail to merge two packets since
> > > +					 * it's beyond the max packet length.
> > > +					 * Insert it into the item group.
> > > +					 */
> > > +					goto insert_to_item_group;
> > > +				} else {
> > > +					prev_idx = cur_idx;
> > > +					cur_idx = tbl->items[cur_idx].next_pkt_idx;
> > > +				}
> > > +			}
> > > +			/**
> > > +			 * find a corresponding item group but fails to find
> > > +			 * one packet to merge. Insert it into this item group.
> > > +			 */
> > > +insert_to_item_group:
> > > +			item_idx = find_an_empty_item(tbl);
> > > +			/* the item number is greater than the max value */
> > > +			if (item_idx == INVALID_ARRAY_INDEX)
> > > +				return -1;
> > > +			tbl->items[prev_idx].next_pkt_idx = item_idx;
> > > +			tbl->items[item_idx].pkt = pkt;
> > > +			tbl->items[item_idx].is_groed = 0;
> > > +			tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> > > +			tbl->items[item_idx].is_valid = 1;
> > > +			tbl->items[item_idx].start_time = rte_rdtsc();
> > > +			tbl->item_num++;
> > > +			return 0;
> > > +		}
> > > +	}
> > > +
> > > +	/**
> > > +	 * merge fail as the given packet has a new key.
> > > +	 * So insert a new key.
> > > +	 */
> > > +	item_idx = find_an_empty_item(tbl);
> > > +	key_idx = find_an_empty_key(tbl);
> > > +	/**
> > > +	 * if current key or item number is greater than the max
> > > +	 * value, don't insert the packet into the table and return
> > > +	 * immediately.
> > > +	 */
> > > +	if (item_idx == INVALID_ARRAY_INDEX ||
> > > +			key_idx == INVALID_ARRAY_INDEX)
> > > +		return -1;
> > > +	tbl->items[item_idx].pkt = pkt;
> > > +	tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> > > +	tbl->items[item_idx].is_groed = 0;
> > > +	tbl->items[item_idx].is_valid = 1;
> > > +	tbl->items[item_idx].start_time = rte_rdtsc();
> >
  
Hu, Jiayu June 30, 2017, 3:40 p.m. UTC | #4
Hi Konstantin,

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Friday, June 30, 2017 8:07 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>
> Cc: dev@dpdk.org; Tan, Jianfeng <jianfeng.tan@intel.com>;
> stephen@networkplumber.org; yliu@fridaylinux.org; Wu, Jingjing
> <jingjing.wu@intel.com>; Yao, Lei A <lei.a.yao@intel.com>; Wiles, Keith
> <keith.wiles@intel.com>; Bie, Tiwei <tiwei.bie@intel.com>
> Subject: RE: [PATCH v7 2/3] lib/gro: add TCP/IPv4 GRO support
> 
> Hi Jiayu,
> 
> > > > +
> > > > +int32_t
> > > > +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> > > > +		struct gro_tcp_tbl *tbl,
> > > > +		uint32_t max_packet_size)
> > > > +{
> > > > +	struct ether_hdr *eth_hdr;
> > > > +	struct ipv4_hdr *ipv4_hdr;
> > > > +	struct tcp_hdr *tcp_hdr;
> > > > +	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
> > > > +
> > > > +	struct tcp_key key;
> > > > +	uint32_t cur_idx, prev_idx, item_idx;
> > > > +	uint32_t i, key_idx;
> > > > +
> > > > +	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> > > > +	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
> > > > +	ipv4_ihl = IPv4_HDR_LEN(ipv4_hdr);
> > > > +
> > > > +	/* check if the packet should be processed */
> > > > +	if (ipv4_ihl < sizeof(struct ipv4_hdr))
> > > > +		goto fail;
> > > > +	tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
> > > > +	tcp_hl = TCP_HDR_LEN(tcp_hdr);
> > > > +	tcp_dl = rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
> > > > +		- tcp_hl;
> > > > +	if (tcp_dl == 0)
> > > > +		goto fail;
> > > > +
> > > > +	/* find a key and traverse all packets in its item group */
> > > > +	key.eth_saddr = eth_hdr->s_addr;
> > > > +	key.eth_daddr = eth_hdr->d_addr;
> > > > +	key.ip_src_addr[0] = rte_be_to_cpu_32(ipv4_hdr->src_addr);
> > > > +	key.ip_dst_addr[0] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
> > >
> > > Your key.ip_src_addr[1-3] still contains some junk.
> > > How memcmp below supposed to worj properly?
> >
> > When allocate an item, we already guarantee the content of its
> > memory space is 0. So memcpy won't be error.
> 
> key is allocated on the stack.
> Obviously fileds that are not initialized manually will contain undefined values,
> i.e.: ip_src-addr[1-3].
> Then below you are doing:
> memcp((&(tbl->keys[i].key), &key, sizeof(struct tcp_key));
> ...
> memcpy(&(tbl->keys[key_idx].key), &key, sizeof(struct tcp_key));
> 
> So I still think you are comparing/copying some junk here.

Oh, yes. Key is allocated in stack. Thanks, I will modify it.

> 
> >
> > > BTW why do you need 4 elems here, why just not uint32_t ip_src_addr;?
> > > Same for ip_dst_addr.
> >
> > I think tcp6 and tcp4 can share the same table structure. So I use
> > 128bit IP address here. You mean we need to use different structures
> > for tcp4 and tcp6?
> 
> That would be my preference.

Got it. I will modify it. Thanks.

> 
> >
> > >
> > > > +	key.src_port = rte_be_to_cpu_16(tcp_hdr->src_port);
> > > > +	key.dst_port = rte_be_to_cpu_16(tcp_hdr->dst_port);
> > > > +	key.recv_ack = rte_be_to_cpu_32(tcp_hdr->recv_ack);
> > > > +	key.tcp_flags = tcp_hdr->tcp_flags;
> > > > +
> > > > +	for (i = 0; i < tbl->max_key_num; i++) {
> > > > +		if (tbl->keys[i].is_valid &&
> > > > +				(memcmp(&(tbl->keys[i].key), &key,
> > > > +						sizeof(struct tcp_key))
> > > > +				 == 0)) {
> > > > +			cur_idx = tbl->keys[i].start_index;
> > > > +			prev_idx = cur_idx;
> > > > +			while (cur_idx != INVALID_ARRAY_INDEX) {
> > > > +				if (check_seq_option(tbl->items[cur_idx].pkt,
> > > > +							tcp_hdr,
> > > > +							tcp_hl) > 0) {
> > >
> > > As I remember linux gro also check ipv4 packet_id - it should be
> consecutive.
> >
> > IP fragmented packets have the same IP ID, but others are consecutive.
> 
> Yes, you assume that they are consecutive.
> But the only way to know for sure that they are - check it.

Yes, I will modify it. Thanks.

> Another thing - I think we need to perform GRO only for TCP packets with
> only ACK bit set
> (i.e. - no GRO for FIN/SYN/PUSH/URG/, etc.).

Currently, we don't process packets whose payload length is 0. So if the packets
are SYN or FIN, we won't process them, since their payload length is 0. For URG,
PSH and RST, TCP/IPv4 GRO may still process these packets.

You are right. We shouldn't process packets whose URG, PSH or RST bit is set.
Thanks, I will modify it. 

> 
> > As we  suppose GRO can merge IP fragmented packets, so I think we
> shouldn't check if
> > the IP ID is consecutive here. How do you think?
> >
> > >
> > > > +					if (merge_two_tcp4_packets(
> > > > +								tbl-
> >items[cur_idx].pkt,
> > > > +								pkt,
> > > > +
> 	max_packet_size) > 0) {
> > > > +						/* successfully merge two
> packets */
> > > > +						tbl-
> >items[cur_idx].is_groed = 1;
> > > > +						return 1;
> > > > +					}
> > >
> > > If you allow more then packet per flow to be stored in the table, then you
> should be
> > > prepared that new segment can fill a gap between 2 packets.
> > > Probably the easiest thing - don't allow more then one 'item' per flow.
> >
> > We allow the table to store same flow but out-of-order arriving packets.
> For
> > these packets, they will occupy different 'item' and we won't re-merge
> them.
> > For example, there are three same flow tcp packets: p1, p2 and p3. And p1
> arrives
> > first, then p3, and last is p2. So TCP GRO will allocate one 'item' for p1 and
> one
> > 'item' for p3, and when p2 arrives, p2 will be merged with p1. Therefore, in
> the
> > table, we will have two 'item': item1 to store merged p1 and p2, item2 to
> store p3.
> >
> > As you can see, TCP GRO can only merges sequential arriving packets. If we
> want to
> > merge all out-of-order arriving packets, we need to re-process these
> packets which
> > are already processed and have one 'item'. IMO, this procedure will be very
> complicated.
> > So we don't do that.
> >
> > Sorry, I don't understand how to allow one 'item' per-flow. Because
> packets are arriving
> > out-of-order. If we don't re-process these packets which already have one
> 'item', how to
> > guarantee it?
> 
> As I understand you'll have an input array:
> <seq=2, seq=1> - you wouldn't be able to merge it.
> So I think your merge need be prepared to merge both smaller and bigger
> sequence.

Oh yes, it's much better. I will modify it.

> About one item my thought : instead of allowing N items per key(flow) - for
> simplicity
> just allow one item per flow.
> In that case we wouldn't allow holes, but still would be able to merge
> reordered packets.
> Alternatively you can store items ordered by seq, and after merge, try to
> merge neighbor
> Items too.

Yes, when we insert a new item, we can chain it with the packets of its item
group ordered by seq. After processing all packets, we need to traverse each
item group and try to merge neighbors. But when will 'merge neighbors' happen?
When flush packets from the table? (e.g.  gro_tcp_tbl_timeout_flush)

BRs,
Jiayu
> 
> >
> > >
> > > > +					/**
> > > > +					 * fail to merge two packets since
> > > > +					 * it's beyond the max packet length.
> > > > +					 * Insert it into the item group.
> > > > +					 */
> > > > +					goto insert_to_item_group;
> > > > +				} else {
> > > > +					prev_idx = cur_idx;
> > > > +					cur_idx = tbl-
> >items[cur_idx].next_pkt_idx;
> > > > +				}
> > > > +			}
> > > > +			/**
> > > > +			 * find a corresponding item group but fails to find
> > > > +			 * one packet to merge. Insert it into this item group.
> > > > +			 */
> > > > +insert_to_item_group:
> > > > +			item_idx = find_an_empty_item(tbl);
> > > > +			/* the item number is greater than the max value */
> > > > +			if (item_idx == INVALID_ARRAY_INDEX)
> > > > +				return -1;
> > > > +			tbl->items[prev_idx].next_pkt_idx = item_idx;
> > > > +			tbl->items[item_idx].pkt = pkt;
> > > > +			tbl->items[item_idx].is_groed = 0;
> > > > +			tbl->items[item_idx].next_pkt_idx =
> INVALID_ARRAY_INDEX;
> > > > +			tbl->items[item_idx].is_valid = 1;
> > > > +			tbl->items[item_idx].start_time = rte_rdtsc();
> > > > +			tbl->item_num++;
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/**
> > > > +	 * merge fail as the given packet has a new key.
> > > > +	 * So insert a new key.
> > > > +	 */
> > > > +	item_idx = find_an_empty_item(tbl);
> > > > +	key_idx = find_an_empty_key(tbl);
> > > > +	/**
> > > > +	 * if current key or item number is greater than the max
> > > > +	 * value, don't insert the packet into the table and return
> > > > +	 * immediately.
> > > > +	 */
> > > > +	if (item_idx == INVALID_ARRAY_INDEX ||
> > > > +			key_idx == INVALID_ARRAY_INDEX)
> > > > +		return -1;
> > > > +	tbl->items[item_idx].pkt = pkt;
> > > > +	tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
> > > > +	tbl->items[item_idx].is_groed = 0;
> > > > +	tbl->items[item_idx].is_valid = 1;
> > > > +	tbl->items[item_idx].start_time = rte_rdtsc();
> > >
  

Patch

diff --git a/doc/guides/rel_notes/release_17_08.rst b/doc/guides/rel_notes/release_17_08.rst
index 842f46f..f067247 100644
--- a/doc/guides/rel_notes/release_17_08.rst
+++ b/doc/guides/rel_notes/release_17_08.rst
@@ -75,6 +75,13 @@  New Features
 
   Added support for firmwares with multiple Ethernet ports per physical port.
 
+* **Add Generic Receive Offload API support.**
+
+  Generic Receive Offload (GRO) API supports to reassemble TCP/IPv4
+  packets. GRO API assumes all inputted packets are with correct
+  checksums. GRO API doesn't update checksums for merged packets. If
+  inputted packets are IP fragmented, GRO API assumes they are complete
+  packets (i.e. with L4 headers).
 
 Resolved Issues
 ---------------
diff --git a/lib/librte_gro/Makefile b/lib/librte_gro/Makefile
index 7e0f128..e89344d 100644
--- a/lib/librte_gro/Makefile
+++ b/lib/librte_gro/Makefile
@@ -43,6 +43,7 @@  LIBABIVER := 1
 
 # source files
 SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro.c
+SRCS-$(CONFIG_RTE_LIBRTE_GRO) += rte_gro_tcp.c
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_GRO)-include += rte_gro.h
diff --git a/lib/librte_gro/rte_gro.c b/lib/librte_gro/rte_gro.c
index 33275e8..5b89928 100644
--- a/lib/librte_gro/rte_gro.c
+++ b/lib/librte_gro/rte_gro.c
@@ -32,11 +32,15 @@ 
 
 #include <rte_malloc.h>
 #include <rte_mbuf.h>
+#include <rte_ethdev.h>
 
 #include "rte_gro.h"
+#include "rte_gro_tcp.h"
 
-static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM];
-static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM];
+static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM] = {
+	gro_tcp_tbl_create, NULL};
+static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM] = {
+	gro_tcp_tbl_destroy, NULL};
 
 struct rte_gro_tbl *rte_gro_tbl_create(uint16_t socket_id,
 		uint16_t max_flow_num,
@@ -94,32 +98,121 @@  void rte_gro_tbl_destroy(struct rte_gro_tbl *gro_tbl)
 }
 
 uint16_t
-rte_gro_reassemble_burst(struct rte_mbuf **pkts __rte_unused,
+rte_gro_reassemble_burst(struct rte_mbuf **pkts,
 		const uint16_t nb_pkts,
-		const struct rte_gro_param param __rte_unused)
+		const struct rte_gro_param param)
 {
-	return nb_pkts;
+	uint16_t i;
+	uint16_t nb_after_gro = nb_pkts;
+	uint32_t item_num = RTE_MIN(nb_pkts, param.max_flow_num *
+			param.max_item_per_flow);
+
+	/* allocate a reassembly table for TCP/IPv4 GRO */
+	uint32_t tcp_item_num = RTE_MIN(item_num, GRO_MAX_BURST_ITEM_NUM);
+	struct gro_tcp_tbl tcp_tbl;
+	struct gro_tcp_key tcp_keys[tcp_item_num];
+	struct gro_tcp_item tcp_items[tcp_item_num];
+
+	struct rte_mbuf *unprocess_pkts[nb_pkts];
+	uint16_t unprocess_num = 0;
+	int32_t ret;
+
+	memset(tcp_keys, 0, sizeof(struct gro_tcp_key) *
+			tcp_item_num);
+	memset(tcp_items, 0, sizeof(struct gro_tcp_item) *
+			tcp_item_num);
+	tcp_tbl.keys = tcp_keys;
+	tcp_tbl.items = tcp_items;
+	tcp_tbl.key_num = 0;
+	tcp_tbl.item_num = 0;
+	tcp_tbl.max_key_num = tcp_item_num;
+	tcp_tbl.max_item_num = tcp_item_num;
+
+	for (i = 0; i < nb_pkts; i++) {
+		if (RTE_ETH_IS_IPV4_HDR(pkts[i]->packet_type)) {
+			if ((pkts[i]->packet_type & RTE_PTYPE_L4_TCP) &&
+				(param.desired_gro_types &
+					 GRO_TCP_IPV4)) {
+				ret = gro_tcp4_reassemble(pkts[i],
+						&tcp_tbl,
+						param.max_packet_size);
+				/* merge successfully */
+				if (ret > 0)
+					nb_after_gro--;
+				else if (ret < 0)
+					unprocess_pkts[unprocess_num++] =
+						pkts[i];
+			} else
+				unprocess_pkts[unprocess_num++] =
+					pkts[i];
+		} else
+			unprocess_pkts[unprocess_num++] =
+				pkts[i];
+	}
+
+	/* re-arrange GROed packets */
+	if (nb_after_gro < nb_pkts) {
+		if (param.desired_gro_types & GRO_TCP_IPV4)
+			i = gro_tcp_tbl_flush(&tcp_tbl, pkts, nb_pkts);
+		if (unprocess_num > 0) {
+			memcpy(&pkts[i], unprocess_pkts,
+					sizeof(struct rte_mbuf *) *
+					unprocess_num);
+			i += unprocess_num;
+		}
+		if (nb_pkts > i)
+			memset(&pkts[i], 0,
+					sizeof(struct rte_mbuf *) *
+					(nb_pkts - i));
+	}
+	return nb_after_gro;
 }
 
-int rte_gro_reassemble(struct rte_mbuf *pkt __rte_unused,
-		struct rte_gro_tbl *gro_tbl __rte_unused)
+int rte_gro_reassemble(struct rte_mbuf *pkt,
+		struct rte_gro_tbl *gro_tbl)
 {
+	if (unlikely(pkt == NULL))
+		return -1;
+
+	if (RTE_ETH_IS_IPV4_HDR(pkt->packet_type)) {
+		if ((pkt->packet_type & RTE_PTYPE_L4_TCP) &&
+				(gro_tbl->desired_gro_types &
+				 GRO_TCP_IPV4))
+			return gro_tcp4_reassemble(pkt,
+					gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
+					gro_tbl->max_packet_size);
+	}
+
 	return -1;
 }
 
-uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
-		uint64_t desired_gro_types __rte_unused,
-		struct rte_mbuf **out __rte_unused,
-		const uint16_t max_nb_out __rte_unused)
+uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl,
+		uint64_t desired_gro_types,
+		struct rte_mbuf **out,
+		const uint16_t max_nb_out)
 {
+	desired_gro_types = desired_gro_types &
+		gro_tbl->desired_gro_types;
+	if (desired_gro_types & GRO_TCP_IPV4)
+		return gro_tcp_tbl_flush(
+				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
+				out,
+				max_nb_out);
 	return 0;
 }
 
 uint16_t
-rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
-		uint64_t desired_gro_types __rte_unused,
-		struct rte_mbuf **out __rte_unused,
-		const uint16_t max_nb_out __rte_unused)
+rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl,
+		uint64_t desired_gro_types,
+		struct rte_mbuf **out,
+		const uint16_t max_nb_out)
 {
+	desired_gro_types = desired_gro_types &
+		gro_tbl->desired_gro_types;
+	if (desired_gro_types & GRO_TCP_IPV4)
+		return gro_tcp_tbl_timeout_flush(
+				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
+				gro_tbl->max_timeout_cycles,
+				out, max_nb_out);
 	return 0;
 }
diff --git a/lib/librte_gro/rte_gro.h b/lib/librte_gro/rte_gro.h
index f9d36e8..a30b1c6 100644
--- a/lib/librte_gro/rte_gro.h
+++ b/lib/librte_gro/rte_gro.h
@@ -45,7 +45,11 @@  extern "C" {
 
 /* max number of supported GRO types */
 #define GRO_TYPE_MAX_NUM 64
-#define GRO_TYPE_SUPPORT_NUM 0	/**< current supported GRO num */
+#define GRO_TYPE_SUPPORT_NUM 1	/**< current supported GRO num */
+
+/* TCP/IPv4 GRO flag */
+#define GRO_TCP_IPV4_INDEX 0
+#define GRO_TCP_IPV4 (1ULL << GRO_TCP_IPV4_INDEX)
 
 /**
  * GRO table, which is used to merge packets. It keeps many reassembly
diff --git a/lib/librte_gro/rte_gro_tcp.c b/lib/librte_gro/rte_gro_tcp.c
new file mode 100644
index 0000000..c0eaa45
--- /dev/null
+++ b/lib/librte_gro/rte_gro_tcp.c
@@ -0,0 +1,394 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_malloc.h>
+#include <rte_mbuf.h>
+#include <rte_cycles.h>
+
+#include <rte_ethdev.h>
+#include <rte_ip.h>
+#include <rte_tcp.h>
+
+#include "rte_gro_tcp.h"
+
+void *gro_tcp_tbl_create(uint16_t socket_id,
+		uint16_t max_flow_num,
+		uint16_t max_item_per_flow)
+{
+	size_t size;
+	uint32_t entries_num;
+	struct gro_tcp_tbl *tbl;
+
+	entries_num = max_flow_num * max_item_per_flow;
+	entries_num = entries_num > GRO_TCP_TBL_MAX_ITEM_NUM ?
+		GRO_TCP_TBL_MAX_ITEM_NUM : entries_num;
+
+	if (entries_num == 0)
+		return NULL;
+
+	tbl = (struct gro_tcp_tbl *)rte_zmalloc_socket(
+			__func__,
+			sizeof(struct gro_tcp_tbl),
+			RTE_CACHE_LINE_SIZE,
+			socket_id);
+
+	size = sizeof(struct gro_tcp_item) * entries_num;
+	tbl->items = (struct gro_tcp_item *)rte_zmalloc_socket(
+			__func__,
+			size,
+			RTE_CACHE_LINE_SIZE,
+			socket_id);
+	tbl->max_item_num = entries_num;
+
+	size = sizeof(struct gro_tcp_key) * entries_num;
+	tbl->keys = (struct gro_tcp_key *)rte_zmalloc_socket(
+			__func__,
+			size, RTE_CACHE_LINE_SIZE,
+			socket_id);
+	tbl->max_key_num = entries_num;
+	return tbl;
+}
+
+void gro_tcp_tbl_destroy(void *tbl)
+{
+	struct gro_tcp_tbl *tcp_tbl = (struct gro_tcp_tbl *)tbl;
+
+	if (tcp_tbl) {
+		if (tcp_tbl->items)
+			rte_free(tcp_tbl->items);
+		if (tcp_tbl->keys)
+			rte_free(tcp_tbl->keys);
+		rte_free(tcp_tbl);
+	}
+}
+
+/**
+ * merge two TCP/IPv4 packets without update checksums.
+ */
+static int
+merge_two_tcp4_packets(struct rte_mbuf *pkt_src,
+		struct rte_mbuf *pkt,
+		uint32_t max_packet_size)
+{
+	struct ipv4_hdr *ipv4_hdr1, *ipv4_hdr2;
+	struct tcp_hdr *tcp_hdr1;
+	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
+	struct rte_mbuf *tail;
+
+	/* parse the given packet */
+	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
+				struct ether_hdr *) + 1);
+	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
+	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
+	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
+	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
+		- tcp_hl1;
+
+	/* parse the original packet */
+	ipv4_hdr2 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt_src,
+				struct ether_hdr *) + 1);
+
+	if (pkt_src->pkt_len + tcp_dl1 > max_packet_size)
+		return -1;
+
+	/* remove the header of the incoming packet */
+	rte_pktmbuf_adj(pkt, sizeof(struct ether_hdr) +
+			ipv4_ihl1 + tcp_hl1);
+
+	/* chain the two packet together */
+	tail = rte_pktmbuf_lastseg(pkt_src);
+	tail->next = pkt;
+
+	/* update IP header */
+	ipv4_hdr2->total_length = rte_cpu_to_be_16(
+			rte_be_to_cpu_16(
+				ipv4_hdr2->total_length)
+			+ tcp_dl1);
+
+	/* update mbuf metadata for the merged packet */
+	pkt_src->nb_segs++;
+	pkt_src->pkt_len += pkt->pkt_len;
+	return 1;
+}
+
+static int
+check_seq_option(struct rte_mbuf *pkt,
+		struct tcp_hdr *tcp_hdr,
+		uint16_t tcp_hl)
+{
+	struct ipv4_hdr *ipv4_hdr1;
+	struct tcp_hdr *tcp_hdr1;
+	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
+	uint32_t sent_seq1, sent_seq;
+	int ret = -1;
+
+	ipv4_hdr1 = (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
+				struct ether_hdr *) + 1);
+	ipv4_ihl1 = IPv4_HDR_LEN(ipv4_hdr1);
+	tcp_hdr1 = (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
+	tcp_hl1 = TCP_HDR_LEN(tcp_hdr1);
+	tcp_dl1 = rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
+		- tcp_hl1;
+	sent_seq1 = rte_be_to_cpu_32(tcp_hdr1->sent_seq) + tcp_dl1;
+	sent_seq = rte_be_to_cpu_32(tcp_hdr->sent_seq);
+
+	/* check if the two packets are neighbor */
+	if ((sent_seq ^ sent_seq1) == 0) {
+		/* check if TCP option field equals */
+		if (tcp_hl1 > sizeof(struct tcp_hdr)) {
+			if ((tcp_hl1 != tcp_hl) ||
+					(memcmp(tcp_hdr1 + 1,
+							tcp_hdr + 1,
+							tcp_hl - sizeof
+							(struct tcp_hdr))
+					 == 0))
+				ret = 1;
+		}
+	}
+	return ret;
+}
+
+static uint32_t
+find_an_empty_item(struct gro_tcp_tbl *tbl)
+{
+	uint32_t i;
+
+	for (i = 0; i < tbl->max_item_num; i++)
+		if (tbl->items[i].is_valid == 0)
+			return i;
+	return INVALID_ARRAY_INDEX;
+}
+
+static uint32_t
+find_an_empty_key(struct gro_tcp_tbl *tbl)
+{
+	uint32_t i;
+
+	for (i = 0; i < tbl->max_key_num; i++)
+		if (tbl->keys[i].is_valid == 0)
+			return i;
+	return INVALID_ARRAY_INDEX;
+}
+
+int32_t
+gro_tcp4_reassemble(struct rte_mbuf *pkt,
+		struct gro_tcp_tbl *tbl,
+		uint32_t max_packet_size)
+{
+	struct ether_hdr *eth_hdr;
+	struct ipv4_hdr *ipv4_hdr;
+	struct tcp_hdr *tcp_hdr;
+	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
+
+	struct tcp_key key;
+	uint32_t cur_idx, prev_idx, item_idx;
+	uint32_t i, key_idx;
+
+	eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
+	ipv4_hdr = (struct ipv4_hdr *)(eth_hdr + 1);
+	ipv4_ihl = IPv4_HDR_LEN(ipv4_hdr);
+
+	/* check if the packet should be processed */
+	if (ipv4_ihl < sizeof(struct ipv4_hdr))
+		goto fail;
+	tcp_hdr = (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
+	tcp_hl = TCP_HDR_LEN(tcp_hdr);
+	tcp_dl = rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
+		- tcp_hl;
+	if (tcp_dl == 0)
+		goto fail;
+
+	/* find a key and traverse all packets in its item group */
+	key.eth_saddr = eth_hdr->s_addr;
+	key.eth_daddr = eth_hdr->d_addr;
+	key.ip_src_addr[0] = rte_be_to_cpu_32(ipv4_hdr->src_addr);
+	key.ip_dst_addr[0] = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
+	key.src_port = rte_be_to_cpu_16(tcp_hdr->src_port);
+	key.dst_port = rte_be_to_cpu_16(tcp_hdr->dst_port);
+	key.recv_ack = rte_be_to_cpu_32(tcp_hdr->recv_ack);
+	key.tcp_flags = tcp_hdr->tcp_flags;
+
+	for (i = 0; i < tbl->max_key_num; i++) {
+		if (tbl->keys[i].is_valid &&
+				(memcmp(&(tbl->keys[i].key), &key,
+						sizeof(struct tcp_key))
+				 == 0)) {
+			cur_idx = tbl->keys[i].start_index;
+			prev_idx = cur_idx;
+			while (cur_idx != INVALID_ARRAY_INDEX) {
+				if (check_seq_option(tbl->items[cur_idx].pkt,
+							tcp_hdr,
+							tcp_hl) > 0) {
+					if (merge_two_tcp4_packets(
+								tbl->items[cur_idx].pkt,
+								pkt,
+								max_packet_size) > 0) {
+						/* successfully merge two packets */
+						tbl->items[cur_idx].is_groed = 1;
+						return 1;
+					}
+					/**
+					 * fail to merge two packets since
+					 * it's beyond the max packet length.
+					 * Insert it into the item group.
+					 */
+					goto insert_to_item_group;
+				} else {
+					prev_idx = cur_idx;
+					cur_idx = tbl->items[cur_idx].next_pkt_idx;
+				}
+			}
+			/**
+			 * find a corresponding item group but fails to find
+			 * one packet to merge. Insert it into this item group.
+			 */
+insert_to_item_group:
+			item_idx = find_an_empty_item(tbl);
+			/* the item number is greater than the max value */
+			if (item_idx == INVALID_ARRAY_INDEX)
+				return -1;
+			tbl->items[prev_idx].next_pkt_idx = item_idx;
+			tbl->items[item_idx].pkt = pkt;
+			tbl->items[item_idx].is_groed = 0;
+			tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
+			tbl->items[item_idx].is_valid = 1;
+			tbl->items[item_idx].start_time = rte_rdtsc();
+			tbl->item_num++;
+			return 0;
+		}
+	}
+
+	/**
+	 * merge fail as the given packet has a new key.
+	 * So insert a new key.
+	 */
+	item_idx = find_an_empty_item(tbl);
+	key_idx = find_an_empty_key(tbl);
+	/**
+	 * if current key or item number is greater than the max
+	 * value, don't insert the packet into the table and return
+	 * immediately.
+	 */
+	if (item_idx == INVALID_ARRAY_INDEX ||
+			key_idx == INVALID_ARRAY_INDEX)
+		return -1;
+	tbl->items[item_idx].pkt = pkt;
+	tbl->items[item_idx].next_pkt_idx = INVALID_ARRAY_INDEX;
+	tbl->items[item_idx].is_groed = 0;
+	tbl->items[item_idx].is_valid = 1;
+	tbl->items[item_idx].start_time = rte_rdtsc();
+	tbl->item_num++;
+
+	memcpy(&(tbl->keys[key_idx].key),
+			&key, sizeof(struct tcp_key));
+	tbl->keys[key_idx].start_index = item_idx;
+	tbl->keys[key_idx].is_valid = 1;
+	tbl->key_num++;
+
+	return 0;
+fail:
+	return -1;
+}
+
+uint16_t gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
+		struct rte_mbuf **out,
+		const uint16_t nb_out)
+{
+	uint32_t i, num = 0;
+
+	if (nb_out < tbl->item_num)
+		return 0;
+
+	for (i = 0; i < tbl->max_item_num; i++) {
+		if (tbl->items[i].is_valid) {
+			out[num++] = tbl->items[i].pkt;
+			tbl->items[i].is_valid = 0;
+			tbl->item_num--;
+		}
+	}
+	memset(tbl->keys, 0, sizeof(struct gro_tcp_key) *
+			tbl->max_key_num);
+	tbl->key_num = 0;
+
+	return num;
+}
+
+uint16_t
+gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
+		uint64_t timeout_cycles,
+		struct rte_mbuf **out,
+		const uint16_t nb_out)
+{
+	uint16_t k;
+	uint32_t i, j;
+	uint64_t current_time;
+
+	if (nb_out == 0)
+		return 0;
+	k = 0;
+	current_time = rte_rdtsc();
+
+	for (i = 0; i < tbl->max_key_num; i++) {
+		if (tbl->keys[i].is_valid) {
+			j = tbl->keys[i].start_index;
+			while (j != INVALID_ARRAY_INDEX) {
+				if (current_time - tbl->items[j].start_time >=
+						timeout_cycles) {
+					out[k++] = tbl->items[j].pkt;
+					tbl->items[j].is_valid = 0;
+					tbl->item_num--;
+					j = tbl->items[j].next_pkt_idx;
+
+					if (k == nb_out &&
+							j == INVALID_ARRAY_INDEX) {
+						/* delete the key */
+						tbl->keys[i].is_valid = 0;
+						tbl->key_num--;
+						goto end;
+					} else if (k == nb_out &&
+							j != INVALID_ARRAY_INDEX) {
+						/* update the first item index */
+						tbl->keys[i].start_index = j;
+						goto end;
+					}
+				}
+			}
+			/* delete the key, as all of its packets are flushed */
+			tbl->keys[i].is_valid = 0;
+			tbl->key_num--;
+		}
+		if (tbl->key_num == 0)
+			goto end;
+	}
+end:
+	return k;
+}
diff --git a/lib/librte_gro/rte_gro_tcp.h b/lib/librte_gro/rte_gro_tcp.h
new file mode 100644
index 0000000..a9a7aca
--- /dev/null
+++ b/lib/librte_gro/rte_gro_tcp.h
@@ -0,0 +1,191 @@ 
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_GRO_TCP_H_
+#define _RTE_GRO_TCP_H_
+
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+#define TCP_HDR_LEN(tcph) \
+	((tcph->data_off >> 4) * 4)
+#define IPv4_HDR_LEN(iph) \
+	((iph->version_ihl & 0x0f) * 4)
+#else
+#define TCP_DATAOFF_MASK 0x0f
+#define TCP_HDR_LEN(tcph) \
+	((tcph->data_off & TCP_DATAOFF_MASK) * 4)
+#define IPv4_HDR_LEN(iph) \
+	((iph->version_ihl >> 4) * 4)
+#endif
+
+#define INVALID_ARRAY_INDEX 0xffffffffUL
+#define GRO_TCP_TBL_MAX_ITEM_NUM (UINT32_MAX - 1)
+
+/* criteria of mergeing packets */
+struct tcp_key {
+	struct ether_addr eth_saddr;
+	struct ether_addr eth_daddr;
+	uint32_t ip_src_addr[4];	/**< IPv4 uses the first 8B */
+	uint32_t ip_dst_addr[4];
+
+	uint32_t recv_ack;	/**< acknowledgment sequence number. */
+	uint16_t src_port;
+	uint16_t dst_port;
+	uint8_t tcp_flags;	/**< TCP flags. */
+};
+
+struct gro_tcp_key {
+	struct tcp_key key;
+	uint32_t start_index;	/**< the first packet index of the flow */
+	uint8_t is_valid;
+};
+
+struct gro_tcp_item {
+	struct rte_mbuf *pkt;	/**< packet address. */
+	/* the time when the packet in added into the table */
+	uint64_t start_time;
+	uint32_t next_pkt_idx;	/**< next packet index. */
+	/* flag to indicate if the packet is GROed */
+	uint8_t is_groed;
+	uint8_t is_valid;	/**< flag indicates if the item is valid */
+};
+
+/**
+ * TCP reassembly table. Both TCP/IPv4 and TCP/IPv6 use the same table
+ * structure.
+ */
+struct gro_tcp_tbl {
+	struct gro_tcp_item *items;	/**< item array */
+	struct gro_tcp_key *keys;	/**< key array */
+	uint32_t item_num;	/**< current item number */
+	uint32_t key_num;	/**< current key num */
+	uint32_t max_item_num;	/**< item array size */
+	uint32_t max_key_num;	/**< key array size */
+};
+
+/**
+ * This function creates a TCP reassembly table.
+ *
+ * @param socket_id
+ *  socket index where the Ethernet port connects to.
+ * @param max_flow_num
+ *  the maximum number of flows in the TCP GRO table
+ * @param max_item_per_flow
+ *  the maximum packet number per flow.
+ * @return
+ *  if create successfully, return a pointer which points to the
+ *  created TCP GRO table. Otherwise, return NULL.
+ */
+void *gro_tcp_tbl_create(uint16_t socket_id,
+		uint16_t max_flow_num,
+		uint16_t max_item_per_flow);
+
+/**
+ * This function destroys a TCP reassembly table.
+ * @param tbl
+ *  a pointer points to the TCP reassembly table.
+ */
+void gro_tcp_tbl_destroy(void *tbl);
+
+/**
+ * This function searches for a packet in the TCP reassembly table to
+ * merge with the inputted one. To merge two packets is to chain them
+ * together and update packet headers. If the packet is without data
+ * (e.g. SYN, SYN-ACK packet), this function returns immediately.
+ * Otherwise, the packet is either merged, or inserted into the table.
+ * Besides, if there is no available space to insert the packet, this
+ * function returns immediately too.
+ *
+ * This function assumes the inputted packet is with correct IPv4 and
+ * TCP checksums. And if two packets are merged, it won't re-calculate
+ * IPv4 and TCP checksums. Besides, if the inputted packet is IP
+ * fragmented, it assumes the packet is complete (with TCP header).
+ *
+ * @param pkt
+ *  packet to reassemble.
+ * @param tbl
+ *  a pointer that points to a TCP reassembly table.
+ * @param max_packet_size
+ *  max packet length after merged
+ * @return
+ *  if the packet doesn't have data, or there is no available space
+ *  in the table to insert a new item or a new key, return a negative
+ *  value. If the packet is merged successfully, return an positive
+ *  value. If the packet is inserted into the table, return 0.
+ */
+int32_t
+gro_tcp4_reassemble(struct rte_mbuf *pkt,
+		struct gro_tcp_tbl *tbl,
+		uint32_t max_packet_size);
+
+/**
+ * This function flushes all packets in a TCP reassembly table to
+ * applications, and without updating checksums for merged packets.
+ * If the array which is used to keep flushed packets is not large
+ * enough, error happens and this function returns immediately.
+ *
+ * @param tbl
+ *  a pointer that points to a TCP GRO table.
+ * @param out
+ *  pointer array which is used to keep flushed packets. Applications
+ *  should guarantee it's large enough to hold all packets in the table.
+ * @param nb_out
+ *  the element number of out.
+ * @return
+ *  the number of flushed packets. If out is not large enough to hold
+ *  all packets in the table, return 0.
+ */
+uint16_t
+gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
+		struct rte_mbuf **out,
+		const uint16_t nb_out);
+
+/**
+ * This function flushes timeout packets in a TCP reassembly table to
+ * applications, and without updating checksums for merged packets.
+ *
+ * @param tbl
+ *  a pointer that points to a TCP GRO table.
+ * @param timeout_cycles
+ *  the maximum time that packets can stay in the table.
+ * @param out
+ *  pointer array which is used to keep flushed packets.
+ * @param nb_out
+ *  the element number of out.
+ * @return
+ *  the number of packets that are returned.
+ */
+uint16_t
+gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
+		uint64_t timeout_cycles,
+		struct rte_mbuf **out,
+		const uint16_t nb_out);
+#endif