pcapng: fix write more packets than IOV_MAX limit

Message ID 20220725152811.409447-1-kuka@cesnet.cz (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series pcapng: fix write more packets than IOV_MAX limit |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation fail Compilation issues
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing fail Testing issues
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-x86_64-compile-testing fail Testing issues
ci/intel-Testing success Testing PASS

Commit Message

Mário Kuka July 25, 2022, 3:28 p.m. UTC
  The rte_pcapng_write_packets() function fails when we try to write more
packets than the IOV_MAX limit. The error is caused by the writev()
system call, which is limited by the IOV_MAX limit. The iovcnt argument
is valid if it is greater than 0 and less than or equal to IOV_MAX as
defined in <limits.h>.

To avoid this problem, we can split the iovec buffer into smaller
chunks with a maximum size of IOV_MAX and write them sequentially by
calling the writev() repeatedly.

Fixes: 8d23ce8f5ee9 ("pcapng: add new library for writing pcapng files")
Cc: stephen@networkplumber.org

Signed-off-by: Mário Kuka <kuka@cesnet.cz>
---
 app/test/test_pcapng.c  | 42 ++++++++++++++++++++++++++++-
 lib/pcapng/rte_pcapng.c | 58 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 98 insertions(+), 2 deletions(-)
  

Comments

Stephen Hemminger July 29, 2022, 3:58 p.m. UTC | #1
On Fri, 29 Jul 2022 09:18:39 +0200
Mário Kuka <kuka@cesnet.cz> wrote:

> This patchset contains fixes for some issues that occur when writing a 
> large burst of packets at once, such as writing more packets than the
> IOV_MAX limit and the problem of partial writing of a packet to a file 
> if the writev() system call performs a partial write.
> 
> The typical use of pcapng in our cases is to copy the packets into a 
> separate buffer and the process of writing to the file is done in some
> slow path, for example by writing in a separate thread or at the end of
> the application, where we don't mind the limitation of the typically 
> slow speed of the storage medium.
> 
> Mário Kuka (2):
>   pcapng: fix write more packets than IOV_MAX limit
>   pcapng: check if writev() returns a partial write
> 
>  app/test/test_pcapng.c  | 42 +++++++++++++++++-
>  lib/pcapng/rte_pcapng.c | 96 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 120 insertions(+), 18 deletions(-)
> 

You ignored my feedback from earlier patch.
It seems you are adding a lot more here that is necessary.
  
Stephen Hemminger July 29, 2022, 4 p.m. UTC | #2
On Fri, 29 Jul 2022 09:18:41 +0200
Mário Kuka <kuka@cesnet.cz> wrote:

> +pcapng_writev(int fd, struct iovec *iov, const int count)
> +{
> +	size_t total = 0;
> +	int at = 0;
> +
> +	while (at < count) {
> +		/*
> +		 * Note: writev() can return the following on a write request:
> +		 *     Complete:
> +		 *         written = [sum of all iov.iov_len]
> +		 *     Partial:
> +		 *         written < [sum of all iov.iov_len]
> +		 *     Deferred:
> +		 *         written = -1, errno = [EAGAIN]
> +		 *
> +		 * Partial and deferred writes are only possible with O_NONBLOCK set.
> +		 *
> +		 * If we get a partial result, we have to call the writev() again on any ivo buffers
> +		 * that have not been fully written.
> +		 */
> +		ssize_t written = writev(fd, &iov[at], count - at);
> +		if (unlikely(written < 0))
> +			return written;
> +
> +		total += written;
> +		at += pcapng_update_iov(&iov[at], count - at, written);
> +	}
> +
> +	return total;

Since this is being written to a file, handling partial writes makes little
sense. The only case where partial write would happen would be if filesystem
was full. Retrying just adds unnecessary complexity.

If you really want to track this, then add a dropped counter.
  
Mário Kuka July 29, 2022, 5:08 p.m. UTC | #3
> Since this is being written to a file, handling partial writes makes little
> sense. The only case where partial write would happen would be if filesystem
> was full. Retrying just adds unnecessary complexity.
>
> If you really want to track this, then add a dropped counter.

But the file descriptor doesn't have to refer to just a regular file, what
if it's a socket or a pipe or some device? The pcapng documentation doesn't
say anything about any restrictions, so the implementation should be fully
generic. What's the point of a function to write packets to a file 
descriptor
where there's a risk that it won't write all the packets or that the 
file will
by corrupted due to a partial write and still not even let me know about 
it?

On 29/07/2022 18:00, Stephen Hemminger wrote:
> On Fri, 29 Jul 2022 09:18:41 +0200
> Mário Kuka <kuka@cesnet.cz> wrote:
>
>> +pcapng_writev(int fd, struct iovec *iov, const int count)
>> +{
>> +	size_t total = 0;
>> +	int at = 0;
>> +
>> +	while (at < count) {
>> +		/*
>> +		 * Note: writev() can return the following on a write request:
>> +		 *     Complete:
>> +		 *         written = [sum of all iov.iov_len]
>> +		 *     Partial:
>> +		 *         written < [sum of all iov.iov_len]
>> +		 *     Deferred:
>> +		 *         written = -1, errno = [EAGAIN]
>> +		 *
>> +		 * Partial and deferred writes are only possible with O_NONBLOCK set.
>> +		 *
>> +		 * If we get a partial result, we have to call the writev() again on any ivo buffers
>> +		 * that have not been fully written.
>> +		 */
>> +		ssize_t written = writev(fd, &iov[at], count - at);
>> +		if (unlikely(written < 0))
>> +			return written;
>> +
>> +		total += written;
>> +		at += pcapng_update_iov(&iov[at], count - at, written);
>> +	}
>> +
>> +	return total;
> Since this is being written to a file, handling partial writes makes little
> sense. The only case where partial write would happen would be if filesystem
> was full. Retrying just adds unnecessary complexity.
>
> If you really want to track this, then add a dropped counter.
  
Mário Kuka July 29, 2022, 5:33 p.m. UTC | #4
> You ignored my feedback from earlier patch.
> It seems you are adding a lot more here that is necessary.

I split the changes into two separate patches, where [PATCH v2 1/2] 
addresses
the IOV_MAX limit issue and where I've incorporated your feedback + 
added a unit
test that tests this situation. if I did something wrong, let me know.

The problem of partial writig is addressed in the second patch [PATCH v2 
2/2]

On 29/07/2022 17:58, Stephen Hemminger wrote:
> On Fri, 29 Jul 2022 09:18:39 +0200
> Mário Kuka <kuka@cesnet.cz> wrote:
>
>> This patchset contains fixes for some issues that occur when writing a
>> large burst of packets at once, such as writing more packets than the
>> IOV_MAX limit and the problem of partial writing of a packet to a file
>> if the writev() system call performs a partial write.
>>
>> The typical use of pcapng in our cases is to copy the packets into a
>> separate buffer and the process of writing to the file is done in some
>> slow path, for example by writing in a separate thread or at the end of
>> the application, where we don't mind the limitation of the typically
>> slow speed of the storage medium.
>>
>> Mário Kuka (2):
>>    pcapng: fix write more packets than IOV_MAX limit
>>    pcapng: check if writev() returns a partial write
>>
>>   app/test/test_pcapng.c  | 42 +++++++++++++++++-
>>   lib/pcapng/rte_pcapng.c | 96 +++++++++++++++++++++++++++++++++--------
>>   2 files changed, 120 insertions(+), 18 deletions(-)
>>
> You ignored my feedback from earlier patch.
> It seems you are adding a lot more here that is necessary.
  
Stephen Hemminger July 29, 2022, 6:14 p.m. UTC | #5
On Fri, 29 Jul 2022 19:08:41 +0200
Mário Kuka <kuka@cesnet.cz> wrote:

> > Since this is being written to a file, handling partial writes makes little
> > sense. The only case where partial write would happen would be if filesystem
> > was full. Retrying just adds unnecessary complexity.
> >
> > If you really want to track this, then add a dropped counter.  
> 
> But the file descriptor doesn't have to refer to just a regular file, what
> if it's a socket or a pipe or some device? The pcapng documentation doesn't
> say anything about any restrictions, so the implementation should be fully
> generic. What's the point of a function to write packets to a file 
> descriptor
> where there's a risk that it won't write all the packets or that the 
> file will
> by corrupted due to a partial write and still not even let me know about 
> it?

As pcapng is used in the dpdk application it writes to a file.
You could repurpose it to something else, but even a pipe will not
give partial writes unless you configure the pipe as non-blocking.

Writing to a non-blocking pipe is going to have a load of other problems.

This seems like a purely hypothetical case, can't see why it needs to be addressed.
  
Mário Kuka Aug. 1, 2022, 8:42 a.m. UTC | #6
> As pcapng is used in the dpdk application it writes to a file.
> You could repurpose it to something else, but even a pipe will not
> give partial writes unless you configure the pipe as non-blocking.
>
> Writing to a non-blocking pipe is going to have a load of other problems.
>
> This seems like a purely hypothetical case, can't see why it needs to be addressed.

OK, I'm sending patch v3 which only fixes the IVO_MAX limit issue.
I've removed the changes related to the partial write check.

On 29/07/2022 20:14, Stephen Hemminger wrote:
> On Fri, 29 Jul 2022 19:08:41 +0200
> Mário Kuka <kuka@cesnet.cz> wrote:
>
>>> Since this is being written to a file, handling partial writes makes little
>>> sense. The only case where partial write would happen would be if filesystem
>>> was full. Retrying just adds unnecessary complexity.
>>>
>>> If you really want to track this, then add a dropped counter.
>> But the file descriptor doesn't have to refer to just a regular file, what
>> if it's a socket or a pipe or some device? The pcapng documentation doesn't
>> say anything about any restrictions, so the implementation should be fully
>> generic. What's the point of a function to write packets to a file
>> descriptor
>> where there's a risk that it won't write all the packets or that the
>> file will
>> by corrupted due to a partial write and still not even let me know about
>> it?
> As pcapng is used in the dpdk application it writes to a file.
> You could repurpose it to something else, but even a pipe will not
> give partial writes unless you configure the pipe as non-blocking.
>
> Writing to a non-blocking pipe is going to have a load of other problems.
>
> This seems like a purely hypothetical case, can't see why it needs to be addressed.
  

Patch

diff --git a/app/test/test_pcapng.c b/app/test/test_pcapng.c
index 320dacea34..7f51946fff 100644
--- a/app/test/test_pcapng.c
+++ b/app/test/test_pcapng.c
@@ -110,7 +110,7 @@  test_setup(void)
 	}
 
 	/* Make a pool for cloned packets */
-	mp = rte_pktmbuf_pool_create_by_ops("pcapng_test_pool", NUM_PACKETS,
+	mp = rte_pktmbuf_pool_create_by_ops("pcapng_test_pool", IOV_MAX + NUM_PACKETS,
 					    0, 0,
 					    rte_pcapng_mbuf_size(pkt_len),
 					    SOCKET_ID_ANY, "ring_mp_sc");
@@ -237,6 +237,45 @@  test_validate(void)
 	return ret;
 }
 
+static int
+test_write_over_limit_iov_max(void)
+{
+	struct rte_mbuf *orig;
+	struct rte_mbuf *clones[IOV_MAX + NUM_PACKETS] = { };
+	struct dummy_mbuf mbfs;
+	unsigned int i;
+	ssize_t len;
+
+	/* make a dummy packet */
+	mbuf1_prepare(&mbfs, pkt_len);
+
+	/* clone them */
+	orig  = &mbfs.mb[0];
+	for (i = 0; i < IOV_MAX + NUM_PACKETS; i++) {
+		struct rte_mbuf *mc;
+
+		mc = rte_pcapng_copy(port_id, 0, orig, mp, pkt_len,
+				rte_get_tsc_cycles(), 0);
+		if (mc == NULL) {
+			fprintf(stderr, "Cannot copy packet\n");
+			return -1;
+		}
+		clones[i] = mc;
+	}
+
+	/* write it to capture file */
+	len = rte_pcapng_write_packets(pcapng, clones, IOV_MAX + NUM_PACKETS);
+
+	rte_pktmbuf_free_bulk(clones, IOV_MAX + NUM_PACKETS);
+
+	if (len <= 0) {
+		fprintf(stderr, "Write of packets failed\n");
+		return -1;
+	}
+
+	return 0;
+}
+
 static void
 test_cleanup(void)
 {
@@ -256,6 +295,7 @@  unit_test_suite test_pcapng_suite  = {
 		TEST_CASE(test_write_packets),
 		TEST_CASE(test_write_stats),
 		TEST_CASE(test_validate),
+		TEST_CASE(test_write_over_limit_iov_max),
 		TEST_CASES_END()
 	}
 };
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index 06ad712bd1..5762f89cb9 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -567,6 +567,62 @@  mbuf_burst_segs(struct rte_mbuf *pkts[], unsigned int n)
 	return iovcnt;
 }
 
+/*
+ * Update iov after writev() has returned written. We must find how many iov
+ * buffers (from beginning) have been written. The first buffer that was not
+ * written fully is to be updated accordingly.
+ *
+ * Returns offset of buffer that was not written fully.
+ */
+static int
+pcapng_update_iov(struct iovec *iov, const int count, size_t written)
+{
+	for (int i = 0; written > 0 && i < count; ++i) {
+		if (written < iov[i].iov_len) {
+			/* found buffer that was not written fully */
+			iov[i].iov_base = RTE_PTR_ADD(iov[i].iov_base, written);
+			iov[i].iov_len -= written;
+
+			return i;
+		}
+
+		/* buffer fully written, zero it and skip */
+		written -= iov[i].iov_len;
+
+		iov[i].iov_base = NULL;
+		iov[i].iov_len = 0;
+	}
+
+	return count;
+}
+
+/*
+ * Writes all iovcnt buffers of data described by iov to the file associated with
+ * the file descriptor fd.
+ *
+ * Note: POSIX.1-2001 allows an implementation to place a limit on the number
+ *       of items that can be passed in iov. An implementation can advertise
+ *       its limit by defining IOV_MAX in <limits.h>.
+ */
+static ssize_t
+pcapng_writev(int fd, struct iovec *iov, const int count)
+{
+	size_t total = 0;
+	int at = 0;
+
+	while (at < count) {
+		const int iov_cnt = RTE_MIN(count - at, IOV_MAX);
+		ssize_t wlen = writev(fd, &iov[at], iov_cnt);
+		if (unlikely(wlen < 0))
+			return wlen;
+
+		total += wlen;
+		at += pcapng_update_iov(&iov[at], iov_cnt, wlen);
+	}
+
+	return total;
+}
+
 /* Write pre-formatted packets to file. */
 ssize_t
 rte_pcapng_write_packets(rte_pcapng_t *self,
@@ -601,7 +657,7 @@  rte_pcapng_write_packets(rte_pcapng_t *self,
 		} while ((m = m->next));
 	}
 
-	ret = writev(self->outfd, iov, iovcnt);
+	ret = pcapng_writev(self->outfd, iov, iovcnt);
 	if (unlikely(ret < 0))
 		rte_errno = errno;
 	return ret;