60 – rte_event_port_unlink() causes subsequent events to end up in wrong port

Bug 60 - rte_event_port_unlink() causes subsequent events to end up in wrong port

Summary: rte_event_port_unlink() causes subsequent events to end up in wrong port

Status:	RESOLVED FIXED

Alias:	None

Product:	DPDK
Classification:	Unclassified
Component:	eventdev (show other bugs)
Version:	17.11
Hardware:	x86 Linux

Importance:	Normal major
Target Milestone:	---
Assignee:	Harry van Haaren (Intel)

URL:

Depends on:
Blocks:

Reported:	2018-06-04 09:21 CEST by Matias Elo
Modified:	2018-09-26 16:28 CEST (History)
CC List:	2 users (show)

Attachments
Test application (13.81 KB, application/mbox) 2018-06-04 09:21 CEST, Matias Elo	Details
Add an attachment (proposed patch, testcase, etc.)

Description Matias Elo 2018-06-04 09:21:18 CEST

Created attachment 8 [details]
Test application

I'm seeing some unexpected(?) behavior when calling rte_event_port_unlink()
with the SW eventdev driver (DPDK 17.11.2/18.02.1,
RTE_EVENT_MAX_QUEUES_PER_DEV=255). After calling rte_event_port_unlink(),
the enqueued events may end up either back to the unlinked port or to port
zero.

Scenario:

- Run SW evendev on a service core
- Start eventdev with e.g. 16 ports. Each core will have a dedicated port.
- Create 1 atomic queue and link all active ports to it (some ports may not
be linked).
- Allocate some events and enqueue them to the created queue
- Next, each worker core does a number of scheduling rounds concurrently.
E.g.

uint64_t rx_events = 0;
while(rx_events < SCHED_ROUNDS) {
	num_deq = rte_event_dequeue_burst(dev_id, port_id, ev, 1, 0);

	if (num_deq) {
		rx_events++;
		rte_event_enqueue_burst(dev_id, port_id, ev, 1);
	}
}

- This works fine but problems occur when doing cleanup after the first
loop finishes on some core.
E.g.

rte_event_port_unlink(dev_id, port_id, NULL, 0);

while(1) {
	num_deq = rte_event_dequeue_burst(dev_id, port_id, ev, 1, 0);

	if (num_deq == 0)
		break;

	rte_event_enqueue_burst(dev_id, port_id, ev, 1);
}

- The events enqueued in the cleanup loop will ramdomly end up either back to
the same port (which has already been unlinked) or to port zero, which is not
used (mapping rte_lcore_id to port_id).

As far as I understand the eventdev API, an eventdev port shouldn't have to be
linked to the target queue for enqueue to work properly.

I've attached a simple test application for reproducing this issue.
# sudo ./eventdev --vdev event_sw0 -s 0x2

Below is an example rte_event_dev_dump() output when processing events with two
cores (ports 2 and 3). The rest of the ports are not linked at all but events
still end up to port zero stalling the system.


Regards,
Matias

EventDev todo-fix-name: ports 16, qids 1
	rx   908342
	drop 0
	tx   908342
	sched calls: 42577156
	sched cq/qid call: 43120490
	sched no IQ enq: 42122057
	sched no CQ enq: 42122064
	inflight 32, credits: 4064
  Port 0 
	rx   0	drop 0	tx   2	inflight 2
	Max New: 1024	Avg cycles PP: 0	Credits: 0
	Receive burst distribution:
		0:-nan% 
	rx ring used:    0	free: 4096
	cq ring used:    2	free:   14
  Port 1 
	rx   0	drop 0	tx   0	inflight 0
	Max New: 1024	Avg cycles PP: 0	Credits: 0
	Receive burst distribution:
		0:-nan% 
	rx ring used:    0	free: 4096
	cq ring used:    0	free:   16
  Port 2 
	rx   524292	drop 0	tx   524290	inflight 0
	Max New: 1024	Avg cycles PP: 190	Credits: 30
	Receive burst distribution:
		0:98% 1-4:1.82% 
	rx ring used:    0	free: 4096
	cq ring used:    0	free:   16
  Port 3 
	rx   384050	drop 0	tx   384050	inflight 0
	Max New: 1024	Avg cycles PP: 191	Credits: 0
	Receive burst distribution:
		0:100% 1-4:0.04% 
	rx ring used:    0	free: 4096
	cq ring used:    0	free:   16
...
  Port 15 
	rx   0	drop 0	tx   0	inflight 0
	Max New: 1024	Avg cycles PP: 0	Credits: 0
	Receive burst distribution:
		0:-nan% 
	rx ring used:    0	free: 4096
	cq ring used:    0	free:   16
  Queue 0 (Atomic)
	rx   908342	drop 0	tx   908342
	Per Port Stats:
	  Port 0: Pkts: 2	Flows: 1
	  Port 1: Pkts: 0	Flows: 0
	  Port 2: Pkts: 524290	Flows: 0
	  Port 3: Pkts: 384050	Flows: 0
	  Port 4: Pkts: 0	Flows: 0
	  Port 5: Pkts: 0	Flows: 0
	  Port 6: Pkts: 0	Flows: 0
	  Port 7: Pkts: 0	Flows: 0
	  Port 8: Pkts: 0	Flows: 0
	  Port 9: Pkts: 0	Flows: 0
	  Port 10: Pkts: 0	Flows: 0
	  Port 11: Pkts: 0	Flows: 0
	  Port 12: Pkts: 0	Flows: 0
	  Port 13: Pkts: 0	Flows: 0
	  Port 14: Pkts: 0	Flows: 0
	  Port 15: Pkts: 0	Flows: 0
	-- iqs empty --

Comment 1 Harry van Haaren (Intel) 2018-06-05 18:43:32 CEST

> From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com]
> Sent: Monday, June 4, 2018 9:20 AM
> To: bugzilla@dpdk.org
> Cc: dev@dpdk.org; Van Haaren, Harry <harry.van.haaren@intel.com>; Ma, Liang
> J <liang.j.ma@intel.com>; hemant.agrawal@nxp.com; sunil.kori@nxp.com;
> nipun.gupta@nxp.com
> Subject: Re: [dpdk-dev] [Bug 60] rte_event_port_unlink() causes subsequent
> events to end up in wrong port
> 
> -----Original Message-----
> > Date: Mon, 4 Jun 2018 07:21:18 +0000
> > From: bugzilla@dpdk.org
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [Bug 60] rte_event_port_unlink() causes subsequent
> >  events to end up in wrong port
> >
> > https://dpdk.org/tracker/show_bug.cgi?id=60
> >
> >             Bug ID: 60
> >            Summary: rte_event_port_unlink() causes subsequent events to
> >                     end up in wrong port
> >            Product: DPDK
> >            Version: 17.11
> >           Hardware: x86
> >                 OS: Linux
> >             Status: CONFIRMED
> >           Severity: major
> >           Priority: Normal
> >          Component: eventdev
> >           Assignee: dev@dpdk.org
> >           Reporter: matias.elo@nokia.com
> >   Target Milestone: ---
> >
> > Created attachment 8 [details]
> >   --> https://dpdk.org/tracker/attachment.cgi?id=8&action=edit
> > Test application
> >
> > I'm seeing some unexpected(?) behavior when calling
> rte_event_port_unlink()
> > with the SW eventdev driver (DPDK 17.11.2/18.02.1,
> > RTE_EVENT_MAX_QUEUES_PER_DEV=255). After calling rte_event_port_unlink(),
> > the enqueued events may end up either back to the unlinked port or to port
> > zero.
> >
> > Scenario:
> >
> > - Run SW evendev on a service core
> > - Start eventdev with e.g. 16 ports. Each core will have a dedicated port.
> > - Create 1 atomic queue and link all active ports to it (some ports may
> not
> > be linked).
> > - Allocate some events and enqueue them to the created queue
> > - Next, each worker core does a number of scheduling rounds concurrently.
> > E.g.
> >
> > uint64_t rx_events = 0;
> > while(rx_events < SCHED_ROUNDS) {
> >         num_deq = rte_event_dequeue_burst(dev_id, port_id, ev, 1, 0);
> >
> >         if (num_deq) {
> >                 rx_events++;
> >                 rte_event_enqueue_burst(dev_id, port_id, ev, 1);
> >         }
> > }
> >
> > - This works fine but problems occur when doing cleanup after the first
> > loop finishes on some core.
> > E.g.
> >
> > rte_event_port_unlink(dev_id, port_id, NULL, 0);
> >
> > while(1) {
> >         num_deq = rte_event_dequeue_burst(dev_id, port_id, ev, 1, 0);
> >
> >         if (num_deq == 0)
> >                 break;
> >
> >         rte_event_enqueue_burst(dev_id, port_id, ev, 1);
> > }
> >
> > - The events enqueued in the cleanup loop will ramdomly end up either back
> to
> > the same port (which has already been unlinked) or to port zero, which is
> not
> > used (mapping rte_lcore_id to port_id).
> >
> > As far as I understand the eventdev API, an eventdev port shouldn't have
> to be
> > linked to the target queue for enqueue to work properly.
> 
> That is a grey area in the spec. octeontx drivers works as the way you
> described. I am not sure about SW driver(CC:
> harry.van.haaren@intel.com), If there is no performance impact for none of
> the drivers and it is do able for all HW and SW implementation then can
> do that way(CC: all PMD maintainers)
> 
> No related to this question, Are you planning to use rte_event_port_unlink()
> in fastpath?
> Does rte_event_stop() works for you, if it is in slow path.


Hi Matias,

Thanks for opening, from memory the sw_port_unlink() API does attempt to handle that correctly.

Having a quick look, we scan for the port to unlink, from the queue, and if we find the queue->port combination, we copy the furthest link in the array to the found position, and reduce num mapped queues by one (aka, we keep the array contiguous from 0 to num_mapped_queues).

The appropriate rte_smp_wmb() is in place to avoid race-conditions between threads there..

I think this should handle the unlink case you mention, however perhaps you have identified a genuine bug. If you have more info or a sample config / app that easily demonstrates the issue that would help reproduce/debug here? 

Unfortunately I will be away until next week, but I will check up on this thread once I'm back in the office.

Regards, -Harry

Comment 2 Matias Elo 2018-06-06 08:08:41 CEST

Hi Harry,

Thanks for looking into this. An example application for demonstrating this issue should be attached to this bug report ('Test application': evendev_test.diff). It can be run simply by:

sudo ./eventdev --vdev event_sw0 -s 0x2

Regards,
Matias

Comment 3 Matias Elo 2018-06-12 13:39:43 CEST

I did some further digging and it seems the problem is at least somehow related to sw_qid.cq_next_tx getting out of sync with sw_qid.cq_num_mapped_cqs. After the following change events won't end up to wrong ports anymore (fix only for atomic queues).

diff --git a/drivers/event/sw/sw_evdev_scheduler.c b/drivers/event/sw/sw_evdev_scheduler.c
index 8a2c9d4f9..c9dd19683 100644
--- a/drivers/event/sw/sw_evdev_scheduler.c
+++ b/drivers/event/sw/sw_evdev_scheduler.c
@@ -80,7 +80,7 @@ sw_schedule_atomic_to_cq(struct sw_evdev *sw, struct sw_qid * const qid,
 
                if (cq < 0) {
                        uint32_t cq_idx = qid->cq_next_tx++;
-                       if (qid->cq_next_tx == qid->cq_num_mapped_cqs)
+                       if (qid->cq_next_tx >= qid->cq_num_mapped_cqs)
                                qid->cq_next_tx = 0;
                        cq = qid->cq_map[cq_idx];


However, an eventdev test application (bug report's attachment) worker is still able to receive events event though it has unlinked all queues.

Comment 4 Bill Fischofer 2018-06-25 23:10:17 CEST

Can we get an update on this bug? What's needed to help move the analysis and resolution along?  

Thanks.

Comment 5 Harry van Haaren (Intel) 2018-07-10 15:21:06 CEST

Hey Matias, Bill,

Apologies for delays - I'm travelling a lot recently, and I'll be away again next week.

I've been thinking about the problem, and we can break it down into two parts;
A) The next_tx index was getting out of bounds (as fixed by Matias in comment 3 above, and a suggested fix here https://mails.dpdk.org/archives/dev/2018-June/104620.html)

B) The unlink() call can be called from another thread than the sw scheduler thread. As such, when unlink() returns, it is *not* gauranteed (currently) by the API or by the implmentation that no events from the unlinked queue could end up in that port. Somehow we need to implement/document a method to allow the application to unlink and get known and reproducible results.

If you have input on what you would this to look like from the application POV, please do provide input!

Regards, -Harry

Comment 6 Matias Elo 2018-07-11 09:01:04 CEST

> B) The unlink() call can be called from another thread than the sw scheduler
> thread. As such, when unlink() returns, it is *not* gauranteed (currently)
> by the API or by the implmentation that no events from the unlinked queue
> could end up in that port. Somehow we need to implement/document a method to
> allow the application to unlink and get known and reproducible results.
> 
> If you have input on what you would this to look like from the application
> POV, please do provide input!

From application POV it is OK if some events are scheduled to the particular
port after unlink as long as there is a method to find out when all these
events have been dequeued.

E.g. a clean-up loop like the following would be fine:

	rte_event_port_unlink(dev_id, port_id, NULL, 0)

	while(1) {
		num_deq = rte_event_dequeue_burst(dev_id, port_id, ev, 1, 0);

		if (num_deq == 0)
			break;

	        /* Pass event back to scheduler */
		rte_event_enqueue_burst(dev_id, port_id, ev, 1)
	}


Regards,
Matias

Comment 7 Harry van Haaren (Intel) 2018-07-13 18:24:10 CEST

Hey,

> From application POV it is OK if some events are scheduled to the particular
port after unlink as long as there is a method to find out when all these
events have been dequeued.

Yes I think this addresses the core problem well. Thanks also for the code snippet - I agree that the core will have to "poll" a few times to understand when the port no longer linked.

Using the num_deq as the indicator of "unlink complete" doesn't sit right here. I'm considering different approaches that might make the unlink() function only return when the unlink is complete - however this doesn't scale well as the SW scheduling core and the core that call unlink() easily get race-condition-y.

Using a separate API might be a solution (aka, rte_event_unlinks_in_process() == 0) or something? I'm not sure yet here - I am giving it some thought, just havent got to a clean solution yet.

As mentioned before, I'm away for next week - I'll catch up on any posts here then and we can work to a "real" solution.
Regards, -Harry

Comment 8 Bill Fischofer 2018-07-13 19:35:53 CEST

(In reply to Harry van Haaren (Intel) from comment #7)
> Hey,
> 
> > From application POV it is OK if some events are scheduled to the
> particular
> port after unlink as long as there is a method to find out when all these
> events have been dequeued.
> 
> Yes I think this addresses the core problem well. Thanks also for the code
> snippet - I agree that the core will have to "poll" a few times to
> understand when the port no longer linked.
> 
> Using the num_deq as the indicator of "unlink complete" doesn't sit right
> here. I'm considering different approaches that might make the unlink()
> function only return when the unlink is complete - however this doesn't
> scale well as the SW scheduling core and the core that call unlink() easily
> get race-condition-y.

Synchronous waits are undesirable unless the expected wait time is brief. 

> 
> Using a separate API might be a solution (aka,
> rte_event_unlinks_in_process() == 0) or something? I'm not sure yet here - I
> am giving it some thought, just havent got to a clean solution yet.

The way this is typically solved is the request to unlink/quiesce/etc. initiates a process that concludes with a special completion event that serves as a "pipe cleaner" acknowledging that the requested process is now complete and no further events will be received. This avoids the ambiguities associated with (temporary) "no event" results as the pipe is draining.

> 
> As mentioned before, I'm away for next week - I'll catch up on any posts
> here then and we can work to a "real" solution.
> Regards, -Harry

Comment 9 Matias Elo 2018-07-16 09:33:06 CEST


> > Using the num_deq as the indicator of "unlink complete" doesn't sit right
> > here. I'm considering different approaches that might make the unlink()
> > function only return when the unlink is complete - however this doesn't
> > scale well as the SW scheduling core and the core that call unlink() easily
> > get race-condition-y.
> 
> Synchronous waits are undesirable unless the expected wait time is brief. 

Agree, making unlink() blocking would make the function pretty much unusable in the fast path.

> Using a separate API might be a solution (aka,
> rte_event_unlinks_in_process() == 0) or something? I'm not sure yet here - I
> am giving it some thought, just havent got to a clean solution yet.

This could work, although simply expanding rte_event_port_unlink() documentation (e.g. there may be pre-scheduled events and the application is responsible to process them) could achieve the same thing if one could use num_deq as the indicator of unlink complete.

-Matias

Comment 10 Harry van Haaren (Intel) 2018-07-23 19:02:36 CEST

So I'm considering both options, the "pipe cleaner" event and the unlink() call to return when un-linking is complete. I'll need to dig a bit deeper to understand the implications of what threads execute which parts of the unlink for the event/sw driver to avoid race-conditions and such.

@Matias, do you think an event as Bill proposed could be a good solution? It would mean the application code needs to continue polling until this special event arrives, and then you can understand that the queue unlink is complete?

Side-note, we should take this discussion to the dpdk-dev mailing list - I don't think bugzillas are being followed by the larger community.

Thanks for the input, -Harry

Comment 11 Matias Elo 2018-07-24 08:07:22 CEST

> @Matias, do you think an event as Bill proposed could be a good solution? It
> would mean the application code needs to continue polling until this special
> event arrives, and then you can understand that the queue unlink is complete?

Bill's "pipe cleaner" solution should work well.

-Matias

Comment 12 Harry van Haaren (Intel) 2018-07-24 19:25:19 CEST

(In reply to Matias Elo from comment #11)
> > @Matias, do you think an event as Bill proposed could be a good solution?
> It
> > would mean the application code needs to continue polling until this
> special
> > event arrives, and then you can understand that the queue unlink is
> complete?
> 
> Bill's "pipe cleaner" solution should work well.

OK I'm starting to get the bigger picture, just to confirm;

Your use case here is that you want to unlink() the *last* queue to a particular port, and then know when to put that CPU core that was polling the port to sleep, correct?

Or does your use case also include eg: 4 queues linked to 1 port, and unlinking just *one* of the queues from the port, and identifying when that queue unlink is complete?

Comment 13 Matias Elo 2018-07-25 09:26:48 CEST

> Your use case here is that you want to unlink() the *last* queue to a
> particular port, and then know when to put that CPU core that was polling
> the port to sleep, correct?

This is definitely the more important use case.

> Or does your use case also include eg: 4 queues linked to 1 port, and
> unlinking just *one* of the queues from the port, and identifying when that
> queue unlink is complete?

This information could be useful but I don't find it mandatory, at least in our use case. What do you think Bill?

As a side note, the rte_event_dev_start() requirement that all queues are linked makes writing applications which link/unlink queues at runtime cumbersome. E.g. I have to dummy link all queues before rte_event_dev_start() and then unlink them after the function call. This alone wouldn't be a big issue but rte_event_dev_start() is also called inside rte_event_eth_rx_adapter_create().

Comment 14 Bill Fischofer 2018-07-25 13:53:34 CEST

I agree with Matias that the key need is to know when the unlink operation has fully completed and that no further events will be presented from that queue. 

This is all part of enabling the event configuration to be more flexible and dynamic. This is especially important for applications that perform workload scaling in response to changing workloads, which is typically in most production scenarios.

Comment 15 Harry van Haaren (Intel) 2018-07-27 19:00:07 CEST

@Matias,
Re "all queues linked at start" topic: Lets leave this as a seperate issue. I do think it is worth discussing that on the mailing list though. I recall that we added the check because if a queue is not being drained (no queue -> port mapping), and events are being enqueued to it by the application (in error..) then it will deadlock the eventdev - without anything that the SW PMD can do about it. If it causes you a headache, please bring it up - I'd like Jerin's input on that one as Eventdev maintainer.

Re detecting when unlink is complete; What if we added an API such as this:
int32_t rte_event_unlinks_in_progress()? This would allow an application to call unlink(), then start polling unlinks_in_progress() until it reaches 0, and then stop polling the Eventdev ports that were just unlinked. Does that sound like a good solution?

Again - we need to bring this to the attention of the DPDK dev mailing list. I think its best if it comes from the requester (that's you @Matias :) as it will give better context? If you prefer I can kick off a discussion otherwise.

Comment 16 Bill Fischofer 2018-07-27 19:06:40 CEST

(In reply to Harry van Haaren (Intel) from comment #15)
> @Matias,
> Re "all queues linked at start" topic: Lets leave this as a seperate issue.
> I do think it is worth discussing that on the mailing list though. I recall
> that we added the check because if a queue is not being drained (no queue ->
> port mapping), and events are being enqueued to it by the application (in
> error..) then it will deadlock the eventdev - without anything that the SW
> PMD can do about it. If it causes you a headache, please bring it up - I'd
> like Jerin's input on that one as Eventdev maintainer.
> 
> Re detecting when unlink is complete; What if we added an API such as this:
> int32_t rte_event_unlinks_in_progress()? This would allow an application to
> call unlink(), then start polling unlinks_in_progress() until it reaches 0,
> and then stop polling the Eventdev ports that were just unlinked. Does that
> sound like a good solution?

"Are we there yet?" polling APIs aren't very scalable, and should be discouraged in an event framework. 

> 
> Again - we need to bring this to the attention of the DPDK dev mailing list.
> I think its best if it comes from the requester (that's you @Matias :) as it
> will give better context? If you prefer I can kick off a discussion
> otherwise.

That's a very high-volume list. I agree a mailing list would be better for design discussions but I'm wondering if there's a more focused way of doing that?

Comment 17 Harry van Haaren (Intel) 2018-07-27 19:13:00 CEST

(In reply to Bill Fischofer from comment #16)
> (In reply to Harry van Haaren (Intel) from comment #15)
> > Re detecting when unlink is complete; What if we added an API such as this:
> > int32_t rte_event_unlinks_in_progress()? This would allow an application to
> > call unlink(), then start polling unlinks_in_progress() until it reaches 0,
> > and then stop polling the Eventdev ports that were just unlinked. Does that
> > sound like a good solution?
> 
> "Are we there yet?" polling APIs aren't very scalable, and should be
> discouraged in an event framework. 

Given unlink() will not be a very common task (once every minute or so at most?) I don't see an issue with having a polling task. "Polling" doesn't have to mean 100% busy polling, depending on how the application wishes to structure its thread model, it can be a non-datapath thread which occasionally checks if unlink() is complete.


> > Again - we need to bring this to the attention of the DPDK dev mailing
> list.
> > I think its best if it comes from the requester (that's you @Matias :) as
> it
> > will give better context? If you prefer I can kick off a discussion
> > otherwise.
> 
> That's a very high-volume list. I agree a mailing list would be better for
> design discussions but I'm wondering if there's a more focused way of doing
> that?

The DPDK mailing list is the (only) central place for design discussions such as this, and has always been used in DPDK for this type of discussion. If the mailing-list causes issues, please raise it as a separate topic, perhaps at the Userspace event.

To continue progress on this topic, I'll wait for Matias to kick off a thread on dev@dpdk.org

Comment 18 Harry van Haaren (Intel) 2018-09-21 11:52:20 CEST

Status update;
Discussion continued on list (https://mails.dpdk.org/archives/dev/2018-September/110929.html) Matias sent a ping, and there's now a v2 patch up on the mailing list: http://patches.dpdk.org/patch/45007/

Comment 19 Harry van Haaren (Intel) 2018-09-21 12:30:42 CEST

Patch sent to fix the "sub bug" on events arriving after unlink;
http://patches.dpdk.org/patch/45066/

Comment 20 Harry van Haaren (Intel) 2018-09-26 16:28:35 CEST

Patchset now applied;
http://patches.dpdk.org/patch/45188/
http://patches.dpdk.org/patch/45066/

Will close this bug.

Note You need to log in before you can comment on or make changes to this bug.