[dpdk-stable] 回复: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline test

Feifei Wang Feifei.Wang2 at arm.com
Fri Jan 8 08:12:54 CET 2021


Hi, Pavan



> -----邮件原件-----

> 发件人: Pavan Nikhilesh Bhagavatula <pbhagavatula at marvell.com>

> 发送时间: 2021年1月5日 17:29

> 收件人: Feifei Wang <Feifei.Wang2 at arm.com>; jerinj at marvell.com; Harry

> van Haaren <harry.van.haaren at intel.com>

> 抄送: dev at dpdk.org; nd <nd at arm.com>; Honnappa Nagarahalli

> <Honnappa.Nagarahalli at arm.com>; stable at dpdk.org; Ruifeng Wang

> <Ruifeng.Wang at arm.com>; nd <nd at arm.com>

> 主题: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers for pipeline

> test

>

> Hi Feifei,

>

> >Hi, Pavan

> >

> >Sorry for my late reply and thanks very much for your review.

> >

> >> -----Original Message-----

> >> From: Pavan Nikhilesh Bhagavatula <pbhagavatula at marvell.com<mailto:pbhagavatula at marvell.com>>

> >> Sent: 2020年12月22日 18:33

> >> To: Feifei Wang <Feifei.Wang2 at arm.com<mailto:Feifei.Wang2 at arm.com>>; jerinj at marvell.com<mailto:jerinj at marvell.com>;

> >Harry van

> >> Haaren <harry.van.haaren at intel.com<mailto:harry.van.haaren at intel.com>>; Pavan Nikhilesh

> >> <pbhagavatula at caviumnetworks.com<mailto:pbhagavatula at caviumnetworks.com>>

> >> Cc: dev at dpdk.org<mailto:dev at dpdk.org>; nd <nd at arm.com<mailto:nd at arm.com>>; Honnappa Nagarahalli

> >> <Honnappa.Nagarahalli at arm.com<mailto:Honnappa.Nagarahalli at arm.com>>; stable at dpdk.org<mailto:stable at dpdk.org>; Phil Yang

> >> <Phil.Yang at arm.com<mailto:Phil.Yang at arm.com>>

> >> Subject: RE: [RFC PATCH v1 4/6] app/eventdev: add release barriers

> >for

> >> pipeline test

> >>

> >>

> >> >Add release barriers before updating the processed packets for

> >worker

> >> >lcores to ensure the worker lcore has really finished data

> >> >processing and then it can update the processed packets number.

> >> >

> >>

> >> I believe we can live with minor inaccuracies in stats being

> >> presented

> >as

> >> atomics are pretty heavy when scheduler is limited to burst size as 1.

> >>

> >> One option is to move it before a pipeline operation

> >(pipeline_event_tx,

> >> pipeline_fwd_event etc.) as they imply implicit release barrier (as

> >> all

> >the

> >> changes done to the event should be visible to the next core).

> >

> >If I understand correctly, your meaning is that move release barriers

> >before pipeline_event_tx or pipeline_fwd_event. This can ensure the

> >event has been processed before the next core begins to tx/fwd. For

> >example:

>

> What I meant was event APIs such as `rte_event_enqueue_burst`,

> `rte_event_eth_tx_adapter_enqueue`

> act as an implicit release barrier and the API `rte_event_dequeue_burst` act

> as an implicit acquire barrier.



>

> Since, pipeline_* test starts with a dequeue() and ends with an enqueue() I

> don’t believe we need barriers in Between.



Sorry for my misunderstanding this. And I agree with you that no barriers are

needed between dequeue and enqueue.



Now, let's go back to the beginning. Actually with this patch, our barrier is mainly

for the synchronous variable " w->processed_pkts ". As we all know, the event is firstly

dequeued and then enqueued, after this, the event can be treated as the processed event

and included in the statistics("w->processed_pkts++").



Thus, we add a release barrier before " w->processed_pkts++" is to prevent this operation

being executed ahead of time. For example:

dequeue  ->  w->processed_pkts++  ->  enqueue

This cause that the worker doesn't actually finish this event processing, but the event is treated

as the processed one and included in the statistics.



______________________________________________________________________________



By the way, I have two other questions about pipeline process test in "test_pipeline_queue".

1. when do we start counting processed events (w->processed_pkts)?

For the fwd mode (internal_port = false), when we choose single stage, application increments

the number events processed after "pipeline_event enqueue". However, when we choose multiple

stage, application increments the number events processed before "pipleline_event_enqueue". So,

maybe we can unify this. For example of multiple stage:



                                if (cq_id == last_queue) {

                                                ev.queue_id = tx_queue[ev.mbuf->port];

                                                rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

                                                pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC);

                                +             pipeline_event_enqueue(dev, port, &ev);

                                                w->processed_pkts++;

                                } else {

                                                ev.queue_id++;

                                                pipeline_fwd_event(&ev, sched_type_list[cq_id]);

                                +             pipeline_event_enqueue(dev, port, &ev);

                                }



                -              pipeline_event_enqueue(dev, port, &ev);



2. Whether  "pipeline_event_enqueue" is needed after "pipeline_event_tx" for tx mode?

For single_stage_burst_tx mode, after "pipeline_event_tx", the worker has to enqueue again

due to  "pipeline_event_enqueue_burst", so maybe we should jump out of the loop after

“pipeline_event_tx”, for example:



                                                if (ev[i].sched_type == RTE_SCHED_TYPE_ATOMIC) {

                                                                pipeline_event_tx(dev, port, &ev[i]);

                                                                ev[i].op = RTE_EVENT_OP_RELEASE;

                                                                w->processed_pkts++;

                                                +             continue;

                                                } else {

                                                                ev[i].queue_id++;

                                                                pipeline_fwd_event(&ev[i],

                                                                                                RTE_SCHED_TYPE_ATOMIC);

                                                }

                                }



                                pipeline_event_enqueue_burst(dev, port, ev, nb_rx);





>

> >

> >if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {

> >                          +             __atomic_thread_fence(__ATOMIC_RELEASE);

> >                                          pipeline_event_tx(dev, port, &ev);

> >                                          w->processed_pkts++;

> >                          } else {

> >                                          ev.queue_id++;

> >                          +             __atomic_thread_fence(__ATOMIC_RELEASE);

> >                                          pipeline_fwd_event(&ev,

> >RTE_SCHED_TYPE_ATOMIC);

> >                                          pipeline_event_enqueue(dev, port, &ev);

> >

> >However, there are two reasons to prevent this:

> >

> >First, compare with other tests in app/eventdev, for example, the

> >eventdev perf test, the wmb is after event operation to ensure

> >operation has been finished and then w->processed_pkts++.

>

> In case of perf_* tests start with a dequeue() and finally ends with a

> mempool_put() should also act as implicit acquire release pairs making stats

> consistent?



For perf tests, this consistency refers to that there is a wmb after mempool_put().

Please refer to this link:

http://patches.dpdk.org/patch/85634/



>

> >So, if we move release barriers before tx/fwd, it may cause that the

> >tests of app/eventdev become  inconsistent.This may reduce the

> >maintainability of the code and make it difficult to understand.

> >

> >Second, it is a test case, though heavy thread may cause performance

> >degradation, it can ensure that the operation process and the test

> >result are correct. And maybe for a test case, correctness is more

> >important than performance.

> >

>

> Most of our internal perf test run on 24/48 core combinations and since

> Octeontx2 event device driver supports a burst size of 1, it will show up as

> Huge performance degradation.



For the impact on performance, I do the test using software driver, following are some test results:

------------------------------------------------------------------------------------------------------------------------------------

Architecture: aarch64

Nics: ixgbe-82599

CPU: Cortex-A72

BURST_SIZE: 1

Order: ./dpdk-test-eventdev -l 0-15 -s 0x2 --vdev=event_sw0 -- --test=pipeline_queue --wlcore=4-14 --prod_type_ethdev --stlist=a,a

Flow: one flow, 64bits package, TX rate: 1.4Mpps



Without this patch:

0.954 mpps avg 0.953 mpps



With this patch:

0.932 mpps avg 0.930 mpps

------------------------------------------------------------------------------------------------------------------------------------



Based on the result above, there is no significant performance degradation with this patch.

This is because the release barrier is only for  “w->processed_pkts++”. It just ensures that the worker core

increments the number events processed after enqueue, and it doesn’t affect dequeue/enqueue:



dequeue -> enqueue -> release barrier -> w->processed_pkts++



On the other hand, I infer the reason for the slight decrease in measurement performance is that the release barrier

prevent “w->processed_pkts++” before that the event has been processed (enqueue). But I think this test result is closer

to the real performance.

And sorry for that we have no octentx2 device, so there is no test result on Octeontx2 event device driver. Would you please

help us test this patch on octentx2 when you are convenient. Thanks very much.



Best Regards

Feifei



>

> >So, due to two reasons above, I'm ambivalent about how we should do in

> >the next step.

> >

> >Best Regards

> >Feifei

>

> Regards,

> Pavan.

>

> >

> >> >Fixes: 314bcf58ca8f ("app/eventdev: add pipeline queue worker

> >> >functions")

> >> >Cc: pbhagavatula at marvell.com<mailto:pbhagavatula at marvell.com>

> >> >Cc: stable at dpdk.org<mailto:stable at dpdk.org>

> >> >

> >> >Signed-off-by: Phil Yang <phil.yang at arm.com<mailto:phil.yang at arm.com>>

> >> >Signed-off-by: Feifei Wang <feifei.wang2 at arm.com<mailto:feifei.wang2 at arm.com>>

> >> >Reviewed-by: Ruifeng Wang <ruifeng.wang at arm.com<mailto:ruifeng.wang at arm.com>>

> >> >---

> >> > app/test-eventdev/test_pipeline_queue.c | 64

> >> >+++++++++++++++++++++----

> >> > 1 file changed, 56 insertions(+), 8 deletions(-)

> >> >

> >> >diff --git a/app/test-eventdev/test_pipeline_queue.c b/app/test-

> >> >eventdev/test_pipeline_queue.c index 7bebac34f..0c0ec0ceb

> >100644

> >> >--- a/app/test-eventdev/test_pipeline_queue.c

> >> >+++ b/app/test-eventdev/test_pipeline_queue.c

> >> >@@ -30,7 +30,13 @@ pipeline_queue_worker_single_stage_tx(void

> >> >*arg)

> >> >

> >> >                    if (ev.sched_type == RTE_SCHED_TYPE_ATOMIC) {

> >> >                                    pipeline_event_tx(dev, port, &ev);

> >> >-                                  w->processed_pkts++;

> >> >+

> >> >+                                 /* release barrier here ensures stored operation

> >> >+                                 * of the event completes before the number of

> >> >+                                 * processed pkts is visible to the main core

> >> >+                                 */

> >> >+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

> >> >+                                                                 __ATOMIC_RELEASE);

> >> >                    } else {

> >> >                                    ev.queue_id++;

> >> >                                    pipeline_fwd_event(&ev,

> >> >RTE_SCHED_TYPE_ATOMIC);

> >> >@@ -59,7 +65,13 @@

> >pipeline_queue_worker_single_stage_fwd(void

> >> >*arg)

> >> >                    rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

> >> >                    pipeline_fwd_event(&ev, RTE_SCHED_TYPE_ATOMIC);

> >> >                    pipeline_event_enqueue(dev, port, &ev);

> >> >-                  w->processed_pkts++;

> >> >+

> >> >+                 /* release barrier here ensures stored operation

> >> >+                 * of the event completes before the number of

> >> >+                 * processed pkts is visible to the main core

> >> >+                 */

> >> >+                 __atomic_fetch_add(&(w->processed_pkts), 1,

> >> >+                                                 __ATOMIC_RELEASE);

> >> >    }

> >> >

> >> >    return 0;

> >> >@@ -84,7 +96,13 @@

> >> >pipeline_queue_worker_single_stage_burst_tx(void *arg)

> >> >                                    if (ev[i].sched_type ==

> >> >RTE_SCHED_TYPE_ATOMIC) {

> >> >                                                    pipeline_event_tx(dev, port, &ev[i]);

> >> >                                                    ev[i].op = RTE_EVENT_OP_RELEASE;

> >> >-                                                  w->processed_pkts++;

> >> >+

> >> >+                                                 /* release barrier here ensures stored

> >> >operation

> >> >+                                                 * of the event completes before the

> >> >number of

> >> >+                                                 * processed pkts is visible to the main

> >> >core

> >> >+                                                 */

> >> >+                                                 __atomic_fetch_add(&(w-

> >> >>processed_pkts), 1,

> >> >+                                                                                 __ATOMIC_RELEASE);

> >> >                                    } else {

> >> >                                                    ev[i].queue_id++;

> >> >                                                    pipeline_fwd_event(&ev[i],

> >> >@@ -121,7 +139,13 @@

> >> >pipeline_queue_worker_single_stage_burst_fwd(void *arg)

> >> >                    }

> >> >

> >> >                    pipeline_event_enqueue_burst(dev, port, ev, nb_rx);

> >> >-                  w->processed_pkts += nb_rx;

> >> >+

> >> >+                 /* release barrier here ensures stored operation

> >> >+                 * of the event completes before the number of

> >> >+                 * processed pkts is visible to the main core

> >> >+                 */

> >> >+                 __atomic_fetch_add(&(w->processed_pkts), nb_rx,

> >> >+                                                 __ATOMIC_RELEASE);

> >> >    }

> >> >

> >> >    return 0;

> >> >@@ -146,7 +170,13 @@

> >pipeline_queue_worker_multi_stage_tx(void

> >> >*arg)

> >> >

> >> >                    if (ev.queue_id == tx_queue[ev.mbuf->port]) {

> >> >                                    pipeline_event_tx(dev, port, &ev);

> >> >-                                  w->processed_pkts++;

> >> >+

> >> >+                                 /* release barrier here ensures stored operation

> >> >+                                 * of the event completes before the number of

> >> >+                                 * processed pkts is visible to the main core

> >> >+                                 */

> >> >+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

> >> >+                                                                 __ATOMIC_RELEASE);

> >> >                                    continue;

> >> >                    }

> >> >

> >> >@@ -180,7 +210,13 @@

> >> >pipeline_queue_worker_multi_stage_fwd(void *arg)

> >> >                                    ev.queue_id = tx_queue[ev.mbuf->port];

> >> >                                    rte_event_eth_tx_adapter_txq_set(ev.mbuf, 0);

> >> >                                    pipeline_fwd_event(&ev,

> >> >RTE_SCHED_TYPE_ATOMIC);

> >> >-                                  w->processed_pkts++;

> >> >+

> >> >+                                 /* release barrier here ensures stored operation

> >> >+                                 * of the event completes before the number of

> >> >+                                 * processed pkts is visible to the main core

> >> >+                                 */

> >> >+                                 __atomic_fetch_add(&(w->processed_pkts), 1,

> >> >+                                                                 __ATOMIC_RELEASE);

> >> >                    } else {

> >> >                                    ev.queue_id++;

> >> >                                    pipeline_fwd_event(&ev,

> >> >sched_type_list[cq_id]);

> >> >@@ -214,7 +250,13 @@

> >> >pipeline_queue_worker_multi_stage_burst_tx(void *arg)

> >> >                                    if (ev[i].queue_id == tx_queue[ev[i].mbuf-

> >> >>port]) {

> >> >                                                    pipeline_event_tx(dev, port, &ev[i]);

> >> >                                                    ev[i].op = RTE_EVENT_OP_RELEASE;

> >> >-                                                  w->processed_pkts++;

> >> >+

> >> >+                                                 /* release barrier here ensures stored

> >> >operation

> >> >+                                                 * of the event completes before the

> >> >number of

> >> >+                                                 * processed pkts is visible to the main

> >> >core

> >> >+                                                 */

> >> >+                                                 __atomic_fetch_add(&(w-

> >> >>processed_pkts), 1,

> >> >+                                                                                 __ATOMIC_RELEASE);

> >> >                                                    continue;

> >> >                                    }

> >> >

> >> >@@ -254,7 +296,13 @@

> >> >pipeline_queue_worker_multi_stage_burst_fwd(void *arg)

> >> >

> >> >    rte_event_eth_tx_adapter_txq_set(ev[i].mbuf, 0);

> >> >                                                    pipeline_fwd_event(&ev[i],

> >> >

> >> >    RTE_SCHED_TYPE_ATOMIC);

> >> >-                                                  w->processed_pkts++;

> >> >+

> >> >+                                                 /* release barrier here ensures stored

> >> >operation

> >> >+                                                 * of the event completes before the

> >> >number of

> >> >+                                                 * processed pkts is visible to the main

> >> >core

> >> >+                                                 */

> >> >+                                                 __atomic_fetch_add(&(w-

> >> >>processed_pkts), 1,

> >> >+                                                                                 __ATOMIC_RELEASE);

> >> >                                    } else {

> >> >                                                    ev[i].queue_id++;

> >> >                                                    pipeline_fwd_event(&ev[i],

> >> >--

> >> >2.17.1




More information about the stable mailing list