[dpdk-dev,v6,2/2] eal: add asynchronous request API to DPDK IPC

Message ID 5d94df0912159b96aa71da5249598bd8c744a198.1522159146.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation fail Compilation issues

Commit Message

Anatoly Burakov March 27, 2018, 1:59 p.m. UTC
  This API is similar to the blocking API that is already present,
but reply will be received in a separate callback by the caller
(callback specified at the time of request, rather than registering
for it in advance).

Under the hood, we create a separate thread to deal with replies to
asynchronous requests, that will just wait to be notified by the
main thread, or woken up on a timer.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Jianfeng Tan <jianfeng.tan@intel.com>
---

Notes:
    v6:
      - address review comments from Jianfeng
      - do not add dummy item to queue by default
    
    v5:
      - addressed review comments from Jianfeng
      - split into two patches to avoid rename noise
      - do not mark ignored message as processed
    v4:
      - rebase on top of latest IPC Improvements patchset [2]
    
    v3:
      - added support for MP_IGN messages introduced in
        IPC improvements v5 patchset
    v2:
      - fixed deadlocks and race conditions by not calling
        callbacks while iterating over sync request list
      - fixed use-after-free by making a copy of request
      - changed API to also give user a copy of original
        request, so that they know to which message the
        callback is a reply to
      - fixed missing .map file entries
    
    This patch is dependent upon previously published patchsets
    for IPC fixes [1] and improvements [2].
    
    rte_mp_action_unregister and rte_mp_async_reply_unregister
    do the same thing - should we perhaps make it one function?
    
    [1] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Fixes/
    [2] http://dpdk.org/dev/patchwork/bundle/aburakov/IPC_Improvements/

 lib/librte_eal/common/eal_common_proc.c | 458 +++++++++++++++++++++++++++++++-
 lib/librte_eal/common/include/rte_eal.h |  36 +++
 lib/librte_eal/rte_eal_version.map      |   1 +
 3 files changed, 482 insertions(+), 13 deletions(-)
  

Comments

Thomas Monjalon March 27, 2018, 4:33 p.m. UTC | #1
27/03/2018 15:59, Anatoly Burakov:
> Under the hood, we create a separate thread to deal with replies to
> asynchronous requests, that will just wait to be notified by the
> main thread, or woken up on a timer.

I really don't like that a library is creating a thread.
We don't even know where the thread is created (which core).
Can it be a rte_service? or in the interrupt thread?


> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -225,6 +225,7 @@ EXPERIMENTAL {
>  	rte_mp_action_unregister;
>  	rte_mp_sendmsg;
>  	rte_mp_request;
> +	rte_mp_request_async;

So there is rte_mp_request and rte_mp_request_async?
You should rename rte_mp_request, I guess.

>  	rte_mp_reply;
>  	rte_service_attr_get;
>  	rte_service_attr_reset_all;
  
Jianfeng Tan March 28, 2018, 2:08 a.m. UTC | #2
Hi Thomas ,

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, March 28, 2018 12:34 AM
> To: Burakov, Anatoly
> Cc: dev@dpdk.org; Tan, Jianfeng; Ananyev, Konstantin; Van Haaren, Harry
> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request API to
> DPDK IPC
> 
> 27/03/2018 15:59, Anatoly Burakov:
> > Under the hood, we create a separate thread to deal with replies to
> > asynchronous requests, that will just wait to be notified by the
> > main thread, or woken up on a timer.
> 
> I really don't like that a library is creating a thread.
> We don't even know where the thread is created (which core).
> Can it be a rte_service? or in the interrupt thread?

Agree that we'd better not adding so many threads in a library.

I was considering to merge all the threads into the interrupt thread, however, we don't have an interrupt thread in freebsd. Further, we don't implement alarm API in freebsd. That's why I tend to current implementation, and optimize it later.

For rte_service, it may be not a good idea to reply on it as it needs explicit API calls to setup.

> 
> 
> > --- a/lib/librte_eal/rte_eal_version.map
> > +++ b/lib/librte_eal/rte_eal_version.map
> > @@ -225,6 +225,7 @@ EXPERIMENTAL {
> >  	rte_mp_action_unregister;
> >  	rte_mp_sendmsg;
> >  	rte_mp_request;
> > +	rte_mp_request_async;
> 
> So there is rte_mp_request and rte_mp_request_async?
> You should rename rte_mp_request, I guess.

+1.

Thanks,
Jianfeng

> 
> >  	rte_mp_reply;
> >  	rte_service_attr_get;
> >  	rte_service_attr_reset_all;
> 
> 
> 
>
  
Thomas Monjalon March 28, 2018, 7:29 a.m. UTC | #3
28/03/2018 04:08, Tan, Jianfeng:
> Hi Thomas ,
> 
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 27/03/2018 15:59, Anatoly Burakov:
> > > Under the hood, we create a separate thread to deal with replies to
> > > asynchronous requests, that will just wait to be notified by the
> > > main thread, or woken up on a timer.
> > 
> > I really don't like that a library is creating a thread.
> > We don't even know where the thread is created (which core).
> > Can it be a rte_service? or in the interrupt thread?
> 
> Agree that we'd better not adding so many threads in a library.
> 
> I was considering to merge all the threads into the interrupt thread, however, we don't have an interrupt thread in freebsd. Further, we don't implement alarm API in freebsd. That's why I tend to current implementation, and optimize it later.

I would prefer we improve the current code now instead of polluting more
with more uncontrolled threads.

> For rte_service, it may be not a good idea to reply on it as it needs explicit API calls to setup.

I don't see the issue of the explicit API.
The IPC is a new service.
  
Van Haaren, Harry March 28, 2018, 8:22 a.m. UTC | #4
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, March 28, 2018 8:30 AM
> To: Tan, Jianfeng <jianfeng.tan@intel.com>
> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request API to
> DPDK IPC
> 
> 28/03/2018 04:08, Tan, Jianfeng:
> > Hi Thomas ,
> >
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 27/03/2018 15:59, Anatoly Burakov:
> > > > Under the hood, we create a separate thread to deal with replies to
> > > > asynchronous requests, that will just wait to be notified by the
> > > > main thread, or woken up on a timer.
> > >
> > > I really don't like that a library is creating a thread.
> > > We don't even know where the thread is created (which core).
> > > Can it be a rte_service? or in the interrupt thread?
> >
> > Agree that we'd better not adding so many threads in a library.
> >
> > I was considering to merge all the threads into the interrupt thread,
> however, we don't have an interrupt thread in freebsd. Further, we don't
> implement alarm API in freebsd. That's why I tend to current implementation,
> and optimize it later.
> 
> I would prefer we improve the current code now instead of polluting more
> with more uncontrolled threads.
> 
> > For rte_service, it may be not a good idea to reply on it as it needs
> explicit API calls to setup.
> 
> I don't see the issue of the explicit API.
> The IPC is a new service.

Although I do like to see new services, if we want to enable "core" dpdk functionality with Services, we need a proper designed solution for that. Service cores is not intended for "occasional" work - there is no method to block and sleep on a specific service until work becomes available, so this would imply a busy-polling. Using a service (hence busy polling) for rte_malloc()-based memory mapping requests is inefficient, and total overkill :)

For this patch I suggest to use some blocking-read capable mechanism.

The above said, in the longer term it would be good to have a design that allows new file-descriptors to be added to a "dpdk core" thread, which performs occasional lengthy work if the FD has data available.
  
Jianfeng Tan March 28, 2018, 8:55 a.m. UTC | #5
Hi Thomas and Harry,


On 3/28/2018 4:22 PM, Van Haaren, Harry wrote:
>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>> Sent: Wednesday, March 28, 2018 8:30 AM
>> To: Tan, Jianfeng <jianfeng.tan@intel.com>
>> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Ananyev,
>> Konstantin <konstantin.ananyev@intel.com>; Van Haaren, Harry
>> <harry.van.haaren@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request API to
>> DPDK IPC
>>
>> 28/03/2018 04:08, Tan, Jianfeng:
>>> Hi Thomas ,
>>>
>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>> 27/03/2018 15:59, Anatoly Burakov:
>>>>> Under the hood, we create a separate thread to deal with replies to
>>>>> asynchronous requests, that will just wait to be notified by the
>>>>> main thread, or woken up on a timer.
>>>> I really don't like that a library is creating a thread.
>>>> We don't even know where the thread is created (which core).
>>>> Can it be a rte_service? or in the interrupt thread?
>>> Agree that we'd better not adding so many threads in a library.
>>>
>>> I was considering to merge all the threads into the interrupt thread,
>> however, we don't have an interrupt thread in freebsd. Further, we don't
>> implement alarm API in freebsd. That's why I tend to current implementation,
>> and optimize it later.
>>
>> I would prefer we improve the current code now instead of polluting more
>> with more uncontrolled threads.
>>
>>> For rte_service, it may be not a good idea to reply on it as it needs
>> explicit API calls to setup.
>>
>> I don't see the issue of the explicit API.
>> The IPC is a new service.

My concern is that not every DPDK application sets up rte_service, but 
IPC will be used for very fundamental functions, like memory allocation. 
We could not possibly ask all DPDK applications to add rte_service now.

And also take Harry's comments below into consideration, most likely, we 
will move these threads into interrupt thread now by adding

> Although I do like to see new services, if we want to enable "core" dpdk functionality with Services, we need a proper designed solution for that. Service cores is not intended for "occasional" work - there is no method to block and sleep on a specific service until work becomes available, so this would imply a busy-polling. Using a service (hence busy polling) for rte_malloc()-based memory mapping requests is inefficient, and total overkill :)
>
> For this patch I suggest to use some blocking-read capable mechanism.

The problem here is that we add too many threads; blocking-read does not 
decrease # of threads.

>
> The above said, in the longer term it would be good to have a design that allows new file-descriptors to be added to a "dpdk core" thread, which performs occasional lengthy work if the FD has data available.

Interrupt thread vs rte_service, which is the direction to go? We 
actually have some others threads, in vhost and even virtio-user; we can 
also avoid those threads if we have a clear direction.

Thanks,
Jianfeng
  
Van Haaren, Harry March 28, 2018, 9:10 a.m. UTC | #6
> -----Original Message-----
> From: Tan, Jianfeng
> Sent: Wednesday, March 28, 2018 9:55 AM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>
> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request API to
> DPDK IPC
> 
> Hi Thomas and Harry,

Hey,


> On 3/28/2018 4:22 PM, Van Haaren, Harry wrote:
> >> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> >> Sent: Wednesday, March 28, 2018 8:30 AM
> >> To: Tan, Jianfeng <jianfeng.tan@intel.com>
> >> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Ananyev,
> >> Konstantin <konstantin.ananyev@intel.com>; Van Haaren, Harry
> >> <harry.van.haaren@intel.com>
> >> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request API
> to
> >> DPDK IPC
> >>
> >> 28/03/2018 04:08, Tan, Jianfeng:
> >>> Hi Thomas ,
> >>>
> >>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> >>>> 27/03/2018 15:59, Anatoly Burakov:
> >>>>> Under the hood, we create a separate thread to deal with replies to
> >>>>> asynchronous requests, that will just wait to be notified by the
> >>>>> main thread, or woken up on a timer.
> >>>> I really don't like that a library is creating a thread.
> >>>> We don't even know where the thread is created (which core).
> >>>> Can it be a rte_service? or in the interrupt thread?
> >>> Agree that we'd better not adding so many threads in a library.
> >>>
> >>> I was considering to merge all the threads into the interrupt thread,
> >> however, we don't have an interrupt thread in freebsd. Further, we don't
> >> implement alarm API in freebsd. That's why I tend to current
> implementation,
> >> and optimize it later.
> >>
> >> I would prefer we improve the current code now instead of polluting more
> >> with more uncontrolled threads.
> >>
> >>> For rte_service, it may be not a good idea to reply on it as it needs
> >> explicit API calls to setup.
> >>
> >> I don't see the issue of the explicit API.
> >> The IPC is a new service.
> 
> My concern is that not every DPDK application sets up rte_service, but
> IPC will be used for very fundamental functions, like memory allocation.
> We could not possibly ask all DPDK applications to add rte_service now.
> 
> And also take Harry's comments below into consideration, most likely, we
> will move these threads into interrupt thread now by adding

I don't suggest moving everything into interrupt thread.

We need to ensure the interrupt thread (aka, link status interrupt thread) is not
busy for too long. Certain work may cause un-acceptable delay in the interrupt
thread, hence perhaps we should have 2 core DPDK threads:

1) EAL interrupt thread: handles only interrupts, so is ~always blocked on some FDs

2) EAL "core" thread: handles the IPC work here, and any other irregular work that
   can take some time to complete. Adding all core DPDK work to this thread enables
   us to refactor and create a scalable / coherent design.


> > Although I do like to see new services, if we want to enable "core" dpdk
> functionality with Services, we need a proper designed solution for that.
> Service cores is not intended for "occasional" work - there is no method to
> block and sleep on a specific service until work becomes available, so this
> would imply a busy-polling. Using a service (hence busy polling) for
> rte_malloc()-based memory mapping requests is inefficient, and total
> overkill :)
> >
> > For this patch I suggest to use some blocking-read capable mechanism.
> 
> The problem here is that we add too many threads; blocking-read does not
> decrease # of threads.

To correct my statement just above - some method capable of blocking reads (as opposed to
busy polling). In my mind, this method *must* allow waiting on multiple FDs,
as to *require* only 1 thread. We could use more if it makes sense to do so.


> > The above said, in the longer term it would be good to have a design that
> allows new file-descriptors to be added to a "dpdk core" thread, which
> performs occasional lengthy work if the FD has data available.
> 
> Interrupt thread vs rte_service, which is the direction to go? We
> actually have some others threads, in vhost and even virtio-user; we can
> also avoid those threads if we have a clear direction.

I don't think that service is the correct hammer for this problem.

As to interrupt thread or creating a new thread, what makes more sense given
the current codebase? Is the current implementation an acceptable short-term
solution, that gets reworked to be more generic in future?
  
Bruce Richardson March 28, 2018, 9:11 a.m. UTC | #7
On Wed, Mar 28, 2018 at 09:29:35AM +0200, Thomas Monjalon wrote:
> 28/03/2018 04:08, Tan, Jianfeng:
> > Hi Thomas ,
> > 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 27/03/2018 15:59, Anatoly Burakov:
> > > > Under the hood, we create a separate thread to deal with replies to
> > > > asynchronous requests, that will just wait to be notified by the
> > > > main thread, or woken up on a timer.
> > > 
> > > I really don't like that a library is creating a thread.
> > > We don't even know where the thread is created (which core).
> > > Can it be a rte_service? or in the interrupt thread?
> > 
> > Agree that we'd better not adding so many threads in a library.
> > 
> > I was considering to merge all the threads into the interrupt thread, however, we don't have an interrupt thread in freebsd. Further, we don't implement alarm API in freebsd. That's why I tend to current implementation, and optimize it later.
> 
> I would prefer we improve the current code now instead of polluting more
> with more uncontrolled threads.
> 
+1
I think it would be worthwhile adding an interrupt thread to BSD, and it
should not be a massive amount of work. Having a single interrupt thread
has a lot of benefits, I think.

/Bruce
  
Anatoly Burakov March 28, 2018, 9:21 a.m. UTC | #8
On 28-Mar-18 9:55 AM, Tan, Jianfeng wrote:
> Hi Thomas and Harry,
> 
> 
> On 3/28/2018 4:22 PM, Van Haaren, Harry wrote:
>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>> Sent: Wednesday, March 28, 2018 8:30 AM
>>> To: Tan, Jianfeng <jianfeng.tan@intel.com>
>>> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org; Ananyev,
>>> Konstantin <konstantin.ananyev@intel.com>; Van Haaren, Harry
>>> <harry.van.haaren@intel.com>
>>> Subject: Re: [dpdk-dev] [PATCH v6 2/2] eal: add asynchronous request 
>>> API to
>>> DPDK IPC
>>>
>>> 28/03/2018 04:08, Tan, Jianfeng:
>>>> Hi Thomas ,
>>>>
>>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>>> 27/03/2018 15:59, Anatoly Burakov:
>>>>>> Under the hood, we create a separate thread to deal with replies to
>>>>>> asynchronous requests, that will just wait to be notified by the
>>>>>> main thread, or woken up on a timer.
>>>>> I really don't like that a library is creating a thread.
>>>>> We don't even know where the thread is created (which core).
>>>>> Can it be a rte_service? or in the interrupt thread?
>>>> Agree that we'd better not adding so many threads in a library.
>>>>
>>>> I was considering to merge all the threads into the interrupt thread,
>>> however, we don't have an interrupt thread in freebsd. Further, we don't
>>> implement alarm API in freebsd. That's why I tend to current 
>>> implementation,
>>> and optimize it later.
>>>
>>> I would prefer we improve the current code now instead of polluting more
>>> with more uncontrolled threads.
>>>
>>>> For rte_service, it may be not a good idea to reply on it as it needs
>>> explicit API calls to setup.
>>>
>>> I don't see the issue of the explicit API.
>>> The IPC is a new service.
> 
> My concern is that not every DPDK application sets up rte_service, but 
> IPC will be used for very fundamental functions, like memory allocation. 
> We could not possibly ask all DPDK applications to add rte_service now.
> 
> And also take Harry's comments below into consideration, most likely, we 
> will move these threads into interrupt thread now by adding
> 
>> Although I do like to see new services, if we want to enable "core" 
>> dpdk functionality with Services, we need a proper designed solution 
>> for that. Service cores is not intended for "occasional" work - there 
>> is no method to block and sleep on a specific service until work 
>> becomes available, so this would imply a busy-polling. Using a service 
>> (hence busy polling) for rte_malloc()-based memory mapping requests is 
>> inefficient, and total overkill :)
>>
>> For this patch I suggest to use some blocking-read capable mechanism.
> 
> The problem here is that we add too many threads; blocking-read does not 
> decrease # of threads.
> 
>>
>> The above said, in the longer term it would be good to have a design 
>> that allows new file-descriptors to be added to a "dpdk core" thread, 
>> which performs occasional lengthy work if the FD has data available.
> 
> Interrupt thread vs rte_service, which is the direction to go? We 
> actually have some others threads, in vhost and even virtio-user; we can 
> also avoid those threads if we have a clear direction.
> 
> Thanks,
> Jianfeng
> 

Hi all,

First of all, @Thomas, this is not a "new library" - it's part of EAL. 
We're going to be removing a few threads from EAL as it is because of 
IPC (Jianfeng has already submitted patches for those), so i don't think 
it's such a big deal to have two IPC threads instead of one. I'm open to 
suggestions on how to make this work without a second thread, but i 
don't see it.

We've discussed possibility of using rte_service internally, but decided 
against it for reasons already outlined by Harry - it's not a suitable 
mechanism for this kind of thing, not as it is.

Using interrupt thread for this _will_ work, however this will require a 
a lot more changes, as currently alarm API allocates everything through 
rte_malloc, while we want to use IPC for rte_malloc (which would make it 
a circular dependency). So it'll probably be more API and more 
complexity for dealing with malloc vs rte_malloc allocations. Hence the 
least-bad approach taken here: a new thread.
  
Thomas Monjalon March 28, 2018, 9:53 a.m. UTC | #9
28/03/2018 11:21, Burakov, Anatoly:
> On 28-Mar-18 9:55 AM, Tan, Jianfeng wrote:
> > On 3/28/2018 4:22 PM, Van Haaren, Harry wrote:
> >> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> >>> 28/03/2018 04:08, Tan, Jianfeng:
> >>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> >>>>> 27/03/2018 15:59, Anatoly Burakov:
> >>>>>> Under the hood, we create a separate thread to deal with replies to
> >>>>>> asynchronous requests, that will just wait to be notified by the
> >>>>>> main thread, or woken up on a timer.
> >>>>> 
> >>>>> I really don't like that a library is creating a thread.
> >>>>> We don't even know where the thread is created (which core).
> >>>>> Can it be a rte_service? or in the interrupt thread?
> >>>> 
> >>>> Agree that we'd better not adding so many threads in a library.
> >>>>
> >>>> I was considering to merge all the threads into the interrupt thread,
> >>> however, we don't have an interrupt thread in freebsd. Further, we don't
> >>> implement alarm API in freebsd. That's why I tend to current 
> >>> implementation,
> >>> and optimize it later.
> >>>
> >>> I would prefer we improve the current code now instead of polluting more
> >>> with more uncontrolled threads.
> >>>
> >>>> For rte_service, it may be not a good idea to reply on it as it needs
> >>> explicit API calls to setup.
> >>>
> >>> I don't see the issue of the explicit API.
> >>> The IPC is a new service.
> > 
> > My concern is that not every DPDK application sets up rte_service, but 
> > IPC will be used for very fundamental functions, like memory allocation. 
> > We could not possibly ask all DPDK applications to add rte_service now.
> > 
> > And also take Harry's comments below into consideration, most likely, we 
> > will move these threads into interrupt thread now by adding
> > 
> >> Although I do like to see new services, if we want to enable "core" 
> >> dpdk functionality with Services, we need a proper designed solution 
> >> for that. Service cores is not intended for "occasional" work - there 
> >> is no method to block and sleep on a specific service until work 
> >> becomes available, so this would imply a busy-polling. Using a service 
> >> (hence busy polling) for rte_malloc()-based memory mapping requests is 
> >> inefficient, and total overkill :)
> >>
> >> For this patch I suggest to use some blocking-read capable mechanism.
> > 
> > The problem here is that we add too many threads; blocking-read does not 
> > decrease # of threads.
> > 
> >>
> >> The above said, in the longer term it would be good to have a design 
> >> that allows new file-descriptors to be added to a "dpdk core" thread, 
> >> which performs occasional lengthy work if the FD has data available.
> > 
> > Interrupt thread vs rte_service, which is the direction to go? We 
> > actually have some others threads, in vhost and even virtio-user; we can 
> > also avoid those threads if we have a clear direction.
> > 
> > Thanks,
> > Jianfeng
> > 
> 
> Hi all,
> 
> First of all, @Thomas, this is not a "new library" - it's part of EAL.

I did not say it is a new library.

> We're going to be removing a few threads from EAL as it is because of 
> IPC (Jianfeng has already submitted patches for those),

I don't understand.
Which threads are you going to remove? Which patch?

> so i don't think 
> it's such a big deal to have two IPC threads instead of one. I'm open to 
> suggestions on how to make this work without a second thread, but i 
> don't see it.

I am not against the second thread.
I am against both threads :)

> We've discussed possibility of using rte_service internally, but decided 
> against it for reasons already outlined by Harry - it's not a suitable 
> mechanism for this kind of thing, not as it is.
> 
> Using interrupt thread for this _will_ work, however this will require a 
> a lot more changes, as currently alarm API allocates everything through 
> rte_malloc, while we want to use IPC for rte_malloc (which would make it 
> a circular dependency). So it'll probably be more API and more 
> complexity for dealing with malloc vs rte_malloc allocations. Hence the 
> least-bad approach taken here: a new thread.

If everybody is happy enough with "least bad" design and not trying
to improve the core design, what can I say?
  
Anatoly Burakov March 28, 2018, 10:42 a.m. UTC | #10
On 28-Mar-18 10:53 AM, Thomas Monjalon wrote:
> 28/03/2018 11:21, Burakov, Anatoly:
>> On 28-Mar-18 9:55 AM, Tan, Jianfeng wrote:
>>> On 3/28/2018 4:22 PM, Van Haaren, Harry wrote:
>>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>>> 28/03/2018 04:08, Tan, Jianfeng:
>>>>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>>>>> 27/03/2018 15:59, Anatoly Burakov:
>>>>>>>> Under the hood, we create a separate thread to deal with replies to
>>>>>>>> asynchronous requests, that will just wait to be notified by the
>>>>>>>> main thread, or woken up on a timer.
>>>>>>>
>>>>>>> I really don't like that a library is creating a thread.
>>>>>>> We don't even know where the thread is created (which core).
>>>>>>> Can it be a rte_service? or in the interrupt thread?
>>>>>>
>>>>>> Agree that we'd better not adding so many threads in a library.
>>>>>>
>>>>>> I was considering to merge all the threads into the interrupt thread,
>>>>> however, we don't have an interrupt thread in freebsd. Further, we don't
>>>>> implement alarm API in freebsd. That's why I tend to current
>>>>> implementation,
>>>>> and optimize it later.
>>>>>
>>>>> I would prefer we improve the current code now instead of polluting more
>>>>> with more uncontrolled threads.
>>>>>
>>>>>> For rte_service, it may be not a good idea to reply on it as it needs
>>>>> explicit API calls to setup.
>>>>>
>>>>> I don't see the issue of the explicit API.
>>>>> The IPC is a new service.
>>>
>>> My concern is that not every DPDK application sets up rte_service, but
>>> IPC will be used for very fundamental functions, like memory allocation.
>>> We could not possibly ask all DPDK applications to add rte_service now.
>>>
>>> And also take Harry's comments below into consideration, most likely, we
>>> will move these threads into interrupt thread now by adding
>>>
>>>> Although I do like to see new services, if we want to enable "core"
>>>> dpdk functionality with Services, we need a proper designed solution
>>>> for that. Service cores is not intended for "occasional" work - there
>>>> is no method to block and sleep on a specific service until work
>>>> becomes available, so this would imply a busy-polling. Using a service
>>>> (hence busy polling) for rte_malloc()-based memory mapping requests is
>>>> inefficient, and total overkill :)
>>>>
>>>> For this patch I suggest to use some blocking-read capable mechanism.
>>>
>>> The problem here is that we add too many threads; blocking-read does not
>>> decrease # of threads.
>>>
>>>>
>>>> The above said, in the longer term it would be good to have a design
>>>> that allows new file-descriptors to be added to a "dpdk core" thread,
>>>> which performs occasional lengthy work if the FD has data available.
>>>
>>> Interrupt thread vs rte_service, which is the direction to go? We
>>> actually have some others threads, in vhost and even virtio-user; we can
>>> also avoid those threads if we have a clear direction.
>>>
>>> Thanks,
>>> Jianfeng
>>>
>>
>> Hi all,
>>
>> First of all, @Thomas, this is not a "new library" - it's part of EAL.
> 
> I did not say it is a new library.
> 
>> We're going to be removing a few threads from EAL as it is because of
>> IPC (Jianfeng has already submitted patches for those),
> 
> I don't understand.
> Which threads are you going to remove? Which patch?

There are separate patches that get rid of some EAL threads we have from 
before (e.g. VFIO thread) - this is the reason IPC was created in the 
first place. These aren't relevant to this patchset per se, i'm just 
saying that it's not like we're adding to the pile of already existing 
threads without taking anything away :)

> 
>> so i don't think
>> it's such a big deal to have two IPC threads instead of one. I'm open to
>> suggestions on how to make this work without a second thread, but i
>> don't see it.
> 
> I am not against the second thread.
> I am against both threads :)

Well, the first thread is already in DPDK. To provide some context, 
first implementation for DPDK IPC was suggested for 17.11, and (without 
many conceptual changes) was merged in 18.02. I think it's a bit late to 
be against both threads :)

> 
>> We've discussed possibility of using rte_service internally, but decided
>> against it for reasons already outlined by Harry - it's not a suitable
>> mechanism for this kind of thing, not as it is.
>>
>> Using interrupt thread for this _will_ work, however this will require a
>> a lot more changes, as currently alarm API allocates everything through
>> rte_malloc, while we want to use IPC for rte_malloc (which would make it
>> a circular dependency). So it'll probably be more API and more
>> complexity for dealing with malloc vs rte_malloc allocations. Hence the
>> least-bad approach taken here: a new thread.
> 
> If everybody is happy enough with "least bad" design and not trying
> to improve the core design, what can I say?

I'm not against trying to improve the core design. I'm just saying that, 
had this kind of feedback been provided just a bit earlier, I would've 
had time to fix it in time for deadlines. However, because memory rework 
patchset depends on this API, i would suggest merging it in now, as is, 
and commit to a roadmap of improvements for next release(s).

For starters, we could plan on removing alarm thread's dependency on 
rte_malloc and just use regular malloc API's in there, and rework 
asynchronous IPC API to use that instead. This shouldn't be much work, 
and will presumably make you halfway happy, as one of the threads will 
be gone :)

We can then look into removing the second thread and moving the entirety 
of DPDK IPC into the interrupt thread. I'm not too sure how would that 
work, but i haven't looked at it in any detail, so maybe it is feasible.

Can we agree on this? It would be great to do everything perfectly from 
the first try, but having a goal in sight and working towards it is fine 
too, even if not all of the steps we take are perfect.
  
Thomas Monjalon March 28, 2018, 11:26 a.m. UTC | #11
28/03/2018 12:42, Burakov, Anatoly:
> On 28-Mar-18 10:53 AM, Thomas Monjalon wrote:
> > 28/03/2018 11:21, Burakov, Anatoly:
> >> so i don't think
> >> it's such a big deal to have two IPC threads instead of one. I'm open to
> >> suggestions on how to make this work without a second thread, but i
> >> don't see it.
> > 
> > I am not against the second thread.
> > I am against both threads :)
> 
> Well, the first thread is already in DPDK. To provide some context, 
> first implementation for DPDK IPC was suggested for 17.11, and (without 
> many conceptual changes) was merged in 18.02. I think it's a bit late to 
> be against both threads :)

No, it's never too late to discuss.
Merging a patch does not prevent discussing it :)

> >> We've discussed possibility of using rte_service internally, but decided
> >> against it for reasons already outlined by Harry - it's not a suitable
> >> mechanism for this kind of thing, not as it is.
> >>
> >> Using interrupt thread for this _will_ work, however this will require a
> >> a lot more changes, as currently alarm API allocates everything through
> >> rte_malloc, while we want to use IPC for rte_malloc (which would make it
> >> a circular dependency). So it'll probably be more API and more
> >> complexity for dealing with malloc vs rte_malloc allocations. Hence the
> >> least-bad approach taken here: a new thread.
> > 
> > If everybody is happy enough with "least bad" design and not trying
> > to improve the core design, what can I say?
> 
> I'm not against trying to improve the core design. I'm just saying that, 
> had this kind of feedback been provided just a bit earlier, I would've 
> had time to fix it in time for deadlines. However, because memory rework 
> patchset depends on this API, i would suggest merging it in now, as is, 
> and commit to a roadmap of improvements for next release(s).

Actually, you had the feedback yourself from the beginning.
You decided to gave up with interrupt thread because its implementation
is not complete (and maybe far from perfect).
There are some communities where it is not acceptable to workaround
core issues because of timing issues. I think we accept it in DPDK,
but I continue to question it, in order to be sure that everybody is OK
with this kind of tradeoff.

> For starters, we could plan on removing alarm thread's dependency on 
> rte_malloc and just use regular malloc API's in there, and rework 
> asynchronous IPC API to use that instead. This shouldn't be much work, 
> and will presumably make you halfway happy, as one of the threads will 
> be gone :)
> 
> We can then look into removing the second thread and moving the entirety 
> of DPDK IPC into the interrupt thread. I'm not too sure how would that 
> work, but i haven't looked at it in any detail, so maybe it is feasible.
> 
> Can we agree on this? It would be great to do everything perfectly from 
> the first try, but having a goal in sight and working towards it is fine 
> too, even if not all of the steps we take are perfect.

The main concern is API.
If all these changes are internal only, and does not involve any major
API change, then I guess it is OK to pospone them in next release.
  
Anatoly Burakov March 28, 2018, 12:21 p.m. UTC | #12
On 28-Mar-18 12:26 PM, Thomas Monjalon wrote:
> 28/03/2018 12:42, Burakov, Anatoly:
>> On 28-Mar-18 10:53 AM, Thomas Monjalon wrote:
>>> 28/03/2018 11:21, Burakov, Anatoly:
>> I'm not against trying to improve the core design. I'm just saying that,
>> had this kind of feedback been provided just a bit earlier, I would've
>> had time to fix it in time for deadlines. However, because memory rework
>> patchset depends on this API, i would suggest merging it in now, as is,
>> and commit to a roadmap of improvements for next release(s).
> 
> Actually, you had the feedback yourself from the beginning.
> You decided to gave up with interrupt thread because its implementation
> is not complete (and maybe far from perfect).

That's not quite how i see it, but OK, suppose so.

> There are some communities where it is not acceptable to workaround
> core issues because of timing issues. I think we accept it in DPDK,
> but I continue to question it, in order to be sure that everybody is OK
> with this kind of tradeoff.

The way i see it, not all API's are equal; some are more important than 
others. This is a new, experimental API that is not core to any DPDK 
function - it's not used on any hotpaths nor is it even that demanding 
(the two threads will be sleeping 99.999% of the time anyway). I think 
we're allowed to experiment on it before settling on an implementation 
that satisfies everyone :)

>> For starters, we could plan on removing alarm thread's dependency on
>> rte_malloc and just use regular malloc API's in there, and rework
>> asynchronous IPC API to use that instead. This shouldn't be much work,
>> and will presumably make you halfway happy, as one of the threads will
>> be gone :)
>>
>> We can then look into removing the second thread and moving the entirety
>> of DPDK IPC into the interrupt thread. I'm not too sure how would that
>> work, but i haven't looked at it in any detail, so maybe it is feasible.
>>
>> Can we agree on this? It would be great to do everything perfectly from
>> the first try, but having a goal in sight and working towards it is fine
>> too, even if not all of the steps we take are perfect.
> 
> The main concern is API.
> If all these changes are internal only, and does not involve any major
> API change, then I guess it is OK to pospone them in next release.
> 

Yes, all of this is/will be internal to DPDK IPC - no externally visible 
changes whatsoever.
  

Patch

diff --git a/lib/librte_eal/common/eal_common_proc.c b/lib/librte_eal/common/eal_common_proc.c
index 52b6ab2..139b653 100644
--- a/lib/librte_eal/common/eal_common_proc.c
+++ b/lib/librte_eal/common/eal_common_proc.c
@@ -26,6 +26,7 @@ 
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#include <rte_tailq.h>
 
 #include "eal_private.h"
 #include "eal_filesystem.h"
@@ -60,13 +61,32 @@  struct mp_msg_internal {
 	struct rte_mp_msg msg;
 };
 
+struct async_request_param {
+	rte_mp_async_reply_t clb;
+	struct rte_mp_reply user_reply;
+	struct timespec end;
+	int n_responses_processed;
+};
+
 struct pending_request {
 	TAILQ_ENTRY(pending_request) next;
-	int reply_received;
+	enum {
+		REQUEST_TYPE_SYNC,
+		REQUEST_TYPE_ASYNC
+	} type;
 	char dst[PATH_MAX];
 	struct rte_mp_msg *request;
 	struct rte_mp_msg *reply;
-	pthread_cond_t cond;
+	int reply_received;
+	RTE_STD_C11
+	union {
+		struct {
+			struct async_request_param *param;
+		} async;
+		struct {
+			pthread_cond_t cond;
+		} sync;
+	};
 };
 
 TAILQ_HEAD(pending_request_list, pending_request);
@@ -74,9 +94,12 @@  TAILQ_HEAD(pending_request_list, pending_request);
 static struct {
 	struct pending_request_list requests;
 	pthread_mutex_t lock;
+	pthread_cond_t async_cond;
 } pending_requests = {
 	.requests = TAILQ_HEAD_INITIALIZER(pending_requests.requests),
-	.lock = PTHREAD_MUTEX_INITIALIZER
+	.lock = PTHREAD_MUTEX_INITIALIZER,
+	.async_cond = PTHREAD_COND_INITIALIZER
+	/**< used in async requests only */
 };
 
 /* forward declarations */
@@ -273,7 +296,12 @@  process_msg(struct mp_msg_internal *m, struct sockaddr_un *s)
 			memcpy(sync_req->reply, msg, sizeof(*msg));
 			/* -1 indicates that we've been asked to ignore */
 			sync_req->reply_received = m->type == MP_REP ? 1 : -1;
-			pthread_cond_signal(&sync_req->cond);
+
+			if (sync_req->type == REQUEST_TYPE_SYNC)
+				pthread_cond_signal(&sync_req->sync.cond);
+			else if (sync_req->type == REQUEST_TYPE_ASYNC)
+				pthread_cond_signal(
+					&pending_requests.async_cond);
 		} else
 			RTE_LOG(ERR, EAL, "Drop mp reply: %s\n", msg->name);
 		pthread_mutex_unlock(&pending_requests.lock);
@@ -320,6 +348,189 @@  mp_handle(void *arg __rte_unused)
 }
 
 static int
+timespec_cmp(const struct timespec *a, const struct timespec *b)
+{
+	if (a->tv_sec < b->tv_sec)
+		return -1;
+	if (a->tv_sec > b->tv_sec)
+		return 1;
+	if (a->tv_nsec < b->tv_nsec)
+		return -1;
+	if (a->tv_nsec > b->tv_nsec)
+		return 1;
+	return 0;
+}
+
+enum async_action {
+	ACTION_NONE, /**< don't do anything */
+	ACTION_FREE, /**< free the action entry, but don't trigger callback */
+	ACTION_TRIGGER /**< trigger callback, then free action entry */
+};
+
+static enum async_action
+process_async_request(struct pending_request *sr, const struct timespec *now)
+{
+	struct async_request_param *param;
+	struct rte_mp_reply *reply;
+	bool timeout, received, last_msg;
+
+	param = sr->async.param;
+	reply = &param->user_reply;
+
+	/* did we timeout? */
+	timeout = timespec_cmp(&param->end, now) <= 0;
+
+	/* did we receive a response? */
+	received = sr->reply_received != 0;
+
+	/* if we didn't time out, and we didn't receive a response, ignore */
+	if (!timeout && !received)
+		return ACTION_NONE;
+
+	/* if we received a response, adjust relevant data and copy mesasge. */
+	if (sr->reply_received == 1 && sr->reply) {
+		struct rte_mp_msg *msg, *user_msgs, *tmp;
+
+		msg = sr->reply;
+		user_msgs = reply->msgs;
+
+		tmp = realloc(user_msgs, sizeof(*msg) *
+				(reply->nb_received + 1));
+		if (!tmp) {
+			RTE_LOG(ERR, EAL, "Fail to alloc reply for request %s:%s\n",
+				sr->dst, sr->request->name);
+			/* this entry is going to be removed and its message
+			 * dropped, but we don't want to leak memory, so
+			 * continue.
+			 */
+		} else {
+			user_msgs = tmp;
+			reply->msgs = user_msgs;
+			memcpy(&user_msgs[reply->nb_received],
+					msg, sizeof(*msg));
+			reply->nb_received++;
+		}
+
+		/* mark this request as processed */
+		param->n_responses_processed++;
+	} else if (sr->reply_received == -1) {
+		/* we were asked to ignore this process */
+		reply->nb_sent--;
+	}
+	free(sr->reply);
+
+	last_msg = param->n_responses_processed == reply->nb_sent;
+
+	return last_msg ? ACTION_TRIGGER : ACTION_FREE;
+}
+
+static void
+trigger_async_action(struct pending_request *sr)
+{
+	struct async_request_param *param;
+	struct rte_mp_reply *reply;
+
+	param = sr->async.param;
+	reply = &param->user_reply;
+
+	param->clb(sr->request, reply);
+
+	/* clean up */
+	free(sr->async.param->user_reply.msgs);
+	free(sr->async.param);
+	free(sr->request);
+}
+
+static void *
+async_reply_handle(void *arg __rte_unused)
+{
+	struct pending_request *sr;
+	struct timeval now;
+	struct timespec timeout, ts_now;
+	while (1) {
+		struct pending_request *trigger = NULL;
+		int ret;
+		bool nowait = false;
+		bool timedwait = false;
+
+		pthread_mutex_lock(&pending_requests.lock);
+
+		/* scan through the list and see if there are any timeouts that
+		 * are earlier than our current timeout.
+		 */
+		TAILQ_FOREACH(sr, &pending_requests.requests, next) {
+			if (sr->type != REQUEST_TYPE_ASYNC)
+				continue;
+			if (!timedwait || timespec_cmp(&sr->async.param->end,
+					&timeout) < 0) {
+				memcpy(&timeout, &sr->async.param->end,
+					sizeof(timeout));
+				timedwait = true;
+			}
+
+			/* sometimes, we don't even wait */
+			if (sr->reply_received) {
+				nowait = true;
+				break;
+			}
+		}
+
+		if (nowait)
+			ret = 0;
+		else if (timedwait)
+			ret = pthread_cond_timedwait(
+					&pending_requests.async_cond,
+					&pending_requests.lock, &timeout);
+		else
+			ret = pthread_cond_wait(&pending_requests.async_cond,
+					&pending_requests.lock);
+
+		if (gettimeofday(&now, NULL) < 0) {
+			RTE_LOG(ERR, EAL, "Cannot get current time\n");
+			break;
+		}
+		ts_now.tv_nsec = now.tv_usec * 1000;
+		ts_now.tv_sec = now.tv_sec;
+
+		if (ret == 0 || ret == ETIMEDOUT) {
+			struct pending_request *next;
+			/* we've either been woken up, or we timed out */
+
+			/* we have still the lock, check if anything needs
+			 * processing.
+			 */
+			TAILQ_FOREACH_SAFE(sr, &pending_requests.requests, next,
+					next) {
+				enum async_action action;
+				if (sr->type != REQUEST_TYPE_ASYNC)
+					continue;
+
+				action = process_async_request(sr, &ts_now);
+				if (action == ACTION_FREE) {
+					TAILQ_REMOVE(&pending_requests.requests,
+							sr, next);
+					free(sr);
+				} else if (action == ACTION_TRIGGER &&
+						trigger == NULL) {
+					TAILQ_REMOVE(&pending_requests.requests,
+							sr, next);
+					trigger = sr;
+				}
+			}
+		}
+		pthread_mutex_unlock(&pending_requests.lock);
+		if (trigger) {
+			trigger_async_action(trigger);
+			free(trigger);
+		}
+	};
+
+	RTE_LOG(ERR, EAL, "ERROR: asynchronous requests disabled\n");
+
+	return NULL;
+}
+
+static int
 open_socket_fd(void)
 {
 	char peer_name[PATH_MAX] = {0};
@@ -382,7 +593,7 @@  rte_mp_channel_init(void)
 	char thread_name[RTE_MAX_THREAD_NAME_LEN];
 	char path[PATH_MAX];
 	int dir_fd;
-	pthread_t tid;
+	pthread_t mp_handle_tid, async_reply_handle_tid;
 
 	/* create filter path */
 	create_socket_path("*", path, sizeof(path));
@@ -419,7 +630,16 @@  rte_mp_channel_init(void)
 		return -1;
 	}
 
-	if (pthread_create(&tid, NULL, mp_handle, NULL) < 0) {
+	if (pthread_create(&mp_handle_tid, NULL, mp_handle, NULL) < 0) {
+		RTE_LOG(ERR, EAL, "failed to create mp thead: %s\n",
+			strerror(errno));
+		close(mp_fd);
+		mp_fd = -1;
+		return -1;
+	}
+
+	if (pthread_create(&async_reply_handle_tid, NULL,
+			async_reply_handle, NULL) < 0) {
 		RTE_LOG(ERR, EAL, "failed to create mp thead: %s\n",
 			strerror(errno));
 		close(mp_fd);
@@ -430,7 +650,11 @@  rte_mp_channel_init(void)
 
 	/* try best to set thread name */
 	snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN, "rte_mp_handle");
-	rte_thread_setname(tid, thread_name);
+	rte_thread_setname(mp_handle_tid, thread_name);
+
+	/* try best to set thread name */
+	snprintf(thread_name, RTE_MAX_THREAD_NAME_LEN, "rte_mp_async_handle");
+	rte_thread_setname(async_reply_handle_tid, thread_name);
 
 	/* unlock the directory */
 	flock(dir_fd, LOCK_UN);
@@ -602,18 +826,77 @@  rte_mp_sendmsg(struct rte_mp_msg *msg)
 }
 
 static int
-mp_request_one(const char *dst, struct rte_mp_msg *req,
+mp_request_async(const char *dst, struct rte_mp_msg *req,
+		struct async_request_param *param)
+{
+	struct rte_mp_msg *reply_msg;
+	struct pending_request *sync_req, *exist;
+	int ret;
+
+	sync_req = malloc(sizeof(*sync_req));
+	reply_msg = malloc(sizeof(*reply_msg));
+	if (sync_req == NULL || reply_msg == NULL) {
+		RTE_LOG(ERR, EAL, "Could not allocate space for sync request\n");
+		rte_errno = ENOMEM;
+		ret = -1;
+		goto fail;
+	}
+
+	memset(sync_req, 0, sizeof(*sync_req));
+	memset(reply_msg, 0, sizeof(*reply_msg));
+
+	sync_req->type = REQUEST_TYPE_ASYNC;
+	strcpy(sync_req->dst, dst);
+	sync_req->request = req;
+	sync_req->reply = reply_msg;
+	sync_req->async.param = param;
+
+	/* queue already locked by caller */
+
+	exist = find_sync_request(dst, req->name);
+	if (!exist) {
+		TAILQ_INSERT_TAIL(&pending_requests.requests, sync_req, next);
+	} else {
+		RTE_LOG(ERR, EAL, "A pending request %s:%s\n", dst, req->name);
+		rte_errno = EEXIST;
+		ret = -1;
+		goto fail;
+	}
+
+	ret = send_msg(dst, req, MP_REQ);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Fail to send request %s:%s\n",
+			dst, req->name);
+		ret = -1;
+		goto fail;
+	} else if (ret == 0) {
+		ret = 0;
+		goto fail;
+	}
+
+	param->user_reply.nb_sent++;
+
+	return 0;
+fail:
+	free(sync_req);
+	free(reply_msg);
+	return ret;
+}
+
+static int
+mp_request_sync(const char *dst, struct rte_mp_msg *req,
 	       struct rte_mp_reply *reply, const struct timespec *ts)
 {
 	int ret;
 	struct rte_mp_msg msg, *tmp;
 	struct pending_request sync_req, *exist;
 
+	sync_req.type = REQUEST_TYPE_SYNC;
 	sync_req.reply_received = 0;
 	strcpy(sync_req.dst, dst);
 	sync_req.request = req;
 	sync_req.reply = &msg;
-	pthread_cond_init(&sync_req.cond, NULL);
+	pthread_cond_init(&sync_req.sync.cond, NULL);
 
 	pthread_mutex_lock(&pending_requests.lock);
 	exist = find_sync_request(dst, req->name);
@@ -637,7 +920,7 @@  mp_request_one(const char *dst, struct rte_mp_msg *req,
 	reply->nb_sent++;
 
 	do {
-		ret = pthread_cond_timedwait(&sync_req.cond,
+		ret = pthread_cond_timedwait(&sync_req.sync.cond,
 				&pending_requests.lock, ts);
 	} while (ret != 0 && ret != ETIMEDOUT);
 
@@ -703,7 +986,7 @@  rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply,
 
 	/* for secondary process, send request to the primary process only */
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
-		return mp_request_one(eal_mp_socket_path(), req, reply, &end);
+		return mp_request_sync(eal_mp_socket_path(), req, reply, &end);
 
 	/* for primary process, broadcast request, and collect reply 1 by 1 */
 	mp_dir = opendir(mp_dir_path);
@@ -732,7 +1015,7 @@  rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply,
 		snprintf(path, sizeof(path), "%s/%s", mp_dir_path,
 			 ent->d_name);
 
-		if (mp_request_one(path, req, reply, &end))
+		if (mp_request_sync(path, req, reply, &end))
 			ret = -1;
 	}
 	/* unlock the directory */
@@ -744,9 +1027,158 @@  rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply,
 }
 
 int __rte_experimental
-rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb)
 {
+	struct rte_mp_msg *copy;
+	struct pending_request *dummy;
+	struct async_request_param *param;
+	struct rte_mp_reply *reply;
+	int dir_fd, ret = 0;
+	DIR *mp_dir;
+	struct dirent *ent;
+	struct timeval now;
+	struct timespec *end;
+	bool dummy_used = false;
+
+	RTE_LOG(DEBUG, EAL, "request: %s\n", req->name);
+
+	if (check_input(req) == false)
+		return -1;
+	if (gettimeofday(&now, NULL) < 0) {
+		RTE_LOG(ERR, EAL, "Faile to get current time\n");
+		rte_errno = errno;
+		return -1;
+	}
+	copy = malloc(sizeof(*copy));
+	dummy = malloc(sizeof(*dummy));
+	param = malloc(sizeof(*param));
+	if (copy == NULL || dummy == NULL || param == NULL) {
+		RTE_LOG(ERR, EAL, "Failed to allocate memory for async reply\n");
+		rte_errno = ENOMEM;
+		goto fail;
+	}
+
+	memset(copy, 0, sizeof(*copy));
+	memset(dummy, 0, sizeof(*dummy));
+	memset(param, 0, sizeof(*param));
+
+	/* copy message */
+	memcpy(copy, req, sizeof(*copy));
+
+	param->n_responses_processed = 0;
+	param->clb = clb;
+	end = &param->end;
+	reply = &param->user_reply;
+
+	end->tv_nsec = (now.tv_usec * 1000 + ts->tv_nsec) % 1000000000;
+	end->tv_sec = now.tv_sec + ts->tv_sec +
+			(now.tv_usec * 1000 + ts->tv_nsec) / 1000000000;
+	reply->nb_sent = 0;
+	reply->nb_received = 0;
+	reply->msgs = NULL;
+
+	/* we have to lock the request queue here, as we will be adding a bunch
+	 * of requests to the queue at once, and some of the replies may arrive
+	 * before we add all of the requests to the queue.
+	 */
+	pthread_mutex_lock(&pending_requests.lock);
 
+	/* we have to ensure that callback gets triggered even if we don't send
+	 * anything, therefore earlier we have allocated a dummy request. fill
+	 * it, and put it on the queue if we don't send any requests.
+	 */
+	dummy->type = REQUEST_TYPE_ASYNC;
+	dummy->request = copy;
+	dummy->reply = NULL;
+	dummy->async.param = param;
+	dummy->reply_received = 1; /* short-circuit the timeout */
+
+	/* for secondary process, send request to the primary process only */
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		ret = mp_request_async(eal_mp_socket_path(), copy, param);
+
+		/* if we didn't send anything, put dummy request on the queue */
+		if (ret == 0 && reply->nb_sent == 0) {
+			TAILQ_INSERT_TAIL(&pending_requests.requests, dummy,
+					next);
+			dummy_used = true;
+		}
+
+		pthread_mutex_unlock(&pending_requests.lock);
+
+		/* if we couldn't send anything, clean up */
+		if (ret != 0)
+			goto fail;
+		return 0;
+	}
+
+	/* for primary process, broadcast request */
+	mp_dir = opendir(mp_dir_path);
+	if (!mp_dir) {
+		RTE_LOG(ERR, EAL, "Unable to open directory %s\n", mp_dir_path);
+		rte_errno = errno;
+		goto unlock_fail;
+	}
+	dir_fd = dirfd(mp_dir);
+
+	/* lock the directory to prevent processes spinning up while we send */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "Unable to lock directory %s\n",
+			mp_dir_path);
+		rte_errno = errno;
+		goto closedir_fail;
+	}
+
+	while ((ent = readdir(mp_dir))) {
+		char path[PATH_MAX];
+
+		if (fnmatch(mp_filter, ent->d_name, 0) != 0)
+			continue;
+
+		snprintf(path, sizeof(path), "%s/%s", mp_dir_path,
+			 ent->d_name);
+
+		if (mp_request_async(path, copy, param))
+			ret = -1;
+	}
+	/* if we didn't send anything, put dummy request on the queue */
+	if (ret == 0 && reply->nb_sent == 0) {
+		TAILQ_INSERT_HEAD(&pending_requests.requests, dummy, next);
+		dummy_used = true;
+	}
+
+	/* trigger async request thread wake up */
+	pthread_cond_signal(&pending_requests.async_cond);
+
+	/* finally, unlock the queue */
+	pthread_mutex_unlock(&pending_requests.lock);
+
+	/* unlock the directory */
+	flock(dir_fd, LOCK_UN);
+
+	/* dir_fd automatically closed on closedir */
+	closedir(mp_dir);
+
+	/* if dummy was unused, free it */
+	if (!dummy_used)
+		free(dummy);
+
+	return ret;
+closedir_fail:
+	closedir(mp_dir);
+unlock_fail:
+	pthread_mutex_unlock(&pending_requests.lock);
+fail:
+	free(dummy);
+	free(param);
+	free(copy);
+	return -1;
+}
+
+int __rte_experimental
+rte_mp_reply(struct rte_mp_msg *msg, const char *peer)
+{
 	RTE_LOG(DEBUG, EAL, "reply: %s\n", msg->name);
 
 	if (check_input(msg) == false)
diff --git a/lib/librte_eal/common/include/rte_eal.h b/lib/librte_eal/common/include/rte_eal.h
index 044474e..87ebfd0 100644
--- a/lib/librte_eal/common/include/rte_eal.h
+++ b/lib/librte_eal/common/include/rte_eal.h
@@ -230,6 +230,16 @@  struct rte_mp_reply {
 typedef int (*rte_mp_t)(const struct rte_mp_msg *msg, const void *peer);
 
 /**
+ * Asynchronous reply function typedef used by other components.
+ *
+ * As we create socket channel for primary/secondary communication, use
+ * this function typedef to register action for coming responses to asynchronous
+ * requests.
+ */
+typedef int (*rte_mp_async_reply_t)(const struct rte_mp_msg *request,
+		const struct rte_mp_reply *reply);
+
+/**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
  *
@@ -321,6 +331,32 @@  rte_mp_request(struct rte_mp_msg *req, struct rte_mp_reply *reply,
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
  *
+ * Send a request to the peer process and expect a reply in a separate callback.
+ *
+ * This function sends a request message to the peer process, and will not
+ * block. Instead, reply will be received in a separate callback.
+ *
+ * @param req
+ *   The req argument contains the customized request message.
+ *
+ * @param ts
+ *   The ts argument specifies how long we can wait for the peer(s) to reply.
+ *
+ * @param clb
+ *   The callback to trigger when all responses for this request have arrived.
+ *
+ * @return
+ *  - On success, return 0.
+ *  - On failure, return -1, and the reason will be stored in rte_errno.
+ */
+int __rte_experimental
+rte_mp_request_async(struct rte_mp_msg *req, const struct timespec *ts,
+		rte_mp_async_reply_t clb);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
  * Send a reply to the peer process.
  *
  * This function will send a reply message in response to a request message
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index d123602..328a0be 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -225,6 +225,7 @@  EXPERIMENTAL {
 	rte_mp_action_unregister;
 	rte_mp_sendmsg;
 	rte_mp_request;
+	rte_mp_request_async;
 	rte_mp_reply;
 	rte_service_attr_get;
 	rte_service_attr_reset_all;