mem: fix deadlock on secondary allocation

Message ID 4e0688f841f6ba2408fde949aabce8e36c0d46f0.1611934186.git.anatoly.burakov@intel.com (mailing list archive)
State Accepted, archived
Delegated to: Thomas Monjalon
Headers
Series mem: fix deadlock on secondary allocation |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-testing warning Testing issues

Commit Message

Anatoly Burakov Jan. 29, 2021, 3:29 p.m. UTC
  Previous fix used `rte_malloc_heap_socket_is_external()` to check if the
heap was an external heap. However, that API is thread-safe, and when
we're inside the allocation process, we're already write-locked, so
calling `rte_malloc_heap_socket_is_external()` will result in a
deadlock followed by a timeout.

Fix it by replacing the API call with a check against maximum number of
NUMA nodes, because external heaps always have higher socket ID's.

Fixes: 7ac31e82bc8f ("mem: improve parameter checking on memory hotplug")

Reported-by: Jim Harris <james.r.harris@intel.com>

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/common/malloc_mp.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)
  

Comments

Thomas Monjalon Jan. 29, 2021, 3:40 p.m. UTC | #1
29/01/2021 16:29, Anatoly Burakov:
> Previous fix used `rte_malloc_heap_socket_is_external()` to check if the
> heap was an external heap. However, that API is thread-safe, and when
> we're inside the allocation process, we're already write-locked, so
> calling `rte_malloc_heap_socket_is_external()` will result in a
> deadlock followed by a timeout.
> 
> Fix it by replacing the API call with a check against maximum number of
> NUMA nodes, because external heaps always have higher socket ID's.

Is there some unit tests for such thing?

> 
> Fixes: 7ac31e82bc8f ("mem: improve parameter checking on memory hotplug")
> 
> Reported-by: Jim Harris <james.r.harris@intel.com>
> 

No need of blank line here.

> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/common/malloc_mp.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/common/malloc_mp.c b/lib/librte_eal/common/malloc_mp.c
> index 0b19d4d5fb..b1f7f7824b 100644
> --- a/lib/librte_eal/common/malloc_mp.c
> +++ b/lib/librte_eal/common/malloc_mp.c
> -	/* for allocations, we must only use internal heaps */
> -	if (rte_malloc_heap_socket_is_external(heap->socket_id)) {
> +	/*
> +	 * for allocations, we must only use internal heaps, but since the
> +	 * rte_malloc_heap_socket_is_external() is thread-safe and we're already
> +	 * read-locked, we'll have to take advantage of the fac that internal

fac -> fact?

> +	 * socket ID's are always lower than RTE_MAX_NUMA_NODES.
> +	 */
> +	if (heap->socket_id >= RTE_MAX_NUMA_NODES) {
  
Anatoly Burakov Jan. 29, 2021, 4:07 p.m. UTC | #2
On 29-Jan-21 3:40 PM, Thomas Monjalon wrote:
> 29/01/2021 16:29, Anatoly Burakov:
>> Previous fix used `rte_malloc_heap_socket_is_external()` to check if the
>> heap was an external heap. However, that API is thread-safe, and when
>> we're inside the allocation process, we're already write-locked, so
>> calling `rte_malloc_heap_socket_is_external()` will result in a
>> deadlock followed by a timeout.
>>
>> Fix it by replacing the API call with a check against maximum number of
>> NUMA nodes, because external heaps always have higher socket ID's.
> 
> Is there some unit tests for such thing?

I couldn't reproduce this using autotests, but Jim has SPDK tests which 
triggered this error.

Since this is dependent upon secondary process, any test would 
necessarily have to be manual here, i think.

> 
>>
>> Fixes: 7ac31e82bc8f ("mem: improve parameter checking on memory hotplug")
>>
>> Reported-by: Jim Harris <james.r.harris@intel.com>
>>
> 
> No need of blank line here.

Need to update my scripts :P

> 
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> ---
>>   lib/librte_eal/common/malloc_mp.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/librte_eal/common/malloc_mp.c b/lib/librte_eal/common/malloc_mp.c
>> index 0b19d4d5fb..b1f7f7824b 100644
>> --- a/lib/librte_eal/common/malloc_mp.c
>> +++ b/lib/librte_eal/common/malloc_mp.c
>> -	/* for allocations, we must only use internal heaps */
>> -	if (rte_malloc_heap_socket_is_external(heap->socket_id)) {
>> +	/*
>> +	 * for allocations, we must only use internal heaps, but since the
>> +	 * rte_malloc_heap_socket_is_external() is thread-safe and we're already
>> +	 * read-locked, we'll have to take advantage of the fac that internal
> 
> fac -> fact?
> 

Yes.

>> +	 * socket ID's are always lower than RTE_MAX_NUMA_NODES.
>> +	 */
>> +	if (heap->socket_id >= RTE_MAX_NUMA_NODES) {
> 
> 
> 
>
  
Thomas Monjalon Jan. 29, 2021, 11:37 p.m. UTC | #3
29/01/2021 17:07, Burakov, Anatoly:
> On 29-Jan-21 3:40 PM, Thomas Monjalon wrote:
> > 29/01/2021 16:29, Anatoly Burakov:
> >> Previous fix used `rte_malloc_heap_socket_is_external()` to check if the
> >> heap was an external heap. However, that API is thread-safe, and when
> >> we're inside the allocation process, we're already write-locked, so
> >> calling `rte_malloc_heap_socket_is_external()` will result in a
> >> deadlock followed by a timeout.
> >>
> >> Fix it by replacing the API call with a check against maximum number of
> >> NUMA nodes, because external heaps always have higher socket ID's.
> > 
> > Is there some unit tests for such thing?
> 
> I couldn't reproduce this using autotests, but Jim has SPDK tests which 
> triggered this error.
> 
> Since this is dependent upon secondary process, any test would 
> necessarily have to be manual here, i think.
> 
> >> Fixes: 7ac31e82bc8f ("mem: improve parameter checking on memory hotplug")
> >>
> >> Reported-by: Jim Harris <james.r.harris@intel.com>
> >>
> > 
> > No need of blank line here.
> 
> Need to update my scripts :P
> 
> >> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >> ---
[...]
> >> +	/*
> >> +	 * for allocations, we must only use internal heaps, but since the
> >> +	 * rte_malloc_heap_socket_is_external() is thread-safe and we're already
> >> +	 * read-locked, we'll have to take advantage of the fac that internal
> > 
> > fac -> fact?
> 
> Yes.
> 
> >> +	 * socket ID's are always lower than RTE_MAX_NUMA_NODES.
> >> +	 */

Applied with minor changes, thanks.
  

Patch

diff --git a/lib/librte_eal/common/malloc_mp.c b/lib/librte_eal/common/malloc_mp.c
index 0b19d4d5fb..b1f7f7824b 100644
--- a/lib/librte_eal/common/malloc_mp.c
+++ b/lib/librte_eal/common/malloc_mp.c
@@ -241,8 +241,13 @@  handle_alloc_request(const struct malloc_mp_req *m,
 
 	heap = &mcfg->malloc_heaps[ar->malloc_heap_idx];
 
-	/* for allocations, we must only use internal heaps */
-	if (rte_malloc_heap_socket_is_external(heap->socket_id)) {
+	/*
+	 * for allocations, we must only use internal heaps, but since the
+	 * rte_malloc_heap_socket_is_external() is thread-safe and we're already
+	 * read-locked, we'll have to take advantage of the fac that internal
+	 * socket ID's are always lower than RTE_MAX_NUMA_NODES.
+	 */
+	if (heap->socket_id >= RTE_MAX_NUMA_NODES) {
 		RTE_LOG(ERR, EAL, "Attempting to allocate from external heap\n");
 		return -1;
 	}