[RFC] mem: do not merge elems from different heap allocs

Message ID 20181129191330.31792-1-james.r.harris@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [RFC] mem: do not merge elems from different heap allocs |

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/checkpatch warning coding style issues

Commit Message

Harris, James R Nov. 29, 2018, 7:13 p.m. UTC
  SPDK uses the rte_mem_event_callback_register API to
create RDMA memory regions (MRs) for newly allocated regions
of memory.  This is used in both the SPDK NVMe-oF target
and the NVMe-oF host driver.

DPDK creates internal malloc_elem structures for these
allocated regions.  As users malloc and free memory, DPDK
will sometimes merge malloc_elems that originated from
different allocations that were notified through the
registered mem_event callback routine.  This results
in subsequent allocations that can span across multiple
RDMA MRs.  This requires SPDK to check each DPDK buffer to
see if it crosses an MR boundary, and if so, would have to
add considerable logic and complexity to describe that
buffer before it can be accessed by the RNIC.  It is somewhat
analagous to rte_malloc returning a buffer that is not
IOVA-contiguous.

As a malloc_elem gets split and some of these elements
get freed, it can also result in DPDK sending an
RTE_MEM_EVENT_FREE notification for a subset of the
original RTE_MEM_EVENT_ALLOC notification.  This is also
problematic for RDMA memory regions, since unregistering
the memory region is all-or-nothing.  It is not possible
to unregister part of a memory region.

SPDK is currently working around this latter issue by
setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the
memory segments associated with an RTE_MEM_EVENT_ALLOC
notification.  But the first issue with merged elements
crossing MR boundaries is still possible and we have
some rather simple allocation patterns that can
trigger it.  So we would like to propose disabling the
merging of malloc_elems that originate from different
RTE_MEM_EVENT_ALLOC events.

This patch demonstrates how this merging could be
avoided.  There are some allocation patterns (especially
frequent multi-huge-page rte_mallocs and rte_frees) that
could result in higher memory usage which need to be
evaluated.  It's possible this behavior would need to be
explicitly enabled through an init flag or a separate API,
or implicitly based on whether any memory registration
callbacks have been registered - implementation of one
of those is deferred pending feedback on this RFC.

Signed-off-by: Jim Harris <james.r.harris@intel.com>
---
 lib/librte_eal/common/malloc_elem.c | 14 ++++++++++----
 lib/librte_eal/common/malloc_elem.h |  6 +++++-
 lib/librte_eal/common/malloc_heap.c |  7 ++++++-
 3 files changed, 21 insertions(+), 6 deletions(-)
  

Comments

Anatoly Burakov Dec. 4, 2018, 4:49 p.m. UTC | #1
On 29-Nov-18 7:13 PM, Jim Harris wrote:
> SPDK uses the rte_mem_event_callback_register API to
> create RDMA memory regions (MRs) for newly allocated regions
> of memory.  This is used in both the SPDK NVMe-oF target
> and the NVMe-oF host driver.
> 
> DPDK creates internal malloc_elem structures for these
> allocated regions.  As users malloc and free memory, DPDK
> will sometimes merge malloc_elems that originated from
> different allocations that were notified through the
> registered mem_event callback routine.  This results
> in subsequent allocations that can span across multiple
> RDMA MRs.  This requires SPDK to check each DPDK buffer to
> see if it crosses an MR boundary, and if so, would have to
> add considerable logic and complexity to describe that
> buffer before it can be accessed by the RNIC.  It is somewhat
> analagous to rte_malloc returning a buffer that is not
> IOVA-contiguous.
> 
> As a malloc_elem gets split and some of these elements
> get freed, it can also result in DPDK sending an
> RTE_MEM_EVENT_FREE notification for a subset of the
> original RTE_MEM_EVENT_ALLOC notification.  This is also
> problematic for RDMA memory regions, since unregistering
> the memory region is all-or-nothing.  It is not possible
> to unregister part of a memory region.
> 
> SPDK is currently working around this latter issue by
> setting the RTE_MEMSEG_FLAG_DO_NOT_FREE on each of the
> memory segments associated with an RTE_MEM_EVENT_ALLOC
> notification.  But the first issue with merged elements
> crossing MR boundaries is still possible and we have
> some rather simple allocation patterns that can
> trigger it.  So we would like to propose disabling the
> merging of malloc_elems that originate from different
> RTE_MEM_EVENT_ALLOC events.
> 
> This patch demonstrates how this merging could be
> avoided.  There are some allocation patterns (especially
> frequent multi-huge-page rte_mallocs and rte_frees) that
> could result in higher memory usage which need to be
> evaluated.  It's possible this behavior would need to be
> explicitly enabled through an init flag or a separate API,
> or implicitly based on whether any memory registration
> callbacks have been registered - implementation of one
> of those is deferred pending feedback on this RFC.
> 
> Signed-off-by: Jim Harris <james.r.harris@intel.com>

Hi Jim,

I can see the use case, and i don't there's a better way to do this than 
to fix it at the allocator stage.

You're correct in pointing out that some amount of memory will be wasted 
with this approach due to free space fragmentation. However, if we make 
this behavior non-default, that is fine.

One thing i'm a bit concerned about is that this change will push malloc 
element size into two cache lines any arch that is 64-bit and has a 
64-byte cache line. Malloc is not a performance API, so structure size 
is not important as long as it's cache-line aligned, but wasting 
additional 64 bytes per allocation on some architectures (including 
ours!) can be a concern.

Moreover, this extra overhead will be there regardless of whether this 
functionality is enabled or disabled (unless we're talking #ifdef-ery, 
which i'd rather not do). This overhead will even be there on FreeBSD, 
which doesn't support dynamic memory allocation in the first place. I 
don't see a good way to avoid this other than #ifdef-magic, but as i 
said, i'd really like to avoid having another config option (or worse, 
OS-specific code in common files).

I personally think additional cache line per allocation on x86_64 and 
similar archs isn't *that* big of a price to pay for this functionality 
(which i think is useful for certain use cases like yours!), but i'm 
adding Techboard folks to this thread in case i'm mistaken on that front :)

Provided we go forward with the proposal, i believe the best way to 
implement this would be through an EAL flag.

Also, some code comments below.

> ---
>   lib/librte_eal/common/malloc_elem.c | 14 ++++++++++----
>   lib/librte_eal/common/malloc_elem.h |  6 +++++-
>   lib/librte_eal/common/malloc_heap.c |  7 ++++++-
>   3 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
> index 9d3dcb6a9..785ecdadd 100644
> --- a/lib/librte_eal/common/malloc_elem.c
> +++ b/lib/librte_eal/common/malloc_elem.c
> @@ -110,7 +110,8 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
>    */
>   void
>   malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
> -		struct rte_memseg_list *msl, size_t size)
> +		struct rte_memseg_list *msl, size_t size,
> +		struct malloc_elem *orig_elem, size_t orig_size)
>   {
>   	elem->heap = heap;
>   	elem->msl = msl;
> @@ -120,6 +121,8 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
>   	elem->state = ELEM_FREE;
>   	elem->size = size;
>   	elem->pad = 0;
> +	elem->orig_elem = orig_elem;
> +	elem->orig_size = orig_size;
>   	set_header(elem);
>   	set_trailer(elem);
>   }
> @@ -278,7 +281,8 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
>   	const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
>   	const size_t new_elem_size = elem->size - old_elem_size;
>   
> -	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
> +	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
> +			 elem->orig_elem, elem->orig_size);
>   	split_pt->prev = elem;
>   	split_pt->next = next_elem;
>   	if (next_elem)
> @@ -469,7 +473,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
>   	 * with it, need to remove from free list.
>   	 */
>   	if (elem->next != NULL && elem->next->state == ELEM_FREE &&
> -			next_elem_is_adjacent(elem)) {
> +			next_elem_is_adjacent(elem) &&
> +			elem->orig_elem == elem->next->orig_elem) {

Why not make this check part of adjacency test? IMO it would be much 
clearer that way.

>   		void *erase;
>   		size_t erase_len;
>   
> @@ -490,7 +495,8 @@ malloc_elem_join_adjacent_free(struct malloc_elem *elem)
>   	 * with it, need to remove from free list.
>   	 */
>   	if (elem->prev != NULL && elem->prev->state == ELEM_FREE &&
> -			prev_elem_is_adjacent(elem)) {
> +			prev_elem_is_adjacent(elem) &&
> +			elem->orig_elem == elem->prev->orig_elem) {

Same as above.

>   		struct malloc_elem *new_elem;
>   		void *erase;
>   		size_t erase_len;
> diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
> index e2bda4c02..207767c94 100644
> --- a/lib/librte_eal/common/malloc_elem.h
> +++ b/lib/librte_eal/common/malloc_elem.h
> @@ -32,6 +32,8 @@ struct malloc_elem {
>   	volatile enum elem_state state;
>   	uint32_t pad;
>   	size_t size;
> +	struct malloc_elem *orig_elem;
> +	size_t orig_size;
>   #ifdef RTE_MALLOC_DEBUG
>   	uint64_t header_cookie;         /* Cookie marking start of data */
>   	                                /* trailer cookie at start + size */

This changes struct size to two cache lines on x86_64, which breaks the 
malloc autotest. That's not the fault of this patchset (malloc autotest 
was kind of stupid in how it handled overhead anyway), but this can be 
fixed by detecting malloc overhead size at runtime. Something like the 
following:


static int
test_multi_alloc_statistics(void)
{
	<snip>
	struct rte_malloc_socket_stats pre_stats, post_stats;
	int overhead = 0;
	
	rte_malloc_get_socket_stats(socket, &pre_stats);

	/* allocate one cacheline */
	dummy = rte_zmalloc_socket(NULL, RTE_CACHE_LINE_SIZE, 0, socket);
	if (dummy == NULL)
		return -1;

	rte_malloc_get_socket_stats(socket, &post_stats);

	/* after subtracting cache line, remainder is overhead */
	overhead = post_stats.heap_allocsz_bytes - pre_stats.heap_allocsz_bytes;
	overhead -= RTE_CACHE_LINE_SIZE;

	rte_free(dummy);

	<proceed with test>
}

> @@ -116,7 +118,9 @@ void
>   malloc_elem_init(struct malloc_elem *elem,
>   		struct malloc_heap *heap,
>   		struct rte_memseg_list *msl,
> -		size_t size);
> +		size_t size,
> +		struct malloc_elem *orig_elem,
> +		size_t orig_size);
>   
>   void
>   malloc_elem_insert(struct malloc_elem *elem);
> diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
> index c6a6d4f6b..602528299 100644
> --- a/lib/librte_eal/common/malloc_heap.c
> +++ b/lib/librte_eal/common/malloc_heap.c
> @@ -94,7 +94,7 @@ malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
>   {
>   	struct malloc_elem *elem = start;
>   
> -	malloc_elem_init(elem, heap, msl, len);
> +	malloc_elem_init(elem, heap, msl, len, elem, len);
>   
>   	malloc_elem_insert(elem);
>   
> @@ -857,6 +857,11 @@ malloc_heap_free(struct malloc_elem *elem)
>   	if (elem->size < page_sz)
>   		goto free_unlock;
>   
> +	/* we can only free memory back to the system as it was
> +	 * originally allocated */
> +	if (elem->size != elem->orig_size)
> +		goto free_unlock;
> +
>   	/* probably, but let's make sure, as we may not be using up full page */
>   	start = elem;
>   	len = elem->size;
>
  

Patch

diff --git a/lib/librte_eal/common/malloc_elem.c b/lib/librte_eal/common/malloc_elem.c
index 9d3dcb6a9..785ecdadd 100644
--- a/lib/librte_eal/common/malloc_elem.c
+++ b/lib/librte_eal/common/malloc_elem.c
@@ -110,7 +110,8 @@  malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align)
  */
 void
 malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
-		struct rte_memseg_list *msl, size_t size)
+		struct rte_memseg_list *msl, size_t size,
+		struct malloc_elem *orig_elem, size_t orig_size)
 {
 	elem->heap = heap;
 	elem->msl = msl;
@@ -120,6 +121,8 @@  malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap,
 	elem->state = ELEM_FREE;
 	elem->size = size;
 	elem->pad = 0;
+	elem->orig_elem = orig_elem;
+	elem->orig_size = orig_size;
 	set_header(elem);
 	set_trailer(elem);
 }
@@ -278,7 +281,8 @@  split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt)
 	const size_t old_elem_size = (uintptr_t)split_pt - (uintptr_t)elem;
 	const size_t new_elem_size = elem->size - old_elem_size;
 
-	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size);
+	malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size,
+			 elem->orig_elem, elem->orig_size);
 	split_pt->prev = elem;
 	split_pt->next = next_elem;
 	if (next_elem)
@@ -469,7 +473,8 @@  malloc_elem_join_adjacent_free(struct malloc_elem *elem)
 	 * with it, need to remove from free list.
 	 */
 	if (elem->next != NULL && elem->next->state == ELEM_FREE &&
-			next_elem_is_adjacent(elem)) {
+			next_elem_is_adjacent(elem) &&
+			elem->orig_elem == elem->next->orig_elem) {
 		void *erase;
 		size_t erase_len;
 
@@ -490,7 +495,8 @@  malloc_elem_join_adjacent_free(struct malloc_elem *elem)
 	 * with it, need to remove from free list.
 	 */
 	if (elem->prev != NULL && elem->prev->state == ELEM_FREE &&
-			prev_elem_is_adjacent(elem)) {
+			prev_elem_is_adjacent(elem) &&
+			elem->orig_elem == elem->prev->orig_elem) {
 		struct malloc_elem *new_elem;
 		void *erase;
 		size_t erase_len;
diff --git a/lib/librte_eal/common/malloc_elem.h b/lib/librte_eal/common/malloc_elem.h
index e2bda4c02..207767c94 100644
--- a/lib/librte_eal/common/malloc_elem.h
+++ b/lib/librte_eal/common/malloc_elem.h
@@ -32,6 +32,8 @@  struct malloc_elem {
 	volatile enum elem_state state;
 	uint32_t pad;
 	size_t size;
+	struct malloc_elem *orig_elem;
+	size_t orig_size;
 #ifdef RTE_MALLOC_DEBUG
 	uint64_t header_cookie;         /* Cookie marking start of data */
 	                                /* trailer cookie at start + size */
@@ -116,7 +118,9 @@  void
 malloc_elem_init(struct malloc_elem *elem,
 		struct malloc_heap *heap,
 		struct rte_memseg_list *msl,
-		size_t size);
+		size_t size,
+		struct malloc_elem *orig_elem,
+		size_t orig_size);
 
 void
 malloc_elem_insert(struct malloc_elem *elem);
diff --git a/lib/librte_eal/common/malloc_heap.c b/lib/librte_eal/common/malloc_heap.c
index c6a6d4f6b..602528299 100644
--- a/lib/librte_eal/common/malloc_heap.c
+++ b/lib/librte_eal/common/malloc_heap.c
@@ -94,7 +94,7 @@  malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl,
 {
 	struct malloc_elem *elem = start;
 
-	malloc_elem_init(elem, heap, msl, len);
+	malloc_elem_init(elem, heap, msl, len, elem, len);
 
 	malloc_elem_insert(elem);
 
@@ -857,6 +857,11 @@  malloc_heap_free(struct malloc_elem *elem)
 	if (elem->size < page_sz)
 		goto free_unlock;
 
+	/* we can only free memory back to the system as it was
+	 * originally allocated */
+	if (elem->size != elem->orig_size)
+		goto free_unlock;
+
 	/* probably, but let's make sure, as we may not be using up full page */
 	start = elem;
 	len = elem->size;