[dpdk-stable] patch 'mem: improve segment list preallocation' has been queued to stable release 18.08.1

Kevin Traynor ktraynor at redhat.com
Mon Nov 26 12:36:41 CET 2018


On 11/26/2018 11:16 AM, Burakov, Anatoly wrote:
> Hi Kevin,
> 
> FYI
> 
> http://patches.dpdk.org/patch/48338/
> 

Thanks Anatoly. There's a couple of more "batches of patches" to do, so
will make sure it's included (after it's applied on dpdk).

> Thanks,
> Anatoly
> 
> 
>> -----Original Message-----
>> From: Kevin Traynor [mailto:ktraynor at redhat.com]
>> Sent: Thursday, November 22, 2018 4:49 PM
>> To: Burakov, Anatoly <anatoly.burakov at intel.com>
>> Cc: dpdk stable <stable at dpdk.org>
>> Subject: patch 'mem: improve segment list preallocation' has been queued
>> to stable release 18.08.1
>>
>> Hi,
>>
>> FYI, your patch has been queued to stable release 18.08.1
>>
>> Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet.
>> It will be pushed if I get no objections before 11/28/18. So please shout if
>> anyone has objections.
>>
>> Also note that after the patch there's a diff of the upstream commit vs the
>> patch applied to the branch. If the code is different (ie: not only metadata
>> diffs), due for example to a change in context or macro names, please
>> double check it.
>>
>> Thanks.
>>
>> Kevin Traynor
>>
>> ---
>> From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
>> 2001
>> From: Anatoly Burakov <anatoly.burakov at intel.com>
>> Date: Fri, 5 Oct 2018 09:29:44 +0100
>> Subject: [PATCH] mem: improve segment list preallocation
>>
>> [ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
>>
>> Current code to preallocate segment lists is trying to do everything in one go,
>> and thus ends up being convoluted, hard to understand, and, most
>> importantly, does not scale beyond initial assumptions about number of
>> NUMA nodes and number of page sizes, and therefore has issues on some
>> configurations.
>>
>> Instead of fixing these issues in the existing code, simply rewrite it to be
>> slightly less clever but much more logical, and provide ample comments to
>> explain exactly what is going on.
>>
>> We cannot use the same approach for 32-bit code because the limitations of
>> the target dictate current socket-centric approach rather than type-centric
>> approach we use on 64-bit target, so 32-bit code is left unmodified. FreeBSD
>> doesn't support NUMA so there's no complexity involved there, and thus its
>> code is much more readable and not worth changing.
>>
>> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>> ---
>>  lib/librte_eal/linuxapp/eal/eal_memory.c | 179 +++++++++++++++++------
>>  1 file changed, 137 insertions(+), 42 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> index 6131bfde2..cc2d3fb69 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> @@ -2096,7 +2096,13 @@ memseg_primary_init(void)  {
>>  	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
>>> mem_config;
>> -	int i, socket_id, hpi_idx, msl_idx = 0;
>> +	struct memtype {
>> +		uint64_t page_sz;
>> +		int socket_id;
>> +	} *memtypes = NULL;
>> +	int i, hpi_idx, msl_idx;
>>  	struct rte_memseg_list *msl;
>> -	uint64_t max_mem, total_mem;
>> +	uint64_t max_mem, max_mem_per_type;
>> +	unsigned int max_seglists_per_type;
>> +	unsigned int n_memtypes, cur_type;
>>
>>  	/* no-huge does not need this at all */ @@ -2104,8 +2110,49 @@
>> memseg_primary_init(void)
>>  		return 0;
>>
>> -	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> -	total_mem = 0;
>> +	/*
>> +	 * figuring out amount of memory we're going to have is a long and
>> very
>> +	 * involved process. the basic element we're operating with is a
>> memory
>> +	 * type, defined as a combination of NUMA node ID and page size (so
>> that
>> +	 * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
>> +	 *
>> +	 * deciding amount of memory going towards each memory type is a
>> +	 * balancing act between maximum segments per type, maximum
>> memory per
>> +	 * type, and number of detected NUMA nodes. the goal is to make
>> sure
>> +	 * each memory type gets at least one memseg list.
>> +	 *
>> +	 * the total amount of memory is limited by RTE_MAX_MEM_MB
>> value.
>> +	 *
>> +	 * the total amount of memory per type is limited by either
>> +	 * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB
>> divided by the number
>> +	 * of detected NUMA nodes. additionally, maximum number of
>> segments per
>> +	 * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is
>> because for
>> +	 * smaller page sizes, it can take hundreds of thousands of segments
>> to
>> +	 * reach the above specified per-type memory limits.
>> +	 *
>> +	 * additionally, each type may have multiple memseg lists associated
>> +	 * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for
>> bigger
>> +	 * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller
>> ones.
>> +	 *
>> +	 * the number of memseg lists per type is decided based on the
>> above
>> +	 * limits, and also taking number of detected NUMA nodes, to make
>> sure
>> +	 * that we don't run out of memseg lists before we populate all
>> NUMA
>> +	 * nodes with memory.
>> +	 *
>> +	 * we do this in three stages. first, we collect the number of types.
>> +	 * then, we figure out memory constraints and populate the list of
>> +	 * would-be memseg lists. then, we go ahead and allocate the
>> memseg
>> +	 * lists.
>> +	 */
>>
>> -	/* create memseg lists */
>> +	/* create space for mem types */
>> +	n_memtypes = internal_config.num_hugepage_sizes *
>> rte_socket_count();
>> +	memtypes = calloc(n_memtypes, sizeof(*memtypes));
>> +	if (memtypes == NULL) {
>> +		RTE_LOG(ERR, EAL, "Cannot allocate space for memory
>> types\n");
>> +		return -1;
>> +	}
>> +
>> +	/* populate mem types */
>> +	cur_type = 0;
>>  	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
>>  			hpi_idx++) {
>> @@ -2116,9 +2163,6 @@ memseg_primary_init(void)
>>  		hugepage_sz = hpi->hugepage_sz;
>>
>> -		for (i = 0; i < (int) rte_socket_count(); i++) {
>> -			uint64_t max_type_mem, total_type_mem = 0;
>> -			int type_msl_idx, max_segs, total_segs = 0;
>> -
>> -			socket_id = rte_socket_id_by_idx(i);
>> +		for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
>> +			int socket_id = rte_socket_id_by_idx(i);
>>
>>  #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
>> @@ -2126,47 +2170,98 @@ memseg_primary_init(void)
>>  				break;
>>  #endif
>> +			memtypes[cur_type].page_sz = hugepage_sz;
>> +			memtypes[cur_type].socket_id = socket_id;
>>
>> -			if (total_mem >= max_mem)
>> -				break;
>> +			RTE_LOG(DEBUG, EAL, "Detected memory type: "
>> +				"socket_id:%u hugepage_sz:%" PRIu64 "\n",
>> +				socket_id, hugepage_sz);
>> +		}
>> +	}
>>
>> -			max_type_mem = RTE_MIN(max_mem -
>> total_mem,
>> -				(uint64_t)RTE_MAX_MEM_MB_PER_TYPE <<
>> 20);
>> -			max_segs = RTE_MAX_MEMSEG_PER_TYPE;
>> +	/* set up limits for types */
>> +	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> +	max_mem_per_type =
>> RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
>> +			max_mem / n_memtypes);
>> +	/*
>> +	 * limit maximum number of segment lists per type to ensure there's
>> +	 * space for memseg lists for all NUMA nodes with all page sizes
>> +	 */
>> +	max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
>>
>> -			type_msl_idx = 0;
>> -			while (total_type_mem < max_type_mem &&
>> -					total_segs < max_segs) {
>> -				uint64_t cur_max_mem, cur_mem;
>> -				unsigned int n_segs;
>> +	if (max_seglists_per_type == 0) {
>> +		RTE_LOG(ERR, EAL, "Cannot accommodate all memory types,
>> please increase %s\n",
>> +			RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> +		return -1;
>> +	}
>>
>> -				if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
>> -					RTE_LOG(ERR, EAL,
>> -						"No more space in memseg
>> lists, please increase %s\n",
>> -
>> 	RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> -					return -1;
>> -				}
>> +	/* go through all mem types and create segment lists */
>> +	msl_idx = 0;
>> +	for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
>> +		unsigned int cur_seglist, n_seglists, n_segs;
>> +		unsigned int max_segs_per_type, max_segs_per_list;
>> +		struct memtype *type = &memtypes[cur_type];
>> +		uint64_t max_mem_per_list, pagesz;
>> +		int socket_id;
>>
>> -				msl = &mcfg->memsegs[msl_idx++];
>> +		pagesz = type->page_sz;
>> +		socket_id = type->socket_id;
>>
>> -				cur_max_mem = max_type_mem -
>> total_type_mem;
>> +		/*
>> +		 * we need to create segment lists for this type. we must
>> take
>> +		 * into account the following things:
>> +		 *
>> +		 * 1. total amount of memory we can use for this memory
>> type
>> +		 * 2. total amount of memory per memseg list allowed
>> +		 * 3. number of segments needed to fit the amount of
>> memory
>> +		 * 4. number of segments allowed per type
>> +		 * 5. number of segments allowed per memseg list
>> +		 * 6. number of memseg lists we are allowed to take up
>> +		 */
>>
>> -				cur_mem =
>> get_mem_amount(hugepage_sz,
>> -						cur_max_mem);
>> -				n_segs = cur_mem / hugepage_sz;
>> +		/* calculate how much segments we will need in total */
>> +		max_segs_per_type = max_mem_per_type / pagesz;
>> +		/* limit number of segments to maximum allowed per type
>> */
>> +		max_segs_per_type = RTE_MIN(max_segs_per_type,
>> +				(unsigned
>> int)RTE_MAX_MEMSEG_PER_TYPE);
>> +		/* limit number of segments to maximum allowed per list */
>> +		max_segs_per_list = RTE_MIN(max_segs_per_type,
>> +				(unsigned
>> int)RTE_MAX_MEMSEG_PER_LIST);
>>
>> -				if (alloc_memseg_list(msl, hugepage_sz,
>> n_segs,
>> -						socket_id, type_msl_idx))
>> -					return -1;
>> +		/* calculate how much memory we can have per segment list
>> */
>> +		max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
>> +				(uint64_t)RTE_MAX_MEM_MB_PER_LIST <<
>> 20);
>>
>> -				total_segs += msl->memseg_arr.len;
>> -				total_type_mem = total_segs *
>> hugepage_sz;
>> -				type_msl_idx++;
>> +		/* calculate how many segments each segment list will have
>> */
>> +		n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list /
>> pagesz);
>>
>> -				if (alloc_va_space(msl)) {
>> -					RTE_LOG(ERR, EAL, "Cannot allocate
>> VA space for memseg list\n");
>> -					return -1;
>> -				}
>> +		/* calculate how many segment lists we can have */
>> +		n_seglists = RTE_MIN(max_segs_per_type / n_segs,
>> +				max_mem_per_type / max_mem_per_list);
>> +
>> +		/* limit number of segment lists according to our maximum
>> */
>> +		n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
>> +
>> +		RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
>> +				"n_segs:%i socket_id:%i hugepage_sz:%"
>> PRIu64 "\n",
>> +			n_seglists, n_segs, socket_id, pagesz);
>> +
>> +		/* create all segment lists */
>> +		for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
>> +			if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
>> +				RTE_LOG(ERR, EAL,
>> +					"No more space in memseg lists,
>> please increase %s\n",
>> +
>> 	RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> +				return -1;
>> +			}
>> +			msl = &mcfg->memsegs[msl_idx++];
>> +
>> +			if (alloc_memseg_list(msl, pagesz, n_segs,
>> +					socket_id, cur_seglist))
>> +				return -1;
>> +
>> +			if (alloc_va_space(msl)) {
>> +				RTE_LOG(ERR, EAL, "Cannot allocate VA
>> space for memseg list\n");
>> +				return -1;
>>  			}
>> -			total_mem += total_type_mem;
>>  		}
>>  	}
>> --
>> 2.19.0
>>
>> ---
>>   Diff of the applied patch vs upstream commit (please double-check if non-
>> empty:
>> ---
>> --- -	2018-11-22 16:47:32.741839268 <sip:32741839268> +0000
>> +++ 0018-mem-improve-segment-list-preallocation.patch	2018-11-22
>> 16:47:32.000000000 <sip:32000000000> +0000
>> @@ -1,8 +1,10 @@
>> -From 1dd342d0fdc4f72102f0b48c89b6a39f029004fe Mon Sep 17 00:00:00
>> 2001
>> +From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
>> 2001
>>  From: Anatoly Burakov <anatoly.burakov at intel.com>
>>  Date: Fri, 5 Oct 2018 09:29:44 +0100
>>  Subject: [PATCH] mem: improve segment list preallocation
>>
>> +[ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
>> +
>>  Current code to preallocate segment lists is trying to do  everything in one
>> go, and thus ends up being convoluted,  hard to understand, and, most
>> importantly, does not scale beyond @@ -21,7 +23,6 @@  its code is much
>> more readable and not worth changing.
>>
>>  Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
>> -Cc: stable at dpdk.org
>>
>>  Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>>  ---
>> @@ -29,10 +30,10 @@
>>   1 file changed, 137 insertions(+), 42 deletions(-)
>>
>>  diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> -index 04f264818..19e686eb6 100644
>> +index 6131bfde2..cc2d3fb69 100644
>>  --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>>  +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> -@@ -2132,7 +2132,13 @@ memseg_primary_init(void)
>> +@@ -2096,7 +2096,13 @@ memseg_primary_init(void)
>>   {
>>   	struct rte_mem_config *mcfg = rte_eal_get_configuration()-
>>> mem_config;
>>  -	int i, socket_id, hpi_idx, msl_idx = 0;
>> @@ -48,7 +49,7 @@
>>  +	unsigned int n_memtypes, cur_type;
>>
>>   	/* no-huge does not need this at all */ -@@ -2140,8 +2146,49 @@
>> memseg_primary_init(void)
>> +@@ -2104,8 +2110,49 @@ memseg_primary_init(void)
>>   		return 0;
>>
>>  -	max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> @@ -101,7 +102,7 @@
>>  +	cur_type = 0;
>>   	for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
>>   			hpi_idx++) {
>> -@@ -2152,9 +2199,6 @@ memseg_primary_init(void)
>> +@@ -2116,9 +2163,6 @@ memseg_primary_init(void)
>>   		hugepage_sz = hpi->hugepage_sz;
>>
>>  -		for (i = 0; i < (int) rte_socket_count(); i++) {
>> @@ -113,7 +114,7 @@
>>  +			int socket_id = rte_socket_id_by_idx(i);
>>
>>   #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
>> -@@ -2162,47 +2206,98 @@ memseg_primary_init(void)
>> +@@ -2126,47 +2170,98 @@ memseg_primary_init(void)
>>   				break;
>>   #endif
>>  +			memtypes[cur_type].page_sz = hugepage_sz;



More information about the stable mailing list