[dpdk-stable] patch 'mem: improve segment list preallocation' has been queued to stable release 18.08.1
Kevin Traynor
ktraynor at redhat.com
Mon Nov 26 12:36:41 CET 2018
On 11/26/2018 11:16 AM, Burakov, Anatoly wrote:
> Hi Kevin,
>
> FYI
>
> http://patches.dpdk.org/patch/48338/
>
Thanks Anatoly. There's a couple of more "batches of patches" to do, so
will make sure it's included (after it's applied on dpdk).
> Thanks,
> Anatoly
>
>
>> -----Original Message-----
>> From: Kevin Traynor [mailto:ktraynor at redhat.com]
>> Sent: Thursday, November 22, 2018 4:49 PM
>> To: Burakov, Anatoly <anatoly.burakov at intel.com>
>> Cc: dpdk stable <stable at dpdk.org>
>> Subject: patch 'mem: improve segment list preallocation' has been queued
>> to stable release 18.08.1
>>
>> Hi,
>>
>> FYI, your patch has been queued to stable release 18.08.1
>>
>> Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet.
>> It will be pushed if I get no objections before 11/28/18. So please shout if
>> anyone has objections.
>>
>> Also note that after the patch there's a diff of the upstream commit vs the
>> patch applied to the branch. If the code is different (ie: not only metadata
>> diffs), due for example to a change in context or macro names, please
>> double check it.
>>
>> Thanks.
>>
>> Kevin Traynor
>>
>> ---
>> From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
>> 2001
>> From: Anatoly Burakov <anatoly.burakov at intel.com>
>> Date: Fri, 5 Oct 2018 09:29:44 +0100
>> Subject: [PATCH] mem: improve segment list preallocation
>>
>> [ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
>>
>> Current code to preallocate segment lists is trying to do everything in one go,
>> and thus ends up being convoluted, hard to understand, and, most
>> importantly, does not scale beyond initial assumptions about number of
>> NUMA nodes and number of page sizes, and therefore has issues on some
>> configurations.
>>
>> Instead of fixing these issues in the existing code, simply rewrite it to be
>> slightly less clever but much more logical, and provide ample comments to
>> explain exactly what is going on.
>>
>> We cannot use the same approach for 32-bit code because the limitations of
>> the target dictate current socket-centric approach rather than type-centric
>> approach we use on 64-bit target, so 32-bit code is left unmodified. FreeBSD
>> doesn't support NUMA so there's no complexity involved there, and thus its
>> code is much more readable and not worth changing.
>>
>> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>> ---
>> lib/librte_eal/linuxapp/eal/eal_memory.c | 179 +++++++++++++++++------
>> 1 file changed, 137 insertions(+), 42 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> index 6131bfde2..cc2d3fb69 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> @@ -2096,7 +2096,13 @@ memseg_primary_init(void) {
>> struct rte_mem_config *mcfg = rte_eal_get_configuration()-
>>> mem_config;
>> - int i, socket_id, hpi_idx, msl_idx = 0;
>> + struct memtype {
>> + uint64_t page_sz;
>> + int socket_id;
>> + } *memtypes = NULL;
>> + int i, hpi_idx, msl_idx;
>> struct rte_memseg_list *msl;
>> - uint64_t max_mem, total_mem;
>> + uint64_t max_mem, max_mem_per_type;
>> + unsigned int max_seglists_per_type;
>> + unsigned int n_memtypes, cur_type;
>>
>> /* no-huge does not need this at all */ @@ -2104,8 +2110,49 @@
>> memseg_primary_init(void)
>> return 0;
>>
>> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> - total_mem = 0;
>> + /*
>> + * figuring out amount of memory we're going to have is a long and
>> very
>> + * involved process. the basic element we're operating with is a
>> memory
>> + * type, defined as a combination of NUMA node ID and page size (so
>> that
>> + * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
>> + *
>> + * deciding amount of memory going towards each memory type is a
>> + * balancing act between maximum segments per type, maximum
>> memory per
>> + * type, and number of detected NUMA nodes. the goal is to make
>> sure
>> + * each memory type gets at least one memseg list.
>> + *
>> + * the total amount of memory is limited by RTE_MAX_MEM_MB
>> value.
>> + *
>> + * the total amount of memory per type is limited by either
>> + * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB
>> divided by the number
>> + * of detected NUMA nodes. additionally, maximum number of
>> segments per
>> + * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is
>> because for
>> + * smaller page sizes, it can take hundreds of thousands of segments
>> to
>> + * reach the above specified per-type memory limits.
>> + *
>> + * additionally, each type may have multiple memseg lists associated
>> + * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for
>> bigger
>> + * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller
>> ones.
>> + *
>> + * the number of memseg lists per type is decided based on the
>> above
>> + * limits, and also taking number of detected NUMA nodes, to make
>> sure
>> + * that we don't run out of memseg lists before we populate all
>> NUMA
>> + * nodes with memory.
>> + *
>> + * we do this in three stages. first, we collect the number of types.
>> + * then, we figure out memory constraints and populate the list of
>> + * would-be memseg lists. then, we go ahead and allocate the
>> memseg
>> + * lists.
>> + */
>>
>> - /* create memseg lists */
>> + /* create space for mem types */
>> + n_memtypes = internal_config.num_hugepage_sizes *
>> rte_socket_count();
>> + memtypes = calloc(n_memtypes, sizeof(*memtypes));
>> + if (memtypes == NULL) {
>> + RTE_LOG(ERR, EAL, "Cannot allocate space for memory
>> types\n");
>> + return -1;
>> + }
>> +
>> + /* populate mem types */
>> + cur_type = 0;
>> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
>> hpi_idx++) {
>> @@ -2116,9 +2163,6 @@ memseg_primary_init(void)
>> hugepage_sz = hpi->hugepage_sz;
>>
>> - for (i = 0; i < (int) rte_socket_count(); i++) {
>> - uint64_t max_type_mem, total_type_mem = 0;
>> - int type_msl_idx, max_segs, total_segs = 0;
>> -
>> - socket_id = rte_socket_id_by_idx(i);
>> + for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
>> + int socket_id = rte_socket_id_by_idx(i);
>>
>> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
>> @@ -2126,47 +2170,98 @@ memseg_primary_init(void)
>> break;
>> #endif
>> + memtypes[cur_type].page_sz = hugepage_sz;
>> + memtypes[cur_type].socket_id = socket_id;
>>
>> - if (total_mem >= max_mem)
>> - break;
>> + RTE_LOG(DEBUG, EAL, "Detected memory type: "
>> + "socket_id:%u hugepage_sz:%" PRIu64 "\n",
>> + socket_id, hugepage_sz);
>> + }
>> + }
>>
>> - max_type_mem = RTE_MIN(max_mem -
>> total_mem,
>> - (uint64_t)RTE_MAX_MEM_MB_PER_TYPE <<
>> 20);
>> - max_segs = RTE_MAX_MEMSEG_PER_TYPE;
>> + /* set up limits for types */
>> + max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> + max_mem_per_type =
>> RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
>> + max_mem / n_memtypes);
>> + /*
>> + * limit maximum number of segment lists per type to ensure there's
>> + * space for memseg lists for all NUMA nodes with all page sizes
>> + */
>> + max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
>>
>> - type_msl_idx = 0;
>> - while (total_type_mem < max_type_mem &&
>> - total_segs < max_segs) {
>> - uint64_t cur_max_mem, cur_mem;
>> - unsigned int n_segs;
>> + if (max_seglists_per_type == 0) {
>> + RTE_LOG(ERR, EAL, "Cannot accommodate all memory types,
>> please increase %s\n",
>> + RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> + return -1;
>> + }
>>
>> - if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
>> - RTE_LOG(ERR, EAL,
>> - "No more space in memseg
>> lists, please increase %s\n",
>> -
>> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> - return -1;
>> - }
>> + /* go through all mem types and create segment lists */
>> + msl_idx = 0;
>> + for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
>> + unsigned int cur_seglist, n_seglists, n_segs;
>> + unsigned int max_segs_per_type, max_segs_per_list;
>> + struct memtype *type = &memtypes[cur_type];
>> + uint64_t max_mem_per_list, pagesz;
>> + int socket_id;
>>
>> - msl = &mcfg->memsegs[msl_idx++];
>> + pagesz = type->page_sz;
>> + socket_id = type->socket_id;
>>
>> - cur_max_mem = max_type_mem -
>> total_type_mem;
>> + /*
>> + * we need to create segment lists for this type. we must
>> take
>> + * into account the following things:
>> + *
>> + * 1. total amount of memory we can use for this memory
>> type
>> + * 2. total amount of memory per memseg list allowed
>> + * 3. number of segments needed to fit the amount of
>> memory
>> + * 4. number of segments allowed per type
>> + * 5. number of segments allowed per memseg list
>> + * 6. number of memseg lists we are allowed to take up
>> + */
>>
>> - cur_mem =
>> get_mem_amount(hugepage_sz,
>> - cur_max_mem);
>> - n_segs = cur_mem / hugepage_sz;
>> + /* calculate how much segments we will need in total */
>> + max_segs_per_type = max_mem_per_type / pagesz;
>> + /* limit number of segments to maximum allowed per type
>> */
>> + max_segs_per_type = RTE_MIN(max_segs_per_type,
>> + (unsigned
>> int)RTE_MAX_MEMSEG_PER_TYPE);
>> + /* limit number of segments to maximum allowed per list */
>> + max_segs_per_list = RTE_MIN(max_segs_per_type,
>> + (unsigned
>> int)RTE_MAX_MEMSEG_PER_LIST);
>>
>> - if (alloc_memseg_list(msl, hugepage_sz,
>> n_segs,
>> - socket_id, type_msl_idx))
>> - return -1;
>> + /* calculate how much memory we can have per segment list
>> */
>> + max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
>> + (uint64_t)RTE_MAX_MEM_MB_PER_LIST <<
>> 20);
>>
>> - total_segs += msl->memseg_arr.len;
>> - total_type_mem = total_segs *
>> hugepage_sz;
>> - type_msl_idx++;
>> + /* calculate how many segments each segment list will have
>> */
>> + n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list /
>> pagesz);
>>
>> - if (alloc_va_space(msl)) {
>> - RTE_LOG(ERR, EAL, "Cannot allocate
>> VA space for memseg list\n");
>> - return -1;
>> - }
>> + /* calculate how many segment lists we can have */
>> + n_seglists = RTE_MIN(max_segs_per_type / n_segs,
>> + max_mem_per_type / max_mem_per_list);
>> +
>> + /* limit number of segment lists according to our maximum
>> */
>> + n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
>> +
>> + RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
>> + "n_segs:%i socket_id:%i hugepage_sz:%"
>> PRIu64 "\n",
>> + n_seglists, n_segs, socket_id, pagesz);
>> +
>> + /* create all segment lists */
>> + for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
>> + if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
>> + RTE_LOG(ERR, EAL,
>> + "No more space in memseg lists,
>> please increase %s\n",
>> +
>> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
>> + return -1;
>> + }
>> + msl = &mcfg->memsegs[msl_idx++];
>> +
>> + if (alloc_memseg_list(msl, pagesz, n_segs,
>> + socket_id, cur_seglist))
>> + return -1;
>> +
>> + if (alloc_va_space(msl)) {
>> + RTE_LOG(ERR, EAL, "Cannot allocate VA
>> space for memseg list\n");
>> + return -1;
>> }
>> - total_mem += total_type_mem;
>> }
>> }
>> --
>> 2.19.0
>>
>> ---
>> Diff of the applied patch vs upstream commit (please double-check if non-
>> empty:
>> ---
>> --- - 2018-11-22 16:47:32.741839268 <sip:32741839268> +0000
>> +++ 0018-mem-improve-segment-list-preallocation.patch 2018-11-22
>> 16:47:32.000000000 <sip:32000000000> +0000
>> @@ -1,8 +1,10 @@
>> -From 1dd342d0fdc4f72102f0b48c89b6a39f029004fe Mon Sep 17 00:00:00
>> 2001
>> +From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
>> 2001
>> From: Anatoly Burakov <anatoly.burakov at intel.com>
>> Date: Fri, 5 Oct 2018 09:29:44 +0100
>> Subject: [PATCH] mem: improve segment list preallocation
>>
>> +[ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
>> +
>> Current code to preallocate segment lists is trying to do everything in one
>> go, and thus ends up being convoluted, hard to understand, and, most
>> importantly, does not scale beyond @@ -21,7 +23,6 @@ its code is much
>> more readable and not worth changing.
>>
>> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
>> -Cc: stable at dpdk.org
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
>> ---
>> @@ -29,10 +30,10 @@
>> 1 file changed, 137 insertions(+), 42 deletions(-)
>>
>> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> -index 04f264818..19e686eb6 100644
>> +index 6131bfde2..cc2d3fb69 100644
>> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
>> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
>> -@@ -2132,7 +2132,13 @@ memseg_primary_init(void)
>> +@@ -2096,7 +2096,13 @@ memseg_primary_init(void)
>> {
>> struct rte_mem_config *mcfg = rte_eal_get_configuration()-
>>> mem_config;
>> - int i, socket_id, hpi_idx, msl_idx = 0;
>> @@ -48,7 +49,7 @@
>> + unsigned int n_memtypes, cur_type;
>>
>> /* no-huge does not need this at all */ -@@ -2140,8 +2146,49 @@
>> memseg_primary_init(void)
>> +@@ -2104,8 +2110,49 @@ memseg_primary_init(void)
>> return 0;
>>
>> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
>> @@ -101,7 +102,7 @@
>> + cur_type = 0;
>> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
>> hpi_idx++) {
>> -@@ -2152,9 +2199,6 @@ memseg_primary_init(void)
>> +@@ -2116,9 +2163,6 @@ memseg_primary_init(void)
>> hugepage_sz = hpi->hugepage_sz;
>>
>> - for (i = 0; i < (int) rte_socket_count(); i++) {
>> @@ -113,7 +114,7 @@
>> + int socket_id = rte_socket_id_by_idx(i);
>>
>> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
>> -@@ -2162,47 +2206,98 @@ memseg_primary_init(void)
>> +@@ -2126,47 +2170,98 @@ memseg_primary_init(void)
>> break;
>> #endif
>> + memtypes[cur_type].page_sz = hugepage_sz;
More information about the stable
mailing list