[dpdk-stable] patch 'mem: improve segment list preallocation' has been queued to stable release 18.08.1
Burakov, Anatoly
anatoly.burakov at intel.com
Mon Nov 26 12:16:11 CET 2018
Hi Kevin,
FYI
http://patches.dpdk.org/patch/48338/
Thanks,
Anatoly
> -----Original Message-----
> From: Kevin Traynor [mailto:ktraynor at redhat.com]
> Sent: Thursday, November 22, 2018 4:49 PM
> To: Burakov, Anatoly <anatoly.burakov at intel.com>
> Cc: dpdk stable <stable at dpdk.org>
> Subject: patch 'mem: improve segment list preallocation' has been queued
> to stable release 18.08.1
>
> Hi,
>
> FYI, your patch has been queued to stable release 18.08.1
>
> Note it hasn't been pushed to http://dpdk.org/browse/dpdk-stable yet.
> It will be pushed if I get no objections before 11/28/18. So please shout if
> anyone has objections.
>
> Also note that after the patch there's a diff of the upstream commit vs the
> patch applied to the branch. If the code is different (ie: not only metadata
> diffs), due for example to a change in context or macro names, please
> double check it.
>
> Thanks.
>
> Kevin Traynor
>
> ---
> From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
> 2001
> From: Anatoly Burakov <anatoly.burakov at intel.com>
> Date: Fri, 5 Oct 2018 09:29:44 +0100
> Subject: [PATCH] mem: improve segment list preallocation
>
> [ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
>
> Current code to preallocate segment lists is trying to do everything in one go,
> and thus ends up being convoluted, hard to understand, and, most
> importantly, does not scale beyond initial assumptions about number of
> NUMA nodes and number of page sizes, and therefore has issues on some
> configurations.
>
> Instead of fixing these issues in the existing code, simply rewrite it to be
> slightly less clever but much more logical, and provide ample comments to
> explain exactly what is going on.
>
> We cannot use the same approach for 32-bit code because the limitations of
> the target dictate current socket-centric approach rather than type-centric
> approach we use on 64-bit target, so 32-bit code is left unmodified. FreeBSD
> doesn't support NUMA so there's no complexity involved there, and thus its
> code is much more readable and not worth changing.
>
> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
> ---
> lib/librte_eal/linuxapp/eal/eal_memory.c | 179 +++++++++++++++++------
> 1 file changed, 137 insertions(+), 42 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index 6131bfde2..cc2d3fb69 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -2096,7 +2096,13 @@ memseg_primary_init(void) {
> struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> >mem_config;
> - int i, socket_id, hpi_idx, msl_idx = 0;
> + struct memtype {
> + uint64_t page_sz;
> + int socket_id;
> + } *memtypes = NULL;
> + int i, hpi_idx, msl_idx;
> struct rte_memseg_list *msl;
> - uint64_t max_mem, total_mem;
> + uint64_t max_mem, max_mem_per_type;
> + unsigned int max_seglists_per_type;
> + unsigned int n_memtypes, cur_type;
>
> /* no-huge does not need this at all */ @@ -2104,8 +2110,49 @@
> memseg_primary_init(void)
> return 0;
>
> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
> - total_mem = 0;
> + /*
> + * figuring out amount of memory we're going to have is a long and
> very
> + * involved process. the basic element we're operating with is a
> memory
> + * type, defined as a combination of NUMA node ID and page size (so
> that
> + * e.g. 2 sockets with 2 page sizes yield 4 memory types in total).
> + *
> + * deciding amount of memory going towards each memory type is a
> + * balancing act between maximum segments per type, maximum
> memory per
> + * type, and number of detected NUMA nodes. the goal is to make
> sure
> + * each memory type gets at least one memseg list.
> + *
> + * the total amount of memory is limited by RTE_MAX_MEM_MB
> value.
> + *
> + * the total amount of memory per type is limited by either
> + * RTE_MAX_MEM_MB_PER_TYPE, or by RTE_MAX_MEM_MB
> divided by the number
> + * of detected NUMA nodes. additionally, maximum number of
> segments per
> + * type is also limited by RTE_MAX_MEMSEG_PER_TYPE. this is
> because for
> + * smaller page sizes, it can take hundreds of thousands of segments
> to
> + * reach the above specified per-type memory limits.
> + *
> + * additionally, each type may have multiple memseg lists associated
> + * with it, each limited by either RTE_MAX_MEM_MB_PER_LIST for
> bigger
> + * page sizes, or RTE_MAX_MEMSEG_PER_LIST segments for smaller
> ones.
> + *
> + * the number of memseg lists per type is decided based on the
> above
> + * limits, and also taking number of detected NUMA nodes, to make
> sure
> + * that we don't run out of memseg lists before we populate all
> NUMA
> + * nodes with memory.
> + *
> + * we do this in three stages. first, we collect the number of types.
> + * then, we figure out memory constraints and populate the list of
> + * would-be memseg lists. then, we go ahead and allocate the
> memseg
> + * lists.
> + */
>
> - /* create memseg lists */
> + /* create space for mem types */
> + n_memtypes = internal_config.num_hugepage_sizes *
> rte_socket_count();
> + memtypes = calloc(n_memtypes, sizeof(*memtypes));
> + if (memtypes == NULL) {
> + RTE_LOG(ERR, EAL, "Cannot allocate space for memory
> types\n");
> + return -1;
> + }
> +
> + /* populate mem types */
> + cur_type = 0;
> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
> hpi_idx++) {
> @@ -2116,9 +2163,6 @@ memseg_primary_init(void)
> hugepage_sz = hpi->hugepage_sz;
>
> - for (i = 0; i < (int) rte_socket_count(); i++) {
> - uint64_t max_type_mem, total_type_mem = 0;
> - int type_msl_idx, max_segs, total_segs = 0;
> -
> - socket_id = rte_socket_id_by_idx(i);
> + for (i = 0; i < (int) rte_socket_count(); i++, cur_type++) {
> + int socket_id = rte_socket_id_by_idx(i);
>
> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
> @@ -2126,47 +2170,98 @@ memseg_primary_init(void)
> break;
> #endif
> + memtypes[cur_type].page_sz = hugepage_sz;
> + memtypes[cur_type].socket_id = socket_id;
>
> - if (total_mem >= max_mem)
> - break;
> + RTE_LOG(DEBUG, EAL, "Detected memory type: "
> + "socket_id:%u hugepage_sz:%" PRIu64 "\n",
> + socket_id, hugepage_sz);
> + }
> + }
>
> - max_type_mem = RTE_MIN(max_mem -
> total_mem,
> - (uint64_t)RTE_MAX_MEM_MB_PER_TYPE <<
> 20);
> - max_segs = RTE_MAX_MEMSEG_PER_TYPE;
> + /* set up limits for types */
> + max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
> + max_mem_per_type =
> RTE_MIN((uint64_t)RTE_MAX_MEM_MB_PER_TYPE << 20,
> + max_mem / n_memtypes);
> + /*
> + * limit maximum number of segment lists per type to ensure there's
> + * space for memseg lists for all NUMA nodes with all page sizes
> + */
> + max_seglists_per_type = RTE_MAX_MEMSEG_LISTS / n_memtypes;
>
> - type_msl_idx = 0;
> - while (total_type_mem < max_type_mem &&
> - total_segs < max_segs) {
> - uint64_t cur_max_mem, cur_mem;
> - unsigned int n_segs;
> + if (max_seglists_per_type == 0) {
> + RTE_LOG(ERR, EAL, "Cannot accommodate all memory types,
> please increase %s\n",
> + RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> + return -1;
> + }
>
> - if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
> - RTE_LOG(ERR, EAL,
> - "No more space in memseg
> lists, please increase %s\n",
> -
> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> - return -1;
> - }
> + /* go through all mem types and create segment lists */
> + msl_idx = 0;
> + for (cur_type = 0; cur_type < n_memtypes; cur_type++) {
> + unsigned int cur_seglist, n_seglists, n_segs;
> + unsigned int max_segs_per_type, max_segs_per_list;
> + struct memtype *type = &memtypes[cur_type];
> + uint64_t max_mem_per_list, pagesz;
> + int socket_id;
>
> - msl = &mcfg->memsegs[msl_idx++];
> + pagesz = type->page_sz;
> + socket_id = type->socket_id;
>
> - cur_max_mem = max_type_mem -
> total_type_mem;
> + /*
> + * we need to create segment lists for this type. we must
> take
> + * into account the following things:
> + *
> + * 1. total amount of memory we can use for this memory
> type
> + * 2. total amount of memory per memseg list allowed
> + * 3. number of segments needed to fit the amount of
> memory
> + * 4. number of segments allowed per type
> + * 5. number of segments allowed per memseg list
> + * 6. number of memseg lists we are allowed to take up
> + */
>
> - cur_mem =
> get_mem_amount(hugepage_sz,
> - cur_max_mem);
> - n_segs = cur_mem / hugepage_sz;
> + /* calculate how much segments we will need in total */
> + max_segs_per_type = max_mem_per_type / pagesz;
> + /* limit number of segments to maximum allowed per type
> */
> + max_segs_per_type = RTE_MIN(max_segs_per_type,
> + (unsigned
> int)RTE_MAX_MEMSEG_PER_TYPE);
> + /* limit number of segments to maximum allowed per list */
> + max_segs_per_list = RTE_MIN(max_segs_per_type,
> + (unsigned
> int)RTE_MAX_MEMSEG_PER_LIST);
>
> - if (alloc_memseg_list(msl, hugepage_sz,
> n_segs,
> - socket_id, type_msl_idx))
> - return -1;
> + /* calculate how much memory we can have per segment list
> */
> + max_mem_per_list = RTE_MIN(max_segs_per_list * pagesz,
> + (uint64_t)RTE_MAX_MEM_MB_PER_LIST <<
> 20);
>
> - total_segs += msl->memseg_arr.len;
> - total_type_mem = total_segs *
> hugepage_sz;
> - type_msl_idx++;
> + /* calculate how many segments each segment list will have
> */
> + n_segs = RTE_MIN(max_segs_per_list, max_mem_per_list /
> pagesz);
>
> - if (alloc_va_space(msl)) {
> - RTE_LOG(ERR, EAL, "Cannot allocate
> VA space for memseg list\n");
> - return -1;
> - }
> + /* calculate how many segment lists we can have */
> + n_seglists = RTE_MIN(max_segs_per_type / n_segs,
> + max_mem_per_type / max_mem_per_list);
> +
> + /* limit number of segment lists according to our maximum
> */
> + n_seglists = RTE_MIN(n_seglists, max_seglists_per_type);
> +
> + RTE_LOG(DEBUG, EAL, "Creating %i segment lists: "
> + "n_segs:%i socket_id:%i hugepage_sz:%"
> PRIu64 "\n",
> + n_seglists, n_segs, socket_id, pagesz);
> +
> + /* create all segment lists */
> + for (cur_seglist = 0; cur_seglist < n_seglists; cur_seglist++) {
> + if (msl_idx >= RTE_MAX_MEMSEG_LISTS) {
> + RTE_LOG(ERR, EAL,
> + "No more space in memseg lists,
> please increase %s\n",
> +
> RTE_STR(CONFIG_RTE_MAX_MEMSEG_LISTS));
> + return -1;
> + }
> + msl = &mcfg->memsegs[msl_idx++];
> +
> + if (alloc_memseg_list(msl, pagesz, n_segs,
> + socket_id, cur_seglist))
> + return -1;
> +
> + if (alloc_va_space(msl)) {
> + RTE_LOG(ERR, EAL, "Cannot allocate VA
> space for memseg list\n");
> + return -1;
> }
> - total_mem += total_type_mem;
> }
> }
> --
> 2.19.0
>
> ---
> Diff of the applied patch vs upstream commit (please double-check if non-
> empty:
> ---
> --- - 2018-11-22 16:47:32.741839268 +0000
> +++ 0018-mem-improve-segment-list-preallocation.patch 2018-11-22
> 16:47:32.000000000 +0000
> @@ -1,8 +1,10 @@
> -From 1dd342d0fdc4f72102f0b48c89b6a39f029004fe Mon Sep 17 00:00:00
> 2001
> +From 6d552c83eacba7be4b4f2efbafd58724e07e2330 Mon Sep 17 00:00:00
> 2001
> From: Anatoly Burakov <anatoly.burakov at intel.com>
> Date: Fri, 5 Oct 2018 09:29:44 +0100
> Subject: [PATCH] mem: improve segment list preallocation
>
> +[ upstream commit 1dd342d0fdc4f72102f0b48c89b6a39f029004fe ]
> +
> Current code to preallocate segment lists is trying to do everything in one
> go, and thus ends up being convoluted, hard to understand, and, most
> importantly, does not scale beyond @@ -21,7 +23,6 @@ its code is much
> more readable and not worth changing.
>
> Fixes: 1d406458db47 ("mem: make segment preallocation OS-specific")
> -Cc: stable at dpdk.org
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov at intel.com>
> ---
> @@ -29,10 +30,10 @@
> 1 file changed, 137 insertions(+), 42 deletions(-)
>
> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> -index 04f264818..19e686eb6 100644
> +index 6131bfde2..cc2d3fb69 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> -@@ -2132,7 +2132,13 @@ memseg_primary_init(void)
> +@@ -2096,7 +2096,13 @@ memseg_primary_init(void)
> {
> struct rte_mem_config *mcfg = rte_eal_get_configuration()-
> >mem_config;
> - int i, socket_id, hpi_idx, msl_idx = 0;
> @@ -48,7 +49,7 @@
> + unsigned int n_memtypes, cur_type;
>
> /* no-huge does not need this at all */ -@@ -2140,8 +2146,49 @@
> memseg_primary_init(void)
> +@@ -2104,8 +2110,49 @@ memseg_primary_init(void)
> return 0;
>
> - max_mem = (uint64_t)RTE_MAX_MEM_MB << 20;
> @@ -101,7 +102,7 @@
> + cur_type = 0;
> for (hpi_idx = 0; hpi_idx < (int) internal_config.num_hugepage_sizes;
> hpi_idx++) {
> -@@ -2152,9 +2199,6 @@ memseg_primary_init(void)
> +@@ -2116,9 +2163,6 @@ memseg_primary_init(void)
> hugepage_sz = hpi->hugepage_sz;
>
> - for (i = 0; i < (int) rte_socket_count(); i++) {
> @@ -113,7 +114,7 @@
> + int socket_id = rte_socket_id_by_idx(i);
>
> #ifndef RTE_EAL_NUMA_AWARE_HUGEPAGES
> -@@ -2162,47 +2206,98 @@ memseg_primary_init(void)
> +@@ -2126,47 +2170,98 @@ memseg_primary_init(void)
> break;
> #endif
> + memtypes[cur_type].page_sz = hugepage_sz;
More information about the stable
mailing list